CN116993973A

CN116993973A - Semantic segmentation method and device for transparent object in image and electronic equipment

Info

Publication number: CN116993973A
Application number: CN202211466056.7A
Authority: CN
Inventors: 王昌安; 王亚彪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-11-03

Abstract

The application discloses a semantic segmentation method and device for transparent objects in images and electronic equipment, belongs to the technical field of computers, and can also be applied to the field of automatic driving. According to the application, rough semantic segmentation features and object boundary features are extracted from the image, complementary information between the rough semantic segmentation features and the object boundary features is further fully utilized to construct complementary attention features, the complementary attention features are utilized to respectively correct the semantic segmentation features and the object boundary features, so that bidirectional interaction of complementary information between semantic branches and boundary branches is realized, corrected semantic features and corrected boundary features are obtained through collaborative optimization, the corrected features pay more attention to the feature values which are more greatly contributed on the branches, and the obtained semantic segmentation map is predicted on the basis, so that transparent objects which are difficult to segment in the image can be more accurately identified.

Description

Semantic segmentation method and device for transparent object in image and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a semantic segmentation method and apparatus for transparent objects in an image, and an electronic device.

Background

For automatic driving systems, it is critical to accurately detect transparent objects such as glass. Transparent objects are widely distributed in daily life, such as glass objects of windows, water cups, glasses, doors and the like.

Since the transparent object has the special property of transparency, the transparent object can reflect the background environment into the image, namely, the appearance of the transparent object is changed along with the change of the environment, and no fixed texture information exists; meanwhile, due to this transparent property, the pattern inside the transparent object and the surrounding pattern are very similar, resulting in that the boundaries of some transparent objects often also appear very blurred. Therefore, a method for precisely implementing semantic segmentation for transparent objects is needed.

Disclosure of Invention

The embodiment of the application provides a semantic segmentation method and device for transparent objects in an image and electronic equipment, which can realize accurate semantic segmentation of the transparent objects in the image. The technical scheme is as follows:

in one aspect, a semantic segmentation method for transparent objects in an image is provided, the method comprising:

extracting semantic segmentation features and object boundary features of an image containing a transparent object, wherein the semantic segmentation features are used for distinguishing a foreground and a background of the image, and the object boundary features are used for indicating boundaries of different objects in the foreground;

Generating complementary attention features based on the semantic segmentation features and the object boundary features, the complementary attention features being used to characterize complementary information of the semantic segmentation features and the object boundary features;

based on the complementary attention features, respectively correcting the semantic segmentation features and the object boundary features to obtain corrected semantic features and corrected boundary features;

based on the modified semantic features and the modified boundary features, a semantic segmentation map is generated that identifies the location and boundary of the transparent object in the image.

In one aspect, there is provided a semantic segmentation apparatus for transparent objects in an image, the apparatus comprising:

the extraction module is used for extracting semantic segmentation features and object boundary features of an image containing a transparent object, wherein the semantic segmentation features are used for distinguishing the foreground and the background of the image, and the object boundary features are used for indicating the boundaries of different objects in the foreground;

a feature generation module for generating complementary attention features based on the semantic segmentation features and the object boundary features, the complementary attention features being used to characterize complementary information of the semantic segmentation features and the object boundary features;

The correction module is used for respectively correcting the semantic segmentation feature and the object boundary feature based on the complementary attention feature to obtain a corrected semantic feature and a corrected boundary feature;

and the image generation module is used for generating a semantic segmentation graph for identifying the position and the boundary of the transparent object in the image based on the corrected semantic features and the corrected boundary features.

In some embodiments, the extraction module comprises:

the first extraction submodule is used for inputting the image into a feature extraction network, and extracting semantic segmentation features of the image under multiple scales through multiple convolution layers in the feature extraction network;

and the generation sub-module is used for generating object boundary characteristics of the image based on the semantic segmentation characteristics under the first scale and the semantic segmentation characteristics under the last scale.

In some embodiments, the feature extraction network further comprises a hole space convolution pooling pyramid ASPP sub-network, the generating sub-module comprising:

the first extraction unit is used for inputting the semantic segmentation features under the last scale into the ASPP sub-network, and extracting the features of the semantic segmentation features under the last scale through the ASPP sub-network to obtain multi-scale semantic features;

The first fusion unit is used for fusing the semantic segmentation feature under the first scale and the multi-scale semantic feature to obtain the object boundary feature.

In some embodiments, the ASPP subnetwork comprises a one-dimensional convolution layer, a plurality of pooled pyramid layers and an ASPP pooled layer, wherein the plurality of pooled pyramid layers have the same convolution kernel size, but different pooled pyramid layers fill the receptive field of the convolution kernel based on different expansion rates, each pooled pyramid layer for extracting pooled pyramid features at one scale;

the first extraction unit includes:

the dimension-reducing convolution subunit is used for carrying out dimension-reducing convolution operation on the semantic segmentation feature under the last scale through the one-dimensional convolution layer to obtain dimension-reducing features;

the cavity convolution subunit is used for respectively carrying out cavity convolution operation based on different expansion rates on the semantic segmentation features under the last scale through the plurality of pooling pyramid layers to obtain pooling pyramid features under the plurality of scales;

chi Huazi unit, configured to perform pooling operation on the semantic segmentation feature under the last scale through the ASPP pooling layer to obtain ASPP pooling features;

And the fusion subunit is used for fusing the dimension reduction feature, the pooling pyramid feature under the multiple scales and the ASPP pooling feature to obtain the multi-scale semantic feature.

In some embodiments, the ASPP pooling layer comprises a mean pooling layer, a one-dimensional convolution layer, and an upsampling layer;

the pooling subunit is configured to:

carrying out mean value pooling operation on the semantic segmentation features under the last scale through the mean value pooling layer to obtain mean value pooling features;

performing dimension reduction convolution operation on the average pooling feature through the one-dimensional convolution layer to obtain dimension reduction pooling feature;

and carrying out up-sampling operation on the dimension reduction pooling feature through the up-sampling layer to obtain the ASPP pooling feature.

In some embodiments, the feature generation module comprises:

and the second extraction submodule is used for extracting complementary information of the semantic segmentation features and the object boundary features under the multiple scales through the cascade complementary attention network under the multiple scales to obtain the complementary attention features under the multiple scales.

In some embodiments, the complementary attention features at each scale include semantic channel complementary features and boundary channel complementary features;

The second extraction submodule includes:

the merging unit is used for merging semantic branch features and boundary branch features in the input signals on the channel dimension to obtain a combined feature for the complementary attention network under any scale;

the second extraction unit is used for inputting the combined features into an attention extraction layer in the complementary attention network under the scale, and extracting the features of the combined features through the attention extraction layer to obtain two-channel attention features, wherein the two-channel attention features comprise semantic channel attention features and boundary channel attention features;

the second fusion unit is used for fusing the semantic channel attention features and the semantic branch features to obtain semantic channel complementary features in the complementary attention features under the scale;

the second fusing unit is further configured to fuse the boundary channel attention feature with the boundary branch feature, so as to obtain a boundary channel complementary feature in the complementary attention features under the scale.

In some embodiments, for a complementary attention network at a last scale, the semantic branch feature is the multi-scale semantic feature and the boundary branch feature is the object boundary feature;

And for the complementary attention network under the rest scales, the semantic branch features are obtained by fusing semantic channel complementary features in the complementary attention features under the previous scale with semantic segmentation features under the previous scale, and the boundary branch features are obtained by fusing boundary channel complementary features in the complementary attention features under the previous scale with the semantic segmentation features under the first scale.

In some embodiments, the second extraction unit is configured to:

carrying out channel-by-channel convolution operation of the joint features in a space dimension through the attention extraction layer to obtain contextual attention features, wherein the contextual attention features have the same dimension as the joint features;

and carrying out point-by-point convolution operation of channel dimension on the context attention feature to obtain the dual-channel attention feature.

In some embodiments, the second fusion unit is configured to:

multiplying the semantic channel attention feature and the semantic branch feature by elements to obtain a semantic weighted feature;

performing convolution operation on the semantic weighted features to obtain semantic weighted convolution features;

and adding the semantic weighted convolution features and the semantic branch features by elements to obtain the semantic channel complementary features.

In some embodiments, the second fusion unit is further configured to:

multiplying the boundary channel attention feature and the boundary branch feature by elements to obtain a boundary weighted feature;

performing convolution operation on the boundary weighted feature to obtain a boundary weighted convolution feature;

and adding the boundary weighted convolution characteristic and the boundary branch characteristic by elements to obtain the boundary channel complementary characteristic.

In some embodiments, the correction module is to:

fusing semantic channel complementary features in the complementary attention features under the multiple scales to obtain the corrected semantic features;

and fusing the boundary channel complementary features in the complementary attention features under the multiple scales to obtain the corrected boundary features.

In some embodiments, the transparent object is a glass object.

In one aspect, an electronic device is provided that includes one or more processors and one or more memories having stored therein at least one computer program loaded and executed by the one or more processors to implement a method of semantic segmentation of a transparent object in an image as in any of the possible implementations described above.

In one aspect, a computer readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement a method for semantic segmentation of transparent objects in an image as in any one of the possible implementations described above.

In one aspect, a computer program product is provided that includes one or more computer programs stored in a computer-readable storage medium. The one or more processors of the electronic device are capable of reading the one or more computer programs from the computer-readable storage medium, the one or more processors executing the one or more computer programs such that the electronic device is capable of performing the method of semantic segmentation of transparent objects in an image of any of the possible embodiments described above.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the rough semantic segmentation feature and object boundary feature are initially extracted from the image, and complementary information between the rough semantic segmentation feature and the object boundary feature is further fully utilized to construct complementary attention features, so that the complementary attention features can respectively correct the semantic segmentation feature and the object boundary feature, complementary information of useful pixel-level categories is transferred from a semantic branch to a boundary branch through bidirectional interaction, complementary information of useful pixel-level boundaries is transferred from the boundary branch to the semantic branch, the corrected semantic feature and the corrected boundary feature obtained through collaborative optimization can be focused on a feature value which contributes to the self branch, and further the obtained semantic segmentation map is predicted on the basis, so that transparent objects which are difficult to segment in the image can be accurately identified, and semantic segmentation is accurately realized on the transparent objects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an implementation environment of a semantic segmentation method for transparent objects in an image according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for semantic segmentation of transparent objects in an image according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for semantic segmentation of transparent objects in an image according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a semantic segmentation process for transparent objects according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a complementary attention network provided by an embodiment of the present application;

FIG. 6 is an effect diagram of a semantic segmentation method for transparent objects in an image according to an embodiment of the present application;

FIG. 7 is a flowchart of a training method of a semantic segmentation model according to an embodiment of the present application;

FIG. 8 is a graph showing the comparison of the effects of a semantic segmentation method for transparent objects according to an embodiment of the present application;

FIG. 9 is a visual illustration of a dual channel complementary attention feature provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of a semantic segmentation device for transparent objects in an image according to an embodiment of the present application;

fig. 11 is a schematic structural view of an image processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution.

The term "at least one" in the present application means one or more, meaning "a plurality of" means two or more, for example, a plurality of convolution layers means two or more.

The term "comprising at least one of A or B" in the present application relates to the following cases: only a, only B, and both a and B.

The user related information (including but not limited to user equipment information, personal information, behavior information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.) and signals referred to in the present application, when applied to a specific product or technology by the method of the present application, are all licensed, agreed, authorized, or fully authorized by the user, and the collection, use, and processing of the related information, data, and signals is required to comply with relevant laws and regulations and standards of the relevant country and region. For example, the images for detecting transparent objects referred to in the present application are all acquired with sufficient authorization.

The embodiment of the application relates to a semantic segmentation technology in the field of artificial intelligence, and before introducing the embodiment of the application, basic concepts in the field of artificial intelligence are introduced first.

Artificial intelligence (Artificial Intelligence, AI): artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV): in the AI field, computer vision technology is a rapidly developing branch, and computer vision is a science for researching how to make a machine "look at", and further means that a camera, a computer and other machines are used to replace human eyes to identify and measure a target, and further perform graphic processing, so that the machine processes to obtain an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision typically includes image segmentation, image recognition, image retrieval, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, optical character recognition (Optical Character Recognition, OCR), video processing, 3D (3 Dimensions) technology, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

Machine Learning (ML): machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Image segmentation (Image Segmentation): image segmentation is one of the core problems of computer vision, with increasing importance to scene understanding. In the image segmentation task, the image is represented as a collection of physically meaningful connected regions, that is, the objects and the background in the image are respectively marked and positioned according to prior knowledge of the objects and the background, and then the objects are separated from the background or other pseudo objects. The image segmentation plays a role in image understanding applications such as target recognition, behavior analysis and the like, so that the data volume to be processed in the subsequent image analysis, recognition and other processing processes is greatly reduced, and meanwhile, the information about the structural characteristics of the image is reserved. Currently, there are three main directions of research for image segmentation, namely semantic segmentation, instance segmentation and panoramic segmentation.

Semantic segmentation (Semantic Segmentation): semantic segmentation refers to pixel-level image segmentation, and for a given image, the semantic segmentation assigns a class to each pixel in the image, and finally outputs a semantic segmentation map. In the semantic segmentation graph, the image is divided into a plurality of mutually disjoint areas according to the types allocated to each pixel, each area is composed of pixels belonging to the same type, and the pixels in the same area show consistency or similarity, and obviously different areas. For example, for images containing a specified object (e.g., a transparent object), separating the object from the background belongs to semantic segmentation for the object. The purpose of semantic segmentation is to understand the content of an image from the pixel level and assign an object class to each pixel in the image.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, unmanned, automatic driving, unmanned aerial vehicles, robots, intelligent medical treatment, intelligent customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and become more and more important. The technical scheme provided by the embodiment of the application relates to a glass object segmentation method based on a self-adaptive multi-scale feature reciprocal evolution module, and aims to form the multi-scale feature reciprocal evolution module for fusing semantic branch features and boundary branch features by constructing a plurality of complementary attention networks of semantic cascade, so that accurate semantic segmentation of transparent objects such as glass can be realized, and the method is described below.

The terms related to the embodiments of the present application are explained as follows:

transparent object: refers to an object having light-transmitting properties, through which the environment behind the transparent object can be seen, and thus has special properties like "transparent". The transparent object according to the embodiment of the application refers to an object with transparency (refers to the light transmission degree of the object) meeting a set condition, such as glass, ice, transparent plastic, transparent film, and the like, wherein the transparency refers to the degree that visible light passes through the object but is not scattered. The setting condition is that the transparency is larger than the transparency threshold, or that the scattering rate of visible light is smaller than the scattering threshold, or that the visibility to the environment is larger than the visibility threshold, or the like, and the setting condition is not particularly limited here.

Glass: an amorphous inorganic nonmetallic material is also a more typical transparent object, and the glass comprises colored glass and object-color glass, and the colored glass and the object-color glass have light transmission properties and belong to the transparent object.

Scale: the scale space of the signal refers to filtering the original signal through a series of gaussian filters to obtain a set of low-frequency signals, and the scale space of the image features refers to taking the image features extracted for the image as the original signal. The pyramid of the image features can efficiently express the image features in multiple scales, and in general, filtering is performed from one bottom-layer feature, and a series of features obtained by filtering are called image pyramid features with different scales in a scale space. In the embodiment of the application, the bottom-layer feature is called as a feature under the first scale, and the top-layer feature is called as a feature under the last scale, and it is to be noted that the sequence in the filtering process is represented by different scales, but the meaning of linear increment or decrement in the space dimension or the channel dimension is not carried.

Attention mechanism (Attention Mechanism): is a means for rapidly screening high-value information from a large amount of information by using limited attention resources. Visual attention mechanisms are brain signal processing mechanisms that are characteristic of human vision. The human vision obtains a target area needing to be focused, namely a focus of attention, through rapidly scanning the global image, and then inputs more attention resources into the area so as to acquire more detail information of the target needing to be focused, and other useless information is restrained.

Attention mechanisms are widely used in various deep learning tasks such as natural language processing, image recognition and voice recognition, and are one of the core technologies most worthy of attention and understanding in deep learning technology. Specifically, in the deep learning technology, the attention mechanism can be formed by masking, the masking is essentially a set of weight values, different weights can be given to different features by using the masking, and higher weights are given to key features, so that the deep neural network focuses on the key features with higher weights, and the attention mechanism is formed. Of course, the deep neural network needs to learn and train a large amount of sample data to determine which features with which features are key features, so that the features with the features are given higher weight in the practical application process. In the embodiment of the application, the related complementary attention features are a set of masks extracted based on an attention mechanism, and the complementary attention features actually provide attention weights of each pixel in the image on a semantic channel and a boundary channel respectively.

Intelligent vehicle road collaboration system (Intelligent Vehicle Infrastructure Cooperative Systems, IVICS): the vehicle-road cooperative system is one development direction of intelligent traffic systems (Intelligent Traffic System, ITS). The vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, carries out vehicle-vehicle and vehicle-road dynamic real-time information interaction in all directions, develops vehicle active safety control and road cooperative management on the basis of full-time idle dynamic traffic information acquisition and fusion, fully realizes effective cooperation of people and vehicles and roads, ensures traffic safety, improves traffic efficiency, and forms a safe, efficient and environment-friendly road traffic system.

The automatic driving technology generally comprises high-precision map, environment perception, behavior decision, path planning, motion control and other technologies, and has wide application prospect. For automatic driving systems, it is critical to accurately detect transparent objects such as glass. Transparent objects are widely distributed in daily life, such as glass objects of windows, water cups, glasses, doors and the like.

Since the transparent object has the special property of transparency, the transparent object can reflect the background environment into the image, namely, the appearance of the transparent object is changed along with the change of the environment, and no fixed texture information exists; meanwhile, due to this transparent property, the pattern inside the transparent object and the surrounding pattern are very similar, resulting in that the boundaries of some transparent objects often also appear very blurred.

The above characteristics mean that the transparent object has no fixed texture information and the boundary also appears to be blurred, which brings great challenges to the semantic segmentation of the transparent object, and in view of this, the embodiment of the application provides a transparent object segmentation method for RGB (red green blue) images, which can realize more accurate transparent object segmentation by effectively mining semantic branch features and boundary branch features of the transparent object.

The system architecture of the embodiment of the present application is described below.

Fig. 1 is a schematic diagram of an implementation environment of a semantic segmentation method of a transparent object in an image according to an embodiment of the present application. Referring to fig. 1, in this implementation environment, it includes: the image capturing apparatus 101 and the image processing apparatus 102 are directly or indirectly connected by wired or wireless communication between the image capturing apparatus 101 and the image processing apparatus 102, and the present application is not limited herein.

The image capturing device 101 is configured to capture an image to be detected, for example, the image capturing device 101 captures an image including a transparent object, and transmits the captured image including the transparent object to the image processing device 102. The image capturing device 101 may be a capturing device that is deployed independently, or may be a camera assembly that is subordinate to an electronic device integrated with the image processing device, which is not particularly limited in the embodiment of the present application.

The image processing device 102 is configured to perform semantic segmentation on the image sent by the image capturing device 101, and optionally perform a downstream image post-processing task based on a semantic segmentation map output by the semantic segmentation task, for example, in an autopilot system, further involving downstream tasks such as obstacle detection, environment awareness, behavior decision, path planning, and motion control.

In some embodiments, the image capturing apparatus 101 and the image processing apparatus 102 may be integrated electronic apparatuses, for example, the image capturing apparatus 101 is an in-vehicle camera, the image processing apparatus 102 is an in-vehicle terminal, and the in-vehicle camera and the in-vehicle terminal are collectively controlled by an automated driving system. The vehicle-mounted camera sends the collected image to the vehicle-mounted terminal, and the vehicle-mounted terminal performs semantic segmentation on the collected image through a local semantic segmentation model.

In other embodiments, the image capturing device 101 and the image processing device 102 may be different devices that are deployed independently, for example, the image capturing device 101 is an in-vehicle terminal configured with a camera component, and the image processing device 102 is an image processing server in the cloud. After the vehicle-mounted terminal collects the image through the camera component, an image processing request (such as a semantic segmentation request) carrying the collected image is sent to an image processing server, and an image processing result (such as a semantic segmentation graph) is returned by the image processing server.

In some embodiments, the image processing device 102 is a different device than the image processing server, and the image processing device 102 may pre-process the image and then send the pre-processed image to the image processing server for further processing. Alternatively, the image processing apparatus 102 takes on primary image processing work, and the image processing server takes on secondary image processing work; alternatively, the image processing apparatus 102 performs a secondary image processing job, and the image processing server performs a primary image processing job; alternatively, a distributed computing architecture is employed between the image processing device 102 and the image processing server for collaborative image processing.

In some embodiments, the image processing server is a stand-alone physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

In some embodiments, the image processing device 102 is a vehicle-mounted terminal, a smart voice interaction device, a smart home appliance, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

The following describes the main flow of semantic segmentation in the embodiment of the present application.

Fig. 2 is a flowchart of a semantic segmentation method of a transparent object in an image according to an embodiment of the present application. Referring to fig. 2, this embodiment is performed by an electronic device, which may be provided as the image processing device 102 in the above-described implementation environment, and includes the steps of:

201. For an image comprising a transparent object, semantic segmentation features of the image are extracted, the semantic segmentation features being used to distinguish a foreground from a background of the image, and object boundary features being used to indicate boundaries of different objects in the foreground.

The image related to the embodiment of the application can be an image acquired by the image acquisition device and sent to the electronic device, or an image acquired by the electronic device, or an image locally stored by the electronic device, or an image from the cloud, and the embodiment of the application does not specifically limit the source of the image.

It should be noted that, the images from the above various sources are all RGB images, that is, only one single-mode signal, i.e., the RGB image collected by the original camera component, can accurately realize semantic segmentation of the transparent object, and signals of other modes (such as text mode and gray mode) are not required to be additionally introduced as auxiliary information, so that the universality of the actual scene is greatly improved.

The transparent object according to the embodiment of the application refers to an object with a light transmission property, wherein the light transmission property refers to a transparency of part or all of the object, and the transparency refers to a degree that visible light passes through the object but is not scattered, namely, the transparency is used for measuring the light transmission degree of the object. Alternatively, an object whose transparency meets a set condition is regarded as a transparent object such as glass, ice, transparent plastic, transparent film, or the like. Alternatively, the setting condition is that the transparency is larger than the transparency threshold, or the scattering rate of visible light is smaller than the scattering threshold, or the visibility to the environment is larger than the visibility threshold, or the like, and the setting condition is not particularly limited here. The environment behind the transparent object can be seen through the transparent object, so the transparent object does not have fixed texture information and its boundaries are often too blurred.

In some embodiments, for an original image containing a transparent object, semantic segmentation features of the image for characterizing depth features of the image on a semantic segmentation task and object boundary features for characterizing boundary features of the image on a boundary detection task may be initially extracted. Optionally, the semantic segmentation feature and the object boundary feature are extracted by a feature extraction network, where the feature extraction network may be: resNet (Residual Networks, residual network), ASPP (Atrous Spatial Pyramid Pooling, hole space convolution pooling pyramid), FPN (Feature Pyramid Networks, feature pyramid network), CNN (Convolutional Neural Networks, convolutional neural network), DNN (Deep Neural Network ), etc., or the feature extraction network may be a combination of one or more of the above, and the structure of the feature extraction network is not specifically limited in the embodiments of the present application.

202. Based on the semantic segmentation feature and the object boundary feature, a complementary attention feature is generated, the complementary attention feature being used to characterize complementary information of the semantic segmentation feature and the object boundary feature.

In some embodiments, for the semantic segmentation feature and the object boundary feature extracted in step 201, the semantic segmentation feature and the object boundary feature may be fused to generate a complementary attention feature that characterizes complementary information of the semantic segmentation feature and the object boundary feature, where the complementary information includes complementary information of the semantic segmentation feature that contributes to the boundary detection task and complementary information of the object boundary feature that contributes to the semantic segmentation task.

In some embodiments, the complementary attention feature is a two-channel complementary attention feature comprising a semantic channel complementary feature for characterizing complementary information that the boundary feature of the object adaptively conveys to the semantic segmentation task to effectively convey useful complementary information that the boundary feature conveys to the boundary detection task from the boundary feature, and a boundary channel complementary feature for characterizing complementary information that the boundary feature adaptively conveys to the boundary detection task to effectively convey useful complementary information that the boundary feature conveys to the boundary detection task from the boundary feature. The two-channel complementary attention feature is intuitively, on one hand, the object boundary feature can provide more specific position information for the transparent object for the semantic segmentation task, and on the other hand, the semantic segmentation feature can provide a supervision signal for the boundary detection task to the non-transparent object, so that interference of noise such as the boundary of the non-transparent object is greatly eliminated, and the complementary information of the boundary branch contribution to the boundary branch and the complementary information of the boundary branch contribution to the semantic branch are integrated, thereby being beneficial to extracting more accurate correction semantic features and correction boundary features at the feature level.

203. And respectively correcting the semantic segmentation feature and the object boundary feature based on the complementary attention feature to obtain a corrected semantic feature and a corrected boundary feature.

In some embodiments, the semantic segmentation feature and the object boundary feature may be modified based on the complementary attention feature, for example, the semantic segmentation feature may be weighted by a semantic channel complementary feature in the complementary attention feature to obtain a modified semantic feature, and the object boundary feature may be weighted by a boundary channel complementary feature in the complementary attention feature to obtain a modified boundary feature. The two-channel complementary attention feature can respectively enable the feature values which are more important to the semantic segmentation task in the semantic segmentation feature to have higher weight, and the feature values which are more important to the boundary detection task in the object boundary feature to have higher weight, so that the semantic segmentation task and the object boundary task respectively apply high attention to the feature values with high importance, and the detection precision on the two branches of the semantic segmentation task and the boundary detection task is improved.

Complementary attention features are extracted, complementary information of the semantic segmentation features and object boundary features can be fully utilized, and the semantic segmentation features and the object boundary features which are initially extracted are corrected by the complementary attention features through bidirectional interaction of the semantic segmentation features and the object boundary features, so that corrected semantic features and corrected boundary features can be obtained through collaborative optimization. Since the boundary of the transparent object is often not clear enough compared with a general object, and the appearance of the object is also affected by environmental changes, the semantic segmentation feature and the boundary feature of the object, which are initially extracted in step 201, often have poor feature expression capability, and thus, it is necessary to use complementary information between the two, and the situation that features are difficult to accurately extract due to the light transmission property of the transparent object can be greatly improved by using the above-mentioned corrected semantic feature and the corrected boundary feature.

204. Based on the modified semantic features and the modified boundary features, a semantic segmentation map is generated that identifies the location and boundary of the transparent object in the image.

In some embodiments, the semantic segmentation map of the image can be further predicted based on the corrected semantic features in step 203, and the boundary detection map of the image can be further predicted based on the corrected boundary features in step 203. Since only a transparent object needs to be precisely cut out from an image in the semantic segmentation task, the above-mentioned boundary detection map is not an essential item in the output result.

In the embodiment of the present application, taking the extraction of the complementary attention feature from the semantic segmentation feature and the object boundary feature of a single scale and the correction by using the complementary attention feature as an example for explanation, in the next embodiment, how to extract the semantic segmentation feature and the object boundary feature respectively at different scales in the scale space will be described in detail, and a complementary attention feature is generated for each scale to correct the semantic segmentation feature and the object boundary feature respectively at the current scale, which will not be described herein.

All the above optional solutions can be combined to form an optional embodiment of the present disclosure, which is not described in detail herein.

According to the method provided by the embodiment of the application, the rough semantic segmentation feature and object boundary feature are initially extracted from the image, and the complementary information between the rough semantic segmentation feature and the object boundary feature is further fully utilized to construct the complementary attention feature, so that the complementary attention feature can respectively correct the semantic segmentation feature and the object boundary feature, the complementary information of useful pixel level categories is transmitted from the semantic branches to the boundary branches through bidirectional interaction, and the complementary information of useful pixel level boundaries is transmitted from the boundary branches to the semantic branches, so that the corrected semantic feature and the corrected boundary feature obtained through collaborative optimization can be focused on the feature value which contributes to the self branches more, further, the obtained semantic segmentation graph is predicted on the basis of the feature value, and transparent objects which are difficult to segment in the image can be identified more accurately, so that the semantic segmentation of the transparent objects is realized accurately.

In the above embodiment, the main flow of the semantic segmentation method of the transparent object in the image is described, and in the embodiment of the present application, the above semantic segmentation flow will be described in detail for the case of extracting the semantic segmentation feature and the object boundary feature respectively at different scales in the scale space.

Fig. 3 is a flowchart of a semantic segmentation method of a transparent object in an image according to an embodiment of the present application. Referring to fig. 3, this embodiment is performed by an electronic device, which may be provided as the image processing device 102 in the above-described implementation environment, and includes the steps of:

301. an image containing a transparent object is input into a feature extraction network, and semantic segmentation features of the image under multiple scales are extracted through multiple convolution layers in the feature extraction network, wherein the semantic segmentation features are used for distinguishing the foreground and the background of the image.

In the embodiment of the present application, the transparent object refers to an object having a light transmitting property, for example, the transparent object is a glass object, or other objects such as ice, transparent plastic, transparent film, etc., and the kind of the transparent object is not specifically limited herein. In particular, in the field of autopilot, on-board cameras often capture RGB images of glass objects including windows, eyes, cups, doors, etc., and thus the autopilot algorithm is not independent of the precise semantic segmentation of transparent objects, especially glass objects.

In the embodiment of the present application, the feature extraction network is exemplified as including a convolutional sub-network and an ASPP sub-network, where the convolutional sub-network includes a plurality of convolutional layers, and the ASPP sub-network is described in detail in step 302 below. Only one exemplary architecture for a feature extraction network is provided, and the feature extraction network may also be a combination of one or more of CNN, DNN, resNet, ASPP, FPN, as embodiments of the application are not specifically limited in this regard.

In some embodiments, an image including a transparent object is obtained, for example, an RGB image is read from a local database, for example, a camera assembly is turned on to capture an RGB image, for example, a frame of RGB video image is captured from a video stream recorded by the camera assembly, for example, an RGB image is downloaded from a cloud database, and the source of the image in the embodiments of the present application is not specifically limited.

After the image is obtained, based on the architecture of the feature extraction network, an image (such as an RGB image) containing a transparent object is input into a convolution sub-network of the feature extraction network, feature extraction is performed on the image through a plurality of cascade convolution layers in the convolution sub-network, each convolution layer is used for extracting semantic segmentation features of the image under one scale, and finally the semantic segmentation features under a plurality of scales respectively output by the plurality of convolution layers can be obtained.

In some embodiments, if the convolutional subnetwork is a conventional CNN, a cascade (i.e., serial) relationship is formed between a plurality of convolutional layers in the conventional CNN, that is, the semantic segmentation feature of a certain scale output by the previous convolutional layer is directly input into the current convolutional layer; if the convolution sub-network is a ResNet, residual connection relations are among a plurality of convolution layers in the ResNet, namely, semantic segmentation features under a certain scale output by a previous convolution layer are spliced with semantic segmentation features under other scales output by other previous convolution layers, then the spliced features are input into a current convolution layer, for example, under the condition that each interval of one convolution layer residual connection exists, the semantic segmentation features under the 1 st scale output by the 1 st convolution layer and the features spliced by the original RGB image are input into a 2 nd convolution layer, and for example, under the condition that each interval of two convolution layer residual connections exists, the semantic segmentation features under the 2 nd scale output by the 2 nd convolution layer and the features spliced by the original RGB image are input into a 3 rd convolution layer, so that redundant description is omitted.

Illustratively, taking a convolutional subnetwork as a ResNet as an example, the ResNet comprises n cascaded residual blocks, each residual block internally comprises i residual connected convolutional layers, namely, the residual blocks satisfy a cascade relation, each residual block internally satisfies a residual connection relation, n is an integer greater than or equal to 1, and i is an integer greater than or equal to 2. Taking res net-50 as an example, res net-50 refers to n=4 residual blocks, each residual block contains i=6 convolutional layers, and, inside each residual block, the input of the 1 st convolutional layer will be spliced with the output of the 3 rd convolutional layer and input to the 4 th convolutional layer, and the output of the 3 rd convolutional layer will be spliced with the output of the 6 th convolutional layer and input to the next residual block. By way of example only, the convolutional subnetwork may be ResNet-18, resNet-34, resNet-101, resNet-152, etc., or the convolutional subnetwork may be a conventional CNN, as embodiments of the present application are not limited in this regard.

Fig. 4 is a schematic diagram of a semantic segmentation process for transparent objects according to an embodiment of the present application, and as shown in fig. 4, a convolutional subnetwork is illustrated as res net-50. The image 400 is acquired and the image 400 is input into a convolutional sub-network 410 in a feature extraction network, the convolutional sub-network 410 containing 4 concatenated residual blocks. Inputting the image 400 into a 1 st residual block Layer1, extracting semantic segmentation features F under the first scale ₁ The method comprises the steps of carrying out a first treatment on the surface of the The first dimension semantic segmentation feature F ₁ Inputting into a 2 nd residual block Layer2, extracting semantic segmentation feature F under the 2 nd scale ₂ The method comprises the steps of carrying out a first treatment on the surface of the Segmentation of semantic features F at scale 2 ₂ Inputting into a 3 rd residual block Layer3, extracting semantic segmentation feature F under the 3 rd scale ₃ The method comprises the steps of carrying out a first treatment on the surface of the Segmentation of semantic features F at scale 3 ₃ Inputting into a 4 th residual block Layer4, extracting semantic segmentation feature F under the 4 th scale ₄ . Finally, the convolution sub-network extracts 4 semantic segmentation features { F ] under 4 scales altogether ₁ ，F ₂ ，F ₃ ，F ₄ }. Note that, the respective convolution layers inside each residual block and the connection relationship thereof are not shown here. It should be understood thatFor the convolutional subnetwork of the ResNet class, the output of each residual block can be extracted as a semantic segmentation feature under one scale for illustration, but for the convolutional subnetwork of the ResNet class or some convolutional subnetworks which do not contain residual blocks, the output of all convolutional layers can be selected, or the output of part of the convolutional layers can be selected as the semantic segmentation feature under multiple scales.

Further, regardless of the type of convolution sub-network, taking one convolution layer in the convolution sub-network as an example to describe a convolution operation process inside a single convolution layer, one convolution layer may include one or more convolution kernels, each convolution kernel corresponds to a scanning window, a size of the scanning window is equal to a size of the convolution layer, during the convolution operation of the convolution kernels, the scanning window may slide on an input feature map of a current convolution layer according to a set step size, and sequentially scan each area of the input feature map, where the target step size may be set by a developer. Taking a convolution kernel as an example, in the process of convolution operation, when a scanning window of the convolution kernel slides to any region of an input feature map, reading each feature value recorded in the region of the input feature map, performing dot multiplication operation on the convolution kernel and each feature value, accumulating each product obtained by the dot multiplication operation, and taking the accumulated result as a new feature value. And then, sliding the scanning window of the convolution kernel to the next area of the input feature map according to the target step length, performing convolution operation again, outputting a new feature value until all areas of the input feature map are scanned, forming all the output new feature values into an output feature map of the current convolution layer, directly inputting the output feature map to the next convolution layer, or splicing the output feature map and the input feature map and then inputting the output feature map and the input feature map of other preceding convolution layers into the next convolution layer, wherein convolution operation in a single convolution layer is the same and is not repeated.

It should be further noted that, the convolution sub-network is used to extract the semantic segmentation features at the multiple scales, where the semantic segmentation features may be essentially represented as a feature map of h×w×c, and generally, as the depth of the convolution layer increases, the semantic segmentation features at the multiple scales may satisfy a decrease in spatial size (i.e., the widths W and heights H of the semantic segmentation features decrease), and an increase in channel dimension (i.e., the depth of the semantic segmentation features, or the number of channels, or the dimension C increases).

302. Inputting the semantic segmentation features under the last scale into an ASPP sub-network in the feature extraction network, and extracting features of the semantic segmentation features under the last scale through the ASPP sub-network to obtain multi-scale semantic features.

In the case that the ASPP sub-network is configured in addition to the convolution sub-network in the feature extraction network, the semantic segmentation feature under the last scale is input into the ASPP sub-network, so as to introduce convolution operations with different expansion rates (or referred to as void rates) and pooling operations through the ASPP sub-network, thereby extracting the multi-scale semantic features containing the feature information with different receptive fields. The feature extraction network comprising the convolution sub-network and the ASPP sub-network is characterized in that on one hand, a deep network model is introduced through the convolution sub-network, semantic segmentation features of depths under a series of different scales are extracted, and on the other hand, cavity convolution operations with different expansion rates are introduced through the ASPP sub-network, so that multi-scale semantic features can be fused to features with different receptive fields.

In some embodiments, the ASPP subnetwork includes: the one-dimensional convolution layer, the plurality of pooling pyramid layers and the ASPP pooling layer are in parallel connection with each other, namely, semantic segmentation features under the last scale are respectively input into the one-dimensional convolution layer, the plurality of pooling pyramid layers and the ASPP pooling layer of the ASPP sub-network. The one-dimensional convolution layers are used for reducing the dimension of the input feature map, each pooling pyramid layer is used for extracting pooling pyramid features under one dimension, and the ASPP pooling layers are used for pooling the input feature map. Based on the ASPP subnetwork as configured above, step 302 may include the following sub-steps 3021-3024, described below:

3021. and performing dimension reduction convolution operation on the semantic segmentation feature under the last scale through a one-dimensional convolution layer of the ASPP subnetwork to obtain dimension reduction features.

In some embodiments, the semantic segmentation feature under the last scale is input into a one-dimensional convolution layer of the ASPP subnetwork, the one-dimensional convolution layer contains K convolution kernels with the size of 1×1, the semantic segmentation feature under the last scale is checked by the K convolution kernels with the size of 1×1 to perform convolution operation, so that the dimension reduction of the semantic segmentation feature under the last scale can be realized, namely the dimension of the semantic segmentation feature under the last scale is reduced to a dimension reduction feature with the dimension of K, wherein K is a super parameter predefined by a technician, and K is an integer greater than or equal to 1.

For example, still illustrated by convolution subnetwork ResNet-50, the semantic segmentation feature at the last scale is F ₄ Suppose F ₄ Is a sheet H ₄ *W ₄ *C ₄ Then F is again ₄ After input to the one-dimensional convolution layer of the ASPP subnetwork, F will be checked by K1 x 1 convolutions ₄ Performing convolution operation to output a sheet H ₄ *W ₄ * K dimension reduction feature for realizing dimension from C ₄ And (5) reducing the dimension to K. The convolution operation process is the same as the example in step 301, and will not be described again.

3022. And respectively carrying out hole convolution operation based on different expansion rates on the semantic segmentation features under the last scale through a plurality of pooling pyramid layers of the ASPP subnetwork to obtain pooling pyramid features under a plurality of scales.

In some embodiments, the semantic segmentation features under the last scale are respectively input into a plurality of pooled pyramid layers of the ASPP sub-network, each pooled pyramid layer comprises a cavity convolution kernel, and the cavity convolution kernels are different from the conventional convolution kernels in that the cavity convolution kernels are filled with (Padding) 0 according to different expansion rates on the basis of the conventional convolution kernels to realize the receptive fields with different sizes. Each pooling pyramid layer utilizes a cavity convolution kernel of the pooling pyramid layer to carry out cavity convolution operation on semantic segmentation features under the last scale of input so as to extract pooling pyramid features under one scale by using a receptive field corresponding to the cavity convolution kernel of the pooling pyramid layer, and outputs of the pooling pyramid layers are synthesized to obtain pooling pyramid features under multiple scales.

In some embodiments, the above-mentioned multiple pooled pyramid layers all have the same convolution kernel size, but different pooled pyramid layers fill up the receptive fields of the convolution kernels based on different expansion rates, so that different pooled pyramid layers can construct receptive fields of different sizes based on the convolution kernels of the same size, and since the receptive fields except for the convolution kernels meeting the above-mentioned size have a weight coefficient other than 0, and the rest of the receptive fields are filled with 0, the receptive fields of different sizes can be introduced on the basis of the convolution kernels of the same size, so as to obtain pooled pyramid features under multiple scales, and the pooled pyramid features under multiple scales reflect pooled feature information under multiple receptive fields.

For example, still illustrated by convolution subnetwork ResNet-50, the semantic segmentation feature at the last scale is F ₄ Assuming that 3 pooled pyramid layers are configured in the ASPP subnetwork, the sizes of the cavity convolution kernels of the 3 pooled pyramid layers are 3×3, but have different expansion rates to construct receptive fields with different scales (the scale of the receptive fields refers to the size of a scanning window), and a technician can configure different expansion rates through self-definition to realize multi-scale pooled pyramid feature extraction. Will F ₄ Respectively inputting into the 3 pooled pyramid layers of the ASPP subnetwork, wherein each pooled pyramid layer is based on own cavity convolution check F ₄ And carrying out cavity convolution operation, outputting pooled pyramid features under one scale, and synthesizing the output of 3 pooled pyramid layers to obtain pooled pyramid features under 3 scales. The convolution operation process is the same as the example in step 301, and only the used convolution kernel is replaced by a hole convolution kernel from the conventional convolution kernel, and the receptive fields with different dimensions of the hole convolution kernel refer to scanning windows with different dimensions, which are not described again.

3023. And carrying out pooling operation on the semantic segmentation features under the last scale through an ASPP pooling layer of the ASPP subnetwork to obtain ASPP pooling features.

In some embodiments, the semantic segmentation feature under the last scale is input into the ASPP pooling layer, the ASPP pooling layer performs pooling operation on the semantic segmentation feature under the last scale, and the ASPP pooling feature is output. Optionally, the ASPP pooling layer is provided as a mean pooling layer, a maximum pooling layer, etc., or as a cascade of pooling layers and other post-processing layers.

In the following, an example will be given of one possible ASPP pooling layer architecture, comprising: the device comprises a mean value pooling layer, a one-dimensional convolution layer and an up-sampling layer, wherein cascade relations are met among the mean value pooling layer, the one-dimensional convolution layer and the up-sampling layer, the mean value pooling layer is used for carrying out mean value pooling on an input feature map, the one-dimensional convolution layer is used for carrying out dimension reduction on the input feature map, and the up-sampling layer is used for carrying out up-sampling on the input feature.

In some embodiments, firstly, the semantic segmentation feature under the last scale is input into a mean value pooling layer, and through the mean value pooling layer, the semantic segmentation feature under the last scale is subjected to mean value pooling operation to obtain the mean value pooling feature. Specifically, when the scanning window scans to any one region of the semantic segmentation feature under the last scale, an average value of each feature value in the scanning window is obtained, and the average value is output as a new feature value in the mean value pooling feature. In this case, only the ASPP pooling layer is configured with the average pooling layer as an example, and optionally, the average pooling layer may be replaced by the maximum pooling layer, which is different in that the maximum value of each feature value in the scanning window is selected as a new feature value, which is not described herein.

In some embodiments, the average pooling feature is input into a one-dimensional convolution layer, and the dimension-reducing convolution operation is performed on the average pooling feature by the one-dimensional convolution layer based on the same manner as in the sub-step 3021, so as to obtain the dimension-reducing pooling feature, and the dimension-reducing convolution operation mode is not described herein.

In some embodiments, the dimension reduction pooling feature is input into an upsampling layer, and upsampling operation is performed on the dimension reduction pooling feature by the upsampling layer to obtain the ASPP pooling feature. Optionally, the upsampling manner of the upsampling layer includes, but is not limited to: deconvolution (Transposed Convolution), inverse pooling (un-pooling), bilinear interpolation (Bilinear), etc., the up-sampling method is not particularly limited here.

In the above example, a processing flow under a possible architecture manner of the ASPP pooling layer is provided, by constructing a cascade averaging layer, a one-dimensional convolution layer and an up-sampling layer, adaptive averaging can be realized through the averaging layer, that is, the size and the step size of the pooling kernel are not required to be specified, the size of the last output averaged pooling feature is only required to be specified, then, each channel of the averaged pooling feature is subjected to dimensional compression by using the one-dimensional convolution layer to obtain a global dimension-reducing pooling feature, and then, the dimension-reducing pooling feature is reconverted to the original size by using the up-sampling layer, so that the spatial dimension of the feature map passing through the ASPP layer is kept unchanged, but the depth, that is, the channel number (or dimension) is compressed, so that the ASPP pooling feature can be characterized by more concentrated semantic feature information.

3024. And fusing the dimension reduction feature, the pooling pyramid feature under the multiple scales and the ASPP pooling feature through the ASPP sub-network to obtain the multi-scale semantic feature.

In some embodiments, the dimension reduction features obtained in step 3021, the pooling pyramid features under multiple scales obtained in step 3022, and the ASPP pooling features obtained in step 3023 are fused to obtain multi-scale semantic features. Alternatively, the fusion means includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., the fusion mode is not particularly limited here.

In one possible implementation manner, a fusion manner of splicing and convolution is provided, that is, the dimension-reducing feature, the pooled pyramid feature under multiple scales and the ASPP pooled feature are spliced (Concat, combination in channel dimension), the spliced feature is input into a one-dimensional convolution layer to perform dimension-reducing convolution operation, and the multi-scale semantic feature is output. By means of the fusion mode of splicing and convolution, the dimensionality of the multi-scale semantic features can be reduced, the spatial size of the multi-scale semantic features and the semantic segmentation features under the last scale are kept unchanged, a pooled feature with global semantic information and compressed dimensionality can be obtained, and the multi-scale semantic features can be used for further synthesizing subsequent object boundary features so as to improve the expression capacity of the object boundary features.

Still referring to fig. 4 as an example, the semantic segmentation feature F at the 4 th scale output by the 4 th residual block Layer4 in the convolutional subnetwork 410 ₄ Is input into ASPP subnetwork 420, and F is paired through ASPP subnetwork 420 ₄ Further extracting features and outputting multi-scale semantic features F ₅ Finally, a feature sequence { F ] formed by semantic segmentation features under 4 scales and a multi-scale semantic feature can be obtained ₁ ，F ₂ ，F ₃ ，F ₄ ，F ₅ }. The convolution subnetwork 410 and ASPP subnetwork 420 form a feature extraction network, which can be considered as a backbone network of a res net-50+ ASPP architecture, into which the output RGB image 400 is input, the following feature F can be obtained _i ，i＝{1，2，3，4，5}。

In the steps 3021 to 3024, a process for extracting multi-scale semantic features in a possible architecture manner of the ASPP subnetwork is provided, and by configuring a one-dimensional convolution layer, a plurality of pooling pyramid layers and an ASPP pooling layer which are connected in parallel, the multi-scale semantic features are extracted to the features with different receptive field sizes on the premise of not losing semantic details of the multi-scale semantic features, and feature compression of channel dimensions is effectively performed, so that the finally obtained low-dimensional multi-scale semantic features have rich expression capability. In other embodiments, the ASPP subnetwork may also be configured with more or fewer pooling pyramid layers, and other processing layers may be added or deleted, without limitation.

303. And fusing the semantic segmentation feature under the first scale with the multi-scale semantic feature to obtain the object boundary feature of the image, wherein the object boundary feature is used for indicating the boundary of different objects in the foreground.

In some embodiments, the first-scale semantic segmentation feature of the semantic segmentation features at the multiple scales extracted in the step 301 is fused with the multi-scale semantic feature extracted in the step 302 to obtain the object boundary feature. Alternatively, the fusion means includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., the fusion mode is not particularly limited here.

In one possible implementation, a fusion manner of splicing and convolution (concat+conv) is provided, and still taking fig. 4 as an example for explanation, the semantic segmentation feature F at the 1 st scale output by the 1 st residual block Layer1 in the convolution sub-network 410 is provided ₁ Multi-scale semantic feature F output from ASPP subnetwork 420 ₅ And splicing, namely inputting the spliced characteristics into a convolution layer to carry out convolution operation, and outputting the object boundary characteristics. The convolution layers herein are not limited to one-dimensional convolution layers, and the convolution kernel size is configured by the skilled artisan. By the fusion mode of splicing and convolution, semantic segmentation features under the first scale and multi-scale semantic features can be fully fused, so that object boundary features have stronger expression capability.

In the above steps 302-303, a possible implementation manner of generating the object boundary feature of the image based on the semantic segmentation feature under the first scale and the semantic segmentation feature under the last scale is provided, that is, firstly, the ASPP sub-network is utilized to extract the multi-scale semantic feature based on the semantic segmentation feature under the last scale, and then the multi-scale semantic feature is fused with the semantic segmentation feature under the first scale to obtain the object boundary feature, and the multi-scale semantic feature can reflect the feature of the abundant receptive field because the multi-scale semantic feature participates in the generation process of the object boundary feature, and the semantic segmentation feature under the first scale is the feature of the bottom layer and naturally has the most abundant texture and other details, so that the fused and generated object boundary feature can have stronger boundary discrimination.

In other embodiments, except for steps 302-303, the semantic segmentation features under the first scale and the semantic segmentation features under the last scale may be directly fused to generate object boundary features, or the semantic segmentation features under all scales are fused together to generate object boundary features, so that an ASPP sub-network is not required to be configured, the structure of the feature extraction network is simplified, the training cost is reduced, and the synthesis flow of the object boundary features is simplified.

In the above steps 301-303, a possible implementation manner of extracting the semantic segmentation feature and the object boundary feature of the image including the transparent object is provided, that is, on one hand, the semantic segmentation feature of the image under multiple scales is extracted through the convolution sub-network, on the other hand, the multi-scale semantic feature is extracted through the ASPP sub-network, and then the multi-scale semantic feature is fused with the semantic segmentation feature under the first scale to extract the object boundary feature, so that not only the abundant semantic segmentation feature under multiple scales but also the accurate object boundary feature with strong expression capability can be obtained.

304. Combining semantic branch features and boundary branch features in the input signals in the channel dimension through the complementary attention network in any scale of the cascade complementary attention networks in the multiple scales to obtain a combined feature.

In some embodiments, based on the design that the feature extraction network extracts semantic segmentation features at multiple scales and multi-scale semantic features respectively, complementary attention networks at multiple scales are correspondingly configured, the complementary attention networks at different scales are cascaded in turn, and the complementary attention network at each scale is used for extracting complementary attention features at the current scale. Optionally, the complementary attention feature at each scale is a two-channel feature map, in which a semantic channel complementary feature and a boundary channel complementary feature are involved.

In the following, a process flow of extracting the complementary attention feature at the current scale by the complementary attention network at any scale will be described by taking the complementary attention network at this scale as an example.

In some embodiments, the complementary attention network at different scales is the same as the complementary attention feature at its own scale, but its input signal is different. That is, although the input signal of the complementary attention network under all scales contains one semantic branch feature and one boundary branch feature, the semantic branch feature and the boundary branch feature under different scales are acquired in different manners.

Optionally, for the complementary attention network at the last scale, the multi-scale semantic features extracted in the step 302 are taken as semantic branch features in the input signal, and the object boundary features extracted in the step 303 are taken as boundary branch features in the input signal. In other words, for a complementary attention network at the last scale, the semantic branch features are multi-scale semantic features and the boundary branch features are object boundary features.

Optionally, fusing the complementary attention network at the rest scale, namely, the complementary attention network at all scales except the last scale, with the semantic channel complementary feature in the complementary attention features extracted by the complementary attention network at the last scale, and the semantic segmentation feature at the last scale extracted by the feature extraction network to obtain semantic branch features in the input signal; and similarly, fusing the boundary channel complementary features in the complementary attention features extracted by the complementary attention network under the previous scale with the semantic segmentation features under the first scale to obtain boundary branch features in the input signal. In other words, for the complementary attention network under the other scales, the semantic branch feature is obtained by fusing the semantic channel complementary feature in the complementary attention feature under the previous scale with the semantic segmentation feature under the previous scale, and the boundary branch feature is obtained by fusing the boundary channel complementary feature in the complementary attention feature under the previous scale with the semantic segmentation feature under the first scale.

Still taking fig. 4 as an example for explanation, 4 semantic segmentation flows are also configuredCascaded complementary attention network MSM ₁ ～MSM ₄ It should be understood that MSM ₁ Is a complementary attention network at the first scale, MSM ₄ Is the complementary attention network at the last scale. For a complementary attention network MSM at the last scale ₄ For example, the multi-scale semantic feature F output in ASPP subnetwork 420 ₅ As semantic branch features in input signalsF ₅ The most abundant semantic information can be MSM ₄ Providing better supervision signals on semantic branches, F ₅ And F ₁ Fusing the resulting object boundary features as boundary branch features in the input signal>Due to F ₅ The semantic information is most abundant and F ₁ The texture information of the object is most abundant, and the boundary characteristics of the object obtained by fusing the texture information and the boundary characteristics can be directly used as a supervision signal of boundary branches.

For complementary attention network MSM at the remaining scale ₁ ～MSM ₃ For example, with the complementary attention network MSM at the ith scale _i For example, i is 1, 2 or 3, the complementary attention network MSM at the last scale, i.e. the i+1st scale _i+1 Semantic channel complementary features among the extracted complementary attention featuresWith the semantic segmentation feature F at the last scale, i.e., the i+1st scale, extracted by the convolution sub-network 410 _i+1 Fusion is carried out to obtain semantic branch feature +.>Due to this->And F _i+1 Can be well utilized to the bottomDetail information of layer features and rich semantic information of high-level features; similarly, the complementary attention network MSM under the previous scale, i.e. the (i+1) th scale _i+1 Boundary channel complementary features of the extracted complementary attention features +.>Semantic segmentation feature F at the first scale ₁ Fusion is carried out to obtain boundary branch characteristics +.>Through this->And F ₁ Can avoid losing lower layer F in the process of extracting complementary information ₁ And rich texture information in the features.

Based on the configuration mode of the input signals, the complementary attention network MSM cascaded under 4 scales is configured ₁ ～MSM ₄ In the case of (1), the scale sequence number is characterized by i, i e {1,2,3,4}, for the complementary attention network MSM at the ith scale _i ToRepresenting semantic branch features in its input signal to +.>Representing boundary branching features in its input signal, the input signal can be expressed as follows:

where [ ] denotes a feature fusion operation, in one possible implementation, the feature fusion operation uses a way of stitching and convolution, or other ways of adding by element, multiplying by element, stitching, bilinear fusion, etc., which are not specifically limited herein.

It should be noted that, here, only the case of configuring 4 cascaded complementary attention networks is taken as an example for explanation, the number of complementary attention networks only needs to be kept consistent with the number of scales of the semantic segmentation features extracted in the feature extraction network, and more or fewer scales may be configured, and similarly, the number of scales is not specifically limited.

By the configuration mode of the input signals related to the complementary attention network at each scale, the whole can extract the semantic channel complementary feature and the boundary channel complementary feature in the complementary attention features at each scale from the two branches of the semantic branches and the boundary branches, the complementary attention features at each scale inherit the feature information of the complementary attention features at the last scale on the double channels, and the semantic branches and the boundary branches can fully and bidirectionally transfer the respective required complementary information through the complementary attention networks at a plurality of scales in a cascading way.

In some embodiments, the complementary attention network MSM at either scale _i Semantic branch features in the input signal configured in the above mannerAnd boundary branching feature- >Later, the above semantic branch feature +.>And boundary branching feature->Splicing, i.e. the above-mentioned semantic branch feature +.>And boundary branching feature->Merging in the channel dimension to obtain the joint feature between the semantics and the boundary>

305. The combined feature is input into an attention extraction layer in a complementary attention network under the scale, and the feature extraction is carried out on the combined feature through the attention extraction layer to obtain a dual-channel attention feature, wherein the dual-channel attention feature comprises a semantic channel attention feature and a boundary channel attention feature.

In some embodiments, the joint features obtained in step 304 above are targetedThe above-mentioned combination feature->Input to a complementary attention network MSM at the current scale _i Attention extraction layer of (2) by which the above mentioned joint feature is treated>Further feature extraction is performed to extract the above-mentioned joint feature +.>The method is converted into a double-channel attention characteristic, wherein one channel in the double-channel attention characteristic is a semantic channel attention characteristic, and the other channel is a boundary channel attention characteristic.

In some embodiments, the attention extraction layer is configured as a depth separable convolution (Depthwise Separable Convolution) layer for splitting the convolution into a channel-by-channel convolution of a spatial dimension and a point-by-point convolution of a channel dimension.

On the basis, the joint features can be subjected to space dimension by-pass through the attention extraction layerA channel convolution operation, which is to say, assuming joint characteristics, to obtain the context attention characteristicsIf the number of channels is D, D single-channel convolution kernels are configured, and the D single-channel convolution kernels and the joint feature are +.>Has a one-to-one mapping relationship for D channels, each single-channel convolution kernel is only used for the combination feature +.>The convolution operation is carried out on one channel of the D single-channel convolution kernels, and the D-dimensional joint feature can be subjected to +.>The channel-by-channel convolution operation results in a D-dimensional contextual attention feature, which is therefore the same dimension as the joint feature, i.e., the channel-by-channel convolution operation does not change the channel dimension. This channel-by-channel convolution operation can take into account the joint feature +.>Context information inside each channel.

Further, the context attention feature is subjected to point-by-point convolution operation of channel dimension to obtain two-channel attention force diagram A ₀ Then the two channels are paid attention to the force diagram A through an activation layer ₀ And performing nonlinear mapping to obtain a final dual-channel attention characteristic A. The point-by-point convolution operation is to perform convolution operation by using one convolution kernel to perform convolution operation on all channels of the contextual attention feature, so that feature information of all channels of the contextual attention feature is combined on one channel, and since the number of channels for outputting one output feature map is required to be controlled to be 2, the contextual attention can be achieved by only configuring two convolution kernels in the point-by-point convolution operation Feature transformation into a two-channel attention diagram A ₀ And two-channel attention seeking graph a ₀ Each channel is obtained by a point-by-point convolution of the context attention feature by a convolution kernel, so that each channel can fully fuse information between all channels of the context attention feature deep at the channel level. Then, attention is paid to two channels in an attempt A ₀ For example, the dual-channel attention feature a is obtained by performing nonlinear mapping through an activation function in an activation layer, where the activation function may be sigmoid, reLU, tanh, and the activation function is not specifically limited herein.

For a dual channel attention feature A, the attention feature of each channel can be characterized as A ⁱ I characterizes the channel number, i e {1,2}, and the two-channel attention feature A is expressed as the following formula:

wherein σ characterizes a sigmoid activation function, sepcov characterizes a depth separable convolution operation, [..; ...]The merging operation in the channel dimension is characterized,characterization of the complementary attention network MSM at the ith scale _i Semantic branch feature in the input signal of +.>Characterization of the complementary attention network MSM at the ith scale _i Boundary branch features in the input signal of a) ¹ Characterizing semantic channel attention features in a two-channel attention feature A, A ² Boundary channel attention features in the two-channel attention feature a are characterized. />

Schematically, fig. 5 is a schematic diagram of a complementary attention network provided by an embodiment of the present application, please refer to fig. 5, for a complementary attention network MSM at any scale _i To be semantically branched in input signalAnd boundary branching feature->Splicing, and then splicing the obtained joint characteristics +.>Input into the attention extraction layer, exemplified by the attention extraction layer as a depth separable convolution layer 510. Sequentially pairs of joint features in depth separable convolutional layer 510Performing channel-by-channel convolution and point-by-point convolution, and performing obtained two-channel attention graph A ₀ The two-channel attention feature comprises a semantic channel attention feature 501 and a boundary channel attention feature 502.

In the process, the convolution is split into the channel-by-channel convolution of the space dimension and the point-by-point convolution of the channel dimension, and the channel-by-channel convolution can fully consider the context information of the feature, so that the information among the features can be deeply fused on the channel layer by the point-by-point convolution, the extraction capability of complementary information can be greatly improved, and the complementary attention feature has stronger expression capability.

306. And fusing the semantic channel attention features and the semantic branch features to obtain semantic channel complementary features in the complementary attention features under the scale.

In some embodiments, for the two-channel attention feature extracted in step 305 above, the semantic channel attention feature in the two-channel attention feature is compared with the complementary attention network MSM at the current scale _i Semantic branch features in an input signal of (a)Fusion is performed to obtain the final MSM _i Semantic channel complementary features in the output signal (i.e. complementary attention features at the current scale). Alternatively, the fusion means includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., the fusion mode is not particularly limited here.

In one possible implementation manner, a fusion manner of sequentially performing multiplication by element, convolution and addition by element is provided, namely, firstly, the semantic channel attention feature and the semantic branch feature are multiplied by element to obtain a semantic weighted feature; then, carrying out convolution operation on the semantic weighted features to obtain semantic weighted convolution features; and finally, adding the semantic weighted convolution feature and the semantic branch feature according to elements to obtain the semantic channel complementary feature.

Still referring to fig. 5, first, semantic channel attention features 501 in the two-channel attention features and semantic branch features in the input signal are describedMultiplying by elements to obtain semantic weighted characteristics; inputting the semantic weighted features into a convolution layer for convolution operation to obtain semantic weighted convolution features; finally, the semantically weighted convolution feature is combined with the semantically branching feature +.>Adding according to elements to obtain semantic channel complementary features +.>

In the process, by sequentially carrying out the element-by-element multiplication, convolution and element-by-element addition fusion modes, the semantic branch features in the input signal and the semantic channel attention features extracted at this time can be fully fused, namely, the semantic branch features are fused on the basis of the semantic channel attention features obtained through optimization because the complementary information of the semantic and boundary two branch features can be extracted from the semantic channel attention features, so that the loss of detail information in the semantic branch features in the input signal is avoided, and the expression capability of the semantic channel complementary features is improved.

307. And fusing the boundary channel attention feature and the boundary branch feature to obtain boundary channel complementary features in the complementary attention features under the scale.

In some embodiments, for the two-channel attention feature extracted in step 305 above, the boundary channel attention feature in the two-channel attention feature is compared with the complementary attention network MSM at the current scale _i Boundary branch features in an input signal of (a)Fusion is performed to obtain the final MSM _i Boundary channel complementary features in the output signal (i.e., complementary attention features at the current scale). Alternatively, the fusion means includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., the fusion mode is not particularly limited here.

In one possible implementation, a fusion mode of sequentially performing multiplication by element, convolution and addition by element is provided, namely, firstly, the boundary channel attention feature and the boundary branch feature are subjected to multiplication by element to obtain a boundary weighting feature; then, carrying out convolution operation on the boundary weighted feature to obtain a boundary weighted convolution feature; and finally, adding the boundary weighted convolution characteristic and the boundary branch characteristic according to elements to obtain the boundary channel complementary characteristic.

Still referring to fig. 5, first, boundary channel attention feature 502 of the two-channel attention features and boundary branch feature of the input signal are described Multiplying by elements to obtain boundary weighting characteristics; inputting the boundary weighted feature into a convolution layer for convolution operation to obtain the boundary weighted convolution featureThe method comprises the steps of carrying out a first treatment on the surface of the Finally, the boundary weighted convolution feature is combined with the boundary branch feature +.>Adding according to elements to obtain boundary channel complementary feature +.>

In the process, the boundary branch characteristics in the input signal and the boundary channel attention characteristics extracted at this time can be fully fused by sequentially carrying out the element multiplication, convolution and element addition fusion modes, namely, the boundary branch characteristics are fused on the basis of the boundary channel attention characteristics obtained through optimization because the complementary information of the semantic and boundary two branch characteristics can be extracted in the boundary channel attention characteristics, so that the loss of detail information in the boundary branch characteristics in the input signal is avoided, and the expression capability of the boundary channel complementary characteristics is improved.

In the examples of steps 306 and 307, for both the semantic branches and the boundary branches, a fusion method is provided in which the element-wise multiplication, convolution and element-wise addition are performed successively, and the complementary features of the semantic channels obtained by the fusion method Complementary features to boundary channel->The complementary attention features at the current scale are composed and expressed as the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,characterization of the complementary attention feature at the ith scale,/->Characterizing semantic channel complementary features in complementary attention features,/->Boundary channel complementary features in the complementary attention features are characterized.

Wherein, the liquid crystal display device comprises a liquid crystal display device,characterization of the complementary attention network MSM at the ith scale _i Semantic branch feature in the input signal of +.>Characterization of the complementary attention network MSM at the ith scale _i Boundary branch features in the input signal of a) ¹ Characterizing semantic channel attention features in a two-channel attention feature A, A ² Characterizing boundary channel attention features in the dual channel attention feature A, conv characterizing convolution operations, as if it were multiplied by elements, as if it were added by elements.

In the above steps 304-307, complementary information extraction is provided for the semantic segmentation feature and the object boundary feature at the multiple scales, respectively, to obtain a possible implementation manner of the complementary attention feature at the multiple scales, where the complementary attention feature at each scale includes a semantic channel complementary feature and a boundary channel complementary feature. In each complementary attention network, boundary branch characteristics and semantic branch characteristics in an input signal are combined, and complementary attention characteristics of the scale where the boundary branch characteristics and the semantic branch characteristics are located are further extracted on the basis of the combined characteristics, so that the layer-by-layer complementary attention network can deeply extract complementary information of the semantics and the boundary on the characteristics, and co-evolution of the double branch characteristics is realized.

In addition, in the prediction process of the complementary attention network on the complementary attention characteristic of the self scale under each scale, complementary information can be transmitted to different branches (meaning semantic branch and boundary branch bidirectional transmission) in a concise and efficient manner, so that invalid areas in the characteristic can be restrained, effective areas in the characteristic can be enhanced, and the expression capability of the complementary attention characteristic of the self scale can be improved.

Further, in the above process, a possible implementation manner of generating the complementary attention feature based on the semantic segmentation feature and the object boundary feature is realized, where the complementary attention feature is used to characterize complementary information of the semantic segmentation feature and the object boundary feature, and by adaptively configuring a stacked multi-scale cascade complementary attention network according to the multi-scale semantic segmentation feature, detailed information of a low-level feature and rich semantic information of a high-level feature can be better utilized, so that the utilization rate of the multi-scale feature is improved.

308. And fusing the complementary features of the semantic channels in the complementary attention features under the multiple scales to obtain corrected semantic features.

In some embodiments, the processing as in steps 304-307 described above is performed for each of the complementary attention networks at multiple scales, thereby extracting complementary attention features at multiple scales, respectively, using the complementary attention networks at multiple scales. Optionally, the complementary attention network MSM at all scales _i And fusing the complementary features of the semantic channels in the output complementary attention features to obtain corrected semantic features. Alternatively, the fusion means includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., the fusion mode is not particularly limited here.

In one possible implementation, a fusion manner of concatenation and convolution is provided, and still illustrated by way of example in fig. 4, the complementary attention network MSM at 4 scales is described ₁ ～MSM ₄ Semantic channel complementary features each output on semantic branchesSplicing, inputting the spliced characteristics into a convolution layer for convolution operation, and outputting corrected semantic characteristics +.>The convolution layers herein are not limited to one-dimensional convolution layers, and the convolution kernel size is configured by the skilled artisan. By the fusion mode of splicing and convolution, the complementary features of the semantic channels in the complementary attention features under a plurality of scales can be fully fused, so that the final corrected semantic features are +.>The invalid region can be suppressed, and the valid region can be enhanced, so that attention can be paid to the pixel indicated by the feature value that contributes more to the semantic segmentation task.

309. And fusing the boundary channel complementary features in the complementary attention features under the multiple scales to obtain the corrected boundary features.

In some embodiments, the processing as in steps 304-307 described above is performed for each of the complementary attention networks at multiple scales, thereby extracting complementary attention features at multiple scales, respectively, using the complementary attention networks at multiple scales. Optionally, the complementary attention network MSM at all scales _i And fusing the boundary channel complementary features in the output complementary attention features to obtain corrected boundary features. Alternatively, the fusion means includes, but is not limited to: splicing, addition by element, multiplication by element, bilinear fusion, etc., the fusion mode is not particularly limited here.

In one possible implementation, a fusion manner of concatenation and convolution is provided, and still illustrated by way of example in fig. 4, the complementary attention network MSM at 4 scales is described ₁ ～MSM ₄ Boundary channel complementary features each output on boundary branchesSplicing, inputting the spliced characteristics into a convolution layer for convolution operation, and outputting corrected boundary characteristics +.>The convolution layers herein are not limited to one-dimensional convolution layers, and the convolution kernel size is configured by the skilled artisan. By the fusion mode of splicing and convolution, boundary channel complementary features in complementary attention features under multiple scales can be fully fused, so that the final modified boundary feature +. >The invalid region can be suppressed, and the valid region can be enhanced, so that attention can be paid to the pixel indicated by the feature value that contributes more to the boundary detection task. />

In the foregoing steps 308-309, a possible implementation manner for correcting the semantic segmentation feature and the boundary feature of the object based on the complementary attention feature is provided, so as to obtain a corrected semantic feature and a corrected boundary feature, under a multi-scale complementary attention network architecture, the semantic branches are fused, the multi-scale semantic channel complementary feature under the hierarchical cascade structure is fused, the boundary branches are boundary branches, and the multi-scale boundary channel complementary feature under the hierarchical cascade structure is fused, so that full utilization of complementary information of multi-scale bidirectional transmission in two branches is realized, and respective expression capability of the corrected semantic feature and the corrected boundary feature can be improved.

310. Based on the modified semantic features and the modified boundary features, a semantic segmentation map is generated that identifies the location and boundary of the transparent object in the image.

In some embodiments, the modified semantic features are input into a convolution prediction network, and the modified semantic features are reconstructed by the convolution prediction network to output a semantic segmentation map for the original RGB image. Alternatively, the convolution prediction network may be a ResNet, CNN or other convolution network, not specifically defined herein.

Fig. 6 is an effect diagram of a semantic segmentation method for transparent objects in an image according to an embodiment of the present application, as shown in fig. 6, the left part shows an original RGB image 601, where the RGB image 601 contains transparent objects that are difficult to segment: glasses made of glass and mineral water bottles can obtain a semantic segmentation map 602 by the method provided by the embodiment of the application, and can be seen that accurate semantic segmentation is realized on glasses lenses and mineral water bottles in the semantic segmentation map 602.

In other embodiments, the modified boundary feature is input into another convolution prediction network, and the modified boundary feature is reconstructed by the convolution prediction network to output a boundary detection map for the original RGB image. Alternatively, the convolution prediction network may be a ResNet, CNN or other convolution network, not specifically defined herein.

Continuing with the description taking FIG. 4 as an example, the semantic features will be modifiedInputting into a convolution prediction network to obtain semantic segmentation map 401, and correcting boundary feature +.>Is input into another convolution prediction network to obtain a boundary detection map 402. The semantic segmentation mode can fully mine the bidirectional complementary relation between the semantic branch characteristics and the boundary branch characteristics, on one hand, the semantic segmentation task can be assisted to more accurately locate the semantic range of the transparent object through the complementary information transferred from the boundary branch to the semantic branch, and on the other hand, the boundary detection task can be assisted to effectively inhibit noise generated due to false detection and the like through the complementary information transferred from the semantic branch to the boundary branch, so that the efficient optimization of boundary identification is realized.

The semantic segmentation method for the transparent object in the image, which is related to the embodiment of the application, can realize the accurate semantic segmentation of the transparent object by only providing the RGB image, does not need to provide auxiliary information of other modes, and is easier to apply in actual scenes. In addition, due to the introduction of a self-adaptive multi-scale complementary attention network, the complementary effect of semantics and boundaries in the transparent object recognition process is considered, semantic branch characteristics and boundary branch characteristics can be adaptively transmitted to complementary information required by the other side in a manner of attention mechanism, so that the co-evolution of the characteristics on the two branches is realized, the accurate prediction of semantic segmentation can be promoted, the recognition accuracy of boundary detection can be promoted, and finally, the prediction accuracy of an overall model, namely an overall semantic segmentation task and a boundary detection task, is promoted, so that the method has important significance for effectively recognizing transparent objects in real scene application.

In the above embodiment, how to realize the accurate semantic segmentation of the transparent object in the image by using the trained feature extraction network and the complementary attention network under multiple scales is described in detail, and the feature extraction network and the complementary attention network under multiple scales are regarded as a semantic segmentation model.

Fig. 7 is a flowchart of a training method of a semantic segmentation model according to an embodiment of the present application. Referring to fig. 7, this embodiment is performed by an electronic device, which may be provided as the image processing device 102 or the image processing server in the above-described implementation environment, and includes the steps of:

701. various parameters of the semantic segmentation model are initialized.

Optionally, all parameters of the feature extraction network in the semantic segmentation model and the complementary attention network under each scale are subjected to random assignment, or the feature extraction network in the semantic segmentation model and the complementary attention network under each scale are subjected to pre-training to obtain an initialized semantic segmentation model.

702. And inputting each sample image in the sample image set into the semantic segmentation model, and outputting a semantic segmentation map and a boundary detection map of each sample image through the semantic segmentation model.

The sample image set may include a plurality of sample images containing transparent objects, and each sample image is marked with the position and boundary of the transparent object. Optionally, a labeled semantic label and a labeled boundary label are configured on each sample image.

For each sample image, the manner of acquiring the semantic segmentation map and the boundary detection map is the same as that in the previous embodiment, and will not be described again.

703. And calculating a loss function value of the iteration based on the difference between the semantic segmentation map and the semantic annotation map of each sample image and the difference between the boundary detection map and the boundary annotation map of each sample image.

In some embodiments, for all sample images in the sample image set, cross entropy loss values between semantic segmentation graphs and semantic annotation graphs of all sample images are obtained, in addition, the Dice loss values between boundary detection graphs and boundary annotation graphs of all sample images are obtained, the Dice loss values are a type of loss constructed based on Dice coefficients (Dice coefficients), the Dice coefficients are measurement functions for evaluating similarity between the boundary detection graphs and the boundary annotation graphs, the value ranges from 0 to 1, and the larger the value is, the more similar the value is. And then, carrying out weighted summation or direct summation on the cross entropy loss value and the position loss value, and obtaining the total loss function value of the iteration.

In the process, the difference between different semantic segmentation graphs can be well measured by using the cross entropy loss value, and the scene with strong unbalance of positive and negative samples in semantic segmentation can be dealt with by using the Dice loss value, so that the accuracy of the loss function value can be improved. In other embodiments, cross entropy loss values may be used for both branches, which is not specifically limited in embodiments of the present application.

Still referring to fig. 4 as an example, assuming that the input RGB image 400 is a sample image, a semantic segmentation graph 401 and a boundary detection graph 402 are predicted through the whole semantic segmentation model, and a cross entropy loss value L of the semantic branches is calculated based on the semantic segmentation graph 401 and the semantic annotation graph of the sample image _s Based on the boundary detection graph 402 and the boundary labeling graph of the sample image, calculating to obtain the Dice loss value L of the boundary branch _b Will cross the entropy loss value L _s And a Dice loss value L _b And carrying out weighted summation to obtain the loss function value of the current iteration.

704. The semantic segmentation model is iteratively trained based on the loss function values until a stop training condition is satisfied.

In some embodiments, the loss function value obtained in step 703 is compared with a set threshold, and when the loss function value is greater than the set threshold, parameters of each layer in the semantic segmentation model are adjusted based on a back propagation algorithm, wherein the set threshold may be set by a technician, and the set threshold is a value between 0 and 1. When the loss function value is smaller than or equal to the set threshold value, the training of the semantic segmentation model is stopped at the moment when the training stopping condition is met, a trained semantic segmentation model is obtained, and the trained semantic segmentation model can be put into the previous embodiment to predict the actual semantic segmentation map of each RGB image.

In other embodiments, the training-stopping condition may also be configured to one or more of the following: the iteration steps reach the set steps, and the set steps are set by technicians; or, the change of the loss function value of N continuous iterations is smaller than the set change amount, the value of N and the value of the set change amount are set by a technician, and the training stopping condition is not specifically limited in the embodiment of the application.

According to the training mode of the semantic segmentation model, only RGB sample images are needed to be provided in the training stage, auxiliary information of other modes is not needed to be introduced, training cost is greatly reduced, the whole semantic segmentation model is constructed into a double-branch network, and one branch extracts and corrects semantic features for carrying out accurate semantic segmentation on transparent objects; the other branch extracts and corrects boundary characteristics, is used for accurately predicting the boundary of a transparent object, builds a multi-scale complementary attention network, adaptively realizes the collaborative optimization of characteristics under two branches, can fully utilize complementary information transmitted in two directions between the two branches, and has a multi-scale cascade architecture, and can fully utilize texture information rich in low-level characteristics and semantic information rich in high-level characteristics, so that the multi-scale cascade architecture has a good semantic segmentation effect on objects with unstable textures and blurred boundaries such as transparent objects.

Hereinafter, the semantic division effect of the transparent object according to the embodiment of the present application will be described with reference to fig. 8 and 9.

Fig. 8 is a comparison diagram of the effects of a semantic segmentation method for transparent objects according to an embodiment of the present application, as shown in fig. 8, for an RGB image 801 including a large-area glass window, it can be seen that the glass window can reflect the texture of the environment outside the window, and has a high semantic segmentation difficulty. Also shown in fig. 8 are a boundary annotation 802 of the image 801, the boundary annotation 802 being a correct annotation result of the boundary of each object in the image 801, and a semantic annotation 803 being a correct semantic segmentation result of each object in the image 801. By using the semantic segmentation model provided in the previous embodiment, the image 801 is processed by using the semantic segmentation method of the previous embodiment, and a machine-processed boundary detection map 804 and a machine-processed semantic segmentation map 805 can be obtained. The boundary detection graph 802 and the boundary detection graph 804 can be compared, and the accuracy of the boundary detection is high even if the glass window is included in a large area, and the accuracy of the semantic segmentation is high even if the boundary detection graph 802 and the boundary detection graph 804 are compared.

Fig. 9 is a visual schematic diagram of a two-channel complementary attention feature provided in an embodiment of the present application, as shown in fig. 9, for 10 RGB images used for testing, by using the semantic segmentation model provided in the previous embodiment, after each RGB image extracts the multi-scale, two-channel complementary attention feature, the complementary features of the semantic branches under the multi-scale are fused to obtain a modified semantic feature, the complementary features of the boundary channels under the multi-scale of the boundary branches are fused to obtain a modified boundary feature, in order to describe the modified semantic feature and the attention effect of the modified boundary feature conveniently, in fig. 9, the modified semantic feature and the modified boundary feature of the 10 RGB images are visually displayed in a visual image manner, the darkness of the pixel color represents the possibility of belonging to the semantic meaning or boundary of the transparent object, the darker the pixel color represents the possibility of belonging to the semantic meaning or boundary of the transparent object, and the lighter the pixel color represents the possibility of belonging to the semantic meaning or boundary of the transparent object. The semantic annotation graph and the corrected semantic feature of the 10 RGB images are compared, the pixels which belong to the transparent object and the pixels which do not belong to the transparent object can be fully represented, the extremely rich and accurate semantic information is contained after correction, similarly, the boundary annotation graph and the corrected boundary feature of the 10 RGB images are compared, the pixels which are located on the actual boundary of the transparent object can be fully represented by the corrected semantic feature, and the extremely rich and accurate texture boundary information is contained after correction, so that under the condition that the number of test samples is greatly increased, the semantic segmentation precision and the boundary detection precision are also stably kept at a very high level, and the accurate semantic segmentation and boundary detection on transparent objects such as glass can be realized, so that the method has very high usability.

Fig. 10 is a schematic structural diagram of a semantic segmentation device for transparent objects in an image according to an embodiment of the present application, please refer to fig. 10, the device includes:

an extraction module 1001, configured to extract, for an image including a transparent object, a semantic segmentation feature of the image and an object boundary feature, where the semantic segmentation feature is used to distinguish a foreground and a background of the image, and the object boundary feature is used to indicate a boundary of a different object in the foreground;

a feature generation module 1002 for generating a complementary attention feature based on the semantic segmentation feature and the object boundary feature, the complementary attention feature being used to characterize complementary information of the semantic segmentation feature and the object boundary feature;

a correction module 1003, configured to correct the semantic segmentation feature and the object boundary feature based on the complementary attention feature, to obtain a corrected semantic feature and a corrected boundary feature;

an image generation module 1004 is configured to generate a semantic segmentation map for identifying a location and a boundary of the transparent object in the image based on the modified semantic feature and the modified boundary feature.

According to the device provided by the embodiment of the application, the rough semantic segmentation feature and the object boundary feature are initially extracted from the image, and the complementary information between the rough semantic segmentation feature and the object boundary feature is further fully utilized to construct the complementary attention feature, so that the complementary attention feature can respectively correct the semantic segmentation feature and the object boundary feature, the complementary information of useful pixel-level categories is transmitted from the semantic branches to the boundary branches through bidirectional interaction, and the complementary information of useful pixel-level boundaries is transmitted from the boundary branches to the semantic branches, so that the corrected semantic feature and the corrected boundary feature obtained through collaborative optimization can be focused on the feature value which contributes to the self branches, further, the obtained semantic segmentation graph is predicted on the basis, and transparent objects which are difficult to segment in the image can be recognized more accurately, thereby realizing semantic segmentation on the transparent objects accurately.

In some embodiments, based on the apparatus composition of fig. 10, the extraction module 1001 includes:

the generation sub-module is used for generating object boundary features of the image based on the semantic segmentation features at the first scale and the semantic segmentation features at the last scale.

In some embodiments, the feature extraction network further comprises an ASPP sub-network, based on the apparatus composition of fig. 10, the generating sub-module comprises:

In some embodiments, the ASPP subnetwork includes a one-dimensional convolution layer, a plurality of pooled pyramid layers and ASPP pooled layers, wherein the plurality of pooled pyramid layers have the same convolution kernel size, but different pooled pyramid layers fill the receptive field of the convolution kernel based on different expansion rates, each pooled pyramid layer for extracting pooled pyramid features at one scale;

Based on the device composition of fig. 10, the first extraction unit comprises:

the dimension-reducing convolution subunit is used for carrying out dimension-reducing convolution operation on the semantic segmentation feature under the last dimension through the one-dimensional convolution layer to obtain dimension-reducing features;

the cavity convolution subunit is used for respectively carrying out cavity convolution operation based on different expansion rates on the semantic segmentation features under the last scale through the plurality of pooled pyramid layers to obtain pooled pyramid features under the plurality of scales;

chi Huazi unit, configured to perform pooling operation on the semantic segmentation feature under the last scale through the ASPP pooling layer to obtain ASPP pooling feature;

the pooling subunit is configured to:

through the averaging layer, carrying out averaging operation on the semantic segmentation features under the last scale to obtain averaging features;

In some embodiments, based on the apparatus composition of fig. 10, the feature generation module 1002 includes:

based on the apparatus composition of fig. 10, the second extraction submodule includes:

the second extraction unit is used for inputting the combined feature into an attention extraction layer in a complementary attention network under the scale, and extracting the feature of the combined feature through the attention extraction layer to obtain a dual-channel attention feature, wherein the dual-channel attention feature comprises a semantic channel attention feature and a boundary channel attention feature;

and for the complementary attention network under the other scales, the semantic branch features are obtained by fusing semantic channel complementary features in the complementary attention features under the previous scale with semantic segmentation features under the previous scale, and the boundary branch features are obtained by fusing boundary channel complementary features in the complementary attention features under the previous scale with semantic segmentation features under the first scale.

In some embodiments, the second extraction unit is configured to:

carrying out channel-by-channel convolution operation on the joint feature through the attention extraction layer to obtain a context attention feature, wherein the context attention feature is the same as the joint feature in dimension;

And carrying out point-by-point convolution operation on the context attention feature in the channel dimension to obtain the dual-channel attention feature.

In some embodiments, the second fusion unit is configured to:

and adding the semantic weighted convolution feature and the semantic branch feature according to elements to obtain the semantic channel complementary feature.

In some embodiments, the second fusion unit is further to:

In some embodiments, the correction module 1003 is to:

In some embodiments, the transparent object is a glass object.

It should be noted that: the semantic segmentation device for transparent objects in images provided in the above embodiments only exemplifies the division of the functional modules when the transparent objects are semantically segmented, and in practical application, the functional allocation can be completed by different functional modules according to needs, that is, the internal structure of the electronic device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the semantic segmentation device for the transparent object in the image provided by the above embodiment belongs to the same concept as the semantic segmentation method for the transparent object in the image, and the specific implementation process of the semantic segmentation device is detailed in the semantic segmentation method for the transparent object in the image, which is not described herein.

Fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, and the image processing apparatus is an exemplary illustration of an electronic apparatus, and as shown in fig. 11, an image processing apparatus is illustrated as a terminal 1100. Optionally, the device types of the terminal 1100 include: vehicle-mounted terminal, intelligent voice interaction equipment, intelligent household appliances, intelligent mobile phones, tablet computers, notebook computers, desktop computers, intelligent sound boxes, intelligent watches and the like. Terminal 1100 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

Generally, the terminal 1100 includes: a processor 1101 and a memory 1102.

Optionally, the processor 1101 includes one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. Optionally, the processor 1101 is implemented in at least one hardware form of a DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). In some embodiments, the processor 1101 includes a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 is integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of the content that the display screen is required to display. In some embodiments, the processor 1101 further includes an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

In some embodiments, memory 1102 includes one or more computer-readable storage media, optionally non-transitory. Memory 1102 also optionally includes high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one program code for execution by processor 1101 to implement the method of semantic segmentation of transparent objects in images provided by various embodiments of the present application.

In some embodiments, the terminal 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, the memory 1102, and the peripheral interface 1103 can be connected by a bus or signal lines. The individual peripheral devices can be connected to the peripheral device interface 1103 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, a display screen 1105, a camera assembly 1106, audio circuitry 1107, and a power supply 1108.

A peripheral interface 1103 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1101 and memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, memory 1102, and peripheral interface 1103 are implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. Optionally, the radio frequency circuitry 1104 communicates with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1104 further includes NFC (Near Field Communication ) related circuitry, which is not limiting of the application.

The display screen 1105 is used to display a UI (User Interface). Optionally, the UI includes graphics, text, icons, video, and any combination thereof. When the display 1105 is a touch display, the display 1105 also has the ability to collect touch signals at or above the surface of the display 1105. The touch signal can be input to the processor 1101 as a control signal for processing. Optionally, the display 1105 is also used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1105 is one, providing a front panel of the terminal 1100; in other embodiments, the display 1105 is at least two, and is respectively disposed on different surfaces of the terminal 1100 or in a folded design; in still other embodiments, the display 1105 is a flexible display disposed on a curved surface or a folded surface of the terminal 1100. Even alternatively, the display screen 1105 is arranged in an irregular pattern that is not rectangular, i.e., a shaped screen. Alternatively, the display screen 1105 is made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 1106 is used to capture images or video. Optionally, the camera assembly 1106 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1106 also includes a flash. Alternatively, the flash is a single-color temperature flash, or a dual-color temperature flash. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and is used for light compensation under different color temperatures.

In some embodiments, audio circuit 1107 includes a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing, or inputting the electric signals to the radio frequency circuit 1104 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones are respectively disposed at different portions of the terminal 1100. Optionally, the microphone is an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. Alternatively, the speaker is a conventional thin film speaker, or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only an electric signal but also an acoustic wave audible to humans can be converted into an acoustic wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1107 also includes a headphone jack.

A power supply 1108 is used to power the various components in terminal 1100. Optionally, the power supply 1108 is an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1108 includes a rechargeable battery, the rechargeable battery supports wired or wireless charging. The rechargeable battery is also used to support fast charge technology.

In some embodiments, terminal 1100 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyroscope sensor 1112, pressure sensor 1113, optical sensor 1114, and proximity sensor 1115.

In some embodiments, the acceleration sensor 1111 detects the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1100. For example, the acceleration sensor 1111 is configured to detect components of gravitational acceleration on three coordinate axes. Optionally, the processor 1101 controls the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1111. The acceleration sensor 1111 is also used for acquisition of motion data of a game or a user.

In some embodiments, the gyro sensor 1112 detects a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 and the acceleration sensor 1111 cooperate to collect 3D actions of the user on the terminal 1100. The processor 1101 realizes the following functions according to the data collected by the gyro sensor 1112: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Optionally, the pressure sensor 1113 is disposed at a side frame of the terminal 1100 and/or at a lower layer of the display screen 1105. When the pressure sensor 1113 is disposed at a side frame of the terminal 1100, a grip signal of the terminal 1100 by a user can be detected, and the processor 1101 performs a right-left hand recognition or a quick operation according to the grip signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1114 is used to collect the ambient light intensity. In one embodiment, the processor 1101 controls the display brightness of the display screen 1105 based on the intensity of ambient light collected by the optical sensor 1114. Specifically, when the intensity of the ambient light is high, the display luminance of the display screen 1105 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1105 is turned down. In another embodiment, the processor 1101 also dynamically adjusts the shooting parameters of the camera assembly 1106 based on the intensity of ambient light collected by the optical sensor 1114.

A proximity sensor 1115, also referred to as a distance sensor, is typically provided on the front panel of the terminal 1100. The proximity sensor 1115 is used to collect a distance between a user and the front surface of the terminal 1100. In one embodiment, when the proximity sensor 1115 detects that the distance between the user and the front surface of the terminal 1100 gradually decreases, the processor 1101 controls the display 1105 to switch from the bright screen state to the off screen state; when the proximity sensor 1115 detects that the distance between the user and the front surface of the terminal 1100 gradually increases, the processor 1101 controls the display screen 1105 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 11 is not limiting of terminal 1100, and can include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device 1200 may generate a relatively large difference due to different configurations or performances, and the electronic device 1200 includes one or more processors (Central Processing Units, CPU) 1201 and one or more memories 1202, where at least one computer program is stored in the memories 1202, and the at least one computer program is loaded and executed by the one or more processors 1201 to implement the method for semantic segmentation of transparent objects in images according to the embodiments described above. Optionally, the electronic device 1200 further includes a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium is also provided, for example a memory comprising at least one computer program executable by a processor in an electronic device to perform the method of semantic segmentation of transparent objects in images in the various embodiments described above. For example, the computer readable storage medium includes ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising one or more computer programs, the one or more computer programs stored in a computer readable storage medium. The one or more processors of the electronic device are capable of reading the one or more computer programs from the computer-readable storage medium, the one or more processors executing the one or more computer programs so that the electronic device is capable of executing to perform the semantic segmentation method of transparent objects in images in the embodiments described above.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments can be implemented by hardware, or can be implemented by a program instructing the relevant hardware, optionally stored in a computer readable storage medium, optionally a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method for semantic segmentation of transparent objects in an image, the method comprising:

2. The method of claim 1, wherein the extracting semantic segmentation features and object boundary features of the image comprises:

inputting the image into a feature extraction network, and extracting semantic segmentation features of the image under multiple scales through multiple convolution layers in the feature extraction network;

and generating object boundary features of the image based on the semantic segmentation features at the first scale and the semantic segmentation features at the last scale.

3. The method of claim 2, wherein the feature extraction network further comprises a hole space convolution pooling pyramid ASPP sub-network, the generating object boundary features of the image based on semantic segmentation features at a first scale and semantic segmentation features at a last scale comprising:

inputting the semantic segmentation features under the last scale into the ASPP sub-network, and extracting the features of the semantic segmentation features under the last scale through the ASPP sub-network to obtain multi-scale semantic features;

and fusing the semantic segmentation features under the first scale with the multi-scale semantic features to obtain the object boundary features.

4. The method of claim 3, wherein the ASPP subnetwork comprises a one-dimensional convolution layer, a plurality of pooled pyramid layers and ASPP pooled layers, wherein the plurality of pooled pyramid layers have the same convolution kernel size, but different pooled pyramid layers fill the receptive field of the convolution kernel based on different expansion rates, each pooled pyramid layer for extracting pooled pyramid features at one scale;

the feature extraction of the semantic segmentation feature under the last scale through the ASPP sub-network is performed, and the obtaining of the multi-scale semantic feature comprises the following steps:

performing dimension reduction convolution operation on the semantic segmentation features under the last scale through the one-dimensional convolution layer to obtain dimension reduction features;

carrying out hole convolution operation based on different expansion rates on the semantic segmentation features under the last scale through the plurality of pooling pyramid layers to obtain pooling pyramid features under a plurality of scales;

carrying out pooling operation on the semantic segmentation features under the last scale through the ASPP pooling layer to obtain ASPP pooling features;

and fusing the dimension reduction feature, the pooling pyramid feature under the multiple scales and the ASPP pooling feature to obtain the multi-scale semantic feature.

5. The method of claim 4, wherein the ASPP pooling layer comprises a mean pooling layer, a one-dimensional convolution layer, and an upsampling layer;

the step of performing pooling operation on the semantic segmentation feature under the last scale through the ASPP pooling layer to obtain ASPP pooling features comprises the following steps:

6. The method of claim 3, wherein the generating complementary attention features based on the semantic segmentation features and the object boundary features comprises:

and respectively extracting complementary information of the semantic segmentation features and the object boundary features under the multiple scales through the cascade complementary attention network under the multiple scales to obtain the complementary attention features under the multiple scales.

7. The method of claim 6, wherein the complementary attention features at each scale include semantic channel complementary features and boundary channel complementary features;

The extracting complementary information of the semantic segmentation features and the object boundary features under the multiple scales through the cascade complementary attention network under the multiple scales respectively to obtain the complementary attention features under the multiple scales comprises the following steps:

combining semantic branch features and boundary branch features in an input signal in a channel dimension for a complementary attention network under any scale to obtain a combined feature;

inputting the combined features into an attention extraction layer in a complementary attention network under the scale, and extracting the features of the combined features through the attention extraction layer to obtain dual-channel attention features, wherein the dual-channel attention features comprise semantic channel attention features and boundary channel attention features;

fusing the semantic channel attention features and the semantic branch features to obtain semantic channel complementary features in the complementary attention features under the scale;

and fusing the boundary channel attention feature and the boundary branch feature to obtain boundary channel complementary features in the complementary attention features under the scale.

8. The method of claim 7, wherein for a complementary attention network at a last scale, the semantic branch feature is the multi-scale semantic feature and the boundary branch feature is the object boundary feature;

9. The method according to claim 7 or 8, wherein the feature extraction of the joint features by the attention extraction layer, obtaining a dual channel attention feature, comprises:

10. The method according to claim 7 or 8, wherein the fusing the semantic channel attention feature with the semantic branch feature to obtain a semantic channel complementary feature of the complementary attention features at the scale comprises:

11. The method according to claim 7 or 8, wherein fusing the boundary channel attention feature with the boundary branch feature to obtain a boundary channel complementary feature of the complementary attention features at the scale comprises:

12. The method according to claim 7 or 8, wherein the modifying the semantic segmentation feature and the object boundary feature based on the complementary attention feature, respectively, to obtain a modified semantic feature and a modified boundary feature comprises:

13. The method of claim 1, wherein the transparent object is a glass object.

14. A semantic segmentation apparatus for transparent objects in an image, the apparatus comprising:

15. An electronic device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one computer program loaded and executed by the one or more processors to implement the method of semantic segmentation of a transparent object in an image as claimed in any of claims 1 to 13.

16. A computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, the at least one computer program being loaded and executed by a processor to implement the method of semantic segmentation of transparent objects in an image according to any one of claims 1 to 13.

17. A computer program product, characterized in that the computer program product comprises at least one computer program that is loaded and executed by a processor to implement the method of semantic segmentation of transparent objects in an image according to any one of claims 1 to 13.