WO2021136528A1 - 一种实例分割的方法及装置 - Google Patents

一种实例分割的方法及装置 Download PDF

Info

Publication number
WO2021136528A1
WO2021136528A1 PCT/CN2020/142438 CN2020142438W WO2021136528A1 WO 2021136528 A1 WO2021136528 A1 WO 2021136528A1 CN 2020142438 W CN2020142438 W CN 2020142438W WO 2021136528 A1 WO2021136528 A1 WO 2021136528A1
Authority
WO
WIPO (PCT)
Prior art keywords
map
original image
instance
feature
feature map
Prior art date
Application number
PCT/CN2020/142438
Other languages
English (en)
French (fr)
Inventor
孙昆阳
陈昊
沈春华
颜友亮
邵滨
许松岑
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021136528A1 publication Critical patent/WO2021136528A1/zh
Priority to US17/853,799 priority Critical patent/US20220335619A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the embodiments of the present application relate to the field of computer vision technology, and in particular, to a method and device for instance segmentation.
  • Image segmentation technology is an important part of image semantic understanding.
  • the current image segmentation tasks mainly include: semantic segmentation and instance segmentation.
  • Semantic segmentation is to divide the instances in the image into corresponding categories, such as people, cats, dogs, etc., without distinguishing different instances belonging to the same category. For example, when there are multiple cats in an image, semantic segmentation will predict all pixels of the multiple cats as a category of "cat". Instance segmentation also needs to distinguish different instances on the basis of specific categories, such as distinguishing which pixels belong to the first cat and which pixels belong to the second cat.
  • the embodiments of the present application provide a method and device for instance segmentation, so as to propose a method for instance segmentation.
  • a method for instance segmentation is provided.
  • the terminal pre-trains the instance segmentation network.
  • the original image can be input into the trained segmentation network, and the segmentation network can output the At least one feature fusion map corresponding to the original image, and each feature fusion map includes at least one instance.
  • the function of the feature fusion map is to mark the pixels of the instance included in the original image, and any feature fusion map output by the segmentation network can be used to mark the pixels of at least one instance included in the feature fusion map.
  • a sample original image can be input into the segmentation network to be trained, and the sample original image is labeled with pixels of at least one instance.
  • the segmentation network to be trained may perform the following processing on each instance group in the sample original image, and each instance group includes at least one labeled instance: predicting at least two different first basic feature maps, and For each first basic feature map, the corresponding first attention feature map is predicted.
  • the first attention feature map has the same size as the first basic feature map, and the pixel value of each pixel in the first attention feature map represents the pixel at the corresponding position in the corresponding first basic feature map In the first attention feature map, there are pixels with different pixel values.
  • the segmentation network to be trained may perform weighting processing on the pixel values in the at least two first basic feature maps and the corresponding first attention feature maps to predict the first feature fusion map, and based on the first feature fusion map A feature fusion map and the sample original image are used to train the segmentation network to be trained.
  • the first feature fusion map is obtained, and the segmentation model is trained using the first feature fusion map and the sample original image.
  • the segmentation model trained in this way can accurately and quickly determine the instance pixels in the subsequent input original image to achieve accurate segmentation of the instances in the original image.
  • the pixel value of the pixel in the first attention feature map is within a set value range, for example, between 0-1. It can also be between 0-0.5, or between 0.5-1.
  • the original image of the sample input into the segmentation network to be trained can not only be marked with pixels of the instance, but also be marked with a detection frame, which is used to identify the instance.
  • an instance corresponds to a detection frame, and the pixels of the instance are located in the detection frame.
  • the segmentation network trained with the sample original image of the marked detection frame has the ability to mark the detection frame of the instance, and the original image is input into the segmentation network, and the segmentation network outputs the detection frame of the instance included in the original image. For example, It is to output the coordinates of the detection frame in the original image, or it can be to output the image with the detection frame.
  • the difference between the image and the original image is only whether the detection frame is included, or the image with the detection frame and the detection frame coordinate.
  • the first basic feature map is the basic feature map of the detection frame image corresponding to the instance group; then according to the first feature fusion map and the sample original image, the When the segmentation network to be trained is trained, it may specifically be to train the segmentation network to be trained according to the first feature fusion map and the detection frame image.
  • the size of the image is preset in the segmentation network, and before the sample original image is input to the segmentation network to be trained, the sample original image can be scaled to the size preset in the segmentation network. In this way, after training the segmentation network, before inputting the original image into the trained segmentation network, the original image can be scaled to the size preset in the segmentation network.
  • the size of the image is preset in the segmentation network, and the segmentation network can adjust the size of the image to achieve the preset size.
  • the size of the first basic feature map predicted by the segmentation network to be trained may be a preset size in the segmentation network to be trained.
  • the size of the first attention feature map and the first basic feature map are both in the segmentation network to be trained The preset size.
  • the size of the first feature fusion map may also be And/or the size of the sample original image is scaled so that the size of the first feature fusion map is the same as the size of the sample original image.
  • a method for rights management is provided.
  • the terminal first obtains the original image, and then processes the original image to determine the detection frame of each instance included in the original image. Furthermore, for each detection frame image, at least two different basic feature maps are determined, and the attention feature maps corresponding to each basic feature map; and the at least two basic feature maps and the corresponding attention feature maps respectively.
  • the pixel values of the image are weighted to obtain a feature fusion map corresponding to the detection frame image, and the feature fusion map is used to mark the pixels of the instance included in the detection frame.
  • the attention feature map has the same size as the basic feature map, the pixel value of each pixel in the attention feature map represents the weight value of the pixel at the corresponding position in the corresponding basic feature map, and the attention There are pixels with different pixel values in the feature map.
  • the instance segmentation can be performed accurately and quickly to determine the instance pixels in the original image, so as to achieve accurate instance segmentation. And there are pixels with different pixel values in the attention feature map, so that the weight of each pixel can be considered, and the pixels of the instance can be further accurately distinguished.
  • a device for instance segmentation has the function of realizing any possible implementation of the first aspect and the first aspect or any possible implementation of the second aspect and the second aspect.
  • These functions can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more functional modules corresponding to the above-mentioned functions.
  • a device for instance division may be the terminal in the above method embodiment or a chip set in the terminal.
  • the device includes a transceiver, a processor, and optionally, a memory.
  • the memory is used to store computer programs or instructions
  • the processor is respectively coupled with the memory and the transceiver.
  • the processor executes the computer programs or instructions
  • the device executes the first aspect and the first aspect through the transceiver. Any possible implementation or a method executed by the terminal in the second aspect and any possible implementation of the second aspect.
  • a computer program product includes: computer program code, which when the computer program code runs on a computer, causes the computer to execute the first aspect and any of the possible aspects of the first aspect. Implementation or a method executed by the terminal in the second aspect and any possible implementation of the second aspect.
  • the present application provides a chip system that includes a processor and a memory, and the processor and the memory are electrically coupled; the memory is used to store computer program instructions; the processor , Used to execute part or all of the computer program instructions in the memory. When the part or all of the computer program instructions are executed, they are used to implement any possible implementation of the first aspect and the first aspect or the second aspect and the second aspect and The function of the terminal in any possible implementation method of the second aspect.
  • the chip system may further include a transceiver, and the transceiver is configured to send a signal processed by the processor or receive a signal input to the processor.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • a computer-readable storage medium stores a computer program.
  • FIG. 1 is a schematic diagram of an instance segmentation scene provided in an embodiment of the application
  • FIG. 2 is a schematic flowchart of an instance segmentation provided in an embodiment of this application.
  • FIG. 3 and 4 are schematic diagrams of an instance segmentation process provided in an embodiment of this application.
  • FIG. 5A is a basic feature diagram provided in an embodiment of this application.
  • FIG. 5B is a framework diagram of a network model for instance segmentation provided in an embodiment of this application.
  • FIG. 5C is an example diagram of a weighting process provided in an embodiment of this application.
  • Fig. 6 and Fig. 7 are structural diagrams of a device for instance segmentation provided in an embodiment of this application.
  • the embodiments of the present application provide a method and device for instance segmentation.
  • the method and device are based on the same technical idea. Since the principles of the method and device to solve the problem are similar, the implementation of the device and the method can be referred to each other, and there is no repetition. Go into details again.
  • a user can take a picture containing a portrait on a terminal or other equipment, and the portrait can be regarded as an example of the picture.
  • Users can also use terminals and other equipment to perform instance segmentation to achieve background virtualization and background replacement functions other than portraits, which can be used in live broadcast production, movie production, animation production and other scenes.
  • the protagonist can be selected and the background grayed out to finally produce the effect of color retention of the protagonist of the portrait.
  • Another example is the terminal segmentation of vehicles on the road.
  • the vehicle-mounted terminal can assist the automatic driving system to make better driving decisions based on the results of the instance segmentation of the vehicles on the road.
  • Terminal also called user equipment (UE), mobile station (MS), mobile terminal (MT), etc.
  • terminal devices include handheld devices with wireless connection functions, vehicle-mounted devices, and Internet of Things devices.
  • terminal devices can be: mobile phones (mobile phones), tablets, notebook computers, handheld computers, mobile Internet devices (MID), wearable devices, virtual reality (VR) devices, augmented reality (augmented reality (AR) equipment, industrial control (industrial control) wireless terminals, unmanned driving (self-driving) wireless terminals, remote medical surgery (remote medical surgery) wireless terminals, smart grid (smart grid) Wireless terminals, wireless terminals in transportation safety, wireless terminals in smart cities, or wireless terminals in smart homes, etc.
  • UE user equipment
  • MS mobile station
  • MT mobile terminal
  • terminal devices include handheld devices with wireless connection functions, vehicle-mounted devices, and Internet of Things devices.
  • terminal devices can be: mobile phones (mobile phones), tablets, notebook computers, handheld computers, mobile Internet devices (MID), wearable devices, virtual reality (VR) devices, augmented reality ( augmented reality (AR)
  • Instance segmentation is a task of identifying the contour of an instance at the pixel level. The more accurate the edge of the instance obtained by the instance segmentation, the finer the instance segmentation and the better the segmentation effect.
  • the "and/or” in this application describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone. This situation.
  • the character "/” generally indicates that the associated objects before and after are in an "or” relationship.
  • the multiple involved in this application refers to two or more.
  • the word "exemplary” is used to mean serving as an example, illustration, or illustration. Any embodiments or implementations described as “examples” in this application should not be construed as being more preferred or advantageous than other embodiments or implementations. Rather, the term example is used to present the concept in a concrete way.
  • FIG. 2 a schematic flow chart of the instance segmentation of the present application is provided.
  • the original image is first input into the backbone network, and the backbone network obtains images with different resolutions, that is, the feature pyramid. Then, extract the basic feature map and the attention feature map according to the feature pyramid.
  • the attention feature map has the same size as the basic feature map, and the pixel value of each pixel in the attention feature map represents the weight value of the pixel at the corresponding position in the corresponding basic feature map.
  • weighted fusion is performed on the basic feature map and the attention feature map to obtain a feature fusion image for instance segmentation.
  • the feature fusion image may represent the result of instance segmentation.
  • a schematic diagram of the process of instance segmentation is provided.
  • a neural network model for instance segmentation referred to as a segmentation network
  • the segmentation network is used for instance segmentation.
  • Step 301 Input the sample original image of the pixel of at least one instance to be marked into the segmentation network to be trained.
  • the user can predetermine a batch of original images.
  • the original images can be pictures taken by the terminal camera or camera, or collected video frames.
  • the user can mark the pixels of the instance on the original image.
  • the pixel values of the pixels occupied by the instance and the pixels occupied by the background image can be set to different values.
  • the original image of the pixels marked by the user can be called Is the sample original image. If a sample original image includes multiple instances, the labels of each instance are also different.
  • the original image is a picture of 3 people taken by the camera, and the pixel value of the background image except for the three people can be marked as 0, and the first The pixel value of the pixel occupied by one person is marked as 2, the pixel value of the pixel occupied by the second person is marked as 4, and the pixel value of the pixel occupied by the third person is marked as 6. If the user mistakenly marks two instances with the same mark, the terminal will regard these two instances as one instance.
  • the detection frame of the instance can also be marked on the original image of the sample, and the detection frame is used to identify the instance.
  • an instance corresponds to a detection frame, and the pixels occupied by an instance are located in its corresponding detection frame. It can also be a detection frame corresponding to multiple instances, and the pixels occupied by the multiple instances are located in the corresponding detection frame.
  • the size of the image can be preset in the segmentation network.
  • the sample original image Before inputting the sample original image into the segmentation network to be trained, the sample original image may be scaled to a preset size in the segmentation network.
  • the segmentation network can adjust the size of the input sample original image to achieve a preset size.
  • the segmentation network to be trained may perform the following processing procedures from step 302 to step 304 on each instance group in the sample original image.
  • Each instance group includes at least one marked instance.
  • Step 302 The segmentation network to be trained predicts at least two different first basic feature maps, and predicts the corresponding first attention feature map for each first basic feature map.
  • the first attention feature map has the same size as the first basic feature map, and the pixel value of each pixel in the first attention feature map represents the pixel at the corresponding position in the corresponding first basic feature map In the first attention feature map, there are pixels with different pixel values.
  • the first basic feature map may be a basic feature map corresponding to the original image of the sample, or a basic feature map corresponding to a detection frame image corresponding to an instance group.
  • the segmentation network to be trained predicts the first basic feature map, it can be predicted by the DeepLabV3+ algorithm.
  • the DeepLabV3+ algorithm can accurately extract the basic features to achieve a good representation of the edge of the instance, such as portraits.
  • the edges and the limbs of the human figure have good characterization ability.
  • a feature fusion map may be determined by at least two basic feature maps and corresponding attention feature maps, and the number of the basic feature maps and the attention feature maps are, for example, 4 respectively.
  • the image is input into the segmentation network, and the segmentation network performs feature extraction, and outputs 4 basic feature maps.
  • the pixel value of the first attention feature map is within a set value range, for example, between 0-1. It can also be between 0-0.5, or between 0.5-1.
  • the size of the image is preset in the segmentation network, and before the sample original image is input to the segmentation network to be trained, the sample original image can be scaled to the size preset in the segmentation network.
  • the segmentation network can adjust the size of the image to achieve a preset size.
  • the size of the first basic feature map predicted by the segmentation network to be trained may be a preset size in the segmentation network to be trained.
  • the size of the first attention feature map and the first basic feature map are the same, the size of the first attention feature map and the size of the first feature fusion map are both in the segmentation network to be trained The preset size.
  • the original image of the marked sample can be scaled to the preset size first; or after the basic feature map is extracted, the basic feature map can be scaled to reach the preset size.
  • the size of the preset image in the segmentation network is R*R
  • the basic feature map and the attention feature map can be zoomed using the following formula
  • a detection frame predicted by the segmentation network includes a predicted instance.
  • B is the basic feature map corresponding to the original image predicted by the bottom module (Bootom Moudule) in the segmentation network;
  • p i is the coordinate of the detection frame of the i-th instance in the original image;
  • r i is the i-th instance
  • the basic feature map located in the detection frame is extracted, and then the basic feature map located in the detection frame is scaled to obtain the basic feature map, the size of which is R *R.
  • a′ i interpolate M ⁇ M ⁇ R ⁇ R (a i );
  • ai is the attention feature map of the i-th instance predicted by the segmentation network at the beginning, and its size is M*M; then it is scaled to a size of R*R, and a′ i is the attention after scaling Feature map; i is the attention feature map of the i-th instance, which corresponds to the basic feature map r i of the i-th instance.
  • the pixel value in the attention feature map represents the weight value.
  • the segmentation network can also normalize the pixel value of the zoomed attention feature, so that the normalized pixel value is at a set value Within range.
  • the formula for normalization is as follows:
  • s i softmax(a' i ); where s i is the attention feature map of the i-th instance after normalization processing.
  • the normalization process can be understood as dividing all pixel values by the same value.
  • Step 303 The segmentation network to be trained performs weighting processing on the pixel values in the at least two first basic feature maps and the corresponding first attention feature maps to predict the first feature fusion map.
  • Is the normalized t-th attention feature map of the i-th instance Is the t-th basic feature map of the i-th instance
  • mi is the feature fusion map corresponding to the i-th detection frame image.
  • K is the total number of basic feature maps corresponding to an instance (the i-th instance).
  • ° represents the dot product operation in the matrix, that is, the dot product calculation of the pixel value of the pixel at the corresponding position.
  • the weighting process is to multiply the pixel value of the first basic feature map and the pixel value of the first attention feature map at the corresponding position, and then add the value to the corresponding position in the first feature fusion map. The pixel value.
  • FIG. 5C a schematic diagram of the weighting process is provided.
  • three first basic feature maps r1, r2, r3 with a size of 2*2 are extracted, and the respective pixel values are shown in FIG. 5C.
  • the three first basic feature maps respectively correspond to the first attention feature maps s1, s2, and s3, and their respective pixel values are shown in FIG. 5C.
  • the pixel value at position 1 in the first feature fusion map is: 60*0.6+61*0.56+60*0.58; the pixel value at position 2 is: 70*0.7+70*0,7+73*0.72; position The pixel value on 3 is: 65*0.2+66*0.21+65*0.2; the pixel value on position 4 is: 75*0.1+75*0.1+76*0.11.
  • predicted first basic feature map and first attention feature map are unlabeled images.
  • Step 304 Train the segmentation network to be trained according to the first feature fusion map and the sample original image. Get the segmentation network that has been trained.
  • the first basic feature map is the basic feature map of the detection frame image corresponding to the instance group; when training the segmentation network to be trained according to the first feature fusion map and the sample original image It may be that the segmentation network to be trained is trained according to the first feature fusion map and the detection frame image.
  • the first feature may also be The size of the fusion map and/or the size of the sample original image/detection frame image are scaled so that the size of the first feature fusion map is the same as the size of the sample original image/detection frame image. Or the size of the sample original image/detection frame image is scaled, so that the size of the first feature fusion map is the same as the size of the sample original image/detection frame image. It is also possible to simultaneously scale the first feature fusion image and the sample original image (detection frame image) to make the size the same.
  • the segmentation network to be trained can train the current background of another instance.
  • the segmentation network to be trained compares the predicted pixels of the extracted instance in the first feature fusion map with the marked pixels on the corresponding instance in the sample original image, calculates the difference, and uses the obtained difference to update the network parameters inversely , So that the segmentation network extracts the pixels of the instance in the sample original image and the pixels of the labeled instance are almost the same.
  • the segmentation network to be trained can also compare the predicted detection frame extracted from the first feature fusion map with the corresponding marked detection frame in the sample original image. Calculate the difference, and use the obtained difference to update the network parameters in reverse, so that the detection frame in the original image of the sample extracted by the segmentation network is almost the same as the marked detection frame. Then proceed to the training of extracting the pixels of the instance.
  • the network framework of the trained segmentation network is shown in Figure 5B. As shown in Figure 5B, the Attention masks are predicted in Detection Head.
  • the segmentation network includes a backbone network. After the image is input into the backbone network, the backbone network can output basic feature maps and attention feature maps. Among them, Detection Head is the head of the detection frame.
  • a neural network's special predictive detection module can output the confidence class, which is the category probability, which can be understood as the probability of the instance in the detection frame predicted by the network, for example, the instance is human. The probability is 90%, the probability that the instance is a cat is 20%, the probability that the instance is a dog is 15%, and so on. Box is the detection frame predicted by the segmentation network, which can specifically be the coordinates of the four corners of the detection frame.
  • An additional layer of convolutional layer is added to the detection module of the FCOS network to predict the attention feature map, that is, attention masks are the attention feature maps predicted by the segmentation network for any detection frame image.
  • the bottom module of the Bottom module is a sub-module of the BlendMask network, specifically used to predict the base feature map Bases. Bases correspond to the original image, and then the basic feature map for feature fusion can be deducted from Bases, which can be based on the detection frame of each instance and deduct corresponding features from Bases.
  • Step 305 Obtain the original image.
  • the original image may be an image that has not undergone any processing, such as a picture taken by a terminal camera or camera, or a captured video frame.
  • Step 306 Input the original image into the trained segmentation network, and the segmentation network can output at least one feature fusion map corresponding to the original image.
  • Each feature fusion map includes at least one instance.
  • the function of the feature fusion map is to mark the pixels of the instance included in the original image, and any feature fusion map output by the segmentation network can be used to mark the pixels of at least one instance included in the feature fusion map.
  • the segmentation network can also output the detection frame of the instance included in the original image. For example, it can output the coordinates of the detection frame in the original image, or it can output the image with the detection frame. The difference between the image and the original image is only Whether to include the detection frame, you can also output the image with the detection frame and the coordinates of the detection frame.
  • the original image before inputting the original image into the trained segmentation network, the original image is scaled to a preset size in the segmentation network.
  • the first feature fusion map is obtained, and the segmentation model is trained using the first feature fusion map and the sample original image.
  • the segmentation model trained in this way can accurately and quickly determine the instance pixels in the subsequent input original image to achieve accurate segmentation of the instances in the original image.
  • Step 401 Obtain the original image.
  • the original image may be a picture or video frame taken by a camera or a camera.
  • Step 402 Determine the detection frame of each instance included in the original image.
  • the terminal may be a detection frame that receives each instance marked by the user on the original image.
  • it can also be a network model with a detection frame used to predict the instance stored on the terminal.
  • the terminal can input the original image into the network model of the detection frame used to predict the instance, and the network model can output the information of each instance included in the original image.
  • the network model can be an existing network model, which will not be repeated here.
  • Step 403 For each detection frame image, determine at least two different basic feature maps, and an attention feature map corresponding to each basic feature map.
  • Step 404 Perform weighting processing on the pixel values of the at least two basic feature maps and the corresponding attention feature maps to obtain a feature fusion map corresponding to the detection frame image, and the feature fusion map is used to mark the The pixels of the example included in the detection frame image.
  • the attention feature map has the same size as the basic feature map, the pixel value of each pixel in the attention feature map represents the weight value of the pixel at the corresponding position in the corresponding basic feature map, and the attention There are pixels with different pixel values in the feature map.
  • the terminal can use the DeepLabV3+ algorithm to extract the basic feature map, or use an algorithm to extract the attention feature map.
  • the detection frame image can be input into a pre-trained segmentation network, and the segmentation network can be the segmentation network described above.
  • the segmentation network outputs at least two basic feature maps and corresponding attention feature maps. Then the terminal may perform weighting processing on the pixel values of the basic feature map and the pixel values of the attention feature map to obtain the corresponding feature fusion map.
  • a device 600 for instance segmentation is provided.
  • the device 600 can perform each step performed by the terminal in the method of FIG. 3 and FIG. 4, in order to avoid Redundant, not detailed here.
  • the device 600 may be a terminal or a chip applied in the terminal.
  • the apparatus 600 may include: a processing module 610, optionally, a transceiver module 620 and a storage module 630; the processing module 610 may be connected to the storage module 630 and the transceiver module 620 respectively, and the storage module 630 may also be connected to the transceiver module 620 .
  • the transceiver module 620 may be used to receive the original image.
  • the storage module 630 can be used to store the original image and store the segmented network.
  • the processing module 610 is configured to obtain an original image; and input the original image into the trained segmentation network to obtain at least one feature fusion map corresponding to the original image, the feature fusion map Pixels used to mark instances included in the original image, and each feature fusion map includes at least one instance;
  • the processing module 610 is further configured to train the segmentation network according to the following methods:
  • the segmentation network to be trained performs the following processing on each instance group in the sample original image of the pixel of the labeled instance, where each instance group includes at least one labeled instance:
  • the first attention feature map has the same size as the first basic feature map, and the pixel value of each pixel in the first attention feature map represents the pixel at the corresponding position in the corresponding first basic feature map In the first attention feature map, there are pixels with different pixel values.
  • the processing module 610 is further configured to obtain the detection frame of the instance included in the original image.
  • the processing module 610 is further configured to scale the original image to a size preset in the segmentation network before inputting the original image into the trained segmentation network.
  • the processing module 610 is further configured to fuse the first feature before training the segmentation network to be trained according to the first feature fusion map and the sample original image.
  • the size of the map and/or the size of the sample original image are scaled so that the size of the first feature fusion map is the same as the size of the sample original image.
  • the processing module 610 is configured to obtain an original image; determine the detection frame of each instance included in the original image; and for each detection frame image, determine at least two different basic feature maps, and Each basic feature map corresponds to an attention feature map; weighting the pixel values of the at least two basic feature maps and the corresponding attention feature maps to obtain a feature fusion map corresponding to the detection frame image ,
  • the feature fusion map is used to mark the pixels of the instance included in the detection frame image;
  • the attention feature map has the same size as the basic feature map, the pixel value of each pixel in the attention feature map represents the weight value of the pixel at the corresponding position in the corresponding basic feature map, and the attention There are pixels with different pixel values in the feature map.
  • FIG. 7 is a schematic block diagram of a device 700 for instance division according to an embodiment of the present application. It should be understood that the apparatus 700 can execute the steps performed by the terminal in the above-mentioned methods of FIG. 3 and FIG. 4, and in order to avoid redundancy, details are not described herein again.
  • the apparatus 700 includes a processor 710, and optionally, a memory 730 and a transceiver 720. The processor 710 and the memory 730 are electrically coupled.
  • the memory 730 is configured to store a computer program; the processor 710 may be configured to call a computer program or instruction stored in the memory, so as to perform the foregoing instance division method through the transceiver 720.
  • the processing module 610 in FIG. 6 may be implemented by the processor 710, the transceiver module 620 may be implemented by the transceiver 720, and the storage module 630 may be implemented by the memory 730.
  • the foregoing processor may be a central processing unit (CPU), a network processor (NP), or a combination of a CPU and an NP.
  • the processor may further include a hardware chip or other general-purpose processors.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL) and other programmable logic devices , Discrete gates or transistor logic devices, discrete hardware components, etc. or any combination thereof.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory mentioned in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), and electrically available Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (Random Access Memory, RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • Enhanced SDRAM, ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • Synchronous Link Dynamic Random Access Memory Synchronous Link Dynamic Random Access Memory
  • DR RAM Direct Rambus RAM
  • the embodiment of the present application also provides a computer storage medium that stores a computer program, and when the computer program is executed by a computer, the computer can be used to execute the above-mentioned instance division method.
  • the embodiment of the present application also provides a computer program product containing instructions, which when running on a computer, enables the computer to execute the method for instance segmentation provided above.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

一种实例分割的方法及装置。待训练的分割网络对被标记实例的像素的样本原始图像中的每个实例组进行如下处理,每个实例组包括被标记的至少一个实例:预测至少两个不同的第一基础特征图,以及每个第一基础特征图分别对应的第一注意力特征图;将至少两个第一基础特征图和分别对应的第一注意力特征图的像素值进行加权处理得到第一特征融合图;根据第一特征融合图和样本原始图像,对待训练的分割网络进行训练;第一注意力特征图中的每个像素的像素值表示其对应的第一基础特征图中对应位置的像素的权重值。分割模型可以精确地确定出原始图像中的实例像素。

Description

一种实例分割的方法及装置
相关申请的交叉引用
本申请要求在2019年12月31日提交中国专利局、申请号为201911418245.5、申请名称为“一种实例分割的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机视觉技术领域,尤其涉及一种实例分割的方法及装置。
背景技术
图像分割(image segmentation)技术是图像语义理解的重要一环。目前的图像分割任务主要有:语义分割(semantic segmentation)和实例分割(instance segmentation)。语义分割是将图像中的实例划分出对应的类别,如:人、猫、狗等类别,不区分属于相同类别的不同实例。例如,当图像中有多只猫时,语义分割会将多只猫整体的所有像素点预测为“猫”这个类别。实例分割还需在具体类别的基础上区别不同的实例,例如区分出哪些像素点属于第一只猫、哪些像素点属于第二只猫。
随着移动端的视频及图像应用越来越广泛,实例分割的必要性越来越得到凸显,在人像拍照、视频特效、AR场景中是不可或缺的重要技术。如何精确地进行实例分割,是急需解决的技术问题。
发明内容
本申请实施例提供一种实例分割的方法及装置,用以提出一种实例分割的方式。
第一方面,提供了一种实例分割的方法,终端预先训练出实例的分割网络,当获取到原始图像后,可以将原始图像输入到训练完成的分割网络中,所述分割网络可以输出所述原始图像对应的至少一个特征融合图,每个特征融合图中包括至少一个实例。特征融合图的作用是用来标记所述原始图像包括的实例的像素,则分割网络输出的任一特征融合图可以用于标记所述特征融合图中包括的至少一个实例的像素。
在训练分割网络时,可以将样本原始图像输入到待训练的分割网络中,所述样本原始图像被标记至少一个实例的像素。所述待训练的分割网络可以对所述样本原始图像中的每个实例组进行如下处理,每个实例组中包括至少一个被标记的实例:预测至少两个不同的第一基础特征图,以及针对每个第一基础特征图预测其对应的第一注意力特征图。所述第一注意力特征图与所述第一基础特征图的尺寸相同,所述第一注意力特征图中的每个像素的像素值表示其对应的第一基础特征图中对应位置的像素的权重值,所述第一注意力特征图中存在像素值不同的像素。待训练的分割网络可以将所述至少两个第一基础特征图和分别对应的所述第一注意力特征图中的像素值进行加权处理,预测出第一特征融合图,并根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练。
通过上述对被标记实例像素的样本原始图像提取第一基础特征图和第一注意力特征 图,得到第一特征融合图,采用第一特征融合图和所述样本原始图像对分割模型进行训练。这种方式训练出的分割模型可以精确、快速地确定出后续输入的原始图像中的实例像素,以实现精确地对原始图像中的实例进行分割。并且注意力特征图中存在像素值不同的像素,这样可以考虑到每个像素的权重,可以进一步准确地区分出实例的像素。
在一种可能的实现中,所述第一注意力特征图中的像素的像素值位于一个设定的取值范围内,例如位于0-1之间。还可以是0-0.5之间,或者0.5-1之间。
在一种可能的实现中,输入到待训练的分割网络中的样本原始图像不但可以被标记实例的像素,还可以被标记检测框,所述检测框用于标识实例。一般情况下,一个实例对应一个检测框,实例的像素位于检测框内。采用被标记检测框的样本原始图像训练出的分割网络具有标记实例的检测框的能力,则将原始图像输入到分割网络中,由分割网络输出所述原始图像包括的实例的检测框,例如可以是输出检测框在原始图像中的坐标,也可以是输出带有检测框的图像,所述图像与原始图像的差别仅为是否包括检测框,也可以输出带有检测框的图像和检测框的坐标。
在一种可能的实现中,所述第一基础特征图为所述实例组对应的检测框图像的基础特征图;则在根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练时,具体可以是根据所述第一特征融合图和所述检测框图像,对所述待训练的分割网络进行训练。
在一种可能的实现中,分割网络中预设有图像的尺寸,在将样本原始图像输入到待训练的分割网络之前,可以先将样本原始图像缩放至所述分割网络中预设的尺寸。这样在对分割网络训练完成后,在将所述原始图像输入到训练完成的分割网络中之前,可以先将所述原始图像缩放至所述分割网络中预设的尺寸。
在一种可能的实现中,分割网络中预设有图像的尺寸,分割网络可以对图像的尺寸进行调整,以达到预设的尺寸。具体的,在模型训练过程中,待训练的分割网络预测的第一基础特征图的尺寸可以是所述待训练的分割网络中预设的尺寸。进而,由于所述第一注意力特征图与所述第一基础特征图的尺寸相同,则第一注意力特征图的尺寸和第一特征融合图的尺寸均为所述待训练的分割网络中预设的尺寸。为了更加精确地对分割网络进行训练,在根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练之前,还可以对所述第一特征融合图的尺寸和/或所述样本原始图像的尺寸进行缩放处理,使所述第一特征融合图的尺寸与所述样本原始图像的尺寸相同。
第二方面,提供了一种权限管理的方法,终端首先获取原始图像,然后对原始图像进行处理,确定所述原始图像包括的每个实例的检测框。进而针对每个检测框图像,确定至少两个不同的基础特征图,以及每个基础特征图分别对应的注意力特征图;将所述至少两个基础特征图和分别对应的所述注意力特征图的像素值进行加权处理,得到所述检测框图像对应的特征融合图,所述特征融合图用于标记所述检测框包括的实例的像素。所述注意力特征图与所述基础特征图的尺寸相同,所述注意力特征图中的每个像素的像素值表示其对应的基础特征图中对应位置的像素的权重值,所述注意力特征图中存在像素值不同的像素。
通过上述对提取基础特征图和注意力特征图得到特征融合图的方式,进行实例分割,可以精确、快速地确定出原始图像中的实例像素,以实现精确地进行实例分割。并且注意力特征图中存在像素值不同的像素,这样可以考虑到每个像素的权重,可以进一步准确地 区分出实例的像素。
第三方面,提供了一种实例分割的装置,所述实例分割具有实现上述第一方面及第一方面任一可能的实现或者第二方面及第二方面任一可能的实现中的功能。这些功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的功能模块。
第四方面,提供了一种实例分割的装置,该装置可以为上述方法实施例中的终端,或者为设置在终端中的芯片。该装置包括收发器以及处理器,可选的,还包括存储器。其中,该存储器用于存储计算机程序或指令,处理器分别与存储器和收发器耦合,当处理器执行所述计算机程序或指令时,使装置通过所述收发器执行上述第一方面及第一方面任一可能的实现或者第二方面及第二方面任一可能的实现中由终端执行的方法。
第五方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行上述第一方面及第一方面任一可能的实现或者第二方面及第二方面任一可能的实现中由终端执行的方法。
第六方面,本申请提供了一种芯片***,该芯片***包括处理器和存储器,所述处理器、所述存储器之间电耦合;所述存储器,用于存储计算机程序指令;所述处理器,用于执行所述存储器中的部分或者全部计算机程序指令,当所述部分或者全部计算机程序指令被执行时,用于实现上述第一方面及第一方面任一可能的实现或者第二方面及第二方面任一可能的实现的方法中终端的功能。
在一种可能的设计中,所述芯片***还可以包括收发器,所述收发器,用于发送所述处理器处理后的信号,或者接收输入给所述处理器的信号。该芯片***,可以由芯片构成,也可以包括芯片和其他分立器件。
第七方面,提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,当该计算机程序被运行时,上述第一方面及第一方面任一可能的实现或者第二方面及第二方面任一可能的实现中由终端执行的方法被执行。
附图说明
图1为本申请实施例中提供的一种实例分割的场景示意图;
图2为本申请实施例中提供的一种实例分割的流程示意图;
图3和图4为本申请实施例中提供的一种实例分割的流程示意图;
图5A为本申请实施例中提供的一种基础特征图;
图5B为本申请实施例中提供的一种实例分割的网络模型框架图;
图5C为本申请实施例中提供的一种加权处理示例图;
图6和图7为本申请实施例中提供的一种实例分割的装置结构图。
具体实施方式
下面将结合附图,对本申请实施例进行详细描述。
本申请实施例提供一种实例分割的方法及装置,其中,方法、装置是基于同一技术构思的,由于方法、装置解决问题的原理相似,因此装置与方法的实施可以相互参见,重复 之处不再赘述。
为便于理解本申请实施例,接下来对本请的应用场景进行介绍,本申请实施例描述的业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
例如图1所示,用户可以在终端等设备上拍摄包含人像的图片,人像可以看作是图片中的一个实例。用户还可以利用终端等设备进行实例分割,以实现除人像外的背景虚拟化、背景替换等功能,可以应用在直播制作、电影制作、动画制作等场景中。例如可以选取主角,背景置灰,最终产生人像主角留色的效果。再例如终端对路面中的车辆进行实例分割。这样在自动驾驶过程中,车载终端可根据路面中的车辆的实例分割结果,辅助自动驾驶***更好的进行驾驶决策。
为便于理解本申请实施例,以下对本申请实施例的部分用语进行解释说明,以便于本领域技术人员理解。
1)、终端,又称之为用户设备(user equipment,UE)、移动台(mobile station,MS)、移动终端(mobile terminal,MT)等,是一种向用户提供语音和/或数据连通性的设备。例如,终端设备包括具有无线连接功能的手持式设备、车载设备、物联网设备等。目前,终端设备可以是:手机(mobile phone)、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备,虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程手术(remote medical surgery)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端,或智慧家庭(smart home)中的无线终端等。
2)、实例分割,是一种在像素层面识别实例轮廓的任务,实例分割得到的实例的边缘越精确,该实例分割更精细,分割效果越好。
本申请中的“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
本申请中所涉及的多个,是指两个或两个以上。
在本申请的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。
另外,在本申请实施例中,“示例的”一词用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或实现方案不应被解释为比其它实施例或实现方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。
如图2所示,提供了一种本申请的实例分割的流程示意图。
首先得到原始图像,对原始图像进行处理,提取出基础特征图以及注意力特征图。可选的,先将原始图像输入到骨干网络中,骨干网络得到不同分辨率的图像,即特征金字塔。然后,再根据特征金字塔提取基础特征图以及提取注意力特征图。所述注意力特征图与所述基础特征图的尺寸相同,所述注意力特征图中的每个像素的像素值表示其对应的基础特征图中对应位置的像素的权重值。接下来,对所述基础特征图与所述注意力特征图进行加权融合,得到实例分割的特征融合图像。所述特征融合图像可以表示实例分割的结果。
接下来对本申请的实例分割过程进行详细介绍。
如图3所示,提供了一种实例分割的流程示意图,在该实施例中,先训练一个用于实例分割的神经网络模型,简称分割网络,后续采用所述分割网络进行实例分割。
训练所述分割网络的具体过程参见如下步骤:
步骤301:将被标记至少一个实例的像素的样本原始图像输入到待训练的分割网络中。
为了训练分割网络,用户可以预先确定出一批原始图像,原始图像可以是终端相机或摄像头拍摄的图片,或者采集的视频帧。
用户可以在所述原始图像上标记出实例的像素,例如可以是设置实例所占的像素与背景图像所占的像素的像素值设置为不同的值,被用户标记实例的像素的原始图像可以称为样本原始图像。若一个样本原始图像上包括多个实例,则每个实例的标记也不同,例如原始图像为相机拍摄的包括3个人的图片,可以将除三个人的背景图像的像素值标记为0,将第一个人所占的像素的像素值标记为2、将第二人所占的像素的像素值标记为4、将第三个人所占的像素的像素值标记为6。如果用户错将两个实例进行了同一标记,则终端会将这两个实例看做是一个实例。
为了提升检测性能,还可以在样本原始图像上标记实例的检测框,所述检测框用于标识实例。一般是一个实例对应一个检测框,一个实例所占的像素位于其对应的检测框内。也可以是多个实例对应的一个检测框,则这多个实例所占的像素位于其对应的检测框内。
为了准确地进行对分割网络进行训练,所述分割网络中可以预先设置图像的尺寸。在将样本原始图像输入到所述待训练的分割网络中之前,可以先将所述样本原始图像缩放至所述分割网络中预设的尺寸。当然所述分割网络可以对输入的样本原始图像的尺寸进行调整,以达到预设的尺寸。
所述待训练的分割网络可以对所述样本原始图像中的每个实例组进行如下步骤302至步骤304的处理过程。每个实例组中包括至少一个被标记的实例。
步骤302:待训练的分割网络预测至少两个不同的第一基础特征图,以及针对每个第一基础特征图预测其对应的第一注意力特征图。
所述第一注意力特征图与所述第一基础特征图的尺寸相同,所述第一注意力特征图中的每个像素的像素值表示其对应的第一基础特征图中对应位置的像素的权重值,所述第一注意力特征图中存在像素值不同的像素。
如果每个实例组中包括一个被标记的实例,则一个实例对应一张特征融合图。如果每个实例组中包括多个实例,则这多个实例对应一张特征融合图。所述第一基础特征图可以是所述样本原始图像对应的基础特征图,也可以是实例组对应的检测框图像对应的基础特征图。
待训练的分割网络在预测第一基础特征图时,可以是采用DeepLabV3+算法进行预测的,DeepLabV3+算法可以精确地进行基础特征提取,以实现对实例的边缘有很好的表征能力,例如对于人像的边缘以及人像肢体都有很好的表征能力。
一个特征融合图可以是由至少两个基础特征图和分别对应的注意力特征图确定,基础特征图和注意力特征图的数量例如分别为4。如图5A所示,将图像输入到分割网络中,分割网络进行特征提取,输出4张基础特征图。
所述第一注意力特征图的像素值位于一个设定的取值范围内,例如位于0-1之间。还可以是0-0.5之间,或者0.5-1之间。
分割网络中预设有图像的尺寸,在将样本原始图像输入到待训练的分割网络之前,可以先将样本原始图像缩放至所述分割网络中预设的尺寸。或者,分割网络可以对图像的尺寸进行调整,以达到预设的尺寸。示例的,在模型训练过程中,待训练的分割网络预测的第一基础特征图的尺寸可以是所述待训练的分割网络中预设的尺寸。进而,由于所述第一注意力特征图与所述第一基础特征图的尺寸相同,则第一注意力特征图的尺寸和第一特征融合图的尺寸均为所述待训练的分割网络中预设的尺寸。
可以是先对被标记的样本原始图像进行缩放达到预设尺寸;也可以是在提取出基础特征图后,对基础特征图进行缩放达到预设尺寸。
例如分割网络中预设的图像的尺寸为R*R,可以采用以下公式对基础特征图以及注意力特征图进行缩放,分割网络预测出的一个检测框内包括预测出的一个实例。
r i=RoIPool R×R(B,p i);
其中,B为分割网络中底部模块(Bootom Moudule)预测出所述原始图像对应的基础特征图;p i为第i个实例的检测框在原始图像中的坐标;r i为将第i个实例的检测框在原始图像中坐标对应到B上后,提取出位于检测框内的基础特征图,然后再对位于检测框内的基础特征图进行缩放后,得到的基础特征图,其大小为R*R。
a′ i=interpolate M×M→R×R(a i);
其中,ai为分割网络最开始预测出的第i个实例的注意力特征图,其大小为M*M;然后将其缩放为大小为R*R,a′ i为经过缩放处理后的注意力特征图;i为第i个实例的注意力特征图,与第i个实例的基础特征图r i相对应。
注意力特征图中的像素值代表的是权重值,分割网络还可以对缩放处理后的注意力特征进行像素值得归一化处理,使归一化处理后的像素值位于一个设定的取值范围内。归一化处理的公式如下:
s i=softmax(a′ i);其中,s i为第i个实例的归一化处理后的注意力特征图。归一化处理可以理解为将所有的像素值除以一个相同的数值。
步骤303:待训练的分割网络将所述至少两个第一基础特征图和分别对应的所述第一注意力特征图中的像素值进行加权处理,预测出第一特征融合图。
具体可以参见如下公式:
Figure PCTCN2020142438-appb-000001
其中,
Figure PCTCN2020142438-appb-000002
为第i个实例的归一化后的第t张注意力特征图,
Figure PCTCN2020142438-appb-000003
为第i个实例的第t张基础特征图,m i为第i个检测框图像对应的特征融合图。K为一个实例(第i个实例)对应的基础特征图的总数量。°代表矩阵中的点乘操作,即对应位置的像素的像素值的点乘计算。加权处理也就是将对应位置上的第一基础特征图的像素值和第一注意力特征图的像素值先相乘后,再相加后得到的值,为第一特征融合图中对应位置上的像素值。
如图5C所示提供了一种加权处理的过程示意图,针对一个实例,提取了三个尺寸为2*2的第一基础特征图r1、r2、r3,其各个像素值如图5C所示。三个第一基础特征图分别对应的第一注意力特征图s1、s2、s3,其各个像素值如图5C所示。第一特征融合图中的位置1上的像素值为:60*0.6+61*0.56+60*0.58;位置2上的像素值为:70*0.7+70*0,7+73*0.72;位置3上的像素值为:65*0.2+66*0.21+65*0.2;位置4上的像素值为:75*0.1+75*0.1+76*0.11。
需要注意的是,预测出的第一基础特征图和第一注意力特征图是不带标记的图像。
步骤304:根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络 进行训练。得到训练完成的分割网络。
如果所述第一基础特征图为所述实例组对应的检测框图像的基础特征图;则根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练时,可以是根据所述第一特征融合图和所述检测框图像,对所述待训练的分割网络进行训练。
为了更加精确地对分割网络进行训练,在根据所述第一特征融合图和所述样本原始图像/检测框图像,对所述待训练的分割网络进行训练之前,还可以对所述第一特征融合图的尺寸和/或所述样本原始图像/检测框图像的尺寸进行缩放处理,使所述第一特征融合图的尺寸与所述样本原始图像/检测框图像的尺寸相同。或者对所述样本原始图像/检测框图像的尺寸进行缩放处理,使所述第一特征融合图的尺寸与所述样本原始图像/检测框图像的尺寸相同。也可以同时对第一特征融合图和所述样本原始图像(检测框图像)进行缩放,使得尺寸相同。
在根据某个实例进行训练时,待训练的分割网络可以将另外的实例当前背景来训练。
在模型训练时,需要大量的样本原始图像,则会得到大量的特征融合图像。待训练的分割网络将预测出的第一特征融合图中提取的实例的像素与样本原始图像中的对应实例上被标记的像素进行比较,计算差值,利用得到的差值反向更新网络参数,以使分割网络提取出样本原始图像中的实例的像素与被标记的实例的像素相差无几。
如果样本原始图像中还被标记实例的检测框,待训练的分割网络还可以将预测出的第一特征融合图中提取的检测框与样本原始图像中的对应的被标记的检测框进行比较,计算差值,利用得到的差值反向更新网络参数,以使分割网络提取出样本原始图像中的检测框与被标记的检测框相差无几。然后再进行提取实例的像素的训练。
训练完成的分割网络的网络框架如图5B所示。如图5B所示,在Detection Head中预测Attention masks。
分割网络中的包括骨干网络(backbone network),图像输入到骨干网络中后,骨干网络可以输出基础特征图和注意力特征图。其中,Detection Head为检测框头,神经网络的一种专门预测检测的模块,可以输出置信度class,也就是类别概率,可以理解为网络预测出的检测框中的实例的概率,例如实例为人的概率为90%,实例为猫的概率为20%,实例为狗的概率为15%等。Box为分割网络预测出的检测框,具体可以是检测框的四个角的坐标。在FCOS网络的检测模块加入额外的一层卷积层进行注意力特征图预测,即Attention masks是分割网络针对任一检测框图像预测出的注意力特征图。Bottom module底部模块是BlendMask网络的一个子模块,专门用来预测基础特征图Bases。Bases是对应原始图像的,而后续可以在Bases中扣取进行特征融合的基础特征图,可以是根据每个实例的检测框,在Bases上扣取相应的特征。
在训练完所述分割网络后,根据所述分割网络得到实例分割结果的具体过程参见如下步骤:
步骤305:获取原始图像。
该原始图像可以是未经过任何处理的图像,例如终端相机或摄像头拍摄的图片,或者采集的视频帧。
步骤306:将原始图像输入到训练完成的分割网络中,所述分割网络可以输出所述原始图像对应的至少一个特征融合图。
每个特征融合图中包括至少一个实例。特征融合图的作用是用来标记所述原始图像包 括的实例的像素,则分割网络输出的任一特征融合图可以用于标记所述特征融合图中包括的至少一个实例的像素。分割网络还可以输出所述原始图像包括的实例的检测框,例如可以是输出检测框在原始图像中的坐标,也可以是输出带有检测框的图像,所述图像与原始图像的差别仅为是否包括检测框,也可以输出带有检测框的图像和检测框的坐标。
当然也可能原始图像中根据就没有实例,则也输出实例的检测框。
在一种示例中,在将所述原始图像输入到训练完成的分割网络中之前,将所述原始图像缩放至所述分割网络中预设的尺寸。
通过上述对被标记实例像素的样本原始图像提取第一基础特征图和第一注意力特征图,得到第一特征融合图,采用第一特征融合图和所述样本原始图像对分割模型进行训练。这种方式训练出的分割模型可以精确、快速地确定出后续输入的原始图像中的实例像素,以实现精确地对原始图像中的实例进行分割。
如图4所示,提供了另一种实例分割的流程示意图:
步骤401:获取原始图像。
所述原始图像可以是相机或摄像头拍摄的图片或者视频帧。
步骤402:确定所述原始图像包括的每个实例的检测框。
终端可以是接收用户在原始图像上标记的每个实例的检测框。当然也可以是终端上保存有用于预测实例的检测框的网络模型,终端可以将原始图像输入到用于预测实例的检测框的网络模型中,网络模型可以输出原始图像中包括的每个实例的检测框。该网络模型可以是现有的网络模型,在此不再进行赘述。
步骤403:针对每个检测框图像,确定至少两个不同的基础特征图,以及每个基础特征图分别对应的注意力特征图。
步骤404:将所述至少两个基础特征图和分别对应的所述注意力特征图的像素值进行加权处理,得到所述检测框图像对应的特征融合图,所述特征融合图用于标记所述检测框图像包括的实例的像素。
所述注意力特征图与所述基础特征图的尺寸相同,所述注意力特征图中的每个像素的像素值表示其对应的基础特征图中对应位置的像素的权重值,所述注意力特征图中存在像素值不同的像素。
终端可以是采用DeepLabV3+算法提取出基础特征图,也可以采用一个算法提取注意力特征图。
或者,可以将检测框图像输入到预先训练好的分割网络中,所述分割网络可以是上述描述的分割网络。所述分割网络输出至少两个基础特征图和分别对应的注意力特征图。然后终端可以将基础特征图的像素值与注意力特征图的像素值进行加权处理,得到对应的特征融合图。
前文介绍了本申请实施例的实例分割的方法,下文中将介绍本申请实施例中的实例分割的装置。
基于与上述实例分割的方法的同一技术构思,如图6所示,提供了一种实例分割的装置600,装置600能够执行上述图3和图4的方法中由终端执行的各个步骤,为了避免冗余,此处不再详述。装置600可以为终端,也可以为应用于终端中的芯片。装置600可以包括:处理模块610,可选的,还包括收发模块620,存储模块630;处理模块610可以分别与存储模块630和收发模块620相连,所述存储模块630也可以与收发模块620相连。
收发模块620,可以用于接收原始图像。
所述存储模块630,可以用于存储原始图像、存储分割网络。
在一种实施方式中,处理模块610,用于获取原始图像;并将所述原始图像输入到训练完成的分割网络中,得到所述原始图像对应的至少一个特征融合图,所述特征融合图用于标记所述原始图像包括的实例的像素,每个特征融合图中包括至少一个实例;
所述处理模块610,还用于根据如下方式进行训练所述分割网络:
待训练的分割网络对被标记实例的像素的样本原始图像中的每个实例组进行如下处理,其中,每个实例组包括至少一个被标记的实例:
预测至少两个不同的第一基础特征图,以及每个第一基础特征图分别对应的第一注意力特征图;将所述至少两个第一基础特征图和分别对应的所述第一注意力特征图的像素值进行加权处理,预测出第一特征融合图;根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练;
所述第一注意力特征图与所述第一基础特征图的尺寸相同,所述第一注意力特征图中的每个像素的像素值表示其对应的第一基础特征图中对应位置的像素的权重值,所述第一注意力特征图中存在像素值不同的像素。
在一种实施方式中,所述处理模块610,还用于得到所述原始图像包括的实例的检测框。
在一种实施方式中,所述处理模块610,还用于在将所述原始图像输入到训练完成的分割网络中之前,将所述原始图像缩放至所述分割网络中预设的尺寸。
在一种实施方式中,所述处理模块610还用于在根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练之前,对所述第一特征融合图的尺寸和/或所述样本原始图像的尺寸进行缩放处理,使所述第一特征融合图的尺寸与所述样本原始图像的尺寸相同。
在一种实施方式中,处理模块610,用于获取原始图像;确定所述原始图像包括的每个实例的检测框;以及针对每个检测框图像,确定至少两个不同的基础特征图,以及每个基础特征图分别对应的注意力特征图;将所述至少两个基础特征图和分别对应的所述注意力特征图的像素值进行加权处理,得到所述检测框图像对应的特征融合图,所述特征融合图用于标记所述检测框图像包括的实例的像素;
所述注意力特征图与所述基础特征图的尺寸相同,所述注意力特征图中的每个像素的像素值表示其对应的基础特征图中对应位置的像素的权重值,所述注意力特征图中存在像素值不同的像素。
图7是本申请实施例的实例分割的装置700的示意性框图。应理解,所述装置700能够执行上述图3和图4的方法中由终端执行的各个步骤,为了避免冗余,此处不再详述。装置700包括:处理器710,可选的,还包括存储器730和收发器720。所述处理器710和所述存储器730之间电耦合。
示例的,存储器730,用于存储计算机程序;所述处理器710,可以用于调用所述存储器中存储的计算机程序或指令,以通过所述收发器720执行上述的实例分割的方法。
图6中的处理模块610可以通过处理器710来实现,收发模块620可以通过收发器720来实现,存储模块630可以通过存储器730来实现。
上述的处理器可以是中央处理器(central processing unit,CPU),网络处理器(network  processor,NP)或者CPU和NP的组合。处理器还可以进一步包括硬件芯片或其他通用处理器。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)及其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等或其任意组合。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。应注意,本申请描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本申请实施例还提供了一种计算机存储介质,存储有计算机程序,该计算机程序被计算机执行时,可以使得所述计算机用于执行上述实例分割的方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机可以执行上述提供的实例分割的方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、***、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包括有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机 或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。

Claims (18)

  1. 一种实例分割的方法,其特征在于,所述方法包括:
    获取原始图像;
    将所述原始图像输入到训练完成的分割网络中,得到所述原始图像对应的至少一个特征融合图,所述特征融合图用于标记所述原始图像包括的实例的像素,每个特征融合图中包括至少一个实例;
    其中,所述分割网络根据如下方式进行训练:
    待训练的分割网络对被标记实例的像素的样本原始图像中的每个实例组进行如下处理,其中,每个实例组包括至少一个被标记的实例:
    预测至少两个不同的第一基础特征图,以及每个第一基础特征图分别对应的第一注意力特征图;将所述至少两个第一基础特征图和分别对应的所述第一注意力特征图的像素值进行加权处理,预测出第一特征融合图;根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练;
    所述第一注意力特征图与所述第一基础特征图的尺寸相同,所述第一注意力特征图中的每个像素的像素值表示其对应的第一基础特征图中对应位置的像素的权重值,所述第一注意力特征图中存在像素值不同的像素。
  2. 如权利要求1所述的方法,其特征在于,所述第一注意力特征图中的像素值的取值范围为0-1之间。
  3. 如权利要求1所述的方法,其特征在于,所述样本原始图像还被标记检测框,所述检测框用于标识实例;
    在得到所述原始图像对应的至少一个特征融合图时,还包括:
    得到所述原始图像包括的实例的检测框。
  4. 如权利要求3所述的方法,其特征在于,所述第一基础特征图为所述实例组对应的检测框图像的基础特征图;
    根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练,包括:
    根据所述第一特征融合图和所述检测框图像,对所述待训练的分割网络进行训练。
  5. 如权利要求1-4任一项所述的方法,其特征在于,在将所述原始图像输入到训练完成的分割网络中之前,还包括:
    将所述原始图像缩放至所述分割网络中预设的尺寸。
  6. 如权利要求1-4任一项所述的方法,其特征在于,所述第一基础特征图的尺寸、所述第一注意力特征图的尺寸,以及所述第一特征融合图的尺寸均为所述待训练的分割网络中预设的尺寸;
    在根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练之前,还包括:
    对所述第一特征融合图的尺寸和/或所述样本原始图像的尺寸进行缩放处理,使所述第一特征融合图的尺寸与所述样本原始图像的尺寸相同。
  7. 一种实例分割的方法,其特征在于,所述方法包括:
    获取原始图像;
    确定所述原始图像包括的每个实例的检测框;
    针对每个检测框图像,确定至少两个不同的基础特征图,以及每个基础特征图分别对应的注意力特征图;将所述至少两个基础特征图和分别对应的所述注意力特征图的像素值进行加权处理,得到所述检测框图像对应的特征融合图,所述特征融合图用于标记所述检测框图像包括的实例的像素;
    所述注意力特征图与所述基础特征图的尺寸相同,所述注意力特征图中的每个像素的像素值表示其对应的基础特征图中对应位置的像素的权重值,所述注意力特征图中存在像素值不同的像素。
  8. 一种实例分割的装置,其特征在于,所述装置包括:
    存储模块,用于存储训练完成的分割网络;
    处理模块,用于获取原始图像;并将所述原始图像输入到训练完成的分割网络中,得到所述原始图像对应的至少一个特征融合图,所述特征融合图用于标记所述原始图像包括的实例的像素,每个特征融合图中包括至少一个实例;
    所述处理模块,还用于根据如下方式进行训练所述分割网络:
    待训练的分割网络对被标记实例的像素的样本原始图像中的每个实例组进行如下处理,其中,每个实例组包括至少一个被标记的实例:
    预测至少两个不同的第一基础特征图,以及每个第一基础特征图分别对应的第一注意力特征图;将所述至少两个第一基础特征图和分别对应的所述第一注意力特征图的像素值进行加权处理,预测出第一特征融合图;根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练;
    所述第一注意力特征图与所述第一基础特征图的尺寸相同,所述第一注意力特征图中的每个像素的像素值表示其对应的第一基础特征图中对应位置的像素的权重值,所述第一注意力特征图中存在像素值不同的像素。
  9. 如权利要求8所述的装置,其特征在于,所述第一注意力特征图中的像素值的取值范围为0-1之间。
  10. 如权利要求8所述的装置,其特征在于,所述样本原始图像还被标记检测框,所述检测框用于标识实例;
    所述处理模块,还用于得到所述原始图像包括的实例的检测框。
  11. 如权利要求10所述的装置,其特征在于,所述第一基础特征图为所述实例组对应的检测框图像的基础特征图;
    根据所述第一特征融合图和所述样本原始图像,对所述待训练的分割网络进行训练,包括:
    根据所述第一特征融合图和所述检测框图像,对所述待训练的分割网络进行训练。
  12. 如权利要求8-11任一项所述的装置,其特征在于,所述处理模块,还用于在将所述原始图像输入到训练完成的分割网络中之前,将所述原始图像缩放至所述分割网络中预设的尺寸。
  13. 如权利要求8-11任一项所述的装置,其特征在于,所述第一基础特征图的尺寸、所述第一注意力特征图的尺寸,以及所述第一特征融合图的尺寸均为所述待训练的分割网络中预设的尺寸;
    所述处理模块还用于在根据所述第一特征融合图和所述样本原始图像,对所述待训练 的分割网络进行训练之前,对所述第一特征融合图的尺寸和/或所述样本原始图像的尺寸进行缩放处理,使所述第一特征融合图的尺寸与所述样本原始图像的尺寸相同。
  14. 一种实例分割的装置,其特征在于,所述装置包括:
    处理模块,用于获取原始图像;确定所述原始图像包括的每个实例的检测框;以及针对每个检测框图像,确定至少两个不同的基础特征图,以及每个基础特征图分别对应的注意力特征图;将所述至少两个基础特征图和分别对应的所述注意力特征图的像素值进行加权处理,得到所述检测框图像对应的特征融合图,所述特征融合图用于标记所述检测框图像包括的实例的像素;
    所述注意力特征图与所述基础特征图的尺寸相同,所述注意力特征图中的每个像素的像素值表示其对应的基础特征图中对应位置的像素的权重值,所述注意力特征图中存在像素值不同的像素。
  15. 一种实例分割的装置,其特征在于,包括处理器、存储器和收发器;
    所述存储器存储有计算机程序或指令;
    所述收发器用于接收和/或发送信号;
    所述处理器在执行所述计算机程序或指令时,使得所述装置执行如权利要求1-6中任一项或7所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机指令,当所述计算机指令被计算机执行时,使得所述计算机执行如权利要求1-6中任一项或7所述的方法。
  17. 一种芯片***,其特征在于,包括处理器和存储器,所述处理器和所述存储器电耦合;
    所述存储器,用于存储计算机程序指令;
    所述处理器,用于执行所述存储器中的部分或者全部计算机程序指令,当所述部分或者全部计算机程序指令被执行时,用于实现如权利要求1-6中任一项或7所述的方法。
  18. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机指令,当所述计算机指令被计算机执行时,使得所述计算机执行如权利要求1-6中任一项或7所述的方法。
PCT/CN2020/142438 2019-12-31 2020-12-31 一种实例分割的方法及装置 WO2021136528A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/853,799 US20220335619A1 (en) 2019-12-31 2022-06-29 Instance segmentation method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911418245.5 2019-12-31
CN201911418245.5A CN111192277A (zh) 2019-12-31 2019-12-31 一种实例分割的方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/853,799 Continuation US20220335619A1 (en) 2019-12-31 2022-06-29 Instance segmentation method and apparatus

Publications (1)

Publication Number Publication Date
WO2021136528A1 true WO2021136528A1 (zh) 2021-07-08

Family

ID=70709705

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/142438 WO2021136528A1 (zh) 2019-12-31 2020-12-31 一种实例分割的方法及装置

Country Status (3)

Country Link
US (1) US20220335619A1 (zh)
CN (1) CN111192277A (zh)
WO (1) WO2021136528A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171029A (zh) * 2022-09-09 2022-10-11 山东省凯麟环保设备股份有限公司 基于无人驾驶的城市场景下的实例分割方法及***

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111192277A (zh) * 2019-12-31 2020-05-22 华为技术有限公司 一种实例分割的方法及装置
CN112241955B (zh) * 2020-10-27 2023-08-25 平安科技(深圳)有限公司 三维图像的碎骨分割方法、装置、计算机设备及存储介质
CN112765955B (zh) * 2021-01-22 2023-05-26 中国人民公安大学 一种中文指代表达下的跨模态实例分割方法
CN112837330B (zh) * 2021-03-02 2024-05-10 中国农业大学 基于多尺度双注意力机制和全卷积神经网络的叶分割方法
CN113255700B (zh) * 2021-06-10 2021-11-02 展讯通信(上海)有限公司 图像的特征图的处理方法及装置、存储介质、终端
CN113824989B (zh) * 2021-07-13 2024-02-27 腾讯科技(深圳)有限公司 一种视频处理方法、装置和计算机可读存储介质
CN113792738A (zh) * 2021-08-05 2021-12-14 北京旷视科技有限公司 实例分割方法、装置、电子设备和计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120099771A1 (en) * 2010-10-20 2012-04-26 Lao Zhiqiang Computer aided detection of architectural distortion in mammography
CN106651877A (zh) * 2016-12-20 2017-05-10 北京旷视科技有限公司 实例分割方法及装置
CN109584248A (zh) * 2018-11-20 2019-04-05 西安电子科技大学 基于特征融合和稠密连接网络的红外面目标实例分割方法
CN109902702A (zh) * 2018-07-26 2019-06-18 华为技术有限公司 目标检测的方法和装置
CN111192277A (zh) * 2019-12-31 2020-05-22 华为技术有限公司 一种实例分割的方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019089192A1 (en) * 2017-11-03 2019-05-09 Siemens Aktiengesellschaft Weakly-supervised semantic segmentation with self-guidance
CN108345887B (zh) * 2018-01-29 2020-10-02 清华大学深圳研究生院 图像语义分割模型的训练方法及图像语义分割方法
CN109447169B (zh) * 2018-11-02 2020-10-27 北京旷视科技有限公司 图像处理方法及其模型的训练方法、装置和电子***
CN110532955B (zh) * 2019-08-30 2022-03-08 中国科学院宁波材料技术与工程研究所 基于特征注意力和子上采样的实例分割方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120099771A1 (en) * 2010-10-20 2012-04-26 Lao Zhiqiang Computer aided detection of architectural distortion in mammography
CN106651877A (zh) * 2016-12-20 2017-05-10 北京旷视科技有限公司 实例分割方法及装置
CN109902702A (zh) * 2018-07-26 2019-06-18 华为技术有限公司 目标检测的方法和装置
CN109584248A (zh) * 2018-11-20 2019-04-05 西安电子科技大学 基于特征融合和稠密连接网络的红外面目标实例分割方法
CN111192277A (zh) * 2019-12-31 2020-05-22 华为技术有限公司 一种实例分割的方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171029A (zh) * 2022-09-09 2022-10-11 山东省凯麟环保设备股份有限公司 基于无人驾驶的城市场景下的实例分割方法及***
CN115171029B (zh) * 2022-09-09 2022-12-30 山东省凯麟环保设备股份有限公司 基于无人驾驶的城市场景下的实例分割方法及***

Also Published As

Publication number Publication date
CN111192277A (zh) 2020-05-22
US20220335619A1 (en) 2022-10-20

Similar Documents

Publication Publication Date Title
WO2021136528A1 (zh) 一种实例分割的方法及装置
US20200250461A1 (en) Target detection method, apparatus, and system
WO2022134337A1 (zh) 人脸遮挡检测方法、***、设备及存储介质
CN110163076B (zh) 一种图像数据处理方法和相关装置
CN110348294B (zh) Pdf文档中图表的定位方法、装置及计算机设备
CN107563372B (zh) 一种基于深度学习ssd框架的车牌定位方法
CN108475331B (zh) 用于对象检测的方法、装置、***和计算机可读介质
WO2022078041A1 (zh) 遮挡检测模型的训练方法及人脸图像的美化处理方法
CN112508975A (zh) 一种图像识别方法、装置、设备及存储介质
CN110163188B (zh) 视频处理以及在视频中嵌入目标对象的方法、装置和设备
WO2022021029A1 (zh) 检测模型训练方法、装置、检测模型使用方法及存储介质
US11113507B2 (en) System and method for fast object detection
WO2020151299A1 (zh) 黄色禁停线识别方法、装置、计算机设备及存储介质
CN113807361B (zh) 神经网络、目标检测方法、神经网络训练方法及相关产品
CN112836625A (zh) 人脸活体检测方法、装置、电子设备
CN114049512A (zh) 模型蒸馏方法、目标检测方法、装置及电子设备
CN111767854B (zh) 一种结合场景文本语义信息的slam回环检测方法
CN113850136A (zh) 基于yolov5与BCNN的车辆朝向识别方法及***
WO2023279799A1 (zh) 对象识别方法、装置和电子***
Sharjeel et al. Real time drone detection by moving camera using COROLA and CNN algorithm
CN114972492A (zh) 一种基于鸟瞰图的位姿确定方法、设备和计算机存储介质
WO2024001617A1 (zh) 玩手机行为识别方法及装置
US20230089845A1 (en) Visual Localization Method and Apparatus
WO2020244076A1 (zh) 人脸识别方法、装置、电子设备及存储介质
CN115512207A (zh) 一种基于多路特征融合及高阶损失感知采样的单阶段目标检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909018

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909018

Country of ref document: EP

Kind code of ref document: A1