WO2023066099A1 - 抠像处理 - Google Patents

抠像处理 Download PDF

Info

Publication number
WO2023066099A1
WO2023066099A1 PCT/CN2022/124757 CN2022124757W WO2023066099A1 WO 2023066099 A1 WO2023066099 A1 WO 2023066099A1 CN 2022124757 W CN2022124757 W CN 2022124757W WO 2023066099 A1 WO2023066099 A1 WO 2023066099A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability
image
target image
semantic
matting
Prior art date
Application number
PCT/CN2022/124757
Other languages
English (en)
French (fr)
Inventor
程俊奇
四建楼
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023066099A1 publication Critical patent/WO2023066099A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present disclosure relates to computer vision technology, and in particular to image matting processing.
  • Image keying as part of a basic image editing technology, is widely used in various image editing software, camera back-end algorithms and other scenarios.
  • embodiments of the present disclosure at least provide an image keying processing method, device, electronic equipment, and storage medium.
  • a matting processing method includes: performing semantic segmentation processing on a target image to obtain a semantic probability map corresponding to the target image, and the semantic probability map includes: for each of the target images pixel, the pixel belongs to the first probability of the target object, and the target object is the foreground or background in the target image; the probability conversion process is performed based on the semantic probability map to obtain a three-part map, and for the three-part map For each pixel, the numerical value corresponding to this pixel represents the probability that the pixel belongs to any area in the foreground, background or region to be determined in the target image; according to the three-part image and the target image, the matting process is performed to obtain the mating like result.
  • a network training method is provided, the method is used for joint training of a semantic segmentation network and a matting network, the method includes: obtaining a training sample set, the training sample set includes a plurality of sample data; For each sample data in the training sample set, process the sample data to obtain a first image containing global image information of the sample image and a segmentation label corresponding to the first image, and a segmentation label containing local image information of the sample image The second image and the keying label corresponding to the second image; perform semantic segmentation processing on the first image through the semantic segmentation network to obtain a semantic probability map output by the semantic segmentation network; perform probability based on the semantic probability map Converting and processing to obtain a three-part map; performing matting processing on the tripartite image and the second image to obtain a matting result; adjusting the network parameters of the semantic segmentation network according to the difference between the semantic probability map and the segmentation label , and adjust the network parameters of the matting network based on the difference between the matting result
  • an image matting processing device includes: a segmentation processing module, configured to perform semantic segmentation processing on a target image to obtain a semantic probability map corresponding to the target image, and the semantic probability map includes: For each pixel in the target image, the pixel belongs to the first probability of the target object, and the target object is the foreground or background in the target image; a conversion processing module is used to perform probability conversion based on the semantic probability map Processing to obtain a three-part map, for each pixel in the three-part map, the value corresponding to the pixel represents the probability that the pixel belongs to any region in the foreground, background or region to be determined in the target image; image matting A processing module, configured to perform matting processing according to the trimap and the target image to obtain a matting result.
  • a network training device is provided, the device is used for joint training of the semantic segmentation network and the matting network, the device includes: a sample acquisition module, used to obtain a training sample set, the training sample set includes A plurality of sample data; a sample processing module, configured to process each sample data in the training sample set to obtain a first image containing global image information of the sample image and a segmentation corresponding to the first image label, and the second image including the local image information of the sample image and the keying label corresponding to the second image; the semantic segmentation module is used to perform semantic segmentation processing on the first image through the semantic segmentation network to obtain The semantic probability map output by the semantic segmentation network; the conversion processing module is used to perform probability conversion processing based on the semantic probability map to obtain a three-part map; the matting processing module is used to combine the three-part map and the third The two images are matted to obtain a matting result; a network adjustment module is used to adjust the network parameters of the semantic segmentation network according to the difference between the semantic probability
  • an electronic device in a fifth aspect, includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to implement the present disclosure when executing the computer instructions The method of any embodiment.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method in any embodiment of the present disclosure is implemented.
  • the image matting processing method, device, electronic equipment, and storage medium provided by the embodiments of the present disclosure perform probability conversion on the semantic probability map obtained based on the semantic segmentation of the target image to obtain a three-part map, making the acquisition of the three-part map faster and more convenient. , it is no longer necessary to manually label, and it is no longer necessary to train the prediction network through trimap labeling, which makes the process of image matting easier to implement; moreover, this method of probabilistic conversion to obtain a trimap is based on the semantic probability of semantic segmentation image, which makes the generated three-part image more accurate, thus realizing accurate and fast keying.
  • FIG. 1 shows a flowchart of a matting processing method provided by at least one embodiment of the present disclosure
  • Fig. 2 shows a schematic diagram of the process of matting processing provided by at least one embodiment of the present disclosure
  • Fig. 3 is a flow chart based on the matting process of Fig. 2;
  • Fig. 4 shows a schematic diagram of a target image provided by at least one embodiment of the present disclosure
  • Fig. 5 shows a schematic diagram of a semantic probability map provided by at least one embodiment of the present disclosure
  • Fig. 6 shows a schematic diagram of a tripartite graph provided by at least one embodiment of the present disclosure
  • FIG. 7 illustrates a schematic diagram of transparency provided by at least one embodiment of the present disclosure
  • Fig. 8 shows a schematic view of the foreground provided by at least one embodiment of the present disclosure
  • FIG. 9 shows a schematic flowchart of a network training method provided by at least one embodiment of the present disclosure.
  • Fig. 10 shows a schematic structural diagram of a keying processing device provided by at least one embodiment of the present disclosure
  • Fig. 11 shows a schematic structural diagram of a network training device provided by at least one embodiment of the present disclosure.
  • the matting can be to extract the target object in the image, and the target object can be the foreground or the background in the image.
  • the target object can be the foreground or the background in the image.
  • the hair of the person in the person image can be extracted, or when the image is a landscape image, the sky in the landscape image can also be extracted.
  • the person's hair or the sky can be referred to as the target object to be extracted by the image keying process.
  • the purpose of image keying processing may be to perform special effect rendering or object replacement on the target object after extracting the target object in the image, for example, rendering the sky into fiery red, or dyeing the hair of a character special effects processing.
  • the usual deep network-based image matting algorithm can use the image to be processed and the trimap (trimap) as the input of the deep network.
  • This method points out the transparency of some pixels in the image input by the network due to the trimap as a guide. , can often obtain a finer keying result.
  • the source of the above-mentioned trimap is either provided by user annotations, or the trimap is predicted by a trimap prediction network.
  • the method of user annotation is too complicated and inconvenient for users to use; the method of network prediction trimap requires a large number of trimap annotations, and the application of trimap annotations is narrow and of little significance.
  • the embodiment of the present disclosure provides a matting processing method, which does not require the user to manually mark the trimap, and does not need to pre-train the prediction network for predicting the trimap, but The trimap can be obtained based on the results of semantic segmentation combined with probability conversion. Please refer to the process flow of the keying processing method shown in Figure 1, which may include the following processing:
  • step 100 semantic segmentation processing is performed on the target image to be processed to obtain a semantic probability map corresponding to the target image.
  • the image to be subjected to image keying processing may be referred to as a target image.
  • a target image For example, assuming that a person's hair and face are to be extracted from a person image, the person image may be called a target image.
  • the hair and face of the person are the targets to be extracted in the keying process, which can be called target objects.
  • semantic segmentation processing may be performed on the target image, for example, semantic segmentation processing may be performed through a semantic segmentation network.
  • the semantic segmentation network includes but is not limited to commonly used semantic segmentation networks such as SegNet, U-Net, DeepLab, and FCN.
  • the semantic probability map of the target image can be obtained, and the semantic probability map can include: for each pixel in the target image, the first probability that the pixel belongs to the target object, and the target object can be a target object in the target image foreground or background.
  • the semantic probability map can indicate that the probability that a certain pixel in the target image belongs to the foreground is 0.85, and the probability that another pixel belongs to the foreground is 0.24.
  • step 102 a probability conversion process is performed based on the semantic probability map to obtain a tripartite map.
  • probability conversion processing may be performed based on the semantic segmentation processing result to obtain a tripartite graph.
  • the trimap obtained through probability conversion processing in this embodiment can be represented by soft-trimap.
  • the probability conversion process may be to map the probability corresponding to the pixel obtained in the semantic probability map to the value corresponding to the pixel in the soft-trimap through a mathematical conversion method.
  • the probability in the semantic probability graph can be transformed into the following two parts:
  • the first probability is converted to obtain the second probability.
  • the trimap soft-trimap can include three kinds of regions: "determined foreground”, "determined background” and "region to be determined".
  • the probability that the pixel belongs to the region to be determined in the tripartite map may be referred to as the second probability.
  • the first probability represents the higher the probability that the pixel belongs to the foreground or the background
  • the second probability represents The lower the probability that the pixel belongs to the region to be determined in the tripartite map. For example, the closer the first probability is to 1 and 0, the closer the second probability is to 0; the closer the first probability is to 0.5, the closer the second probability is to 1.
  • the above conversion principle is that if a pixel in the image has a higher probability of belonging to the foreground, or a higher probability of belonging to the background, then the probability of the pixel belonging to the area to be determined is lower; and the probability of the pixel belonging to the foreground or background is 0.5 Nearby, it means that the pixel is more uncertain whether it belongs to the foreground or the background, and the higher the probability that the pixel belongs to the area to be determined.
  • the first probability can be converted to obtain the second probability.
  • the embodiment of the present disclosure does not limit the specific formula of probability conversion, and the following is only an example:
  • un represents the second probability that the pixel belongs to the area to be determined
  • score represents the first probability that the pixel belongs to the target object in the semantic probability map.
  • the formula (1) is a way of polynomial fitting, through polynomial fitting, the first probability of the pixel is fitted to obtain the second probability. This embodiment does not limit the specific values of the above-mentioned coefficients "k1/k2/k3/k4".
  • polynomial fitting is used to convert the first probability into the second probability, which can make the polynomial conversion calculation more efficient, and also more accurately reflect the above conversion principle.
  • a semantic probability map can be obtained.
  • the foreground and background in the target image can be roughly distinguished. Taking the foreground as an example, if a pixel belongs to the foreground If the first probability of a pixel is 0.96, the probability of belonging to the foreground is very high; if the first probability of a pixel belonging to the foreground is 0.14, it means that the probability of the pixel belonging to the background is very high.
  • the second probability that each pixel belongs to the region to be determined can be obtained.
  • the first probability corresponding to the pixel in the semantic probability map and the second probability that the pixel belongs to the area to be determined can be combined for probability fusion, and the pixel in the three-part map soft-
  • the corresponding numerical value in the trimap which can represent the probability that the pixel belongs to any one of the determined foreground, determined background or region to be determined in the target image.
  • the value corresponding to a pixel is closer to 1, it means that the pixel is more likely to belong to the foreground in the target image; the closer the value of the pixel in soft-trimap is to 0, the more likely the pixel is Belongs to the background; the closer the value of the pixel in the soft-trimap is to 0.5, the more likely the pixel belongs to the area to be determined. That is, the probability that the pixel belongs to any one of the foreground, background, or region to be determined can be expressed by the value corresponding to the pixel in the soft-trimap.
  • the following formula (2) exemplifies a method of performing probability fusion according to the first probability and the second probability of a pixel to obtain the corresponding value of the pixel in the tripartite map:
  • soft_trimap -k5*un/k6*sign(score-k7)+(sign(score-k7)+k8)/k9....
  • soft_trimap represents the value corresponding to the pixel in soft-trimap
  • un represents the second probability
  • score represents the first probability
  • sign() represents the sign function.
  • this embodiment does not limit the specific values of the above coefficients "k5/k6/k7/k8".
  • pooling processing may be performed on the semantic probability map, and the above-mentioned probability conversion processing is performed on the pooled semantic probability map. See equation (3) below:
  • the average pooling process can be performed on the semantic probability map, and the pooling is performed according to the convolution stride and the convolution kernel size (kernel_size, ks).
  • score_ represents the semantic probability map after pooling, which contains the probability of each pooling.
  • the score in the above formula (1) and formula (2) is replaced with the pooled probability, that is, the pooled semantic probability map is used to perform probability conversion.
  • the size of the kernel used in the above pooling process can be adjusted, and the pooling process is performed before the probability conversion of the semantic probability map, which helps to adjust the width of the area to be determined in the soft_trimap to be generated by adjusting the size of the convolution kernel .
  • the image size of the target image can also be preprocessed, and the preprocessing can be based on the semantic segmentation network
  • the image size of the target image is processed by an integer multiple of the downsampling multiple, so that the image size after the integer multiple processing can be divided by the above-mentioned downsampling multiple scale_factor, which is the semantic segmentation network for the target image.
  • the downsampling multiple the specific value is determined by the network structure of the semantic segmentation network.
  • step 104 image matting is performed according to the trimap and the target image to obtain an image matting result.
  • the process of the matting process may include: using the three-part map and the target image as the input of the matting network to obtain the object residual in the target image output by the matting network (for example, it may be the foreground Residual or background residual, the foreground residual can indicate the difference between the predicted foreground pixel value and the pixel value of the corresponding pixel of the target image, and the background residual can indicate the difference between the predicted background pixel value and the pixel value of the corresponding pixel of the target image difference), and the initial transparency of the target image.
  • the object residual in the target image output by the matting network for example, it may be the foreground Residual or background residual
  • the foreground residual can indicate the difference between the predicted foreground pixel value and the pixel value of the corresponding pixel of the target image
  • the background residual can indicate the difference between the predicted background pixel value and the pixel value of the corresponding pixel of the target image difference
  • the target object in the target image can be obtained based on the target image and the object residual (for example, the foreground can be obtained based on the foreground residual and the target image, or the background can be obtained based on the background residual and the target image), and the initial transparency and three-point Figure soft_trimap gets the transparency of the target image.
  • the object residual for example, the foreground can be obtained based on the foreground residual and the target image, or the background can be obtained based on the background residual and the target image
  • the initial transparency and three-point Figure soft_trimap gets the transparency of the target image.
  • the matting processing method of this embodiment based on the semantic probability map obtained by semantically segmenting the target image, performs probability conversion to obtain a three-part map, so that the acquisition of the three-part map is faster and more convenient, no manual labeling is required, and no longer It is necessary to train the prediction network through trimap annotation, so that the process of matting processing is simpler; and, this method of obtaining the trimap through probability conversion, based on the semantic probability map of semantic segmentation, makes the generated trimap more accurate , so as to achieve accurate and fast keying.
  • the image keying processing method in the embodiment of the present disclosure can be applied to a mobile terminal.
  • another embodiment of the present disclosure can miniaturize the network deployed to the mobile terminal, and can scale the size of the target image so that the running time and memory consumption are within the burden of the mobile terminal .
  • An example of image keying on the mobile terminal is described as follows.
  • a semantic segmentation network and a matting network may be used.
  • the semantic segmentation network may be a network such as SegNet, U-Net, etc.
  • the matting network may include an encoder (encoder) and a decoder (decoder).
  • the encoder of the image matting network can adopt the structure design of mobv2, and before the image matting network is deployed to the mobile terminal, the channel compression of the image matting network can be performed, and the channel compression can be carried out in the middle of the network of the image matting network
  • the number of channels of the feature that is, the feature of the middle layer of the network
  • the number of output channels of the convolution kernel in the process of matting network can be reduced. Assuming that the number of output channels of the convolution kernel is originally a, it can be calculated according to 0.35 Times the number of channels for compression, the number of output channels of the convolution kernel after compression is 0.35*a.
  • Fig. 2 illustrates a schematic diagram of a matting process provided by an embodiment of the present disclosure
  • Fig. 3 is a flow chart of the matting process based on Fig. 2 , combined with Fig. 2 and Fig. 3 , it may include the following processing, wherein this The embodiment is described with the target object being the foreground as an example:
  • step 300 zoom processing is performed on the target image.
  • the target image in this embodiment may be a person image, please refer to FIG. 4 .
  • the person image may be captured by the camera of the mobile terminal when the user uses his own mobile terminal, or may be an image stored in the mobile terminal or received from other devices.
  • the purpose of the image matting process in this embodiment may be to extract hair and face regions in the person image.
  • the hair and face of the person in the target image can be used as the foreground.
  • the target image can be scaled. Assuming that the size of the target image in Figure 4 is 1080*1920, the image can be scaled to a size of 480*288. For example, scaling can be done by way of bilinear difference. Scaling can be performed with reference to the following formula (4) and formula (5):
  • h and w are the length and width of the target image
  • basesize is the base size, which is 480 in this example
  • int(x) means rounding x.
  • new_h and new_w are respectively the scaled dimensions of the target image, where the specific values of the coefficients in formula (5) are not limited in this embodiment.
  • the image size of the target image can be processed by an integer multiple of the downsampling multiple to control the scaled image size to be able to divide the semantic segmentation network's downsampling multiple scale_factor of the image. It can be understood that other formulas may also be used for the integer multiple processing, and are not limited to the following two formulas.
  • new_h int(int(int(new_h–k12+scale_factor–k13)/scale_factor)*scale_factor).
  • This embodiment does not limit the specific values of the respective coefficients in the above formula (6) and formula (7).
  • the above values of k12 to k15 may all be set to 1. If the original target image before scaling is marked with A, then the target image obtained by normalizing the image after being scaled to a 480*288 image can be marked with B. Referring to FIG. 2 , the target image B is the target image after zooming.
  • step 302 semantic segmentation processing is performed on the scaled target image through the semantic segmentation network to obtain a semantic probability map output by the semantic segmentation network.
  • the semantic segmentation process can be performed on the target image B through the semantic segmentation network 21, and the semantic probability map 22 output by the semantic segmentation network can be obtained.
  • the semantic probability map can be identified by score, and FIG. 5 shows the score . It can be seen that the score of the semantic probability map roughly distinguishes the foreground and background in the image based on the probability that the pixel belongs to the foreground.
  • step 304 a probability conversion process is performed based on the semantic probability map to obtain a tripartite map.
  • the soft-trimap can be generated according to the probability conversion process described in the flow chart in FIG. 1 .
  • the semantic probability map can be pooled according to formula (3), and then the probability conversion process can be performed on the pooled semantic probability map according to formula (1) and formula (2) to generate a tripartite map. See this tripartite diagram 23 in FIG. 2 .
  • the probability value of the pixel in the soft-trimap can indicate the probability that the pixel belongs to three types of regions, and the image is roughly distinguished according to the probability value. "Foreground”, “Background” and “Region to be determined”.
  • step 306 the trimap and the target image are used as inputs to the matting network to obtain the foreground residual and initial transparency output by the matting network.
  • the three-part image 23 and the target image B can be used as the input of the matting network 24, and the matting network can output a 4-channel result, wherein the result of one channel is the initial transparency raw_alpha, and the other three The result of channels is the foreground residual fg_res.
  • the first result 25 output by the keying network in FIG. 2 may include "raw_alpha+fg_res".
  • step 308 the foreground in the target image is obtained based on the target image and the foreground residual, and the transparency is obtained according to the initial transparency and the trimap.
  • the foreground residual fg_res can be enlarged by bilinear difference, so that it can return to the scale before the target image is scaled, and then execute the formula (8):
  • the foreground FG in the target image can be obtained according to the enlarged foreground residual fg_res and the target image A.
  • clip(x, s1, s2) is to limit the value of x to [s1, s2].
  • This embodiment does not limit the specific values of s1 and s2 in the above formula (7), for example, s1 may be 0, and s2 may be 1.
  • Alpha represents transparency, and after obtaining Alpha, the Alpha can be enlarged back to the original size of the target image before scaling through the bilinear difference.
  • this embodiment does not limit the specific values of the respective coefficients s3 to s8 in the above formula (9) and formula (10).
  • Figure 7 shows Alpha
  • Figure 8 shows the foreground FG of the final target image.
  • image editing can be continued according to the foreground and transparency, for example, the image editing can be foreground replacement and/or foreground rendering of the image.
  • the image matting processing method of the embodiment of the present disclosure by performing channel compression and other processing on the image matting model, and performing zoom processing on the target image, the image matting can be made more suitable for the mobile terminal.
  • the user uses his own mobile terminal to capture images
  • the hair matting process can be completed directly on the mobile terminal, the hair is extracted, and the hair is dyed, so that these processes can be performed locally on the mobile terminal without uploading to the cloud, which improves data security and privacy protection .
  • the method for image matting is to use a single target image as an input to directly obtain the matting result, that is, to provide a target image. Based on the method for matting provided by the embodiment of the present disclosure, it can To obtain the prediction of the foreground in the target image, the input information is less, so that the matting process is more convenient.
  • Fig. 9 shows a network training method provided by at least one embodiment of the present disclosure, which can be used for joint training of a semantic segmentation network and a matting network. As shown in Figure 9, the method may include the following processing:
  • step 900 a training sample set is obtained, and the training sample set includes a plurality of sample data.
  • each sample data in the training sample set may include a sample image, a first feature label corresponding to the sample image, and a second feature label corresponding to the sample image.
  • the first feature label can be a segmentation label for the sample image
  • the second feature label can be a keying label for the sample image.
  • step 902 for each sample data in the training sample set, the sample data is processed to obtain a first image containing global image information of the sample image and a segmentation label corresponding to the first image, and A second image of the partial image information of the sample image and a keying label corresponding to the second image.
  • the first processing can be performed on the sample image of the sample data to obtain the first image including most of the image information of the sample image. It can be considered that the first image includes the global image information of the sample image.
  • the same first processing is performed on the first feature label corresponding to the image to obtain the segmentation label corresponding to the first image.
  • the sample image can be scaled according to the size requirements of the semantic segmentation network for the input image, but still retain most of the image information of the sample image to obtain the first image, and perform the same scaling process on the first feature label, Get the split label.
  • the second processing is performed on the sample image of the sample data to obtain a second image including partial image information of the sample image, and at the same time, the same second processing is performed on the second feature label corresponding to the sample image to obtain the corresponding Keying tab.
  • the sample image may be partially cropped to obtain a second image including partial image information of the sample image, and the same partial cropping may be performed on the second feature label to obtain the matte label.
  • step 904 semantic segmentation processing is performed on the first image through a semantic segmentation network to obtain a semantic probability map output by the semantic segmentation network.
  • step 906 a probability conversion process is performed based on the semantic probability map to obtain a tripartite map.
  • step 908 a matting process is performed based on the trimap and the second image through the matting network to obtain a matting result.
  • step 910 the network parameters of the semantic segmentation network are adjusted according to the difference between the semantic probability map and the segmentation label, and the network parameters of the matting network are adjusted based on the difference between the matting result and the matting label.
  • the obtained first image including global image information and the first label are used to train the first sub-network, and the second sub-network including local image information
  • the image and the second label train the second sub-network to improve the joint training effect and reduce the risk of network effect degradation.
  • the generation of soft-trimap adopts the method of probability conversion processing, which can assist the network training to a certain extent and have a better effect.
  • soft-trimap can be adaptively adjusted during network training. For example, in the process of adjusting the network parameters of the semantic segmentation network according to the difference between the semantic probability map and the segmentation label, and adjusting the network parameters of the matting network based on the difference between the matting result and the matting label, the network parameters of the semantic segmentation network The parameters will be updated, and thus the semantic probability map output by the semantic segmentation network will also be updated.
  • the soft-trimap is generated based on the semantic probability map. Therefore, the update of the semantic probability map will bring the update of the three-part map soft-trimap, and then the matting result will also be updated. That is, in the network training process, it is usually iterated multiple times, and after each iteration, if the parameters of the semantic segmentation network are updated, even if the input is the same image, the semantic probability map, soft-trimap and matting results will be adaptive. Update, and continue to adjust the network parameters according to the updated results. This method of adaptively adjusting the soft-trimap will help to dynamically optimize the generated soft-trimap and matting results along with the adjustment of the semantic segmentation network, so that the training effect of the final model is better and more accurate. Extract the target object into the target image.
  • Fig. 10 illustrates a kind of key image processing device, and this device can be applied to realize the key image processing method of any embodiment of the present disclosure.
  • the device may include: a segmentation processing module 1001 , a conversion processing module 1002 and a keying processing module 1003 .
  • the segmentation processing module 1001 is configured to perform semantic segmentation processing on the target image to obtain a semantic probability map corresponding to the target image, and the semantic probability map includes: for each pixel in the target image, the pixel belongs to the target object With a first probability, the target object is a foreground or a background in the target image.
  • the conversion processing module 1002 is configured to perform probability conversion processing based on the semantic probability map to obtain a three-part map, and for each pixel in the three-part map, the value corresponding to the pixel indicates that the pixel belongs to the foreground in the target image , the background or the probability of any region in the region to be determined.
  • the image matting processing module 1003 is configured to perform image matting processing according to the trimap and the target image to obtain an image matting result.
  • the conversion processing module 1002 when performing probability conversion processing based on the semantic probability map to obtain the trimap, includes: for each pixel in the semantic probability map, based on the pixel's Perform probability conversion on the first probability to obtain the second probability that the pixel belongs to the region to be determined in the tripartite map; according to the first probability and the second probability of each pixel in the semantic probability map , generating the tripartite graph.
  • the conversion processing module 1002 is configured to perform probability conversion for each pixel in the semantic probability map based on the first probability of the pixel to obtain that the pixel belongs to all three parts of the map.
  • it includes: fitting the first probability of the pixel by means of polynomial fitting to obtain that the pixel belongs to the region to be determined in the tripartite map The second probability.
  • the image matting processing module 1003 when used to perform image matting processing according to the trimap and the target image to obtain the matting result, it includes: The target image is matted to obtain the object residual and the initial transparency of the target image; based on the target image and the object residual, the target object in the target image is obtained; according to the initial transparency and the trimap, determine the transparency of the target image.
  • the segmentation processing module 1001 further performs scaling processing on the target image before performing semantic segmentation processing on the target image; and the object residual, when obtaining the target object in the target image, includes: enlarging the object residual to the scale before the target image is scaled; according to the enlarged object residual and the the target image to obtain the target object in the target image.
  • the segmentation processing module 1001 when used to perform semantic segmentation processing on the target image to obtain the semantic probability map corresponding to the target image, it includes: Perform semantic segmentation processing to obtain the semantic probability map output by the semantic segmentation network; the matting processing module 1003, when performing matting processing according to the trimap and the target image, includes: The image matting network performs image matting processing according to the trimap and the target image.
  • the image matting network is a channel-compressed network
  • the channel compression is to compress the number of channels of the network intermediate features of the image matting network.
  • the first probability of a pixel indicates that the pixel has a higher probability of belonging to the foreground or background, and the corresponding second probability obtained through probability conversion indicates that the pixel belongs to the to-be The lower the probability of the determined region;
  • the conversion processing module 1002 when generating the trimap according to the first probability and the second probability of the pixel, includes: for the target image For each pixel, perform probability fusion according to the first probability and the second probability corresponding to the pixel, and determine the value corresponding to the pixel in the tripartite map.
  • the conversion processing module 1002 is further configured to perform a pooling process on the semantic probability map before performing probability conversion processing based on the semantic probability map to obtain the tripartite map, and obtain the pooled
  • the performing probability conversion processing based on the semantic probability map includes: performing probability conversion processing on the pooled semantic probability map.
  • the segmentation processing module 1001 is further configured to downsample the image size of the target image based on the downsampling multiple of the target image by the semantic segmentation network before performing semantic segmentation processing on the target image Integer multiple processing of the multiple, so that the size of the image after the integer multiple processing can be divisible by the downsampling multiple.
  • the matting result includes: the transparency of the target image and the target object; the matting processing module 1003 is further configured to Transparency, to perform object replacement and/or object rendering.
  • FIG. 11 illustrates a network training device, which can be applied to implement the network training method of any embodiment of the present disclosure, and the device is used for joint training of a semantic segmentation network and a matting network.
  • the device may include: a sample acquisition module 1101 , a sample processing module 1102 , a semantic segmentation module 1103 , a conversion processing module 1104 , a matting processing module 1105 and a network adjustment module 1106 .
  • a sample acquisition module 1101 configured to acquire a training sample set, the training sample set including a plurality of sample data
  • the sample processing module 1102 is configured to process each sample data in the training sample set to obtain a first image including global image information of the sample image and a segmentation label corresponding to the first image, and include A second image of the partial image information of the sample image and a keying label corresponding to the second image.
  • the semantic segmentation module 1103 is configured to perform semantic segmentation processing on the first image through a semantic segmentation network to obtain a semantic probability map output by the semantic segmentation network.
  • the conversion processing module 1104 is configured to perform probability conversion processing based on the semantic probability map to obtain a trimester.
  • the image matting processing module 1105 is configured to perform image matting processing on the trimap and the second image to obtain an image matting result.
  • a network adjustment module 1106, configured to adjust the network parameters of the semantic segmentation network according to the difference between the semantic probability map and the segmentation label, and adjust the matting based on the difference between the matting result and the mating label Like the network parameters of the network.
  • the present disclosure also provides an electronic device, the device includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the present disclosure when executing the computer instructions.
  • the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the image matting processing method or the network training method described in any embodiment of the present disclosure is implemented.
  • one or more embodiments of the present disclosure may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present disclosure may employ a computer program embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The form of the product.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • An embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program can be stored, and when the program is executed by a processor, the steps described in any embodiment of the present disclosure or the network training method are implemented.
  • the "and/or” means at least one of the two, for example, “multiple and/or B" includes three options: multiple, B, and “multiple and B".
  • the embodiments of the present disclosure relate to the field of augmented reality.
  • the target object may involve faces, limbs, gestures, actions, etc. related to the human body, or markers and markers related to objects, or sand tables, display areas or display items related to venues or places.
  • Vision-related algorithms can involve visual positioning, SLAM, 3D reconstruction, image registration, background segmentation, object key point extraction and tracking, object pose or depth detection, etc.
  • Specific applications can not only involve interactive scenes such as guided tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but also special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual Interactive scenarios such as model display.
  • the relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network.
  • the above-mentioned convolutional neural network is a network model obtained by performing model training based on a deep learning framework.
  • Embodiments of the subject matter and functional operations described in this disclosure can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this disclosure and their structural equivalents, or in A combination of one or more of .
  • Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e. one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data
  • the processing means executes.
  • a computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processes and logic flows described in this disclosure can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as FPG (Field Programmable Gate Array) or SIC (Application Specific Integrated Circuit).
  • FPG Field Programmable Gate Array
  • SIC Application Specific Integrated Circuit
  • Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit.
  • a central processing unit will receive instructions and data from a read only memory and/or a random access memory.
  • the essential components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
  • mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
  • a computer is not required to have such a device.
  • a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PD or more), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus ( USB) flash drives, to name a few.
  • a mobile phone such as a personal digital assistant (PD or more), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus ( USB) flash drives, to name a few.
  • PD personal digital assistant
  • GPS Global Positioning System
  • USB Universal Serial Bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disk), magneto-optical disk, and CD ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or removable disk
  • magneto-optical disk and CD ROM and DVD-ROM disks.
  • the processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例提供一种抠像处理方法、装置、电子设备及存储介质,其中方法包括:对目标图像进行语义分割处理,得到所述目标图像对应的语义概率图,所述语义概率图包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,所述目标对象是所述目标图像中的前景或者背景;基于所述语义概率图进行概率转换处理,得到三分图,针对所述三分图中每个像素,该像素对应的数值表示该像素在所述目标图像中属于前景、背景或待确定区域中的任一区域的概率;根据所述三分图和所述目标图像进行抠像处理,得到抠像结果。

Description

抠像处理
相关申请的交叉引用
本申请要求在2021年10月18日提交至中国专利局、申请号为CN2021112120678的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及计算机视觉技术,具体涉及抠像处理。
背景技术
图片抠像作为一种基础性的图片编辑技术的一部分,被广泛应用于各种图片编辑软件,相机后端算法等场景。
发明内容
有鉴于此,本公开实施例至少提供一种抠像处理方法、装置、电子设备及存储介质。
第一方面,提供一种抠像处理方法,所述方法包括:对目标图像进行语义分割处理,得到目标图像对应的语义概率图,所述语义概率图包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,所述目标对象是所述目标图像中的前景或者背景;基于所述语义概率图进行概率转换处理,得到三分图,针对所述三分图中每个像素,该像素对应的数值表示该像素在目标图像中属于前景、背景或待确定区域中的任一区域的概率;根据所述三分图和所述目标图像进行抠像处理,得到抠像结果。
第二方面,提供一种网络训练方法,所述方法用于对语义分割网络和抠像网络进行联合训练,所述方法包括:获取训练样本集,所述训练样本集包括多个样本数据;针对所述训练样本集中的每个样本数据,对该样本数据处理得到包含样本图像的全局图像信息的第一图像和所述第一图像对应的分割标签,以及包含所述样本图像的局部图像信息的第二图像和所述第二图像对应的抠像标签;通过语义分割网络对所述第一图像进行语义分割处理,得到所述语义分割网络输出的语义概率图;基于所述语义概率图进行概率转换处理,得到三分图;将所述三分图和所述第二图像进行抠像处理,得到抠像结果;根据所述语义概率图与所述分割标签的差异调整语义分割网络的网络参数,并且,基于抠像结果和抠像标签的差异调整抠像网络的网络参数。
第三方面,提供一种抠像处理装置,所述装置包括:分割处理模块,用于对目标图像进行语义分割处理,得到所述目标图像对应的语义概率图,所述语义概率图包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,所述目标对象是所述目标图像中的前景或者背景;转换处理模块,用于基于所述语义概率图进行概率转换处理,得到三分图,针对所述三分图中每个像素,该像素对应的数值表示该像素在所述目标图像中属于前景、背景或待确定区域中的任一区域的概率;抠像处理模块,用于根据所述三分图和所述目标图像进行抠像处理,得到抠像结果。
第四方面,提供一种网络训练装置,所述装置用于对语义分割网络和抠像网络进行联合训练,所述装置包括:样本获取模块,用于获取训练样本集,所述训练样本集包括多个样本数据;样本处理模块,用于针对所述训练样本集中的每个样本数据,对该样本数据处理,得到包含样本图像的全局图像信息的第一图像和所述第一图像对应的分割标 签,以及包含所述样本图像的局部图像信息的第二图像和所述第二图像对应的抠像标签;语义分割模块,用于通过语义分割网络对所述第一图像进行语义分割处理,得到所述语义分割网络输出的语义概率图;转换处理模块,用于基于所述语义概率图进行概率转换处理,得到三分图;抠像处理模块,用于将所述三分图和所述第二图像进行抠像处理,得到抠像结果;网络调整模块,用于根据所述语义概率图与所述分割标签的差异调整所述语义分割网络的网络参数,并且,基于所述抠像结果和所述抠像标签的差异调整所述抠像网络的网络参数。
第五方面,提供一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开任一实施例的方法。
第六方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例的方法。
本公开实施例提供的抠像处理方法、装置、电子设备及存储介质,通过基于对目标图像进行语义分割得到的语义概率图,进行概率转换得到三分图,使得三分图的获得更加快捷方便,不再需要人工标注,也不再需要通过trimap标注训练预测网络,从而使得抠像处理的过程实现起来更加简单;并且,这种概率转换得到三分图的方式,依据了语义分割的语义概率图,使得生成的三分图较为准确,从而实现了准确快捷的抠像。
附图说明
为了更清楚地说明本公开一个或多个实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1示出了本公开至少一个实施例提供的抠像处理方法的流程图;
图2示出了本公开至少一个实施例提供的抠像处理的过程示意图;
图3是基于图2的抠像处理的流程图;
图4示出了本公开至少一个实施例提供的目标图像的示意图;
图5示出了本公开至少一个实施例提供的语义概率图的示意图;
图6示出了本公开至少一个实施例提供的三分图的示意图;
图7示出了本公开至少一个实施例提供的透明度的示意图;
图8示出了本公开至少一个实施例提供的前景示意图;
图9示出了本公开至少一个实施例提供的网络训练方法的流程示意图;
图10示出了本公开至少一个实施例提供的一种抠像处理装置的结构示意图;
图11示出了本公开至少一个实施例提供的一种网络训练装置的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本公开一个或多个实施例中的技术方案,下面将结合本公开一个或多个实施例中的附图,对本公开一个或多个实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开一个或多个实施例,本领域普通技术人员在没有做出创造性劳动 前提下所获得的所有其他实施例,都应当属于本公开保护的范围。
所述的抠像可以是提取出图像中的目标对象,该目标对象可以是图像中的前景或者背景。例如,当图像是人物图像时,可以提取出人物图像中的人物头发,或者当图像是风景图像时,还可以提取出风景图像中的天空。所述的人物头发或者天空即可以称为抠像处理要提取的目标对象。在一种示例中,抠像处理的目的可以是,在提取到图像中的目标对象之后,对该目标对象进行特效渲染或者对象替换,例如,将天空渲染成火红色,或者对人物头发进行染发特效处理。
通常的基于深度网络的图像抠像算法可以将待处理的图像和三分图(trimap)作为深度网络的输入,该方式由于有trimap作为引导,指出了网络输入的图像中的部分像素点的透明度,往往能获得更为精细的抠像结果。但是,相关技术中,上述trimap的来源,要么是由用户标注提供,要么是通过trimap的预测网络预测trimap。而用户标注的方式太复杂,用户使用不方便;网络预测trimap的方式又需要大量的trimap标注,trimap标注的应用面窄,意义较小。
针对相关技术中的trimap获得方式较为复杂的问题,本公开实施例提供了一种抠像处理方法,该方法不需要用户人工标注trimap,也不需要预先训练用于预测trimap的预测网络,而是可以基于语义分割的结果结合概率转换得到trimap。请参见图1所示的抠像处理方法的流程,可以包括如下处理:
在步骤100中,对待处理的目标图像进行语义分割处理,得到对应目标图像的语义概率图。
其中,可以将待进行抠像处理的图像称为目标图像。例如,假设要从一个人物图像中提取出人物头发和人脸,该人物图像可以称为目标图像。其中的人物头发和人脸即为抠像处理要提取的目标,可以称为目标对象。
本实施例中,可以通过对目标图像进行语义分割处理,例如,可以通过语义分割网络进行语义分割处理。该语义分割网络包括但不限于SegNet、U-Net、DeepLab、FCN等常用的语义分割网络。
通过语义分割处理后可以得到目标图像的语义概率图,该语义概率图可以包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,该目标对象可以是目标图像中的前景或者背景。以抠出前景为例,语义概率图可以表示目标图像中的某个像素属于前景的概率是0.85,另一个像素属于前景的概率是0.24。
在步骤102中,基于所述语义概率图进行概率转换处理,得到三分图。
本步骤中,可以基于语义分割处理的结果,进行概率转换处理,得到三分图。本实施例中的通过概率转换处理得到的三分图可以用soft-trimap表示。
其中,所述的概率转换处理可以是通过数学的转换方式,将语义概率图中得到的像素对应的概率映射到soft-trimap中的像素对应的数值。
具体的,可以将语义概率图中的概率执行如下两部分的概率转换:
1)基于语义概率图,将第一概率转换得到第二概率。
其中,三分图soft-trimap中可以包括三种区域:―确定前景‖、―确定背景‖和―待确定区域‖。本实施例可以将像素属于所述三分图中的待确定区域的概率称为第二概率。
将语义概率图中像素属于目标对象的第一概率转换到第二概率时,可以遵循如下的概率转换原则:第一概率表征所述像素属于前景或者背景的概率越高,所述第二概率表征所述像素属于三分图中的待确定区域的概率越低。比如,第一概率越接近1和0,第 二概率越接近于0;第一概率越接近0.5,第二概率越接近于1。上述转换原则即为,若图像中的一个像素属于前景的概率越高,或者属于背景的概率越高,则该像素属于待确定区域的概率就越低;而像素属于前景或背景的概率在0.5附近时,表示该像素越不确定属于前景还是背景,那么像素属于待确定区域的概率就越高。
基于上述的概率转换的原则,可以将第一概率转换得到第二概率。本公开实施例不限制概率转换的具体公式,如下仅示例一种:
un=-k4*score^4+k3*score^3–k2*score^2+k1*score.......(1)
如上的公式(1),其中的un表示像素属于待确定区域的第二概率,score表示该像素在语义概率图中属于目标对象的第一概率。该公式(1)是一种多项式拟合的方式,通过多项式拟合,将像素的第一概率拟合得到第二概率。本实施例不限制上述的各个系数―k1/k2/k3/k4‖的具体取值。
可以理解的是,实际实施中并不局限于上述的多项式拟合,也可以采用其他的函数式,只要遵循上述的概率转换原则即可。本实施例采用多项式拟合将第一概率转换为第二概率,能够使得这种多项式的转换计算效率更高,而且也较准确的反映了上述的转换原则。
2)根据所述语义概率图中每个像素对应的第一概率和第二概率,生成所述三分图。
如上,通过对目标图像进行语义分割处理,可以得到了语义概率图,通过该语义概率图就可以大致的将目标图像中的前景和背景区分出来,以抠出前景为例,若一个像素属于前景的第一概率是0.96,那属于前景的概率很高;若一个像素属于前景的第一概率是0.14,即为像素属于背景的概率很高。
在基于语义概率图得到第二概率后,就可以得到每一个像素属于待确定区域的第二概率。对于目标图像中的每个像素,可以结合该像素在语义概率图中对应的第一概率、以及该像素属于待确定区域的第二概率进行概率融合,就可以得到该像素在三分图soft-trimap中对应的数值,该数值可以表征所述像素在目标图像中属于确定前景、确定背景或待确定区域中的任一区域的概率。
举例来说:在soft-trimap中,若一个像素对应的数值越靠近1,表示该像素在目标图像中越可能属于前景;该像素在soft-trimap中对应的数值越靠近0,表示该像素越可能属于背景;该像素在soft-trimap中对应的数值越靠近0.5,表示该像素越可能属于待确定区域。即通过像素在soft-trimap中对应的数值就可以表示出该像素属于前景、背景或待确定区域中的任一区域的概率。
如下的公式(2),示例了一种根据像素的第一概率和第二概率进行概率融合得到该像素在三分图中对应的数值的方式:
soft_trimap=-k5*un/k6*sign(score-k7)+(sign(score-k7)+k8)/k9.......(2)
如上的公式(2)中,soft_trimap表示soft-trimap中的像素对应的数值,un表示第二概率,score表示第一概率,sign()表示sign函数。同理,本实施例不限制上述的各个系数―k5/k6/k7/k8‖的具体取值。
如上示例的描述,针对语义概率图中每个像素,将该像素对应的第一概率转换得到第二概率、以及结合像素对应的第一概率和第二概率生成所述三分图,实现了基于语义概率图进行概率转换处理得到三分图soft_trimap。
在一些实施例中,在进行上述的基于语义概率图进行概率转换处理之前,还可以对所述语义概率图进行池化处理,并对池化后的语义概率图进行上述的概率转换处理。请 参见下面的公式(3):
score_=avgpool2d(score,ks,stride).......(3)
如公式(3)所示,在一个示例中,可以对语义概率图进行平均池化处理,并且依据卷积步长stride、卷积核大小(kernel_size,ks)进行池化。score_表示池化后的语义概率图,其中包含各池化后的概率。
如果对语义概率图进行了池化处理,那么上面的公式(1)和公式(2)中的score都替换为池化后的概率,即采用池化后的语义概率图执行概率转换。
上述池化处理中采用的kernel的大小可以调整,并且在对语义概率图进行概率转换前进行池化处理,有助于通过调整卷积核大小,调整将要生成的soft_trimap中的待确定区域的宽度。例如,kernel_size越大,待确定区域的宽度就可以越宽。
在一些实施例中,假设对目标图像的语义分割处理是由语义分割网络进行,那么在进行语义分割处理之前,还可以对目标图像的图像尺寸进行预处理,该预处理可以是基于语义分割网络对目标图像的下采样倍数,将目标图像的图像尺寸进行该下采样倍数的整数倍处理,使得整数倍处理后的图像尺寸能够整除上述下采样倍数scale_factor,该scale_factor是语义分割网络对目标图像的下采样倍数,具体数值由语义分割网络的网络结构确定。
在步骤104中,根据所述三分图和目标图像进行抠像处理,得到抠像结果。
本步骤中,所述的抠像处理的过程可以包括:将三分图和目标图像作为抠像网络的输入,得到所述抠像网络输出的目标图像中的对象残差(例如,可以是前景残差或背景残差,前景残差可以指示预测的前景的像素值与目标图像对应像素的像素值的差值,背景残差可以指示预测的背景的像素值与目标图像对应像素的像素值的差值)、以及目标图像的初始透明度。接着,可以基于目标图像和对象残差得到目标图像中的目标对象(例如,基于前景残差和目标图像得到前景,或者基于背景残差和目标图像得到背景),并可以根据初始透明度和三分图soft_trimap得到所述目标图像的透明度。
本实施例的抠像处理方法,通过基于对目标图像进行语义分割得到的语义概率图,进行概率转换得到三分图,使得三分图的获得更加快捷方便,不再需要人工标注,也不再需要通过trimap标注训练预测网络,从而使得抠像处理的过程实现起来更加简单;并且,这种通过概率转换得到三分图的方式,依据语义分割的语义概率图,使得生成的三分图较为准确,从而实现了准确快捷的抠像。
本公开实施例的抠像处理方法,可以应用到移动端。考虑到移动端的处理能力,本公开的另一个实施例可以将部署到移动端的网络进行小型化设计,并且可以将目标图像的尺寸进行缩放,以使得运行耗时和内存消耗在移动端的负担范围内。如下描述一个在移动端进行抠像的例子。
本实施例在处理抠像时,可以采用语义分割网络和抠像网络。其中,语义分割网络可以是SegNet、U-Net等网络,抠像网络可以包括编码器(encoder)和解码器(decoder)。所述的抠像网络的编码器可以采用mobv2的结构设计,并且在抠像网络部署到移动端之前,可以将抠像网络进行通道压缩,所述的通道压缩可以是对抠像网络的网络中间特征(即网络中间层特征)的通道数量进行压缩,例如,可以是将抠像网络处理过程中的卷积核的输出通道数量降低,假设卷积核的输出通道数原本是a,可以按照0.35倍的通道数量进行压缩,压缩后卷积核的输出通道数是0.35*a。
图2示例了本公开实施例提供的一种抠像处理的过程示意图,图3是基于图2的抠像处理的流程图,结合图2和图3来看,可以包括如下处理,其中,本实施例以目标对 象是前景为例进行描述:
在步骤300中,对目标图像进行缩放处理。
例如,本实施例的目标图像可以是人物图像,请参见图4。该人物图像可以是用户在使用自己的移动终端时,通过移动终端摄像头拍摄得到,或者也可以是移动终端存储或从其他设备接收到的图像。
本实施例进行抠像处理的目的可以是提取该人物图像中的头发和人脸的区域。可以将目标图像中的人物的头发和人脸作为前景。
由于本实施例是在移动端执行抠像处理,为了减轻移动端处理的负担,节省移动端的计算量,可以对目标图像进行缩放。假设图4中的目标图像的尺寸是1080*1920,可以将该图像缩放到480*288的尺寸。例如,可以通过双线性差值的方式进行缩放。可以参照如下的公式(4)和公式(5)进行缩放:
scale=max(h/basesize,w/basesize).......(4)
new_h=int(h/scale+k10)new_w=int(w/scale+k11).......(5)
其中,h和w是目标图像的长和宽,basesize是基准尺寸,本例子中是480,int(x)表示对x进行取整。new_h和new_w分别是对目标图像进行缩放后的尺寸,其中,公式(5)中的系数的具体取值本实施例不做限制。
此外,可以继续根据公式(6)和公式(7),对目标图像的图像尺寸进行下采样倍数的整数倍处理,来控制缩放后的图像尺寸能够整除语义分割网络对图像的下采样倍数scale_factor。可以理解,所述的整数倍处理也可以采用其他公式,不局限于如下的两个公式。
new_h=int(int(int(new_h–k12+scale_facor–k13)/scale_factor)*scale_factor)......(6)
new_w=int(int(int(new_w–k14+scale_facor–k15)/scale_factor)*scale_factor)......(7)
本实施例不限制上述的公式(6)和公式(7)中的各个系数的具体取值,例如,可以将上述的k12至k15的取值都设置为1。如果将缩放前的原始的目标图像以A标识,那么在缩放成480*288的图像后,再将该图像进行归一化得到的目标图像可以用B标识。参见图2中所示,目标图像B即为缩放处理后的目标图像。
在步骤302中,通过语义分割网络对缩放处理后的目标图像进行语义分割处理,得到语义分割网络输出的语义概率图。
例如,结合图2所示,可以通过语义分割网络21对目标图像B进行语义分割处理,得到语义分割网络输出的语义概率图22,该语义概率图可以用score标识,并且图5示意了该score。可以看到,该语义概率图的score基于像素属于前景的概率,粗略的区分了图像中的前景和背景。
在步骤304中,基于所述语义概率图进行概率转换处理,得到三分图。
本步骤中,可以按照图1流程中描述的概率转换处理生成三分图soft-trimap。例如,可以先根据公式(3)将语义概率图进行池化处理,再对池化后的语义概率图根据公式(1)和公式(2)进行概率转换处理,生成三分图。参见图2中的该三分图23。
请参见图6的示意,该图6示意了soft-trimap,可以看到,该soft-trimap中的像素的概率值可以表示该像素属于三种区域的概率,根据该概率值粗略的区分了图像中的 ―前景‖、―背景‖和―待确定区域‖。
在步骤306中,将三分图和目标图像作为抠像网络的输入,得到抠像网络输出的前景残差和初始透明度。
请参见图2所示,可以将三分图23和目标图像B都作为抠像网络24的输入,该抠像网络可以输出一个4通道的结果,其中一个通道的结果是初始透明度raw_alpha,另外三个通道的结果是前景残差fg_res。图2中的抠像网络输出的第一结果25可以包括―raw_alpha+fg_res‖。
在步骤308中,基于目标图像和前景残差得到目标图像中的前景,并根据初始透明度和三分图得到透明度。
请继续结合图2所示,可以将前景残差fg_res通过双线性差值进行放大处理,使得回复到目标图像进行缩放处理之前的尺度,然后执行公式(8):
FG=clip(A+fg_res,s1,s2).......(8)
如图2所示,可以根据放大处理后的前景残差fg_res和目标图像A,得到目标图像中的前景FG。其中,clip(x,s1,s2)为将x的数值限制在[s1,s2]。本实施例不限制上述的公式(7)中的s1和s2的具体取值,例如,s1可以是0,s2可以是1。
此外,可以按照下面的公式(9)和公式(10)计算透明度:
fs=clip((soft_trimap-s3)/s4,s5,s6)......(9)
Alpha=clip(fs+un*raw_alpha,s7,s8).....(10)
其中,Alpha表示透明度,在得到Alpha后,可以通过双线性差值将Alpha放大回目标图像缩放前的原始尺寸。同样的,本实施例不限制上述的公式(9)和公式(10)中的各个系数s3至s8的具体取值。
图7示意了Alpha,图8示意了最终得到的目标图像的前景FG。
此外,在得到抠像结果中包括的目标图像的前景和透明度之后,可以根据该前景和透明度,继续执行图像编辑,例如,该图像编辑可以是图像的前景替换和/或前景渲染等处理。
本公开实施例的抠像处理方法,通过将抠像模型进行通道压缩等处理,并且将目标图像进行缩放处理,可以使得抠像更适合在移动端进行,例如,用户使用自己的移动终端拍摄图像后,可以直接在移动终端完成对头发的抠像处理,提取出头发,并进行头发的染色处理,从而使得这些处理都可以在移动终端本地进行,不需要上传云端,提高了数据的安全隐私保护。并且可以由图2看到,该抠像处理的方法是将单一的目标图像作为输入即可直接得到抠像结果,即提供一张目标图像,基于本公开实施例提供的抠像处理方法就可以得到对该目标图像中前景的预测,输入的信息较少,从而使得该抠像处理更加便利。
此外,本公开实施例的抠像处理的流程中,使用的语义分割网络和抠像网络,本实施例不限制这两个网络的训练方法。图9示出了本公开至少一个实施例提供的一种网络训练方法,该方法可以用于对语义分割网络和抠像网络进行联合训练。如图9所示,该方法可以包括如下处理:
在步骤900中,获取训练样本集,所述训练样本集包括多个样本数据。
在一些实施方式中,所述的训练样本集中的每个样本数据可以包括样本图像、样本图像对应的第一特征标签以及样本图像对应的第二特征标签。以抠像场景为例,第一特 征标签可以是针对样本图像的分割标签,第二特征标签可以是针对样本图像的抠像标签。
在步骤902中,针对所述训练样本集中的每个样本数据,对该样本数据处理,得到包含所述样本图像的全局图像信息的第一图像和所述第一图像对应的分割标签,以及包含所述样本图像的局部图像信息的第二图像和所述第二图像对应的抠像标签。
在一些实施方式中,可以对样本数据的样本图像进行第一处理,得到包括样本图像的大部分图像信息的第一图像,可以认为该第一图像包括了样本图像的全局图像信息,同时对样本图像对应的第一特征标签进行相同的第一处理,得到第一图像对应的分割标签。例如,可以按照语义分割网络对输入图像的尺寸要求,将样本图像进行缩放处理,但仍保留该样本图像的大部分图像信息,得到第一图像,并将第一特征标签进行相同的缩放处理,得到分割标签。
同时,对样本数据的样本图像进行第二处理,还可得到包括样本图像局部图像信息的第二图像,同时对样本图像对应的第二特征标签进行相同的第二处理,得到第二图像对应的抠像标签。例如,可以将样本图像进行局部裁切,得到包括样本图像的局部图像信息的第二图像,并且将第二特征标签进行相同的局部裁切,得到所述抠像标签。
在步骤904中,通过语义分割网络对所述第一图像进行语义分割处理,得到所述语义分割网络输出的语义概率图。
在步骤906中,基于所述语义概率图进行概率转换处理,得到三分图。
本步骤的概率转换处理可以参照前述的实施例,不再详述。通过所述的概率转换处理可以得到本公开实施例的soft-trimap。
在步骤908中,通过抠像网络,基于三分图和第二图像进行抠像处理,得到抠像结果。
在步骤910中,根据所述语义概率图与分割标签的差异调整语义分割网络的网络参数,并且,基于抠像结果和抠像标签的差异调整抠像网络的网络参数。
通过上述可知,本公开实施方式中,通过对每个样本数据进行处理,利用得到的包括全局图像信息的第一图像和第一标签对第一子网络进行训练,和包括局部图像信息的第二图像和第二标签对第二子网络进行训练,提高联合训练效果,降低网络效果退化的风险。
此外,上述的训练方式中,soft-trimap的生成采用概率转换处理这种方式,能够在一定程度上辅助网络训练的效果更好。
具体的,soft-trimap能够在网络训练过程中自适应进行调整。比如,在根据所述语义概率图与分割标签的差异调整语义分割网络的网络参数,并且,基于抠像结果和抠像标签的差异调整抠像网络的网络参数的过程中,语义分割网络的网络参数将进行更新,进而该语义分割网络输出的语义概率图也进行了更新。
进一步的,soft-trimap是基于语义概率图生成的,因此,语义概率图更新将带来三分图soft-trimap的更新,进而抠像结果也会更新。即,在网络训练过程中通常会迭代多次,而每一次迭代后,如果语义分割网络发生了参数更新,即使输入的是同一个图像,语义概率图、soft-trimap和抠像结果都会适应性更新,并根据更新后的结果继续调整网络参数。这种自适应调整soft-trimap的方式,将有助于使得生成的soft-trimap和抠像结果都随着语义分割网络的调整进行动态优化,使得最终模型的训练效果更好,能更准确的提取到目标图像中的目标对象。
图10示例了一种抠像处理装置,该装置可以应用于实现本公开任一实施例的抠像处 理方法。如图10所示,该装置可以包括:分割处理模块1001、转换处理模块1002和抠像处理模块1003。
分割处理模块1001,用于对目标图像进行语义分割处理,得到所述目标图像对应的语义概率图,所述语义概率图包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,所述目标对象是所述目标图像中的前景或者背景。
转换处理模块1002,用于基于所述语义概率图进行概率转换处理,得到三分图,针对所述三分图中每个像素,该像素对应的数值表示该像素在所述目标图像中属于前景、背景或待确定区域中的任一区域的概率。
抠像处理模块1003,用于根据所述三分图和所述目标图像进行抠像处理,得到抠像结果。
在一个例子中,转换处理模块1002,在用于基于所述语义概率图进行概率转换处理得到所述三分图时,包括:针对所述语义概率图中的每个像素,基于该像素的所述第一概率进行概率转换,得到该像素属于所述三分图中所述待确定区域的第二概率;根据所述语义概率图中每个像素的所述第一概率和所述第二概率,生成所述三分图。
在一个例子中,所述转换处理模块1002,在用于针对所述语义概率图中的每个像素,基于该像素的所述第一概率进行概率转换得到该像素属于所述三分图中所述待确定区域的所述第二概率时,包括:通过多项式拟合的方式,将该像素的所述第一概率,拟合得到该像素属于所述三分图中的所述待确定区域的所述第二概率。
在一个例子中,所述抠像处理模块1003,在用于根据所述三分图和所述目标图像进行抠像处理,得到所述抠像结果时,包括:根据所述三分图和所述目标图像进行抠像处理,得到对象残差和所述目标图像的初始透明度;基于所述目标图像和所述对象残差,得到所述目标图像中的所述目标对象;根据所述初始透明度和所述三分图,确定所述目标图像的透明度。
在一个例子中,所述分割处理模块1001,在用于对所述目标图像进行语义分割处理之前,还对目标图像进行缩放处理;所述抠像处理模块1003,在用于基于所述目标图像和所述对象残差,得到所述目标图像中的所述目标对象时,包括:将所述对象残差放大至所述目标图像进行缩放处理之前的尺度;根据放大后的对象残差和所述目标图像,得到所述目标图像中的所述目标对象。
在一个例子中,所述分割处理模块1001,在用于对所述目标图像进行语义分割处理,得到所述目标图像对应的所述语义概率图时,包括:通过语义分割网络对所述目标图像进行语义分割处理,得到所述语义分割网络输出的所述语义概率图;所述抠像处理模块1003,在用于根据所述三分图和所述目标图像进行抠像处理时,包括:通过抠像网络,根据所述三分图和所述目标图像进行抠像处理。
在一个例子中,所述抠像网络是经过通道压缩的网络,所述通道压缩是对所述抠像网络的网络中间特征的通道数量进行压缩。
在一个例子中,像素的所述第一概率表征该像素属于前景或者背景的概率越高,对应的经概率转换得到的所述第二概率表征该像素属于所述三分图中的所述待确定区域的概率越低;所述转换处理模块1002,在用于根据所述像素的所述第一概率和所述第二概率,生成所述三分图时,包括:对于所述目标图像中的每个像素,根据该像素对应的所述第一概率和所述第二概率进行概率融合,确定该像素在所述三分图中对应的数值。
在一个例子中,所述转换处理模块1002,还用于在基于所述语义概率图进行概率转换处理,得到所述三分图之前,对所述语义概率图进行池化处理,得到池化后的语义 概率图,所述基于所述语义概率图进行概率转换处理,包括:对所述池化后的语义概率图进行概率转换处理。
在一个例子中,所述分割处理模块1001,还用于在对所述目标图像进行语义分割处理之前,基于语义分割网络对目标图像的下采样倍数,将所述目标图像的图像尺寸进行下采样倍数的整数倍处理,以使所述整数倍处理后的图像尺寸能够整除所述下采样倍数。
在一个例子中,所述抠像结果包括:所述目标图像的透明度和所述目标对象;所述抠像处理模块1003,还用于根据所述抠像结果中的所述目标对象和所述透明度,执行对象替换和/或对象渲染。
图11示例了一种网络训练装置,该装置可以应用于实现本公开任一实施例的网络训练方法,该装置用于对语义分割网络和抠像网络进行联合训练。如图11所示,该装置可以包括:样本获取模块1101、样本处理模块1102、语义分割模块1103、转换处理模块1104、抠像处理模块1105和网络调整模块1106。
样本获取模块1101,用于获取训练样本集,所述训练样本集包括多个样本数据;
样本处理模块1102,用于针对所述训练样本集中的每个样本数据,对该样本数据处理,得到包含样本图像的全局图像信息的第一图像和所述第一图像对应的分割标签,以及包含所述样本图像的局部图像信息的第二图像和所述第二图像对应的抠像标签。
语义分割模块1103,用于通过语义分割网络对所述第一图像进行语义分割处理,得到所述语义分割网络输出的语义概率图。
转换处理模块1104,用于基于所述语义概率图进行概率转换处理,得到三分图。
抠像处理模块1105,用于将所述三分图和所述第二图像进行抠像处理,得到抠像结果。
网络调整模块1106,用于根据所述语义概率图与所述分割标签的差异调整所述语义分割网络的网络参数,并且,基于所述抠像结果和所述抠像标签的差异调整所述抠像网络的网络参数。
本公开还提供了一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开任一实施例的抠像处理方法或网络训练方法。
本公开还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的抠像处理方法或网络训练方法。
本领域技术人员应明白,本公开一个或多个实施例可提供为方法、***或计算机程序产品。因此,本公开一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本公开一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开实施例还提供一种计算机可读存储介质,该存储介质上可以存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例描述的或网络训练方法的步骤。其中,所述的―和/或‖表示至少具有两者中的其中一个,例如,―多和/或B‖包括三种方案:多、B、以及―多和B‖。
本公开实施例涉及增强现实领域,通过获取现实环境中的目标对象的图像信息,进而借助各类视觉相关算法实现对目标对象的相关特征、状态及属性进行检测或识别处理, 从而得到与具体应用匹配的虚拟与现实相结合的AR效果。示例性的,目标对象可涉及与人体相关的脸部、肢体、手势、动作等,或者与物体相关的标识物、标志物,或者与场馆或场所相关的沙盘、展示区域或展示物品等。视觉相关算法可涉及视觉定位、SLAM、三维重建、图像注册、背景分割、对象的关键点提取及跟踪、对象的位姿或深度检测等。具体应用不仅可以涉及跟真实场景或物品相关的导览、导航、讲解、重建、虚拟效果叠加展示等交互场景,还可以涉及与人相关的特效处理,比如妆容美化、肢体美化、特效展示、虚拟模型展示等交互场景。可通过卷积神经网络,实现对目标对象的相关特征、状态及属性进行检测或识别处理。上述卷积神经网络是基于深度学习框架进行模型训练而得到的网络模型。
本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置或设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本公开中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本公开中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本公开中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。
本公开中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPG多(现场可编程门阵列)或多SIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PD多)、移动音频或视频播放器、游戏操纵台、全球定位***(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路。
虽然本公开包含许多具体实施细节,但是这些不应被解释为限制任何公开的范围或所要求保护的范围,而是主要用于描述特定公开的具体实施例的特征。本公开内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种***模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和***通常可以一起集成在单个软件产品中,或者封装成多个软件产品。
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。
以上所述仅为本公开一个或多个实施例的较佳实施例而已,并不用以限制本公开一个或多个实施例,凡在本公开一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开一个或多个实施例保护的范围之内。

Claims (20)

  1. 一种抠像处理方法,其特征在于,所述方法包括:
    对目标图像进行语义分割处理,得到所述目标图像对应的语义概率图,所述语义概率图包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,所述目标对象是所述目标图像中的前景或者背景;
    基于所述语义概率图进行概率转换处理,得到三分图,针对所述三分图中每个像素,该像素对应的数值表示该像素在所述目标图像中属于前景、背景或待确定区域中的任一区域的概率;
    根据所述三分图和所述目标图像进行抠像处理,得到抠像结果。
  2. 根据权利要求1所述的方法,其特征在于,
    对所述目标图像进行语义分割处理,得到所述目标图像对应的所述语义概率图,包括:通过语义分割网络对所述目标图像进行语义分割处理,得到所述语义分割网络输出的所述语义概率图;
    根据所述三分图和所述目标图像进行抠像处理,包括:通过抠像网络,根据所述三分图和所述目标图像进行抠像处理。
  3. 根据权利要求2所述的方法,其特征在于,
    所述抠像网络是经过通道压缩的网络,所述通道压缩是对所述抠像网络的网络中间特征的通道数量进行压缩。
  4. 根据权利要求1所述的方法,其特征在于,基于所述语义概率图进行概率转换处理,得到所述三分图,包括:
    针对所述语义概率图中的每个像素,基于该像素的所述第一概率进行概率转换,得到该像素属于所述三分图中所述待确定区域的第二概率;
    根据所述语义概率图中每个像素的所述第一概率和所述第二概率,生成所述三分图。
  5. 根据权利要求4所述的方法,其特征在于,针对语义概率图中每个像素,该像素的所述第一概率表征该像素属于前景或者背景的概率越高,对应的经概率转换得到的所述第二概率表征该像素属于所述三分图中的所述待确定区域的概率越低;
    根据所述语义概率图中每个像素的所述第一概率和所述第二概率,生成所述三分图,包括:对于所述目标图像中的每个像素,根据该像素对应的所述第一概率和所述第二概率进行概率融合,确定该像素在所述三分图中对应的数值。
  6. 根据权利要求4所述的方法,其特征在于,针对所述语义概率图中的每个像素,基于该像素的所述第一概率进行概率转换,得到该像素属于所述三分图中所述待确定区域的所述第二概率,包括:
    通过多项式拟合的方式,将该像素的所述第一概率,拟合得到该像素属于所述三分图中的所述待确定区域的所述第二概率。
  7. 根据权利要求1所述的方法,其特征在于,基于所述语义概率图进行概率转换处理,得到所述三分图之前,所述方法还包括:
    对所述语义概率图进行池化处理,得到池化后的语义概率图;
    基于所述语义概率图进行概率转换处理,包括:对所述池化后的语义概率图进行概率转换处理。
  8. 根据权利要求1所述的方法,其特征在于,对所述目标图像进行语义分割处理之前,所述方法还包括:
    基于语义分割网络对目标图像的下采样倍数,将所述目标图像的图像尺寸进行下采样倍数的整数倍处理,以使所述整数倍处理后的图像尺寸能够整除所述下采样倍数。
  9. 根据权利要求1所述的方法,其特征在于,根据所述三分图和所述目标图像进行抠像处理,得到所述抠像结果,包括:
    根据所述三分图和所述目标图像进行抠像处理,得到对象残差和所述目标图像的初 始透明度;
    基于所述目标图像和所述对象残差,得到所述目标图像中的所述目标对象;
    根据所述初始透明度和所述三分图,确定所述目标图像的透明度。
  10. 根据权利要求9所述的方法,其特征在于,在对所述目标图像进行语义分割处理之前,所述方法还包括:对目标图像进行缩放处理;
    基于所述目标图像和所述对象残差,得到所述目标图像中的所述目标对象,包括:
    将所述对象残差放大至所述目标图像进行所述缩放处理之前的尺度;
    根据放大后的对象残差和所述目标图像,得到所述目标图像中的所述目标对象。
  11. 根据权利要求1~10任一所述的方法,其特征在于,所述抠像结果包括:所述目标图像的透明度和所述目标对象;
    所述方法还包括:
    根据所述抠像结果中的所述目标对象和所述透明度,执行对象替换和/或对象渲染。
  12. 一种网络训练方法,其特征在于,所述方法用于对语义分割网络和抠像网络进行联合训练,所述方法包括:
    获取训练样本集,所述训练样本集包括多个样本数据;
    针对所述训练样本集中的每个样本数据,对该样本数据处理,得到包含样本图像的全局图像信息的第一图像和所述第一图像对应的分割标签,以及包含所述样本图像的局部图像信息的第二图像和所述第二图像对应的抠像标签;
    通过语义分割网络对所述第一图像进行语义分割处理,得到所述语义分割网络输出的语义概率图;
    基于所述语义概率图进行概率转换处理,得到三分图;
    通过所述抠像网络,基于所述三分图和所述第二图像进行抠像处理,得到抠像结果;
    根据所述语义概率图与所述分割标签的差异调整所述语义分割网络的网络参数,并且,基于所述抠像结果和所述抠像标签的差异调整所述抠像网络的网络参数。
  13. 一种抠像处理装置,其特征在于,所述装置包括:
    分割处理模块,用于对目标图像进行语义分割处理,得到所述目标图像对应的语义概率图,所述语义概率图包括:针对所述目标图像中的每个像素,该像素属于目标对象的第一概率,所述目标对象是所述目标图像中的前景或者背景;
    转换处理模块,用于基于所述语义概率图进行概率转换处理,得到三分图,针对所述三分图中每个像素,该像素对应的数值表示该像素在所述目标图像中属于前景、背景或待确定区域中的任一区域的概率;
    抠像处理模块,用于根据所述三分图和所述目标图像进行抠像处理,得到抠像结果。
  14. 根据权利要求13所述的装置,其特征在于,
    所述转换处理模块,在用于基于所述语义概率图进行概率转换处理得到所述三分图时,包括:针对所述语义概率图中的每个像素,基于该像素的所述第一概率进行概率转换,得到该像素属于所述三分图中所述待确定区域的第二概率;根据所述语义概率图中每个像素的所述第一概率和所述第二概率,生成所述三分图。
  15. 根据权利要求14所述的装置,其特征在于,
    所述转换处理模块,在用于针对所述语义概率图中的每个像素,基于该像素的所述第一概率进行概率转换得到该像素属于所述三分图中所述待确定区域的所述第二概率时,包括:通过多项式拟合的方式,将该像素的所述第一概率,拟合得到该像素属于所述三分图中的所述待确定区域的所述第二概率。
  16. 根据权利要求13所述的装置,其特征在于,
    所述抠像处理模块,在用于根据所述三分图和所述目标图像进行抠像处理,得到所述抠像结果时,包括:根据所述三分图和所述目标图像进行抠像处理,得到对象残差和所述目标图像的初始透明度;基于所述目标图像和所述对象残差,得到所述目标图像中 的所述目标对象;根据所述初始透明度和所述三分图,确定所述目标图像的透明度。
  17. 根据权利要求16所述的装置,其特征在于,
    所述分割处理模块,在用于对所述目标图像进行语义分割处理之前,还对目标图像进行缩放处理;
    所述抠像处理模块,在用于基于所述目标图像和所述对象残差,得到所述目标图像中的所述目标对象时,包括:将所述对象残差放大至所述目标图像进行缩放处理之前的尺度;根据放大后的对象残差和所述目标图像,得到所述目标图像中的所述目标对象。
  18. 一种网络训练装置,其特征在于,所述装置用于对语义分割网络和抠像网络进行联合训练,所述装置包括:
    样本获取模块,用于获取训练样本集,所述训练样本集包括多个样本数据;
    样本处理模块,用于针对所述训练样本集中的每个样本数据,对该样本数据处理,得到包含样本图像的全局图像信息的第一图像和所述第一图像对应的分割标签,以及包含所述样本图像的局部图像信息的第二图像和所述第二图像对应的抠像标签;
    语义分割模块,用于通过语义分割网络对所述第一图像进行语义分割处理,得到所述语义分割网络输出的语义概率图;
    转换处理模块,用于基于所述语义概率图进行概率转换处理,得到三分图;
    抠像处理模块,用于将所述三分图和所述第二图像进行抠像处理,得到抠像结果;
    网络调整模块,用于根据所述语义概率图与所述分割标签的差异调整所述语义分割网络的网络参数,并且,基于所述抠像结果和所述抠像标签的差异调整所述抠像网络的网络参数。
  19. 一种电子设备,其特征在于,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至11任一所述的方法,或权利要求12所述的方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1至11任一所述的方法,或权利要求12所述的方法。
PCT/CN2022/124757 2021-10-18 2022-10-12 抠像处理 WO2023066099A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111212067.8 2021-10-18
CN202111212067.8A CN113657402B (zh) 2021-10-18 2021-10-18 抠像处理方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023066099A1 true WO2023066099A1 (zh) 2023-04-27

Family

ID=78484219

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124757 WO2023066099A1 (zh) 2021-10-18 2022-10-12 抠像处理

Country Status (2)

Country Link
CN (1) CN113657402B (zh)
WO (1) WO2023066099A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657402B (zh) * 2021-10-18 2022-02-01 北京市商汤科技开发有限公司 抠像处理方法、装置、电子设备及存储介质
CN114187317B (zh) * 2021-12-10 2023-01-31 北京百度网讯科技有限公司 图像抠图的方法、装置、电子设备以及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150287211A1 (en) * 2014-04-04 2015-10-08 Hrl Laboratories Llc Method for classification and segmentation and forming 3d models from images
CN108460770A (zh) * 2016-12-13 2018-08-28 华为技术有限公司 抠图方法及装置
CN109410185A (zh) * 2018-10-10 2019-03-01 腾讯科技(深圳)有限公司 一种图像分割方法、装置和存储介质
CN109461167A (zh) * 2018-11-02 2019-03-12 Oppo广东移动通信有限公司 图像处理模型的训练方法、抠图方法、装置、介质及终端
CN109712145A (zh) * 2018-11-28 2019-05-03 山东师范大学 一种图像抠图方法及***
CN110751655A (zh) * 2019-09-16 2020-02-04 南京工程学院 一种基于语义分割和显著性分析的自动抠图方法
CN110930296A (zh) * 2019-11-20 2020-03-27 Oppo广东移动通信有限公司 图像处理方法、装置、设备及存储介质
CN113657402A (zh) * 2021-10-18 2021-11-16 北京市商汤科技开发有限公司 抠像处理方法、装置、电子设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110335277A (zh) * 2019-05-07 2019-10-15 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机可读存储介质和计算机设备
CN111340047B (zh) * 2020-02-28 2021-05-11 江苏实达迪美数据处理有限公司 基于多尺度特征与前背景对比的图像语义分割方法及***

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150287211A1 (en) * 2014-04-04 2015-10-08 Hrl Laboratories Llc Method for classification and segmentation and forming 3d models from images
CN108460770A (zh) * 2016-12-13 2018-08-28 华为技术有限公司 抠图方法及装置
CN109410185A (zh) * 2018-10-10 2019-03-01 腾讯科技(深圳)有限公司 一种图像分割方法、装置和存储介质
CN109461167A (zh) * 2018-11-02 2019-03-12 Oppo广东移动通信有限公司 图像处理模型的训练方法、抠图方法、装置、介质及终端
CN109712145A (zh) * 2018-11-28 2019-05-03 山东师范大学 一种图像抠图方法及***
CN110751655A (zh) * 2019-09-16 2020-02-04 南京工程学院 一种基于语义分割和显著性分析的自动抠图方法
CN110930296A (zh) * 2019-11-20 2020-03-27 Oppo广东移动通信有限公司 图像处理方法、装置、设备及存储介质
CN113657402A (zh) * 2021-10-18 2021-11-16 北京市商汤科技开发有限公司 抠像处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN113657402B (zh) 2022-02-01
CN113657402A (zh) 2021-11-16

Similar Documents

Publication Publication Date Title
WO2023066099A1 (zh) 抠像处理
US11393152B2 (en) Photorealistic real-time portrait animation
US11276231B2 (en) Semantic deep face models
CN106682632B (zh) 用于处理人脸图像的方法和装置
US11410364B2 (en) Systems and methods for realistic head turns and face animation synthesis on mobile device
CN108388882B (zh) 基于全局-局部rgb-d多模态的手势识别方法
WO2023071810A1 (zh) 图像处理
CN113420719B (zh) 生成动作捕捉数据的方法、装置、电子设备以及存储介质
EP3104331A1 (en) Digital image manipulation
CN111402399B (zh) 人脸驱动和直播方法、装置、电子设备及存储介质
CN113261013A (zh) 用于移动装置上逼真的头部转动和面部动画合成的***和方法
CN113949808B (zh) 视频生成方法、装置、可读介质及电子设备
US11915355B2 (en) Realistic head turns and face animation synthesis on mobile device
CN114723760B (zh) 人像分割模型的训练方法、装置及人像分割方法、装置
CN114445562A (zh) 三维重建方法及装置、电子设备和存储介质
CN114782864B (zh) 一种信息处理方法、装置、计算机设备及存储介质
Sun et al. Masked lip-sync prediction by audio-visual contextual exploitation in transformers
CN116012232A (zh) 图像处理方法、装置及存储介质、电子设备
CN115239857B (zh) 图像生成方法以及电子设备
CN107766803B (zh) 基于场景分割的视频人物装扮方法、装置及计算设备
CN114581542A (zh) 图像预览方法及装置、电子设备和存储介质
CN117911588A (zh) 虚拟对象脸部驱动及模型训练方法、装置、设备和介质
CN113920023A (zh) 图像处理方法及装置、计算机可读介质和电子设备
CN113837933A (zh) 网络训练及图像生成方法、装置、电子设备和存储介质
CN115205325A (zh) 目标追踪方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882709

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE