CN116109646A

CN116109646A - Training data generation method and device, electronic equipment and storage medium

Info

Publication number: CN116109646A
Application number: CN202111332709.8A
Authority: CN
Inventors: 吴臣桓
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2023-05-12

Abstract

The embodiment of the application relates to the technical field of image processing, and discloses a training data generation method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: performing face detection on the image to be processed through a face detection model to obtain a face area of the image to be processed, and determining the area occupation ratio of the face area in the image to be processed; if the area ratio is larger than the ratio threshold, performing image segmentation processing on the image to be processed through the image segmentation model to obtain an image segmentation result; and if the image segmentation result meets the precision requirement, determining the image to be processed as training data. By implementing the embodiment of the application, the expansion efficiency of the training data of the portrait segmentation model can be improved.

Description

Training data generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a training data generating method and apparatus, an electronic device, and a storage medium.

Background

The human image segmentation model is a model capable of generating human image segmentation results corresponding to an input image, wherein the human image segmentation results are used for describing image positions of human image areas of the input image, and the human image segmentation model is widely applied to various image processing processes.

The training process of the portrait segmentation model requires a large amount of training data including a portrait. In practice, it is found that the related art usually adopts a mode of manually shooting portrait images and then manually labeling each shot image to expand training data, but the manual labeling mode has slower efficiency, so that the expansion efficiency of the training data is reduced.

Disclosure of Invention

The embodiment of the application discloses a training data generation method and device, electronic equipment and storage medium, which can improve the expansion efficiency of training data of a portrait segmentation model.

An embodiment of the present application in a first aspect discloses a method for generating training data, including:

performing face detection on an image to be processed through a face detection model to obtain a face area of the image to be processed, and determining the area occupation ratio of the face area in the image to be processed;

if the area occupation ratio is larger than the occupation ratio threshold, carrying out human image segmentation processing on the image to be processed through a human image segmentation model so as to obtain a human image segmentation result;

and if the image segmentation result meets the precision requirement, determining the image to be processed as training data.

A second aspect of an embodiment of the present application discloses a training data generating device, including:

A first determining unit, configured to perform face detection on an image to be processed through a face detection model, obtain a face area of the image to be processed, and determine an area ratio of the face area in the image to be processed;

the dividing unit is used for carrying out human image dividing processing on the image to be processed through a human image dividing model when the area ratio is larger than the ratio threshold value so as to obtain a human image dividing result;

and the second determining unit is used for determining the image to be processed as training data when the image segmentation result meets the precision requirement.

A third aspect of an embodiment of the present application discloses an electronic device, including:

a memory storing executable program code;

a processor coupled to the memory;

the processor invokes the executable program code stored in the memory to execute the training data generating device disclosed in the first aspect of the embodiment of the present application.

A fourth aspect of the embodiments of the present application discloses a computer-readable storage medium storing a computer program, where the computer program causes a computer to execute the generating apparatus of training data disclosed in the first aspect of the embodiments of the present application.

Compared with the related art, the embodiment of the application has the following beneficial effects:

according to the method disclosed by the embodiment of the application, the face area of the image to be processed can be determined through the face detection model, the area occupation ratio of the face area in the image to be processed is determined, if the area occupation ratio of the face area is larger than the occupation ratio threshold, the fact that the face area included in the image to be processed is larger is indicated, and the value of training data serving as a portrait segmentation model is further achieved; further, the image to be processed can be subjected to image segmentation processing through the image segmentation module, so that a result of image segmentation is obtained. It can be understood whether the image segmentation result of the image to be processed accurately affects the performance of a model trained according to the image to be processed, so that the image to be processed can be determined as training data only when the obtained image segmentation result is determined to meet the accuracy requirement. Therefore, by implementing the embodiment of the application, the training data can be quickly generated through the model, so that the expansion efficiency of the training data of the portrait segmentation model is improved; in addition, the training data generated by the method has large face ratio and accurate human image segmentation result, so that the performance of a model obtained by subsequent training according to the training data can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for generating training data according to an embodiment of the present application;

FIG. 2 is a flow chart of another method for generating training data disclosed in an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a face detection box according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a face detection model according to an embodiment of the present application;

FIG. 5 is a flow chart of yet another method for generating training data disclosed in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a portrait segmentation module according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training data generating device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be noted that the terms "first," "second," "third," and "fourth," etc. in the description and claims of the present application are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

The technical scheme of the present application will be described in detail with reference to specific embodiments.

In order to more clearly describe a method and device for generating training data, electronic equipment and storage medium disclosed in the embodiments of the present application, an application scenario suitable for the method for generating training data is first introduced. Optionally, the method for generating training data may be applied to various electronic devices, including, but not limited to, portable electronic devices such as mobile phones and tablet computers, wearable devices such as smart watches and smart bracelets, or desktop electronic devices such as desktop computers and televisions, which are not limited herein.

With the rapid development of electronic devices, at present, after an image is shot, the electronic device can perform image processing operations such as background blurring and face beautifying on the shot image so as to improve the quality of an image picture. In the process of background blurring or portrait beautifying, the electronic equipment needs to determine a portrait area and a background area of an image by using a portrait segmentation result of the image, so that different image processing operations can be performed on the background area and the portrait area. It can be understood that the accuracy of the image segmentation result will affect the image processing effect, so how to quickly and accurately determine the image segmentation result is a problem that needs to be solved in the image processing process.

In practice, it has been found that electronic devices currently typically generate image segmentation results from a trained image segmentation model. It will be appreciated that the more accurately the portrait region is marked in the training data (typically, the image including the portrait) for training the portrait segmentation model, the higher the portrait segmentation accuracy of the portrait segmentation model that is trained later. In order to improve the accuracy of the portrait segmentation model, in the related art, a manner of manually shooting portrait images and manually labeling portrait areas on each shot image is generally adopted to expand training data, but the efficiency of the manual labeling manner is slower, so that the expansion efficiency of the training data is reduced.

In this regard, the embodiment of the application discloses a method for generating training data, which can determine a face area of an image to be processed through a face detection model, determine an area occupation ratio of the face area in the image to be processed, and if the area occupation ratio of the face area is greater than a occupation ratio threshold value, indicate that the face area included in the image to be processed is greater, and further have a value as training data of a portrait segmentation model; further, the image to be processed can be subjected to image segmentation processing through the image segmentation module, so that a result of image segmentation is obtained. It can be understood whether the image segmentation result of the image to be processed accurately affects the performance of a model trained according to the image to be processed, so that the image to be processed can be determined as training data only when the obtained image segmentation result is determined to meet the accuracy requirement. Therefore, by implementing the embodiment of the application, the training data can be quickly generated through the model, so that the expansion efficiency of the training data of the portrait segmentation model is improved; in addition, the training data generated by the method has large face ratio and accurate human image segmentation result, so that the performance of a model obtained by subsequent training according to the training data can be improved.

Referring to fig. 1, fig. 1 is a flowchart of a training data generating method disclosed in an embodiment of the present application. The training data generating method may be applied to the electronic device or other execution subject, which is not limited herein. The method may comprise the steps of:

102. and carrying out face detection on the image to be processed through a face detection model to obtain a face area of the image to be processed, and determining the area occupation ratio of the face area in the image to be processed.

In this embodiment of the present application, the image to be processed may be an image captured by the electronic device through the capturing device, or may be an image received from another electronic device, or may be an image downloaded from the internet, which is not limited herein. In one embodiment, the electronic device may acquire video data including a portrait, and then decompress a frame of video image from the video data frame by frame as an image to be processed, so as to quickly acquire multiple frames of images to be processed.

Further, the electronic device may perform face detection on the image to be processed through the face detection model to determine a face area in the image to be processed. The electronic device can then determine the area of the face region and the area of the entire image of the image to be processed, and divide the area of the face region by the area of the image to be processed to obtain the area ratio of the face region in the image to be processed.

It should be further noted that, the area corresponding to the face area may be represented by the number of pixels included in the face area, and the area ratio may be represented by a ratio of the number of pixels included in the face area to the total number of pixels in the whole image to be processed, which is not limited herein.

104. If the area ratio of the face area in the image to be processed is larger than the threshold value of the area ratio, the image to be processed is subjected to image segmentation processing through the image segmentation model, and therefore image segmentation results are obtained.

It will be appreciated that the training data that is subsequently determined is for training a portrait segmentation model, so that the larger the area ratio of the face region in the training data, the more valuable the training data is as the training data for a portrait segmentation model. Alternatively, when the electronic device determines that the area ratio corresponding to the face area of the image to be processed is greater than the ratio threshold, the image segmentation model may perform image segmentation processing on the image to be processed.

Further, since the front camera of the electronic device is usually close to the face of the user during photographing, the area ratio corresponding to the face area of the front portrait image photographed by the front camera is usually large. In this case, the front-end portrait segmentation model for segmenting the front-end portrait image generally needs to segment an image having a relatively large area of the face region. In order to make the segmentation result of the front-end portrait segmentation model more accurate, the image to be processed with the area ratio corresponding to the obtained face area larger than the threshold value can be used as training data of the front-end portrait segmentation model, so that the performance of the front-end portrait segmentation model can be improved.

Optionally, the training data determined in step 106 may be used to train a front-end portrait segmentation model, where the front-end portrait segmentation model is used to process the front-end image captured by the front-end camera to obtain a portrait segmentation result corresponding to the front-end image, so that accuracy of the portrait segmentation result corresponding to the front-end image may be improved.

In practice, it has been found that the training process of a portrait segmentation model generally comprises: training the model to be trained according to the training data and the portrait segmentation result corresponding to the training data until model parameters of the model to be trained meet the requirements, and determining that the model training is finished. In the process of generating training data, the electronic equipment also needs to determine the human image segmentation result of the image to be processed.

In the related art, the step of determining the image segmentation result corresponding to the image to be processed is generally performed by a manual labeling method, and the efficiency of the manual labeling method is slower. In this regard, in the embodiment of the present application, the electronic device may perform, through the trained portrait segmentation model, portrait segmentation processing on the image to be processed, so as to obtain a portrait segmentation result, thereby improving the determination efficiency of the portrait segmentation result.

106. And if the human image segmentation result meets the precision requirement, determining the image to be processed as training data.

In the embodiment of the present application, the accuracy of the portrait segmentation result may refer to an error value between a portrait area determined by the portrait segmentation result and an actual portrait contour in an image to be processed; optionally, if the error value between the portrait area determined by the portrait segmentation result and the actual portrait outline in the image to be processed is smaller than the error threshold, it can be determined that the portrait segmentation result meets the precision requirement; otherwise, if the error value between the human image region determined by the human image segmentation result and the actual human image contour in the image to be processed is greater than or equal to the error threshold value, it can be determined that the human image segmentation result does not meet the precision requirement.

It can be understood whether the image segmentation result of the image to be processed accurately affects the performance of a model trained according to the image to be processed, so that the image to be processed can be determined as training data only when the obtained image segmentation result is determined to meet the accuracy requirement. Further, the electronic device may store multi-frame training data to the training data set to facilitate categorizing and managing the data.

In another embodiment, if the image segmentation result obtained by the image segmentation model does not meet the accuracy requirement, the electronic device may determine the image to be processed as the image to be corrected. Optionally, the electronic device may input the image to be corrected into the portrait segmentation model again to obtain a new portrait segmentation result; if the new portrait segmentation result meets the precision requirement, the electronic equipment can determine the training data from the data to be corrected.

By implementing the method, the portrait segmentation module can be generated for multiple times through the portrait segmentation model, so that accidental influence is avoided, more training data meeting the conditions can be screened as much as possible, and the expansion efficiency of the training data is improved.

In yet another embodiment, the electronic device may further output an image to be corrected and a portrait segmentation result corresponding to the image to be corrected, so that a worker corrects the portrait segmentation result corresponding to the image to be corrected according to the image to be corrected, to obtain a corrected portrait segmentation result; and the electronic device can determine the corresponding image to be corrected as training data when the corrected image segmentation result is received and the corrected image segmentation result is determined to meet the accuracy requirement.

By implementing the method, the human image segmentation result which does not meet the precision requirement can be corrected in a manual correction mode, so that more training data meeting the condition can be screened as much as possible, and the expansion efficiency of the training data is improved.

By implementing the method disclosed in the above embodiments, the face area of the image to be processed can be determined through the face detection model, the area ratio of the face area in the image to be processed is determined, and if the area ratio of the face area is greater than the duty ratio threshold, the fact that the face area included in the image to be processed is greater is indicated, and the value of training data serving as the portrait segmentation model is further provided; further, the image to be processed can be subjected to image segmentation processing through the image segmentation module, so that a result of image segmentation is obtained. It can be understood whether the image segmentation result of the image to be processed accurately affects the performance of a model trained according to the image to be processed, so that the image to be processed can be determined as training data only when the obtained image segmentation result is determined to meet the accuracy requirement. Therefore, by implementing the embodiment of the application, the training data can be quickly generated through the model, so that the expansion efficiency of the training data of the portrait segmentation model is improved; in addition, the training data generated by the method has large face ratio and accurate human image segmentation result, so that the performance of a model obtained by subsequent training according to the training data can be improved.

Referring to fig. 2, fig. 2 is a flowchart of another method for generating training data according to an embodiment of the present disclosure. The training data generating method may be applied to the electronic device or other execution subject, which is not limited herein. The method may comprise the steps of:

202. and determining a face detection frame in the image to be processed through the face detection model, wherein the face detection frame is used for selecting a face area in the image to be processed.

Referring to fig. 3, fig. 3 is a schematic diagram for illustrating a face detection frame according to an embodiment of the present disclosure. The electronic device determines, in the image 300 to be processed, that the face detection frame 310 may be shown as a rectangular frame in fig. 3 through the face detection model. It is understood that the face detection frame may be rectangular, square, or circular, and fig. 3 should not be limited to the embodiment of the present application. Further, the face detection box may box a face region.

Further, referring to fig. 4, fig. 4 is a schematic structural diagram of a face detection model according to an embodiment of the present application. Alternatively, the face detection model may include a feature extraction model 400, a region suggestion module 410, and a first pooling model 420. The feature extraction module 400 may include various convolution layers, convolution and pooling layers, or downsampling layers, for extracting feature images of images; the region suggestion module 410 may include a region suggestion network (Region Proposal Network, RPN) for generating candidate boxes for the feature map extracted by the feature extraction module 400; the first pooling module 420 may include an ROI pooling (Region Of Interest pooling) module for pooling the feature map of the generated candidate frame to obtain a feature map meeting the size requirement.

Alternatively, the electronic device may perform feature extraction on the image to be processed through the feature extraction module to obtain a first facial feature map; generating a plurality of candidate frame information according to the first facial feature map through a region suggestion module, wherein the plurality of candidate frame information respectively corresponds to a plurality of feature points included in the first facial feature map, and the candidate frame information is used for representing the probability that the region selected by the corresponding candidate frame belongs to the face and the coordinate information of the corresponding candidate frame; further, the electronic device may perform pooling processing on the first facial feature map and the plurality of candidate frame information through the first pooling module to obtain a second facial feature map; and the electronic device can determine a face detection frame in the image to be processed according to the second facial feature map.

In one embodiment, the face detection model may further include a full connection layer (as shown at 430 in fig. 4), and the electronic device may process the second facial feature image through the full connection layer to obtain a face detection box in the image to be processed.

The full-connection layer is used for comprehensively processing image features included in the second facial feature image to obtain a face detection frame.

By implementing the method, the electronic equipment can quickly determine the face detection frame in the image to be processed through the face detection model, and further determine the face area of the image to be processed through the face detection frame, so that the determination speed of the face area can be improved.

Referring again to fig. 4, in another embodiment, the feature extraction module 400 may include N sub-modules 4001, where N is an integer greater than or equal to 2. As shown in fig. 4, there may be 3 sub-modules 4001, and in other embodiments there may be 5 sub-modules 4001, 6 sub-modules, etc., and fig. 4 should not be construed as limiting the embodiments herein. Alternatively, the submodule 4001 may include a convolution and pooling module (Convolution and Pooling Block, CPB) or other convolution module, not limited herein.

And the electronic equipment can respectively perform feature extraction on the images to be processed through the N sub-modules so as to obtain N sub-feature images respectively corresponding to different receptive fields. The receptive field is the area size of the pixel points on the characteristic map output by each layer of the convolutional neural network and mapped on the input picture, so that the sub-characteristic maps of different receptive fields comprise different image information.

Further, the face detection module may also include a fusion module (shown as 440 in FIG. 4), which may include various convolution layers. Optionally, the electronic device may perform stitching on the N sub-feature graphs to obtain a stitching result, and further perform fusion processing on the stitching result through a fusion module, so as to obtain the first facial feature graph.

By implementing the method, the feature extraction can be performed on the image to be processed through the plurality of sub-modules respectively to obtain the plurality of sub-feature images corresponding to different receptive fields, the splicing process can be performed on the plurality of sub-feature images corresponding to different receptive fields, and the fusion process is performed on the splicing result to obtain the first facial feature image comprising more feature information, so that the robustness of the face detection module can be improved.

In another alternative embodiment, the electronic device may reduce the image to be processed to a first size, and further input the image to be processed of the first size into the face detection module, so as to determine the face detection frame in the image to be processed of the first size through the face detection model. As the size of the input image to be processed is reduced, the calculation amount of the electronic equipment can be reduced, and the power consumption of the electronic equipment can be reduced.

204. And determining a first area corresponding to the face detection frame according to the coordinate information corresponding to the face detection frame, and determining the area ratio of the face area in the image to be processed according to the first area and a second area corresponding to the image to be processed.

In the embodiment of the application, the face detection model outputs the coordinate information corresponding to the face detection frame while outputting the face detection frame, so that the electronic equipment can determine the first area corresponding to the face detection frame according to the coordinate information corresponding to the face detection frame, and determine the second area corresponding to the image to be processed according to the length and width information of the image to be processed; the electronic device may then divide the first area by the second area to determine an area ratio of the face region in the image to be processed.

In one embodiment, the electronic device may determine a first length and a first width corresponding to the face detection frame according to coordinate information corresponding to the face detection frame; the electronic device may further determine, according to a first length and a first width corresponding to the face detection frame, a second length and a second width corresponding to the image to be processed, and the following formula, an area ratio of the face area in the image to be processed, that is:

where s represents the area ratio of the face region in the image to be processed, W represents a first length, H represents a first width, W represents a second length, and H represents a second width.

By implementing the method, the area of the face area can be approximately represented by the area of the face detection frame, so that the acquisition speed of the area of the face area can be improved, and the speed of the follow-up determination of the area occupation ratio of the face area in the image to be processed can be improved.

206. If the area ratio of the face area in the image to be processed is larger than the threshold value of the area ratio, the image to be processed is subjected to image segmentation processing through the image segmentation model, and therefore image segmentation results are obtained.

208. And if the human image segmentation result meets the precision requirement, determining the image to be processed as training data.

By implementing the method disclosed by the embodiments, training data can be quickly generated through the model, so that the expansion efficiency of the training data of the portrait segmentation model is improved; in addition, the training data generated by the method has large face ratio and accurate human image segmentation result, so that the performance of a model obtained by subsequent training according to the training data can be improved; the face detection frame can be rapidly determined in the image to be processed through the face detection model, and then the face area of the image to be processed is determined through the face detection frame, so that the determination speed of the face area can be improved; and respectively extracting features of the image to be processed through the plurality of sub-modules to obtain a plurality of sub-feature images corresponding to different receptive fields, further splicing the sub-feature images corresponding to the different receptive fields, and fusing the splicing results to obtain a first facial feature image comprising more feature information, so that the robustness of the face detection module can be improved.

Referring to fig. 5, fig. 5 is a flowchart of another method for generating training data according to an embodiment of the present disclosure. The training data generating method may be applied to the electronic device or other execution subject, which is not limited herein. The method may comprise the steps of:

502. And carrying out face detection on the image to be processed through a face detection model to obtain a face area of the image to be processed, and determining the area occupation ratio of the face area in the image to be processed.

504. If the area ratio of the face area in the image to be processed is larger than the ratio threshold, extracting features of the image to be processed through the encoder to obtain a first human feature map.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a portrait segmentation module according to an embodiment of the present application. Alternatively, the portrait segmentation model disclosed in embodiments of the present application may include an encoder 600, a decoder 610, and a second pooling module 620. The encoder 600 may include, but is not limited to, a convolution layer and a downsampling layer, and is used for extracting features of an image to obtain a feature map corresponding to the image; decoder 610 may include, but is not limited to, a convolution layer, an upsampling layer, for convolving or upsampling the feature map generated by the encoder to recover the features of the feature map to obtain an output result of the model; the second pooling module 620 may include, but is not limited to: and the spatial pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP) module and the pyramid pooling (Spatial Pyramid Pooling, SPP) module are used for carrying out hole convolution with different sampling rates on the feature images extracted by the encoder so as to obtain the feature images with more spatial information.

Alternatively, the electronic device may perform feature extraction on the image to be processed through the encoder to obtain the first image feature map.

In one embodiment, the encoder may include X self-calibrating convolutional layers, X being an integer greater than or equal to 2. Referring to fig. 6, the encoder may include 3 self-calibrating convolutional layers 6001, and in other embodiments, the self-calibrating convolutional layers 6001 may have 5, 6, etc., and fig. 6 should not be construed as limiting in accordance with embodiments of the present application. The self-calibration convolution layer 6001 may include a self-calibration convolution (Self Calibrated Convolution, SCC) network layer, where the self-calibration convolution layer 6001 is configured to perform convolution processing on an image to be processed, so as to obtain a feature map with a larger receptive field. Alternatively, the self-calibrating convolution layer 6001 may be replaced by a transpose convolution layer, an expanded convolution layer, a separable convolution layer, a single-channel convolution layer, or a multi-channel convolution layer, which are not limited herein.

Referring again to fig. 6, optionally, the electronic device may perform feature extraction on the image to be processed through a first self-calibration convolution layer included in the encoder to obtain a convolution feature map (shown in 6002 of fig. 6) output by the first self-calibration convolution layer; and the convolution feature map output by the Y self-calibration convolution layer can be input into the Y+1 self-calibration convolution layer, and feature extraction is carried out on the convolution feature map output by the Y self-calibration convolution layer through the Y+1 self-calibration convolution layer so as to obtain the convolution feature map output by the Y+1 self-calibration convolution layer, wherein Y is an integer which is more than or equal to 2 and less than or equal to X, and the image size of the convolution feature map output by the Y+1 self-calibration convolution layer is smaller than the convolution feature map output by the Y self-calibration convolution layer. The encoder performs downsampling on the image to be processed in the process of extracting the features of the image to be processed, so that the image size of the convolution feature map output by the Y+1st self-calibration convolution layer is smaller than that of the convolution feature map output by the Y self-calibration convolution layer.

Further, the convolution signature of the mth self-calibrating convolution layer output may be referred to as a first human image signature (as shown in 6003 of fig. 6).

By implementing the method, the image to be processed can be subjected to convolution processing for multiple times through the self-calibration convolution layer, so that a first portrait characteristic diagram with a larger receptive field is obtained, and the precision of a subsequent obtained portrait segmentation result can be improved.

It should be noted that, the process of extracting features from an image to be processed by an encoder generally includes performing a convolution operation on the image to be processed by a convolution kernel with a certain size, so that the larger the size of the image to be processed, the more convolution operations need to be performed. Alternatively, the portrait segmentation model shown in fig. 6 may further include a resizing module 630, and optionally, the resizing module 630 may include a learnable resizer for resizing the input image to a target size. The electronic equipment can reduce the image size of the image to be processed through the size adjustment module so as to obtain the image to be processed with the target size; and further, the encoder can be used for carrying out feature extraction on the image to be processed with the target size so as to obtain a first human figure feature map.

By implementing the method, the size of the image to be processed can be reduced through the size adjusting module, so that the calculated amount of the portrait segmentation model can be reduced, and the power consumption of the electronic equipment is further reduced.

506. And respectively carrying out M times of cavity convolution on the first portrait characteristic image through a second pooling module to obtain M convolution results, and carrying out fusion processing on the M convolution results to obtain a second portrait characteristic image, wherein the M times of cavity convolution correspond to different sampling rates, and M is an integer greater than or equal to 2.

In this embodiment of the present application, the electronic device may perform hole convolution on the first human feature image by using the second pooling module, where the hole convolution is performed M times at different sampling rates, so as to obtain M convolution results. The cavity convolution is one of convolution, and cavities are added in common convolution operation to enlarge the receptive field of a convolution result; the sampling rate refers to the sampling frequency of the first artificial feature map in the cavity convolution process.

And then, the M convolution results can be fused through a second pooling module so as to obtain a second portrait characteristic diagram.

By implementing the method, the second pooling module can perform cavity convolution with different sampling rates on the first portrait feature map to obtain M convolution results with different scales, and then fusion processing can be performed on the M convolution results to obtain a second portrait feature map comprising more space information.

508. And up-sampling the second portrait characteristic diagram through a decoder to obtain a portrait segmentation result.

In this embodiment of the present application, after the electronic device downsamples the image to be processed through the encoder, the resolution of the image is reduced, and in order to restore the resolution of the image, the electronic device may upsample the second portrait feature map through the decoder to obtain a portrait segmentation result with the original resolution.

Referring again to fig. 6, the decoder 610 may optionally include P upsampling layers 6101, where P is an integer greater than or equal to 2. Wherein the number of upsampling layers 6101 is generally equal to the number of self-calibrating convolutional layers described above, for which an alternative decoder may include 3 upsampling layers 6101, and in other embodiments, there may be 5, 6, etc. upsampling layers 6101, and fig. 6 should not be construed as limiting the embodiments herein.

Referring to fig. 6 again, alternatively, the electronic device may upsample the second image feature map (shown as 6004 in fig. 6) according to the first convolution feature map (shown as 6003 in fig. 6) by using the first upsampling layer included in the decoder to obtain an upsampled feature map (shown as 6005 in fig. 6) output by the first upsampling layer, where the first convolution feature map is a convolution feature map output by the encoder and having the same size as the second image feature map.

And then the up-sampling feature map output by the Q up-sampling layer can be input into the Q+1st up-sampling layer, and up-sampling is performed on the up-sampling feature map output by the Q up-sampling layer by the Q+1st up-sampling layer according to a second convolution feature map, so that the up-sampling feature map output by the Q+1st up-sampling layer is an integer greater than or equal to 2 and less than or equal to P, and the second convolution feature map is a convolution feature map output by the encoder and having the same size as the up-sampling feature map output by the Q up-sampling layer.

Further, the image segmentation result may be generated according to the upsampled feature map (shown as 6006 in fig. 6) output by the P-th upsampling layer.

By implementing the method, the convolution characteristic map which is output by the encoder and has the same size as the up-sampling characteristic map can be fused in the process of downsampling the second portrait characteristic map by the normal downsampling layer, so that the precision of the subsequently obtained portrait segmentation result can be improved.

510. And if the human image segmentation result meets the precision requirement, determining the image to be processed as training data.

By implementing the method disclosed by the embodiments, training data can be quickly generated through the model, so that the expansion efficiency of the training data of the portrait segmentation model is improved; in addition, the training data generated by the method has large face ratio and accurate human image segmentation result, so that the performance of a model obtained by subsequent training according to the training data can be improved; and the self-calibration convolution layer is used for carrying out convolution processing on the image to be processed for a plurality of times to obtain a first human figure characteristic diagram with a larger receptive field, so that the accuracy of a subsequently obtained human figure segmentation result can be improved; and the size of the image to be processed can be reduced through the size adjusting module, so that the calculated amount of the portrait segmentation model can be reduced, and the power consumption of the electronic equipment is further reduced; and carrying out cavity convolution with different sampling rates on the first portrait feature map through a second pooling module to obtain M convolution results with different scales, and further carrying out fusion processing on the M convolution results to obtain a second portrait feature map comprising more space information; and in the process of downsampling the second portrait characteristic diagram by the normal downsampling layer, the convolution characteristic diagram which is output by the encoder and has the same size as the upsampling characteristic diagram is fused, so that the precision of a subsequent portrait segmentation result can be improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a training data generating apparatus according to an embodiment of the present application, which may be applied to the above-mentioned electronic device or other execution subject, and is not limited herein. The apparatus may include a first determination unit 701, a division unit 702, and a second determination unit 703, wherein:

a first determining unit 701, configured to perform face detection on an image to be processed through a face detection model, obtain a face area of the image to be processed, and determine an area ratio of the face area in the image to be processed;

a segmentation unit 702, configured to perform, when an area ratio of the face area in the image to be processed is greater than a ratio threshold, a portrait segmentation process on the image to be processed through a portrait segmentation model, so as to obtain a portrait segmentation result;

a second determining unit 703, configured to determine the image to be processed as training data when the image segmentation result meets the accuracy requirement.

By implementing the device, the face area of the image to be processed can be determined through the face detection model, the area occupation ratio of the face area in the image to be processed is determined, if the area occupation ratio of the face area is larger than the occupation ratio threshold, the fact that the face area included in the image to be processed is larger is indicated, and the value of training data serving as a portrait segmentation model is further achieved; further, the image to be processed can be subjected to image segmentation processing through the image segmentation module, so that a result of image segmentation is obtained. It can be understood whether the image segmentation result of the image to be processed accurately affects the performance of a model trained according to the image to be processed, so that the image to be processed can be determined as training data only when the obtained image segmentation result is determined to meet the accuracy requirement. Therefore, by implementing the embodiment of the application, the training data can be quickly generated through the model, so that the expansion efficiency of the training data of the portrait segmentation model is improved; in addition, the training data generated by the method has large face ratio and accurate human image segmentation result, so that the performance of a model obtained by subsequent training according to the training data can be improved.

As an optional implementation manner, the first determining unit 701 is further configured to determine, by using a face detection model, a face detection frame in the image to be processed, where the face detection frame is used to frame a face area in the image to be processed; and determining a first area corresponding to the face detection frame according to the coordinate information corresponding to the face detection frame, and determining the area ratio of the face area in the image to be processed according to the first area and a second area corresponding to the image to be processed.

By implementing the device, the area of the face area can be approximately represented by the area of the face detection frame, so that the acquisition speed of the area of the face area can be improved, and the speed of the area occupation ratio of the face area in the image to be processed can be improved.

As an alternative embodiment, the face detection model includes a feature extraction module, a region suggestion module, and a first pooling module; the first determining unit 701 is further configured to perform feature extraction on the image to be processed through the feature extraction module, so as to obtain a first facial feature map; generating, by the region suggestion module, a plurality of candidate frame information according to the first facial feature map, where the plurality of candidate frame information corresponds to a plurality of feature points included in the first facial feature map, and the candidate frame information is used to represent a probability that a region selected by a corresponding candidate frame belongs to a face, and coordinate information of the corresponding candidate frame; the first facial feature map and the plurality of candidate frame information are subjected to pooling processing through a first pooling module so as to obtain a second facial feature map; and determining a face detection frame in the image to be processed according to the second facial feature map.

By implementing the device, the electronic equipment can quickly determine the face detection frame in the image to be processed through the face detection model, and further determine the face area of the image to be processed through the face detection frame, so that the determination speed of the face area can be improved.

As an optional implementation manner, the feature extraction module includes N submodules, where N is an integer greater than or equal to 2; the first determining unit 701 is further configured to perform feature extraction on the image to be processed through N sub-modules, so as to obtain N sub-feature graphs corresponding to different receptive fields respectively; and performing stitching processing on the N sub-feature images to obtain a first facial feature image.

By implementing the device, the image to be processed can be subjected to feature extraction through the plurality of sub-modules respectively to obtain a plurality of sub-feature images corresponding to different receptive fields, the sub-feature images corresponding to the different receptive fields can be subjected to splicing processing, and the splicing result is subjected to fusion processing to obtain a first facial feature image comprising more feature information, so that the robustness of the face detection module can be improved.

As an alternative embodiment, the portrait segmentation model includes an encoder, a decoder, and a second pooling module; and a segmentation unit 702, further configured to perform feature extraction on the image to be processed through the encoder, so as to obtain a first image feature map; respectively carrying out M times of cavity convolution on the first human figure characteristic image through a second pooling module to obtain M convolution results, and carrying out fusion processing on the M convolution results to obtain a second human figure characteristic image, wherein the M times of cavity convolution correspond to different sampling rates, and M is an integer greater than or equal to 2; and up-sampling the second portrait characteristic diagram through a decoder to obtain a portrait segmentation result.

By implementing the device, the second pooling module can perform cavity convolution with different sampling rates on the first portrait feature map so as to obtain M convolution results with different scales, and then fusion processing can be performed on the M convolution results so as to obtain a second portrait feature map comprising more space information.

As an optional implementation manner, the portrait segmentation model further comprises a size adjustment module; and a segmentation unit 702, configured to reduce, by the size adjustment module, an image size of the image to be processed, so as to obtain a target size of the image to be processed; and extracting the characteristics of the image to be processed with the target size through an encoder to obtain a first human figure characteristic diagram.

By implementing the device, the size of the image to be processed can be reduced through the size adjusting module, so that the calculated amount of the portrait segmentation model can be reduced, and the power consumption of the electronic equipment is further reduced.

As an alternative embodiment, the encoder includes X self-calibrating convolutional layers, X being an integer greater than or equal to 2; and a segmentation unit 702, configured to perform feature extraction on an image to be processed through a first self-calibration convolution layer included in the encoder, so as to obtain a convolution feature map output by the first self-calibration convolution layer; inputting the convolution feature map output by the Y self-calibration convolution layer into the Y+1 self-calibration convolution layer, and extracting features of the convolution feature map output by the Y self-calibration convolution layer through the Y+1 self-calibration convolution layer to obtain the convolution feature map output by the Y+1 self-calibration convolution layer, wherein Y is an integer which is greater than or equal to 2 and less than or equal to X, and the image size of the convolution feature map output by the Y+1 self-calibration convolution layer is smaller than the convolution feature map output by the Y self-calibration convolution layer; and taking the convolution characteristic diagram output by the Mth self-calibration convolution layer as a first human image characteristic diagram.

By implementing the device, the self-calibration convolution layer can be used for carrying out convolution processing on the image to be processed for multiple times to obtain the first portrait characteristic diagram with a larger receptive field, so that the precision of a subsequent obtained portrait segmentation result can be improved.

As an alternative embodiment, the decoder includes P upsampling layers, where P is an integer greater than or equal to 2; and a segmentation unit 702, configured to upsample, by a first upsampling layer included in the decoder, the second portrait feature map according to a first convolution feature map, to obtain an upsampled feature map output by the first upsampling layer, where the first convolution feature map is a convolution feature map output by the encoder and having the same size as the second portrait feature map; and inputting the up-sampling feature map output by the Q up-sampling layer into the Q+1st up-sampling layer, and up-sampling the up-sampling feature map output by the Q up-sampling layer by the Q+1st up-sampling layer according to a second convolution feature map, so that the up-sampling feature map output by the Q+1st up-sampling layer, Q being an integer greater than or equal to 2 and less than or equal to P, the second convolution feature map being a convolution feature map output by the encoder and having the same size as the up-sampling feature map output by the Q up-sampling layer; and generating a portrait segmentation result according to the up-sampling feature image output by the P up-sampling layer.

By implementing the device, the convolution characteristic map which is output by the encoder and has the same size with the up-sampling characteristic map can be fused in the process of downsampling the second portrait characteristic map by the normal downsampling layer, so that the precision of the subsequently obtained portrait segmentation result can be improved.

As an alternative embodiment, the apparatus shown in fig. 7 further comprises a third determining unit, not shown, wherein:

and the third determining unit is used for determining the image to be processed as the image to be corrected if the image segmentation result does not meet the precision requirement after the image segmentation processing is carried out on the image to be processed through the image segmentation model so as to obtain the image segmentation result.

By implementing the device, the human image segmentation result which does not meet the precision requirement can be corrected in a manual correction mode, so that more training data meeting the condition can be screened as much as possible, and the expansion efficiency of the training data is improved.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device may include:

a memory 801 storing executable program code;

a processor 802 coupled to the memory 801;

The processor 802 invokes executable program codes stored in the memory 801 to execute the training data generating method disclosed in the above embodiments.

The embodiment of the application discloses a computer readable storage medium storing a computer program, wherein the computer program causes a computer to execute the training data generation method disclosed in each embodiment.

The application embodiment also discloses an application publishing platform, wherein the application publishing platform is used for publishing the computer program product, and the computer program product is enabled to execute part or all of the steps of the method as in the method embodiments.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments and that the acts and modules referred to are not necessarily required in the present application.

In various embodiments of the present application, it should be understood that the size of the sequence numbers of the above processes does not mean that the execution sequence of the processes is necessarily sequential, and the execution sequence of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer-accessible memory. Based on such understanding, the technical solution of the present application, or a part contributing to the prior art or all or part of the technical solution, may be embodied in the form of a software product stored in a memory, including several requests for a computer device (which may be a personal computer, a server or a network device, etc., in particular may be a processor in the computer device) to perform part or all of the steps of the above-mentioned method of the various embodiments of the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a program that instructs associated hardware, the program may be stored in a computer readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disk Memory, magnetic disk Memory, tape Memory, or any other medium that can be used for carrying or storing data that is readable by a computer.

The foregoing describes in detail a training data generating method and apparatus, an electronic device, and a storage medium, where specific examples are applied to illustrate principles and implementations of the present application, and the description of the foregoing examples is only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of generating training data, the method comprising:

2. The method according to claim 1, wherein the performing face detection on the image to be processed by the face detection model to obtain a face area of the image to be processed, and determining an area ratio of the face area in the image to be processed, includes:

determining a face detection frame in an image to be processed through a face detection model, wherein the face detection frame is used for selecting a face area in the image to be processed;

and determining a first area corresponding to the face detection frame according to the coordinate information corresponding to the face detection frame, and determining the area occupation ratio of the face area in the image to be processed according to the first area and a second area corresponding to the image to be processed.

3. The method of claim 2, wherein the face detection model comprises a feature extraction module, a region suggestion module, and a first pooling module; the step of determining the face detection frame in the image to be processed through the face detection model comprises the following steps:

extracting features of the image to be processed through the feature extraction module to obtain a first facial feature map;

generating, by the region suggestion module, a plurality of candidate frame information according to the first facial feature map, where the plurality of candidate frame information respectively corresponds to a plurality of feature points included in the first facial feature map, and the candidate frame information is used to represent a probability that a region selected by a corresponding candidate frame belongs to a face, and coordinate information of the corresponding candidate frame;

the first facial feature map and the plurality of candidate frame information are subjected to pooling processing through the first pooling module so as to obtain a second facial feature map;

and determining a face detection frame in the image to be processed according to the second facial feature map.

4. A method according to claim 3, wherein the feature extraction module comprises N sub-modules, N being an integer greater than or equal to 2; the feature extraction of the image to be processed by the feature extraction module to obtain a first facial feature map includes:

Extracting features of the image to be processed through the N sub-modules respectively to obtain N sub-feature images respectively corresponding to different receptive fields;

and performing stitching treatment on the N sub-feature images to obtain stitching results, and performing fusion treatment on the stitching results to obtain a first facial feature image.

5. The method of claim 1, wherein the portrait segmentation model includes an encoder, a decoder, and a second pooling module; the step of performing the image segmentation processing on the image to be processed through the image segmentation model to obtain an image segmentation result comprises the following steps:

extracting features of the image to be processed through the encoder to obtain a first human feature map;

respectively carrying out M times of cavity convolution on the first human figure characteristic image through the second pooling module to obtain M convolution results, and carrying out fusion processing on the M convolution results to obtain a second human figure characteristic image, wherein the M times of cavity convolution correspond to different sampling rates, and M is an integer greater than or equal to 2;

and up-sampling the second portrait characteristic diagram through the decoder to obtain a portrait segmentation result.

6. The method of claim 5, wherein the portrait segmentation model further comprises a resizing module; the step of extracting features of the image to be processed by the encoder to obtain a first human feature map includes:

Reducing the image size of the image to be processed through the size adjustment module to obtain the image to be processed with the target size;

and extracting the characteristics of the image to be processed with the target size through the encoder to obtain a first human image characteristic diagram.

7. The method of claim 5, wherein the encoder comprises X self-calibrating convolutional layers, the X being an integer greater than or equal to 2; the step of extracting features of the image to be processed by the encoder to obtain a first human feature map includes:

performing feature extraction on the image to be processed through a first self-calibration convolution layer included in the encoder to obtain a convolution feature map output by the first self-calibration convolution layer;

inputting a convolution feature map output by a Y self-calibration convolution layer into a Y+1th self-calibration convolution layer, and extracting features of the convolution feature map output by the Y+1th self-calibration convolution layer to obtain a convolution feature map output by the Y+1th self-calibration convolution layer, wherein Y is an integer which is greater than or equal to 2 and less than or equal to X, and the image size of the convolution feature map output by the Y+1th self-calibration convolution layer is smaller than the convolution feature map output by the Y self-calibration convolution layer;

And taking the convolution characteristic diagram output by the Mth self-calibration convolution layer as a first human image characteristic diagram.

8. The method of claim 7, wherein the decoder comprises P upsampling layers, the P being an integer greater than or equal to 2; the step of up-sampling the second portrait feature map by the decoder to obtain a portrait segmentation result includes:

upsampling the second portrait feature map through a first upsampling layer included in the decoder according to a first convolution feature map, so as to obtain an upsampled feature map output by the first upsampling layer, wherein the first convolution feature map is a convolution feature map output by the encoder and having the same size as the second portrait feature map;

inputting an up-sampling feature map output by a Q-th up-sampling layer into a Q+1th up-sampling layer, and up-sampling the up-sampling feature map output by the Q-th up-sampling layer according to a second convolution feature map through the Q+1th up-sampling layer to the up-sampling feature map output by the Q+1th up-sampling layer, wherein Q is an integer greater than or equal to 2 and less than or equal to P, and the second convolution feature map is a convolution feature map output by the encoder and having the same size as the up-sampling feature map output by the Q-th up-sampling layer;

And generating a portrait segmentation result according to the up-sampling feature image output by the P up-sampling layer.

9. The method according to any one of claims 1 to 8, wherein after the image to be processed is subjected to the image segmentation process by the image segmentation model to obtain a image segmentation result, the method further comprises:

and if the human image segmentation result does not meet the precision requirement, determining the image to be processed as an image to be corrected.

10. A training data generation apparatus, the apparatus comprising:

11. An electronic device comprising a memory storing executable program code, and a processor coupled to the memory; wherein the processor invokes the executable program code stored in the memory to perform the method of any one of claims 1-9.

12. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1 to 9.