CN108985181A

CN108985181A - A kind of end-to-end face mask method based on detection segmentation

Info

Publication number: CN108985181A
Application number: CN201810654160.6A
Authority: CN
Inventors: 温世平; 董明辉; 黄廷文
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2018-12-11
Anticipated expiration: 2038-06-22
Also published as: CN108985181B

Abstract

The invention discloses a kind of end-to-end face mask methods based on detection segmentation, comprising: the region that every facial image needs to detect in mark facial image training set；End-to-end face segmentation mark neural network model is constructed, the neural network model includes being made of sharing feature module, face composition detection module, ROI feature extraction module, three sub- network modules, three sorter network modules；Based on facial image training set, the end-to-end face segmentation mark neural network model of training, modules synchronize training；Divide mark neural network model using trained face and mark is split to test facial image.The present invention obtains the sharing feature of different scale using deconvolution up-sampling step by step and height Fusion Features, captures more details information；By designing sub-network for the less semantic classes of distribution area, segmentation precision is improved；The ability to express of entire model is promoted by unified training.

Description

A kind of end-to-end face mask method based on detection segmentation

Technical field

The invention belongs to artificial intelligence Image Information Processing fields, more particularly, to a kind of end based on detection segmentation Opposite end face mask method.

Background technique

Face mark, which refers to, carries out region segmentation to the image comprising face, and difference according to demand divides the image into 3 Class (hair, background, face) or more multiclass (face is continued to divide).Face mark is used as front-end processing, has been applied to In the applications such as recognition of face, virtual makeups, face exchange.In such applications, effect of the precision of face mark to inter-related task Fruit plays very crucial effect.

With the promotion of depth learning technology development and big data computing capability, deep learning is applied to people by researcher Face mark, makes precision be greatly improved.Face mask method based on deep learning is broadly divided into based on sliding window Method and method based on full convolutional neural networks, since the method process based on sliding window is complicated and compared to full convolutional Neural The method precision of network does not have advantage, is all based on the face mark of the deep learning of full convolutional neural networks greatly in the prior art Method.

Patent CN105354565A is disclosed one kind and is positioned based on full convolutional network human face five-sense-organ and sentence method for distinguishing, is passed through It collects facial image and mark face forms training set by hand, using the full convolutional neural networks of training set training, by face to be measured Image is input to trained full convolutional neural networks, the face segmentation result and facial feature localization and mark of output pixel rank. Although it discloses the human face five-sense-organ mask method based on full convolutional neural networks, consider sample in distribution area compared with Few semantic classes has that the area markings such as eyes, mouth, nose are inaccurate.

Patent CN107729819A discloses a kind of face mask method based on sparse full convolutional neural networks, passes through receipts Collect facial image and by hand mark hair, skin, background formed training set, using full convolutional neural networks semantic segmentation method with Group lasso Sparse methods constitute model and facial image to be measured are then input to training using training set to model training Good model is completed to mark the Pixel-level of hair, skin and background in facial image.Although it discloses based on full convolution mind Face dimensioning algorithm through network, but do not consider the segmentation of facial face, there is a problem of that application scenarios are not extensive enough.

Summary of the invention

In view of the drawbacks of the prior art, it is an object of the invention to solve the face dividing method based on artificial design features The problem of existing precision is low and poor robustness, while solving traditional face dividing method based on full convolutional neural networks and depositing Output homogeneity problem, and designed method can be very good the processing classification present in face segmentation task point The unbalanced problem of cloth.

To achieve the above object, the embodiment of the invention provides a kind of end-to-end face mark sides based on detection segmentation Method, comprising the following steps:

S1. the region that every facial image needs to detect in facial image training set is marked；

S2. an end-to-end face segmentation mark neural network model is constructed, the neural network model is by shared spy Module, face composition detection module, ROI feature extraction module, three sub- network modules, three sorter network modules composition are levied, In, the input of the sharing feature module is facial image, is exported as sharing feature figure；The face composition detection module it is defeated Enter the position frame information exported for sharing feature figure as each region for needing to detect；The input of the ROI feature extraction module is Sharing feature figure and each position frame information export as each ROI subcharacter figure；The input of the sub-network module is ROI subcharacter Figure, exports the segmentation result for each subregion；The input of the three sorter networks module is sharing feature figure, is exported as head The segmentation result of hair, skin and background；

S3. it is based on facial image training set, the end-to-end face segmentation of training marks neural network model, and modules are same Step training；

S4. divide mark neural network model using trained face and mark is split to test facial image, three The segmentation result of sorter network and the fusion of the segmentation result of each sub-network are segmentation annotation results.

Specifically, the region for needing to detect described in step S1 includes: left eyebrow, right eyebrow, left eye eyeball, right eye eyeball, nose Son, upper lip, in mouth, lower lip, skin, hair, background.

Specifically, sharing feature module described in step S2 uses coding-decoding structure, and coding structure passes through full convolution mind Characteristic pattern C1-CN is converted by facial image through network；Decode structure by characteristic pattern C1-CN step by step deconvolution up-sampling and height Fusion Features obtain sharing feature figure P1-PN.

Specifically, face composition detection module described in step S2 exports the position frame information of different ingredients.

Specifically, ROI feature extraction module described in step S2 is after obtaining the position ROI frame information, from sharing feature figure ROI feature is intercepted in P1-PN respectively, identical size is uniformly transformed to, the fusion of feature concatenation, is uniformly transformed to identical channel After dimension, each ROI subcharacter figure is obtained.

Specifically, three sub- network modules include that eyes add eyebrow sub-network module, nose sub-network module, mouth Sub-network module three classes are responsible for that individual features are further processed, and the segmentation result of decoded output subregion, In, subregion is divided into 2 semantic classes of nose and background by nasal area sub-network, and mouth region sub-network is by mouth region Be divided into upper lip, in mouth, 4 semantic classes of lower lip and background, eyes add eyebrow sub-network that eyes are added brow region point It is segmented into 3 eyes, eyebrow and background semantic classes.

Specifically, characteristic dimension is risen to input picture by warp lamination and convolutional layer by the three sorter networks module Dimension, the segmentation result of output skin area, hair zones and background area.

Specifically, the optimization loss function of training process is defined as follows:

L_all=L_seg+L_det+L_reg

Wherein, L_allFor global optimization loss function, L_regFor regularization loss function, asked for alleviating model over-fitting Topic, L_detFor face composition detection module loss function, loss, L are returned comprising Classification Loss and position_segLetter is lost for segmentation Number, is defined as follows:

Wherein,Divide loss functions for three classification,For three classes community finding loss function, 4 segmentations Loss function is all made of cross entropy loss function.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:

(1) present invention obtains sharing feature by the way of deconvolution up-sampling step by step and height Fusion Features, compared to biography It unites full convolutional neural networks model, can capture more details profile informations, alleviate homogeneous problem.Simultaneously as using The profile information of different scale, network model can learn the information to more different scales from data.

(2) present invention employs sub-network structure designs, individually set for the less semantic classes of distribution area in the sample A sub-network structure is counted, the information for losing the category in trained and test process is avoided.It is designed by this, is alleviated significantly Since sample class is distributed uneven the problem of reducing segmentation precision.

(3) all modules of the invention can carry out end-to-end unified training, be not necessarily to stage by stage or sub-module is trained, In unified training process, the ability to express of entire model can be promoted with sharing feature information between modules.

Detailed description of the invention

Fig. 1 is a kind of end-to-end face mask method flow chart based on detection segmentation provided by the invention.

Fig. 2 is ROI feature extraction module operation principle schematic diagram provided by the invention.

Fig. 3 is mouth region sub-network flow diagram provided by the invention.

Fig. 4 provides a kind of end-to-end face mask method segmentation result schematic diagram based on detection segmentation for the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Fig. 1 is a kind of end-to-end face mask method flow chart based on detection segmentation provided by the invention.Such as Fig. 1 institute Show, method includes the following steps:

S2. an end-to-end face segmentation mark neural network model is constructed, the neural network model is by shared spy Levy module, face composition detection module, ROI feature extraction module, three sub- network modules, three sorter network modules composition；

S4. divide mark neural network model using trained face and mark is split to test facial image.

The region for needing to detect described in step S1 includes: left eyebrow, right eyebrow, left eye eyeball, right eye eyeball, nose, upper mouth In lip, mouth, lower lip, skin, hair, background.

The input of sharing feature module described in step S2 is facial image, is exported as sharing feature figure；The face at The input for dividing detection module is sharing feature figure, exports and adds left brow region, right eye eyeball to add right brow region, nose for left eye eyeball The position frame information in region, mouth region and facial area；The input of the ROI feature extraction module is for sharing feature figure and respectively Position frame information exports as each ROI subcharacter figure；The input of the sub-network module is ROI subcharacter figure, is exported as every height The segmentation result in region；The input of the three sorter networks module is sharing feature figure, exports point for hair, skin and background Cut result.

Sharing feature module described in step S2 uses coding-decoding structure, and coding structure passes through full convolutional neural networks Characteristic pattern C1-CN is converted by facial image；Decoding structure, deconvolution up-sampling and height feature are melted step by step by characteristic pattern C1-CN Conjunction obtains sharing feature figure P1-PN.

The process of deconvolution up-sampling and height Fusion Features is as follows step by step by characteristic pattern C1-CN: first by characteristic pattern C1-CN It is unified for identical dimension of the channel K, the characteristic pattern CN after unified channel dimension is sharing feature figure PN, then characteristic pattern Ci reversed Dimension of the channel K, which is unified for, after being added after convolution with adjacent characteristic pattern C (i-1) obtains sharing feature figure P (i-1), i=N ..., 2, Finally obtain sharing feature figure P1-PN.

Specifically, coding structure uses Res50 network module, and Res50 module is one kind that Kaiming He et al. is proposed Convolutional neural networks model with residual error connection structure has stronger feature representation ability and information transfer capacity.It will The last layer convolutional layer of 5 convolution blocks is denoted as characteristic pattern C1-C5 respectively in Res50；Decoding structure first will be to this 5 convolution Layer uses the convolution kernel of 1 × 1 size respectively, and dimension of the channel is uniformly dropped to 256, the use of step-length is step by step 2 then since C5 Deconvolution operation by characteristic pattern scale size promoted 2 times, then with corresponding next stage characteristic pattern be added, for example, C5 promoted 2 It is added after times dimension of the channel with C4, characteristic pattern after being added imposes 3 × 3 convolution kernel again, and dimension of the channel 256 obtains most Whole sharing feature figure P1-P5.

For face composition detection module described in step S2 using Faster RCNN structure as detector, the detector is defeated The position frame information of different ingredients out, the rectangular area that position frame is indicated where target using 5, format are rectangular area The height and width of upper left corner x, y and rectangle.Facial area without feature extraction, the purpose for detecting facial area is to make detector Have when detecting face ingredient it is certain stress region, improve learning model in the precision of face composition detection.Other The position frame information in four class regions is used for the extraction of individual features.

ROI feature extraction module described in step S2 is after obtaining the position ROI frame information, from sharing feature figure P1-PN ROI feature is intercepted respectively, after being uniformly transformed to identical size, the fusion of feature concatenation, being uniformly transformed to identical dimension of the channel, Obtain each ROI subcharacter figure.

Specifically, high-level characteristic generally comprises more semantic informations, and low-level feature generally comprises more detailed information, In face segmentation task, these two types of information are all particularly significant.Fig. 2 is that ROI feature extraction module provided by the invention work is former Manage schematic diagram.As shown in Fig. 2, in order to integrate the feature of different levels, the present invention after the position frame information for obtaining some ROI, ROI feature is intercepted from the sharing feature figure of tetra- levels of P1-P4 respectively.For the ease of training, different size of ROI feature quilt Uniformly transform to identical size 14 × 14.After obtaining the subcharacter figure that four groups of channels are 256, by four groups of subcharacter figures according to The sequential concatenation of P1-P4 together, is then obtained dimension of the channel dimensionality reduction 256 final extracted using 1 × 1 convolution kernel Subcharacter, the part are ROI feature extraction module.Extracted subcharacter will be entered in different sub-networks, individually It is handled.

The sub-network module includes that eyes add eyebrow sub-network module, nose sub-network module, mouth sub-network module Three classes, wherein eyes add the input of eyebrow sub-network module to be that left eye eyeball adds left eyebrow characteristic pattern and right eye eyeball to add right eyebrow special Sign figure, exports and adds left brow region and right eye eyeball to add right brow region for left eye eyeball；The input of nose sub-network module is nose Characteristic pattern exports as nasal area；The input of mouth sub-network module is mouth subcharacter figure, is exported as mouth region.

Sub-network is responsible for that individual features are further processed, and the segmentation result of decoded output subregion.Wherein Subregion is divided into 2 semantic classes of nose and background by nasal area sub-network, and mouth region sub-network divides mouth region Be segmented into upper lip, in mouth, 4 semantic classes of lower lip and background, eyes add eyebrow sub-network to add brow region to divide eyes For 3 eyes, eyebrow and background semantic classes.ROI feature figure is re-converted into original ROI size by sub-network, and will be defeated Result is placed on corresponding position out, as shown in Figure 3.

Fig. 3 is mentioned mouth region sub-network flow diagram in method by the present invention, and nasal area and eyes add eyebrow area Domain process flow having the same, three sub-networks are run parallel, and three sub-networks are full convolutional neural networks network model, Specifically, sub-network rises dimension, the result of output 56 × 56 for Fig. 4 times of ROI subcharacter.

Since the class imbalance phenomenon of skin, hair and background is not serious, It is not necessary to the individual subnet of design Network structure.Characteristic dimension is risen to input picture dimension, is then exported by three sorter network modules by warp lamination and convolutional layer Skin area, hair zones and the segmentation result of background area.Decoder rear end part forms three sorter network modules.

Specifically, after C1 characteristic pattern is added with C2 liter dimensional feature figure, convolutional layer 1 (256 channels, 3 × 3 convolution kernels), Warp lamination 1 (128 channels, step-length 2), convolutional layer 2 (128 channels, 3 × 3 convolution kernels) and convolutional layer 3 (3 channels, volume 3 × 3 Product core) it is sequentially overlapped, output hair, skin and the other segmentation result of three type of background, the part are three sorter network pattern dies Block.

In training process, face composition detection module mentions for the position the ROI frame that each sub-network provides 20 respective classes View, 20 proposal extracted features of position frame are exported by corresponding sub-network as a result, then true with corresponding position Value, which calculates, intersects entropy loss, and combination is used to optimize network parameter by the intersection entropy loss of all proposals, and the true value is from artificial Divide face and divides database.

The optimization loss function of training process is defined as follows:

L_all=L_seg+L_det+L_reg

Wherein, L_allFor global optimization loss function, L_regFor regularization loss function, asked for alleviating model over-fitting Topic, L_detFor face composition detection module loss function, it is original fixed that partial loss function definition is derived from Faster RCNN model Justice returns loss, L comprising Classification Loss and position_segFor segmentation loss function, it is defined as follows:

Step 4: mark being split to facial image to be processed using algorithm model trained in step 3.

The segmentation result of three sorter networks and the fusion of the segmentation result of each sub-network just obtain the classification segmentation mark of face 11 Infuse result.(left eye, right eye, Zuo Mei after acquisition mouth region respectively, nasal area, eyes add the segmentation result of brow region In hair, right eyebrow, nose, upper lip, mouth, lower lip), with three sorter networks export result (skin, hair, background) phase group Close, obtain final 11 classifications (left eye, right eye, left eyebrow, right eyebrow, nose, upper lip, in mouth, lower lip, skin, head Hair, background) segmentation result.

Fig. 4 provides a kind of end-to-end face mask method segmentation result schematic diagram based on detection segmentation for the present invention.Such as Shown in Fig. 4, the first behavior input picture, the second behavior segmentation result, the corresponding true value of third behavior.Shown sample is on head Posture, hair style, the colour of skin, block, expression etc. all has biggish variation, mentioned method model still can obtain preferably Segmentation result.

More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of end-to-end face mask method based on detection segmentation, which comprises the following steps:

S2. an end-to-end face segmentation mark neural network model is constructed, the neural network model is by sharing feature mould Block, face composition detection module, ROI feature extraction module, three sub- network modules, three sorter network modules composition, wherein institute The input for stating sharing feature module is facial image, is exported as sharing feature figure；The input of the face composition detection module is Sharing feature figure exports as the position frame information in each region for needing to detect；The input of the ROI feature extraction module is shared Characteristic pattern and each position frame information export as each ROI subcharacter figure；The input of the sub-network module is ROI subcharacter figure, defeated It is out the segmentation result of each subregion；The input of the three sorter networks module is sharing feature figure, is exported as hair, skin With the segmentation result of background；

S3. facial image training set, the end-to-end face segmentation mark neural network model of training, the synchronous instruction of modules are based on Practice；

S4. divide mark neural network model using trained face and mark, three classification are split to test facial image The segmentation result of network and the fusion of the segmentation result of each sub-network are segmentation annotation results.

2. face mask method as described in claim 1, which is characterized in that the region packet for needing to detect described in step S1 Include: left eyebrow, right eyebrow, left eye eyeball, right eye eyeball, nose, upper lip, in mouth, lower lip, skin, hair, background.

3. face mask method as claimed in claim 1 or 2, which is characterized in that sharing feature module described in step S2 is adopted With coding-decoding structure, coding structure converts characteristic pattern C1-CN for facial image by full convolutional neural networks；Decoding knot By characteristic pattern C1-CN, deconvolution up-sampling and height Fusion Features obtain sharing feature figure P1-PN to structure step by step.

4. face mask method as claimed in claim 1 or 2, which is characterized in that face composition detection mould described in step S2 Block exports the position frame information of different ingredients.

5. face mask method as claimed in claim 4, which is characterized in that ROI feature extraction module described in step S2 exists After obtaining the position ROI frame information, intercept ROI feature respectively from sharing feature figure P1-PN, be uniformly transformed to identical size, The fusion of feature concatenation after being uniformly transformed to identical dimension of the channel, obtains each ROI subcharacter figure.

6. face mask method as claimed in claim 1 or 2, which is characterized in that three sub- network modules include eyes Add eyebrow sub-network module, nose sub-network module, mouth sub-network module three classes, is responsible for carrying out individual features further Processing, and the segmentation result of decoded output subregion, wherein subregion is divided into nose and background 2 by nasal area sub-network A semantic classes, mouth region sub-network by mouth region segmentation be upper lip, in mouth, 4 semantic classes of lower lip and background, Eyes add eyebrow sub-network to add brow region to be divided into 3 eyes, eyebrow and background semantic classes eyes.

7. face mask method as claimed in claim 1 or 2, which is characterized in that the three sorter networks module passes through warp Characteristic dimension is risen to input picture dimension, exports the segmentation of skin area, hair zones and background area by lamination and convolutional layer As a result.

8. face mask method as claimed in claim 1 or 2, which is characterized in that the optimization loss function of training process defines It is as follows:

L_all=L_seg+L_det+L_reg

Wherein, L_allFor global optimization loss function, L_regFor regularization loss function, for alleviating model overfitting problem, L_det For face composition detection module loss function, loss, L are returned comprising Classification Loss and position_segTo divide loss function, determine Justice is as follows:

Wherein,Divide loss functions for three classification,For three classes community finding loss function, 4 segmentation losses Function is all made of cross entropy loss function.