CN112613385A

CN112613385A - Face recognition method based on monitoring video

Info

Publication number: CN112613385A
Application number: CN202011501108.0A
Authority: CN
Inventors: 张正强; 阴俊恺; 吴震; 李斌
Original assignee: Chengdu 30kaitian Communication Industry Co ltd
Current assignee: Chengdu 30kaitian Communication Industry Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-06

Abstract

The invention provides a face recognition method based on a monitoring video, which comprises the following steps: s1, collecting face detection training data and face matching training data; s2, constructing a face detection model and a face matching model; s3, respectively inputting the collected face detection training data and face matching training data into the constructed face detection model and face matching model for training; and S4, performing face recognition on the monitoring video by using the trained face detection model and the face matching model. The invention provides a face snapshot, model training and face recognition method, which can improve the face recognition precision in a monitoring scene and improve the actual combat application capability of a monitoring system.

Description

Face recognition method based on monitoring video

Technical Field

The invention relates to the technical field of face recognition, in particular to a face recognition method based on a monitoring video.

Background

The intelligent video monitoring platform needs to be constructed to collect various video information, and the face information is one of the video information. As a preposed technology of monitoring platform personnel behavior tracks, black and white lists and behavior recognition analysis technology, the face recognition accuracy directly influences the deployment and control effect of the system. However, the accuracy and performance of the existing face recognition method for the surveillance video are poor.

Disclosure of Invention

The invention aims to provide a face recognition method based on a monitoring video, and aims to solve the problem that the existing face recognition method aiming at the monitoring video is poor in accuracy and performance.

The invention provides a face recognition method based on a monitoring video, which comprises the following steps:

s1, collecting face detection training data and face matching training data;

s2, constructing a face detection model and a face matching model;

s3, respectively inputting the collected face detection training data and face matching training data into the constructed face detection model and face matching model for training;

and S4, performing face recognition on the monitoring video by using the trained face detection model and the face matching model.

Further, the process of acquiring the face detection training data in step S1 is as follows:

s111, collecting face video materials of a plurality of different backgrounds, different angles and different time periods;

s112, dividing the video frame picture into 9 regions according to the nine-square grid for each face video material;

s113, combining 9 areas of each face video material according to the depth of field to obtain a near-scene depth video material and a far-scene depth video material;

s114, cutting the near field depth video material and the far field depth video material into a near field depth video frame image and a far field depth video frame image, and marking face regions in the near field depth video frame image and the far field depth video frame image respectively by using a marking tool, wherein the positions of all the marked face regions are the same, and the included parts are also the same;

s115, performing exception screening and data augmentation on the labeled face region to obtain video training data; the video training data comprises near depth of field face detection training data and far depth of field face detection training data.

Further, the method for constructing the face detection model in step S2 includes:

s211, constructing a multi-scale single-stage face detection model; the multi-scale single-stage face detection model comprises a backbone network model, a characteristic pyramid, a context module and a multi-task loss function;

s212, on the basis of the multi-scale single-stage face detection model, a near scene depth face detection model and a far scene depth face detection model are respectively expanded by adjusting the number of network layers of the backbone network model.

Further, the backbone network structure of the close-range deep face detection model is as follows: conv s2- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > 5:. Conv dw- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Avg Pools1- > S FC 1- > Softmaxs 1; wherein Conv denotes a normal convolutional layer, Conv dw denotes a depth separable convolutional layer, Avg Pool denotes an average pooling layer, and FC denotes a full-connection layer; a Relu layer and a batch normalization layer BN exist after each convolutional layer, a batch normalization layer BN exists after the full connection layer FC, and s1 and s2 represent step sizes of 1 or 2.

Further, the backbone network structure of the long-range deep face detection model is as follows: conv s2- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s 53- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- >7 Conv dw s1- > Conv s1- > Conv fts 1- > Conv dw s1- > Avg Pools1- > Conv fts 1-; wherein Conv denotes a normal convolutional layer, Conv dw denotes a depth separable convolutional layer, Avg Pool denotes an average pooling layer, and FC denotes a full-connection layer; the Relu activation function layer and the batch normalization layer BN exist after each convolution layer, the batch normalization layer BN exists after the full connection layer FC, and s1 and s2 represent step sizes of 1 or 2.

Further, the process of acquiring the face matching training data in step S1 is as follows:

s121, collecting a plurality of frontal face pictures with different backgrounds, different angles and different illuminations, and recording the name of each person in the collection process;

s122, marking face key points in the front face picture by using a face detection algorithm, aligning the face of the front face picture by using the marked face key points, cutting out images except for a face region in the front face picture after face alignment to obtain a face region image, and finally, setting the face region image resize to be 112 x 112 and classifying according to names;

and S123, screening and data augmentation are carried out on the classified face region images to obtain face matching training data.

Further, the method for constructing the face matching model in step S2 is to improve on the basis of the ResNet 34 network to obtain a face matching model network structure; the face matching model network structure comprises 6 computing units Conv 1, Conv 2.x, Conv 3.x, Conv 4.x, Conv 5.x and Conv 6; wherein Conv 1 is a 3 × 3 convolutional layer Conv (stride ═ 1); conv 2.x comprises a max pooling layer Maxpool and a residual block, Conv 3.x, Conv 4.x and Conv 5.x are residual blocks, the network structure of the residual blocks is BN- > Conv- > BN- > PReLu- > Conv- > BN structure, BN represents a batch normalization layer, and PreLu represents a PreLu activation function layer; conv 6 is a BN- > Dropout- > FC- > BN structure, the Dropout layer is used to randomly close certain neural nodes during training to prevent overfitting, and the FC layer is a fully connected layer.

Further, step S4 includes the following steps:

s41, establishing a human face base;

s411, collecting the front face picture of the warehousing personnel manually;

s412, carrying out face detection on the collected front face image picture by using a near field depth face detection model to obtain a face region image;

(1) marking face key points in the front face picture by using a near-depth-of-field face detection model;

(2) using the marked human face key points to align the human face of the front face picture;

(3) cutting out images outside the face area in the front face picture after the face alignment to obtain a face area image;

s413, extracting a face feature vector from the face region image by using a face matching model;

s414, binding the face feature vector and the corresponding personnel information and storing the face feature vector into a face base;

s42, acquiring a monitoring video and cutting a near scene depth monitoring video and a far scene depth monitoring video from the monitoring video;

s43, cutting the near field depth monitoring video and the far field depth monitoring video into a near field depth monitoring video frame image and a far field depth monitoring video frame image, respectively and correspondingly using a near field depth face detection model and a far field depth face detection model to perform face detection on the near field depth monitoring video frame image and the far field depth monitoring video frame image, extracting face key points of the near field depth monitoring video frame image and the far field depth monitoring video frame image, performing face alignment by using the face key points, and then cutting off the parts outside the face area to obtain a near field depth face area monitoring video frame image and a far field depth face area monitoring video frame image; finally, monitoring video frame images of the near field depth face area and monitoring video frame images of the far field depth face area from resize to 112 × 112;

s44, extracting a face feature vector to be matched from the near-depth face region monitoring video frame image and the far-depth face region monitoring video frame image by using a face matching model;

and S45, matching the face feature vector to be matched with the face feature vector in the face bottom library to obtain a corresponding face matching result.

Further, the area of the face region in the front face picture acquired in step S411 occupying the entire front face picture is 1/5 or more.

Further, the matching method in step S45 is: the face matching model maps the face feature vector to be matched and the face feature vector in the face bottom library to a unit hypersphere, the cosine similarity between the face feature vector to be matched and each face feature vector in the face bottom library is calculated, the value range of the cosine similarity is [ -1,1], the closer the cosine similarity obtained by calculation is to 1, the more similar the face is, and the personnel information corresponding to the face feature vector with the highest cosine similarity is the face matching result.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention provides a face snapshot, model training and face recognition method, which can improve the face recognition precision in a monitoring scene and improve the actual combat application capability of a monitoring system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flow chart of a face recognition method based on a surveillance video according to an embodiment of the present invention.

Fig. 2 is a block diagram of a flow of acquiring face detection training data according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of dividing a video frame picture into 9 regions according to a squared figure for each face video material according to the embodiment of the present invention.

Fig. 4 is a block diagram of a process of acquiring face matching training data according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a multi-scale single-stage face detection model according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of a feature pyramid output connection according to an embodiment of the invention.

FIG. 7 is a block diagram of a context module according to an embodiment of the present invention.

FIG. 8 is a diagram illustrating a multi-tasking penalty according to an embodiment of the invention.

Fig. 9 is a schematic diagram of the arrangement of anchors with specific dimensions of the feature pyramid according to the embodiment of the present invention.

Fig. 10 is a schematic view of a backbone network structure of a near depth face detection model according to an embodiment of the present invention.

Fig. 11 is a schematic network structure diagram of the face matching model according to the embodiment of the present invention.

Fig. 12 is a schematic diagram of a residual block network structure of a face matching model according to an embodiment of the present invention.

Fig. 13 is a schematic diagram of unit hypersphere distribution of face feature vectors extracted by the face matching model according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

As shown in fig. 1, the present embodiment provides a face recognition method based on a surveillance video, including the following steps:

s1, collecting face detection training data and face matching training data;

(1) referring to fig. 2, the process of acquiring the face detection training data in step S1 is as follows:

s111, collecting a plurality of face video materials with different backgrounds, different angles and different time periods, wherein the number of the face video materials in different categories is approximately equivalent;

s112, dividing the video frame picture into 9 regions according to the nine-square grid for each face video material; referring to fig. 3, each of the 9 regions has the same size, generally, the regions 1 to 6 are far depth of field regions, and the regions 7 to 9 are near depth of field regions, and the specific division manner is adjusted according to the actual situation;

s113, combining 9 regions of each face video material according to the depth of field to obtain a near-scene depth video material and a far-scene depth video material, wherein the processed video materials in the step need to be guaranteed to be the same in size, namely 112 × 112, and the main difference is that the face sizes are different;

and S114, cutting the near-depth-of-field video material and the far-depth-of-field video material into a near-depth-of-field video frame image and a far-depth-of-field video frame image, and marking face regions in the near-depth-of-field video frame image and the far-depth-of-field video frame image respectively by using a marking tool, wherein the positions of all marked face regions are approximately the same, and the positions of the marked face regions are also the same. Generally, the scale of the labeled boxes is 1: 1.

And S115, carrying out abnormal screening and data augmentation on the labeled face area to obtain face detection training data. The face detection training data comprises near depth of field face detection training data and far depth of field face detection training data. The abnormal screening refers to screening undersized human faces to avoid influencing the subsequent training effect on a human face detection model; the data augmentation refers to augmenting data by operations of increasing noise, increasing blur, adjusting brightness, contrast, translation, scaling, mirroring and the like so as to augment face detection training data under different conditions;

(2) referring to fig. 4, the process of acquiring the face matching training data in step S1 is as follows:

s121, collecting a plurality of frontal face pictures with different backgrounds, different angles (generally, the included angle between the frontal face pictures and a normal vector of a human face is less than 15 degrees) and different illuminations, and recording the name of each person in the collection process; generally, about 15 positive face pictures of each person are needed;

s123, screening and data augmentation are carried out on the classified face region images to obtain face matching training data; the screening is to screen out fuzzy photos which can influence the training effect of the face matching model; the data augmentation refers to increasing noise and improving image differentiation by adjusting the brightness and contrast of images so as to increase face matching training data under different conditions;

s2, constructing a face detection model and a face matching model;

(1) the method for constructing the face detection model in the step S2 includes:

s211, constructing a multi-scale single-stage face detection model, wherein the multi-scale single-stage face detection model comprises a backbone network model, a characteristic pyramid, a context module and a multitask loss function; the multi-scale single-stage face detection model is used for realizing three tasks of face confidence degree judgment, face region frame selection and face key point marking.

For a multi-scale Single-stage Face detection model, the present embodiment refers to the paper RetinaFace, Single-stage depth Face localization in the Wild; the structure of the multi-scale single-stage face detection model is shown in fig. 5. For the backbone network model, the C1 layer is the input layer, the input image size is 640 × 640, and the C2 to C6 layers are the down-sampling layers. Because the distance and the size of the face in the face detection training data are different, the multi-scale detection is introduced by using the idea of a feature pyramid to better balance the speed and the recognition accuracy; in fig. 5, P2-P6 are feature pyramid layers, and the relevant parameters of the feature pyramid layers are determined by anchors. Where P6 is the result of convolving C5 with a convolution kernel (kernel) of 3x3 and a step size (stride) of 2. The P5 layer accepts the output of the C5 layer and performs an upsampling operation; the P4 layers receive the output of the P5 layer and the output of the C4 layer, combine the data and perform the up-sampling operation, and so on until the P2 layers are generated, the number of channels of each layer is 256, and the sizes of the channels are 10 × 10, 20 × 20, 40 × 40, 80 × 80 and 160 × 160 respectively. As shown in fig. 6, the generated weight of the feature pyramid layer is divided into two parts, one part is connected with a 3 × 3DCN layer (Deformable Convolution), the other part is merged again after the processing is completed through the context module, and the detection result is obtained, and includes face confidence, face labeling box regression and face key point regression. Detailed structure of context module see fig. 7, where the Conv layers are all deformable convolutions, this module is used to enhance the receptive field obtained in the euclidean mesh on the feature pyramid.

For the anchor setting, referring to fig. 8, anchor parameters are in one-to-one correspondence with feature pyramid layers P2-P6 to generate feature maps, and from each layer of feature maps P2-P6, the feature maps are gradually decreased, and the larger feature map is used for grabbing the small face. The step sizes of different scales are all set to be

Aspect ratio 1: 1. The input image was 640 × 640, feature map maximum 406 × 406, minimum 16 × 16, total 102300 anchors, 75% of which were from P2.

Referring to fig. 9, the model uses multitask loss in the training process, that is, the loss function of the multiscale single-stage face detection model includes face classification (classification) loss, face labeling box Regression (box Regression) loss, face key point Regression (landmark Regression) loss, and Dense Regression (Dense Regression) loss is introduced. Wherein for all negative examples anchors only classification loss is used; for positive samples anchors, respectively calculating the multitask loss according to the 256-channel result after the context module, and respectively calculating the multitask loss in different feature maps (feature maps) H_n×W_nX 256, n ∈ {2, …,6} is increased by 1 x 1 convolution to share loss header (loss head).

S212, on the basis of the multi-scale single-stage face detection model, a near scene depth face detection model and a far scene depth face detection model are respectively expanded by adjusting a backbone network model so as to balance the model detection speed and the model detection precision. Specifically, the method comprises the following steps:

referring to fig. 10, the backbone network structure of the near scene depth face detection model is adjusted as follows: conv s2- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > 5:. Conv dw- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Avg Pools1- > S FC 1- > Softmaxs 1; wherein Conv denotes a normal convolutional layer, Conv dw denotes a depth separable convolutional layer, Avg Pool denotes an average pooling layer, and FC denotes a full-connection layer; a Relu layer and a batch normalization layer BN exist behind each convolution layer, a batch normalization layer BN exists behind a full connection layer FC, and s1 and s2 represent that the step length is 1 or 2;

the backbone network structure of the far field depth face detection model can be obtained by modifying the backbone network structure of the near field depth face detection model in the following modification mode: conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2 to Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2, and 5. Conv dw- > Conv s1 is increased to 7. Conv dw- > Conv s 1; therefore, the backbone network structure of the long-range deep face detection model is as follows: conv s2- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s 53- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- >7 Conv dw s1- > Conv s1- > Conv fts 1- > Conv dw s1- > Avg Pools1- > Conv fts 1-; wherein Conv denotes a normal convolutional layer, Conv dw denotes a depth separable convolutional layer, Avg Pool denotes an average pooling layer, and FC denotes a full-connection layer; the Relu activation function layer and the batch normalization layer BN exist after each convolution layer, the batch normalization layer BN exists after the full connection layer FC, and s1 and s2 represent step sizes of 1 or 2.

It can be seen that the farther the depth of field is, the smaller the face size is, the more the number of required network layers is, the more the image details can be extracted, so that the number of network layers in practical application can be quantitatively adjusted according to the requirement;

(2) the method for constructing the face matching model in step S2 is to improve on the basis of the ResNet 34 network to obtain a face matching model network structure (for convenience of representation, the RELU layer, the BN layer, etc. are hidden in fig. 11) as shown in fig. 11; the face matching model network structure comprises 6 computing units Conv 1, Conv 2.x, Conv 3.x, Conv 4.x, Conv 5.x and Conv 6; wherein Conv 1 is a 3 × 3 convolutional layer Conv (stride ═ 1); conv 2.x comprises a max pooling layer Maxpool and a residual block, Conv 3.x, Conv 4.x and Conv 5.x are residual blocks, the network structure of the residual blocks is BN- > Conv- > BN- > PReLu- > Conv- > BN structure, BN represents a batch normalization layer, and PreLu represents a PreLu activation function layer; conv 6 is a BN- > Dropout- > FC- > BN structure, the Dropout layer is used to randomly close certain neural nodes during training to prevent overfitting, and the FC layer is a fully connected layer.

The face matching model network adopted in the embodiment is improved on the basis of the ResNet 34 network as follows:

the first improvement is as follows: most neural networks are designed to complete the classification task of Image-Net, and use 224 × 224 inputs, while the face matching model of this embodiment uses 112 × 112 inputs, and direct use will result in that the original final extracted feature dimension becomes 3 × 3, and the original dimension is 7 × 7, so it is necessary to replace the first 7 × 7 Conv layer (stride ═ 2) with 3 × 3 Conv layer (stride ═ 1), that is, the portion of the calculation unit Conv 1 in fig. 11.

The second improvement is as follows: the BN-Conv-BN-prilu-Conv-BN structure is adopted as the residual block, i.e., the residual block of the calculation units Conv 2.x, Conv 3.x, Conv 4.x and Conv 5.x in fig. 11, and the step size of the first convolutional layer Conv in the residual block is adjusted from 2 to 1, the activation function adopts prilu instead of the original ReLu, and the network structure is shown in fig. 12.

The third improvement is that: for the last layers of the network, different outputs may affect the performance of the model, and we add the BN- > DropOut- > FC- > BN structure after the last residual block in the original model, i.e. part of the computational unit Conv 6 of fig. 11.

The loss function of the face matching model of the present embodiment refers to the paper: ArcFace, Additive Angular Margin Lossfor Deep Face Recognition, using Angular Margin Loss to achieve maximization in Angular space, so that the classification boundary has clearer geometric interpretation relative to cosine space. wj is weight of FC layer, xi is input of FC layer

(i.e., wj and xi are both unit vectors), the loss function of the face matching model is expressed as:

wherein t is an angle edge for controlling the aggregation degree of the same type of points, m is the size of batch size in the training process, s is a normalization parameter, and theta_yiIs the angle between the input vector and the weight in the hypersphere. Therefore, training the model using this loss function can cause the face matching model to map the face feature vectors into a unit hypersphere as shown in fig. 13. Each point in fig. 13 represents a face feature vector located on the unit hypersphere, and it can be clearly seen that the face feature vectors corresponding to similar faces are closer and concentrated in distribution on the unit hypersphere, and the face feature vectors corresponding to different faces are farther in distribution on the unit hypersphere.

s31, training a face detection model;

for the Face detection model, refer to the paper Retina Face, Single-stage Face localization in the Wild, and train it by using the multitask loss function.

S32, training a face matching model;

in order to improve the detection speed, the face matching model adopts a mixed precision training method:

(1) the weight of the FP32 is converted into FP16 for forward propagation, and after the Loss (with the format of FP32) is obtained, the weight gradient is calculated by using FP16 and is updated to the weight.

(2) After the Loss of FP32 is obtained, the obtained Loss is amplified by the power of 2 as necessary and stored as FP16, and then the obtained Loss is propagated in the opposite direction. Updating the weights will convert the gradient to FP32 and scale back to the original scale.

(3) The multiplication uses FP16 and the addition uses FP32 (the strategy is used according to the network type);

s4, carrying out face recognition on the monitoring video by using the trained face detection model and the face matching model;

s41, establishing a human face base;

s411, collecting the front face picture of the warehousing personnel manually; the collected front face picture is required to be clear and complete, and preferably, the area of the face region in the collected front face picture in the whole front face picture is greater than or equal to 1/5 (if the size of the collected front face picture is 1920 × 1080, the face region needs to be greater than 384 × 216).

s42, acquiring a monitoring video and cutting a near scene depth monitoring video and a far scene depth monitoring video from the monitoring video; the method of the step is the same as that of the step S112, the video frame picture of the monitoring video is divided into 9 areas according to the nine-square grid; referring to fig. 3, each of the 9 regions has the same size, generally, the regions 1 to 6 are far depth of field regions, and the regions 7 to 9 are near depth of field regions, and the specific division manner is adjusted according to the actual situation;

s43, cutting the near field depth monitoring video and the far field depth monitoring video into a near field depth monitoring video frame image and a far field depth monitoring video frame image, respectively and correspondingly using a near field depth face detection model and a far field depth face detection model to perform face detection on the near field depth monitoring video frame image and the far field depth monitoring video frame image, extracting face key points of the near field depth monitoring video frame image and the far field depth monitoring video frame image, performing face alignment by using the face key points, and then cutting off the parts outside the face area to obtain a near field depth face area monitoring video frame image and a far field depth face area monitoring video frame image; and finally, resetting the near-depth face region monitoring video frame image and the far-depth face region monitoring video frame image to 112 × 112.

s45, matching the face feature vector to be matched with the face feature vector in the face bottom library to obtain a corresponding face matching result: the face matching model maps the face feature vector to be matched and the face feature vector in the face bottom library to a unit hypersphere, the cosine similarity between the face feature vector to be matched and each face feature vector in the face bottom library is calculated, the value range of the cosine similarity is [ -1,1], the closer the cosine similarity obtained by calculation is to 1, the more similar the face is, and the personnel information corresponding to the face feature vector with the highest cosine similarity is the face matching result.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A face recognition method based on a monitoring video is characterized by comprising the following steps:

s1, collecting face detection training data and face matching training data;

s2, constructing a face detection model and a face matching model;

2. The surveillance video-based face recognition method according to claim 1, wherein the process of collecting the face detection training data in step S1 is as follows:

3. The surveillance video-based face recognition method according to claim 2, wherein the method for constructing the face detection model in step S2 is as follows:

4. The surveillance video-based face recognition method of claim 3, wherein the backbone network structure of the near-depth-of-field face detection model is: conv s2- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > 5:. Conv dw- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Avg Pools1- > S FC 1- > Softmaxs 1; wherein Conv denotes a normal convolutional layer, Conv dw denotes a depth separable convolutional layer, Avg Pool denotes an average pooling layer, and FC denotes a full-connection layer; a Relu layer and a batch normalization layer BN exist after each convolutional layer, a batch normalization layer BN exists after the full connection layer FC, and s1 and s2 represent step sizes of 1 or 2.

5. The surveillance video-based face recognition method of claim 3, wherein the backbone network structure of the far-field face detection model is: conv s2- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s2- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s 53- > Conv s1- > Conv dw s1- > Conv s1- > Conv dw s2- > Conv s1- >7 Conv dw s1- > Conv s1- > Conv fts 1- > Conv dw s1- > Avg Pools1- > Conv fts 1-; wherein Conv denotes a normal convolutional layer, Conv dw denotes a depth separable convolutional layer, Avg Pool denotes an average pooling layer, and FC denotes a full-connection layer; the Relu activation function layer and the batch normalization layer BN exist after each convolution layer, the batch normalization layer BN exists after the full connection layer FC, and s1 and s2 represent step sizes of 1 or 2.

6. The surveillance video-based face recognition method according to claim 3, wherein the process of collecting the face matching training data in step S1 is as follows:

7. The face recognition method based on monitoring video of any one of claims 3-6, characterized in that the method for constructing the face matching model in step S2 is to improve on the basis of the ResNet 34 network to obtain a face matching model network structure; the face matching model network structure comprises 6 computing units Conv 1, Conv 2.x, Conv 3.x, Conv 4.x, Conv 5.x and Conv 6; wherein Conv 1 is a 3 × 3 convolutional layer Conv (stride ═ 1); conv 2.x comprises a max pooling layer Maxpool and a residual block, Conv 3.x, Conv 4.x and Conv 5.x are residual blocks, the network structure of the residual blocks is BN- > Conv- > BN- > PReLu- > Conv- > BN structure, BN represents a batch normalization layer, and PreLu represents a PreLu activation function layer; conv 6 is a BN- > Dropout- > FC- > BN structure, the Dropout layer is used to randomly close certain neural nodes during training to prevent overfitting, and the FC layer is a fully connected layer.

8. The surveillance video-based face recognition method according to claim 7, wherein the step S4 includes the steps of:

s41, establishing a human face base;

s411, collecting the front face picture of the warehousing personnel manually;

9. The surveillance video-based face recognition method according to claim 8, wherein the area of the face region in the front face picture collected in step S411 is greater than or equal to 1/5.

10. The surveillance video-based face recognition method according to claim 8 or 9, wherein the matching method in step S45 is: the face matching model maps the face feature vector to be matched and the face feature vector in the face bottom library to a unit hypersphere, the cosine similarity between the face feature vector to be matched and each face feature vector in the face bottom library is calculated, the value range of the cosine similarity is [ -1,1], the closer the cosine similarity obtained by calculation is to 1, the more similar the face is, and the personnel information corresponding to the face feature vector with the highest cosine similarity is the face matching result.