CN116798100A

CN116798100A - Face video detection method and device

Info

Publication number: CN116798100A
Application number: CN202310863404.2A
Authority: CN
Inventors: 罗曼
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-09-22

Abstract

The embodiment of the specification provides a method and a device for detecting face videos. The method comprises the following steps: obtaining a face video to be identified; calculating motion information of at least two image frames in the face video to be recognized by using a sparse optical flow algorithm; determining suspected risk image frames in the face video to be identified according to the motion information of at least two image frames in the face video to be identified; respectively extracting characteristic information of the suspected risk image frames in at least two modes; and judging whether the face video to be recognized is a fake face video or not according to the characteristic information of the suspected risk image frames under at least two modes. According to the embodiment of the specification, the detection efficiency of the face video can be improved, and the personal privacy of a user is protected.

Description

Face video detection method and device

Technical Field

One or more embodiments of the present disclosure relate to electronic information technology, and in particular, to a method and apparatus for detecting a face video.

Background

Face recognition is an important biological recognition technology and is widely applied to identity authentication in the fields of banks, hotels, traffic and the like. In recent years, various attacks against face recognition systems have been proposed, and in particular, the damage of deep fake (deep) video has been increased. Deep fake video means that an attacker utilizes a deep fake tool to synthesize a fake face-brushing video to attack the existing face recognition system, so that personal privacy and fund security of a user are seriously threatened. Therefore, it is necessary to detect a face video to determine whether the face video is a forged face video.

At present, the method for detecting the face video mostly takes the full-face image as input, and detects the full-face image frame by frame or uniformly extracts the full-face image, and the characteristic layer generally only uses the RGB information of the image, which results in long time consumption, low efficiency, no pertinence and low accuracy in the whole detection process.

Therefore, a more effective method for detecting face videos is needed.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method and an apparatus for detecting a face video, which can more effectively detect a fake face video, thereby improving the security of face recognition.

According to a first aspect, there is provided a method for detecting a face video, wherein the method includes:

obtaining a face video to be identified;

calculating motion information of at least two image frames in the face video to be recognized by using a sparse optical flow algorithm;

determining suspected risk image frames in the face video to be identified according to the motion information of at least two image frames in the face video to be identified;

respectively extracting characteristic information of the suspected risk image frames in at least two modes;

and judging whether the face video to be recognized is a fake face video or not according to the characteristic information of the suspected risk image frames under at least two modes.

The determining the suspected risk image frame in the face video to be identified according to the motion information of at least two image frames in the face video to be identified comprises the following steps:

extracting motion information of every two adjacent image frames of the face video to be identified;

obtaining a speed vector of a pixel point in each image frame according to the motion information of every two adjacent image frames;

and for each image frame in the face video to be identified, determining whether the image frame is a suspected risk image frame in the face video to be identified according to the speed vector of the pixel point in each image frame and the average speed vector of each pixel point around the pixel point.

The method further comprises the steps of: dividing a human face into at least two key areas;

the determining whether the image frame is a suspected risk image frame in the face video to be identified according to the velocity vector of the pixel point in each image frame and the average velocity vector of each pixel point around the pixel point comprises:

for each key region included in one image frame, obtaining a difference between a speed vector of each pixel point in the key region and an average speed vector of all other pixel points in the key region; obtaining an artifact risk index corresponding to the key region according to each difference of each pixel point in the key region;

and determining whether the image frame is a suspected risk image frame in the face video to be identified according to each artifact risk index corresponding to each key region in the image frame.

The obtaining the difference between the velocity vector of each pixel point in the key area and the average velocity vector of all other pixel points in the key area, according to each difference of each pixel point in the key area, obtaining an artifact risk index corresponding to the key area, including:

for each critical region, an artifact risk index for the critical region is calculated using the following calculation:

wherein r is an artifact risk index of a key area, i and j are variables, and K is the number of pixel points in the key area; v (V) _i V is the velocity vector of the ith pixel point in the key area _j A speed vector for a j-th pixel point in the key area;

determining whether the image frame is a suspected risk image frame in the face video to be identified according to the corresponding differences of the key areas in the image frame, wherein the method comprises the steps of;

for one image frame, judging whether an artifact risk index corresponding to at least one key area of the image frame is larger than a preset index threshold corresponding to the key area, and if so, determining that the image frame is a suspected risk image frame in the face video to be identified.

Wherein the at least two critical areas comprise at least two of: left eye region, right eye region, nose region, mouth region, face contour region.

The feature information of the suspected risk image frames in the at least two modes comprises: at least two of a feature of multi-frame motion information, a single frame frequency domain feature, and a single frame spatial domain feature.

The extracting the characteristic information of the suspected risk image frames under at least two modes respectively comprises at least two of the following steps:

extracting features of multi-frame motion information using a FlowNet2 optical flow model or a calcOpticalFlowPyrLK operator;

calculating single frame frequency domain features using DCT transformation or calculating single frame frequency domain features using FFT operators;

and obtaining the single-frame spatial feature by using the RGB color information as the single-frame spatial feature or using the YUV color space or the image feature filtered by the Sobel operator.

According to a second aspect, there is provided a detection apparatus for face video, wherein the apparatus comprises:

the input module is configured to obtain a face video to be recognized;

the sparse optical flow algorithm module is configured to calculate motion information of at least two image frames in the face video to be identified by using a sparse optical flow algorithm;

the to-be-detected sequence generation module is configured to determine suspected risk image frames in the to-be-identified face video according to motion information of at least two image frames in the to-be-identified face video, and form a to-be-detected sequence by using the suspected risk image frames;

the feature information extraction module is configured to respectively extract feature information of the suspected risk image frames under at least two modes;

the judging module is configured to judge whether the face video to be identified is a fake face video or not according to the characteristic information of the suspected risk image frames under at least two modes.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method according to any of the embodiments of the present description.

According to a fourth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which, when executing the executable code, implements a method as described in any of the embodiments of the present specification.

The face video detection method and device provided by the embodiments of the present disclosure have at least the following beneficial effects after being singly or combined:

1. in the embodiment of the specification, instead of taking the full-face image as the object to be detected to perform frame-by-frame detection or uniform frame extraction detection, the suspected risk image frames in the face video are determined first, and only a part of the full-face image, namely the suspected risk image frames, is taken as the object to be detected to perform feature extraction and detection, so that the time consumption of the detection process is greatly shortened, the efficiency is improved, and only the suspected risk image frames are detected in a targeted manner, so that the accuracy is higher.

2. In the embodiment of the present disclosure, when the features are extracted and detected, the detection is performed not by using only single image RGB information, but by using multi-mode feature information, that is, from multiple dimensions, so that the accuracy of the detection is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture to which one embodiment of the present description applies.

Fig. 2 is a flowchart of a method for detecting a face video in an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a face video detection method according to another embodiment of the present disclosure.

Fig. 4 is a flowchart of a method for determining a suspected risk image frame in a face video to be identified in one embodiment of the present description.

Fig. 5 is a schematic structural diagram of a face video detection device according to an embodiment of the present disclosure.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

It is first noted that the terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

For ease of understanding the methods provided in this specification, a description of the system architecture to which this specification relates and applies is first provided. As shown in fig. 1, the system architecture mainly includes three network nodes: terminal equipment, face video acquisition device, face video detection device.

The face video acquisition device is arranged at the side of the terminal equipment and is used for acquiring face videos and transmitting the acquired face videos to the face video detection device.

The face video detection device is used for detecting the received face video to be identified and determining whether the face video is a fake face video or not.

Wherein the terminal device depends on the traffic scenario. For example, in a gate inhibition detection business scenario (such as a gate of a company, whether a person is allowed to pass through or not), the terminal device may be a device provided at the gate inhibition. For another example, in a business scenario of on-site supermarket shopping, the terminal device may be a point-of-sale (POS) machine. In another example, in a business scenario of online shopping, the terminal device may be a mobile phone or a computer. As another example, in a business scenario of smart home, the terminal device may be a smart home device such as a smart door lock, a smart socket, a smart light, a smart camera, and the like.

The face video acquisition device can be a camera or the like which is arranged on the terminal equipment independently.

The face video detection device may be provided in a server, and the server may be a single server or a server group formed by a plurality of servers.

It should be understood that the number of terminal devices, face video acquisition devices, face video detection devices in fig. 1 are merely illustrative. Any number may be selected and deployed as desired for implementation.

Fig. 2 is a flowchart of a method for detecting a face video in an embodiment of the present disclosure. The execution subject of the method is a detection device of the face video. It will be appreciated that the method may be performed by any apparatus, device, platform, cluster of devices, having computing, processing capabilities. Referring to fig. 2, the method includes:

step 201: and obtaining the face video to be identified.

Step 203: and calculating the motion information of at least two image frames in the face video to be recognized by using a sparse optical flow algorithm.

Step 205: and determining suspected risk image frames in the face video to be identified according to the motion information of at least two image frames in the face video to be identified.

Step 207: and respectively extracting the characteristic information of the suspected risk image frames under at least two modes.

Step 209: and judging whether the face video to be recognized is a fake face video or not according to the characteristic information of the suspected risk image frames under at least two modes.

According to the flow shown in fig. 2, in the embodiment of the present disclosure, instead of performing frame-by-frame detection or uniform frame extraction detection with the full-face image as the object to be detected, the suspected risk image frames in the face video are determined first, and only a part of the full-face image, that is, the suspected risk image frames, are used as the object to be detected to perform feature extraction and detection, so that the time consumption of the detection process is greatly shortened, the efficiency is improved, and only the suspected risk image frames are detected in a targeted manner, with higher accuracy.

In addition, in the embodiment of the present specification, when the features are extracted and detected, not only single image RGB information is used for detection, but multi-mode feature information, that is, detection is performed from multiple dimensions, so that the accuracy of detection is further improved.

The flow shown in fig. 2 is illustrated below in conjunction with fig. 3.

First for step 201: and obtaining the face video to be identified.

As described above, the method of the embodiments of the present disclosure may be applied to a variety of different service scenarios. For example, the method can be applied to an entrance guard detection service scene (such as a gate of a company, whether a person is allowed to pass through or not), and at this time, a face video to be identified is obtained from the entrance guard service scene, so that whether the current person is allowed to pass through the entrance guard is determined according to a detection result. For example, the method can be applied to a business scene of on-site supermarket shopping, and at the moment, the face video to be identified, which is acquired by the POS machine, is obtained, so that whether normal payment of people in front of the POS machine is allowed or not is determined according to a detection result. In another example, the method can be applied to a business scene of intelligent home, such as an intelligent remote controller for collecting face videos to be identified, so as to determine whether to allow a current person to control whether a cabinet door of a cabinet is opened or not through the remote controller according to a detection result.

Referring to fig. 1, at a terminal device side, after a face video acquisition device acquires a face video to be identified, the face video to be identified acquired is uploaded to a face video detection device of a cloud.

The processing of the subsequent steps may be performed by the face video detection apparatus.

The concept of optical flow refers to movement of a target pixel in an image due to movement of an object in the image or movement of a camera in two consecutive frames of images. Optical flow is a two-dimensional vector field representing the displacement of a point from a first frame to a second frame. The working principle of the optical flow method is based on the following assumption: the pixel intensity of the scene is substantially constant between adjacent frames, with adjacent pixels having similar motion.

In the embodiment of the present specification, the principle of the sparse optical flow algorithm is utilized to perform the authenticity detection of the face video. This is because, in the case of a fake face video, for example, when a face image of a legitimate user is reproduced to forge the face video, an artifact is present in the face video, and thus an artifact risk is a situation that generally occurs in the process of falsifying a face or making a fake face, and the present situation is utilized for detection in the embodiment of the present specification. The sparse optical flow algorithm can calculate the motion information from one frame to the next in the face video to be recognized, so in step 203, the sparse optical flow algorithm can be used to calculate the velocity vector of each pixel point on each image frame in the face video to be recognized.

The sparse optical flow algorithm used in this step 203 may be the Lucas-Kanade Method (Lucas-Kanade Method).

Next for step 205: and determining suspected risk image frames in the face video to be identified according to the motion information of at least two image frames in the face video to be identified.

In the embodiment of the present specification, the process of the present step 205 may include:

In order to further detect in a targeted manner, the detection of the partitions can be performed in a targeted manner according to the characteristics of different areas of the face. Thus, in one embodiment of the present disclosure, to implement partition detection, it may further include: the face is divided into at least two key regions, such as at least two key regions of a left eye region, a right eye region, a nose region, a mouth region, and a face contour region.

Accordingly, in step 205, it is determined whether the image frame is a suspected risk image frame in the face video to be identified according to the velocity vector of each pixel in the key area in the image frame and the average velocity vector of all other pixels in the key area. If the velocity vector of the pixel point is changed by more than a threshold value compared with the average velocity vector of all other pixel points in the surrounding, i.e. key area range, the image frame is indicated to have high artifact risk, and the image frame is determined to be a suspected risk image frame in the face video to be identified.

The face video to be identified is composed of a plurality of image frames, i.e. a plurality of frames, for example, 1000 image frames, i.e. 1000 face images, and through the process of step 205, image frames with high risk can be located from the 1000 image frames, for example, 20 suspected risk image frames are located, the 20 suspected risk image frames are formed into a sequence to be detected, and then, features are extracted from the sequence to be detected formed by the 20 suspected risk image frames, and the extracted features are detected.

In response to the above-described partition detection scheme, referring to fig. 4, in one embodiment of the present disclosure, the process of step 205 includes:

step 401: for each key region included in one image frame, obtaining a difference between a speed vector of each pixel point in the key region and an average speed vector of all other pixel points in the key region;

step 403: obtaining an artifact risk index corresponding to the key region according to each difference of each pixel point in the key region;

step 405: and determining whether the image frame is a suspected risk image frame in the face video to be identified according to each artifact risk index corresponding to each key region in the image frame.

In the flow of fig. 4 described above, the gap between the two vectors may be represented by the angle between the vectors. At this time, in one embodiment of the present disclosure, one implementation procedure of the foregoing step 401 and step 403 includes:

for each critical region, an artifact risk index for that critical region is calculated using the following calculation:

accordingly, the implementation process of step 405 includes:

For example, for the first frame in the face video to be identified, the artifact risk index r1 of the left eye region, the artifact risk index r2 of the right eye region, the artifact risk index r3 of the nose region, the artifact risk index r4 of the mouth region, and the artifact risk index r5 of the face contour region are calculated, and then when any one of these 5 artifact risk indexes is greater than the index threshold corresponding to the key region in step 405, the first frame may be determined as a suspected risk image frame, and the first frame is added to the sequence to be detected. And so on, each subsequent frame in the face video to be identified is performed, and if a certain frame is determined to be a suspected risk image frame, the frame is added to the sequence to be detected.

Next for step 207: and respectively extracting the characteristic information of the suspected risk image frames under at least two modes.

In an embodiment of the present disclosure, the feature information of the suspected risk image frames in at least two modes includes: at least two of a feature of multi-frame motion information, a single frame frequency domain feature, and a single frame spatial domain feature.

In this step 207, feature information of the suspected risk image frames is extracted under at least two modes, including at least two of the following:

Next for step 209: and judging whether the face video to be recognized is a fake face video or not according to the characteristic information of the suspected risk image frames under at least two modes.

In step 209, feature information of the suspected risk image frames under at least two modes is fused in series along the channel direction; and inputting the fused features into a pre-trained detection model for recognition.

In an embodiment of the present disclosure, a face video detection apparatus is provided, see fig. 5, where the apparatus includes:

an input module 501 configured to obtain a face video to be identified;

the sparse optical flow algorithm module 502 is configured to calculate motion information of at least two image frames in the face video to be identified by using a sparse optical flow algorithm;

the to-be-detected sequence generating module 503 is configured to determine suspected risk image frames in the to-be-identified face video according to motion information of at least two image frames in the to-be-identified face video, and form a to-be-detected sequence by using each suspected risk image frame;

the feature information extracting module 504 is configured to extract feature information of the suspected risk image frames under at least two modes respectively;

the judging module 505 is configured to judge whether the face video to be identified is a fake face video according to the feature information of the suspected risk image frames in at least two modes.

In an embodiment of the present description apparatus, the sequence to be detected generating module 503 is configured to perform:

In one embodiment of the apparatus of the present specification, the face is divided into at least two key regions;

accordingly, the to-be-detected sequence generation module 503 is configured to perform: for each key region included in one image frame, obtaining a difference between a speed vector of each pixel point in the key region and an average speed vector of all other pixel points in the key region; obtaining an artifact risk index corresponding to the key region according to each difference of each pixel point in the key region; and determining whether the image frame is a suspected risk image frame in the face video to be identified according to each artifact risk index corresponding to each key region in the image frame.

In one embodiment of the apparatus of the present specification, the at least two critical areas comprise at least two of: left eye region, right eye region, nose region, mouth region, face contour region.

In one embodiment of the apparatus of the present specification, the feature information of the suspected risk image frames in at least two modalities includes: at least two of a feature of multi-frame motion information, a single frame frequency domain feature, and a single frame spatial domain feature.

In one embodiment of the present description apparatus, the feature information extraction module 504 is configured to perform at least two of:

In one embodiment of the present description apparatus, the determining module 504 is configured to perform:

carrying out serial fusion on the characteristic information of the suspected risk image frames under at least two modes along the channel direction; and inputting the fused features into a pre-trained detection model for recognition.

The above-described devices are usually implemented at the server side, and may be provided in separate servers, or a combination of some or all of the devices may be provided in the same server. The server can be a single server or a server cluster consisting of a plurality of servers, and the server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system. The above devices may also be implemented in a computer terminal having a relatively high computing power.

An embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the embodiments of the specification.

An embodiment of the present specification provides a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, performs a method of any of the embodiments of the present specification.

It should be understood that the structures illustrated in the embodiments of the present specification do not constitute a particular limitation on the apparatus of the embodiments of the present specification. In other embodiments of the specification, the apparatus may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, a pendant, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims

1. The method for detecting the face video comprises the following steps:

obtaining a face video to be identified;

2. The method according to claim 1, wherein the determining the suspected risk image frame in the face video to be identified according to the motion information of at least two image frames in the face video to be identified comprises:

3. The method of claim 2, wherein the method further comprises: dividing a human face into at least two key areas;

4. A method according to claim 3, wherein said obtaining a difference between the velocity vector of each pixel in the critical area and the average velocity vector of all other pixels in the critical area, and obtaining an artifact risk index corresponding to the critical area according to the differences of the pixels in the critical area, includes:

5. A method according to claim 3, wherein the at least two critical areas comprise at least two of: left eye region, right eye region, nose region, mouth region, face contour region.

6. The method of claim 1, wherein the feature information of the at least two modality suspected risk image frames comprises: at least two of a feature of multi-frame motion information, a single frame frequency domain feature, and a single frame spatial domain feature.

7. The method of claim 6, wherein the extracting feature information of the suspected risk image frames in at least two modalities respectively comprises at least two of:

8. The device for detecting the face video comprises:

the input module is configured to obtain a face video to be recognized;

9. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-7.

10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-7.