CN114842399B

CN114842399B - Video detection method, training method and device for video detection model

Info

Publication number: CN114842399B
Application number: CN202210564026.3A
Authority: CN
Inventors: 李艾仑; 王洪斌; 吴至友; 皮家甜; 曾定衡
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2023-07-25
Anticipated expiration: 2042-05-23
Also published as: CN114842399A

Abstract

The application discloses a video detection method and device, which are used for solving the problems of low detection accuracy and poor universality of the existing fake video detection method. The video detection method comprises the following steps: acquiring at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face based on time sequence arrangement; extracting features of the at least one frame of video image through a video detection model to obtain facial emotion features of the target face; extracting features of the multi-frame first optical flow image through the video detection model to obtain facial action features of the target face; and determining a detection result of the video to be detected at least based on the facial emotion characteristics and the facial action characteristics of the target face.

Description

Video detection method, training method and device for video detection model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a video detection method, a training method of a video detection model and a training device of the video detection model.

Background

With the development of deep learning, various face forging techniques are layered endlessly, such as making an inexistent face or replacing a face in a video with other faces, etc., and these face forging techniques are inevitably used by some people for illegal use, forging some videos which damage others or have bad influence on society. Therefore, detection of counterfeit video becomes very important.

At present, detection for fake video is still in a development stage, and most detection methods are used for judging whether the video is true or false based on the change of human face characteristics and the artifacts in the fake process. However, the method is easy to fit certain depth counterfeit characteristics of specific distribution, so that a good detection effect can be achieved on part of video, the detection accuracy is low, and the universality is poor.

Disclosure of Invention

The embodiment of the application aims to provide a video detection method and device, which are used for solving the problems of low detection accuracy and poor universality of the existing video detection method.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a video detection method, including:

acquiring at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face based on time sequence arrangement;

extracting features of the at least one frame of video image through a video detection model to obtain facial emotion features of the target face;

extracting features of the multi-frame first optical flow image through the video detection model to obtain facial action features of the target face;

And determining a detection result of the video to be detected at least based on the facial emotion characteristics and the facial action characteristics of the target face.

It can be seen that, in this embodiment of the present application, the natural rule that the appearance and the dynamic actions of the real face and the fake face have differences is utilized, based on the video detection model, the facial emotion feature of the target face is extracted from at least one frame of video image of the target face in the video to be detected, based on the video detection model, the facial motion feature of the target face is extracted from the multi-frame first optical flow image of the target face based on the time sequence arrangement in the video to be detected, further based on at least the facial emotion feature and the facial motion feature of the target face, the detection result of the video to be detected is determined, and because the facial emotion feature belongs to the static feature in the airspace, the appearance of the face can be reflected, the facial motion feature belongs to the dynamic feature in the time domain, the face motion can be reflected, and the video detection can be performed in combination with the static facial emotion feature in the airspace and the dynamic facial motion feature in the time domain, so that the fitting state of the depth fake feature of some specific distributions can be avoided, and the detection accuracy and the universality can be improved.

In a second aspect, an embodiment of the present application provides a training method for a video detection model, including:

acquiring a sample video set and an authenticity label corresponding to each sample video in the sample video set, wherein the sample video set comprises real videos and a plurality of fake videos, the fake videos are in one-to-one correspondence with a plurality of face fake algorithms, and each fake video is obtained after the real videos are fake based on the corresponding face fake algorithm;

acquiring at least one frame of video image of a sample face in a target sample video and a plurality of frames of second optical flow images of the sample face based on time sequence arrangement;

extracting features of at least one frame of video image of a sample face in the target sample video through an initial video detection model to obtain facial emotion features of the sample face;

extracting features of a multi-frame second optical flow image of a sample face in the target sample video through the initial video detection model to obtain facial motion features of the sample face;

determining a detection result of the target sample video at least based on facial emotion characteristics and facial action characteristics of a sample face in the target sample video;

And carrying out iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain a video detection model.

It can be seen that in the embodiment of the application, the real video and the fake video obtained by carrying out the fake processing on the real video based on various face fake algorithms are adopted as sample video, and the sample video and the corresponding fake label thereof are utilized to train the initial video detection model, so that the obtained video detection model can learn the characteristics of various fake videos, thereby being beneficial to improving the generalization capability of the video detection model and the detection effect of the video detection model on various videos; in a specific model training process, extracting facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face based on time sequence arrangement in the sample video through the initial video detection model, detecting the sample video at least based on the facial emotion characteristics and the facial motion characteristics of the sample face, and then carrying out iterative training on the initial video detection model based on a detection result of each sample video in a sample video set and an authenticity label corresponding to each sample video, thereby obtaining a video detection model, so that the initial video detection model can fully learn static characteristics of the sample video on a space domain to accurately extract the facial emotion characteristics reflecting the appearance of the face, fully learn dynamic characteristics of the sample video on a time domain to accurately extract the facial motion characteristics reflecting the dynamic motion of the face, and further avoid the condition that the initial video detection model fits certain specific distribution depth characteristics, thereby improving the accuracy of the detection model and the accuracy of the obtained detection model.

In a third aspect, an embodiment of the present application provides a video detection apparatus, including:

the first image acquisition unit is used for acquiring at least one frame of video image of a target face in the video to be detected and a plurality of frames of first optical flow images of the target face based on time sequence arrangement;

the first airspace feature extraction unit is used for carrying out feature extraction on the at least one frame of video image through a video detection model to obtain facial emotion features of a target face;

the first time domain feature extraction unit is used for extracting features of the multi-frame first optical flow images through the video detection model to obtain facial action features of the target face;

and the first detection unit is used for determining the detection result of the video to be detected at least based on the facial emotion characteristics and the facial action characteristics of the target face.

In a fourth aspect, an embodiment of the present application provides a training device for a video detection model, including:

the sample acquisition unit is used for acquiring a sample video set and an authenticity label corresponding to each sample video in the sample video set, wherein the sample video set comprises real videos and a plurality of types of fake videos, the plurality of types of fake videos are in one-to-one correspondence with a plurality of types of face fake algorithms, and each type of fake video is obtained by carrying out fake processing on the real videos based on the corresponding face fake algorithm;

The second image acquisition unit is used for acquiring at least one frame of video image of the sample face in the target sample video and a plurality of frames of second optical flow images of the sample face based on time sequence arrangement;

the second airspace feature extraction unit is used for carrying out feature extraction on at least one frame of video image of the sample face in the target sample video through an initial video detection model to obtain facial emotion features of the sample face;

the second time domain feature extraction unit is used for extracting features of a plurality of frames of second optical flow images of the sample face in the target sample video through the initial video detection model to obtain facial action features of the sample face;

the second detection unit is used for determining a detection result of the target sample video at least based on facial emotion characteristics and facial action characteristics of the sample face in the target sample video;

and the training unit is used for carrying out iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain a video detection model.

In a fifth aspect, embodiments of the present application provide an electronic device, including:

A processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method according to the first or second aspect.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of the first or second aspect.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a flow chart of a video detection method according to an embodiment of the present application;

fig. 2 is a flow chart of a video detection method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a spatial stream network according to an embodiment of the present application;

FIG. 4 is a flowchart of a training method of a video detection model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video detection device according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training device for a video detection model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein. Furthermore, in the present specification and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated object is an "or" before and after.

Partial conceptual description:

OpenCV: is a cross-platform computer vision and machine learning software library based on BSD (Berkly Software Distribution) licensing (open source) release. The vision processing algorithms provided by OpenCV are very rich.

Dlib tool: is a modern c++ toolbox that contains machine learning algorithms and tools for creating complex software in c++ to solve practical problems.

Plain Li Weite operator: the method is also called as a Prewitt operator, and utilizes the gray level difference of adjacent points of the pixel points up and down and left and right to identify the pixel points with obvious brightness change in the digital image so as to obtain the boundary information of the target in the digital image.

Freeman chain code encoding: the method of describing a curve or boundary is described by coordinates of the starting point of the curve and the direction code of the boundary point. Freeman chain code encoding is often used to represent curves and region boundaries in the fields of image processing, computer graphics, pattern recognition, and the like. For example, the Freeman chain code encoding may employ an 8-way chain code, i.e., 4 adjacent points, on top of the center pixel, top right, bottom right, sitting, top left, and bottom left, respectively. The 8-communication chain code accords with the actual similar point, and can accurately describe the information of the central pixel point and the adjacent pixel points.

In order to solve the problems of low detection accuracy and poor generality caused by the fact that the existing video detection method can only achieve a good detection effect on part of videos, the embodiment of the application provides a video detection method based on a double-flow network architecture, the natural law that the appearance and the dynamic actions of a real face and a fake face are different is utilized, the facial emotion characteristics of the target face are extracted from at least one frame of video image of the target face in the video to be detected based on a video detection model, the facial motion characteristics of the target face are extracted from a multi-frame first optical flow image of the target face based on time sequence arrangement in the video to be detected based on the video detection model, the detection result of the video to be detected is further determined based on at least the facial emotion characteristics and the facial motion characteristics of the target face, the facial emotion characteristics belong to static characteristics on the human face, the facial appearance can be reflected, the facial motion characteristics belong to dynamic characteristics on the time domain, the facial emotion characteristics on the human face can be reflected, the depth fitting state of certain specific distribution facial emotion characteristics and the fake motion characteristics on the time domain can be avoided, and the accuracy of the depth fitting of the face characteristics to be further improved.

The embodiment of the application also provides a training method of the video detection model, which adopts the real video and the fake video obtained by carrying out fake processing on the real video based on various face fake algorithms as sample video, trains an initial video detection model by utilizing the sample video and the corresponding fake label thereof, enables the obtained video detection model to learn the characteristics of various fake videos, is beneficial to improving the generalization capability of the video detection model, and is beneficial to improving the detection effect of the video detection model on various videos; in a specific model training process, extracting facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face based on time sequence arrangement in the sample video through the initial video detection model, detecting the sample video at least based on the facial emotion characteristics and the facial motion characteristics of the sample face, and then carrying out iterative training on the initial video detection model based on a detection result of each sample video in a sample video set and an authenticity label corresponding to each sample video, thereby obtaining a video detection model, so that the initial video detection model can fully learn static characteristics of the sample video on a space domain to accurately extract the facial emotion characteristics reflecting the appearance of the face, fully learn dynamic characteristics of the sample video on a time domain to accurately extract the facial motion characteristics reflecting the dynamic motion of the face, and further avoid the condition that the initial video detection model fits certain specific distribution depth characteristics, thereby improving the accuracy of the detection model and the accuracy of the obtained detection model.

It should be understood that, the video detection method and the training method of the video detection model provided in the embodiments of the present application may be executed by an electronic device or software installed in the electronic device. The electronic devices referred to herein may include terminal devices such as smartphones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, intelligent home appliances, smart watches, vehicle terminals, aircraft, etc.; alternatively, the electronic device may further include a server, such as an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides a cloud computing service.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a video detection method according to an embodiment of the present application is provided, and the method may include the following steps:

s102, at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face based on time sequence arrangement are obtained.

In this embodiment of the present application, the target face in the video to be detected refers to a main face in the video to be detected, for example, the video to be detected includes a face of a user a and a face of a user B, where the face of the user a is located in the foreground or a face area of the user a is greater than a face area of the user B, and the face of the user B is located in the background or a face area of the user B is less than a face area of the user a, and then the face of the user a is the target face.

The single-frame video image of the target face can be any video image containing the target face in the video to be detected. The single-frame video image of the target face can also be a frame of video image containing the target face corresponding to the whole video to be detected. The at least one frame of video image of the target face may be any one or more frames of video images containing the target face in the video to be detected. Considering that the RGB image contains image data of three color channels of R (red), G (green) and B (blue), the life characteristics of the target face in the video to be detected can be better reflected, and further, the single-frame video image can be an RGB image containing the target face in any frame of the video to be detected.

The first optical flow image is an image including target face motion information, which can express a change in a video image. In practical application, the optical flow image may be obtained by calculating two frames of video images adjacent to any time sequence by adopting an optical flow algorithm, where the optical flow algorithm may specifically include, but is not limited to, one or more of a farnebback algorithm, a FlowNet algorithm, and the like.

In order to enable the obtained at least one frame of video image to accurately reflect the appearance characteristics of the face (especially to accurately reflect the emotion characteristics of the face) and the obtained multi-frame first optical flow image to accurately reflect the motion characteristics of the face (especially the facial motion characteristics), in an alternative implementation manner, a segmented random sampling manner can be adopted to obtain at least one frame of video image of the target face in the video to be detected and the multi-frame first optical flow image of the target face based on time sequence arrangement; of course, in another alternative implementation manner, the whole video to be detected may be randomly sampled, so as to obtain a single-frame video image of the target face in the video to be detected. Specifically, the step S102 may include: dividing a video to be detected into a plurality of video clips; then, randomly sampling multi-frame RGB images of a target face in each video segment to obtain a plurality of candidate single-frame video images, and determining at least one frame of video image according to the plurality of candidate single-frame video images; randomly sampling multi-frame gray-scale images of the target face in each video segment to obtain multi-frame candidate gray-scale images; further, a first perfusion image corresponding to each frame of candidate gray-scale image is determined based on each frame of candidate gray-scale image and the candidate gray-scale images adjacent to each other in time sequence, and then a plurality of frames of first optical flow images are determined based on the first optical flow images respectively corresponding to the plurality of frames of gray-scale images.

For example, the video to be detected may be equally divided into K segments based on the duration of the video to be detected, each segment having equal duration and each segment containing multiple frames of video images. Then, for each segment, converting the segment into a multi-frame RGB image and a multi-frame gray image based on time sequence arrangement by using OpenCV, and randomly sampling the multi-frame RGB image of the segment to obtain a candidate single-frame RGB image of the segment and a multi-frame candidate gray image based on time sequence arrangement; further, the candidate single-frame RGB image of each segment may be used as a final single-frame video image, or at least one frame RGB image with a relatively good effect (such as high definition and clear face) may be selected from the candidate single-frame RGB images of each segment as a final single-frame video image; meanwhile, an optical flow algorithm may be adopted, a frame of first optical flow image corresponding to each frame of candidate gray-scale image is calculated based on each frame of candidate gray-scale image and the candidate gray-scale images adjacent to each other in time sequence, and then the first optical flow images corresponding to the frames of candidate gray-scale images are determined to be a plurality of frames of first optical flow images based on time sequence arrangement.

Optionally, to improve the quality of the single-frame video image and the multi-frame first optical flow image, the RGB images and the gray images of each frame included in each segment may be further preprocessed, such as filtering, before randomly sampling the RGB images and the gray images of each frame. The specific pretreatment mode can be selected according to actual needs, and the embodiment of the application is not limited to this.

Only one specific implementation of S102 described above is shown here. Of course, it should be understood that S102 may be implemented in other manners, which are not limited in this embodiment of the present application.

And S104, extracting features of at least one frame of video image of the target face through the video detection model to obtain facial emotion features of the target face in the video to be detected.

Because the real face and the fake face have differences in appearance and dynamic actions, the facial emotion can especially reflect the appearance of the face, the facial action can especially reflect the dynamic action of the face, and the two types of characteristics are accurately extracted, in an optional implementation mode, as shown in fig. 2, the video detection model of the embodiment of the application can adopt a double-flow network architecture, namely a space flow network and a time flow network, wherein the space flow network is used for extracting facial emotion characteristics of the face, and the time flow network is used for extracting facial action characteristics of the face, and then video detection is carried out by combining the two types of characteristics, namely facial emotion characteristics and facial action characteristics of a target face in a video to be detected.

Specifically, as shown in fig. 2, S104 may be implemented as: and extracting features of at least one frame of video image of the target face through a space flow network in the video detection model to obtain facial emotion features of the target face.

Illustratively, at least one frame of video image of the target face is input into a space flow network of the video detection model, and feature extraction is performed on the input single frame of video image by the space flow network, so that facial emotion features of the target face are obtained.

In practice, the spatial stream network may take any suitable structure. Optionally, the depth and the width of the network can be increased by the aid of the admission-V3 convolutional neural network, the nonlinearity of the network is increased, and the spatial stream network can adopt the admission-V3 convolutional neural network, so that the problem that facial emotion characteristics cannot be accurately extracted due to content difference of single-frame video images is effectively solved. More specifically, to fully utilize useful information in a single frame video image to extract rich facial emotion features, as shown in fig. 3, the spatial stream network may include multiple convolution layers, a gating cyclic unit (Gated Recurrent Unit, GRU) layer, a full connection layer (Fully Connected Layer), and the like, where the multiple convolution layers may include a two-dimensional invariant convolution layer, a two-dimensional spectral convolution layer, and the like, and batch normalization (Batch Normalization, BN) functions, linear rectification functions (Rectified Linear Unit, reLU), and the like may be set in each convolution layer. Specifically, each convolution layer is used for extracting facial emotion features with different sizes from at least one frame of video image; the GRU layer is used for selecting facial emotion characteristics extracted by the plurality of convolution layers and retaining facial emotion characteristics useful for video detection; the full-connection layer is used for integrating the facial emotion characteristics reserved by the GRU layer to obtain final facial emotion characteristics.

And S106, extracting features of the multi-frame first optical flow images through a video detection model to obtain facial motion features of the target face.

The facial motion feature of the target face refers to a feature capable of reflecting the facial motion of the target face, and includes, for example, but not limited to, a feature reflecting the lip motion of the target face.

Specifically, as shown in fig. 2, S106 may be implemented as: and extracting features of the multi-frame first optical flow images through a time flow network in the video detection model to obtain facial motion features of the target face.

For example, a plurality of frames of first optical flow images of a target face are input into a time flow network in a video detection model, and feature extraction is performed on the plurality of frames of first optical flow images by the time flow network according to the time sequence of the plurality of frames of first optical flow images, so that facial motion features of the target face are obtained.

It should be noted that in practical applications, the spatial stream network and the temporal stream network may have different network structures, for example, a self-attention layer is introduced into the spatial stream network, so that the spatial stream network can focus on key facial emotion features in the Shan Zhen video image; alternatively, the spatial stream network and the temporal stream network may have the same network structure.

S108, determining a detection result of the video to be detected at least based on the facial emotion characteristics and the facial action characteristics of the target face.

Specifically, the detection result of the video to be detected may indicate whether the video to be detected is a counterfeit video.

In an alternative implementation, as shown in fig. 2, the video detection model in the embodiment of the present application further includes a classification network, where the classification network has a function of identifying authenticity of a face based on the input facial features. Specifically, the classification network may include an emotion recognition network and a voice recognition network, where the emotion recognition network may recognize an emotion expressed by a face, that is, an emotion state of the face, and the voice recognition network may recognize a facial action corresponding to the voice data.

Because the facial emotion characteristics of the target face can reflect the facial emotion of the target face, the facial action characteristics of the target face can reflect the facial action of the target face, and the real face and the fake face have differences in facial emotion and facial action, based on the facial emotion characteristics and the facial action characteristics of the target face are input into the classification network of the video detection model in the S108, whether the target face is a recognition result of the fake face or not can be obtained, and if the target face is the fake face, the video to be detected can be determined to be a fake video; if the target face is a real face, the video to be detected can be determined to be a real video.

In another alternative implementation manner, considering that the pupil size changes correspondingly when the face presents different emotions, and the face motion of the user also changes correspondingly when speaking, in order to accurately identify the true or false of the video to be detected, as shown in fig. 2, S108 may be specifically implemented as follows:

s181, determining the pupil size of the target face based on at least one frame of video image of the target face.

In the embodiment of the present application, the pupil size of the target face may be determined in any suitable manner. Optionally, the step S181 may be specifically implemented as: based on a preset image segmentation algorithm, segmenting an eye region of a target face from at least one frame of video image of the target face; performing edge detection on an eye region of the target face based on a preset edge detection algorithm to obtain a pupil boundary of the target face; and fitting the pupil boundary of the target face based on a preset fitting algorithm, and performing fitting treatment on the eye region to obtain the pupil size of the target face.

For example, the Dlib tool may be used to extract the target face in the at least one frame of video image, and then detect key points of human eyes by using one or more image segmentation algorithms commonly used in the art, and segment an eye region of the target face from a single frame of video image; then, filtering the eye area of the target face, for example, median filtering is carried out on the eye area of the target face by adopting a filtering template with a preset size, and noise normally distributed in the eye area is filtered; then, binarizing the eye region of the target face based on a one-dimensional maximum entropy threshold segmentation method and a preset threshold to obtain a binarized eye region; further, performing edge detection on the eye region after threshold processing by using a common Li Weite operator (prewittoperation) to obtain a pupil boundary of the target face, and representing the pupil boundary of the target face by using Freeman chain code coding; then, adopting a Hough circle fitting algorithm and the pupil boundary of the target face, converting the image space into a parameter space based on a standard Hough transformation principle, then carrying out circle center detection on the eye region, and deducing the radius of the circle from the circle center, wherein the radius is the pupil size of the target face.

Only one specific implementation of determining pupil size is shown here. Of course, it should be understood that pupil size may be determined in other ways, and embodiments of the present application are not limited in this regard.

S182, determining a first detection result of the video to be detected based on facial emotion characteristics and pupil size of the target face.

The first detection result of the video to be detected can be used for indicating the authenticity of the video to be detected.

Optionally, considering that the emotion of the real face is theoretically matched with the pupil size, for example, the pupil size of the real face is smaller when the real face appears as open heart, the pupil size of the real face is larger when the real face appears as panic, and the like, but the emotion of the face forged by the existing face forging technology is hardly matched with the pupil size, based on this, in S182, the emotion recognition can be performed on the facial emotion characteristics of the target face through the emotion recognition network to obtain the emotion state of the target face, and then the first detection result of the video to be detected is determined based on the matching state between the emotion state of the target face and the pupil size of the target face.

For example, based on the emotional state of the target face and the preset correspondence between the emotional state and the pupil size, the pupil size matched with the emotional state of the target face can be determined, if the difference between the pupil size matched with the emotional state of the target face and the calculated pupil size exceeds a preset threshold, the target face can be determined to be a fake face, and then a first detection result indicating that the video to be detected is a fake video can be obtained; if the difference between the pupil size matched with the emotional state of the target face and the calculated pupil size is smaller than a preset threshold value, the target face can be determined to be a real face, and then a first detection result indicating that the video to be detected is a real video can be obtained.

Of course, in practical applications, the first detection result may also include a probability that the video to be detected is a counterfeit video and/or a probability that the video to be detected is a real video. For example, the probability that the video to be detected is a fake video and/or the probability that the video to be detected is a real video may be determined based on the matching degree value between the emotional state of the target face and the pupil size of the target face, so as to obtain the first detection result.

Optionally, in order to further obtain the first detection result with higher accuracy, in S182, the facial emotion feature of the target face may be input into the emotion recognition network to obtain an emotion state of the target face, and text data corresponding to the voice data of the video to be detected may be input into the preset text recognition model to obtain the emotion state of the target face in the video to be detected; further, a first detection result of the video to be detected is determined based on a matching state between an emotion state obtained by the emotion recognition network and an emotion state obtained by a preset text recognition model and a matching state between an emotion state of the target face and a pupil size of the target face.

For example, if the emotion state obtained based on the emotion recognition network is the same as the emotion state obtained by the preset text recognition model, the emotion state and the emotion state are considered to be matched; further, if the emotion state obtained based on the emotion recognition network is matched with the emotion state obtained by the preset text recognition model and the emotion state of the target face is matched with the pupil size of the target face, determining that the video to be detected is a real video; if the emotion state obtained based on the emotion recognition network is not matched with the emotion state obtained by the preset text recognition model or the emotion state of the target face is not matched with the pupil size of the target face, determining that the video to be detected is a fake video.

It can be understood that in the latter implementation manner, whether the emotion state of the target face is matched with the pupil size of the target face is determined, whether the emotion state obtained based on the emotion recognition network is matched with the emotion state obtained by the preset text recognition model is also determined, and the first detection result of the video to be detected is determined by combining the two matching results, so that inaccurate first detection results caused by matching between the emotion of the pseudo face and the pupil size can be avoided.

S183, determining a second detection result of the video to be detected based on the facial motion characteristics of the target face and the voice data of the video to be detected.

Optionally, in consideration of that the voice data of the real video is matched with the facial motion (especially, lip motion) of the target face in the real video, but the voice data of the video forged by the existing face forging technology is difficult to be matched with the facial motion of the face in the video, in S183, the voice data of the video to be detected may be subjected to voice recognition through the voice recognition network to obtain the target facial motion feature corresponding to the voice data, and then the second detection result of the video to be detected is determined based on the matching state between the facial motion feature of the target face and the target facial motion feature corresponding to the voice data.

For example, if the facial motion feature of the target face does not match the target facial motion feature corresponding to the voice data, it may be determined that the video to be detected is a fake video as the second detection result; if the facial motion characteristics of the target face are matched with the target facial motion characteristics corresponding to the voice data, the second detection result can be determined to be that the video to be detected is a real video.

Of course, in practical applications, the second detection result may also include a probability that the video to be detected is a counterfeit video and/or a probability that the video to be detected is a real video. For example, the probability that the video to be detected is a fake video and/or the probability that the video to be detected is a real video may be determined based on the matching degree value of the facial motion feature of the target face and the target facial motion feature corresponding to the voice data, so as to obtain the second detection result.

S184, determining a detection result of the video to be detected based on the first detection result and the second detection result of the video to be detected.

If the first detection result and the second detection result indicate that the video on the side to be detected is a real video, the video to be detected is finally determined to be the real video; otherwise, finally determining the video to be detected as the fake video.

For another example, if the first detection result and the second detection result both include probabilities that the video to be detected is a fake video, the first detection result and the second detection result can be weighted and summed to obtain a final probability, and if the final probability exceeds a preset probability threshold, the video to be detected is determined to be a fake video; otherwise, determining the video to be detected as the real video.

Only one specific implementation of S108 described above is shown here. Of course, it should be understood that S108 may be implemented in other manners, which are not limited in this embodiment of the present application.

It should be noted that, in the step S102, if a single-frame video image and a plurality of frames of first optical flow images based on time sequence arrangement are obtained for each segment of the video to be detected, the steps S104 to S108 may be executed for each segment of the video to be detected based on the single-frame video image and the plurality of frames of first optical flow images in the segment, so as to obtain a detection result corresponding to the segment; and then, the detection results corresponding to each segment of the video to be detected are synthesized, and whether the video to be detected is a fake video or not is determined. For example, if the detection result corresponding to the more than 1/2 segment in the video to be detected indicates that the video to be detected is a fake video, determining that the video to be detected is a fake video; otherwise, determining the video to be detected as the real video.

According to the video detection method provided by the embodiment of the application, the natural law that the appearance and the dynamic actions of the real face and the fake face are different is utilized, the facial emotion characteristics of the target face are extracted from at least one frame of video image of the target face in the video to be detected based on the video detection model, the facial action characteristics of the target face are extracted from a plurality of frames of first optical flow images of the target face based on time sequence arrangement in the video to be detected based on the video detection model, the detection result of the video to be detected is further determined based on at least the facial emotion characteristics and the facial action characteristics of the target face, and the facial emotion characteristics belong to static characteristics in a space domain, can reflect the face appearance, the facial action characteristics belong to dynamic characteristics in a time domain, can reflect the face action, and can avoid the state of overfitting the depth fake emotion characteristics of certain specific distribution in combination with the dynamic facial action characteristics in the space domain, so that the detection accuracy and the universality can be improved.

The embodiment of the application also provides a training method of the video detection model, and the trained video detection model can be used for detecting the video to be detected. The training process of the video detection model is described in detail below.

Referring to fig. 4, a flowchart of a training method of a video detection model according to an embodiment of the present application is provided, and the method may include the following steps:

s402, acquiring a sample video set and an authenticity label corresponding to each sample video in the sample video set.

Wherein the sample video set includes real videos and a plurality of counterfeit videos. The fake video is obtained by carrying out fake processing on the real video based on the corresponding face fake algorithm. In practical applications, the various Face forging algorithms may include, but are not limited to, face2Face algorithm, faceSwap algorithm, deep fakes algorithm, and Neural text algorithm.

The authenticity label corresponding to the sample video is used for indicating whether the sample video is a fake video or not. In practical applications, the genuine-fake label corresponding to the sample video may be represented by a form of one-hot encoding (one-hot), for example, the genuine-fake label corresponding to the genuine video is (1, 0), the genuine-fake label corresponding to the counterfeit video is (0, 1), and so on. Of course, the authenticity label corresponding to the sample video may also be represented by other manners commonly used in the art, which is not limited in this embodiment of the present application.

S404, at least one frame of video image of the sample face in the target sample video and a plurality of frames of second optical flow images of the sample face based on time sequence arrangement are obtained.

The specific implementation of S404 is similar to that of S102 in the embodiment shown in fig. 1, and will not be described herein.

S406, extracting features of at least one frame of video image of the sample face through the initial video detection model to obtain facial emotion features of the sample face.

The specific implementation of S406 is similar to the specific implementation of S104 in the embodiment shown in fig. 1, and will not be described herein.

S408, extracting features of a plurality of frames of second optical flow images of the sample face through the initial video detection model to obtain facial motion features of the sample face.

The implementation of S408 is similar to that of S106 in the embodiment shown in fig. 1, and will not be described here again.

S410, determining a detection result of the target sample video at least based on facial emotion characteristics and facial action characteristics of the sample face.

The implementation of S410 is similar to the implementation of S108 in the embodiment shown in fig. 1, and will not be described herein. For example, the S410 may include: determining the pupil size of a sample face in a target sample video based on at least one frame of video image of the sample face in the target sample video; determining a second detection result of the target sample video based on facial motion characteristics of a sample face in the target sample video and voice data of the target sample video; and determining whether the target sample video is a fake video or not based on the first detection result and the second detection result of the target sample video.

And S412, performing iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain a video detection model.

Specifically, the detection loss of the initial video detection model can be determined based on the detection result and the true-false label of each sample video in the sample video set and a preset loss function; further, based on the detection loss of the initial video detection model, iterative training is carried out on the initial video detection model until the training stopping condition is met, and the video detection model is obtained.

More specifically, the S412 may include: repeating the following processes until the initial video detection model meets the preset training stop condition: determining total detection loss of an initial video detection model based on a first detection result and a second detection result of each sample video in the sample video set and an authenticity label of each sample video; based on the total detection loss, model parameters of the initial video detection model are adjusted.

For example, determining a first detection loss of the initial video detection model based on a first detection result of each sample video in the sample video set, an authenticity label of each sample video and a first preset loss function; determining a second detection loss of the initial video detection model based on a second detection result of each sample video in the sample video set, the authenticity label of each sample video and a second preset loss function; further, the first detection loss and the second detection loss of the initial video detection model are weighted and summed to obtain the total detection loss of the initial video detection model. The first detection loss is used for representing the loss generated by the video detection of the initial video detection model based on facial emotion characteristics, and the second detection loss is used for representing the loss generated by the video detection of the initial video detection model based on facial motion characteristics. The first preset loss function and the second preset loss function may be set according to actual needs, which is not limited in the embodiment of the present application.

It can be understood that the total detection loss of the initial video detection model is determined through the above manner, and the detection loss generated by the video detection of the initial video detection model based on different face features is comprehensively considered, so that the obtained total detection loss can more accurately reflect the difference between the detection result of each sample video in the sample video set and the true-false label corresponding to each sample video, and further, the model parameters of the initial video detection model are adjusted by utilizing the total detection loss, thereby being beneficial to improving the detection accuracy of the finally obtained video detection model.

For example, a back propagation algorithm may be employed to determine the detection loss caused by each network in the initial video detection model based on the detection loss of the initial video detection model and the current model parameters of the initial detection model; then, the model parameters of each network are adjusted layer by layer with the aim of reducing the detection loss of the initial video detection model. The model parameters of the initial video detection model may include, but are not limited to: the method comprises the steps of detecting the number of nodes of each network, the connection relation among the nodes of different networks, the weight of the connection edges, the bias corresponding to the nodes in each network and the like in an initial video detection model.

In practical applications, the preset loss function and the training stop condition may be set according to actual needs, for example, the preset loss function may be set as a cross entropy loss function, and the training stop condition may include that the detection loss of the initial video detection model is smaller than a preset loss threshold or the iteration number reaches a preset number of times threshold, which is not limited in this embodiment of the present application.

One specific implementation of iterative training of an initial video detection model is shown herein. Of course, it should be understood that other manners in the art may be used to iteratively train the initial video detection model, which is not limited in this embodiment of the present application.

According to the training method for the video detection model, the real video and the fake video obtained by carrying out fake processing on the real video based on various face fake algorithms are used as sample videos, and the initial video detection model is trained by utilizing the sample videos and the corresponding fake labels, so that the obtained video detection model can learn the characteristics of various fake videos, the generalization capability of the video detection model is improved, and the detection effect of the video detection model on various videos is improved; in a specific model training process, extracting facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face based on time sequence arrangement in the sample video through the initial video detection model, detecting the sample video at least based on the facial emotion characteristics and the facial motion characteristics of the sample face, and then carrying out iterative training on the initial video detection model based on a detection result of each sample video in a sample video set and an authenticity label corresponding to each sample video, thereby obtaining a video detection model, so that the initial video detection model can fully learn static characteristics of the sample video on a space domain to accurately extract the facial emotion characteristics reflecting the appearance of the face, fully learn dynamic characteristics of the sample video on a time domain to accurately extract the facial motion characteristics reflecting the dynamic motion of the face, and further avoid the condition that the initial video detection model fits certain specific distribution depth characteristics, thereby improving the accuracy of the detection model and the accuracy of the obtained detection model.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In addition, corresponding to the video detection method shown in fig. 1, the embodiment of the application also provides a video detection device. Referring to fig. 5, a schematic structural diagram of a video detection apparatus 500 according to an embodiment of the present application is provided, where the apparatus 500 includes:

a first image obtaining unit 510, configured to obtain at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face based on a time sequence arrangement;

the first airspace feature extraction unit 520 is configured to perform feature extraction on the at least one frame of video image through a video detection model to obtain facial emotion features of a target face;

A first time domain feature extraction unit 530, configured to perform feature extraction on the multiple frames of first optical flow images through the video detection model, so as to obtain facial motion features of the target face;

the first detection unit 540 is configured to determine a detection result of the video to be detected based on at least the facial emotion feature and the facial action feature of the target face.

Optionally, the first detection unit includes:

a pupil determining subunit, configured to determine a pupil size of the target face based on the at least one frame of video image;

the first detection subunit is used for determining a first detection result of the video to be detected based on the facial emotion characteristics and the pupil size of the target face;

the second detection subunit is used for determining a second detection result of the video to be detected based on the facial action characteristics of the target face and the voice data of the video to be detected;

and the third detection subunit is used for determining the detection result of the video to be detected based on the first detection result and the second detection result.

Optionally, the first detection subunit is specifically configured to:

carrying out emotion recognition on facial emotion characteristics of the target face through an emotion recognition network in the video detection model to obtain an emotion state of the target face;

And determining a first detection result of the video to be detected based on a matching state between the emotion state of the target face and the pupil size of the target face.

Optionally, the pupil determination subunit is specifically configured to:

based on a preset image segmentation algorithm, segmenting an eye region of the target face from the at least one frame of video image;

performing edge detection on the eye region of the target face based on a preset edge detection algorithm to obtain the pupil boundary of the target face;

and carrying out fitting treatment on the eye region based on a preset fitting algorithm and the pupil boundary of the target face to obtain the pupil size of the target face.

Optionally, the second detection subunit is specifically configured to:

performing voice recognition on voice data of the video to be detected through a voice recognition network of the video detection model to obtain target facial motion characteristics corresponding to the voice data

And determining a second detection result of the video to be detected based on the matching state between the facial motion feature of the target face and the target facial motion feature corresponding to the voice data.

Optionally, the first spatial domain feature extraction unit is specifically configured to perform feature extraction on the at least one frame of video image through a spatial stream network in the video detection model to obtain facial emotion features of the target face;

The first time domain feature extraction unit is specifically configured to perform feature extraction on the multiple frames of first optical flow images through a time flow network in the video detection model, so as to obtain facial motion features of the target face.

Optionally, the first image obtaining unit obtains at least one frame of video image of a target face in the video to be detected, including:

dividing the video to be detected into a plurality of video clips;

randomly sampling multi-frame RGB images of a target face in each video segment to obtain a plurality of candidate single-frame video images;

the at least one frame of video image is determined from the plurality of candidate single frame of video images.

Optionally, the first image obtaining unit is based on a plurality of frames of first optical flow images of the target face arranged in time sequence, and includes:

dividing the video to be detected into a plurality of video clips;

randomly sampling multi-frame gray level images of the target face in each video segment to obtain multi-frame candidate gray level images;

determining a first optical flow image corresponding to each frame of the candidate gray level image based on the candidate gray level image of each frame and the candidate gray level images adjacent to each other in time sequence;

and obtaining the multi-frame first optical flow images based on the first optical flow images respectively corresponding to the multi-frame candidate gray images.

Obviously, the video detection device provided in the embodiment of the present application may be used as an execution body of the video detection method shown in fig. 1, so that the functions of the video detection method implemented in fig. 1 can be implemented. Since the principle is the same, the description is not repeated here.

According to the video detection device provided by the embodiment of the application, the natural law that the appearance and the dynamic actions of the real face and the fake face are different is utilized, the facial emotion characteristics of the target face are extracted from at least one frame of video image of the target face in the video to be detected based on the video detection model, the facial action characteristics of the target face are extracted from the multi-frame first optical flow image of the target face based on time sequence arrangement in the video to be detected based on the video detection model, the detection result of the video to be detected is further determined at least based on the facial emotion characteristics and the facial action characteristics of the target face, and the facial emotion characteristics belong to static characteristics in a space domain, can reflect the appearance of the face, and the facial action characteristics belong to dynamic characteristics in a time domain.

In addition, corresponding to the training method of the video detection model shown in fig. 4, the embodiment of the application further provides a training device of the video detection model. Referring to fig. 6, a schematic structural diagram of training 600 of a video detection model according to an embodiment of the present application is provided, where the apparatus 600 includes:

a sample obtaining unit 610, configured to obtain a sample video set and an authenticity label corresponding to each sample video in the sample video set, where the sample video set includes a real video and multiple kinds of forged videos, the multiple kinds of forged videos are in one-to-one correspondence with multiple kinds of face forging algorithms, and each kind of forged video is obtained by forging the real video based on the corresponding face forging algorithm;

a second image obtaining unit 620, configured to obtain at least one frame of video image of a sample face in the target sample video and a plurality of frames of second optical flow images of the sample face based on a time sequence arrangement;

a second spatial domain feature extraction unit 630, configured to perform feature extraction on at least one frame of video image of a sample face in the target sample video through an initial video detection model, so as to obtain facial emotion features of the sample face;

A second time domain feature extraction unit 640, configured to perform feature extraction on a multi-frame second optical flow image of a sample face in the target sample video through the initial video detection model, so as to obtain facial motion features of the sample face;

a second detection unit 650, configured to determine a detection result of the target sample video based on at least facial emotion features and facial motion features of a sample face in the target sample video;

and the training unit 660 is configured to perform iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the true-false label corresponding to each sample video, so as to obtain a video detection model.

Optionally, the second detection unit is specifically configured to:

determining the pupil size of a sample face in the target sample video based on at least one frame of video image of the sample face in the target sample video;

determining a first detection result of the sample face in the target sample video based on facial emotion characteristics and pupil sizes of the sample face in the target sample video;

determining a second detection result of the target sample video based on facial motion characteristics of a sample face in the target sample video and voice data of the target sample video;

And determining whether the target sample video is a fake video or not based on the first detection result and the second detection result of the target sample video.

Optionally, the training unit is specifically configured to:

repeating the following processes until the initial video detection model meets the preset training stopping condition:

determining total detection loss of the initial video detection model based on a first detection result and a second detection result of each sample video in the sample video set and the authenticity label of each sample video;

based on the total detection loss, model parameters of the initial video detection model are adjusted.

Obviously, the training device for the video detection model provided in the embodiment of the present application may be used as an execution subject of the training method for the video detection model shown in fig. 1, so that the function of the training method for the video detection model implemented in fig. 1 can be implemented. Since the principle is the same, the description is not repeated here.

According to the training device for the video detection model, the real video and the fake video obtained by carrying out fake processing on the real video based on various face fake algorithms are used as sample videos, and the sample videos and the corresponding fake labels are used for training the initial video detection model, so that the obtained video detection model can learn the characteristics of various fake videos, the generalization capability of the video detection model is improved, and the fake detection effect of the video detection model on various videos is improved; in a specific model training process, extracting facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face based on time sequence arrangement in the sample video through the initial video detection model, detecting the sample video at least based on the facial emotion characteristics and the facial motion characteristics of the sample face, and then carrying out iterative training on the initial video detection model based on a detection result of each sample video in a sample video set and an authenticity label corresponding to each sample video, thereby obtaining a video detection model, so that the initial video detection model can fully learn static characteristics of the sample video on a space domain to accurately extract the facial emotion characteristics reflecting the appearance of the face, fully learn dynamic characteristics of the sample video on a time domain to accurately extract the facial motion characteristics reflecting the dynamic motion of the face, and further avoid the condition that the initial video detection model fits certain specific distribution depth characteristics, thereby improving the accuracy of the detection model and the accuracy of the obtained detection model.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 7, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs, and the video detection device is formed on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

Alternatively, the processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the same, and forms the training device of the video detection model on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

The method performed by the video detection apparatus disclosed in the embodiment shown in fig. 1 of the present application or the training method of the video detection model disclosed in the embodiment shown in fig. 4 of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may further perform the method of fig. 1 and implement the function of the embodiment of the video detection device shown in fig. 1, or the electronic device may further perform the method of fig. 4 and implement the function of the training device of the video detection model shown in the embodiment of fig. 4, which is not described herein.

Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flow is not limited to each logic unit, but may be hardware or a logic device.

The present embodiments also provide a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to:

Alternatively, the instructions, when executed by a portable electronic device comprising a plurality of applications, enable the portable electronic device to perform the method of the embodiment shown in fig. 4, and in particular to:

In summary, the foregoing description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. A video detection method, comprising:

2. The method according to claim 1, wherein the determining the detection result of the video to be detected based on at least the facial emotion feature and the facial action feature of the target face includes:

Determining the pupil size of the target face based on the at least one frame of video image;

determining a first detection result of the video to be detected based on facial emotion characteristics and pupil sizes of the target face;

determining a second detection result of the video to be detected based on the facial action characteristics of the target face and the voice data of the video to be detected;

and determining the detection result of the video to be detected based on the first detection result and the second detection result.

3. The method according to claim 2, wherein the determining the first detection result of the video to be detected based on the facial emotion feature and the pupil size of the target face includes:

4. The method of claim 2, wherein the determining the pupil size of the target face based on the at least one frame of video image comprises:

5. The method according to claim 2, wherein the determining the second detection result of the video to be detected based on the facial motion feature of the target face and the voice data of the video to be detected includes:

performing voice recognition on voice data of the video to be detected through a voice recognition network of the video detection model to obtain target face action characteristics corresponding to the voice data;

6. The method according to claim 1, wherein the feature extraction of the at least one frame of video image by the video detection model to obtain facial emotion features of the target face comprises:

Extracting features of the at least one frame of video image through a spatial stream network in the video detection model to obtain facial emotion features of the target face;

extracting features of the multi-frame first optical flow image through the video detection model to obtain facial motion features of the target face, wherein the feature extraction comprises the following steps:

and extracting features of the multi-frame first optical flow images through a time flow network in the video detection model to obtain facial action features of the target face.

7. The method according to claim 1, wherein the acquiring at least one frame of video image of the target face in the video to be detected comprises:

dividing the video to be detected into a plurality of video clips;

8. The method of claim 1, wherein the time-series arrangement based multi-frame first optical flow image of the target face comprises:

dividing the video to be detected into a plurality of video clips;

9. A method for training a video detection model, comprising:

10. The method of claim 9, wherein the determining the detection result of the target sample video based at least on facial emotion features and facial motion features of the sample faces in the target sample video comprises:

11. The method of claim 10, wherein the iteratively training the initial video detection model based on the detection result of each sample video in the set of sample videos and the authenticity label corresponding to each sample video to obtain a video detection model comprises:

12. A video detection apparatus, comprising:

13. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 11.

14. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1 to 11.