CN116343308B

CN116343308B - Fused face image detection method, device, equipment and storage medium

Info

Publication number: CN116343308B
Application number: CN202310351690.4A
Authority: CN
Inventors: 贾成昆; 刘永超; 杨睿蝶; 王呈泽
Original assignee: Hunan Institute of Traffic Engineering
Current assignee: Hunan Institute of Traffic Engineering
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2024-02-09
Anticipated expiration: 2043-04-04
Also published as: CN116343308A

Abstract

The application provides a method, a device, equipment and a storage medium for detecting a fused face image, which comprise the following steps: preprocessing an RGB image to be detected, decomposing the preprocessed image into a first image, a second image and a third image corresponding to a color channel, extracting high-frequency information in the images, inputting the combined high-frequency information image and RGB image into a trained double-flow convolutional neural network, and obtaining a detection result of whether the RGB image is a fused face image; wherein, the double-flow convolution neural network also comprises self-enhancement processing and/or mutual-enhancement processing. According to the double-flow convolutional neural network, the self-enhancement processing is added after the convolutional processing, the mutual enhancement processing is added after the Stage processing, the input characteristics and the characteristic interaction are enhanced, the accuracy of the fused face image detection result can be improved, and the detection performance is improved.

Description

Fused face image detection method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of face recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a fused face image.

Background

Under the current complicated social situation and the large environment with uneven public security level, how to quickly and accurately determine the identities of the personnel is not only an important place of relevant political institutions such as public security departments, side detection and entrance, railway stations, high-speed rail stations and the like, but also one of important subjects of concern of the personnel who are involved in relevant industries and need personnel identity verification. In the past, the identity verification is generally performed by adopting a manual verification mode, and the manual verification often has the following problems: 1. the accuracy is low, the consistency of the face and the photo is compared by the naked eyes by the verification personnel, the actual verification effect is different from person to person, the standardization is difficult to achieve, and the error is large; 2. the subjective opinion of the individual is easy to form, most of the subjective opinion depends on own experience or subjective judgment, and misjudgment is easy to occur; 3. the comparison speed is low, the labor is consumed, the efficiency is low, and the actual majority of verification scenes and the side verification pass are intersected with the business operations of various industries such as the verification of the ticket of the person, and the like, so that the resource integration with various businesses is difficult; 4. the accuracy is difficult to count and verify, and the verification result is also difficult to recheck.

In the existing face image detection technology at present, although the method based on texture detection fusion is simple to realize, as textures are only the characteristics of the surface of an object, high-level image content cannot be obtained only by utilizing texture characteristics; although the method based on the deep convolutional neural network has generally better performance than the method based on texture, a larger data set is needed, and the control accuracy is affected by insufficient sample number; hybrid feature-based approaches tend to increase the complexity of the algorithm.

Therefore, how to improve the accuracy of the face image recognition result becomes a problem to be solved.

The above information disclosed in the background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for detecting a fused face image, which are used for solving the problems existing in the prior art.

In a first aspect, the present application provides a fused face image detection method, including the following steps:

s11, preprocessing the acquired RGB image to be detected to obtain a first face image;

s12, decomposing the first face image into a first image, a second image and a third image corresponding to R, G, B color channels, and extracting corresponding first high-frequency information, second high-frequency information and third high-frequency information in the first image, the second image and the third image to obtain a first high-frequency information image, a second high-frequency information image and a third high-frequency information image;

s13, combining the first high-frequency information image, the second high-frequency information image and the third high-frequency information image to obtain a fourth high-frequency information image;

s14, inputting the fourth high-frequency information image and the RGB image into a trained double-flow convolutional neural network to obtain a detection result of whether the RGB image is a fused face image or not;

wherein after the convolution processing in the dual-flow convolution neural network and before the next processing of the convolution processing, the method further comprises a self-enhancement processing, wherein the self-enhancement processing is used for enhancing the input characteristics; and/or the number of the groups of groups,

after Stage processing in the dual-flow convolutional neural network and before the next Stage processing, mutual enhancement processing is further included, wherein the mutual enhancement processing is used for enhancing characteristic interaction of input.

In some embodiments, the self-enhancement process includes:

a1, compressing a first input by using global average pooling and global maximum pooling respectively to obtain a corresponding first global space feature and a corresponding second global space feature, wherein the first input is a convolution result obtained after convolution processing;

a2, carrying out convolution processing on the first global space feature and the second global space feature to obtain a corresponding first attention prediction graph and a corresponding second attention prediction graph;

a3, obtaining a channel attention map according to the first attention prediction map and the second attention prediction map;

a4, calculating an enhanced first output characteristic diagram according to the first input and the channel attention diagram.

In some embodiments, the mutual enhancement process includes:

b1, connecting the acquired second input with the third input to obtain a first characteristic diagram, wherein the second input and the third input are respectively corresponding to a frequency flow characteristic diagram and an RGB flow characteristic diagram obtained after Stage processing;

b2, carrying out average pooling and maximum pooling treatment on the first feature map to obtain a feature descriptor;

b3, carrying out convolution processing on the feature descriptors to obtain spatial attention features;

and b4, obtaining an enhanced second output characteristic diagram according to the spatial attention characteristic, the second input and the third input.

In some embodiments, the S11 includes:

and detecting eye coordinates in the RGB image, and carrying out segmentation processing and normalization processing on a face region in the RGB image according to the eye coordinates to obtain a first face image.

In some embodiments, the S12 includes:

s121, decomposing the first face image into a first image, a second image and a third image corresponding to R, G, B color channels;

s122, obtaining a first spectrogram, a second spectrogram and a third spectrogram corresponding to the first image, the second image and the third image through Fourier transformation;

s123, filtering the low-frequency information in the first spectrogram, the second spectrogram and the third spectrogram to obtain a first high-frequency information spectrogram, a second high-frequency information spectrogram and a third high-frequency information spectrogram;

s124, converting the first high-frequency information spectrogram, the second high-frequency information spectrogram and the third high-frequency information spectrogram into RGB color space by utilizing inverse Fourier transform to obtain a first high-frequency information image, a second high-frequency information image and a third high-frequency information image.

In some embodiments, the merging in S13 is as follows:

and splicing the first high-frequency information image, the second high-frequency information image and the third high-frequency information image along the directions of the corresponding R, G, B three color channels to obtain a fourth high-frequency information image.

In some embodiments, the training process of the dual-stream convolutional neural network comprises:

s21, converting a plurality of RGB sample images into corresponding high-frequency information images;

s22, inputting the high-frequency information image and the RGB sample image into a double-flow network for training to obtain a trained double-flow convolutional neural network;

wherein the dual-stream convolutional neural network is used for aggregating RGB image information and high-frequency image information.

In a second aspect, the present application provides a fused face image detection apparatus, including:

the preprocessing module is used for preprocessing the acquired RGB image to be detected to obtain a first face image;

the extraction module is used for decomposing the first face image into a first image, a second image and a third image corresponding to R, G, B color channels, and extracting corresponding first high-frequency information, second high-frequency information and third high-frequency information in the first image, the second image and the third image to obtain a first high-frequency information image, a second high-frequency information image and a third high-frequency information image;

the merging module is used for merging the first high-frequency information image, the second high-frequency information image and the third high-frequency information image to obtain a fourth high-frequency information image;

the detection module is used for inputting the fourth high-frequency information image and the RGB image into a trained double-flow convolutional neural network to obtain a detection result of whether the RGB image is a fused face image or not;

In a third aspect, the present application provides a terminal device, including:

a memory for storing a computer program;

and the processor is used for reading the computer program in the memory and executing the operation corresponding to the fused face image detection method.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, are configured to implement the fused face image detection method.

The method, the device, the equipment and the storage medium for detecting the fused face image provided by the application comprise the following steps:

The method comprises the steps of obtaining high-frequency information images of different color channels from an RGB image, merging the obtained high-frequency information images, and then using the combined high-frequency information images and the RGB image as input of a double-flow convolutional neural network to detect a fused face image; the double-flow convolutional neural network enhances the input characteristics and characteristic interaction by adding self-enhancement processing after convolutional processing and adding mutual-enhancement processing after Stage processing; the method and the device can improve the accuracy of the detection result of the fused face image and improve the detection performance.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart of steps of a method for detecting a fused face image provided in the present application;

fig. 2 is a schematic diagram of a method for detecting a fused face image provided in the present application;

fig. 3 is a schematic diagram of a high-frequency information extraction process of RGB color channels according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an attention feature fusion module involved in an embodiment of the present application;

FIG. 5 is a schematic diagram of a multi-scale channel attention module as referred to in an embodiment of the present application;

FIG. 6 is a schematic diagram of a self-enhancement process involved in an embodiment of the present application;

FIG. 7 is a schematic diagram of a mutual enhancement process involved in an embodiment of the present application;

FIG. 8 is an exemplary diagram of a real face and a fused face provided herein; wherein, the graph (a) and the graph (b) are real faces, and the graph (c) is a face obtained by fusing the graph (a) and the graph (b).

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the embodiments of the present application, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" or "a number" is two or more, unless explicitly defined otherwise.

It should be understood that the structures, proportions, sizes, etc. shown in the drawings are for illustration purposes only and should not be construed as limiting the scope of the present disclosure, since any structural modifications, proportional changes, or dimensional adjustments made by those skilled in the art should not be made in the present disclosure without affecting the efficacy or achievement of the present disclosure.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front-rear association object is an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.

Detailed analysis of background art problems:

in the current society, as one of the biometric features which are relatively easy to acquire, a face feature is commonly used as authentication information of personal identity. The face recognition system is widely applied to the aspects of bank business handling, mobile phone face-brushing payment, border management and the like along with the improvement of recognition rate.

However, recent studies have demonstrated the vulnerability of face recognition systems to a specific attack approach called fusion attack. This attack refers to combining two facial images with different biological characteristics into one fused facial image. The fused facial images have similar biological characteristic information as the two facial images, so that the face recognition system can match two different people with one forged face image, thereby attacking the face recognition system. If such an image is embedded in a passport or other electronic travel document, it would pose a serious threat to the management of the environment. Fig. 8 is an exemplary diagram of a real face and a fused face provided in the present application, where fig. a and fig. b are real faces and fig. c is a face after the fusion of fig. a and fig. b.

In many countries today, facial images used in the process of electronic passport applications are provided by the applicant in digital or analog form. A blacklisted criminal may fuse its face image with a long-alike partner face image by free software. Because the resulting fused face is very similar to a real face, if a partner takes the fused face to apply for a passport or other electronic travel document, the criminal may use the facial image on the electronic travel document to fool the border inspector and the face recognition system to pass automatic border control. Therefore, the detection of the face generated by the attack is important to the guarantee of social security.

Fusion attack detection currently there are two different techniques: 1. a no-reference fusion attack detection technique; 2. differential fusion attack detection techniques. In the no-reference fusion detection technique, images are analyzed separately without any reference and then classified as either real images or fused images. In the differential fusion attack detection, the obtained reference image is compared with the stored reference image to analyze. Furthermore, depending on the processing type of the image data, the no-reference fusion attack detection may be of two types: (a) Print scan attack detection, the captured digital photograph is then printed and handed to the passport issuing center, where it is digitized again using scanning and then stored in the eMRTD; (b) Digital attack detection, in which digitally captured faces can be used directly to detect fusion attacks.

In digital attack detection, the most popular fusion attack detection techniques can be broadly divided into four algorithm types: texture-based, image quality-based, deep learning-based, and hybrid feature-based fusion detection methods. Early adopting a texture-based method, the change of micro textures in the fusion process is expected to be captured, so that the fusion face detection is realized; the method based on image quality detects the fused face by quantifying the differences of compression artifacts and noise introduced in the fusion process; with the advent of deep learning, deep learning-based methods, especially using pre-trained CNN architecture extraction features, have been mostly adopted in recent years to detect fused faces. Compared with a method based on texture and image quality, the method based on deep learning can extract more abundant semantic information, and the generalization capability of the method is better than that of the method.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a step flowchart of a fused face image detection method provided by the present application, and fig. 2 is a schematic diagram of a fused face image detection method provided by an embodiment of the present application, as shown in fig. 1 and fig. 2, where the fused face image detection method provided by the present application includes the following steps:

in some embodiments, the S11 includes:

Specifically, in the present embodiment, the eye coordinates in the RGB image are detected by the dlib marker point detector.

More specifically, in the embodiment of the present application, after the normalization process is completed, a normalized region is obtained, and the normalized region is cut into 224×224 pixels, so as to ensure that the fusion detection algorithm is only applied to the face region; the size of 224 x 224 pixels is chosen to accommodate the size of the input layer of the D-CNNs.

Fig. 3 is a schematic diagram of a high-frequency information extraction process of an RGB color channel according to an embodiment of the present application, as shown in fig. 3, S12, decomposing the first face image into a first image, a second image and a third image corresponding to R, G, B color channels, and extracting the first high-frequency information, the second high-frequency information and the third high-frequency information corresponding to the first image, the second image and the third image to obtain a first high-frequency information image, a second high-frequency information image and a third high-frequency information image;

as shown in FIG. 2 and FIG. 3, R-Channel is R Channel, G-Channel is G Channel, and B-Channel is B Channel.

In some embodiments, the S12 includes:

s121, decomposing the first face image into a first image, a second image and a third image corresponding to R, G, B color channels, wherein the first image, the second image and the third image are specifically expressed as follows:

RGB(X)＝[X _r ,X _g ,X _b ]

wherein X is _r 、X _g 、X _b The first image, the second image and the third image are respectively corresponding.

S122, obtaining a first spectrogram, a second spectrogram and a third spectrogram corresponding to the first image, the second image and the third image through Fourier transformation, wherein the first spectrogram, the second spectrogram and the third spectrogram are specifically expressed as follows:

X _fr ,X _fg ,X _fb ＝D(X _r ,X _g ,X _b )

wherein X is _fr 、X _fg 、X _fb Respectively corresponding to a first spectrogram, a second spectrogram and a third spectrogram, X _fr ,X _fg ,X _fb ∈R ^H×W×1 D represents DFT (Discrete Fourier Transform ).

The resulting image after DFT conversion has a good frequency distribution layout, i.e. the low frequency response is at the top corner and the high frequency response is at the bottom right corner.

specifically, in the embodiment of the present application, in order to better extract high-frequency information, a center shifting operation is performed using a ffshift function, the upper left low-frequency portion is shifted to the middle, and then the image content is suppressed by filtering out the low-frequency information to amplify fine artifacts of high frequencies, specifically expressed as:

wherein,the first high-frequency information spectrogram, the second high-frequency information spectrogram and the third high-frequency information spectrogram are respectively corresponding to the first high-frequency information spectrogram, the second high-frequency information spectrogram and the third high-frequency information spectrogram, F is high-pass filtering, alpha is low-frequency components to be filtered out in a control mode, and the alpha takes a value of 50.

S124, converting the first high-frequency information spectrogram, the second high-frequency information spectrogram and the third high-frequency information spectrogram into RGB color space by utilizing inverse Fourier transform to obtain a first high-frequency information image, a second high-frequency information image and a third high-frequency information image, wherein the first high-frequency information image, the second high-frequency information image and the third high-frequency information image are specifically expressed as follows:

wherein X is _hr, X _hg, X _hb X is respectively corresponding to a first high-frequency information image, a second high-frequency information image and a third high-frequency information image _hr ,X _hg ,X _hb ∈R ^H×W×1 ,D ^-1 Is IDFT (Inverse Discrete Fourier Transform ).

in some embodiments, the merging in S13 is as follows:

the first high-frequency information image, the second high-frequency information image and the third high-frequency information image are spliced along the directions of the corresponding R, G, B three color channels, and a fourth high-frequency information image is obtained, which is specifically expressed as follows:

X ^h ＝cat(X _hr ,X _hg ,X _hb )

wherein X is ^h X is the fourth high-frequency information image ^h ∈R ^H×W×3 。

s22, inputting the high-frequency information image and the RGB sample image into a double-flow network for training, and performing end-to-end training to obtain a trained double-flow convolutional neural network, wherein the double-flow convolutional neural network takes a SheffleNetV 2 as a backbone network;

Fig. 4 is a schematic diagram of an attention feature fusion module according to an embodiment of the present application, as shown in fig. 4, in the feature fusion stage, the input in the process is a high-frequency information stream and an RGB stream, and two learned features (RGB image information and high-frequency image information) are fused by using a most advanced AFF (Attention Feature Fusion ) module, in which an MS-CAM (Multi-Scale Channel Attention Module, multi-scale channel attention) module is adopted, and the output and the input dimensions after passing through the AFF module remain consistent, that is, 7×7× 1024.

Fig. 5 is a schematic diagram of a multi-scale channel attention module involved in an embodiment of the present application, as shown in fig. 5, one branch uses global averaging pooling Global Avg Pooling to extract the attention of global features, and another branch directly uses Point-wise convolution Point-wise Conv to extract the channel attention of local features. Referring to fig. 5, the part is composed of local and global processes, and performs corresponding processes using Point-wise Conv, reLu activation functions, batch Normalization layers, i.e., BN, and Sigmoid activation functions.

The fused features are then fed into the Softmax layer for classification. Wherein each flow is based on the same network.

As shown in fig. 1, the SRM is a self-enhancement module, the MRM is a mutual-enhancement module, the SRM is preceded by a residual module of shuffle network channel shuffling, attention Based Fusion is a fusion module based on an attention mechanism, fc is a fully connected layer of the network, and softmax is a network softmax layer.

It should be noted that the lightweight convolutional neural network shufflenet v2 is a highly efficient CNN architecture, and it uses two new operation methods of packet point convolution and channel shuffling, so as to greatly reduce the calculation cost while maintaining the accuracy, and has the advantages of low complexity and few parameters. The present application uses two ShuffleNetV2 as the backbone network. The network is trained end-to-end. For each dataset, the model was trained using a cross entropy loss function, optimized using SGD (Stochastic Gradient Descent, random gradient descent), the description function defined as follows:

wherein i is the index of the training samples, N is the number of the training samples,for the predicted value of the ith sample, yi is the label of the ith sample. The batch size was 4 during training and validation and 16 during testing. On the setting of learning parameters, the momentum is set to 0.9, the learning rate is 0.001, and 30 epochs are trained.

It should be further noted that the self-enhancement module is inserted after each convolution block of the backbone network, and the mutual-enhancement module is inserted after each Stage. Wherein the self-enhancement module can enhance the characteristics of each stream, and the mutual-enhancement module can complementarily enhance the characteristic interaction of double streams. Through the progressive characteristic enhancement flow, the high-frequency information and the RGB information can be effectively utilized to position the fine fusion trace.

FIG. 6 is a schematic diagram of a self-enhancement process involved in an embodiment of the present application, as shown in FIG. 6, and in some embodiments, the self-enhancement process includes:

a1, respectively compressing a first input by using GAP (Global average pooling ) and GMP (Global max-pooling) to obtain a corresponding first Global space feature S1 and a second Global space feature S2, wherein the first input is a convolution result obtained after convolution processing; the concrete representation is as follows:

S1＝GAP(f _in ),S2＝GMP(f _in )

wherein f _in Representing the first input, GAP represents global average pooling and GMP represents global maximum pooling.

a2, carrying out convolution processing on the first global space feature and the second global space feature to obtain a corresponding first attention prediction graph Z1 and a corresponding second attention prediction graph Z2;

specifically, in the embodiment of the present application, in order to effectively capture cross-channel interaction information, local cross-channel interaction information is captured from each channel and k neighbors thereof, and for this purpose, the global spatial features S1 and S2 obtained above are subjected to a fast one-dimensional convolution with a size of k, which is specifically expressed as:

Z1＝σ(C1D _k (S1))，Z2＝σ(C1D _k (S2))

where σ is a sigmoid activation function, C1D represents a one-dimensional convolution, and the convolution kernel size k represents the coverage of local cross-channel interactions, i.e., how many neighbors near the channel are involved in the attention prediction of this channel.

a3, obtaining a channel attention map Z according to the first attention prediction map Z1 and the second attention prediction map Z2; the concrete steps are as follows: z=z1+z2;

a4, calculating an enhanced first output characteristic diagram f according to the first input and the channel attention diagram _out The method is specifically expressed as follows:

wherein f _in A first input is represented as such,representing element multiplication.

It should be noted that, after the self-enhancement module is inserted into each convolution block, the tracks in different input spaces can be captured through channel attention, so as to enhance the characteristics of each stream.

Fig. 7 is a schematic diagram of a mutual enhancement process involved in an embodiment of the present application, as shown in fig. 7, and in some embodiments, the mutual enhancement process includes:

b1, connecting the acquired second input U1 with the third input U2 to obtain a first characteristic diagram U, U epsilon R ^H×W×C The second input and the third input are respectively corresponding to a frequency flow characteristic diagram and an RGB flow characteristic diagram which are obtained after the Stage processing;

specifically, in the embodiments of the present application, U1 εR ^H×W×C 、U2∈R ^H×W×C The characteristic diagrams of the frequency stream and the RGB stream of the first-th Stage of the network are shown, and H, W, C shows the length, width and height of the characteristic diagrams.

b2, carrying out average pooling and maximum pooling treatment on the first feature map U to obtain a feature descriptor;

specifically, in the embodiment of the present application, the feature descriptor is subjected to a convolution operation of 7×7, and the dimension is reduced to 1 channel, i.e., h×w×1; and generating a spatial attention characteristic V through sigmoid, wherein the spatial attention characteristic V is specifically expressed as:

wherein σ is a sigmoid activation function, f ^7×7 For convolution operation with a filter size of 7, avgPool is the average pooling process, maxPool is the maximum pooling process,for global average feature->For global maximum feature, V is spatial attention feature, V ε R ^H×W×1 。

Specifically, in the embodiment of the present application, the feature is multiplied by the input feature of each stream to obtain an enhanced feature Uz, that is, a second output feature map, which is specifically expressed as follows:

where V is the spatial attention characteristic,representing element multiplication, ui is the input feature for each stream.

It should be noted that the mutual enhancement module is inserted after each Stage and placed after the self-enhancement module. In this way, parallel enhancement of RGB with high frequency branching can be achieved.

According to the fused face image detection method, high-frequency information of three color channels is extracted in a pixel domain, and a lightweight convolutional neural network frame training model is used. The high-frequency information is used as the input of the neural network, so that the model can better explore the subtle difference between the fused image and the real image, and the fused face can be better detected. The high frequency information is combined with the RGB image information using a progressive dual stream network. The method can be applied to various scenes with personnel identity verification requirements, such as airports, stations, subways, side inspection ports, customs ports, scenic spots, examination rooms, communities, financial and business points, business enterprises, parks, communities and the like. The human face detection is fused as a key step and a key link of verification work, and has important significance for identifying and verifying the true identity and the effective identity of the entering and exiting personnel, striking illegal entering and exiting personnel, striking counterfeit certificates and the identities of counterfeit personnel, checking suspicious personnel and timely and accurately discovering the illegal entering and exiting personnel. Aiming at the fused face detection, the invention provides a fused face detection method of high-frequency characteristics and an incremental double-flow network. The method utilizes the high-frequency characteristics of R, G, B three color channels and the characteristics of the original image to perform fusion face detection. RGB information is used for locating abnormal textures, high-frequency information is used for highlighting differences between a real face and a fused face, fusion of the two is helpful for comprehensive feature representation, the fused face can be effectively detected, and the accuracy of fused face image detection is improved under the condition that the complexity of a face detection algorithm is not increased.

The application also provides a fused face image detection device, comprising:

The application also provides a terminal device, comprising:

a memory for storing a computer program;

The application also provides a computer readable storage medium, wherein computer executable instructions are stored in the computer readable storage medium, and the computer executable instructions are used for realizing the fused face image detection method when being executed by a processor.

It should be understood that, although the steps in the flowcharts in the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily occurring in sequence, but may be performed alternately or alternately with other steps or at least a portion of the other steps or stages.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The method for detecting the fused face image is characterized by comprising the following steps of:

wherein after the convolution processing in the dual-flow convolution neural network and before the next processing of the convolution processing, the method further comprises a self-enhancement processing, wherein the self-enhancement processing is used for enhancing the input characteristics; and after Stage processing in the dual-flow convolutional neural network and before the next processing of Stage processing, further comprising mutual enhancement processing, wherein the mutual enhancement processing is used for enhancing the characteristic interaction of the input; the mutual enhancement process follows each Stage process;

the mutual enhancement process includes:

2. The fused face image detection method according to claim 1, wherein the self-enhancement processing includes:

3. The fused face image detection method according to any one of claims 1-2, wherein S11 includes:

4. The fused face image detection method according to claim 1, wherein S12 includes:

5. The method for detecting a fused face image according to claim 1 or 4, wherein the merging manner in S13 is as follows:

6. The fused face image detection method of claim 1, wherein the training process of the dual-stream convolutional neural network comprises:

7. A fused face image detection apparatus, comprising:

the mutual enhancement process includes: connecting the acquired second input with the third input to obtain a first characteristic diagram, wherein the second input and the third input are respectively corresponding to a frequency flow characteristic diagram and an RGB flow characteristic diagram obtained after Stage processing; carrying out average pooling and maximum pooling treatment on the first feature map to obtain a feature descriptor; carrying out convolution processing on the feature descriptors to obtain spatial attention features; and obtaining an enhanced second output characteristic diagram according to the spatial attention characteristic, the second input and the third input.

8. A terminal device, comprising:

a memory for storing a computer program;

a processor for reading the computer program in the memory and performing the operations corresponding to the fused face image detection method according to any one of claims 1 to 6.

9. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing the fused face image detection method of any of claims 1-6.