CN111796681A

CN111796681A - Self-adaptive sight estimation method and medium based on differential convolution in man-machine interaction

Info

Publication number: CN111796681A
Application number: CN202010647088.1A
Authority: CN
Inventors: 罗元; 陈旭
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-20

Abstract

The invention requests to protect a self-adaptive sight line estimation method and a medium based on differential convolution in human-computer interaction, wherein the method comprises the following steps: s1, preprocessing the face image, detecting the face and positioning the human eye area by using the MTCNN algorithm, and extracting human eye characteristic information; s2, directly estimating the head pose by using the face image; s3, automatically fusing the head posture and the human eye characteristic diagram by utilizing the full connection layer of the convolutional neural network, and performing primary sight estimation; s4, predicting the gaze difference of the eyes by training through a differential convolution network; and S5, calibrating the preliminary realization estimation result by using the gaze difference, and outputting a final sight line estimation result. The results of verification on the public data set eyescap and comparison with the sight line estimation model with good performance in recent years show that the sight line estimation model provided by the method can estimate the sight line direction more accurately in the state of free head movement.

Description

Self-adaptive sight estimation method and medium based on differential convolution in man-machine interaction

Technical Field

The invention belongs to the field of image processing and pattern recognition, and particularly relates to an adaptive sight estimation method based on poor carriage convolution.

Background

With the rapid development in the fields of computer vision, artificial intelligence, and the like, the research on the sight line estimation technology has attracted extensive attention. The sight line is a very important non-linguistic clue for analyzing human behavior and psychological state, is one of the expressions of human attention and interest, and the sight line information is helpful for deducing the internal state or intention of the human, so that the interaction between individuals can be better understood. Therefore, line-of-sight estimation plays an important role in many research areas, such as: human-computer interaction, virtual reality, social interaction analysis, medical treatment, and the like.

The sight line estimation in a broad sense refers to research related to eyeballs, eye movements, sight lines and the like. Generally speaking, the sight line estimation method can be divided into two broad categories, model-based method and appearance-based method. The basic idea of the model-based method is to estimate the sight direction based on the characteristics of corneal reflection and the like of the eyes and by combining the prior knowledge of the 3D eyeball. The appearance-based method directly extracts visual features of eyes, trains a regression model, and learns a model that maps appearances to sight directions, thereby performing sight estimation. Through comparative analysis, the accuracy obtained by the model-based method is higher, but the requirements on the quality and the resolution of the picture are also higher, in order to achieve the purpose, special hardware is generally required to be used, and the mobility of the head pose and the like of a user is greatly limited; while appearance-based methods perform better for low-resolution and high-noise images, training of the model requires a large amount of data and is prone to the phenomenon of over-fitting. With the rise of deep learning and the disclosure of large data sets, appearance-based approaches are receiving increasing attention.

At present, although the research on the sight line estimation technology is greatly improved, the accuracy obtained by a universal model is limited due to the difference of the eye shape and the intraocular structure among individuals, and meanwhile, the moving amplitude of the head of a user has a great influence on the experimental result, and the recognition accuracy is reduced.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A self-adaptive sight line estimation method and medium based on differential convolution in man-machine interaction are provided. The technical scheme of the invention is as follows:

a self-adaptive sight line estimation method based on differential convolution in human-computer interaction comprises the following steps:

s1, preprocessing the face image by utilizing a bilinear difference method to carry out multi-scale scaling, detecting the face by utilizing an optimized multi-task cascade convolution neural network algorithm, realizing pupil center positioning, and extracting human eye characteristic information;

s2, directly estimating the head pose by using the face image;

s3, automatically fusing the head pose in the step S1 and the human eye feature map in the step S2 by utilizing the full connection layer of the convolutional neural network to carry out preliminary sight estimation;

s4, predicting the gaze difference of the eyes by training by utilizing a differential convolution network;

and S5, calibrating the preliminary realization estimation result by using the obtained gaze difference, and outputting a final sight line estimation result.

In the step S1, 5 human face feature points are output by using the optimized multi-task cascade convolution neural network algorithm, so that pupil center positioning is completed while human face detection is performed.

The output of the multitasking cascade convolution neural network algorithm includes the pupil center position.

Further, the step S2 of directly performing head pose estimation by using the face image specifically includes: positioning the head position and orientation using a real-time head pose estimation system of a random regression forest, using T_t＝[T_x,T_y,T_z]Indicating the position information of the head at time t, R_t＝[R_y,R_p,R_r]Representing the rotation angle information of the head at the time t, the head deflection parameter at the time t can be recorded as h_t＝(T_t,R_t)。

Further, the step S3 automatically fuses the head pose and the eye feature map by using the full connection layer of the convolutional neural network to perform the preliminary gaze estimation, which specifically includes:

using a convolutional neural network based approach, we map 3@48 x 72 eyesTaking an image I as input, wherein 3 represents the number of channels of an eye image, 48 multiplied by 72 represents the size of the eye image, preprocessing the image, applying the preprocessed image to a convolutional layer, inputting an obtained characteristic map into a full-link layer, and finally obtaining a primary sight direction g in the full-link layer by training a linear regression_p(I) The loss function is:

wherein, g_gt(I) For the true gaze direction, D is the training data set and | is the cardinality computation graph.

Further, the step S4 predicts the gaze difference amount of the eye through training by using a differential convolution network, and specifically includes:

the differential convolution is to analyze the mode direction of a certain sample and an adjacent sample, and the differential calculation reflects the change of continuous samples by calculating the difference between sample activations;

the differential convolution network adopts a parallel structure, each branch of the parallel structure consists of three convolution layers, each convolution layer is subjected to batch processing normalization and a ReLU unit, and maximum pooling is applied after a first layer and a second layer so as to reduce the size of an image; and after the third layer, performing normalization processing on the characteristic images of the two input images and splicing the characteristic images into a new tensor, and then applying two full-connection layers on the tensor to predict the gaze difference of the two input images.

Further, the differential convolutional network selects a ReLU function as an activation function of the convolutional layer and the fully-connected layer, and the formula is as follows:

f(x)＝max(0,x) (10)

where x is the input, f (x) is the output after the ReLU unit;

training a gaze estimation model using a loss function, using d^p(I, J) represents the predicted gaze difference of the difference network, then the loss function L_dComprises the following steps:

wherein I is a test image, F is a reference image, D^kTo a subset of the training set D, only images of one eye of the kth individual are included.

Further, the step S5 is to calibrate the preliminary estimation result by using the obtained gaze difference, and output a final gaze estimation result, specifically: predicting the difference d between a test image I and a reference image F by means of a differential convolution network^p(I, J) in combination with the true gaze value g_gt(F) To predict the final viewing direction g_gt(F)+d^p(I, J), formula:

wherein D is_cFor the calibration set of reference images, w (-) weights the importance of each prediction.

A storage medium, the storage medium being a computer readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of the above.

The invention has the following advantages and beneficial effects:

currently, most apparent-based gaze estimation methods return the gaze direction directly from a single face or eye image. However, due to differences in eye shape and intraocular structure between individuals, the universal model has limited acquisition accuracy, and its output often exhibits high variability and subject-related bias. Meanwhile, when the head deflection angle is too large, the sight line estimation result is also greatly affected. Therefore, the present disclosure provides an adaptive line-of-sight estimation method based on differential convolution to solve the above problems. And (3) introducing differential convolution, directly training a differential convolution neural network to predict the gaze difference between two eye input images of the same subject, and then calibrating the initial implementation estimation result by using the gaze difference. In addition, head pose information is fused in the model to improve the robustness of the gaze estimation system.

Tests on the public data set Eyediap show that when head posture information is blended and a differential network is used for calibration, the sight line estimation error is minimum. The introduction of the visible differential convolution can effectively calibrate the sight line estimation result, reduce the error of the sight line estimation, and the fusion of the head posture information can enable the system to have better robustness to the change of the head posture. In order to more clearly compare the visual line estimation effects of different models, the algorithm model provided by the invention is compared with other visual line estimation methods based on the convolutional neural network, the error of the model provided by the invention on the visual line estimation is smaller, and the excellent performance is obtained.

Drawings

FIG. 1 is a diagram of a preferred embodiment of a line-of-sight estimation framework based on a differential convolutional network (DNet) according to the present invention;

fig. 2 is a diagram of a differential convolutional network structure.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

s1, preprocessing the face image by using a bilinear difference method to perform multi-scale scaling, detecting the face by using an optimized multi-task cascade convolution neural network algorithm (in the invention, the existing algorithm is adopted, so the invention is abbreviated), realizing pupil center positioning at the same time, and extracting human eye characteristic information;

and S2, directly estimating the head pose by using the human face image. A real-time head pose estimation system of a random regression forest is employed to locate the head position and orientation. By T_t＝[T_x,T_y,T_z]Indicating the position information of the head at time t, R_t＝[R_y,R_p,R_r]Representing the rotation angle information of the head at the time t, the head deflection parameter at the time t can be recorded as h_t＝(T_t,R_t)。

S3, automatically fusing the head pose and the human eye feature map by using the full connection layer of the convolutional neural network, performing preliminary sight line estimation, and taking an eye image I of 3@48 × 72 as input by adopting a method based on the convolutional neural network, wherein 3 represents the number of channels of the eye image, and 48 × 72 represents the size of the eye image. Preprocessing the image, applying the preprocessed image to a convolutional layer, inputting the obtained characteristic map into a full-link layer, and finally obtaining a primary sight line direction g in the full-link layer by training a linear regression_p(I) In that respect The loss function is:

And S4, predicting the gaze difference of the eyes by training by using a differential convolution network. Differential convolution is the analysis of the pattern direction of a certain sample and the adjacent sample, and differential calculation reflects the change of continuous samples by calculating the difference between sample activations.

The differential convolutional network adopts a parallel structure, and each branch of the parallel structure is composed of three convolutional layers, and each convolutional layer is subjected to batch normalization and a ReLU unit. Maximum pooling is applied after the first and second layers to reduce the image size. And after the third layer, normalizing the characteristic images of the two input images and splicing the characteristic images into a new tensor. Two fully connected layers are then applied in the tensor to predict the gaze difference of the two input images.

The differential convolutional network selects a ReLU function as an activation function of the convolutional layer and the fully-connected layer, and the formula is as follows:

f(x)＝max(0,x) (10)

where x is the input and f (x) is the output after the ReLU unit.

And S5, calibrating the preliminary realization estimation result by using the obtained gaze difference, and outputting the final sight line estimation result. Predicting the difference d between a test image I and a reference image F by means of a differential convolution network^p(I, J) in combination with the true gaze value g_gt(F) To predict the final viewing direction g_gt(F)+d^p(I, J), formula:

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A self-adaptive sight line estimation method based on differential convolution in human-computer interaction is characterized by comprising the following steps:

s2, directly estimating the head pose by using the face image;

2. The adaptive sight line estimation method based on differential convolution in human-computer interaction according to claim 1, wherein in the step S1, 5 human face feature points are output by using an optimized multitask cascade convolution neural network algorithm, so that pupil center positioning is completed while human face detection is performed.

3. The adaptive sight line estimation method based on differential convolution in human-computer interaction according to claim 2, wherein the step S2 is to directly perform head pose estimation by using a face image, and specifically includes: positioning the head position and orientation using a real-time head pose estimation system of a random regression forest, using T_t＝[T_x,T_y,T_z]Indicating the position information of the head at time t, R_t＝[R_y,R_p,R_r]Representing the rotation angle information of the head at the time t, the head deflection parameter at the time t can be recorded as h_t＝(T_t,R_t)。

4. The adaptive sight line estimation method based on differential convolution in human-computer interaction according to claim 3, wherein the step S3 is to perform preliminary sight line estimation by automatically fusing a head pose and a human eye feature map by using a full connection layer of a convolutional neural network, and specifically includes:

adopting a method based on a convolutional neural network, taking an eye image I of 3@48 multiplied by 72 as input, wherein 3 represents the channel number of the eye image, 48 multiplied by 72 represents the size of the eye image, preprocessing the image, applying the preprocessed image to a convolutional layer, inputting an obtained feature map into a full-link layer, and finally, performing full-link processing on the full-link layerThe layer is connected to obtain a primary sight direction g by training a linear regression_p(I) The loss function is:

5. The adaptive sight line estimation method based on differential convolution in human-computer interaction according to claim 4, wherein the step S4 predicts the gaze difference amount of the eye through training by using a differential convolution network, and specifically comprises:

6. The adaptive sight line estimation method based on differential convolution in human-computer interaction according to claim 5, characterized in that the differential convolution network selects a ReLU function as an activation function of a convolution layer and a full link layer, and the formula is as follows:

f(x)＝max(0,x) (10)

where x is the input, f (x) is the output after the ReLU unit;

7. The adaptive gaze estimation method based on differential convolution in human-computer interaction according to claim 6, wherein the S5 calibrates the preliminary implementation estimation result by using the obtained gaze difference, and outputs a final gaze estimation result, specifically: predicting the difference d between a test image I and a reference image F by means of a differential convolution network^p(I, J) in combination with the true gaze value g_gt(F) To predict the final viewing direction g_gt(F)+d^p(I, J), formula:

8. A storage medium being a computer readable storage medium storing one or more programs which, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of claims 1-7 above.