WO2022268183A1 - 一种基于视频的随机手势认证方法及*** - Google Patents

一种基于视频的随机手势认证方法及*** Download PDF

Info

Publication number
WO2022268183A1
WO2022268183A1 PCT/CN2022/100935 CN2022100935W WO2022268183A1 WO 2022268183 A1 WO2022268183 A1 WO 2022268183A1 CN 2022100935 W CN2022100935 W CN 2022100935W WO 2022268183 A1 WO2022268183 A1 WO 2022268183A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
gesture
video
random
physiological
Prior art date
Application number
PCT/CN2022/100935
Other languages
English (en)
French (fr)
Inventor
康文雄
宋文伟
Original Assignee
华南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华南理工大学 filed Critical 华南理工大学
Publication of WO2022268183A1 publication Critical patent/WO2022268183A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/44Program or device authentication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the invention belongs to the field of biological feature recognition and video understanding, and more specifically relates to a video-based random gesture authentication method and system in the air.
  • Biometric authentication technology is a typical and complex pattern recognition problem, which has been at the forefront of the development of artificial intelligence technology. This technology refers to the science and technology of realizing identity identification by acquiring and analyzing the physiological and behavioral characteristics of the human body. Common biometric modalities include fingerprints, irises, faces, palm prints, hand shapes, veins, handwriting, gait, and voiceprints. After years of development, biometric authentication technology has penetrated into all aspects of people's production and life. From unlocking electronic devices, supermarket cash registers, community access control, to high-speed rail entry and airport security checks, biometrics have become people in the Internet of Everything. identity credentials.
  • Biometric authentication is related to the privacy and property security of the public, and involves many moral and ethical issues. Therefore, the public urgently needs a safer, more friendly and more efficient biometric authentication technology.
  • existing biometric identification technologies are not perfect, and different biometric modalities have their own advantages and disadvantages. Face is the most concerned modality in biometrics, because the information it carries is highly recognizable. However, it touches the sensitive identity information of the public and violates the privacy of users to a certain extent. If there is no effective supervision and legal restrictions, Facial recognition technology is difficult to popularize on a large scale. After more than 50 years of development, fingerprint technology is relatively mature.
  • the authentication process requires touch sensors, which are easily affected by oil, water stains, etc., and also increase the possibility of cross-infection of bacteria and viruses.
  • the iris authentication technology can realize non-contact, it is difficult to obtain the image, requires a high degree of cooperation of the user, and the user experience is poor.
  • the above modalities also face the serious problem of counterfeiting attacks.
  • liveness detection can be performed, the hidden dangers remain, and the templates are irreplaceable.
  • Vein-based authentication methods have good anti-counterfeiting capabilities, but the amount of information carried by veins is relatively small and difficult to mine, and is greatly affected by collection equipment, individual differences, and temperature.
  • gait recognition, signature recognition and voiceprint recognition are mainly based on behavioral characteristics.
  • the behavioral features involved in gait recognition and signature recognition are relatively simple, and lack of feature-rich physiological features, so the recognition effect is relatively poor.
  • Voiceprint is a behavioral characteristic with physiological characteristics.
  • the voice can reflect the difference of the speaker's congenital pronunciation organs, on the other hand, the voice contains the unique pronunciation and speech habits formed by the speaker.
  • voice is required during authentication, which leads to poor user experience and limited application scenarios.
  • Two authentication modes include gesture authentication based on system-defined gesture types and gesture authentication based on user-defined gesture types.
  • the first type of gesture authentication is based on the gesture type defined by the system. The user must use the gesture specified by the system when registering and authenticating, and the registration gesture and the gesture used for authentication must be consistent. This method requires the user to memorize the gesture type. Proficiency can easily lead to unnatural execution, and at the same time, the authentication effect is poor due to forgetting.
  • the second type of gesture authentication is based on custom gesture types. Users can design their own gestures during registration and authentication, but the registration and authentication gestures must be consistent.
  • Two kinds of video-based gesture authentication systems include two-stream convolutional neural network-based authentication system and three-dimensional convolutional neural network-based authentication system.
  • the authentication system based on two-stream convolutional neural network uses optical flow to represent behavioral features, which requires twice the amount of parameters and calculations, and the calculation of optical flow is also inefficient.
  • the authentication system based on the three-dimensional convolutional neural network directly performs spatiotemporal feature modeling through three-dimensional convolution, and extracts behavioral and physiological features at the same time, but the three-dimensional convolution parameters and calculations are also very large. These two systems cannot meet the real-time requirements of actual certified products. It can be seen that the current video-based gesture authentication method still has many deficiencies in the authentication mode and system design, and cannot meet the needs of use.
  • the purpose of the present invention is to overcome the deficiencies of existing biometric identification technology and gesture authentication technology, and provide a random gesture authentication method and system based on video, without needing to memorize gestures, and the authentication is more efficient and safe.
  • the random gesture feature extractor is obtained after training and testing the time difference symbiotic neural network model Random gesture feature extractor; wherein, the time difference symbiotic neural network model includes a residual physiological feature extraction module, a symbiotic behavior feature extraction module, a feature fusion module and an inter-frame difference module based on behavioral feature modulus, and the residual physiological feature extraction
  • the module uses random gesture video as input for extracting physiological features; the inter-frame difference module is used to subtract the same channels of adjacent frames from the output features of each layer in the input video and residual physiological feature extraction module and each All channels of a differential feature are summed element-wise to obtain differential pseudo-modalities; the co-occurrence behavioral feature extraction module uses gesture video differential pseudo-modalities as input for extracting behavioral features; the feature based on behavioral feature modulus length
  • the fusion module fuses the physiological characteristics and behavioral characteristics to make full use of the complementary advantages of physiological characteristics and behavioral
  • the input user name and the feature vectors of the extracted random gestures are added to the gesture template number database; in the authentication mode, firstly extract multiple feature vectors corresponding to the user name in the gesture template database, and then calculate and Authenticating the cosine distance of the user feature vector, and comparing the minimum cosine distance with a threshold, if it is lower than the threshold, the authentication is passed, otherwise the authentication is not passed, wherein the threshold refers to the authentication threshold manually set according to the application scenario.
  • the collection of user random gesture video only needs to improvise a gesture that meets the requirements in front of the camera.
  • the random gesture does not need to be memorized.
  • the gesture should fully mobilize the five fingers as much as possible, and show the shape of the palm. multiple angles.
  • T frame gesture fragments are intercepted from the dynamic gesture video, and then frame-by-frame center cropping, image size adjustment and image standardization are performed, and the final captured video size is (T, C, W, H), where T is the number of frames, C is the number of channels, W is the image width, and H is the image height.
  • the random gesture feature extractor is a random gesture feature extractor obtained after the temporal difference symbiotic neural network model is trained and tested, including:
  • the final data set size is (P, Q, N, C, W, H), where P In order to collect the number of users, Q is the number of random gestures performed by each user, and N is the number of video frames of each random gesture;
  • the data set is divided into training samples and test samples, which are used for training and testing of the temporal difference symbiotic neural network model.
  • the test set needs to take into account the cross-period problem in biometric identification, that is, as time goes by, biological characteristics will change to a certain extent, usually reflected in behavioral characteristics. Therefore, the test set of random gestures needs to collect the random gestures of many people (for example, 100 people) after a week apart as the test set of the second stage.
  • the neural network finally deployed in the authentication system is mainly selected based on the equal error rate of the samples in the second stage, so that the model has good performance in real scenarios.
  • the random gesture video is intercepted with random T-frame gesture fragments, and random rotation, random color dithering and image standardization are performed; the random gesture video that has undergone the above-mentioned online processing is obtained through the forward propagation of the time-difference symbiotic neural network model Fuse features, then input the loss function, and optimize the time difference symbiotic neural network model through backpropagation;
  • the middle T-frame gesture segment is intercepted from the random gesture video, and the image is normalized, and then input into the temporal difference symbiotic neural network to obtain fusion features for distance calculation.
  • Gesture authentication can be regarded as a metric learning task.
  • the model should map the user's random gesture video to a feature space with small intra-class spacing and large inter-class spacing.
  • AM-Softmax does not carefully design sample pairs compared to ternary loss functions and contrastive loss functions, AM-Softmax is simpler and more interpretable than Sphereface and L-Softmax. This system uses the AM-Softmax loss function for model training:
  • W i (W i includes W yi and W j ) and f i are normalized weight coefficients and user identity feature vectors respectively, is the loss function, the batch size used in Bt training, i represents the i-th sample in the batch, y i represents the correct user name corresponding to the sample, fdim is the dimension of the output feature based on the behavioral feature modulus long feature fusion module (this system uses 512 dimension, as shown in Figure 2), and j represents the jth dimension of the fdim dimension feature.
  • the samples in the test set of the first stage and the samples in the test set of the second stage are tested in turn.
  • the random gesture videos were first paired.
  • the random gesture pairs from the same user were marked as positive samples, and the random gesture pairs from different users were marked as negative samples.
  • 25,000 pairs of positive and negative sample pairs were randomly selected for testing.
  • the T-frame gesture fragments containing rich actions are firstly intercepted, and the image is standardized, and then input into the time-difference symbiotic neural network model to obtain the user identity feature that combines physiological and behavioral features, and calculates the distance of 50,000 sample pairs.
  • Threshold [min,min+step,min+2step,..., max], where step is the uniform sampling step size. If the cosine distance of the sample pair is less than the threshold, the authentication is passed, otherwise the authentication is not passed.
  • FAR represents the probability that the system mistakenly authenticates an unregistered user, that is, the ratio of the number of negative sample pairs whose cosine distance is less than the threshold to all negative sample pairs in the test set:
  • FP thres indicates the number of negative samples passed by the system authentication under the threshold thres
  • TN thres indicates the number of negative samples rejected by the system authentication.
  • FRR represents the probability that the system mistakenly rejects the registered user authentication, that is, the ratio of the number of positive sample pairs whose cosine distance is greater than the threshold to all positive sample pairs in the test set:
  • FN thres indicates the number of positive samples rejected by the system authentication
  • TP thres indicates the number of positive samples passed by the system authentication
  • FRR the stronger the ease of use of the algorithm, that is, the less likely users will be rejected when accessing their own accounts
  • the smaller the FAR the stronger the security of the algorithm, that is, it is more difficult for users to counterfeit and attack other people's accounts.
  • FAR and FRR have performance trade-offs.
  • FAR and FRR By traversing different thresholds, FAR and FRR under each threshold can be obtained.
  • FAR increases and FRR decreases.
  • EER is the error rate when FRR is equal to FAR, and it is used to evaluate the matching accuracy of different parameters, because FRR and FAR are treated equally at this time. Algorithms with lower EER can show better performance in authentication tasks. Therefore, the model with the lowest EER is finally selected as the feature extractor.
  • T frames of random gesture images are regarded as image batches of size T to perform forward propagation of 18-layer convolutional neural network; through global average pooling and full connection operations, physiological features are expressed as T ⁇ fdim dimensional features Vector; the T ⁇ fdim-dimensional feature vector is averaged in the time dimension to obtain the fdim-dimensional physiological feature vector.
  • the step of obtaining behavioral features through the symbiotic behavior feature extraction module is: input a random gesture video, process and obtain a random gesture video differential pseudo-modality through the inter-frame difference module; input the random gesture video differential pseudo-modality into the symbiotic Behavioral feature extraction module; after each convolution operation, the output of the upper layer is spliced with the differential pseudo-modal representing the corresponding residual physiological characteristics in the channel dimension; through global average pooling and full connection operations, the behavior Features are represented as fdim-dimensional feature vectors.
  • the difference pseudo-mode obtained by the inter-frame difference module is:
  • IS fn (x, y, t) is the differential pseudo mode, where chn, fn, t represent the chnth channel respectively, from the fnth layer feature and the tth frame of the residual physiological feature extraction module, and ch means the current
  • the total number of feature map channels, x, y represent the abscissa and ordinate of the feature map or image, respectively.
  • the step of obtaining the fusion module through the feature fusion module based on the behavioral feature modulus length includes: normalizing the physiological features output by the residual physiological feature extraction module; extracting the normalized physiological features and symbiotic behavior features The behavioral features output by the modules are added to obtain the fusion features; the fusion features are normalized; the final fusion features are:
  • 2 represents the two-norm
  • is the hyperparameter
  • is the angle between the physiological feature vector P and the behavioral feature vector B.
  • the proportion of physiological features and behavioral features is automatically adjusted through the feature fusion module based on the behavioral feature modulus, wherein
  • the proportion of physiological characteristics is greater than that of behavioral characteristics; when the angle ⁇ between behavioral characteristics and physiological characteristics is greater than 120°, the proportion of physiological characteristics is less than ⁇ .
  • also needs to be greater than - ⁇ (1+2cos ⁇ ), so that the proportion of physiological characteristics is greater than that of behavioral characteristics, that is
  • the proportion of behavioral characteristics is greater than that of physiological characteristics; when the angle between behavioral characteristics and physiological characteristics is greater than 120°, the proportion of physiological characteristics is greater than ⁇ Also need to be less than The proportion of behavioral characteristics is greater than that of physiological characteristics, that is,
  • the system can automatically adjust the proportion of physiological features and behavioral features according to the size of the behavioral feature length.
  • the module also limits the upper limit of the proportion of the two features to prevent the early training period, when a certain feature modulus is too large, occupying a dominant position and causing another feature to be obliterated.
  • the present invention also provides a system for implementing the aforementioned method.
  • a video-based random gesture authentication system comprising:
  • a mode selection module is used to select a registration mode or an authentication mode
  • the collection module is used to input the user name and collect the user's random gesture video
  • the feature extraction module is used to input the preprocessed dynamic gesture video to the random gesture feature extractor to extract the feature vector containing the user's physiological characteristics and behavioral features.
  • the random gesture feature extractor is a time difference symbiotic neural network model.
  • the residual physiological feature extraction module uses random gesture video as an input to extract physiological features; the inter-frame difference module is used to perform adjacent frames of the same channel on the input video and the output features of each layer in the residual physiological feature extraction module.
  • the subtraction will carry out element-wise summation of all channels of each differential feature to obtain a differential pseudo-mode;
  • the co-occurrence behavioral feature extraction module uses gesture video differential pseudo-modality as input for extracting behavioral features;
  • the feature fusion module based on the behavioral feature modulus performs feature fusion of physiological features and behavioral features;
  • the registration authentication module is used to add the input user name and the feature vector of the extracted random gesture to the gesture template number database in the registration mode; in the authentication mode, first extract multiple features corresponding to the user name in the gesture template database vector, and then calculate the cosine distance with the feature vector of the user to be authenticated, and compare the minimum cosine distance with the threshold, if it is lower than the threshold, then the authentication is passed, otherwise the authentication is not passed, wherein the threshold refers to the artificial Set the authentication threshold.
  • the random gesture authentication method disclosed in the present invention can achieve at least the following beneficial effects:
  • Random gestures have both physiological and behavioral characteristics, rich in information, and more accurate authentication
  • the present invention also provides a random gesture authentication system based on video, which has the same beneficial effects as the above random gesture authentication method based on video.
  • the system provided by the present invention also has the following advantages:
  • a new time-difference symbiotic neural network model is disclosed.
  • the residual physiological feature extraction module and the symbiotic behavior feature extraction module can extract physiological and behavioral features related to user identity, respectively.
  • the disclosed network Compared with the mainstream three-dimensional convolutional neural network and two-stream two-dimensional convolutional neural network, the disclosed network has higher accuracy and faster operation speed.
  • a feature fusion strategy is disclosed, which can automatically assign physiological and behavioral feature weights according to the size of the behavioral feature modulus. Compared with the existing feature fusion strategy, it has better performance improvement.
  • FIG. 1 is a schematic diagram of the principles of the video-based random gesture authentication method and system of the present invention.
  • FIG. 2 is a schematic diagram of a video-based random gesture authentication method and a random gesture feature extractor in the system of the present invention.
  • Fig. 3 is a schematic diagram of the video-based random gesture authentication method and the inter-frame difference module in the system of the present invention.
  • FIG. 1 is a schematic diagram of the principle of a video-based random gesture authentication method provided by the present invention, including the following steps:
  • Step 1 Construct a random gesture dataset and train a random gesture feature extractor.
  • the random gesture feature extractor is obtained after training and testing with deep learning technology.
  • it is first necessary to collect high-quality random gesture samples.
  • Gesture sample collection requires N-frame video collection of several random gestures of several users to obtain a random gesture video dataset.
  • 64 frames of video are collected.
  • the frame rate of the video signal is 15fps, that is, there are 15 frames of images per second of video. Understandably, 15fps is just a concrete example, and if disk storage allows, bigger is better. 15fps is a relatively suitable value. If it is too low, the timing information will be insufficient. If it is too high, the storage pressure will be high and there will be a lot of redundant information.
  • the present invention collects random gestures. The random gestures do not need to be memorized. It only needs to perform a gesture that meets the requirements impromptuly in front of the camera, that is, the gestures should fully mobilize the five fingers and show multiple angles of the palm. The corresponding user name needs to be recorded during video capture.
  • the size of the data set is (P, Q, N, C, W, H), where P is the number of collected users, Q is the number of random gestures performed by each user, N is the number of video frames of each random gesture, and C is the channel Number, W is the image width, H is the image height.
  • the random gesture video dataset needs to be divided into training set and test set.
  • the test set should take into account the cross-period problem in biometric identification, that is, as time goes by, the biological characteristics will change to a certain extent, usually reflected in behavioral characteristics.
  • the test set of random gestures needs to collect random gesture samples of multiple people (eg, 100 people) in the second stage after a preset time interval (eg, after one week). Since in real application scenarios, the authentication system needs to be robust to gesture differences caused by time extension of the same user, the neural network finally deployed in the authentication system is mainly based on the equal error rate of random gesture samples in the second stage. Selection, so that the temporal difference symbiotic neural network model has good performance in real scenes.
  • Time-domain data enhancement needs to intercept random T-frame gesture fragments from the selected N-frame random gesture video.
  • N-T+1 different T-frame random gestures can be derived from one N-frame gesture of the same user. Gestures, thus in the time dimension, achieve a very good effect of data enhancement.
  • our method applies the same random rotation and random color dithering (brightness, contrast, and saturation) to all frames of the same gesture video.
  • N takes the value of 64 and T takes the value of 20, at a video capture frame rate of 15fps, it is equivalent to performing a quick gesture for 1.3s.
  • random rotation random ⁇ 15° rotation is performed.
  • Gesture authentication can be regarded as a metric learning task.
  • the model should map the user's random gesture video to a feature space with small intra-class spacing and large inter-class spacing.
  • AM-Softmax does not carefully design sample pairs compared to ternary loss functions and contrastive loss functions, AM-Softmax is simpler and more interpretable than Sphereface and L-Softmax.
  • the present invention adopts the AM-Softmax loss function for time difference symbiotic neural network model training, and the AM-Softmax loss function is as follows:
  • n is the batch size used during training
  • i represents the i -th sample in the batch
  • W i (W i includes W yi and W j )
  • f i are the normalized weight coefficient and user identity Feature vector (i.e. the output based on the behavioral feature modulus length feature fusion module in Fig.
  • y i represents the correct user name of the sample
  • fdim is the dimension of the output feature based on the behavioral feature modulo length feature fusion module (in one of the embodiments of the present invention, Dimension is 512 dimensions, as shown in Figure 2)
  • j represents the jth dimension of fdim dimension feature
  • T represents transpose
  • the test samples collected in the first phase and the second phase are tested sequentially.
  • the random gesture videos were first paired, and the random gesture pairs from the same user were marked as positive samples, and the random gesture pairs from different users were marked as negative samples. Finally, 25,000 pairs of positive and negative sample pairs were randomly selected for testing.
  • T takes a value of 20
  • T takes a value of 20
  • Threshold [min,min+step,min+2 ⁇ step,.. .,max], where step is the uniform sampling step size. If the cosine distance of the sample pair is less than the threshold, the authentication is passed, otherwise the authentication is not passed.
  • FAR represents the probability of incorrectly authenticating an unregistered user, that is, the ratio of the number of negative sample pairs whose cosine distance is less than the threshold to all negative sample pairs in the test set:
  • FP thres indicates the number of negative samples that are authenticated under the threshold thres
  • TN thres indicates the number of negative samples that are rejected by authentication.
  • FRR represents the probability of incorrectly rejecting the registered user authentication, that is, the ratio of the number of positive sample pairs whose cosine distance is greater than the threshold to all positive sample pairs in the test set:
  • FN thres represents the number of positive samples rejected by authentication
  • TP thres represents the number of positive samples passed by authentication
  • the false acceptance rate FAR and the false rejection rate FRR By traversing different thresholds, the FAR and false rejection rate FRR under each threshold can be obtained. When the threshold increases, the false acceptance rate FAR increases and FRR decreases.
  • FRR false rejection rate
  • FAR false acceptance rate
  • the matching accuracy of because at this time the false rejection rate FRR and the false acceptance rate FAR are treated equally.
  • Algorithms with lower error rate EER can show better performance in authentication tasks.
  • the temporal difference co-occurrence neural network model with the lowest error rate EER is selected as the random gesture feature extractor.
  • Step 2 Select Registration Mode or Authentication Mode.
  • the random gesture feature extractor can be deployed in the system to extract the user's identity features during registration and authentication.
  • Step 3 Enter the user name and collect the user's random gesture video.
  • Random gestures do not need to be memorized. You only need to improvise a gesture that meets the requirements in front of the camera. The gesture should fully mobilize the five fingers and show multiple angles of the palm.
  • the frame rate of the video signal is 15 fps, that is, there are 15 frames of images in the video per second.
  • Step 4 Preprocess random gesture videos.
  • the random gesture feature extractor needs to be initialized with the ImageNet image data set pre-training model, when the image is normalized, the three channels of all video frames are subtracted from the mean value [0.485, 0.456, 0.406] and divided Take the standard deviation [0.229, 0.224, 0.225] (the mean and standard deviation are based on the statistics of the ImageNet dataset).
  • the size of the final intercepted video is (T, C, W, H), where T is the number of frames, C is the number of channels, W is the image width, and H is the image height.
  • Step 5 Input the preprocessed dynamic gesture video to the random gesture feature extractor obtained after training and testing, and extract the feature vector including the user's physiological and behavioral features.
  • Random gestures have both physiological and behavioral features.
  • the random gesture feature extractor needs to be able to extract the above two features at the same time, and perform feature fusion to make full use of the complementary advantages of physiological and behavioral features in identity information to improve the accuracy of authentication. and system security.
  • the random gesture feature extractor is obtained after training and testing through a temporal difference co-occurrence neural network model.
  • the fast and accurate temporal difference symbiotic neural network model provided by this embodiment includes a residual physiological feature extraction module, a symbiotic behavior feature extraction module, an inter-frame difference module, and a feature based on the behavioral feature modulus length Fusion module.
  • the residual physiological feature extraction module includes an input layer and a standard 18-layer residual network to extract the physiological features of each frame of gesture images, while providing differential pseudo-modal input for the co-occurrence behavior feature extraction module.
  • the input is the original gesture video (Bt, T, 3, 224, 224), that is, the gesture video with the batch size Bt of T frames and three channels with a size of 224 ⁇ 224.
  • the input needs to be converted to (Bt ⁇ T, 3, 224, 224), that is, the video frames are processed separately, and information interaction between frames is not involved.
  • the shape of the physiological feature is (Bt ⁇ T, fdim), and the physiological feature needs to be converted into (Bt, T, fdim) in the final output.
  • the co-occurrence behavior feature extraction module includes five input layers, five two-dimensional convolutional layers, one two-dimensional pooling layer, one global average pooling layer and one fully connected layer. After all convolutional layers, the BN layer is used for batch normalization, and the activation function uses ReLU.
  • the input is the difference pseudo-modality obtained after the feature map obtained by convolution of the original gesture video frame and the residual physiological feature extraction module Conv1, Layer1, Layer2, and Layer3 is processed by the inter-frame difference module.
  • Conv1 can directly convolve the differential pseudo-modality
  • Conv2, Conv3, Conv4, and Conv5 first need to combine the feature map obtained by the previous convolution with the inter-frame difference before convolution.
  • the differential pseudo-modality of the module is concatenated in the channel dimension, followed by convolution. Finally, through global average pooling and full connection operations, the behavioral features are expressed as fdim-dimensional feature vectors.
  • the inter-frame difference module is a bridge between the residual physiological feature extraction module and the co-occurrence behavior feature extraction module, and its input is from the residual physiological feature extraction module.
  • the shape is (Bt ⁇ T, ch, w, h), which needs to be converted to (Bt,T,ch,w,h), where ch is the number of channels, w and h are the width and height of the original image or feature map, respectively.
  • the number of input image channels is 3
  • the width and height are (224, 224)
  • the number of feature map channels obtained after passing through the residual physiological feature extraction module Conv1, Layer1, Layer2, and Layer3 is 64.
  • the inter-frame difference module uses the above-mentioned layers of convolution features (including the input image) to subtract the same channels of adjacent frames, and then sums all the channels of each difference feature element-wise.
  • the formula is:
  • IS fn (x, y, t) is the differential pseudo mode, where chn represents the chnth channel, fn comes from the fnth layer feature of the residual physiological feature extraction module, t represents the tth frame, and ch represents the current feature
  • the total number of map channels, x, y represent the abscissa and ordinate of the feature map or image, respectively, Represents the chnth channel feature map of the tth frame image in the fn layer feature of the residual physiological feature extraction module.
  • the feature maps with different channel numbers output by different convolutional layers of the residual physiological feature extraction module can be uniformly expressed as the difference pseudo-modality of the T-1 channel, which can well represent user behavior information and at the same time , greatly reducing the amount of computation.
  • the output feature pseudo-mode shape of the final inter-frame difference module is (Bt,T-1,w,h).
  • Carrying out feature fusion through the feature fusion module based on the behavior feature modulus length including: averaging the physiological features output by the residual physiological feature extraction module in the video frame dimension, outputting the physiological features whose size is (Bt, fdim), and then performing Normalized: Then add the normalized physiological features and the behavior features output by the symbiosis behavior feature extraction module to obtain the fusion features:
  • 1
  • p n and b n represent the value of the nth dimension of the physiological feature and behavioral feature vector, respectively.
  • the fusion features are normalized:
  • the angle ⁇ between physiological characteristics and behavioral characteristics determines the upper limit of the contribution value, and the smaller the angle, the larger the upper limit.
  • ⁇ p > 1 the proportion of physiological characteristics is large, at this time:
  • the physiological feature when the angle ⁇ between the behavioral feature and the physiological feature is less than 120°, and the modulus length of the behavioral feature is less than ⁇ , the physiological feature is dominant; when the angle ⁇ between the behavioral feature and the physiological feature is greater than 120°, the physiological feature is smaller than ⁇ It also needs to be greater than - ⁇ (1+2cos ⁇ ), so that the physiological characteristics can dominate;
  • the behavioral feature when the angle between the behavioral feature and the physiological feature is less than 120°, and the modulus of the behavioral feature is greater than ⁇ , the behavioral feature is dominant; need to be less than Behavioral traits can dominate;
  • the system can automatically adjust the proportion of physiological features and behavioral features according to the size of the behavioral feature length.
  • this module also limits the upper limit of the proportion of the two features to prevent the size of a certain feature from being too large in the early stage of training and occupying a dominant position, causing another feature to be obliterated.
  • Step 6 in the registration mode, add the input user name and the extracted random gesture feature vector to the gesture template database; in the authentication mode, first extract multiple feature vectors corresponding to the user name in the gesture template database, and then calculate and The cosine distance of the feature vector of the user to be authenticated, and compare the minimum cosine distance with the threshold, if it is lower than the threshold, then the authentication passes, otherwise the authentication fails;
  • the threshold refers to the authentication threshold manually set according to the application scenario, in In one embodiment of the present invention, the value range of the threshold is [0,1].
  • the threshold can be dynamically selected to balance and meet the needs of actual applications. For example, in occasions with high security requirements, such as banks, customs, etc., it is necessary to avoid the successful attack of counterfeit attackers as much as possible. At this time The threshold should be lowered (eg 0.2) so that the False Acceptance Rate is FAR lower. On the contrary, in occasions with relatively low security requirements, such as access control in public office areas, home appliance product control, etc., it is necessary to increase the threshold (for example, 0.3), so as to correctly identify registered users as much as possible and reduce FRR. The range of lowering or raising the threshold is determined by the user according to requirements.
  • a system for implementing the aforementioned method is also provided. That is, a video-based random gesture authentication system, including the following modules:
  • a mode selection module is used to select a registration mode or an authentication mode
  • the collection module is used to input the user name and collect the user's random gesture video
  • the feature extraction module is used to input the preprocessed dynamic gesture video to the random gesture feature extractor to extract the feature vector containing the user's physiological characteristics and behavioral features.
  • the random gesture feature extractor is a time difference symbiotic neural network model.
  • the residual physiological feature extraction module uses random gesture video as an input to extract physiological features; the inter-frame difference module is used to perform adjacent frames of the same channel on the input video and the output features of each layer in the residual physiological feature extraction module.
  • each differential feature is summed element-wise to obtain a differential pseudo-modality;
  • the co-occurrence behavioral feature extraction module uses the gesture video differential pseudo-modality as input for extracting behavioral features;
  • the based on The feature fusion module of behavioral feature modulus performs feature fusion of physiological features and behavioral features;
  • the registration authentication module is used to add the input user name and the feature vector of the extracted random gesture to the gesture template number database in the registration mode; in the authentication mode, first extract multiple features corresponding to the user name in the gesture template database vector, and then calculate the cosine distance with the feature vector of the user to be authenticated, and compare the minimum cosine distance with the threshold, if it is lower than the threshold, then the authentication is passed, otherwise the authentication is not passed, wherein the threshold refers to the artificial Set the authentication threshold.
  • the present invention discloses the equal error rate of the random gesture authentication of the time difference symbiotic neural network model in the dynamic gesture authentication data set , and compared with the current mainstream video understanding network (TSN, TSM, two-stream convolutional neural network, three-dimensional convolutional neural network, and image classification network (ResNet18).
  • TSN video understanding network
  • TSM two-stream convolutional neural network
  • ResNet18 image classification network
  • this method achieves an equal error rate of 2.580% in the first-stage test set and 6.485% in the second-stage test set by using the time-difference symbiotic neural network model for authentication, that is, only the wrong recognition 2.580% and 6.485% of registered users/non-registered users (equivalent to recognition accuracy of 97.420% and 93.515%), the equal error rate is far lower than other existing methods, which can prove the effectiveness of random gestures.
  • the temporal difference symbiotic neural network has the lowest equal error rate on the test set of stage 1 and stage 2, so it can prove that time Differential symbiotic neural networks have stronger authentication performance. This experiment is only to prove the effectiveness of random gesture authentication and the superiority of temporal difference symbiotic neural network.
  • each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
  • the video-based random gesture authentication system disclosed in the embodiment since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the relevant details, please refer to the description of the method part.
  • the present invention uses video-based fast random gestures for authentication, which can complete the user's identity authentication by performing a random gesture impromptu without memory.
  • User information privacy can achieve safer, more efficient and more friendly identity authentication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Collating Specific Patterns (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

本发明公开的一种基于视频的随机手势认证方法,包括:选择注册模式或认证模式;采集用户随机手势视频;随机手势视频预处理;将预处理后的动态手势视频输入到随机手势特征提取器,提取包含用户生理特征和行为特征的特征向量;在注册模式时,将输入的用户名和提取出的随机手势的特征向量添加至手势模板数数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过。本发明采用随机手势兼备生理特征和行为特征,认证更加安全、高效和友好。本发明还提供了相应的***。

Description

一种基于视频的随机手势认证方法及*** 技术领域
本发明属于生物特征识别与视频理解领域,更具体地说,涉及一种基于视频的空中随机手势认证方法及***。
背景技术
生物特征认证技术是一个典型而又复杂的模式识别问题,一直处于人工智能技术发展的前沿。该技术是指通过获取和分析人体的生理特征和行为特征实现身份鉴别的科学和技术。常见的生物特征模态包括指纹、虹膜、人脸、掌纹、手形、静脉、笔迹、步态和声纹等。经过多年的发展,生物特征认证技术已经渗透到人们生产生活的方方面面,从电子设备解锁、超市收银、小区门禁,再到高铁进站和机场安检,生物特征已经成为人们在万物互联时代的重要数字身份凭证。
生物特征认证攸关公众的隐私及财产安全,涉及到诸多道德和伦理问题,因此社会公众迫切需求一种更加安全、更加友好和更加高效的生物特征认证技术。然而,既有生物特征识别技术并非完美,不同生物特征模态具有各自的优点和缺点。人脸是生物特征中最受关注的模态,因为其所携带的信息辨识度极高,然而触及到了公众的敏感身份信息,一定程度上侵害了用户的隐私,如果缺乏有效监管和法律制约,人脸识别技术很难大规模普及。指纹经过50多年的发展,技术相对成熟,然而认证过程需要触摸传感器,容易受到油脂、水渍等的影响,同时也增加了细菌病毒交叉感染的可能。虹膜认证技术虽然可以实现非接触,然而图像的获取难度大,需要用户高度配合,用户体验差。上述模态还共同面临着严峻的伪冒攻击问题,虽然可以进行活体检测,但是隐患依旧,并且模板具有不可替换性。基于静脉的认证方式具有很好的防伪能力,然而静脉所携带的信息量相对较少且难以挖掘,同时受采集设备、个体差异和温度的影响大。与上述人脸、指纹、虹膜和静脉这些生理特征不同,步态识别、签名识别和声纹识别主要以行为特征为主。步态识别和签名识别涉及到的行为特征相对简单,并缺失了特征丰富的生理特征,因此识别效果相对较差。声纹是一种具有生理特性的行为特征。一方面,语音可以体现说话人先天发音器官差异,另一方面,语音中又包含了说话人后天形成的独特发音与言语习惯。但是认证时需要发声,用户体验差,应用场景受限。
目前有两种基于视频的手势认证模式和两种基于视频的手势认证***。两种认证模式包括基于***定义手势类型的手势认证和基于自定义手势类型的手势认证。第一种基于***定义手势类型的手势认证,用户在进行注册和认证时必须采用***指定的手势,且注册手势和用于认证的手势必须一致,这种方法需要用户记忆手势类型,由于手势不熟练容易导致执行不自然,同时由于遗忘的导致认证效果差。第二种基于自定义手势类型的手势认证,用户可以在注册和认证时自己设计手势,但是注册和认证手势必须一致。这种方法可以一定程度上缓解用户记忆的压力并且可以选择自己熟悉的手势进行注册和认证,但是仍然会因为遗忘导致的认证效果变差,同时自定义的手势类型容易被盗取,增加被入侵的风险。此外两种手势认证模式需要采集较长手势视频(约4s),用户友好性较差。两种基于视频的手势认证***包括基于双流卷积神经网络的认证***和基于三维卷积神经网络的认证***。基于双流卷积神经网络的认证***采用光流表示行为特征,需要两倍的参数量和运算量,此外光流的计算同样效率低下。基于三维卷积神经网络的认证***直接通过三维卷积进行时空特征建模,同时提取行为特征和生理特征,但是三维卷积参数量和运算量同样很大。这两种***无法满足实际认证产品对实时性的要求。由此可见,目前基于视频的手势认证方法在认证模式和***设计上仍然存在很多的不足之处,不能满足使用需要。
发明内容
本发明的目的在于克服既有生物特征识别技术和手势认证技术的不足之处,提供一种基于视频的随机手势认证方法及***,无需记忆手势,认证更加高效和安全。
为了达到上述目的,本发明提供的一种基于视频的随机手势认证方法,包括以下步骤:
选择注册模式或认证模式;
输入用户名,采集用户随机手势视频;
对随机手势视频进行预处理;
将预处理后的动态手势视频输入到随机手势特征提取器,提取出包含用户生理特征和行为特征的特征向量,所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器;其中,时间差分共生神经网络模型包括残差生理特征提取模块、共生行为特征提取模块、基于行为特征模长的特征融合模块和帧间差分模块,所述残差生理特征提取模块将随机手势视频作为输入,用于提取生理特征;所述帧间差分模块用于对输入视频及残差生理特征提取模块中各层的输出特征进行相邻帧相同通道的相减并将每一个差分特征的所有通道进行逐元素求和,得到差分伪模态;所述共生行为特征提取模块将手势 视频差分伪模态作为输入,用于提取行为特征;所述基于行为特征模长的特征融合模块将生理特征和行为特征进行特征融合,以充分利用生理特征和行为特征在身份信息上的互补优势,提高认证的准确率和***的安全性;
在注册模式时,将输入的用户名和提取出的随机手势的特征向量添加至手势模板数数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过,其中,所述阈值是指根据应用场景人工设定的认证阈值。
优选地,所述采集用户随机手势视频,只需要在摄像头面前即兴地执行一段满足要求的手势即可,随机手势无需记忆,在数据采集时,手势要尽量充分调动五根手指,并展现手掌的多个角度。
优选地,从动态手势视频截取T帧手势片段,然后进行逐帧的中心裁剪、图像大小调整和图像标准化,最终截取的视频大小为(T,C,W,H),其中T为帧数,C为通道数,W为图像宽度,H为图像高度。
优选地,所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器,包括:
对若干用户的若干随机手势进行N帧视频采集,并记录对应的用户名,形成随机手势视频数据集;
对随机手势视频数据集进行处理,从随机手势视频数据集的画面中剪切手势动作区域并进行图像大小调整,最终数据集大小为(P,Q,N,C,W,H),其中P为采集用户个数,Q为每个用户执行随机手势个数,N为每段随机手势视频帧数;
将数据集分为训练样本和测试样本,用于进行时间差分共生神经网络模型的训练和测试。测试集需要考虑到生物特征识别中的跨时段问题,即随着时间的延长,生物特征会存在一定程度上的变化,通常体现在行为特征上。因此随机手势的测试集需要在相隔一个周之后采集多人(例如100人)的随机手势作为第二阶段的测试集。最终部署于认证***的神经网络主要依据第二阶段样本的等误率进行选择,从而使模型在真实场景下具有良好的性能。
在训练阶段,对随机手势视频进行随机T帧手势片段的截取,并进行随机旋转、随机色彩抖动和图像标准化处理;将经过上述在线处理的随机手势视频通过时间差分共生神经网络 模型前向传播得到融合特征,然后输入损失函数,并通过反向传播对时间差分共生神经网络模型进行优化;
在测试阶段,对随机手势视频进行中间T帧手势片段的截取,并进行图像标准化处理,然后输入时间差分共生神经网络获得融合特征,用于距离计算。
手势认证可以看作是一种度量学习任务,通过训练,模型应该把用户随机手势视频映射到一个类内间距小,类间间距大的特征空间。考虑到相比于三元损失函数、对比损失函数,AM-Softmax不要精心地设计样本对,相比于Sphereface和L-Softmax,AM-Softmax更简单且可解释性更强。本***采用了AM-Softmax损失函数用于模型训练:
Figure PCTCN2022100935-appb-000001
其中,W i(W i包括W yi和W j)和f i分别为归一化的权重系数和用户身份特征向量,
Figure PCTCN2022100935-appb-000002
为损失函数、Bt训练时采用的批大小、i表示批中第i个样本、y i表示样本对应的正确用户名、fdim为基于行为特征模长特征融合模块输出特征的维度(本***采用512维,如图2所示)、j表示fdim维特征的第j维。s和m为超参数,在本发明其中一个实施例中,设定s=30,m=0.5。
在测试阶段,依次对第一阶段的测试集中的样本和第二阶段测试集的样本进行测试。测试前首先对随机手势视频进行配对,其中来自相同用户的随机手势对标记为正样本,来自不同用户的随机手势对标记为负样本,最终随机选取正负样本对各2.5万对用于测试。测试时首先对含有丰富动作T帧手势片段的截取,并进行图像标准化处理,然后输入时间差分共生神经网络模型获得融合生理特征和行为特征的用户身份特征,并计算5万个样本对的距离。然后计算5万个样本对距离的最大值和最小值,并在最小值和最大值间均匀采样1000个值依次作为阈值,即Threshold=[min,min+step,min+2step,...,max],其中
Figure PCTCN2022100935-appb-000003
step为均匀采样步长。样本对的余弦距离小于阈值则认证通过,否则认证不通过。
计算错误接受率FAR、错误拒绝率FRR和等误率EER。FAR表示***错误地把未注册用户认证通过的概率,即测试集中负样本对余弦距离小于阈值个数占所有负样本对的比率:
Figure PCTCN2022100935-appb-000004
其中FP thres表示在阈值thres下,负样本被***认证通过的个数,TN thres表示负样本被***认证拒绝的个数。FRR表示***错误地把注册用户认证拒绝的概率,即测试集中正样本对余弦距离大于阈值个数占所有正样本对的比率:
Figure PCTCN2022100935-appb-000005
其中FN thres表示正样本被***认证拒绝的个数,TP thres表示正样本被***认证通过的个数。
FRR越小表明算法的易用性越强,即用户在访问自己账户时更加不容易被拒绝;FAR越小表明算法的安全性越强,即用户仿冒攻击他人账户的难度更大。通常,FAR和FRR会有性能权衡,通过遍历不同的阈值,可以获得各个阈值下的FAR和FRR,当阈值增加,FAR上升,FRR下降。EER是当FRR等于FAR时的误差率,它用于评估不同参数的匹配精度,因为此时FRR和FAR被同等对待。具有较低EER的算法可以在认证任务中表现出更好的性能。因此最终选择EER最低的模型用于作为特征提取器。
优选地,将T帧随机手势图像视为大小为T的图像批进行18层卷积神经网络的前向传播;通过全局平均值池化和全连接操作,将生理特征表示为T×fdim维特征向量;将T×fdim维特征向量在时间维度平均得到fdim维的生理特征向量。
优选地,通过所述共生行为特征提取模块得到行为特征的步骤为:输入随机手势视频,通过所述帧间差分模块处理获得随机手势视频差分伪模态;将随机手势视频差分伪模态输入共生行为特征提取模块;每经过一次卷积操作后,将上一层的输出与代表相应的残差生理特征的差分伪模态进行通道维度的拼接;通过全局平均池化和全连接操作,将行为特征表示为fdim维特征向量。
优选地,通过所述帧间差分模块得到的差分伪模态为:
Figure PCTCN2022100935-appb-000006
IS fn(x,y,t)即为所述差分伪模态,其中chn,fn,t分别代表第chn个通道,来自残差生理特征提取模块第fn层特征和第t帧,ch表示当前特征图通道总数,x,y分别表示特征图或 图像的横坐标和纵坐标。
优选地,通过所述基于行为特征模长的特征融合模块得到融合模块的步骤包括:将残差生理特征提取模块输出的生理特征进行归一化;将归一化的生理特征与共生行为特征提取模块输出的行为特征进行相加获得融合特征;将融合特征进行归一化;最后融合特征为:
Figure PCTCN2022100935-appb-000007
Figure PCTCN2022100935-appb-000008
为归一化后的融合特征,包含了生理特征和行为特征,其中生理特征为P=(p 1,p 2,...,p n) T,行为特征为B=(b 1,b 2,...,b n) T,||.|| 2表示二范数,λ为超参数,α为生理特征向量P与行为特征向量B的之间夹角。
优选地,通过所述基于行为特征模长的特征融合模块自动调节生理特征和行为特征的比重,其中
当行为特征与生理特征夹角α小于120°时,且行为特征模长小于λ时,生理特征所占比重大于行为特征,当行为特征与生理特征夹角α大于120°时,生理特征在小于λ的同时还需要大于-λ(1+2cosα),生理特征所占比重才大于行为特征,即
Figure PCTCN2022100935-appb-000009
当行为特征与生理特征夹角小于120°时,且行为特征模长大于λ时,行为特征所占比重大于生理特征;当行为特征与生理特征夹角大于120°时,生理特征在大于λ的同时还需要小于
Figure PCTCN2022100935-appb-000010
行为特征所占比重才大于生理特征,即
Figure PCTCN2022100935-appb-000011
通过基于行为特征模长的特征融合模块,***可以根据行为特征模长的大小自动调节生理特征和行为特征的比重。同时该模块也限制了两种特征比重的上限,防止训练初期,某种特征模长过大,占据主导地位从而导致另一种特征被湮没。
本发明还提供用于实现前述方法的***。
一种基于视频的随机手势认证***,包括:
模式选择模块,用于选择注册模式或认证模式;
采集模块,用于输入用户名,采集用户随机手势视频;
数据处理模块,用于对随机手势视频进行预处理;
特征提取模块,用于将预处理后的动态手势视频输入到随机手势特征提取器,提取出包含用户生理特征和行为特征的特征向量,所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器;其中,时间差分共生神经网络模型包括残差生理特征提取模块、共生行为特征提取模块、基于行为特征模长的特征融合模块和帧间差分模块,所述残差生理特征提取模块将随机手势视频作为输入,用于提取生理特征;所述帧间差分模块用于对输入视频及残差生理特征提取模块中各层的输出特征进行相邻帧相同通道的相减并将将每一个差分特征的所有通道进行逐元素求和,得到差分伪模态;所述共生行为特征提取模块将手势视频差分伪模态作为输入,用于提取行为特征;所述基于行为特征模长的特征融合模块将生理特征和行为特征进行特征融合;
注册认证模块,用于在注册模式时,将输入的用户名和提取出的随机手势的特征向量添加至手势模板数数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过,其中,所述阈值是指根据应用场景人工设定的认证阈值。
本发明公开的随机手势认证方法,相比其它生物特征模态和既有手势认证方法,能够实现有益效果至少如下:
(1)随机手势兼备生理特征和行为特征,信息量丰富,认证更加准确;
(2)随机手势,模仿难度极大,安全性更高;
(3)随机手势执行轻松自然,采集数据质量更高;
(4)随机手势,无需记忆,执行快速(<1.3s),用户体验好,认证效率高;
(5)凌空操作,采集方便,清洁卫生,不受污渍影响;
(6)解耦敏感身份信息,不触及用户信息隐私;
本发明还提供了一种基于视频的随机手势认证***,具有和上述基于视频的随机手势认证方法相同的有益效果,此外相比既有的手势认证***,本发明提供的***还具有以下优点:
(1)公开了一种新型的时间差分共生神经网络模型,残差生理特征提取模块和共生行为特征提取模块可以分别提取用户身份相关的生理特征和行为特征。相比主流的三维卷积神经网络和双流二维卷积神经网络,所公开的网络具有更高的准确率和更快速的运行速度。
(2)公开一种特征融合策略,可以根据行为特征模长大小自动地分配生理特征和行为特征权重,相比既有的特征融合策略,具有更好的性能提升。
附图说明
图1是本发明基于视频的随机手势认证方法及***的原理示意图。
图2是本发明基于视频的随机手势认证方法及***中随机手势特征提取器示意图。
图3是本发明基于视频的随机手势认证方法及***中帧间差分模块示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造力劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
请参照图1,图1为本发明所提供的一种基于视频的随机手势认证方法的原理示意图,包括以下步骤:
步骤1:进行随机手势数据集构建和训练随机手势特征提取器。
本步骤中,随机手势特征提取器通过深度学习技术训练和测试后得到。为了获得高性能随机手势特征提取器,首先需要对高质量的随机手势样本进行采集。
手势样本采集需要对若干用户的若干随机手势进行N帧视频采集,得到随机手势视频数据集。在本发明其中一个实施例中,是进行64帧视频采集。并设定视频信号的帧率,在本发明其中一个实施例中,视频信号的帧率为15fps,即每秒视频中有15帧图像。可以理解的是,15fps只是一个具体的实例,如果磁盘存储允许,越大越好。15fps是一个比较适宜的值,太低的话,时序信息不足,太高的话,存储压力大,冗余信息多。本发明进行随机手势的采集,随机手势无需记忆,只需要在摄像头面前即兴地执行一段满足要求的手势即可,即手势要尽量充分调动五根手指,并展现手掌的多个角度。视频采集时需要记录对应的用户名。
采集后需要对随机手势视频数据集进行初步处理,从随机手势视频数据集的画面中剪切手势动作区域并进行图像大小调整,使其满足随机手势特征提取器对图像大小的预设要求。数据集大小为(P,Q,N,C,W,H),其中P为采集用户个数,Q为每个用户执行随机手势个数,N为每段随机手势视频帧数,C为通道数,W为图像宽度,H为图像高度。
正式训练前,需要将随机手势视频数据集分为训练集和测试集。测试集要考虑到生物特征识别中的跨时段问题,即随着时间的延长,生物特征会存在一定程度上的变化,通常体现在行为特征上。在本发明其中一个实施例中,随机手势的测试集要在相隔预设时间后(如一周之后)采集多人(如100人)的第二阶段随机手势样本。由于在真实应用场景中,认证***需要对同一用户由于时间延长导致的手势差异具有较强的鲁棒性,所以最终部署于认证***的神经网络主要依据第二阶段随机手势样本的等误率进行选择,从而使时间差分共生神经网络模型在真实场景下具有良好的性能。
在训练阶段,随机挑选用户的随机手势,然后进行在线数据增强,包括时域数据增强和空域数据增强。时域数据增强需要从所挑选的N帧随机手势视频进行随机T帧手势片段的截取,通过采用这种方法同一个用户的一段N帧手势可以衍生出N-T+1段不同的T帧随机手势,从而在时间维度上,达到了很好的数据增强的作用。对于空域数据增强,本方法对同一手势视频的所有帧进行相同的随机旋转和随机色彩抖动(亮度、对比度和饱和度)。在本发明其中一个实施例中,考虑到***实时性要求,当N取值64时,T取值20,在15fps视频采集帧率下,等效于快速手势执行了1.3s。在进行随机旋转时,是进行随机±15°旋转。
手势认证可以看作是一种度量学习任务,通过训练,模型应该把用户随机手势视频映射到一个类内间距小,类间间距大的特征空间。考虑到相比于三元损失函数、对比损失函数,AM-Softmax不要精心地设计样本对,相比于Sphereface和L-Softmax,AM-Softmax更简单且可解释性更强。本发明采用AM-Softmax损失函数用于时间差分共生神经网络模型训练,AM-Softmax损失函数如下:
Figure PCTCN2022100935-appb-000012
其中,
Figure PCTCN2022100935-appb-000013
为损失函数, n为训练练时采用的批大小, i表示批中的第 i个样本,W i(W i包括W yi和W j)和f i分别为归一化的权重系数和用户身份特征向量(即图2中基于行为特征模长特征融合模块的输出)y i表示样本正确用户名,fdim为基于行为特征模长特征融合模块输出特征的维度(在本发明其中一个实施例中,维度为512维,如图2所示),j表示fdim维 特征的第j维,T代表转置,s和m为超参数,在本发明其中一个实施例中,设定s=30,m=0.5。
在测试阶段,依次对第一阶段和第二阶段采集的测试样本进行测试。测试前首先对随机手势视频进行配对,其中来自相同用户的随机手势对标记为正样本,来自不同用户的随机手势对标记为负样本,最终随机选取正负样本对各2.5万对用于测试。测试时,首先截取视频中间T帧手势片段(因中间T帧往往动作丰富,在本发明其中一个实施例中,T取值20),并进行图像标准化处理,然后输入时间差分共生神经网络获得融合生理特征和行为特征的用户身份特征,并计算5万个样本对的距离。然后计算5万个样本对距离的最大值和最小值,并在最小值和最大值间均匀采样1000个值依次作为阈值,即Threshold=[min,min+step,min+2×step,...,max],其中
Figure PCTCN2022100935-appb-000014
step为均匀采样步长。样本对的余弦距离小于阈值的则认证通过,否则认证不通过。
计算***错误接受率FAR、错误拒绝率FRR和等误率EER。FAR表示错误地把未注册用户认证通过的概率,即测试集中负样本对余弦距离小于阈值个数占所有负样本对的比率:
Figure PCTCN2022100935-appb-000015
其中FP thres表示在阈值thres下,负样本被认证通过的个数,TN thres表示负样本被认证拒绝的个数。FRR表示错误地把注册用户认证拒绝的概率,即测试集中正样本对余弦距离大于阈值个数占所有正样本对的比率:
Figure PCTCN2022100935-appb-000016
其中FN thres表示正样本被认证拒绝的个数,TP thres表示正样本被认证通过的个数。
错误拒绝率FRR越小表明本方法的易用性越强,即用户在访问自己账户时更加不容易被拒绝;错误接受率FAR越小表明本方法的安全性越强,即用户仿冒攻击他人账户的难度更大。通常,错误接受率FAR和错误拒绝率FRR会有性能权衡,通过遍历不同的阈值,可以获得各个阈值下的FAR和错误拒绝率FRR,当阈值增加,错误接受率FAR上升,FRR下降。EER是当错误拒绝率FRR等于错误接受率FAR时的误差率(EER就是FRR=FAR时FRR、FAR的值,即此时三者数值相等,EER=FRR=FAR),它用于评估不同参数的匹配精度,因为此时错误拒绝率FRR和错误接受率FAR被同等对待。具有较低误差率EER的算法可以在认证任 务中表现出更好的性能。在本发明其中一个实施例中,选择误差率EER最低时的时间差分共生神经网络模型作为随机手势特征提取器。
步骤2:选择注册模式或认证模式。
完成随机手势特征提取器训练后,即可将随机手势特征提取器进行***部署,用于在注册和认证环节提取用户的身份特征。
步骤3:输入用户名,采集用户随机手势视频。
随机手势无需记忆,只需要在摄像头面前即兴地执行一段满足要求的手势即可,手势要尽量充分调动五根手指,并展现手掌的多个角度。在本发明其中一个实施例中,采集用户随机手势视频时,视频信号的帧率为15fps,即每秒视频中有15帧图像。
步骤4:对随机手势视频进行预处理。
在手势注册和认证环节,需要对采集到的手势视频首先裁剪中间T帧,从而获取随机手势视频中动作相对丰富的片段。然后再进行逐帧的中心裁剪、图像大小调整和图像标准化,去除无关的图像背景,并使手势视频帧满足用于随机手势特征提取器对输入图像的大小和分布要求。在本发明其中一个实施例中,由于需要采用ImageNet图像数据集预训练模型初始化随机手势特征提取器,因此在图像标准化时,所有视频帧的三通道减去均值[0.485,0.456,0.406]并除以标准差[0.229,0.224,0.225](均值和标准差都是基于ImageNet数据集的统计值)。最终截取的视频大小为(T,C,W,H),其中T为帧数,C为通道数,W为图像宽度,H为图像高度。
步骤5:将预处理后的动态手势视频输入到经训练和测试后得到的随机手势特征提取器,提取出包含用户生理特征和行为特征的特征向量。
随机手势兼备生理特征和行为特征,随机手势特征提取器需要具备同时提取上述两种特征的能力,并进行特征融合,充分利用生理特征和行为特征在身份信息上的互补优势,提高认证的准确率和***的安全性。
在本发明其中一个实施例中,随机手势特征提取器是通过时间差分共生神经网络模型进行训练和测试后得到的。请参阅图2至图3所示,本实施例提供的快速准确的时间差分共生神经网模型包括残差生理特征提取模块、共生行为特征提取模块、帧间差分模块和基于行为特征模长的特征融合模块。
残差生理特征提取模块包括一个输入层和标准18层残差网络,用于提取每一帧手势图像的生理特征,同时为共生行为特征提取模块提供差分伪模态输入。输入为原始手势视频 (Bt,T,3,224,224),即批大小为Bt的T帧三通道尺寸为224×224的手势视频。前向传播时需要将输入转换为(Bt×T,3,224,224),即将视频帧单独进行处理,不涉及帧间信息交互。在模块末端通过全局平均池化和全连接操作后生理特征形状为(Bt×T,fdim),最终输出时生理特征需要转换成(Bt,T,fdim)。
共生行为特征提取模块包括五个输入层、五个二维卷积层、一个二维池化层、一个全局平均池化层和一个全连接层。所有卷积层后采用BN层进行批归一化,激活函数采用ReLU。输入为原始手势视频帧和残差生理特征提取模块Conv1、Layer1、Layer2、Layer3卷积获得的特征图经过帧间差分模块处理后得到的差分伪模态。在共生行为特征提取模块中,除了Conv1可以直接对差分伪模态进行卷积外,Conv2、Conv3、Conv4和Conv5在卷积前首先需要将上一层卷积获得的特征图与来自帧间差分模块的差分伪模态进行通道维度的拼接,然后再进行卷积。最后通过全局平均池化和全连接操作,将行为特征表示为fdim维特征向量。
所述帧间差分模块为残差生理特征提取模块和共生行为特征提取模块的桥梁,其输入来自残差生理特征提取模块,形状为(Bt×T,ch,w,h),需要首先转换为(Bt,T,ch,w,h),其中ch为通道个数,w和h分别为原始图像或特征图的宽度和高度。在残差生理特征提取模块中输入图像通道数为3,宽度和高度为(224,224),经过残差生理特征提取模块的Conv1,Layer1,Layer2,Layer3后获得的特征图通道数依次为64,64,128,256,特征图的宽度和高度依次为(56,56),(56,56),(28,28),(14,14)。帧间差分模块用上述各层卷积特征(包括输入图像)进行相邻帧相同通道的相减,然后将每一个差分特征的所有通道进行逐元素求和,公式为:
Figure PCTCN2022100935-appb-000017
式中,IS fn(x,y,t)即为差分伪模态,其中chn代表第chn个通道,fn来自残差生理特征提取模块第fn层特征,t表示第t帧,ch表示当前特征图通道总数,x,y分别表示特征图或图像的横坐标和纵坐标,
Figure PCTCN2022100935-appb-000018
表示第t帧图像在残差生理特征提取模块fn层特征中的第chn通道特征图。
通过帧间差分模块可以将残差生理特征提取模块不同卷积层输出的通道数不同的特征图统一表示为T-1通道的差分伪模态,可以很好地对用户行为信息进行表示的同时,大大降低 运算量。最终帧间差分模块的输出特征伪模态形状为(Bt,T-1,w,h)。
通过所述基于行为特征模长的特征融合模块进行特征融合,包括:将残差生理特征提取模块输出的生理特征进行视频帧维度的平均,输出大小为(Bt,fdim)的生理特征,然后进行归一化:
Figure PCTCN2022100935-appb-000019
然后将归一化的生理特征与共生行为特征提取模块输出的行为特征进行相加获得融合特征:
Figure PCTCN2022100935-appb-000020
其中,生理特征为P=(p 1,p 2,...,p n) T
Figure PCTCN2022100935-appb-000021
为归一化后的生理特征,行为特征为B=(b 1,b 2,...,b n) T,||.|| 2表示二范数,λ为超参数,值越大生理特征越重要,在本发明其中一个实施例中,λ=1,p n和b n分别表示生理特征和行为特征向量第n维的数值。最后将融合特征进行归一化:
Figure PCTCN2022100935-appb-000022
式中
Figure PCTCN2022100935-appb-000023
为归一化后的融合特征,通过时间差分共生神经网络模型的训练,其中包含了合理占比的生理特征和行为特征,α为生理特征向量P与行为特征向量B的之间夹角。
如果对生理特征和行为特征都先进行归一化处理然后再相加和进一步归一化,那么可以得到均衡融合特征:
Figure PCTCN2022100935-appb-000024
Figure PCTCN2022100935-appb-000025
为均衡融合特征,其中所融合的生理特征和行为特征的贡献相同,
Figure PCTCN2022100935-appb-000026
为归一化后的行为特征(归一化方法与生理特征归一化方法一致)。通过
Figure PCTCN2022100935-appb-000027
可以得到通过基于行为特征模长的特征融合方法下生理特征和行为特征相比均衡状态下的生理特征和行为特征的比重提升了多少倍:
生理特征相比均衡贡献提升倍数:
Figure PCTCN2022100935-appb-000028
行为特征相比均衡贡献提升倍数:
Figure PCTCN2022100935-appb-000029
生理特征和行为特征的夹角α决定贡献值的上限,夹角越小上限值越大。μ p>1时,生理特征比重大,此时:
Figure PCTCN2022100935-appb-000030
可见当行为特征与生理特征夹角α小于120°时,且行为特征模长小于λ时,生理特征占主导;当行为特征与生理特征夹角α大于120°时,生理特征在小于λ的同时还需要大于-λ(1+2cosα),生理特征才能占主导;
μ b>1时,行为特征比重大,此时:
Figure PCTCN2022100935-appb-000031
即,当行为特征与生理特征夹角小于120°时,且行为特征模长大于λ时,行为特征占主导;当行为特征与生理特征夹角大于120°时,生理特征在大于λ的同时还需要小于
Figure PCTCN2022100935-appb-000032
行为特征才能占主导;
通过基于行为特征模长的特征融合模块,***可以根据行为特征模长的大小自动调节生理特征和行为特征的比重。同时该模块也限制了两种特征比重的上限,防止训练初期某种特征模长过大,占据主导地位从而导致另一种特征被湮没。
步骤6,在注册模式时,将输入的用户名和提取出的随机手势特征向量添加至手势模板数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过;所述阈值是指根据应用场景人工设定的认证阈值,在本发明其中一个实施例中,阈值取值范围为[0,1]。
在实际场景使用时,可以动态地选择阈值来平衡满足实际应用的需要,例如在安全性要求很高的场合,例如银行、海关等,需要尽可能的避免仿冒攻击者攻击成功的情况,此时应 该调低阈值(例如0.2)使得错误接受率FAR降低。反之在安全性要求相对不高的场合,例如公共办公区门禁、家电产品控制等,需要调高阈值(例如0.3),从而尽可能地正确识别注册用户,使得FRR降低。阈值调低或调高的幅度由用户根据需求确定。
在本发明其中一个实施例中,还提供了实现前述方法***。即一种基于视频的随机手势认证***,包括以下模块:
模式选择模块,用于选择注册模式或认证模式;
采集模块,用于输入用户名,采集用户随机手势视频;
数据处理模块,用于对随机手势视频进行预处理;
特征提取模块,用于将预处理后的动态手势视频输入到随机手势特征提取器,提取出包含用户生理特征和行为特征的特征向量,所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器;其中,时间差分共生神经网络模型包括残差生理特征提取模块、共生行为特征提取模块、基于行为特征模长的特征融合模块和帧间差分模块,所述残差生理特征提取模块将随机手势视频作为输入,用于提取生理特征;所述帧间差分模块用于对输入视频及残差生理特征提取模块中各层的输出特征进行相邻帧相同通道的相减并将每一个差分特征的所有通道进行逐元素求和,得到差分伪模态;所述共生行为特征提取模块将手势视频差分伪模态作为输入,用于提取行为特征;所述基于行为特征模长的特征融合模块将生理特征和行为特征进行特征融合;
注册认证模块,用于在注册模式时,将输入的用户名和提取出的随机手势的特征向量添加至手势模板数数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过,其中,所述阈值是指根据应用场景人工设定的认证阈值。
为了证明本发明所公开的基于时间差分共生神经网络模型的随机手势认证方法及***的有效性和优越性,本发明公开时间差分共生神经网络模型在动态手势认证数据集随机手势认证的等误率,并与当前主流视频理解网络(TSN、TSM、双流卷积神经网络、三维卷积神经网络、图像分类网络(ResNet18)进行对比实验。实验结果如下表所示:
Figure PCTCN2022100935-appb-000033
可以看到,本方法通过采用时间差分共生神经网络模型进行认证,在第一阶段测试集中达到2.580%的等误率,在第二阶段测试集中达到6.485%的等误率,即分别只错误识别2.580%和6.485%的注册用户/非注册用户(相当于识别准确率分别为97.420%和93.515%),等误率远远低于其他现有的方法,由此可证明随机手势的有效性。通过与当前主流视频理解网络和图像分类网络在随机手势认证中的表现的比较可以发现,时间差分共生神经网络在阶段一和阶段二的测试集上都具有最低的等误率,因此可以证明时间差分共生神经网络具有更强的认证性能。本实验仅为证明随机手势认证的有效性和时间差分共生神经网络的优越性。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的一种基于视频的随机手势认证***而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本发明通过基于视频的快速随机手势进行认证,可以无需记忆,只需即兴地执行一段随机手势即可完成用户的身份认证,所采用的的模型运行速度快,手势解耦敏感身份信息,不触及用户信息隐私,可以实现更安全、更高效和更友好的身份认证。
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于视频的随机手势认证方法,其特征在于,包括以下步骤:
    选择注册模式或认证模式;
    输入用户名,采集用户随机手势视频;
    对随机手势视频进行预处理;
    将预处理后的动态手势视频输入到随机手势特征提取器,提取出包含用户生理特征和行为特征的特征向量,所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器;其中,时间差分共生神经网络模型包括残差生理特征提取模块、共生行为特征提取模块、基于行为特征模长的特征融合模块和帧间差分模块,所述残差生理特征提取模块将随机手势视频作为输入,用于提取生理特征;所述帧间差分模块用于对输入视频和残差生理特征提取模块中各层的输出特征进行相邻帧相同通道的相减并将每一个差分特征的所有通道进行逐元素求和,得到差分伪模态;所述共生行为特征提取模块将手势视频差分伪模态作为输入,用于提取行为特征;所述基于行为特征模长的特征融合模块将生理特征和行为特征进行特征融合;
    在注册模式时,将输入的用户名和提取出的随机手势的特征向量添加至手势模板数数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过,其中,所述阈值是指根据应用场景人工设定的认证阈值。
  2. 根据权利要求1所述的一种基于视频的随机手势认证方法,其特征在于:所述采集用户随机手势视频中,随机手势无需记忆,只需要即兴执行一段手势即可进行注册和认证。
  3. 根据权利要求1所述的一种基于视频的随机手势认证方法,其特征在于:所述对随机手势视频进行预处理,包括:从动态手势视频截取T帧手势片段,然后进行逐帧的中心裁剪、图像大小调整和图像标准化,最终截取的视频大小为(T,C,W,H),其中T为帧数,C为通道数,W为图像宽度,H为图像高度。
  4. 根据权利要求1所述的基于视频的随机手势认证方法,其特征在于:所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器,包括:
    对若干用户的若干随机手势进行N帧视频采集,并记录对应的用户名,形成随机手势视频数据集;
    对随机手势视频数据集进行处理,从随机手势视频数据集的画面中剪切手势动作区域并进行图像大小调整,最终数据集大小为(P,Q,N,C,W,H),其中P为采集用户个数,Q为每个用户执行随机手势个数,N为每段随机手势视频帧数;
    将随机手势视频数据集分为训练集和测试集对时间差分共生神经网络模型进行训练和测试,其中,对测试集中的样本在相隔预设时间后采集多人的随机手势作为第二阶段的测试集;
    在训练阶段,对随机手势视频进行随机T帧手势片段的截取,并进行预处理;将经过预处理的随机手势视频通过时间差分共生神经网络模型前向传播得到融合特征,然后输入损失函数,并通过反向传播对时间差分共生神经网络模型进行优化;
    在测试阶段,对随机手势视频进行中间T帧手势片段的截取,并进行图像标准化处理,然后输入时间差分共生神经网络获得融合特征,用于距离计算。
  5. 根据权利要求1所述的一种基于视频的随机手势认证方法,其特征在于,通过所述残差生理特征提取模块得到生理特征的步骤为:将T帧随机手势图像视为大小为T的图像批进行18层卷积神经网络的前向传播;通过全局平均值池化和全连接操作,将生理特征表示为T×fdim维特征向量;将T×fdim维特征向量在时间维度平均得到fdim维的生理特征向量。
  6. 根据权利要求1所述的一种基于视频的随机手势认证方法,其特征在于,通过所述共生行为特征提取模块得到行为特征的步骤为:输入随机手势视频,通过所述帧间差分模块处理获得随机手势视频差分伪模态;将随机手势视频差分伪模态输入共生行为特征提取模块;每经过一次卷积操作后,将上一层的输出与代表相应的残差生理特征的差分伪模态进行通道维度的拼接;通过全局平均池化和全连接操作,将行为特征表示为fdim维特征向量。
  7. 根据权利要求1所述的一种基于视频的随机手势认证方法,其特征在于,通过所述帧间差分模块得到的差分伪模态为:
    Figure PCTCN2022100935-appb-100001
    IS fn(x,y,t)即为所述差分伪模态,其中chn,fn,t分别代表第chn个通道,来自残差生理特征提取模块第fn层特征和第t帧,ch表示当前特征图通道总数,x,y分别表示特征图或图像的横坐标和纵坐标。
  8. 根据权利要求1-7任一所述的一种基于视频的随机手势认证方法,其特征在于,通过所述基于行为特征模长的特征融合模块得到融合模块的步骤包括:将残差生理特征提取模块输出的生理特征进行归一化;将归一化的生理特征与共生行为特征提取模块输出的行为特征进行相加获得融合特征;将融合特征进行归一化;最后融合特征为:
    Figure PCTCN2022100935-appb-100002
    Figure PCTCN2022100935-appb-100003
    为归一化后的融合特征,包含了生理特征和行为特征,其中生理特征为P=(p 1,p 2,...,p n) T,行为特征为B=(b 1,b 2,...,b n) T,||.|| 2表示二范数,λ为超参数,α为生理特征向量P与行为特征向量B的之间夹角。
  9. 根据权利要求8所述的一种基于视频的随机手势认证方法,其特征在于:通过所述基于行为特征模长的特征融合模块自动调节生理特征和行为特征的比重,其中
    当行为特征与生理特征夹角α小于120°时,且行为特征模长小于λ时,生理特征所占比重大于行为特征,当行为特征与生理特征夹角α大于120°时,生理特征在小于λ的同时还需要大于-λ(1+2cosα),生理特征所占比重才大于行为特征,即
    Figure PCTCN2022100935-appb-100004
    当行为特征与生理特征夹角小于120°时,且行为特征模长大于λ时,行为特征所占比重大于生理特征;当行为特征与生理特征夹角大于120°时,生理特征在大于λ的同时还需要小于
    Figure PCTCN2022100935-appb-100005
    行为特征所占比重才大于生理特征,即
    Figure PCTCN2022100935-appb-100006
  10. 一种基于视频的随机手势认证***,其特征在于,用于实现权利要求1任一所述的方法,所述***包括:
    模式选择模块,用于选择注册模式或认证模式;
    采集模块,用于输入用户名,采集用户随机手势视频;
    数据处理模块,用于对随机手势视频进行预处理;
    特征提取模块,用于将预处理后的动态手势视频输入到随机手势特征提取器,提取出包含用户生理特征和行为特征的特征向量,所述随机手势特征提取器是时间差分共生神经网络模型进行训练和测试后得到的随机手势特征提取器;其中,时间差分共生神经网络模型包括残差生理特征提取模块、共生行为特征提取模块、基于行为特征模长的特征融合模块和帧间差分模块,所述残差生理特征提取模块将随机手势视频作为输入,用于提取生理特征;所述帧间差分模块用于对输入视频及残差生理特征提取模块中各层的输出特征进行相邻帧相同通道的相减并将每一个差分特征的所有通道进行逐元素求和,得到差分伪模态;所述共生行为特征提取模块将手势视频差分伪模态作为输入,用于提取行为特征;所述基于行为特征模长的特征融合模块将生理特征和行为特征进行特征融合;
    注册认证模块,用于在注册模式时,将输入的用户名和提取出的随机手势的特征向量添加至手势模板数数据库;在认证模式时,首先提取用户名在手势模板数据库中对应的多个特征向量,然后计算与待认证用户特征向量的余弦距离,并将最小的余弦距离与阈值比对,如果低于阈值,则认证通过,否则认证不通过,其中,所述阈值是指根据应用场景人工设定的认证阈值。
PCT/CN2022/100935 2021-06-23 2022-06-23 一种基于视频的随机手势认证方法及*** WO2022268183A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110699895.2 2021-06-23
CN202110699895.2A CN113343198B (zh) 2021-06-23 2021-06-23 一种基于视频的随机手势认证方法及***

Publications (1)

Publication Number Publication Date
WO2022268183A1 true WO2022268183A1 (zh) 2022-12-29

Family

ID=77478002

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100935 WO2022268183A1 (zh) 2021-06-23 2022-06-23 一种基于视频的随机手势认证方法及***

Country Status (2)

Country Link
CN (1) CN113343198B (zh)
WO (1) WO2022268183A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117055738A (zh) * 2023-10-11 2023-11-14 湖北星纪魅族集团有限公司 手势识别方法、可穿戴设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343198B (zh) * 2021-06-23 2022-12-16 华南理工大学 一种基于视频的随机手势认证方法及***

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN109919057A (zh) * 2019-02-26 2019-06-21 北京理工大学 一种基于高效卷积神经网络的多模态融合手势识别方法
CN112380512A (zh) * 2020-11-02 2021-02-19 华南理工大学 卷积神经网络动态手势认证方法、装置、存储介质及设备
CN112507898A (zh) * 2020-12-14 2021-03-16 重庆邮电大学 一种基于轻量3d残差网络和tcn的多模态动态手势识别方法
CN113343198A (zh) * 2021-06-23 2021-09-03 华南理工大学 一种基于视频的随机手势认证方法及***

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9589120B2 (en) * 2013-04-05 2017-03-07 Microsoft Technology Licensing, Llc Behavior based authentication for touch screen devices

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN109919057A (zh) * 2019-02-26 2019-06-21 北京理工大学 一种基于高效卷积神经网络的多模态融合手势识别方法
CN112380512A (zh) * 2020-11-02 2021-02-19 华南理工大学 卷积神经网络动态手势认证方法、装置、存储介质及设备
CN112507898A (zh) * 2020-12-14 2021-03-16 重庆邮电大学 一种基于轻量3d残差网络和tcn的多模态动态手势识别方法
CN113343198A (zh) * 2021-06-23 2021-09-03 华南理工大学 一种基于视频的随机手势认证方法及***

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117055738A (zh) * 2023-10-11 2023-11-14 湖北星纪魅族集团有限公司 手势识别方法、可穿戴设备及存储介质
CN117055738B (zh) * 2023-10-11 2024-01-19 湖北星纪魅族集团有限公司 手势识别方法、可穿戴设备及存储介质

Also Published As

Publication number Publication date
CN113343198A (zh) 2021-09-03
CN113343198B (zh) 2022-12-16

Similar Documents

Publication Publication Date Title
Abozaid et al. Multimodal biometric scheme for human authentication technique based on voice and face recognition fusion
CN104361276B (zh) 一种多模态生物特征身份认证方法及***
Kim et al. Person authentication using face, teeth and voice modalities for mobile device security
WO2022268183A1 (zh) 一种基于视频的随机手势认证方法及***
WO2018082011A1 (zh) 活体指纹识别方法及装置
Liu et al. One-class fingerprint presentation attack detection using auto-encoder network
CN108446690A (zh) 一种基于多视角动态特征的人脸活体检测方法
WO2018187953A1 (zh) 基于神经网络的人脸识别方法
RU2316051C2 (ru) Способ и система автоматической проверки присутствия лица живого человека в биометрических системах безопасности
Sarin et al. Cnn-based multimodal touchless biometric recognition system using gait and speech
TWI325568B (en) A method for face varification
Lumini et al. When Fingerprints Are Combined with Iris-A Case Study: FVC2004 and CASIA.
Alharbi et al. Face-voice based multimodal biometric authentication system via FaceNet and GMM
Shen et al. Secure mobile services by face and speech based personal authentication
Arbab‐Zavar et al. On forensic use of biometrics
Li et al. Living identity verification via dynamic face-speech recognition
Ohki et al. Efficient spoofing attack detection against unknown sample using end-to-end anomaly detection
Ali et al. Intelligent system for imposter detection: Asurvey
CN112769872B (zh) 一种基于音频及视频特征融合的会议***接入方法及***
Khan et al. Investigating linear discriminant analysis (LDA) on dorsal hand vein images
Koch et al. One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information
Shaker et al. Identification Based on Iris Detection Technique.
Wu et al. Multibiometric fusion authentication in wireless multimedia environment using dynamic Bayesian method
Shenai et al. Fast biometric authentication system based on audio-visual fusion
Ashiba et al. Suggested wavelet transform for cancelable face recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22827675

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22827675

Country of ref document: EP

Kind code of ref document: A1