CN117115928B

CN117115928B - Rural area network co-construction convenience service terminal based on multiple identity authentications

Info

Publication number: CN117115928B
Application number: CN202311096199.8A
Authority: CN
Inventors: 牛节省; 梁春芝; 李丰生; 李建辉; 李如飞
Original assignee: Beijing Guowang Shengyuan Intelligent Terminal Technology Co ltd
Current assignee: Beijing Guowang Shengyuan Intelligent Terminal Technology Co ltd
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2024-03-22
Anticipated expiration: 2043-08-29
Also published as: CN117115928A

Abstract

The application discloses a village network co-building convenience service terminal based on multiple identity authentications, which analyzes and processes a user face by utilizing a deep learning technology so as to intelligently judge whether the identity authentication is successful or not.

Description

Rural area network co-construction convenience service terminal based on multiple identity authentications

Technical Field

The application relates to the field of identity authentication, and more particularly, to a village network co-building convenience service terminal based on multiple identity authentications.

Background

With the continuous promotion of rural informatization construction in China, a rural network co-construction convenience service terminal is used as intelligent equipment integrating multiple functions of government affairs, finance, medical treatment and the like, and convenient information service is provided for vast rural residents. More and more rural residents can transact various matters through the network, such as paying hydropower fees, inquiring agricultural information, applying education sponsorship and the like.

However, the existing village network co-building convenience service terminal has a single identity authentication mode, and identity authentication failure or misidentification is easily caused by the conditions of embezzlement, counterfeiting and the like. Thus, an optimized solution is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a village-network co-construction convenience service terminal and system based on multiple identity authentications, which analyze and process face images of users by utilizing an image processing technology based on deep learning so as to intelligently judge whether the identity authentication is successful or not.

According to one aspect of the present application, there is provided a rural area network co-construction convenience service terminal based on multiple identity authentications, including:

the touch screen display is used for displaying a user interface and receiving a user request;

a computer communicatively coupled to the touch screen display for performing identity authentication and processing the user request;

a printer communicatively coupled to the computer for printing documents;

a card reader communicatively connected to the computer for extracting a user identity tag from a user identity card;

the fingerprint instrument is communicatively connected with the computer and is used for collecting fingerprint information of a user;

the camera is communicatively connected with the computer and is used for collecting face images of the user;

a scanner communicatively coupled to the computer for scanning a user document;

A voice recognition module communicatively coupled to the computer for recognizing a voice command of a user;

and a speech synthesis module communicatively connected to the computer for outputting a speech prompt.

Compared with the prior art, the village network co-building convenience service terminal and system based on multiple identity authentications analyze and process face images of users by utilizing an image processing technology based on deep learning so as to intelligently judge whether the identity authentication is successful or not.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a block diagram of a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application;

fig. 2 is a block diagram of a computer in a village network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application;

Fig. 3 is a system architecture diagram of a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application;

fig. 4 is a block diagram of a training phase of a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application;

fig. 5 is a block diagram of an image feature extraction module in a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application;

fig. 6 is a block diagram of a context information multi-scale aggregation unit in a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application;

fig. 7 is a block diagram of an identity authentication module in a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

With the continuous promotion of rural informatization construction in China, a rural network co-construction convenience service terminal is used as intelligent equipment integrating multiple functions of government affairs, finance, medical treatment and the like, and convenient information service is provided for vast rural residents. More and more rural residents can transact various matters through the network, such as paying hydropower fees, inquiring agricultural information, applying education sponsorship and the like. However, the existing village network co-building convenience service terminal has a single identity authentication mode, and identity authentication failure or misidentification is easily caused by the conditions of embezzlement, counterfeiting and the like. Thus, an optimized solution is desired. Specifically, when the user needs to use the terminal of the invention, identity authentication is needed first, authentication can be performed by selecting modes such as an identity card, a social security card, a bank card, a fingerprint, a face and the like, and authentication can also be performed by combining multiple modes so as to improve safety and accuracy. After the identity authentication is successful, the user can select a required service item, such as information inquiry, government transaction, life payment and the like, on the touch screen display. Depending on the service items, the user may need to provide other credentials or documents, such as a house book, wedding certificate, water bill, etc., at which time a scanner may be used to scan and upload the scan results to the village-network co-creation platform. Meanwhile, the user can also use the voice recognition module to carry out voice input or voice inquiry, and the voice synthesis module can output voice prompt or feedback. When the user completes the required service item, the printer can be used for printing the related bill or certificate, and the use is ended.

The face recognition can finish authentication only by facing the camera because the user does not need to carry out additional study or training, and the operation is simple and convenient; in addition, compared with other identity authentication modes such as fingerprints, the face recognition does not need direct contact between a user and terminal equipment, so that more convenient, sanitary and friendly user experience is provided, and the method becomes one of the most main modes in the identity authentication process. Therefore, in the technical scheme of the application, in order to realize reliable user identity authentication, especially reliable face recognition, the application expects to utilize an image processing technology based on deep learning to analyze and process a user face image so as to intelligently judge whether the identity authentication is successful.

In the technical scheme of the application, the village-network co-building convenience service terminal based on multiple identity authentications is provided. Fig. 1 is a block diagram of a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application. As shown in fig. 1, a rural area network co-construction convenience service terminal 300 based on multiple identity authentications according to an embodiment of the present application includes: a touch screen display 310 for displaying a user interface and receiving a user request; a computer 320 communicatively coupled to the touch screen display for performing identity authentication and processing the user request; a printer 330 communicatively connected to the computer for printing documents; a card reader 340 communicatively connected to the computer for extracting a user identity tag from a user identity card; a fingerprint instrument 350 communicatively coupled to the computer for collecting user fingerprint information; a camera 360 communicably connected to the computer for capturing images of the user's face; a scanner 370 communicatively coupled to the computer for scanning the user's credentials; a voice recognition module 380 communicatively connected to the computer for recognizing a voice command of a user; a speech synthesis module 390 communicatively coupled to the computer for outputting voice prompts.

In particular, the touch screen display 310 is used to display a user interface and receive user requests. Wherein the touch screen display is a display integrated with a touch sensing technology, which can input instructions and control devices through a user's touch operation. Its main function is to provide a user interface and to receive user requests. The touch screen display may serve as the primary user interface for the device, replacing the traditional keyboard and mouse. The user can directly touch the icons, buttons, menus and other elements on the screen to operate, so that interaction with the device is achieved.

In particular, the computer 320, which is communicatively coupled to the touch screen display, is used for authentication and processing the user request. In particular, in one specific example of the present application, as shown in fig. 2 and 3, the computer 320 communicatively connected to the touch screen display, comprises: the image feature extraction module 321 is configured to perform image feature extraction on the user face image to obtain a multi-scale context aggregation face feature map; and an identity authentication module 322, configured to determine whether identity authentication is successful based on the user identity tag and the multi-scale context aggregated face feature map.

Specifically, the image feature extraction module 321 is configured to perform image feature extraction on the face image of the user to obtain a multi-scale context aggregation face feature map. In particular, in one specific example of the present application, as shown in fig. 5, the image feature extraction module 321 includes: the neighborhood feature extraction unit 3211 is configured to perform neighborhood feature extraction on the user face image to obtain a face local feature map; and a context information multi-scale aggregation unit 3212, configured to perform context information multi-scale aggregation on the face local feature map to obtain the multi-scale context aggregated face feature map.

More specifically, the neighborhood feature extraction unit 3211 is configured to perform neighborhood feature extraction on the face image of the user to obtain a face local feature map. In particular, in one specific example of the present application, the neighborhood feature extraction unit 3211 is configured to: and the face image of the user passes through a face local feature extractor based on a convolutional neural network model to obtain the face local feature map. That is, the feature filtering of the local convolution kernel is performed on the user face image by using a convolution neural network model having excellent performance in the field of local feature extraction, so as to capture the face features in the image. Specifically, each layer of the face local feature extractor based on the convolutional neural network model is used for respectively carrying out input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; pooling the convolution feature images based on the local feature matrix to obtain pooled feature images; performing nonlinear activation on the pooled feature map to obtain an activated feature map; the output of the last layer of the face local feature extractor based on the convolutional neural network model is the face local feature map, and the input of the first layer of the face local feature extractor based on the convolutional neural network model is the face image of the user.

Notably, convolutional neural networks are a deep learning model that is specifically used to process data having a grid structure, such as images and video. CNNs perform well in the field of image processing and have made an important breakthrough in many computer vision tasks. The core idea of CNN is to extract the features of the input data through components such as convolution layer, pooling layer and full connection layer, and to classify, identify or otherwise perform tasks through these features. The following are the general structure and major components of CNNs: convolution layer: the convolutional layer is one of the core components of the CNN. It extracts local features of an image by applying a set of learnable filters (convolution kernels) to a sliding window operation on the input image. Each filter may detect a different feature in the input image, such as edges, texture, etc. The convolution layer can calculate in parallel through a plurality of filters to generate a plurality of feature graphs; activation function: the convolutional layer typically applies an activation function after the convolutional operation to introduce nonlinear properties. Common activation functions include ReLU, sigmoid, tanh, etc., which can increase the expressive power and nonlinear fitting power of the network; pooling layer: the pooling layer serves to reduce the spatial dimensions of the feature map and preserve important features. Common pooling operations include maximum pooling and average pooling, which can reduce computation, extract key features, and enhance translational invariance of the model; full tie layer: the full-join layer is typically located after the convolution layer for mapping the high-dimensional features to probability distributions for the target classes. Neurons in the full-connection layer are connected with all neurons in the previous layer, and feature combination and classification are realized through learning weights and biasing; dropout layer: to prevent overfitting, dropout layers are often used in CNNs. The Dropout layer randomly closes a portion of neurons during the training process to reduce the dependency between neurons and improve the generalization ability of the model. The training process of CNNs typically uses a back-propagation algorithm and gradient descent optimizers to minimize the loss function between the predicted output and the real labels. Through large-scale training data and proper network architecture design, CNN can learn abstract features in images and obtain excellent performance in tasks such as image classification, target detection, face recognition, image generation and the like.

It should be noted that, in other specific examples of the present application, the neighborhood feature extraction may be performed on the face image of the user in other manners to obtain a face local feature map, for example: inputting a face image of a user; processing the input image by using a face detection algorithm to determine the face position and the bounding box existing in the image; and carrying out face alignment operation on each detected face. This may be achieved by aligning key points such as eyes, nose and mouth in the face image with specific reference points to ensure that the face has a consistent pose and position during subsequent processing; and dividing the aligned face image into a plurality of neighborhoods. Each neighborhood represents a local region of the face; for each neighborhood, a Convolutional Neural Network (CNN) or other feature extraction model is used to extract local features. This can be achieved by applying a convolution layer and an activation function on each neighborhood. The convolution layer applies sliding window operation to each neighborhood to extract local features; for each neighborhood, a pooling layer may be further applied to reduce the size of the feature map and preserve important features. Common pooling operations include maximum pooling and average pooling; and carrying out context aggregation on the local feature map of each neighborhood to obtain a more global face representation. This may be achieved by different methods, such as pyramid pooling, spatial pyramid pooling, or attention mechanisms, etc.; and combining the aggregated feature images to obtain a final facial local feature image. This may be achieved by stitching, overlaying or otherwise combining the feature maps of each neighborhood.

More specifically, the context information multi-scale aggregation unit 3212 is configured to perform context information multi-scale aggregation on the face local feature map to obtain the multi-scale context aggregated face feature map. In convolutional neural networks, receptive fields generally expand as the number of layers of the network increases. However, in some complex scenarios, such as where there are multiple categories of images, the expansion of receptive fields may result in false context information aggregation. In particular, as the receptive field expands, the neural network may blend features of different classes of objects or regions together. For example, in the face image of the user, information of non-target persons such as faces of passers-by, tables as backgrounds, etc. may be mixed, and if the receptive field is too large, the network may mix their features, so that it is difficult to accurately extract facial feature information of a target task. Therefore, in the technical scheme of the application, the context information multi-scale aggregation is expected to be performed on the face local feature map so as to obtain a multi-scale context aggregated face feature map. That is, more scale context information is obtained in the feature extraction process, and then fusion is performed to adaptively extract facial feature information of the target person, thereby obtaining more discriminative features. In particular, in one specific example of the present application, as shown in fig. 6, the context information multiscale aggregation unit 3212 includes: a downsampling subunit 32121, configured to downsample the face local feature map using pooled check with different scales to obtain a plurality of face local pooled feature maps; and a fusion subunit 32122, configured to use a context content encoder to fuse the plurality of face local pooling feature maps to obtain the multi-scale context aggregated face feature map.

The downsampling subunit 32121 is configured to downsample the face local feature map with pooled check having different scales to obtain a plurality of face local pooled feature maps. It should be appreciated that pooling kernels with different scales are utilized, so that the distribution of the implicit features of the face at different scales can be obtained. It is worth mentioning that in computer vision, downsampling is an operation that reduces the size of an image or feature map. It reduces the resolution of the image or feature map by applying some sampling method to the image or feature map, thereby reducing the dimensionality of the data. During downsampling, a pooling operation is typically used. The most common pooling operation is maximum pooling and average pooling. By the method, the characteristics of multiple scales can be extracted from the facial local characteristic diagrams, and each characteristic diagram corresponds to a pooling core of different scales. These face partial pooling feature maps may be used for subsequent feature aggregation, classification, or other face analysis tasks to improve the performance and robustness of the model.

Notably, the Pooling Kernel (Pooling Kernel) is a rectangular region of fixed size used in the Pooling operation. It is used to downsample the input image or feature map and generate downsampled results by aggregating the values within the pooling kernel. Accordingly, in one possible implementation, the face partial feature map may be downsampled using a pooling check with different scales to obtain a plurality of face partial pooling feature maps, for example: input: a face local feature map; defining pooling cores of different scales: a plurality of different scale pooled kernel sizes are selected. These pooling cores may be rectangular areas of different sizes; downsampling each pooled core: selecting a pooling core; moving a pooling core on the face local feature map: moving the pooling core in a fixed step length and applying the pooling core to the feature map; extracting features in the pooled nuclei: extracting a maximum value or an average value in the pooling core as a downsampled value according to the type of pooling operation (maximum pooling or average pooling); generating a downsampling result: taking the value extracted by each pooling core as a downsampling result to form a new face local pooling feature map; repeating the step 3, and downsampling each pooling core with different scales so as to obtain a plurality of face local pooling feature images; and (3) outputting: and a plurality of face local pooling feature maps, wherein each feature map corresponds to a pooling kernel downsampling result with a specific scale.

The fusing subunit 32122 is configured to fuse the plurality of face local pooling feature maps using a context content encoder to obtain the multi-scale context aggregated face feature map. That is, globally based context information interaction is performed by the context content encoder. In this way, the problem of incorrect context information aggregation can be remedied to some extent. In the technical scheme of the application, a context encoder is used for fusing the plurality of face local pooling feature images to obtain a multi-scale context aggregation face feature image encoding process, which comprises the following steps: vectorizing the face local pooling feature images (global average pooling is carried out on each feature matrix) to obtain face local pooling feature vectors; subsequently, the plurality of face local pooling feature vectors pass through a context content encoder based on a converter to obtain a multi-scale context aggregation face feature vector; and then carrying out feature vector reconstruction on the multi-scale context aggregation face feature vector to obtain the multi-scale context aggregation face feature map. Wherein the step of passing the plurality of face local pooling feature vectors through a context content encoder based on a converter to obtain a multi-scale context aggregated face feature vector comprises: one-dimensional arrangement is carried out on the face local pooling feature vectors to obtain global face local pooling feature vectors; calculating the product between the global face local pooling feature vector and the transpose vector of each face local pooling feature vector in the face local pooling feature vectors to obtain a plurality of self-attention association matrixes; respectively carrying out standardization processing on each self-attention correlation matrix in the plurality of self-attention correlation matrices to obtain a plurality of standardized self-attention correlation matrices; obtaining a plurality of probability values by using a Softmax classification function through each normalized self-attention correlation matrix in the normalized self-attention correlation matrices; weighting each face local pooling feature vector in the face local pooling feature vectors by taking each probability value in the probability values as a weight so as to obtain the context semantic face local pooling feature vectors; and cascading the plurality of context semantic face local pooling feature vectors to obtain the multi-scale context aggregation face feature vector. More specifically, it should be understood that the purpose of feature vector reconstruction of the multi-scale context-aggregation face feature vector is to restore or generate a multi-scale context-aggregation face feature map for performing subsequent tasks such as face recognition, face expression analysis, and the like. It is worth mentioning that vector reconstruction is the restoration of the encoded or compressed vector to the original high-dimensional vector by inverse operation or inverse transformation to recover the missing information or representation of the near-original vector. It finds application in a variety of fields that can provide a more accurate or near-original vector representation.

It should be noted that, in other specific examples of the present application, the context information may be multi-scale aggregated on the face local feature map by other manners to obtain the multi-scale context aggregated face feature map, for example: inputting a face local feature map; a context window size is defined for a plurality of dimensions. The window sizes may be selected based on the characteristics of the particular task and dataset, e.g., small scale windows for capturing details, large scale windows for obtaining broader context information; for each scale context window, it is applied to the face local feature map. The contextual window may be moved over the feature map by sliding the window or otherwise to obtain features within the window; within each context window, a Convolutional Neural Network (CNN) or other feature extraction model is used to extract the context features. This can be achieved by applying a convolution layer and an activation function within the window; for each contextual window, a pooling layer may be further applied to reduce the size of the feature map and preserve important features. Common pooling operations include maximum pooling and average pooling; the features extracted within each contextual window are aggregated to obtain contextual features on that scale. Different aggregation modes may be used, such as average, maximum or other modes of feature maps; repeating the steps 3-6, and carrying out feature extraction and aggregation operation on the context window of each scale; and combining the context features on different scales to obtain a multi-scale context aggregation face feature map. This may be achieved by stitching, stacking or other combining features of different dimensions.

It should be noted that, in other specific examples of the present application, the image feature extraction may be performed on the face image of the user in other manners to obtain a multi-scale context aggregation face feature map, for example: processing the input image by using a face detection algorithm to determine the face position and the bounding box existing in the image; and carrying out face alignment operation on each detected face. This may be achieved by aligning key points such as eyes, nose and mouth in the face image with specific reference points to ensure that the face has a consistent pose and position during subsequent processing; for the aligned face images, a Convolutional Neural Network (CNN) or other feature extraction model is used to extract multi-scale face features. These features may include shallow and deep feature representations to capture different levels of detail and contextual information; and carrying out context aggregation on the extracted multi-scale face features to obtain a more global face representation. This may be achieved by different methods, such as pyramid pooling, spatial pyramid pooling, or attention mechanisms, etc. The aggregated features comprehensively utilize the context information of different scales, so that the robustness and the discriminant of the face representation are improved; and converting the feature map after the context aggregation into a face feature map. This may be achieved by applying a suitable linear or non-linear transformation, for example using a fully connected layer, a convolution layer or an activation function, etc., to convert the feature map into a more compact, high-dimensional representation of the face features.

Specifically, the identity authentication module 322 is configured to determine whether the identity authentication is successful based on the user identity tag and the multi-scale context aggregated face feature map. In particular, in one specific example of the present application, as shown in fig. 7, the identity authentication module 322 includes: a classification result generating unit 3221, configured to pass the multi-scale context aggregation face feature map through a classifier to obtain a classification result, where the classification result is used to represent an identity tag corresponding to the user face image; and an authentication result generating unit 3222 configured to determine whether the authentication is successful based on the matching between the classification result and the user identity tag.

More specifically, the classification result generating unit 3221 is configured to pass the multi-scale context aggregated face feature map through a classifier to obtain a classification result, where the classification result is used to represent an identity tag corresponding to the face image of the user. That is, after the multi-scale context aggregation face feature map is obtained, the multi-scale context aggregation face feature map is further passed through a classifier to obtain a classification result. Specifically, the multi-scale context aggregation face feature map is unfolded to be a classification feature vector based on a row vector or a column vector; performing full-connection coding on the classification feature vectors by using a plurality of full-connection layers of the classifier to obtain coded classification feature vectors; and passing the coding classification feature vector through a Softmax classification function of the classifier to obtain the classification result.

A Classifier (Classifier) refers to a machine learning model or algorithm that is used to classify input data into different categories or labels. The classifier is part of supervised learning, which performs classification tasks by learning mappings from input data to output categories.

The fully connected layer (Fully Connected Layer) is one type of layer commonly found in neural networks. In the fully connected layer, each neuron is connected to all neurons of the upper layer, and each connection has a weight. This means that each neuron in the fully connected layer receives inputs from all neurons in the upper layer, and weights these inputs together, and then passes the result to the next layer.

The Softmax classification function is a commonly used activation function for multi-classification problems. It converts each element of the input vector into a probability value between 0 and 1, and the sum of these probability values equals 1. The Softmax function is commonly used at the output layer of a neural network, and is particularly suited for multi-classification problems, because it can map the network output into probability distributions for individual classes. During the training process, the output of the Softmax function may be used to calculate the loss function and update the network parameters through a back propagation algorithm. Notably, the output of the Softmax function does not change the relative magnitude relationship between elements, but rather normalizes them. Thus, the Softmax function does not change the characteristics of the input vector, but simply converts it into a probability distribution form.

More specifically, the authentication result generating unit 3222 is configured to determine whether the authentication is successful based on the matching between the classification result and the user identity tag. That is, after the classification result is obtained, whether the identity authentication is successful is further determined based on the matching between the classification result and the user identity tag.

Accordingly, in one possible implementation, it may be determined whether the authentication is successful based on the matching between the classification result and the user identity tag, for example, by: and processing the face image of the user by using a feature extraction algorithm to extract corresponding face features. These features are typically a vector or feature descriptor representing the unique features of the face; and comparing the extracted face features with known user identity tags. This can be achieved in two ways: and (3) classification judgment: and inputting the face features into the classifier by using the trained classifier, and judging the identity class to which the face features belong. The classifier may be a Support Vector Machine (SVM), decision tree, neural network, etc.; similarity calculation: and calculating the similarity between the extracted face features and the features corresponding to the known identity tags. Common similarity calculation methods include euclidean distance, cosine similarity and the like; judging a matching result: judging whether the matching is successful or not according to the classification result or the similarity calculation result; if the classification result or the similarity calculation result is consistent with the user identity label or has higher similarity, namely the matching is successful, the identity authentication can be considered to be successful; if the classification result or the similarity calculation result is inconsistent with the user identity label or has low similarity, namely the matching fails, the identity authentication can be considered to fail.

It should be noted that, in other specific examples of the present application, whether the identity authentication is successful may also be determined by other manners based on the user identity tag and the multi-scale context aggregate face feature map, for example: first, the user's identity tag information, such as a user name, ID number, or other identification information, needs to be collected. These information will be used to compare with the results in the authentication process; and acquiring a face image of the user through a camera or other face acquisition equipment. The quality of the collected image is ensured to be good enough for subsequent face detection and feature extraction; and detecting the acquired face image by using a face detection algorithm, and determining the position and the bounding box of the face. Then, carrying out alignment operation on the detected face so that the position and the posture of the face in the image are kept consistent, and improving the accuracy of subsequent feature extraction; and processing the aligned face images by using a feature extraction algorithm, and extracting a multi-scale context aggregation face feature map. The feature images contain multi-scale and context information of the human face, and can be used for subsequent identity authentication; and comparing the extracted multi-scale context aggregation face feature map with the identity tag of the user. Various matching or classification algorithms may be used to calculate the similarity between features or make classification decisions. If the feature map is successfully matched with the user identity tag or is judged by classification, the identity authentication is considered to be successful; otherwise, the identity authentication is considered to fail.

It should be noted that, in other specific examples of the present application, the authentication and processing of the user request may also be performed by other manners, for example: the touch screen display firstly displays a login interface, and requires a user to input identity credentials, such as a user name and a password, fingerprints, facial recognition and the like; the user inputs identity credential information on the touch screen and the system will verify the accuracy and validity of this information. The verification mode depends on the specific application and the equipment setting; and the system adopts corresponding processing measures according to the identity authentication result. If the authentication is successful, the user will be authorized to access the relevant functions and data; if authentication fails, the user may be required to reenter credentials or be denied access; if the identity authentication is successful, the touch screen display is switched to the main interface, and the functions of the equipment and options operable by the user are displayed. The main interface generally comprises elements such as icons, menus, buttons and the like, and a user can operate the main interface through a touch screen; the user performs operations such as clicking on icons, buttons, menus, etc., or gesture operations such as sliding, zooming, etc., through the touch screen interface. The touch screen display can sense and capture the operation of a user and convert the operation into a corresponding instruction or request; and the system performs corresponding processing according to the operation request of the user. This may involve performing specific functions, displaying relevant information, invoking other applications or services, etc.; after processing the user request, the touch screen display may provide feedback and a result display to inform the user of the status and result of the operation. Such as displaying a successful or failed message, popup dialog, changing the status of an interface element, etc.

In particular, the printer 330, which is communicatively coupled to the computer, is used to print documents. It will be appreciated that documents play an important role in business and financial activities, and they provide a basis for recording, reconciling and auditing transactions and activities, helping to ensure accuracy, legitimacy and traceability of transactions.

In particular, the card reader 340, which is communicatively connected to the computer, is used to extract the user identity tag from the user identity card. Among them, a card reader is a device for reading and analyzing information stored on various types of cards. It may communicate with the card by physical contact or wirelessly and transmit data on the card to a computer or other device. User identity tags refer to tags or identifiers used to identify and categorize user identity attributes. These tags may be used to distinguish between particular attributes, rights or roles of a user for authentication, authorization and access control in a system or application.

In particular, the fingerprint instrument 350, which is communicatively coupled to the computer, is used to collect user fingerprint information. Wherein, the fingerprint appearance is a device that is used for gathering and discernment human fingerprint. The fingerprint identification device converts the shape, texture and characteristics of the fingerprint into digital data by sensing and recording the shape, texture and characteristics of the fingerprint for subsequent comparison and identification.

In particular, the camera 360, which is communicatively connected to the computer, is used to capture images of the user's face. The main function of the camera is to capture facial images of users so as to perform face recognition and other applications. The face data can be acquired in a real-time video stream or still image mode and transmitted to a corresponding algorithm for analysis and processing.

In particular, the scanner 370, which is communicatively coupled to the computer, is used to scan user credentials. A scanner is a device used to convert paper documents, photographs or other planar objects into digitized images or text. It converts image information on an object into digitized data for storage, editing or sharing on a computer or other device through optical sensors and image processing techniques.

In particular, the voice recognition module 380, which is communicatively coupled to the computer, is used to recognize the voice instructions of the user. The speech recognition module is a technical module for converting human speech into text or commands. It may be converted to recognizable text form by analyzing the sound and intonation features in the speech signal for comprehension and processing by a computer or other device.

In particular, the speech synthesis module 390, which is communicatively coupled to the computer, is used to output speech prompts. The speech synthesis module is a technical module for converting text or other forms of information into audible human speech. It can convert text to speech and play it out through an audio output device (e.g., a speaker) to enable a computer or other device to interact verbally with the user.

It should be appreciated that training of the face local feature extractor, the context encoder, and the classifier based on the convolutional neural network model is required prior to the inference using the neural network model described above. That is, the village network co-building convenience service terminal 300 based on multiple identity authentications according to the present application further includes a training stage 400 for training the face local feature extractor, the context encoder and the classifier based on the convolutional neural network model.

Fig. 4 is a block diagram of a training phase of a village-network co-building convenience service terminal based on multiple identity authentications according to an embodiment of the present application. As shown in fig. 4, a rural area network co-building convenience service terminal 300 based on multiple identity authentications according to an embodiment of the present application includes: training phase 400, comprising: a training image acquisition unit 410, configured to acquire training data, where the training data includes a training user face image, and a true value of an identity tag corresponding to the training user face image; the training convolution unit 420 is configured to pass the training user face image through the face local feature extractor based on the convolutional neural network model to obtain a training face local feature map; a training image downsampling unit 430, configured to downsample the training face local feature map using pooled check with different scales to obtain a plurality of training face local pooled feature maps; a training multi-scale context aggregation unit 440, configured to use the context encoder to fuse the plurality of training face local pooling feature maps to obtain a training multi-scale context aggregated face feature map; a classification loss unit 450, configured to pass the training multi-scale context aggregation face feature map through a classifier to obtain a classification loss function value; and a training unit 460, configured to train the face local feature extractor, the context content encoder, and the classifier based on the convolutional neural network model based on the classification loss function value, where in each iteration of the training, fine-granularity density prediction search optimization iteration of a weight space is performed on a multi-scale context aggregated face feature vector obtained after the multi-scale context aggregated face feature map is expanded.

In particular, in the technical solution of the present application, the plurality of face local pooling feature maps are persons whose face images pass through a convolutional neural network modelAfter the face local feature extractor, the face local feature images are subjected to downsampling by using pooling cores with different scales, so that on the basis that the face local feature images are subjected to local image semantic association feature expression by the face local feature images, each feature value of the face local pooling feature images expresses corresponding spatially aggregated image semantic features obtained by pooling cores with different scales, and after the face local pooling feature images are fused by using a context encoder, context association encoding based on feature image granularity is performed on the image semantic features of corresponding aggregation scales expressed by each face local pooling feature image, so that the multi-scale context aggregation face feature images have semantic expression dimensions among aggregation scales with feature image granularity besides aggregation scale semantic expression dimensions with feature value granularity, namely, the multi-scale context aggregation face feature images have super-resolution expression characteristics of multi-dimension context, which can affect the efficiency of classification by the classifier. Therefore, when the multi-scale context-aggregation face feature map is trained by the classifier, in each iteration, the multi-scale context-aggregation face feature vector obtained by expanding the multi-scale context-aggregation face feature map is recorded as, for example And carrying out fine granularity density prediction search optimization of a weight space, wherein the fine granularity density prediction search optimization is expressed as follows:

and->The weight matrix of the last iteration and the current iteration are respectively adopted, wherein, during the first iteration, different initialization strategies are adopted to set +.>And->(e.g.)>Set as a unitary matrix->Set as the diagonal matrix of the mean value of the feature vector to be classified),>is the multi-scale context aggregated face feature vector,/for the human face>And->Respectively represent feature vector +>And->Global mean of (2), and->Is a bias vector, for example initially set as a unit vector. Here, for the super-resolution expression characteristic of the multi-scale context aggregated face feature vector in the multi-dimensional context, the fine-granularity density prediction search optimization of the weight space may be performed by the feedforward serialization mapping of the vector space of the projection of the multi-scale context aggregated face feature vector, while searching for the weight spaceWhile the inter-dense prediction tasks provide a corresponding fine-grained weight search strategy, the overall sequence complexity (overall sequential complexity) of the representation of the multi-scale context-aggregated face feature vector within the weight search space is reduced, thereby improving training efficiency.

As described above, the rural-networking co-construction convenience service terminal 300 based on the multiple identities according to the embodiment of the present application may be implemented in various wireless terminals, for example, a server or the like having a rural-networking co-construction convenience service algorithm based on the multiple identities. In one possible implementation, the rural area network co-construction convenience service terminal 300 based on multiple identity authentications according to embodiments of the present application may be integrated into a wireless terminal as one software module and/or hardware module. For example, the village network co-creation convenience service terminal 300 based on multiple identity authentications may be a software module in an operating system of the wireless terminal, or may be an application developed for the wireless terminal; of course, the rural-networking co-creation convenience service terminal 300 based on multiple identities may also be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the multi-identity based village network co-building convenience service terminal 300 and the wireless terminal may be separate devices, and the multi-identity based village network co-building convenience service terminal 300 may be connected to the wireless terminal through a wired and/or wireless network and transmit the interactive information in a agreed data format.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A village network co-building convenience service terminal based on multiple identity authentications is characterized by comprising:

a printer communicatively coupled to the computer for printing documents;

a scanner communicatively coupled to the computer for scanning a user document;

a speech synthesis module communicatively coupled to the computer for outputting a speech prompt;

the computer includes:

the image feature extraction module is used for extracting image features of the face image of the user to obtain a multi-scale context aggregation face feature image;

the identity authentication module is used for determining whether the identity authentication is successful or not based on the user identity tag and the multi-scale context aggregation face feature map;

the image feature extraction module comprises:

the neighborhood feature extraction unit is used for extracting neighborhood features of the face image of the user to obtain a face local feature map;

the context information multi-scale aggregation unit is used for carrying out context information multi-scale aggregation on the face local feature map to obtain the multi-scale context aggregation face feature map;

the neighborhood feature extraction unit is configured to: the face image of the user passes through a face local feature extractor based on a convolutional neural network model to obtain the face local feature map;

The context information multiscale aggregation unit comprises:

a downsampling subunit, configured to downsample the face local feature map by using pooling check with different scales to obtain a plurality of face local pooling feature maps;

and a fusion subunit, configured to fuse the plurality of face local pooling feature maps using a context content encoder to obtain the multi-scale context aggregated face feature map;

the identity authentication module comprises:

the classification result generation unit is used for enabling the multi-scale context aggregation face feature images to pass through a classifier to obtain classification results, wherein the classification results are used for representing identity tags corresponding to the face images of the users;

the identity authentication result generation unit is used for determining whether the identity authentication is successful or not based on the matching between the classification result and the user identity label;

the terminal also comprises a training module for training the face local feature extractor, the context content encoder and the classifier based on the convolutional neural network model;

wherein, training module includes:

the training image acquisition unit is used for acquiring training data, wherein the training data comprise training user face images and the true values of identity labels corresponding to the training user face images;

The training convolution unit is used for enabling the training user face image to pass through the face local feature extractor based on the convolution neural network model to obtain a training face local feature image;

the training image downsampling unit is used for downsampling the training face local feature images by using pooling check with different scales to obtain a plurality of training face local pooling feature images;

the training multi-scale context aggregation unit is used for fusing the plurality of training face local pooling feature images by using the context content encoder to obtain a training multi-scale context aggregation face feature image;

the classification loss unit is used for enabling the training multi-scale context aggregation face feature images to pass through a classifier to obtain classification loss function values;

the training unit is used for training the face local feature extractor, the context content encoder and the classifier based on the convolutional neural network model based on the classification loss function value, wherein in each round of iteration of the training, fine granularity density prediction search optimization iteration of a weight space is carried out on the multi-scale context aggregation face feature vector obtained after the multi-scale context aggregation face feature map is unfolded;

The training unit is used for: carrying out fine granularity density prediction search optimization iteration of a weight space on the multi-scale context aggregation face feature vector obtained after the multi-scale context aggregation face feature map is unfolded by using the following optimization formula;

wherein, the formula is:

；

wherein the method comprises the steps ofAnd->The weight matrix of the last iteration and the current iteration are respectively adopted, wherein, during the first iteration, different initialization strategies are adopted to set +.>And->，/>Is the multi-scale context aggregated face feature vector,/for the human face>And->Respectively represent feature vectorsAnd->Global mean of (2), and->Is a bias vector, initially set as a unit vector, < ->Representing addition by position>Representing multiplication by location +.>Representing vector multiplication.