CN115273863A

CN115273863A - Compound network class attendance system and method based on voice recognition and face recognition

Info

Publication number: CN115273863A
Application number: CN202210662375.9A
Authority: CN
Inventors: 陈荣征; 李浩能; 李育廷
Original assignee: Guangdong Vocational and Technical College
Current assignee: Guangdong Vocational and Technical College
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-11-01

Abstract

The invention discloses a voice recognition and face recognition combined online class attendance system and a method, wherein a voiceprint recognition model is established based on an X-Vector algorithm and a PLDA algorithm, a face recognition model is established based on a YOLOv3 algorithm, original voice information and original video dynamic of vocabulary entry information read aloud by students during attendance are collected and preprocessed to obtain voiceprint characteristic information and a face characteristic image, a first attendance score is obtained through the voiceprint characteristic information and the face recognition model, a second attendance score is obtained through the face characteristic image and the face recognition model, and a final attendance result is obtained by integrating the first attendance score and the second attendance score. The invention obtains two attendance scores through face recognition and voiceprint recognition, and obtains the final attendance result of the student by integrating the two attendance scores.

Description

Compound network class attendance system and method based on voice recognition and face recognition

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a composite network class attendance system and a method based on voice recognition and face recognition.

Background

The class attendance is to acquire the attendance of students in a certain time period of a specific course in a certain mode, effectively improve campus management by carrying out the class attendance and play a role in supervising the students to carry out school book learning; moreover, the attendance condition of the students is an important data for evaluating the classroom teaching effect and an important parameter for measuring the comprehensive performance of the students at the end of the term. The modern class is divided into an online class and an offline class, the online class generally performs class study in a mode that students and teachers face each other, the online class is used for course study through network courses, and the traditional attendance modes of the online class include manual attendance, manual statistics attendance and mobile terminal attendance checking. The online class is required to be changed into the online class in some cases, under the condition, a teacher is difficult to carry out attendance work on each student for classroom learning, the assessment work of the teacher on the classroom teaching effect is influenced, the enthusiasm of the students on course learning is reduced, and the study of the students is slow and lazy easily caused.

Disclosure of Invention

The invention aims to provide a composite network class attendance system and a composite network class attendance method based on voice recognition and face recognition, which are used for solving one or more technical problems in the prior art and at least provide a beneficial selection or creation condition.

The solution of the invention for solving the technical problem is as follows: the utility model provides a compound net class attendance system based on voice recognition and face identification, includes: the system comprises a communication module, a cloud server, a data acquisition module, a first processing module, a second processing module, a voiceprint recognition module, a face recognition module and an attendance analysis module, wherein the cloud server is respectively connected with the attendance analysis module and the communication module, and the preprocessing module is respectively connected with the data acquisition module, the voiceprint recognition module and the face recognition module;

the communication module is used for acquiring attendance request information of a teacher and sending the attendance request information to the cloud server;

the cloud server is used for acquiring attendance request information, selecting entry information from the random entry database, displaying the entry information on a display screen used by the student and prompting the student to read aloud;

the student face database comprises a plurality of face-up images of all students and annotation information corresponding to each face-up image; the student voiceprint sample library comprises a plurality of voiceprint sample information of all students and a student name corresponding to each voiceprint sample information;

the data acquisition module comprises a microphone and a camera, the microphone is used for acquiring original voice information of students corresponding to the entry information, and the camera is used for acquiring original video dynamic of the students during reading;

the first processing module is used for preprocessing original voice information, screening effective voice information, extracting the characteristics of the effective voice information and outputting voiceprint characteristic information;

the second processing module is used for intercepting dynamic image frames of the original video at any moment, denoising the image frames, and performing face detection and face characteristic point extraction on the denoised image frames to generate face characteristic images;

the voiceprint recognition module is used for constructing a voiceprint recognition model, and the voiceprint recognition model is used for carrying out voiceprint recognition on the voiceprint characteristic information and outputting a first attendance score;

the face recognition module is used for constructing a face recognition model, and the face recognition model carries out face recognition on the face characteristic image to obtain a second attendance score;

the attendance analysis module is used for endowing a first attendance score with a first weight, endowing a second attendance score with a second weight, and carrying out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;

the attendance analysis module is also used for presetting an attendance qualified score, outputting an attendance result of the student to the cloud server according to the attendance comprehensive score, and storing the attendance result by the cloud server.

As a further improvement of the above technical solution, the first processing module records a voiceprint feature processing program, and the voiceprint feature processing program includes: pre-emphasis, framing, frame shifting and windowing are carried out on original voice information to generate first voice information; extracting the characteristics of the first voice information through fast Fourier transform, a filter bank and discrete cosine transform to generate voiceprint characteristic information;

wherein the pre-emphasis process satisfies the following formula:

Y[n]＝X[n]-βX[n-1]；

wherein Y [ n ] represents the pre-emphasized original voice information, X [ n ] is the nth sampling point of the original voice information, X [ n-1] is the nth-1 sampling point of the original voice information, and beta is a constant (wherein beta belongs to [0.9,1.0 ]);

the windowing process satisfies the following formula:

T[n]＝Y[n]*f[n]；

where T [ n ] represents the first speech signal and f [ n ] is a Hamming window function.

As a further improvement of the above technical solution, the second processing module records a face feature processing program, and the face feature processing program includes: intercepting dynamic image frames of an original video at any moment; carrying out high-pass filtering on the image frame through a Laplace operator, and denoising the image frame subjected to high-pass filtering through a median filtering method; carrying out face detection and face feature point extraction on the denoised image frame to generate face feature information;

wherein the laplacian satisfies the following formula:

wherein the content of the first and second substances,

representing the laplacian, f (x, y) being the image frame;

the median filtering process satisfies the following formula:

wherein g (x, y) is the denoised image frame,

and (4) selecting a 3 multiplied by 3 area for the image frame after high-pass filtering, wherein A is a two-dimensional template and theta is selected.

A composite network class attendance method based on voice recognition and face recognition is applied to a composite network class attendance system based on voice recognition and face comparison, and comprises the following steps:

the method comprises the steps that a communication module acquires attendance request information of a teacher, a cloud server randomly selects entry information from a random entry library and prompts students to read aloud after receiving the attendance request information, and a data acquisition module acquires original voice information read aloud by the students and original video dynamics of the students during reading aloud;

the first processing module preprocesses original voice information to generate first voice information, extracts the characteristics of the first voice information and outputs voiceprint characteristic information;

the second processing module intercepts dynamic image frames of the original video at any moment, carries out denoising processing on the image frames, carries out face detection and face feature point extraction on the denoised image frames and generates face feature information;

the voiceprint recognition module builds a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, the voiceprint recognition model carries out voiceprint recognition on the voiceprint characteristic information, and a first attendance score is output;

the face recognition module builds a face recognition model based on a YOLOv3 algorithm, the face recognition model carries out face recognition on the face feature image and outputs a second attendance score;

the attendance analysis module endows the first attendance score with a first weight, endows the second attendance score with a second weight, and carries out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;

the attendance analysis module presets attendance qualified scores, outputs attendance results of students to the cloud server according to attendance comprehensive scores, and the cloud server stores the attendance results.

As a further improvement of the foregoing technical solution, the extracting features of the first speech information and outputting voiceprint feature information includes:

performing fast Fourier transform on the first voice information to obtain a magnitude spectrum of the first voice information; wherein the magnitude spectrum satisfies the following formula:

wherein U (k) is a linear frequency spectrum of the first voice information, T [ N ] is a first voice signal, and N is a window width of a window function when fast Fourier transform is performed;

performing modulus operation on the linear frequency spectrum of the first voice information and performing square calculation to obtain a discrete power spectrum of the first voice information; wherein the discrete power spectrum satisfies the following formula:

wherein P (k) is a discrete power spectrum of the first speech information;

constructing a Gamma atom filter bank, and performing frequency integration on the discrete power spectrum through the Gamma atom filter bank; wherein, the time domain impulse response of the Gamma filter bank satisfies the following formula:

wherein c is a proportionality coefficient, n is a Gamma filter order, b is a time attenuation coefficient, f₀Is the center frequency of the Gammatone filter,

is the phase of the gamma filter;

calculating the long-time frame power of the first voice information, and masking and suppressing noise except the voice; wherein the long time frame power satisfies the following formula:

wherein Q (i, j) represents the long-time frame power, and P [ i', j ] represents the power spectrum of the current frame and a certain frame in the previous and subsequent i frames;

normalizing the time domain and the frequency domain;

and calculating a nonlinear function power of the power spectrum after time-frequency normalization, and reducing the dimension through discrete cosine transform to finally obtain a PNCC coefficient, wherein the PNCC coefficient represents voiceprint characteristic information.

As a further improvement of the above technical solution, the second processing module intercepts a dynamic image frame of an original video at any time, performs denoising processing on the image frame, performs face detection and face feature point extraction on the denoised image frame, and generates face feature information, including:

the second processing module intercepts dynamic image frames of the original video at any moment, carries out high-pass filtering on the image frames, and carries out denoising on the image frames subjected to high-pass filtering through a median filtering method to obtain a first image;

and carrying out face detection and face characteristic point extraction on the first image through a multitask cascade convolution neural network to generate a face characteristic image.

As a further improvement of the above technical solution, the multitask cascade convolution neural network includes a recommendation network, an optimization network, and an output network, where the recommendation network is configured to perform regression prediction on a first image, merge the first image through non-maximum compression, and output a first candidate frame; the optimization network is used for filtering out non-human face candidate windows in the first candidate frame through the multilayer convolutional neural network, and outputting a second candidate frame after training through the full connection layer; the output network is used for filtering the overlapped candidate window in the second candidate frame and outputting the face feature image.

As a further improvement of the above technical solution, the voiceprint recognition module constructs a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, and the voiceprint recognition model performs voiceprint recognition on the voiceprint characteristic information and outputs a first attendance score, and the method further includes the following steps:

the voiceprint recognition module acquires a student voiceprint sample library through the cloud server, establishes an X-Vector model according to the student voiceprint sample library and an X-Vector algorithm, and outputs an X-Vector feature Vector corresponding to the student voiceprint sample library;

according to the X-Vector feature Vector corresponding to the student voiceprint sample library, the voiceprint recognition module establishes a PLDA model based on a PLDA algorithm, trains the PLDA model through an EM algorithm and generates a voiceprint recognition model;

the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model to obtain an X-Vector feature Vector corresponding to the voiceprint feature information, inputs the X-Vector feature Vector corresponding to the voiceprint feature information into the voiceprint recognition model, and outputs a first attendance score.

As a further improvement of the above technical solution, the face recognition module constructs a face recognition model based on the YOLOv3 algorithm, and the face recognition model performs face recognition on the face feature image and outputs a second attendance score, and the method further includes the following steps:

the face recognition module calls a student face database stored in the cloud server, divides the student face database into a first training set and a first test set according to a ratio of 8;

the face recognition module sets a second performance evaluation index, and obtains performance parameters of the face comparison model through the first test set;

and the face comparison model acquires a face characteristic image, the face characteristic image is input into the face comparison model, and a second attendance score is output.

The beneficial effects of the invention are: the invention discloses a composite web class attendance system and a method based on sound recognition and face recognition, wherein a voiceprint recognition model is established based on an X-Vector algorithm and a PLDA algorithm, a face recognition model is established based on a YOLOv3 algorithm, a first attendance score is obtained through the voiceprint recognition model according to the acquired original voice information and original video dynamic of vocabulary entry information read by students during attendance, a second attendance score is obtained through the face recognition model, and the first attendance score and the second attendance score are synthesized to obtain the final attendance result of the students. The invention obtains two attendance scores through face recognition and voiceprint recognition, and synthesizes the two attendance scores to obtain the final attendance result of the student. The invention facilitates the on-line attendance checking work of teachers on the class, reduces the workload of the on-line attendance checking work of teachers in class, ensures that the attendance checking results of students are more accurate, and improves the confidence of the attendance checking results of the students.

Drawings

In order to more clearly illustrate the technical solution in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is clear that the described figures are only some embodiments of the invention, not all embodiments, and that a person skilled in the art can also derive other designs and figures from them without inventive effort.

FIG. 1 is a flow chart of a method of a composite web class attendance method based on voice recognition and face recognition;

FIG. 2 is a flow chart of a method for obtaining voiceprint characteristic information of a composite web class attendance method based on voice recognition and face recognition;

FIG. 3 is a flow chart of a method for obtaining a face feature image of a composite web class attendance method based on voice recognition and face recognition;

FIG. 4 is a flow chart of a method for face detection and face feature point extraction by a multitask cascade convolution neural network of a compound network class attendance method based on voice recognition and face recognition;

FIG. 5 is a flow chart of a method for constructing a voiceprint recognition model and obtaining a first attendance score based on a voice recognition and face recognition combined lesson attendance method;

fig. 6 is a flow chart of a method for constructing a face recognition model and obtaining a second attendance score based on a voice recognition and face recognition combined lesson attendance checking method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It is noted that while a division of functional blocks is depicted in the system diagram, and logical order is depicted in the flowchart, in some cases the steps depicted and described may be performed in a different order than the division of blocks in the system or the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

A compound network class attendance system based on voice recognition and face recognition comprises: the system comprises a communication module, a cloud server, a data acquisition module, a first processing module, a second processing module, a voiceprint recognition module, a face recognition module and an attendance analysis module, wherein the cloud server is respectively connected with the attendance analysis module and the communication module, and the preprocessing module is respectively connected with the data acquisition module, the voiceprint recognition module and the face recognition module;

the second processing module is used for intercepting dynamic image frames of the original video at any moment, the image frames are frontal face images of students, denoising the image frames, and performing face detection and face feature point extraction on the denoised image frames to generate face feature images;

the attendance analysis module is used for endowing the first weight with a first attendance score and the second weight with a second attendance score, and carrying out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;

the attendance analysis module is also used for judging whether the attendance comprehensive score is smaller than the attendance qualified score or not according to the preset attendance qualified score; if yes, the student attendance is regarded as not passing, and the attendance result of the student is uploaded to a cloud server to be stored.

Further, the first processing module records a voiceprint feature processing program, and the voiceprint feature processing program includes: pre-emphasis, framing, frame shifting and windowing are carried out on original voice information to generate first voice information; and extracting the characteristics of the first voice information through fast Fourier transform, a filter bank and discrete cosine transform to generate voiceprint characteristic information.

Further, the second processing module records a face feature processing program, and the face feature processing program includes: intercepting dynamic image frames of an original video at any moment; carrying out high-pass filtering on the image frame through a Laplace operator, and denoising the image frame subjected to high-pass filtering through a median filtering method; and carrying out face detection and face feature point extraction on the denoised image frame to generate face feature information.

The invention also discloses a composite network course attendance method based on voice recognition and face recognition, which is applied to the composite network course attendance system based on voice recognition and face recognition, please refer to fig. 1 to 6, and the method comprises the following steps:

s100, a communication module acquires attendance request information of a teacher, a cloud server randomly selects entry information from a random entry library and prompts students to read aloud after receiving the attendance request information, and a data acquisition module acquires original voice information read aloud by the students and original video dynamics during the reading aloud by the students;

s200, preprocessing original voice information by a first processing module to generate first voice information, extracting the characteristics of the first voice information, and outputting voiceprint characteristic information;

s300, a second processing module intercepts dynamic image frames of the original video at any moment, de-noises the image frames, and performs face detection and face feature point extraction on the de-noised image frames to generate face feature information;

s400, the voiceprint recognition module constructs a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, and the voiceprint recognition model performs voiceprint recognition on the voiceprint characteristic information and outputs a first attendance score;

s500, the face recognition module constructs a face recognition model based on a YOLOv3 algorithm, the face recognition model carries out face recognition on the face feature image, and a second attendance score is output;

s600, the attendance analysis module endows the first attendance score with a first weight, endows the second attendance score with a second weight, and carries out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;

s700, the attendance analysis module presets attendance qualified scores, outputs attendance results of students to the cloud server according to the attendance comprehensive scores, and the cloud server stores the attendance results.

Further, in step S100, the communication module acquires attendance request information of a teacher, the cloud server calls any entry information in a random entry library according to the attendance request information and prompts students to read aloud, the random entry library includes a plurality of preset entry information, the number range of words included in each entry information is 10-20 words, and the data acquisition module acquires original voice information of the student aloud entry information and synchronously acquires original video dynamics of the student aloud.

In this embodiment, the data acquisition module is a microphone and a camera of a device used by the student during the class session.

Further, in step S200, the first processing module obtains the original voice information and pre-processes the original voice information, where the pre-processing step is: carrying out pre-emphasis, framing, frame shifting and windowing on original voice information to generate first voice information; wherein the pre-emphasis process satisfies the following formula:

Y[n]＝X[n]-βX[n-1]；

wherein, Y [ n ] represents the pre-emphasized original voice information, X [ n ] is the nth sampling point of the original voice information, X [ n-1] is the nth-1 sampling point of the original voice information, and beta is a constant (wherein, beta belongs to [0.9,1.0 ]);

the windowing process satisfies the following formula:

T[n]＝Y[n]*f[n]；

wherein T [ n ] represents the first speech information, and f [ n ] is a Hamming window function.

In this embodiment, the first processing module extracts the feature of the first speech information through a PNCC algorithm. Compared with the traditional MFCC voice feature extraction algorithm, the PNCC algorithm uses power-law nonlinearity to replace the traditional logarithmic nonlinearity in the MFCC coefficient, and is added with a noise suppression algorithm based on asymmetric filtering suppression background excitation and a time masking module to improve the voice recognition effect in a noisy scene.

Referring to fig. 2, the extracting the feature of the first speech information by the PNCC algorithm includes the following steps:

s210, performing fast Fourier transform on the first voice information to obtain a magnitude spectrum of the first voice information; wherein the magnitude spectrum satisfies the following formula:

s220, performing modulus extraction and square calculation on the linear frequency spectrum of the first voice information to obtain a discrete power spectrum of the first voice information; wherein the discrete power spectrum satisfies the following formula:

wherein P (k) is a discrete power spectrum of the first speech information;

s230, constructing a Gamma atom filter bank, and performing frequency integration on the discrete power spectrum through the Gamma atom filter bank; wherein, the time domain impulse response of the Gamma filter bank satisfies the following formula:

is the phase of the gamma filter;

s240, calculating long-time frame power of the first voice information, and masking and suppressing noise except for human voice; wherein the long time frame power satisfies the following formula:

wherein, Q (i, j) represents long-time frame power, P [ i', j ] represents power spectrum of current frame and some frame in each frame before and after i;

s250, normalizing the time domain and the frequency domain; wherein the normalization process satisfies the following formula:

a＝min(j+N,J)；

b＝max(j-N,1)；

wherein, F [ i, j^′]Is a noise figure other than the human voice,

is a power spectrum after time-frequency normalization;

s260, calculating a nonlinear function power of the power spectrum after time-frequency normalization, and reducing dimensions through discrete cosine transform to finally obtain a PNCC coefficient which represents voiceprint characteristic information.

Further, referring to fig. 3, in step S300, the second processing module intercepts a dynamic image frame of the original video at any time, performs denoising processing on the image frame, and performs face detection and face feature point extraction on the denoised image frame to generate face feature information, and the method further includes the following steps:

s310, a second processing module intercepts dynamic image frames of the original video at any moment, high-pass filtering is carried out on the image frames, and the image frames subjected to high-pass filtering are denoised by a median filtering method to obtain a first image;

wherein the laplacian satisfies the following formula:

wherein the content of the first and second substances,

representing the laplacian, f (x, y) being an image frame;

the median filtering process satisfies the following formula:

wherein g (x, y) is the denoised image frame,

the image frame is subjected to high-pass filtering, A is a two-dimensional template, and theta is selected to be a 3 multiplied by 3 area;

and S320, performing face detection and face feature point extraction on the first image through the multitask cascade convolution neural network to generate a face feature image.

In this embodiment, the Multi-task Cascaded Convolutional neural Network (MTCNN) is a Convolutional neural Network capable of simultaneously processing face detection and face feature point positioning, and the Multi-task Cascaded Convolutional neural Network includes three Multi-task Convolutional neural Networks, which are respectively a recommendation Network (P-Net), an optimization Network (R-Net), and an Output Network (O-Net), where each Multi-task Convolutional neural Network has three learning tasks, and the three learning tasks are respectively a face classification task, a frame regression task, and a face feature point positioning task.

According to the method, the pre-trained multitask cascade convolution neural network is used for carrying out face detection and face characteristic point extraction on the first image, and the first image needs to be preprocessed to be adjusted into an input format which accords with the multitask cascade convolution neural network before being input into the multitask cascade convolution neural network.

Further, referring to fig. 4, the process of face detection and face feature point extraction is divided into three stages:

s321, performing regression prediction on the first image by the recommendation network, merging by non-maximum suppression, and outputting a first candidate frame.

Specifically, in step S321, the first image is input into a recommendation network, the recommendation network obtains a regression vector of a candidate window and a bounding box of the face region, performs regression prediction with the bounding box and calibrates the candidate window, and outputs a first candidate box by non-maximum suppression (NMS) combination;

and S322, filtering out non-human face candidate windows in the first candidate frame by the optimization network through a multilayer convolutional neural network, and outputting a second candidate frame after training through a full connection layer.

Specifically, in step S322, a first candidate frame obtained by the recommendation network is used as an input of the optimization network, the optimization network filters most of candidate windows that are not human faces in the first candidate frame through the multilayer convolutional neural network, selects a full connection layer for training, finely adjusts the first candidate frame by using a bounding box vector, and finally removes overlapped candidate windows through non-maximum suppression (NMS) to output a second candidate frame;

and S323, filtering the overlapped candidate window in the second candidate frame by the output network, and outputting the human face feature image.

Specifically, in step S323, the second candidate frame obtained from the preferred network is used as an input of the output network, the output network removes the overlapping candidate window in the second candidate frame, and finally completes the face detection on the first image, and outputs the face feature image.

Further, in step S400, a voiceprint recognition model is constructed based on Deep Neural Network (DNN) and PLDA (vocal learning LDA) algorithms, and voiceprint recognition is performed through the constructed voiceprint recognition model according to the voiceprint feature information to obtain a first attendance score. Referring to fig. 5, the step of constructing a voiceprint recognition model and performing voiceprint recognition on the voiceprint feature information includes:

s410, the voiceprint recognition module acquires a student voiceprint sample library through the cloud server, establishes an X-Vector model according to the student voiceprint sample library and an X-Vector algorithm, and outputs an X-Vector feature Vector corresponding to the student voiceprint sample library;

s420, according to the X-Vector feature vectors corresponding to the student voiceprint sample library, the voiceprint recognition module establishes a PLDA model based on a PLDA algorithm, trains the PLDA model through an EM algorithm and generates a voiceprint recognition model;

s430, the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model to obtain an X-Vector feature Vector corresponding to the voiceprint feature information, inputs the X-Vector feature Vector corresponding to the voiceprint feature information into the voiceprint recognition model, and outputs a first attendance score.

Specifically, in step S410, the cloud server stores a student voiceprint sample library, the student voiceprint sample library includes a plurality of voiceprint sample information of all students and a student name corresponding to each voiceprint sample information, and each voiceprint sample information is subjected to feature extraction processing in advance through a PNCC algorithm.

According to the method and the device, an X-Vector model is trained through a deep neural network structure, the X-Vector model can accept input of any length and is converted into feature expression of fixed length, and a data enhancement strategy is adopted in the model training process to strengthen the robustness of the model to the noise interference. The X-Vector model can be divided into nine layers, wherein the first layer to the fifth layer are deep convolutional neural network layers at a frame level, the sixth layer is a statistical pooling layer, the seventh layer to the eighth layer are all full connection layers at a segment level, and the ninth layer is a classification layer based on a Softmax classifier. In the training process of the X-Vector model, the multi-layer deep convolutional network layer is subjected to parallel training, the time sequence of the student voiceprint sample library is increased, the X-Vector feature vectors corresponding to all sample data in the student voiceprint sample library are extracted through the full connection layer at the level of the sixth layer section, and the X-Vector feature vectors are output through the classification layer.

The loss function of the X-Vector model satisfies the following formula:

wherein F (P, Q) represents the loss function of the X-Vector model, P (X)_i) Representing the probability distribution, Q (x), of the ith sample value in the student's voiceprint sample library_i) And the probability distribution of the predicted value of the ith sample in the student voiceprint sample library is predicted by the representative X-Vector model.

After an X-Vector model is constructed, X-Vector feature vectors corresponding to all sample data in a student voiceprint sample library are obtained through the X-Vector model. In step S420, a PLDA model is constructed according to a PLDA algorithm, X-Vector feature vectors corresponding to all sample data in a student voiceprint sample library are used as input of the PLDA model, the PLDA model is trained through an EM algorithm, parameters of the PLDA model are updated, and then the voiceprint recognition model is generated.

The PLDA model is a channel compensation algorithm and is used for further extracting the speaker information contained in the X-Vector feature Vector. The method comprises the steps of taking X-Vector feature vectors corresponding to all sample data in a student voiceprint sample library obtained through an X-Vector model as input of a PLDA model, and defining ith data X input into the PLDA model_ijIs the jth of the ith studentX-Vector feature Vector, the ith data X input to the PLDA model_ijThe following formula is satisfied:

x_ij＝μ+Fh_i+Gw_ij+ε_ij；

wherein mu represents a mean vector, F represents an identity information matrix of each student, G represents a channel information matrix, and h_iHidden variable, w, representing voiceprint sample information corresponding to the ith student_ijImplicit variable in the channel, ε, representing the jth X-Vector feature Vector of the ith student_ijThe residual part of the jth X-Vector feature Vector representing the ith student.

In this embodiment, the PLDA model is trained by the EM algorithm to obtain the hidden variable h in the above formula_iAnd w_ijFinally, generating the voiceprint recognition model. The process of updating the parameters by the EM algorithm comprises the following steps: calculating the mean value of all data input into the PLDA model, and subtracting each training data from the mean value; initializing a channel information matrix, and reducing the dimension of the mean value by a principal component analysis method; calculating an implicit variable h_i、w_ijAnd according to a hidden variable h_i、w_ijParameters of the PLDA model are updated.

In step S430, the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model, the X-Vector model extracts an X-Vector feature Vector corresponding to the voiceprint feature information, the X-Vector feature Vector corresponding to the voiceprint feature information is input into the voiceprint recognition model, and the voiceprint recognition model calculates similarity between the voiceprint feature information and voiceprint sample information corresponding to the student, and outputs a first attendance score.

Further, referring to fig. 6, in step S500, a face recognition model is constructed through a YOLOv3 algorithm, and according to the face feature information obtained in step S300, the face recognition model performs face recognition on the face feature information and outputs a second attendance score.

In this embodiment, the construction of the face recognition model by using the YOLOv3 algorithm includes the following steps:

s510, the face recognition module calls a student face database stored in the cloud server, divides the student face database into a first training set and a first testing set according to a ratio of 8;

s520, the face recognition module sets a second performance evaluation index, and obtains performance parameters of the face recognition model through the first test set;

s530, the face recognition model obtains a face feature image, the face feature image is input into the face recognition model, and a second attendance score is output.

Specifically, in step S510, a student face database is stored in the cloud server, the student face database includes a plurality of frontal face images of all students and annotation information corresponding to each frontal face image, and the annotation information includes a face area of the student frontal face image and a student name corresponding to the student frontal face image. After the student face database is obtained, the face recognition module divides the student face database into a first training set and a first test set according to a ratio of 8.

It should be noted that the hyper-parameters of the YOLOv3 network model need to be set before the YOLOv3 network is trained, and the setting of the hyper-parameters will affect the training effect of the YOLOv3 network model. The hyper-parameters to be set in the application are respectively learning rate, batch size, iteration times and activation function, the learning rate is set through a learning rate attenuation function, the learning rate attenuation function is used for obtaining better learning rate to ensure that the loss function of YOLOv3 oscillates in a region near an optimal value, and the batch size is set to be 2⁵The number of iterations of the YOLOv3 network model of the present application will depend on the case where the learning rate decays.

Specifically, the learning rate attenuation function ω satisfies the following formula:

ω＝y^x·ω₀；

wherein x represents the number of iterations, y represents the decay rate, ω₀Indicating the initial learning rate.

In step S520, the face recognition module performs performance evaluation on the face recognition model through the second performance evaluation index and the first test set, where the performance evaluation includes: dividing the first test set into four categories, wherein the four categories are respectively true and divided into true sample numbers by the trained network model, false sample numbers by the trained network model and false sample numbers by the trained network model; and calculating the proportion of the number of samples which are actually true and are classified as true by the trained network model, the proportion of the number of samples which are actually false and are classified as false by the trained network model in all the samples of the first test set, and the proportion represents the accuracy of the face recognition model.

Particularly, the standard accuracy of the face recognition model is set, if the accuracy of the face recognition model is not greater than the standard accuracy, the hyper-parameters of the YOLOv3 network model are reset, and the YOLOv3 network model is trained again.

In step S530, the face recognition model obtains a face feature image and inputs the face feature image into the face recognition model, and the face recognition model performs face recognition on the face feature image to obtain similarity data between the face feature image and a frontal face image in a student face database corresponding to the student, and outputs a second attendance score.

According to the invention, the face recognition model and the voiceprint recognition model are used for checking attendance of students to obtain the first attendance score and the second attendance score, the first attendance score and the second attendance score are integrated to obtain the final attendance integrated score, and the attendance integrated score obtained by the method has higher confidence level.

Further, in step S600, the attendance analysis module assigns the first attendance score to a first weight, assigns the second attendance score to a second weight, and performs an accumulation operation on the first attendance score assigned to the first weight and the second attendance score assigned to the second weight to obtain an attendance comprehensive score.

The first weight and the second weight are obtained through calculation of the accuracy of the face recognition model and the accuracy of the voiceprint recognition model, the accuracy of the face recognition model is used for representing the accuracy of the face recognition model in processing the first test set, and the accuracy of the voiceprint recognition model is used for representing the accuracy of the voiceprint recognition model in processing the student voiceprint sample library.

Further, in step S700, the attendance analysis module presets an attendance qualified score, and determines whether the attendance integrated score is smaller than the attendance qualified score, and if the attendance integrated score is smaller than the attendance qualified score, the student attendance is regarded as not passing; and if the attendance comprehensive score is not less than the attendance qualified score, the student is regarded as passing the attendance. In this embodiment, no matter whether the attendance result of the student passes or not, the attendance analysis module uploads the attendance result of the student to the cloud server for storage.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that the present invention is not limited to the details of the embodiments shown and described, but is capable of numerous modifications and substitutions without departing from the spirit of the present invention and within the scope of the appended claims.

Claims

1. The utility model provides a compound net class attendance system based on voice recognition and face identification which characterized in that includes: the system comprises a communication module, a cloud server, a data acquisition module, a first processing module, a second processing module, a voiceprint recognition module, a face recognition module and an attendance analysis module, wherein the cloud server is respectively connected with the attendance analysis module and the communication module;

the student face database comprises a plurality of front face photographic images of all students and annotation information corresponding to each front face photographic image; the student voiceprint sample library comprises a plurality of voiceprint sample information of all students and a student name corresponding to each voiceprint sample information;

the face recognition module is used for constructing a face recognition model, and the face recognition model carries out face recognition on the face feature image to obtain a second attendance score;

2. The system of claim 1, wherein the first processing module records a voiceprint feature processing program, and the voiceprint feature processing program comprises: pre-emphasis, framing, frame shifting and windowing are carried out on original voice information to generate first voice information; extracting the characteristics of the first voice information through fast Fourier transform, a filter bank and discrete cosine transform to generate voiceprint characteristic information;

wherein the pre-emphasis process satisfies the following formula:

Y[n]＝X[n]-βX[n-1]；

the windowing process satisfies the following formula:

T[n]＝Y[n]*f[n]；

3. The system according to claim 1, wherein the second processing module is recorded with a facial feature processing program, the facial feature processing program comprises: intercepting dynamic image frames of an original video at any moment; carrying out high-pass filtering on the image frame through a Laplace operator, and denoising the image frame subjected to high-pass filtering through a median filtering method; carrying out face detection and face feature point extraction on the denoised image frame to generate face feature information;

wherein the laplacian satisfies the following formula:

wherein, the first and the second end of the pipe are connected with each other,

representing the laplacian, f (x, y) being the image frame;

the median filtering process satisfies the following formula:

wherein g (x, y) is the denoised image frame,

is a high-pass filtered image frame, A is a two-dimensional template,

selected as a 3 x 3 area.

4. A composite network class attendance method based on voice recognition and face recognition is applied to the composite network class attendance system based on voice recognition and face comparison as claimed in any one of claims 1 to 3, and is characterized by comprising the following steps of:

the second processing module intercepts dynamic image frames of the original video at any moment, carries out denoising processing on the image frames, and carries out face detection and face feature point extraction on the denoised image frames to generate face feature information;

the face recognition module constructs a face recognition model based on a YOLOv3 algorithm, the face recognition model carries out face recognition on the face feature image, and a second attendance score is output;

the attendance analysis module presets attendance qualified scores, and according to attendance comprehensive scores, the attendance result of students is output to the cloud server, and the cloud server stores the attendance result.

5. The method of claim 4, wherein the extracting the feature of the first voice message and outputting voiceprint feature information comprises:

wherein P (k) is a discrete power spectrum of the first speech information;

is the phase of the gamma filter;

normalizing the time domain and the frequency domain;

6. The composite web lesson attendance checking method based on the voice recognition and the face recognition, as claimed in claim 4, wherein the second processing module intercepts dynamic image frames of the original video at any moment, de-noises the image frames, performs the face detection and the face feature point extraction on the de-noised image frames, and generates the face feature information, comprising:

7. The composite web class attendance method based on voice recognition and face recognition is characterized in that the multitask cascade convolution neural network comprises a recommendation network, an optimization network and an output network, wherein the recommendation network is used for performing regression prediction on a first image and outputting a first candidate frame through non-maximum suppression combination; the optimization network is used for filtering candidate windows of non-human faces in the first candidate frame through the multilayer convolutional neural network, and outputting a second candidate frame after training through the full connection layer; and the output network is used for filtering the overlapped candidate windows in the second candidate frame and outputting the face characteristic image.

8. The composite web class attendance method based on voice recognition and face recognition of claim 4, wherein the voiceprint recognition module constructs a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, the voiceprint recognition model performs voiceprint recognition on the voiceprint characteristic information and outputs a first attendance score, and the method further comprises the following steps:

according to the X-Vector feature Vector corresponding to the student voiceprint sample library, the voiceprint recognition module establishes a PLDA model based on a PLDA algorithm, and trains the PLDA model through an EM algorithm to generate a voiceprint recognition model;

9. The composite web class attendance method based on voice recognition and face recognition of claim 4, wherein the face recognition module constructs a face recognition model based on a YOLOv3 algorithm, the face recognition model performs face recognition on a face feature image and outputs a second attendance score, and the method further comprises the following steps: