CN115273863A - Compound network class attendance system and method based on voice recognition and face recognition - Google Patents

Compound network class attendance system and method based on voice recognition and face recognition Download PDF

Info

Publication number
CN115273863A
CN115273863A CN202210662375.9A CN202210662375A CN115273863A CN 115273863 A CN115273863 A CN 115273863A CN 202210662375 A CN202210662375 A CN 202210662375A CN 115273863 A CN115273863 A CN 115273863A
Authority
CN
China
Prior art keywords
attendance
face
voiceprint
information
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210662375.9A
Other languages
Chinese (zh)
Inventor
陈荣征
李浩能
李育廷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Vocational and Technical College
Original Assignee
Guangdong Vocational and Technical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Vocational and Technical College filed Critical Guangdong Vocational and Technical College
Priority to CN202210662375.9A priority Critical patent/CN115273863A/en
Publication of CN115273863A publication Critical patent/CN115273863A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a voice recognition and face recognition combined online class attendance system and a method, wherein a voiceprint recognition model is established based on an X-Vector algorithm and a PLDA algorithm, a face recognition model is established based on a YOLOv3 algorithm, original voice information and original video dynamic of vocabulary entry information read aloud by students during attendance are collected and preprocessed to obtain voiceprint characteristic information and a face characteristic image, a first attendance score is obtained through the voiceprint characteristic information and the face recognition model, a second attendance score is obtained through the face characteristic image and the face recognition model, and a final attendance result is obtained by integrating the first attendance score and the second attendance score. The invention obtains two attendance scores through face recognition and voiceprint recognition, and obtains the final attendance result of the student by integrating the two attendance scores.

Description

Compound network class attendance system and method based on voice recognition and face recognition
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a composite network class attendance system and a method based on voice recognition and face recognition.
Background
The class attendance is to acquire the attendance of students in a certain time period of a specific course in a certain mode, effectively improve campus management by carrying out the class attendance and play a role in supervising the students to carry out school book learning; moreover, the attendance condition of the students is an important data for evaluating the classroom teaching effect and an important parameter for measuring the comprehensive performance of the students at the end of the term. The modern class is divided into an online class and an offline class, the online class generally performs class study in a mode that students and teachers face each other, the online class is used for course study through network courses, and the traditional attendance modes of the online class include manual attendance, manual statistics attendance and mobile terminal attendance checking. The online class is required to be changed into the online class in some cases, under the condition, a teacher is difficult to carry out attendance work on each student for classroom learning, the assessment work of the teacher on the classroom teaching effect is influenced, the enthusiasm of the students on course learning is reduced, and the study of the students is slow and lazy easily caused.
Disclosure of Invention
The invention aims to provide a composite network class attendance system and a composite network class attendance method based on voice recognition and face recognition, which are used for solving one or more technical problems in the prior art and at least provide a beneficial selection or creation condition.
The solution of the invention for solving the technical problem is as follows: the utility model provides a compound net class attendance system based on voice recognition and face identification, includes: the system comprises a communication module, a cloud server, a data acquisition module, a first processing module, a second processing module, a voiceprint recognition module, a face recognition module and an attendance analysis module, wherein the cloud server is respectively connected with the attendance analysis module and the communication module, and the preprocessing module is respectively connected with the data acquisition module, the voiceprint recognition module and the face recognition module;
the communication module is used for acquiring attendance request information of a teacher and sending the attendance request information to the cloud server;
the cloud server is used for acquiring attendance request information, selecting entry information from the random entry database, displaying the entry information on a display screen used by the student and prompting the student to read aloud;
the student face database comprises a plurality of face-up images of all students and annotation information corresponding to each face-up image; the student voiceprint sample library comprises a plurality of voiceprint sample information of all students and a student name corresponding to each voiceprint sample information;
the data acquisition module comprises a microphone and a camera, the microphone is used for acquiring original voice information of students corresponding to the entry information, and the camera is used for acquiring original video dynamic of the students during reading;
the first processing module is used for preprocessing original voice information, screening effective voice information, extracting the characteristics of the effective voice information and outputting voiceprint characteristic information;
the second processing module is used for intercepting dynamic image frames of the original video at any moment, denoising the image frames, and performing face detection and face characteristic point extraction on the denoised image frames to generate face characteristic images;
the voiceprint recognition module is used for constructing a voiceprint recognition model, and the voiceprint recognition model is used for carrying out voiceprint recognition on the voiceprint characteristic information and outputting a first attendance score;
the face recognition module is used for constructing a face recognition model, and the face recognition model carries out face recognition on the face characteristic image to obtain a second attendance score;
the attendance analysis module is used for endowing a first attendance score with a first weight, endowing a second attendance score with a second weight, and carrying out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;
the attendance analysis module is also used for presetting an attendance qualified score, outputting an attendance result of the student to the cloud server according to the attendance comprehensive score, and storing the attendance result by the cloud server.
As a further improvement of the above technical solution, the first processing module records a voiceprint feature processing program, and the voiceprint feature processing program includes: pre-emphasis, framing, frame shifting and windowing are carried out on original voice information to generate first voice information; extracting the characteristics of the first voice information through fast Fourier transform, a filter bank and discrete cosine transform to generate voiceprint characteristic information;
wherein the pre-emphasis process satisfies the following formula:
Y[n]=X[n]-βX[n-1];
wherein Y [ n ] represents the pre-emphasized original voice information, X [ n ] is the nth sampling point of the original voice information, X [ n-1] is the nth-1 sampling point of the original voice information, and beta is a constant (wherein beta belongs to [0.9,1.0 ]);
the windowing process satisfies the following formula:
T[n]=Y[n]*f[n];
Figure BDA0003691289650000041
where T [ n ] represents the first speech signal and f [ n ] is a Hamming window function.
As a further improvement of the above technical solution, the second processing module records a face feature processing program, and the face feature processing program includes: intercepting dynamic image frames of an original video at any moment; carrying out high-pass filtering on the image frame through a Laplace operator, and denoising the image frame subjected to high-pass filtering through a median filtering method; carrying out face detection and face feature point extraction on the denoised image frame to generate face feature information;
wherein the laplacian satisfies the following formula:
Figure BDA0003691289650000042
wherein the content of the first and second substances,
Figure BDA0003691289650000043
representing the laplacian, f (x, y) being the image frame;
the median filtering process satisfies the following formula:
Figure BDA0003691289650000044
wherein g (x, y) is the denoised image frame,
Figure BDA0003691289650000045
and (4) selecting a 3 multiplied by 3 area for the image frame after high-pass filtering, wherein A is a two-dimensional template and theta is selected.
A composite network class attendance method based on voice recognition and face recognition is applied to a composite network class attendance system based on voice recognition and face comparison, and comprises the following steps:
the method comprises the steps that a communication module acquires attendance request information of a teacher, a cloud server randomly selects entry information from a random entry library and prompts students to read aloud after receiving the attendance request information, and a data acquisition module acquires original voice information read aloud by the students and original video dynamics of the students during reading aloud;
the first processing module preprocesses original voice information to generate first voice information, extracts the characteristics of the first voice information and outputs voiceprint characteristic information;
the second processing module intercepts dynamic image frames of the original video at any moment, carries out denoising processing on the image frames, carries out face detection and face feature point extraction on the denoised image frames and generates face feature information;
the voiceprint recognition module builds a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, the voiceprint recognition model carries out voiceprint recognition on the voiceprint characteristic information, and a first attendance score is output;
the face recognition module builds a face recognition model based on a YOLOv3 algorithm, the face recognition model carries out face recognition on the face feature image and outputs a second attendance score;
the attendance analysis module endows the first attendance score with a first weight, endows the second attendance score with a second weight, and carries out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;
the attendance analysis module presets attendance qualified scores, outputs attendance results of students to the cloud server according to attendance comprehensive scores, and the cloud server stores the attendance results.
As a further improvement of the foregoing technical solution, the extracting features of the first speech information and outputting voiceprint feature information includes:
performing fast Fourier transform on the first voice information to obtain a magnitude spectrum of the first voice information; wherein the magnitude spectrum satisfies the following formula:
Figure BDA0003691289650000051
wherein U (k) is a linear frequency spectrum of the first voice information, T [ N ] is a first voice signal, and N is a window width of a window function when fast Fourier transform is performed;
performing modulus operation on the linear frequency spectrum of the first voice information and performing square calculation to obtain a discrete power spectrum of the first voice information; wherein the discrete power spectrum satisfies the following formula:
Figure BDA0003691289650000061
wherein P (k) is a discrete power spectrum of the first speech information;
constructing a Gamma atom filter bank, and performing frequency integration on the discrete power spectrum through the Gamma atom filter bank; wherein, the time domain impulse response of the Gamma filter bank satisfies the following formula:
Figure BDA0003691289650000062
wherein c is a proportionality coefficient, n is a Gamma filter order, b is a time attenuation coefficient, f0Is the center frequency of the Gammatone filter,
Figure BDA0003691289650000063
is the phase of the gamma filter;
calculating the long-time frame power of the first voice information, and masking and suppressing noise except the voice; wherein the long time frame power satisfies the following formula:
Figure BDA0003691289650000064
wherein Q (i, j) represents the long-time frame power, and P [ i', j ] represents the power spectrum of the current frame and a certain frame in the previous and subsequent i frames;
normalizing the time domain and the frequency domain;
and calculating a nonlinear function power of the power spectrum after time-frequency normalization, and reducing the dimension through discrete cosine transform to finally obtain a PNCC coefficient, wherein the PNCC coefficient represents voiceprint characteristic information.
As a further improvement of the above technical solution, the second processing module intercepts a dynamic image frame of an original video at any time, performs denoising processing on the image frame, performs face detection and face feature point extraction on the denoised image frame, and generates face feature information, including:
the second processing module intercepts dynamic image frames of the original video at any moment, carries out high-pass filtering on the image frames, and carries out denoising on the image frames subjected to high-pass filtering through a median filtering method to obtain a first image;
and carrying out face detection and face characteristic point extraction on the first image through a multitask cascade convolution neural network to generate a face characteristic image.
As a further improvement of the above technical solution, the multitask cascade convolution neural network includes a recommendation network, an optimization network, and an output network, where the recommendation network is configured to perform regression prediction on a first image, merge the first image through non-maximum compression, and output a first candidate frame; the optimization network is used for filtering out non-human face candidate windows in the first candidate frame through the multilayer convolutional neural network, and outputting a second candidate frame after training through the full connection layer; the output network is used for filtering the overlapped candidate window in the second candidate frame and outputting the face feature image.
As a further improvement of the above technical solution, the voiceprint recognition module constructs a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, and the voiceprint recognition model performs voiceprint recognition on the voiceprint characteristic information and outputs a first attendance score, and the method further includes the following steps:
the voiceprint recognition module acquires a student voiceprint sample library through the cloud server, establishes an X-Vector model according to the student voiceprint sample library and an X-Vector algorithm, and outputs an X-Vector feature Vector corresponding to the student voiceprint sample library;
according to the X-Vector feature Vector corresponding to the student voiceprint sample library, the voiceprint recognition module establishes a PLDA model based on a PLDA algorithm, trains the PLDA model through an EM algorithm and generates a voiceprint recognition model;
the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model to obtain an X-Vector feature Vector corresponding to the voiceprint feature information, inputs the X-Vector feature Vector corresponding to the voiceprint feature information into the voiceprint recognition model, and outputs a first attendance score.
As a further improvement of the above technical solution, the face recognition module constructs a face recognition model based on the YOLOv3 algorithm, and the face recognition model performs face recognition on the face feature image and outputs a second attendance score, and the method further includes the following steps:
the face recognition module calls a student face database stored in the cloud server, divides the student face database into a first training set and a first test set according to a ratio of 8;
the face recognition module sets a second performance evaluation index, and obtains performance parameters of the face comparison model through the first test set;
and the face comparison model acquires a face characteristic image, the face characteristic image is input into the face comparison model, and a second attendance score is output.
The beneficial effects of the invention are: the invention discloses a composite web class attendance system and a method based on sound recognition and face recognition, wherein a voiceprint recognition model is established based on an X-Vector algorithm and a PLDA algorithm, a face recognition model is established based on a YOLOv3 algorithm, a first attendance score is obtained through the voiceprint recognition model according to the acquired original voice information and original video dynamic of vocabulary entry information read by students during attendance, a second attendance score is obtained through the face recognition model, and the first attendance score and the second attendance score are synthesized to obtain the final attendance result of the students. The invention obtains two attendance scores through face recognition and voiceprint recognition, and synthesizes the two attendance scores to obtain the final attendance result of the student. The invention facilitates the on-line attendance checking work of teachers on the class, reduces the workload of the on-line attendance checking work of teachers in class, ensures that the attendance checking results of students are more accurate, and improves the confidence of the attendance checking results of the students.
Drawings
In order to more clearly illustrate the technical solution in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is clear that the described figures are only some embodiments of the invention, not all embodiments, and that a person skilled in the art can also derive other designs and figures from them without inventive effort.
FIG. 1 is a flow chart of a method of a composite web class attendance method based on voice recognition and face recognition;
FIG. 2 is a flow chart of a method for obtaining voiceprint characteristic information of a composite web class attendance method based on voice recognition and face recognition;
FIG. 3 is a flow chart of a method for obtaining a face feature image of a composite web class attendance method based on voice recognition and face recognition;
FIG. 4 is a flow chart of a method for face detection and face feature point extraction by a multitask cascade convolution neural network of a compound network class attendance method based on voice recognition and face recognition;
FIG. 5 is a flow chart of a method for constructing a voiceprint recognition model and obtaining a first attendance score based on a voice recognition and face recognition combined lesson attendance method;
fig. 6 is a flow chart of a method for constructing a face recognition model and obtaining a second attendance score based on a voice recognition and face recognition combined lesson attendance checking method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It is noted that while a division of functional blocks is depicted in the system diagram, and logical order is depicted in the flowchart, in some cases the steps depicted and described may be performed in a different order than the division of blocks in the system or the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
A compound network class attendance system based on voice recognition and face recognition comprises: the system comprises a communication module, a cloud server, a data acquisition module, a first processing module, a second processing module, a voiceprint recognition module, a face recognition module and an attendance analysis module, wherein the cloud server is respectively connected with the attendance analysis module and the communication module, and the preprocessing module is respectively connected with the data acquisition module, the voiceprint recognition module and the face recognition module;
the communication module is used for acquiring attendance request information of a teacher and sending the attendance request information to the cloud server;
the cloud server is used for acquiring attendance request information, selecting entry information from the random entry database, displaying the entry information on a display screen used by the student and prompting the student to read aloud;
the data acquisition module comprises a microphone and a camera, the microphone is used for acquiring original voice information of students corresponding to the entry information, and the camera is used for acquiring original video dynamic of the students during reading;
the first processing module is used for preprocessing original voice information, screening effective voice information, extracting the characteristics of the effective voice information and outputting voiceprint characteristic information;
the second processing module is used for intercepting dynamic image frames of the original video at any moment, the image frames are frontal face images of students, denoising the image frames, and performing face detection and face feature point extraction on the denoised image frames to generate face feature images;
the voiceprint recognition module is used for constructing a voiceprint recognition model, and the voiceprint recognition model is used for carrying out voiceprint recognition on the voiceprint characteristic information and outputting a first attendance score;
the face recognition module is used for constructing a face recognition model, and the face recognition model carries out face recognition on the face characteristic image to obtain a second attendance score;
the attendance analysis module is used for endowing the first weight with a first attendance score and the second weight with a second attendance score, and carrying out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;
the attendance analysis module is also used for judging whether the attendance comprehensive score is smaller than the attendance qualified score or not according to the preset attendance qualified score; if yes, the student attendance is regarded as not passing, and the attendance result of the student is uploaded to a cloud server to be stored.
Further, the first processing module records a voiceprint feature processing program, and the voiceprint feature processing program includes: pre-emphasis, framing, frame shifting and windowing are carried out on original voice information to generate first voice information; and extracting the characteristics of the first voice information through fast Fourier transform, a filter bank and discrete cosine transform to generate voiceprint characteristic information.
Further, the second processing module records a face feature processing program, and the face feature processing program includes: intercepting dynamic image frames of an original video at any moment; carrying out high-pass filtering on the image frame through a Laplace operator, and denoising the image frame subjected to high-pass filtering through a median filtering method; and carrying out face detection and face feature point extraction on the denoised image frame to generate face feature information.
The invention also discloses a composite network course attendance method based on voice recognition and face recognition, which is applied to the composite network course attendance system based on voice recognition and face recognition, please refer to fig. 1 to 6, and the method comprises the following steps:
s100, a communication module acquires attendance request information of a teacher, a cloud server randomly selects entry information from a random entry library and prompts students to read aloud after receiving the attendance request information, and a data acquisition module acquires original voice information read aloud by the students and original video dynamics during the reading aloud by the students;
s200, preprocessing original voice information by a first processing module to generate first voice information, extracting the characteristics of the first voice information, and outputting voiceprint characteristic information;
s300, a second processing module intercepts dynamic image frames of the original video at any moment, de-noises the image frames, and performs face detection and face feature point extraction on the de-noised image frames to generate face feature information;
s400, the voiceprint recognition module constructs a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, and the voiceprint recognition model performs voiceprint recognition on the voiceprint characteristic information and outputs a first attendance score;
s500, the face recognition module constructs a face recognition model based on a YOLOv3 algorithm, the face recognition model carries out face recognition on the face feature image, and a second attendance score is output;
s600, the attendance analysis module endows the first attendance score with a first weight, endows the second attendance score with a second weight, and carries out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;
s700, the attendance analysis module presets attendance qualified scores, outputs attendance results of students to the cloud server according to the attendance comprehensive scores, and the cloud server stores the attendance results.
Further, in step S100, the communication module acquires attendance request information of a teacher, the cloud server calls any entry information in a random entry library according to the attendance request information and prompts students to read aloud, the random entry library includes a plurality of preset entry information, the number range of words included in each entry information is 10-20 words, and the data acquisition module acquires original voice information of the student aloud entry information and synchronously acquires original video dynamics of the student aloud.
In this embodiment, the data acquisition module is a microphone and a camera of a device used by the student during the class session.
Further, in step S200, the first processing module obtains the original voice information and pre-processes the original voice information, where the pre-processing step is: carrying out pre-emphasis, framing, frame shifting and windowing on original voice information to generate first voice information; wherein the pre-emphasis process satisfies the following formula:
Y[n]=X[n]-βX[n-1];
wherein, Y [ n ] represents the pre-emphasized original voice information, X [ n ] is the nth sampling point of the original voice information, X [ n-1] is the nth-1 sampling point of the original voice information, and beta is a constant (wherein, beta belongs to [0.9,1.0 ]);
the windowing process satisfies the following formula:
T[n]=Y[n]*f[n];
Figure BDA0003691289650000131
wherein T [ n ] represents the first speech information, and f [ n ] is a Hamming window function.
In this embodiment, the first processing module extracts the feature of the first speech information through a PNCC algorithm. Compared with the traditional MFCC voice feature extraction algorithm, the PNCC algorithm uses power-law nonlinearity to replace the traditional logarithmic nonlinearity in the MFCC coefficient, and is added with a noise suppression algorithm based on asymmetric filtering suppression background excitation and a time masking module to improve the voice recognition effect in a noisy scene.
Referring to fig. 2, the extracting the feature of the first speech information by the PNCC algorithm includes the following steps:
s210, performing fast Fourier transform on the first voice information to obtain a magnitude spectrum of the first voice information; wherein the magnitude spectrum satisfies the following formula:
Figure BDA0003691289650000141
wherein U (k) is a linear frequency spectrum of the first voice information, T [ N ] is a first voice signal, and N is a window width of a window function when fast Fourier transform is performed;
s220, performing modulus extraction and square calculation on the linear frequency spectrum of the first voice information to obtain a discrete power spectrum of the first voice information; wherein the discrete power spectrum satisfies the following formula:
Figure BDA0003691289650000142
wherein P (k) is a discrete power spectrum of the first speech information;
s230, constructing a Gamma atom filter bank, and performing frequency integration on the discrete power spectrum through the Gamma atom filter bank; wherein, the time domain impulse response of the Gamma filter bank satisfies the following formula:
Figure BDA0003691289650000143
wherein c is a proportionality coefficient, n is a Gamma filter order, b is a time attenuation coefficient, f0Is the center frequency of the Gammatone filter,
Figure BDA0003691289650000144
is the phase of the gamma filter;
s240, calculating long-time frame power of the first voice information, and masking and suppressing noise except for human voice; wherein the long time frame power satisfies the following formula:
Figure BDA0003691289650000151
wherein, Q (i, j) represents long-time frame power, P [ i', j ] represents power spectrum of current frame and some frame in each frame before and after i;
s250, normalizing the time domain and the frequency domain; wherein the normalization process satisfies the following formula:
Figure BDA0003691289650000152
a=min(j+N,J);
b=max(j-N,1);
Figure BDA0003691289650000153
wherein, F [ i, j]Is a noise figure other than the human voice,
Figure BDA0003691289650000154
is a power spectrum after time-frequency normalization;
s260, calculating a nonlinear function power of the power spectrum after time-frequency normalization, and reducing dimensions through discrete cosine transform to finally obtain a PNCC coefficient which represents voiceprint characteristic information.
Further, referring to fig. 3, in step S300, the second processing module intercepts a dynamic image frame of the original video at any time, performs denoising processing on the image frame, and performs face detection and face feature point extraction on the denoised image frame to generate face feature information, and the method further includes the following steps:
s310, a second processing module intercepts dynamic image frames of the original video at any moment, high-pass filtering is carried out on the image frames, and the image frames subjected to high-pass filtering are denoised by a median filtering method to obtain a first image;
wherein the laplacian satisfies the following formula:
Figure BDA0003691289650000161
wherein the content of the first and second substances,
Figure BDA0003691289650000162
representing the laplacian, f (x, y) being an image frame;
the median filtering process satisfies the following formula:
Figure BDA0003691289650000163
wherein g (x, y) is the denoised image frame,
Figure BDA0003691289650000164
the image frame is subjected to high-pass filtering, A is a two-dimensional template, and theta is selected to be a 3 multiplied by 3 area;
and S320, performing face detection and face feature point extraction on the first image through the multitask cascade convolution neural network to generate a face feature image.
In this embodiment, the Multi-task Cascaded Convolutional neural Network (MTCNN) is a Convolutional neural Network capable of simultaneously processing face detection and face feature point positioning, and the Multi-task Cascaded Convolutional neural Network includes three Multi-task Convolutional neural Networks, which are respectively a recommendation Network (P-Net), an optimization Network (R-Net), and an Output Network (O-Net), where each Multi-task Convolutional neural Network has three learning tasks, and the three learning tasks are respectively a face classification task, a frame regression task, and a face feature point positioning task.
According to the method, the pre-trained multitask cascade convolution neural network is used for carrying out face detection and face characteristic point extraction on the first image, and the first image needs to be preprocessed to be adjusted into an input format which accords with the multitask cascade convolution neural network before being input into the multitask cascade convolution neural network.
Further, referring to fig. 4, the process of face detection and face feature point extraction is divided into three stages:
s321, performing regression prediction on the first image by the recommendation network, merging by non-maximum suppression, and outputting a first candidate frame.
Specifically, in step S321, the first image is input into a recommendation network, the recommendation network obtains a regression vector of a candidate window and a bounding box of the face region, performs regression prediction with the bounding box and calibrates the candidate window, and outputs a first candidate box by non-maximum suppression (NMS) combination;
and S322, filtering out non-human face candidate windows in the first candidate frame by the optimization network through a multilayer convolutional neural network, and outputting a second candidate frame after training through a full connection layer.
Specifically, in step S322, a first candidate frame obtained by the recommendation network is used as an input of the optimization network, the optimization network filters most of candidate windows that are not human faces in the first candidate frame through the multilayer convolutional neural network, selects a full connection layer for training, finely adjusts the first candidate frame by using a bounding box vector, and finally removes overlapped candidate windows through non-maximum suppression (NMS) to output a second candidate frame;
and S323, filtering the overlapped candidate window in the second candidate frame by the output network, and outputting the human face feature image.
Specifically, in step S323, the second candidate frame obtained from the preferred network is used as an input of the output network, the output network removes the overlapping candidate window in the second candidate frame, and finally completes the face detection on the first image, and outputs the face feature image.
Further, in step S400, a voiceprint recognition model is constructed based on Deep Neural Network (DNN) and PLDA (vocal learning LDA) algorithms, and voiceprint recognition is performed through the constructed voiceprint recognition model according to the voiceprint feature information to obtain a first attendance score. Referring to fig. 5, the step of constructing a voiceprint recognition model and performing voiceprint recognition on the voiceprint feature information includes:
s410, the voiceprint recognition module acquires a student voiceprint sample library through the cloud server, establishes an X-Vector model according to the student voiceprint sample library and an X-Vector algorithm, and outputs an X-Vector feature Vector corresponding to the student voiceprint sample library;
s420, according to the X-Vector feature vectors corresponding to the student voiceprint sample library, the voiceprint recognition module establishes a PLDA model based on a PLDA algorithm, trains the PLDA model through an EM algorithm and generates a voiceprint recognition model;
s430, the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model to obtain an X-Vector feature Vector corresponding to the voiceprint feature information, inputs the X-Vector feature Vector corresponding to the voiceprint feature information into the voiceprint recognition model, and outputs a first attendance score.
Specifically, in step S410, the cloud server stores a student voiceprint sample library, the student voiceprint sample library includes a plurality of voiceprint sample information of all students and a student name corresponding to each voiceprint sample information, and each voiceprint sample information is subjected to feature extraction processing in advance through a PNCC algorithm.
According to the method and the device, an X-Vector model is trained through a deep neural network structure, the X-Vector model can accept input of any length and is converted into feature expression of fixed length, and a data enhancement strategy is adopted in the model training process to strengthen the robustness of the model to the noise interference. The X-Vector model can be divided into nine layers, wherein the first layer to the fifth layer are deep convolutional neural network layers at a frame level, the sixth layer is a statistical pooling layer, the seventh layer to the eighth layer are all full connection layers at a segment level, and the ninth layer is a classification layer based on a Softmax classifier. In the training process of the X-Vector model, the multi-layer deep convolutional network layer is subjected to parallel training, the time sequence of the student voiceprint sample library is increased, the X-Vector feature vectors corresponding to all sample data in the student voiceprint sample library are extracted through the full connection layer at the level of the sixth layer section, and the X-Vector feature vectors are output through the classification layer.
The loss function of the X-Vector model satisfies the following formula:
Figure BDA0003691289650000191
wherein F (P, Q) represents the loss function of the X-Vector model, P (X)i) Representing the probability distribution, Q (x), of the ith sample value in the student's voiceprint sample libraryi) And the probability distribution of the predicted value of the ith sample in the student voiceprint sample library is predicted by the representative X-Vector model.
After an X-Vector model is constructed, X-Vector feature vectors corresponding to all sample data in a student voiceprint sample library are obtained through the X-Vector model. In step S420, a PLDA model is constructed according to a PLDA algorithm, X-Vector feature vectors corresponding to all sample data in a student voiceprint sample library are used as input of the PLDA model, the PLDA model is trained through an EM algorithm, parameters of the PLDA model are updated, and then the voiceprint recognition model is generated.
The PLDA model is a channel compensation algorithm and is used for further extracting the speaker information contained in the X-Vector feature Vector. The method comprises the steps of taking X-Vector feature vectors corresponding to all sample data in a student voiceprint sample library obtained through an X-Vector model as input of a PLDA model, and defining ith data X input into the PLDA modelijIs the jth of the ith studentX-Vector feature Vector, the ith data X input to the PLDA modelijThe following formula is satisfied:
xij=μ+Fhi+Gwijij
wherein mu represents a mean vector, F represents an identity information matrix of each student, G represents a channel information matrix, and hiHidden variable, w, representing voiceprint sample information corresponding to the ith studentijImplicit variable in the channel, ε, representing the jth X-Vector feature Vector of the ith studentijThe residual part of the jth X-Vector feature Vector representing the ith student.
In this embodiment, the PLDA model is trained by the EM algorithm to obtain the hidden variable h in the above formulaiAnd wijFinally, generating the voiceprint recognition model. The process of updating the parameters by the EM algorithm comprises the following steps: calculating the mean value of all data input into the PLDA model, and subtracting each training data from the mean value; initializing a channel information matrix, and reducing the dimension of the mean value by a principal component analysis method; calculating an implicit variable hi、wijAnd according to a hidden variable hi、wijParameters of the PLDA model are updated.
In step S430, the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model, the X-Vector model extracts an X-Vector feature Vector corresponding to the voiceprint feature information, the X-Vector feature Vector corresponding to the voiceprint feature information is input into the voiceprint recognition model, and the voiceprint recognition model calculates similarity between the voiceprint feature information and voiceprint sample information corresponding to the student, and outputs a first attendance score.
Further, referring to fig. 6, in step S500, a face recognition model is constructed through a YOLOv3 algorithm, and according to the face feature information obtained in step S300, the face recognition model performs face recognition on the face feature information and outputs a second attendance score.
In this embodiment, the construction of the face recognition model by using the YOLOv3 algorithm includes the following steps:
s510, the face recognition module calls a student face database stored in the cloud server, divides the student face database into a first training set and a first testing set according to a ratio of 8;
s520, the face recognition module sets a second performance evaluation index, and obtains performance parameters of the face recognition model through the first test set;
s530, the face recognition model obtains a face feature image, the face feature image is input into the face recognition model, and a second attendance score is output.
Specifically, in step S510, a student face database is stored in the cloud server, the student face database includes a plurality of frontal face images of all students and annotation information corresponding to each frontal face image, and the annotation information includes a face area of the student frontal face image and a student name corresponding to the student frontal face image. After the student face database is obtained, the face recognition module divides the student face database into a first training set and a first test set according to a ratio of 8.
It should be noted that the hyper-parameters of the YOLOv3 network model need to be set before the YOLOv3 network is trained, and the setting of the hyper-parameters will affect the training effect of the YOLOv3 network model. The hyper-parameters to be set in the application are respectively learning rate, batch size, iteration times and activation function, the learning rate is set through a learning rate attenuation function, the learning rate attenuation function is used for obtaining better learning rate to ensure that the loss function of YOLOv3 oscillates in a region near an optimal value, and the batch size is set to be 25The number of iterations of the YOLOv3 network model of the present application will depend on the case where the learning rate decays.
Specifically, the learning rate attenuation function ω satisfies the following formula:
ω=yx·ω0
wherein x represents the number of iterations, y represents the decay rate, ω0Indicating the initial learning rate.
In step S520, the face recognition module performs performance evaluation on the face recognition model through the second performance evaluation index and the first test set, where the performance evaluation includes: dividing the first test set into four categories, wherein the four categories are respectively true and divided into true sample numbers by the trained network model, false sample numbers by the trained network model and false sample numbers by the trained network model; and calculating the proportion of the number of samples which are actually true and are classified as true by the trained network model, the proportion of the number of samples which are actually false and are classified as false by the trained network model in all the samples of the first test set, and the proportion represents the accuracy of the face recognition model.
Particularly, the standard accuracy of the face recognition model is set, if the accuracy of the face recognition model is not greater than the standard accuracy, the hyper-parameters of the YOLOv3 network model are reset, and the YOLOv3 network model is trained again.
In step S530, the face recognition model obtains a face feature image and inputs the face feature image into the face recognition model, and the face recognition model performs face recognition on the face feature image to obtain similarity data between the face feature image and a frontal face image in a student face database corresponding to the student, and outputs a second attendance score.
According to the invention, the face recognition model and the voiceprint recognition model are used for checking attendance of students to obtain the first attendance score and the second attendance score, the first attendance score and the second attendance score are integrated to obtain the final attendance integrated score, and the attendance integrated score obtained by the method has higher confidence level.
Further, in step S600, the attendance analysis module assigns the first attendance score to a first weight, assigns the second attendance score to a second weight, and performs an accumulation operation on the first attendance score assigned to the first weight and the second attendance score assigned to the second weight to obtain an attendance comprehensive score.
The first weight and the second weight are obtained through calculation of the accuracy of the face recognition model and the accuracy of the voiceprint recognition model, the accuracy of the face recognition model is used for representing the accuracy of the face recognition model in processing the first test set, and the accuracy of the voiceprint recognition model is used for representing the accuracy of the voiceprint recognition model in processing the student voiceprint sample library.
Further, in step S700, the attendance analysis module presets an attendance qualified score, and determines whether the attendance integrated score is smaller than the attendance qualified score, and if the attendance integrated score is smaller than the attendance qualified score, the student attendance is regarded as not passing; and if the attendance comprehensive score is not less than the attendance qualified score, the student is regarded as passing the attendance. In this embodiment, no matter whether the attendance result of the student passes or not, the attendance analysis module uploads the attendance result of the student to the cloud server for storage.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that the present invention is not limited to the details of the embodiments shown and described, but is capable of numerous modifications and substitutions without departing from the spirit of the present invention and within the scope of the appended claims.

Claims (9)

1. The utility model provides a compound net class attendance system based on voice recognition and face identification which characterized in that includes: the system comprises a communication module, a cloud server, a data acquisition module, a first processing module, a second processing module, a voiceprint recognition module, a face recognition module and an attendance analysis module, wherein the cloud server is respectively connected with the attendance analysis module and the communication module;
the communication module is used for acquiring attendance request information of a teacher and sending the attendance request information to the cloud server;
the cloud server is used for acquiring attendance request information, selecting entry information from the random entry database, displaying the entry information on a display screen used by the student and prompting the student to read aloud;
the student face database comprises a plurality of front face photographic images of all students and annotation information corresponding to each front face photographic image; the student voiceprint sample library comprises a plurality of voiceprint sample information of all students and a student name corresponding to each voiceprint sample information;
the data acquisition module comprises a microphone and a camera, the microphone is used for acquiring original voice information of students corresponding to the entry information, and the camera is used for acquiring original video dynamic of the students during reading;
the first processing module is used for preprocessing original voice information, screening effective voice information, extracting the characteristics of the effective voice information and outputting voiceprint characteristic information;
the second processing module is used for intercepting dynamic image frames of the original video at any moment, the image frames are frontal face images of students, denoising the image frames, and performing face detection and face feature point extraction on the denoised image frames to generate face feature images;
the voiceprint recognition module is used for constructing a voiceprint recognition model, and the voiceprint recognition model is used for carrying out voiceprint recognition on the voiceprint characteristic information and outputting a first attendance score;
the face recognition module is used for constructing a face recognition model, and the face recognition model carries out face recognition on the face feature image to obtain a second attendance score;
the attendance analysis module is used for endowing the first weight with a first attendance score and the second weight with a second attendance score, and carrying out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;
the attendance analysis module is also used for presetting an attendance qualified score, outputting an attendance result of the student to the cloud server according to the attendance comprehensive score, and storing the attendance result by the cloud server.
2. The system of claim 1, wherein the first processing module records a voiceprint feature processing program, and the voiceprint feature processing program comprises: pre-emphasis, framing, frame shifting and windowing are carried out on original voice information to generate first voice information; extracting the characteristics of the first voice information through fast Fourier transform, a filter bank and discrete cosine transform to generate voiceprint characteristic information;
wherein the pre-emphasis process satisfies the following formula:
Y[n]=X[n]-βX[n-1];
wherein Y [ n ] represents the pre-emphasized original voice information, X [ n ] is the nth sampling point of the original voice information, X [ n-1] is the nth-1 sampling point of the original voice information, and beta is a constant (wherein beta belongs to [0.9,1.0 ]);
the windowing process satisfies the following formula:
T[n]=Y[n]*f[n];
Figure FDA0003691289640000021
where T [ n ] represents the first speech signal and f [ n ] is a Hamming window function.
3. The system according to claim 1, wherein the second processing module is recorded with a facial feature processing program, the facial feature processing program comprises: intercepting dynamic image frames of an original video at any moment; carrying out high-pass filtering on the image frame through a Laplace operator, and denoising the image frame subjected to high-pass filtering through a median filtering method; carrying out face detection and face feature point extraction on the denoised image frame to generate face feature information;
wherein the laplacian satisfies the following formula:
Figure FDA0003691289640000031
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003691289640000032
representing the laplacian, f (x, y) being the image frame;
the median filtering process satisfies the following formula:
Figure FDA0003691289640000033
wherein g (x, y) is the denoised image frame,
Figure FDA0003691289640000034
is a high-pass filtered image frame, A is a two-dimensional template,
Figure FDA0003691289640000035
selected as a 3 x 3 area.
4. A composite network class attendance method based on voice recognition and face recognition is applied to the composite network class attendance system based on voice recognition and face comparison as claimed in any one of claims 1 to 3, and is characterized by comprising the following steps of:
the method comprises the steps that a communication module acquires attendance request information of a teacher, a cloud server randomly selects entry information from a random entry library and prompts students to read aloud after receiving the attendance request information, and a data acquisition module acquires original voice information read aloud by the students and original video dynamics of the students during reading aloud;
the first processing module preprocesses original voice information to generate first voice information, extracts the characteristics of the first voice information and outputs voiceprint characteristic information;
the second processing module intercepts dynamic image frames of the original video at any moment, carries out denoising processing on the image frames, and carries out face detection and face feature point extraction on the denoised image frames to generate face feature information;
the voiceprint recognition module builds a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, the voiceprint recognition model carries out voiceprint recognition on the voiceprint characteristic information, and a first attendance score is output;
the face recognition module constructs a face recognition model based on a YOLOv3 algorithm, the face recognition model carries out face recognition on the face feature image, and a second attendance score is output;
the attendance analysis module endows the first attendance score with a first weight, endows the second attendance score with a second weight, and carries out accumulation operation on the first attendance score endowed with the first weight and the second attendance score endowed with the second weight to obtain an attendance comprehensive score;
the attendance analysis module presets attendance qualified scores, and according to attendance comprehensive scores, the attendance result of students is output to the cloud server, and the cloud server stores the attendance result.
5. The method of claim 4, wherein the extracting the feature of the first voice message and outputting voiceprint feature information comprises:
performing fast Fourier transform on the first voice information to obtain a magnitude spectrum of the first voice information; wherein the magnitude spectrum satisfies the following formula:
Figure FDA0003691289640000041
wherein U (k) is a linear frequency spectrum of the first voice information, T [ N ] is a first voice signal, and N is a window width of a window function when fast Fourier transform is performed;
performing modulus operation on the linear frequency spectrum of the first voice information and performing square calculation to obtain a discrete power spectrum of the first voice information; wherein the discrete power spectrum satisfies the following formula:
Figure FDA0003691289640000051
wherein P (k) is a discrete power spectrum of the first speech information;
constructing a Gamma atom filter bank, and performing frequency integration on the discrete power spectrum through the Gamma atom filter bank; wherein, the time domain impulse response of the Gamma filter bank satisfies the following formula:
Figure FDA0003691289640000052
wherein c is a proportionality coefficient, n is a Gamma filter order, b is a time attenuation coefficient, f0Is the center frequency of the Gammatone filter,
Figure FDA0003691289640000053
is the phase of the gamma filter;
calculating the long-time frame power of the first voice information, and masking and suppressing noise except the voice; wherein the long time frame power satisfies the following formula:
Figure FDA0003691289640000054
wherein Q (i, j) represents the long-time frame power, and P [ i', j ] represents the power spectrum of the current frame and a certain frame in the previous and subsequent i frames;
normalizing the time domain and the frequency domain;
and calculating a nonlinear function power of the power spectrum after time-frequency normalization, and reducing the dimension through discrete cosine transform to finally obtain a PNCC coefficient, wherein the PNCC coefficient represents voiceprint characteristic information.
6. The composite web lesson attendance checking method based on the voice recognition and the face recognition, as claimed in claim 4, wherein the second processing module intercepts dynamic image frames of the original video at any moment, de-noises the image frames, performs the face detection and the face feature point extraction on the de-noised image frames, and generates the face feature information, comprising:
the second processing module intercepts dynamic image frames of the original video at any moment, carries out high-pass filtering on the image frames, and carries out denoising on the image frames subjected to high-pass filtering through a median filtering method to obtain a first image;
and carrying out face detection and face characteristic point extraction on the first image through a multitask cascade convolution neural network to generate a face characteristic image.
7. The composite web class attendance method based on voice recognition and face recognition is characterized in that the multitask cascade convolution neural network comprises a recommendation network, an optimization network and an output network, wherein the recommendation network is used for performing regression prediction on a first image and outputting a first candidate frame through non-maximum suppression combination; the optimization network is used for filtering candidate windows of non-human faces in the first candidate frame through the multilayer convolutional neural network, and outputting a second candidate frame after training through the full connection layer; and the output network is used for filtering the overlapped candidate windows in the second candidate frame and outputting the face characteristic image.
8. The composite web class attendance method based on voice recognition and face recognition of claim 4, wherein the voiceprint recognition module constructs a voiceprint recognition model based on an X-Vector algorithm and a PLDA algorithm, the voiceprint recognition model performs voiceprint recognition on the voiceprint characteristic information and outputs a first attendance score, and the method further comprises the following steps:
the voiceprint recognition module acquires a student voiceprint sample library through the cloud server, establishes an X-Vector model according to the student voiceprint sample library and an X-Vector algorithm, and outputs an X-Vector feature Vector corresponding to the student voiceprint sample library;
according to the X-Vector feature Vector corresponding to the student voiceprint sample library, the voiceprint recognition module establishes a PLDA model based on a PLDA algorithm, and trains the PLDA model through an EM algorithm to generate a voiceprint recognition model;
the voiceprint recognition module inputs the voiceprint feature information into the X-Vector model to obtain an X-Vector feature Vector corresponding to the voiceprint feature information, inputs the X-Vector feature Vector corresponding to the voiceprint feature information into the voiceprint recognition model, and outputs a first attendance score.
9. The composite web class attendance method based on voice recognition and face recognition of claim 4, wherein the face recognition module constructs a face recognition model based on a YOLOv3 algorithm, the face recognition model performs face recognition on a face feature image and outputs a second attendance score, and the method further comprises the following steps:
the face recognition module calls a student face database stored in the cloud server, divides the student face database into a first training set and a first test set according to a ratio of 8;
the face recognition module sets a second performance evaluation index, and obtains performance parameters of the face comparison model through the first test set;
and the face comparison model acquires a face characteristic image, the face characteristic image is input into the face comparison model, and a second attendance score is output.
CN202210662375.9A 2022-06-13 2022-06-13 Compound network class attendance system and method based on voice recognition and face recognition Pending CN115273863A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210662375.9A CN115273863A (en) 2022-06-13 2022-06-13 Compound network class attendance system and method based on voice recognition and face recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210662375.9A CN115273863A (en) 2022-06-13 2022-06-13 Compound network class attendance system and method based on voice recognition and face recognition

Publications (1)

Publication Number Publication Date
CN115273863A true CN115273863A (en) 2022-11-01

Family

ID=83759646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210662375.9A Pending CN115273863A (en) 2022-06-13 2022-06-13 Compound network class attendance system and method based on voice recognition and face recognition

Country Status (1)

Country Link
CN (1) CN115273863A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440686A (en) * 2013-07-29 2013-12-11 上海交通大学 Mobile authentication system and method based on voiceprint recognition, face recognition and location service
CN105224850A (en) * 2015-10-24 2016-01-06 北京进化者机器人科技有限公司 Combined right-discriminating method and intelligent interactive system
CN108399809A (en) * 2018-03-26 2018-08-14 滨州职业学院 Virtual teaching system, cloud platform management system and processing terminal manage system
CN108520565A (en) * 2018-03-07 2018-09-11 南京奥工信息科技有限公司 A kind of synthesis cloud computing platform for long-distance education
CN110047504A (en) * 2019-04-18 2019-07-23 东华大学 Method for distinguishing speek person under identity vector x-vector linear transformation
CN110232352A (en) * 2019-06-12 2019-09-13 东北大学 A kind of improved method of the multitask concatenated convolutional neural network model for recognition of face
CN110349598A (en) * 2019-07-15 2019-10-18 桂林电子科技大学 A kind of end-point detecting method under low signal-to-noise ratio environment
CN110852703A (en) * 2019-10-22 2020-02-28 佛山科学技术学院 Attendance checking method, system, equipment and medium based on side face multi-feature fusion face recognition
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
CN112257591A (en) * 2020-10-22 2021-01-22 安徽天盛智能科技有限公司 Remote video teaching quality evaluation method and system based on machine vision
CN112542174A (en) * 2020-12-25 2021-03-23 南京邮电大学 VAD-based multi-dimensional characteristic parameter voiceprint identification method
CN112735388A (en) * 2020-12-28 2021-04-30 马上消费金融股份有限公司 Network model training method, voice recognition processing method and related equipment
CN113469002A (en) * 2021-06-24 2021-10-01 淮阴工学院 Identity recognition method based on block chain mutual authentication, biological multi-feature recognition and multi-source data fusion
CN114332989A (en) * 2021-12-08 2022-04-12 重庆邮电大学 Face detection method and system of multitask cascade convolution neural network

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440686A (en) * 2013-07-29 2013-12-11 上海交通大学 Mobile authentication system and method based on voiceprint recognition, face recognition and location service
CN105224850A (en) * 2015-10-24 2016-01-06 北京进化者机器人科技有限公司 Combined right-discriminating method and intelligent interactive system
CN108520565A (en) * 2018-03-07 2018-09-11 南京奥工信息科技有限公司 A kind of synthesis cloud computing platform for long-distance education
CN108399809A (en) * 2018-03-26 2018-08-14 滨州职业学院 Virtual teaching system, cloud platform management system and processing terminal manage system
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
CN110047504A (en) * 2019-04-18 2019-07-23 东华大学 Method for distinguishing speek person under identity vector x-vector linear transformation
CN110232352A (en) * 2019-06-12 2019-09-13 东北大学 A kind of improved method of the multitask concatenated convolutional neural network model for recognition of face
CN110349598A (en) * 2019-07-15 2019-10-18 桂林电子科技大学 A kind of end-point detecting method under low signal-to-noise ratio environment
CN110852703A (en) * 2019-10-22 2020-02-28 佛山科学技术学院 Attendance checking method, system, equipment and medium based on side face multi-feature fusion face recognition
CN112257591A (en) * 2020-10-22 2021-01-22 安徽天盛智能科技有限公司 Remote video teaching quality evaluation method and system based on machine vision
CN112542174A (en) * 2020-12-25 2021-03-23 南京邮电大学 VAD-based multi-dimensional characteristic parameter voiceprint identification method
CN112735388A (en) * 2020-12-28 2021-04-30 马上消费金融股份有限公司 Network model training method, voice recognition processing method and related equipment
CN113469002A (en) * 2021-06-24 2021-10-01 淮阴工学院 Identity recognition method based on block chain mutual authentication, biological multi-feature recognition and multi-source data fusion
CN114332989A (en) * 2021-12-08 2022-04-12 重庆邮电大学 Face detection method and system of multitask cascade convolution neural network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KAIPENG ZHANG 等: ""Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks"", 《IEEE SIGNAL PROCESSING LETERS》, vol. 23, no. 10, 26 August 2016 (2016-08-26) *
KIM C等: ""Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition"", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》, 23 March 2016 (2016-03-23) *
SNYDER DAVID 等: ""Deep neural network-based speaker embeddings for end-to-end speaker verification"", 《2016 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP》, 9 February 2017 (2017-02-09) *
SNYDER DAVID等: ""X-vectors:Robust DNN Embeddings for Speaker Recognition"", 《2018 ICASSP》, 13 September 2018 (2018-09-13) *
刘翔;王明忠;陈织光;吴斯杰;袁资桢;杨鸿平;: "高职视频互动教学直播平台建设――以广州东华职业学院为例", 无线互联科技, no. 04, 25 February 2020 (2020-02-25) *
王宁: ""面向压缩语音的说话人识别技术的研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, 15 March 2018 (2018-03-15) *

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
DE112017003563B4 (en) METHOD AND SYSTEM OF AUTOMATIC LANGUAGE RECOGNITION USING POSTERIORI TRUST POINT NUMBERS
CN110310647B (en) Voice identity feature extractor, classifier training method and related equipment
CN111461176A (en) Multi-mode fusion method, device, medium and equipment based on normalized mutual information
CN112487949B (en) Learner behavior recognition method based on multi-mode data fusion
CN111563422A (en) Service evaluation obtaining method and device based on bimodal emotion recognition network
CN110807585A (en) Student classroom learning state online evaluation method and system
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN108986798A (en) Processing method, device and the equipment of voice data
CN111402922B (en) Audio signal classification method, device, equipment and storage medium based on small samples
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
CN110956142A (en) Intelligent interactive training system
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN117058597B (en) Dimension emotion recognition method, system, equipment and medium based on audio and video
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN111932056A (en) Customer service quality scoring method and device, computer equipment and storage medium
CN110705523A (en) Entrepreneur performance evaluation method and system based on neural network
CN116244474A (en) Learner learning state acquisition method based on multi-mode emotion feature fusion
CN115273863A (en) Compound network class attendance system and method based on voice recognition and face recognition
CN113504891B (en) Volume adjusting method, device, equipment and storage medium
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN116230017A (en) Speech evaluation method, device, computer equipment and storage medium
Al-Kaltakchi et al. Closed-set speaker identification system based on MFCC and PNCC features combination with different fusion strategies
CN113488069A (en) Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination