CN110807370B - Conference speaker identity noninductive confirmation method based on multiple modes - Google Patents
Conference speaker identity noninductive confirmation method based on multiple modes Download PDFInfo
- Publication number
- CN110807370B CN110807370B CN201910968323.2A CN201910968323A CN110807370B CN 110807370 B CN110807370 B CN 110807370B CN 201910968323 A CN201910968323 A CN 201910968323A CN 110807370 B CN110807370 B CN 110807370B
- Authority
- CN
- China
- Prior art keywords
- word
- speaker
- conference
- algorithm
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012790 confirmation Methods 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 40
- 238000012549 training Methods 0.000 claims description 17
- 238000012360 testing method Methods 0.000 claims description 12
- 238000013136 deep learning model Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000008447 perception Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000013473 artificial intelligence Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims 1
- 238000004801 process automation Methods 0.000 abstract 1
- 230000006872 improvement Effects 0.000 description 5
- 238000000556 factor analysis Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Game Theory and Decision Science (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Business, Economics & Management (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a multi-mode conference speaker identity noninductive confirmation method, which is based on the recognition of the speaker expression, the speaker sound and the speaker speaking style during the multi-mode conference using images, voices and texts. The method can realize the whole process automation without manual intervention, can realize noninductive confirmation of the identity of the speaker through an artificial intelligent algorithm model, does not need manual intervention, greatly improves the efficiency of meeting and office, and has higher accuracy.
Description
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a conference speaker identity noninductive confirmation method based on multiple modes.
Background
Along with the development of economy, high-efficiency offices are more and more separated from conference systems, and in the current stage, many conference systems need to record the speaking content of each speaker for convenience in summarizing and reporting. Thus, in response to this need, a method of intelligently and quickly distinguishing speakers is needed.
At present, a conference system mostly adopts a microphone to record the voice of a speaker to record the speaking content, if different speakers are to be distinguished, one microphone needs to be allocated to each speaker, but crosstalk problems can be caused if a plurality of microphones are allocated, namely, because the microphones are too close to each other, one person can recognize that the plurality of microphones are speaking, and when one person speaks, other microphones need to be closed to distinguish the speakers, and although the speaker can be distinguished, the method is very troublesome and needs human intervention. Therefore, there is a need for a speaker identity non-perception confirmation method based on multiple modes such as image, voice, text and the like.
Disclosure of Invention
In order to solve the complicated problem that microphones at different positions are required to be closed and opened for multiple times due to distance adjustment caused by traditional microphone allocation and formulation in a conventional conference, the invention provides a multi-mode conference speaker identity noninductive confirmation method, which specifically comprises the following steps: the method for distinguishing conference speakers by automatically recognizing three aspects of expression, voice and speaking style of the speakers comprises an expression recognition method based on a deep learning model, a voice recognition method based on an artificial intelligence algorithm and a speaking content recognition method based on a text clustering algorithm.
In the expression recognition method based on the deep learning model, firstly, the face photo information of a speaker at a conference site is collected, operations such as random interference, deformation, rotation and the like are carried out through information preprocessing, then a plurality of groups of training sets are generated by utilizing a Gan network, then the sample data are trained by adopting a Faster R-Cnn model, and finally the deep learning model is generated.
As an improvement, the voice recognition method comprises the following specific steps:
(1) Data acquisition and processing
Collecting conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, and processing the data by taking each segment as a processing unit to remove noise;
(2) Model building and training
Assume that the training data speech is composed of multiple voices of multiple persons, wherein the jth voice of the ith person is defined as X ij The construction model is as follows: x is X ij =μ+Fh i +Gw ij +∈ ij Wherein μ is the data mean, fh i And Gw ij E is a space feature matrix ij Is the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;
(3) Model testing
Calculating whether two voices are the same speaker is based on characteristic h in speaker space i The likelihood generated, or generated by hi, is calculated by log likelihood ratio score as follows:
wherein eta 1 ,η 2 Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta 1 ,η 2 Probabilities from the unified space; />And->Respectively represent probabilities belonging to respective different spaces.
As an improvement, a text clustering algorithm is adopted to identify the speaking content, the text clustering method comprises two parts of sentence vector representation and text clustering, all sentence vector representations are first carried out, and then text clustering is carried out on all sentence vector representations through a DBSCAN algorithm.
As an improvement, the Skip-gram model of word2vec tool is adopted to train word vectors of the text, and a word vector matrix X epsilon R is formed mn In x i ∈R m The word vector of the feature word i in the m-dimensional space is represented, and the Euclidean distance between the two vectors is expressed, wherein the formula is as follows: d (w) i ,w j )=|x i -x j | 2 Wherein d (w i ,w j ) Representing the semantic distance between the feature word i and the feature word j; x is x i And x j Representing the characteristic word w i And w j Corresponding word vectors.
As an improvement, the Skip-gram model includes an input layer, a projection layer, and an output layer; wherein the input layer is the current characteristic word, and the word vector is marked as W t ∈R m The output layer is the probability of the word in the contextual window of the feature word; the projection layer is used to maximize the value of the objective function L.
As an improvement, assume a set of word sequences w 1 ,w 2 ,…,w N The formula of the objective function is noted as:
wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 words; p (w) j+1 |w j ) For knowing the current word w j With probability of occurrence, its contextual feature word w j+1 Probability of occurrence.
When text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of people of the talkers is known, the cluster number of the corresponding talker is obtained through adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and then the talking contents of different talkers are separated.
The beneficial effects are that: the conference speaker identity noninductive confirmation method based on the multiple modes is based on the fact that the speaker identity is confirmed through identifying the expression, the sound and the speaking style of the speaker when the conference with the multiple modes of images, voices and texts is used, the whole process is automatic, manual intervention is not needed, noninductive confirmation of the speaker identity can be achieved through an artificial intelligent algorithm model, manual intervention is not needed, conference and office efficiency is greatly improved, and accuracy is high.
Drawings
Fig. 1 is a flow chart illustrating the principle of the present invention.
FIG. 2 is a schematic diagram of the DBSCAN algorithm of the present invention.
Detailed Description
The drawings of the invention are further described below in conjunction with the embodiments.
The invention is based on the conference of image, voice, text multimode, confirm the identity of the speaker by identifying the expression of the speaker, the voice of the speaker, the speaking style of the speaker, can realize the whole course automation without manual intervention, and is specific:
(1) Speaker expression recognition
The expression of the person when speaking and the expression of the person when not speaking are greatly different, the expression of each participant is recognized based on a deep learning model through real-time video of a conference site, the speaking state of the participant is judged, and the speaker is confirmed;
(2) Speaker voice recognition
The voice of each person has great difference in frequency and tone, and the speaker is distinguished based on an artificial intelligence algorithm through real-time voice of a conference site, so that the identity of the speaker is determined;
(3) Speaker speech style identification
When the two effects are bad, the speech content text information after speech recognition can be used for classifying the paragraphs with the corresponding number categories according to the known number of the speakers by adopting a clustering algorithm, so that the identities of the speaking scores are distinguished.
Aiming at the expression recognition of the speaker, firstly, collecting face photo information of the speaker on the conference site, preprocessing the information including random interference, deformation, rotation and the like, generating a plurality of groups of training sets by utilizing a Gan network, training sample data by adopting a fast R-Cnn model, and finally generating a deep learning model.
Example 1
About 1000 pictures of the face of a speaker in a conference site are collected, the pictures are classified manually into two categories of speaking and non-speaking, then basic operations such as random interference, deformation and rotation are carried out, more training sets are generated by utilizing a Gan network, and about 10 times of the source data set is obtained. And training sample data by adopting a Faster R-Cnn model, wherein the accuracy of the final model reaches 85%.
For speaker voice recognition, the specific embodiment of the invention is as follows: 1) And (3) data acquisition: collecting voice data in real time at a conference site, segmenting the data every 4-8 seconds, preferably 5 seconds, and taking each segment as a processing unit; 2) And (3) data processing: because the speaking of the conference site is standard, most of the conference site is mandarin, and the conference site is quite and low in noise, the data is basically not processed; 3) Model construction: assume that the training data speech is made up of I speakers' voices, each having J segments of their own distinct voices. Then define the jth voice of the ith speaker as X ij . Then, according to the factor analysis, define X ij The generation model of (1) is as follows:
X ij =μ+Fh i +Gw ij +∈ ij
wherein μ is the data mean, fh i And Gw ij E is a space feature matrix ij Is the noise covariance. This model can be seen as two parts: the first two right terms of the equal sign are related to the speaker only and are related to a specific one of the speakersThe bar speech is irrelevant, called signal part, which describes the difference between the speakers; the second two terms on the right of the equal sign describe the difference between different voices of the same speaker, called the noise portion.
Two imaginary variables are used to describe the data structure of a piece of speech. The middle two terms to the right of the equal sign are a matrix and a vector representation, respectively, which are yet another core part of the factor analysis. The two matrices F and G contain the basic factors in the respective imaginary variable spaces, which can be regarded as eigenvectors of the respective spaces. For example, each column of F corresponds to a feature vector of the inter-class space, and each column of G corresponds to a feature vector of the intra-class space. And h is i And w i Representing the characteristic representation of F and G in the respective spaces, e.g. h i Can be regarded as x ij A representation of features in speaker space. In the recognition scoring stage, if h of two voices i The greater the likelihood that the features are identical, the more confident the two voices belong to the same speaker. 4) Model training: mu Fh i Gw ij ∈ ij The training process of the model adopts an EM algorithm to carry out iterative solution. 5) Model test: calculating whether two voices are represented by feature h in speaker space i Generated, or by h i The likelihood degree generated is calculated using the log likelihood ratio score as follows:
wherein eta 1 ,η 2 Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta 1 ,η 2 Probabilities from the same space; />And->Respectively represent probabilities belonging to respective different spaces. By calculating the log likelihood ratio, the similarity of two voices can be measured, namely, the higher the score is, the greater the probability that the two voices belong to the same speaker is.
Aiming at speaker speaking style recognition, a text clustering algorithm is adopted to recognize speaking contents, the method comprises two parts of sentence vector representation and text clustering, all sentence vector representations are first performed, and then text clustering is performed on all sentence vector representations through a DBSCAN algorithm.
1) Sentence vector representation
The invention adopts the Skip-gram model of Word2vec tool to train Word vector for the text. The model is a Huffman tree constructed based on Hierarchical Softmax, and can predict the occurrence probability of context words from large-scale non-labeled text data according to the currently input words, namely, the surrounding words can be predicted according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information.
The Skip-gram model includes an input layer, a projection layer, and an output layer. Wherein the input layer is the current characteristic word, and the word vector W t ∈R m The method comprises the steps of carrying out a first treatment on the surface of the The output layer is the probability of the word in the contextual window of the feature word; the purpose of the projection layer is to maximize the value of the objective function L. Assume that there is a set of word sequences w 1 ,w 2 ,…,w N The formula of the objective function is noted as:
n in the above formula is the length of the word sequence; c represents the context length of the current feature word, and generally takes 5-10 word lengths; p (w) j+1 |w j ) For knowing the current word w j With probability of occurrence, its contextual feature word w j+1 Probability of occurrence.
All word vectors obtained through Skip-gram model training form word vector matrix X epsilon R mn . In x i ∈R m Representing the word vector of the feature word i in m-dimensional space. The similarity between feature words may be measured using the distance between corresponding word vectors. Wherein the Euclidean distance between the two vectors is as follows:
d(w i ,w j )=|x i -x j | 2
wherein: d (w) i ,w j ) Representing the semantic distance between the feature word i and the feature word j; x is x i And x j Representing the characteristic word w i And w j Corresponding word vectors. d (w) i ,w j ) The smaller the value of (2) is, the smaller the semantic distance between two feature words is, the more similar the semantics are, and finally, the sentence vectors are obtained by adding the word vectors.
2) Text clustering
When clustering is performed on all sentence vector representations by using a clustering method, a DBSCAN algorithm is adopted, and the DBSCAN algorithm is a density-based algorithm. DBSCAN divides sample points into three classes, which are here vector representations: core point: the number of samples in the neighborhood of the core point is equal to or greater than the minimum number of samples. The field here is an area within a specified radius. Edge points: the edge point is not a core point, but it has core points in its neighborhood. Noise point: noise points are points other than core points and edge points. This is a visual effect of three classes of points, where a is the core point, B, C is the edge point, and N is the noise point, as shown in fig. 2.
The first step: the samples are divided into core points and non-core points according to the number of samples in the neighborhood.
And a second step of: according to whether core points exist in the neighborhood, non-core points are divided into edge points and noise points.
And a third step of: one cluster is initialized for each point.
Fourth step: and selecting a core point, traversing samples in the neighborhood of the core point, and combining clusters of the core point and the sample.
Fifth step: the fourth step is repeated until all core points have been accessed.
Under the condition that the number of the talkers is known, the cluster number corresponding to the number of the talkers is obtained by adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and the speaking contents of different talkers are separated.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
Claims (7)
1. A conference speaker identity noninductive confirmation method based on multiple modes is characterized in that: the method is characterized by automatically identifying and distinguishing conference speakers according to three aspects of expression, voice and speaking style of the speakers, wherein the method comprises an expression identification method based on a deep learning model, a voice identification method based on an artificial intelligence algorithm and a speaking content identification method based on a text clustering algorithm;
the voice recognition method comprises the following specific steps:
(1) Data acquisition and processing
Collecting conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, taking each segment as a processing unit, and carrying out denoising treatment on the data;
(2) Model building and training
Assume that a plurality of persons exist in the training data voice, wherein the jth voice of the ith person is defined as X ij The construction model is as follows: x is X ij =μ+Fh i +Gw ij +∈ ij Which is provided withMu is the data mean value, fh i And Gw ij E is a space feature matrix ij Is the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;
(3) Model testing
Calculating whether two voices are the same speaker is based on characteristic h in speaker space i Generated, or by h i The likelihood degree of the generation is calculated by a log likelihood ratio score to generate a score, and the calculation formula is as follows:
wherein eta 1 ,η 2 Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta 1 ,η 2 Probabilities from the same space; />And->Respectively represent eta 1 ,η 2 Probabilities belonging to respective different spaces.
2. The multi-modality based conference speaker identity non-perception confirmation method of claim 1, wherein: in the expression recognition method based on the deep learning model, firstly, the face photo information of a talker on a conference site is collected, random interference, deformation and rotation are included through information preprocessing, then a plurality of groups of training sets are generated by utilizing a Gan network, then the data of a sample are trained by adopting a fast R-Cnn model, and finally the deep learning model is generated.
3. The multi-modality based conference speaker identity non-perception confirmation method of claim 1, wherein: the text clustering algorithm is adopted to identify the speaking content, and the method comprises two parts of sentence vector representation and text clustering, wherein all sentence vector representations are firstly carried out, and then text clustering is carried out on all sentence vector representations through the DBSCAN algorithm.
4. A multi-modal based conference speaker identity non-perception confirmation method as claimed in claim 3, wherein: word vector training is carried out on the text by adopting Skip-gram model of word2vec tool to form word vector matrix X epsilon R mn In x i ∈R m The word vector of the feature word i in the m-dimensional space is represented, and the Euclidean distance between the two vectors is expressed, wherein the formula is as follows: d (w) i ,w j )=|x i -x j | 2 Wherein d (w i ,w j ) Representing the semantic distance between the feature word i and the feature word j; x is x i And x j Representing the characteristic word w i And w j Corresponding word vectors.
5. The multi-modality based conference speaker identity non-perception confirmation method of claim 4, wherein: the Skip-gram model comprises an input layer, a projection layer and an output layer; wherein the input layer is the current characteristic word, and the word vector is marked as W t ∈R m The output layer is the probability of the word in the contextual window of the feature word; the projection layer is used to maximize the value of the objective function L.
6. The multi-modality based conference speaker identity non-perception confirmation method of claim 5, wherein: assume that there is a set of word sequences w 1 ,w 2 ,…,w N The formula of the objective function is written as:
Wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 words; p (w) j+1 |w j ) For knowing the current word w j With probability of occurrence, its contextual feature word w j+1 Probability of occurrence.
7. A multi-modal based conference speaker identity non-perception confirmation method as claimed in claim 3, wherein: when text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of people of a speaker is known, the cluster number of the corresponding speaker is obtained through adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and then the speaking contents of different speakers are separated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910968323.2A CN110807370B (en) | 2019-10-12 | 2019-10-12 | Conference speaker identity noninductive confirmation method based on multiple modes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910968323.2A CN110807370B (en) | 2019-10-12 | 2019-10-12 | Conference speaker identity noninductive confirmation method based on multiple modes |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110807370A CN110807370A (en) | 2020-02-18 |
CN110807370B true CN110807370B (en) | 2024-01-30 |
Family
ID=69488298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910968323.2A Active CN110807370B (en) | 2019-10-12 | 2019-10-12 | Conference speaker identity noninductive confirmation method based on multiple modes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110807370B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113746822B (en) * | 2021-08-25 | 2023-07-21 | 广州市昇博电子科技有限公司 | Remote conference management method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107993665A (en) * | 2017-12-14 | 2018-05-04 | 科大讯飞股份有限公司 | Spokesman role determines method, intelligent meeting method and system in multi-conference scene |
CN109960743A (en) * | 2019-01-16 | 2019-07-02 | 平安科技(深圳)有限公司 | Conference content differentiating method, device, computer equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110046941A1 (en) * | 2009-08-18 | 2011-02-24 | Manuel-Devados Johnson Smith Johnson | Advanced Natural Language Translation System |
-
2019
- 2019-10-12 CN CN201910968323.2A patent/CN110807370B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107993665A (en) * | 2017-12-14 | 2018-05-04 | 科大讯飞股份有限公司 | Spokesman role determines method, intelligent meeting method and system in multi-conference scene |
CN109960743A (en) * | 2019-01-16 | 2019-07-02 | 平安科技(深圳)有限公司 | Conference content differentiating method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110807370A (en) | 2020-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kanda et al. | Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers | |
CN106503805B (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis method | |
CN111524527B (en) | Speaker separation method, speaker separation device, electronic device and storage medium | |
US10515292B2 (en) | Joint acoustic and visual processing | |
EP0549265A2 (en) | Neural network-based speech token recognition system and method | |
CN110111797A (en) | Method for distinguishing speek person based on Gauss super vector and deep neural network | |
CN108735200A (en) | A kind of speaker's automatic marking method | |
CN104538025A (en) | Method and device for converting gestures to Chinese and Tibetan bilingual voices | |
WO2023048746A1 (en) | Speaker-turn-based online speaker diarization with constrained spectral clustering | |
CN113113022A (en) | Method for automatically identifying identity based on voiceprint information of speaker | |
Bellagha et al. | Speaker naming in tv programs based on speaker role recognition | |
CN113239903B (en) | Cross-modal lip reading antagonism dual-contrast self-supervision learning method | |
CN111091840A (en) | Method for establishing gender identification model and gender identification method | |
CN110807370B (en) | Conference speaker identity noninductive confirmation method based on multiple modes | |
CN111462762B (en) | Speaker vector regularization method and device, electronic equipment and storage medium | |
CN112233655A (en) | Neural network training method for improving voice command word recognition performance | |
CN114970695B (en) | Speaker segmentation clustering method based on non-parametric Bayesian model | |
WO2016152132A1 (en) | Speech processing device, speech processing system, speech processing method, and recording medium | |
CN113516987B (en) | Speaker recognition method, speaker recognition device, storage medium and equipment | |
CN115472182A (en) | Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder | |
Chit et al. | Myanmar continuous speech recognition system using convolutional neural network | |
CN110265003B (en) | Method for recognizing voice keywords in broadcast signal | |
Maruf et al. | Effects of noise on RASTA-PLP and MFCC based Bangla ASR using CNN | |
Kim | Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection. | |
Hussein et al. | Arabic speaker recognition using HMM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210311 Address after: 210000 rooms 1201 and 1209, building C, Xingzhi Science Park, Qixia Economic and Technological Development Zone, Nanjing, Jiangsu Province Applicant after: Nanjing Xingyao Intelligent Technology Co.,Ltd. Address before: Room 1211, building C, Xingzhi Science Park, 6 Xingzhi Road, Nanjing Economic and Technological Development Zone, Jiangsu Province, 210000 Applicant before: Nanjing Shixing Intelligent Technology Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |