CN110807370B - Conference speaker identity noninductive confirmation method based on multiple modes - Google Patents

Conference speaker identity noninductive confirmation method based on multiple modes Download PDF

Info

Publication number
CN110807370B
CN110807370B CN201910968323.2A CN201910968323A CN110807370B CN 110807370 B CN110807370 B CN 110807370B CN 201910968323 A CN201910968323 A CN 201910968323A CN 110807370 B CN110807370 B CN 110807370B
Authority
CN
China
Prior art keywords
word
speaker
conference
algorithm
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910968323.2A
Other languages
Chinese (zh)
Other versions
CN110807370A (en
Inventor
杨理想
王云甘
周亚
孙振平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xingyao Intelligent Technology Co ltd
Original Assignee
Nanjing Xingyao Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xingyao Intelligent Technology Co ltd filed Critical Nanjing Xingyao Intelligent Technology Co ltd
Priority to CN201910968323.2A priority Critical patent/CN110807370B/en
Publication of CN110807370A publication Critical patent/CN110807370A/en
Application granted granted Critical
Publication of CN110807370B publication Critical patent/CN110807370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a multi-mode conference speaker identity noninductive confirmation method, which is based on the recognition of the speaker expression, the speaker sound and the speaker speaking style during the multi-mode conference using images, voices and texts. The method can realize the whole process automation without manual intervention, can realize noninductive confirmation of the identity of the speaker through an artificial intelligent algorithm model, does not need manual intervention, greatly improves the efficiency of meeting and office, and has higher accuracy.

Description

Conference speaker identity noninductive confirmation method based on multiple modes
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a conference speaker identity noninductive confirmation method based on multiple modes.
Background
Along with the development of economy, high-efficiency offices are more and more separated from conference systems, and in the current stage, many conference systems need to record the speaking content of each speaker for convenience in summarizing and reporting. Thus, in response to this need, a method of intelligently and quickly distinguishing speakers is needed.
At present, a conference system mostly adopts a microphone to record the voice of a speaker to record the speaking content, if different speakers are to be distinguished, one microphone needs to be allocated to each speaker, but crosstalk problems can be caused if a plurality of microphones are allocated, namely, because the microphones are too close to each other, one person can recognize that the plurality of microphones are speaking, and when one person speaks, other microphones need to be closed to distinguish the speakers, and although the speaker can be distinguished, the method is very troublesome and needs human intervention. Therefore, there is a need for a speaker identity non-perception confirmation method based on multiple modes such as image, voice, text and the like.
Disclosure of Invention
In order to solve the complicated problem that microphones at different positions are required to be closed and opened for multiple times due to distance adjustment caused by traditional microphone allocation and formulation in a conventional conference, the invention provides a multi-mode conference speaker identity noninductive confirmation method, which specifically comprises the following steps: the method for distinguishing conference speakers by automatically recognizing three aspects of expression, voice and speaking style of the speakers comprises an expression recognition method based on a deep learning model, a voice recognition method based on an artificial intelligence algorithm and a speaking content recognition method based on a text clustering algorithm.
In the expression recognition method based on the deep learning model, firstly, the face photo information of a speaker at a conference site is collected, operations such as random interference, deformation, rotation and the like are carried out through information preprocessing, then a plurality of groups of training sets are generated by utilizing a Gan network, then the sample data are trained by adopting a Faster R-Cnn model, and finally the deep learning model is generated.
As an improvement, the voice recognition method comprises the following specific steps:
(1) Data acquisition and processing
Collecting conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, and processing the data by taking each segment as a processing unit to remove noise;
(2) Model building and training
Assume that the training data speech is composed of multiple voices of multiple persons, wherein the jth voice of the ith person is defined as X ij The construction model is as follows: x is X ij =μ+Fh i +Gw ij +∈ ij Wherein μ is the data mean, fh i And Gw ij E is a space feature matrix ij Is the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;
(3) Model testing
Calculating whether two voices are the same speaker is based on characteristic h in speaker space i The likelihood generated, or generated by hi, is calculated by log likelihood ratio score as follows:
wherein eta 1 ,η 2 Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta 1 ,η 2 Probabilities from the unified space; />And->Respectively represent probabilities belonging to respective different spaces.
As an improvement, a text clustering algorithm is adopted to identify the speaking content, the text clustering method comprises two parts of sentence vector representation and text clustering, all sentence vector representations are first carried out, and then text clustering is carried out on all sentence vector representations through a DBSCAN algorithm.
As an improvement, the Skip-gram model of word2vec tool is adopted to train word vectors of the text, and a word vector matrix X epsilon R is formed mn In x i ∈R m The word vector of the feature word i in the m-dimensional space is represented, and the Euclidean distance between the two vectors is expressed, wherein the formula is as follows: d (w) i ,w j )=|x i -x j | 2 Wherein d (w i ,w j ) Representing the semantic distance between the feature word i and the feature word j; x is x i And x j Representing the characteristic word w i And w j Corresponding word vectors.
As an improvement, the Skip-gram model includes an input layer, a projection layer, and an output layer; wherein the input layer is the current characteristic word, and the word vector is marked as W t ∈R m The output layer is the probability of the word in the contextual window of the feature word; the projection layer is used to maximize the value of the objective function L.
As an improvement, assume a set of word sequences w 1 ,w 2 ,…,w N The formula of the objective function is noted as:
wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 words; p (w) j+1 |w j ) For knowing the current word w j With probability of occurrence, its contextual feature word w j+1 Probability of occurrence.
When text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of people of the talkers is known, the cluster number of the corresponding talker is obtained through adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and then the talking contents of different talkers are separated.
The beneficial effects are that: the conference speaker identity noninductive confirmation method based on the multiple modes is based on the fact that the speaker identity is confirmed through identifying the expression, the sound and the speaking style of the speaker when the conference with the multiple modes of images, voices and texts is used, the whole process is automatic, manual intervention is not needed, noninductive confirmation of the speaker identity can be achieved through an artificial intelligent algorithm model, manual intervention is not needed, conference and office efficiency is greatly improved, and accuracy is high.
Drawings
Fig. 1 is a flow chart illustrating the principle of the present invention.
FIG. 2 is a schematic diagram of the DBSCAN algorithm of the present invention.
Detailed Description
The drawings of the invention are further described below in conjunction with the embodiments.
The invention is based on the conference of image, voice, text multimode, confirm the identity of the speaker by identifying the expression of the speaker, the voice of the speaker, the speaking style of the speaker, can realize the whole course automation without manual intervention, and is specific:
(1) Speaker expression recognition
The expression of the person when speaking and the expression of the person when not speaking are greatly different, the expression of each participant is recognized based on a deep learning model through real-time video of a conference site, the speaking state of the participant is judged, and the speaker is confirmed;
(2) Speaker voice recognition
The voice of each person has great difference in frequency and tone, and the speaker is distinguished based on an artificial intelligence algorithm through real-time voice of a conference site, so that the identity of the speaker is determined;
(3) Speaker speech style identification
When the two effects are bad, the speech content text information after speech recognition can be used for classifying the paragraphs with the corresponding number categories according to the known number of the speakers by adopting a clustering algorithm, so that the identities of the speaking scores are distinguished.
Aiming at the expression recognition of the speaker, firstly, collecting face photo information of the speaker on the conference site, preprocessing the information including random interference, deformation, rotation and the like, generating a plurality of groups of training sets by utilizing a Gan network, training sample data by adopting a fast R-Cnn model, and finally generating a deep learning model.
Example 1
About 1000 pictures of the face of a speaker in a conference site are collected, the pictures are classified manually into two categories of speaking and non-speaking, then basic operations such as random interference, deformation and rotation are carried out, more training sets are generated by utilizing a Gan network, and about 10 times of the source data set is obtained. And training sample data by adopting a Faster R-Cnn model, wherein the accuracy of the final model reaches 85%.
For speaker voice recognition, the specific embodiment of the invention is as follows: 1) And (3) data acquisition: collecting voice data in real time at a conference site, segmenting the data every 4-8 seconds, preferably 5 seconds, and taking each segment as a processing unit; 2) And (3) data processing: because the speaking of the conference site is standard, most of the conference site is mandarin, and the conference site is quite and low in noise, the data is basically not processed; 3) Model construction: assume that the training data speech is made up of I speakers' voices, each having J segments of their own distinct voices. Then define the jth voice of the ith speaker as X ij . Then, according to the factor analysis, define X ij The generation model of (1) is as follows:
X ij =μ+Fh i +Gw ij +∈ ij
wherein μ is the data mean, fh i And Gw ij E is a space feature matrix ij Is the noise covariance. This model can be seen as two parts: the first two right terms of the equal sign are related to the speaker only and are related to a specific one of the speakersThe bar speech is irrelevant, called signal part, which describes the difference between the speakers; the second two terms on the right of the equal sign describe the difference between different voices of the same speaker, called the noise portion.
Two imaginary variables are used to describe the data structure of a piece of speech. The middle two terms to the right of the equal sign are a matrix and a vector representation, respectively, which are yet another core part of the factor analysis. The two matrices F and G contain the basic factors in the respective imaginary variable spaces, which can be regarded as eigenvectors of the respective spaces. For example, each column of F corresponds to a feature vector of the inter-class space, and each column of G corresponds to a feature vector of the intra-class space. And h is i And w i Representing the characteristic representation of F and G in the respective spaces, e.g. h i Can be regarded as x ij A representation of features in speaker space. In the recognition scoring stage, if h of two voices i The greater the likelihood that the features are identical, the more confident the two voices belong to the same speaker. 4) Model training: mu Fh i Gw ijij The training process of the model adopts an EM algorithm to carry out iterative solution. 5) Model test: calculating whether two voices are represented by feature h in speaker space i Generated, or by h i The likelihood degree generated is calculated using the log likelihood ratio score as follows:
wherein eta 1 ,η 2 Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta 1 ,η 2 Probabilities from the same space; />And->Respectively represent probabilities belonging to respective different spaces. By calculating the log likelihood ratio, the similarity of two voices can be measured, namely, the higher the score is, the greater the probability that the two voices belong to the same speaker is.
Aiming at speaker speaking style recognition, a text clustering algorithm is adopted to recognize speaking contents, the method comprises two parts of sentence vector representation and text clustering, all sentence vector representations are first performed, and then text clustering is performed on all sentence vector representations through a DBSCAN algorithm.
1) Sentence vector representation
The invention adopts the Skip-gram model of Word2vec tool to train Word vector for the text. The model is a Huffman tree constructed based on Hierarchical Softmax, and can predict the occurrence probability of context words from large-scale non-labeled text data according to the currently input words, namely, the surrounding words can be predicted according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information.
The Skip-gram model includes an input layer, a projection layer, and an output layer. Wherein the input layer is the current characteristic word, and the word vector W t ∈R m The method comprises the steps of carrying out a first treatment on the surface of the The output layer is the probability of the word in the contextual window of the feature word; the purpose of the projection layer is to maximize the value of the objective function L. Assume that there is a set of word sequences w 1 ,w 2 ,…,w N The formula of the objective function is noted as:
n in the above formula is the length of the word sequence; c represents the context length of the current feature word, and generally takes 5-10 word lengths; p (w) j+1 |w j ) For knowing the current word w j With probability of occurrence, its contextual feature word w j+1 Probability of occurrence.
All word vectors obtained through Skip-gram model training form word vector matrix X epsilon R mn . In x i ∈R m Representing the word vector of the feature word i in m-dimensional space. The similarity between feature words may be measured using the distance between corresponding word vectors. Wherein the Euclidean distance between the two vectors is as follows:
d(w i ,w j )=|x i -x j | 2
wherein: d (w) i ,w j ) Representing the semantic distance between the feature word i and the feature word j; x is x i And x j Representing the characteristic word w i And w j Corresponding word vectors. d (w) i ,w j ) The smaller the value of (2) is, the smaller the semantic distance between two feature words is, the more similar the semantics are, and finally, the sentence vectors are obtained by adding the word vectors.
2) Text clustering
When clustering is performed on all sentence vector representations by using a clustering method, a DBSCAN algorithm is adopted, and the DBSCAN algorithm is a density-based algorithm. DBSCAN divides sample points into three classes, which are here vector representations: core point: the number of samples in the neighborhood of the core point is equal to or greater than the minimum number of samples. The field here is an area within a specified radius. Edge points: the edge point is not a core point, but it has core points in its neighborhood. Noise point: noise points are points other than core points and edge points. This is a visual effect of three classes of points, where a is the core point, B, C is the edge point, and N is the noise point, as shown in fig. 2.
The first step: the samples are divided into core points and non-core points according to the number of samples in the neighborhood.
And a second step of: according to whether core points exist in the neighborhood, non-core points are divided into edge points and noise points.
And a third step of: one cluster is initialized for each point.
Fourth step: and selecting a core point, traversing samples in the neighborhood of the core point, and combining clusters of the core point and the sample.
Fifth step: the fourth step is repeated until all core points have been accessed.
Under the condition that the number of the talkers is known, the cluster number corresponding to the number of the talkers is obtained by adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and the speaking contents of different talkers are separated.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (7)

1. A conference speaker identity noninductive confirmation method based on multiple modes is characterized in that: the method is characterized by automatically identifying and distinguishing conference speakers according to three aspects of expression, voice and speaking style of the speakers, wherein the method comprises an expression identification method based on a deep learning model, a voice identification method based on an artificial intelligence algorithm and a speaking content identification method based on a text clustering algorithm;
the voice recognition method comprises the following specific steps:
(1) Data acquisition and processing
Collecting conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, taking each segment as a processing unit, and carrying out denoising treatment on the data;
(2) Model building and training
Assume that a plurality of persons exist in the training data voice, wherein the jth voice of the ith person is defined as X ij The construction model is as follows: x is X ij =μ+Fh i +Gw ij +∈ ij Which is provided withMu is the data mean value, fh i And Gw ij E is a space feature matrix ij Is the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;
(3) Model testing
Calculating whether two voices are the same speaker is based on characteristic h in speaker space i Generated, or by h i The likelihood degree of the generation is calculated by a log likelihood ratio score to generate a score, and the calculation formula is as follows:
wherein eta 1 ,η 2 Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta 1 ,η 2 Probabilities from the same space; />And->Respectively represent eta 1 ,η 2 Probabilities belonging to respective different spaces.
2. The multi-modality based conference speaker identity non-perception confirmation method of claim 1, wherein: in the expression recognition method based on the deep learning model, firstly, the face photo information of a talker on a conference site is collected, random interference, deformation and rotation are included through information preprocessing, then a plurality of groups of training sets are generated by utilizing a Gan network, then the data of a sample are trained by adopting a fast R-Cnn model, and finally the deep learning model is generated.
3. The multi-modality based conference speaker identity non-perception confirmation method of claim 1, wherein: the text clustering algorithm is adopted to identify the speaking content, and the method comprises two parts of sentence vector representation and text clustering, wherein all sentence vector representations are firstly carried out, and then text clustering is carried out on all sentence vector representations through the DBSCAN algorithm.
4. A multi-modal based conference speaker identity non-perception confirmation method as claimed in claim 3, wherein: word vector training is carried out on the text by adopting Skip-gram model of word2vec tool to form word vector matrix X epsilon R mn In x i ∈R m The word vector of the feature word i in the m-dimensional space is represented, and the Euclidean distance between the two vectors is expressed, wherein the formula is as follows: d (w) i ,w j )=|x i -x j | 2 Wherein d (w i ,w j ) Representing the semantic distance between the feature word i and the feature word j; x is x i And x j Representing the characteristic word w i And w j Corresponding word vectors.
5. The multi-modality based conference speaker identity non-perception confirmation method of claim 4, wherein: the Skip-gram model comprises an input layer, a projection layer and an output layer; wherein the input layer is the current characteristic word, and the word vector is marked as W t ∈R m The output layer is the probability of the word in the contextual window of the feature word; the projection layer is used to maximize the value of the objective function L.
6. The multi-modality based conference speaker identity non-perception confirmation method of claim 5, wherein: assume that there is a set of word sequences w 1 ,w 2 ,…,w N The formula of the objective function is written as:
Wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 words; p (w) j+1 |w j ) For knowing the current word w j With probability of occurrence, its contextual feature word w j+1 Probability of occurrence.
7. A multi-modal based conference speaker identity non-perception confirmation method as claimed in claim 3, wherein: when text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of people of a speaker is known, the cluster number of the corresponding speaker is obtained through adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and then the speaking contents of different speakers are separated.
CN201910968323.2A 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes Active CN110807370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910968323.2A CN110807370B (en) 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910968323.2A CN110807370B (en) 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes

Publications (2)

Publication Number Publication Date
CN110807370A CN110807370A (en) 2020-02-18
CN110807370B true CN110807370B (en) 2024-01-30

Family

ID=69488298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910968323.2A Active CN110807370B (en) 2019-10-12 2019-10-12 Conference speaker identity noninductive confirmation method based on multiple modes

Country Status (1)

Country Link
CN (1) CN110807370B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113746822B (en) * 2021-08-25 2023-07-21 广州市昇博电子科技有限公司 Remote conference management method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993665A (en) * 2017-12-14 2018-05-04 科大讯飞股份有限公司 Spokesman role determines method, intelligent meeting method and system in multi-conference scene
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110046941A1 (en) * 2009-08-18 2011-02-24 Manuel-Devados Johnson Smith Johnson Advanced Natural Language Translation System

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993665A (en) * 2017-12-14 2018-05-04 科大讯飞股份有限公司 Spokesman role determines method, intelligent meeting method and system in multi-conference scene
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110807370A (en) 2020-02-18

Similar Documents

Publication Publication Date Title
Kanda et al. Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers
CN106503805B (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis method
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
US10515292B2 (en) Joint acoustic and visual processing
EP0549265A2 (en) Neural network-based speech token recognition system and method
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN108735200A (en) A kind of speaker's automatic marking method
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN113113022A (en) Method for automatically identifying identity based on voiceprint information of speaker
Bellagha et al. Speaker naming in tv programs based on speaker role recognition
CN113239903B (en) Cross-modal lip reading antagonism dual-contrast self-supervision learning method
CN111091840A (en) Method for establishing gender identification model and gender identification method
CN110807370B (en) Conference speaker identity noninductive confirmation method based on multiple modes
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN112233655A (en) Neural network training method for improving voice command word recognition performance
CN114970695B (en) Speaker segmentation clustering method based on non-parametric Bayesian model
WO2016152132A1 (en) Speech processing device, speech processing system, speech processing method, and recording medium
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
Chit et al. Myanmar continuous speech recognition system using convolutional neural network
CN110265003B (en) Method for recognizing voice keywords in broadcast signal
Maruf et al. Effects of noise on RASTA-PLP and MFCC based Bangla ASR using CNN
Kim Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection.
Hussein et al. Arabic speaker recognition using HMM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210311

Address after: 210000 rooms 1201 and 1209, building C, Xingzhi Science Park, Qixia Economic and Technological Development Zone, Nanjing, Jiangsu Province

Applicant after: Nanjing Xingyao Intelligent Technology Co.,Ltd.

Address before: Room 1211, building C, Xingzhi Science Park, 6 Xingzhi Road, Nanjing Economic and Technological Development Zone, Jiangsu Province, 210000

Applicant before: Nanjing Shixing Intelligent Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant