CN110807370B

CN110807370B - Conference speaker identity noninductive confirmation method based on multiple modes

Info

Publication number: CN110807370B
Application number: CN201910968323.2A
Authority: CN
Inventors: 杨理想; 王云甘; 周亚; 孙振平
Original assignee: Nanjing Xingyao Intelligent Technology Co ltd
Current assignee: Nanjing Xingyao Intelligent Technology Co ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2024-01-30
Anticipated expiration: 2039-10-12
Also published as: CN110807370A

Abstract

The invention provides a multi-mode conference speaker identity noninductive confirmation method, which is based on the recognition of the speaker expression, the speaker sound and the speaker speaking style during the multi-mode conference using images, voices and texts. The method can realize the whole process automation without manual intervention, can realize noninductive confirmation of the identity of the speaker through an artificial intelligent algorithm model, does not need manual intervention, greatly improves the efficiency of meeting and office, and has higher accuracy.

Description

Conference speaker identity noninductive confirmation method based on multiple modes

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a conference speaker identity noninductive confirmation method based on multiple modes.

Background

Along with the development of economy, high-efficiency offices are more and more separated from conference systems, and in the current stage, many conference systems need to record the speaking content of each speaker for convenience in summarizing and reporting. Thus, in response to this need, a method of intelligently and quickly distinguishing speakers is needed.

At present, a conference system mostly adopts a microphone to record the voice of a speaker to record the speaking content, if different speakers are to be distinguished, one microphone needs to be allocated to each speaker, but crosstalk problems can be caused if a plurality of microphones are allocated, namely, because the microphones are too close to each other, one person can recognize that the plurality of microphones are speaking, and when one person speaks, other microphones need to be closed to distinguish the speakers, and although the speaker can be distinguished, the method is very troublesome and needs human intervention. Therefore, there is a need for a speaker identity non-perception confirmation method based on multiple modes such as image, voice, text and the like.

Disclosure of Invention

In order to solve the complicated problem that microphones at different positions are required to be closed and opened for multiple times due to distance adjustment caused by traditional microphone allocation and formulation in a conventional conference, the invention provides a multi-mode conference speaker identity noninductive confirmation method, which specifically comprises the following steps: the method for distinguishing conference speakers by automatically recognizing three aspects of expression, voice and speaking style of the speakers comprises an expression recognition method based on a deep learning model, a voice recognition method based on an artificial intelligence algorithm and a speaking content recognition method based on a text clustering algorithm.

In the expression recognition method based on the deep learning model, firstly, the face photo information of a speaker at a conference site is collected, operations such as random interference, deformation, rotation and the like are carried out through information preprocessing, then a plurality of groups of training sets are generated by utilizing a Gan network, then the sample data are trained by adopting a Faster R-Cnn model, and finally the deep learning model is generated.

As an improvement, the voice recognition method comprises the following specific steps:

(1) Data acquisition and processing

Collecting conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, and processing the data by taking each segment as a processing unit to remove noise;

(2) Model building and training

Assume that the training data speech is composed of multiple voices of multiple persons, wherein the jth voice of the ith person is defined as X _ij The construction model is as follows: x is X _ij ＝μ+Fh _i +Gw _ij +∈ _ij Wherein μ is the data mean, fh _i And Gw _ij E is a space feature matrix _ij Is the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;

(3) Model testing

Calculating whether two voices are the same speaker is based on characteristic h in speaker space _i The likelihood generated, or generated by hi, is calculated by log likelihood ratio score as follows:

wherein eta ₁ ，η ₂ Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta ₁ ，η ₂ Probabilities from the unified space; />And->Respectively represent probabilities belonging to respective different spaces.

As an improvement, a text clustering algorithm is adopted to identify the speaking content, the text clustering method comprises two parts of sentence vector representation and text clustering, all sentence vector representations are first carried out, and then text clustering is carried out on all sentence vector representations through a DBSCAN algorithm.

As an improvement, the Skip-gram model of word2vec tool is adopted to train word vectors of the text, and a word vector matrix X epsilon R is formed ^mn In x _i ∈R ^m The word vector of the feature word i in the m-dimensional space is represented, and the Euclidean distance between the two vectors is expressed, wherein the formula is as follows: d (w) _i ，w _j )＝|x _i -x _j | ₂ Wherein d (w _i ，w _j ) Representing the semantic distance between the feature word i and the feature word j; x is x _i And x _j Representing the characteristic word w _i And w _j Corresponding word vectors.

As an improvement, the Skip-gram model includes an input layer, a projection layer, and an output layer; wherein the input layer is the current characteristic word, and the word vector is marked as W _t ∈R ^m The output layer is the probability of the word in the contextual window of the feature word; the projection layer is used to maximize the value of the objective function L.

As an improvement, assume a set of word sequences w ₁ ，w ₂ ，…，w _N The formula of the objective function is noted as:

wherein N is the length of the word sequence; c represents the context length of the current feature word, and the length is 5-10 words; p (w) _j+1 |w _j ) For knowing the current word w _j With probability of occurrence, its contextual feature word w _j+1 Probability of occurrence.

When text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of people of the talkers is known, the cluster number of the corresponding talker is obtained through adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and then the talking contents of different talkers are separated.

The beneficial effects are that: the conference speaker identity noninductive confirmation method based on the multiple modes is based on the fact that the speaker identity is confirmed through identifying the expression, the sound and the speaking style of the speaker when the conference with the multiple modes of images, voices and texts is used, the whole process is automatic, manual intervention is not needed, noninductive confirmation of the speaker identity can be achieved through an artificial intelligent algorithm model, manual intervention is not needed, conference and office efficiency is greatly improved, and accuracy is high.

Drawings

Fig. 1 is a flow chart illustrating the principle of the present invention.

FIG. 2 is a schematic diagram of the DBSCAN algorithm of the present invention.

Detailed Description

The drawings of the invention are further described below in conjunction with the embodiments.

The invention is based on the conference of image, voice, text multimode, confirm the identity of the speaker by identifying the expression of the speaker, the voice of the speaker, the speaking style of the speaker, can realize the whole course automation without manual intervention, and is specific:

(1) Speaker expression recognition

The expression of the person when speaking and the expression of the person when not speaking are greatly different, the expression of each participant is recognized based on a deep learning model through real-time video of a conference site, the speaking state of the participant is judged, and the speaker is confirmed;

(2) Speaker voice recognition

The voice of each person has great difference in frequency and tone, and the speaker is distinguished based on an artificial intelligence algorithm through real-time voice of a conference site, so that the identity of the speaker is determined;

(3) Speaker speech style identification

When the two effects are bad, the speech content text information after speech recognition can be used for classifying the paragraphs with the corresponding number categories according to the known number of the speakers by adopting a clustering algorithm, so that the identities of the speaking scores are distinguished.

Aiming at the expression recognition of the speaker, firstly, collecting face photo information of the speaker on the conference site, preprocessing the information including random interference, deformation, rotation and the like, generating a plurality of groups of training sets by utilizing a Gan network, training sample data by adopting a fast R-Cnn model, and finally generating a deep learning model.

Example 1

About 1000 pictures of the face of a speaker in a conference site are collected, the pictures are classified manually into two categories of speaking and non-speaking, then basic operations such as random interference, deformation and rotation are carried out, more training sets are generated by utilizing a Gan network, and about 10 times of the source data set is obtained. And training sample data by adopting a Faster R-Cnn model, wherein the accuracy of the final model reaches 85%.

For speaker voice recognition, the specific embodiment of the invention is as follows: 1) And (3) data acquisition: collecting voice data in real time at a conference site, segmenting the data every 4-8 seconds, preferably 5 seconds, and taking each segment as a processing unit; 2) And (3) data processing: because the speaking of the conference site is standard, most of the conference site is mandarin, and the conference site is quite and low in noise, the data is basically not processed; 3) Model construction: assume that the training data speech is made up of I speakers' voices, each having J segments of their own distinct voices. Then define the jth voice of the ith speaker as X _ij . Then, according to the factor analysis, define X _ij The generation model of (1) is as follows:

X _ij ＝μ+Fh _i +Gw _ij +∈ _ij

wherein μ is the data mean, fh _i And Gw _ij E is a space feature matrix _ij Is the noise covariance. This model can be seen as two parts: the first two right terms of the equal sign are related to the speaker only and are related to a specific one of the speakersThe bar speech is irrelevant, called signal part, which describes the difference between the speakers; the second two terms on the right of the equal sign describe the difference between different voices of the same speaker, called the noise portion.

Two imaginary variables are used to describe the data structure of a piece of speech. The middle two terms to the right of the equal sign are a matrix and a vector representation, respectively, which are yet another core part of the factor analysis. The two matrices F and G contain the basic factors in the respective imaginary variable spaces, which can be regarded as eigenvectors of the respective spaces. For example, each column of F corresponds to a feature vector of the inter-class space, and each column of G corresponds to a feature vector of the intra-class space. And h is _i And w _i Representing the characteristic representation of F and G in the respective spaces, e.g. h _i Can be regarded as x _ij A representation of features in speaker space. In the recognition scoring stage, if h of two voices _i The greater the likelihood that the features are identical, the more confident the two voices belong to the same speaker. 4) Model training: mu Fh _i Gw _ij ∈ _ij The training process of the model adopts an EM algorithm to carry out iterative solution. 5) Model test: calculating whether two voices are represented by feature h in speaker space _i Generated, or by h _i The likelihood degree generated is calculated using the log likelihood ratio score as follows:

wherein eta ₁ ，η ₂ Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta ₁ ，η ₂ Probabilities from the same space; />And->Respectively represent probabilities belonging to respective different spaces. By calculating the log likelihood ratio, the similarity of two voices can be measured, namely, the higher the score is, the greater the probability that the two voices belong to the same speaker is.

Aiming at speaker speaking style recognition, a text clustering algorithm is adopted to recognize speaking contents, the method comprises two parts of sentence vector representation and text clustering, all sentence vector representations are first performed, and then text clustering is performed on all sentence vector representations through a DBSCAN algorithm.

1) Sentence vector representation

The invention adopts the Skip-gram model of Word2vec tool to train Word vector for the text. The model is a Huffman tree constructed based on Hierarchical Softmax, and can predict the occurrence probability of context words from large-scale non-labeled text data according to the currently input words, namely, the surrounding words can be predicted according to the occurrence probability of the current words. According to the co-occurrence principle of words in a window, the co-occurrence probability among the words is calculated based on window sliding, so that word vectors generated by each feature word contain certain text structure information and semantic information.

The Skip-gram model includes an input layer, a projection layer, and an output layer. Wherein the input layer is the current characteristic word, and the word vector W _t ∈R ^m The method comprises the steps of carrying out a first treatment on the surface of the The output layer is the probability of the word in the contextual window of the feature word; the purpose of the projection layer is to maximize the value of the objective function L. Assume that there is a set of word sequences w ₁ ，w ₂ ，…，w _N The formula of the objective function is noted as:

n in the above formula is the length of the word sequence; c represents the context length of the current feature word, and generally takes 5-10 word lengths; p (w) _j+1 |w _j ) For knowing the current word w _j With probability of occurrence, its contextual feature word w _j+1 Probability of occurrence.

All word vectors obtained through Skip-gram model training form word vector matrix X epsilon R ^mn . In x _i ∈R ^m Representing the word vector of the feature word i in m-dimensional space. The similarity between feature words may be measured using the distance between corresponding word vectors. Wherein the Euclidean distance between the two vectors is as follows:

d(w _i ，w _j )＝|x _i -x _j | ₂

wherein: d (w) _i ，w _j ) Representing the semantic distance between the feature word i and the feature word j; x is x _i And x _j Representing the characteristic word w _i And w _j Corresponding word vectors. d (w) _i ，w _j ) The smaller the value of (2) is, the smaller the semantic distance between two feature words is, the more similar the semantics are, and finally, the sentence vectors are obtained by adding the word vectors.

2) Text clustering

When clustering is performed on all sentence vector representations by using a clustering method, a DBSCAN algorithm is adopted, and the DBSCAN algorithm is a density-based algorithm. DBSCAN divides sample points into three classes, which are here vector representations: core point: the number of samples in the neighborhood of the core point is equal to or greater than the minimum number of samples. The field here is an area within a specified radius. Edge points: the edge point is not a core point, but it has core points in its neighborhood. Noise point: noise points are points other than core points and edge points. This is a visual effect of three classes of points, where a is the core point, B, C is the edge point, and N is the noise point, as shown in fig. 2.

The first step: the samples are divided into core points and non-core points according to the number of samples in the neighborhood.

And a second step of: according to whether core points exist in the neighborhood, non-core points are divided into edge points and noise points.

And a third step of: one cluster is initialized for each point.

Fourth step: and selecting a core point, traversing samples in the neighborhood of the core point, and combining clusters of the core point and the sample.

Fifth step: the fourth step is repeated until all core points have been accessed.

Under the condition that the number of the talkers is known, the cluster number corresponding to the number of the talkers is obtained by adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and the speaking contents of different talkers are separated.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A conference speaker identity noninductive confirmation method based on multiple modes is characterized in that: the method is characterized by automatically identifying and distinguishing conference speakers according to three aspects of expression, voice and speaking style of the speakers, wherein the method comprises an expression identification method based on a deep learning model, a voice identification method based on an artificial intelligence algorithm and a speaking content identification method based on a text clustering algorithm;

the voice recognition method comprises the following specific steps:

(1) Data acquisition and processing

Collecting conference site voice data in real time, segmenting the data at intervals of 4-8 seconds, taking each segment as a processing unit, and carrying out denoising treatment on the data;

(2) Model building and training

Assume that a plurality of persons exist in the training data voice, wherein the jth voice of the ith person is defined as X _ij The construction model is as follows: x is X _ij ＝μ+Fh _i +Gw _ij +∈ _ij Which is provided withMu is the data mean value, fh _i And Gw _ij E is a space feature matrix _ij Is the noise covariance; after construction, the training process adopts EM algorithm iteration to solve;

(3) Model testing

Calculating whether two voices are the same speaker is based on characteristic h in speaker space _i Generated, or by h _i The likelihood degree of the generation is calculated by a log likelihood ratio score to generate a score, and the calculation formula is as follows:

wherein eta ₁ ，η ₂ Two test voices are represented and the test result is displayed,and->Representing two test voices from the same space and from different spatial hypotheses, respectively; />Representing eta ₁ ，η ₂ Probabilities from the same space; />And->Respectively represent eta ₁ ，η ₂ Probabilities belonging to respective different spaces.

2. The multi-modality based conference speaker identity non-perception confirmation method of claim 1, wherein: in the expression recognition method based on the deep learning model, firstly, the face photo information of a talker on a conference site is collected, random interference, deformation and rotation are included through information preprocessing, then a plurality of groups of training sets are generated by utilizing a Gan network, then the data of a sample are trained by adopting a fast R-Cnn model, and finally the deep learning model is generated.

3. The multi-modality based conference speaker identity non-perception confirmation method of claim 1, wherein: the text clustering algorithm is adopted to identify the speaking content, and the method comprises two parts of sentence vector representation and text clustering, wherein all sentence vector representations are firstly carried out, and then text clustering is carried out on all sentence vector representations through the DBSCAN algorithm.

4. A multi-modal based conference speaker identity non-perception confirmation method as claimed in claim 3, wherein: word vector training is carried out on the text by adopting Skip-gram model of word2vec tool to form word vector matrix X epsilon R ^mn In x _i ∈R ^m The word vector of the feature word i in the m-dimensional space is represented, and the Euclidean distance between the two vectors is expressed, wherein the formula is as follows: d (w) _i ，w _j )＝|x _i -x _j | ₂ Wherein d (w _i ，w _j ) Representing the semantic distance between the feature word i and the feature word j; x is x _i And x _j Representing the characteristic word w _i And w _j Corresponding word vectors.

5. The multi-modality based conference speaker identity non-perception confirmation method of claim 4, wherein: the Skip-gram model comprises an input layer, a projection layer and an output layer; wherein the input layer is the current characteristic word, and the word vector is marked as W _t ∈R ^m The output layer is the probability of the word in the contextual window of the feature word; the projection layer is used to maximize the value of the objective function L.

6. The multi-modality based conference speaker identity non-perception confirmation method of claim 5, wherein: assume that there is a set of word sequences w ₁ ，w ₂ ，…，w _N The formula of the objective function is written as：

7. A multi-modal based conference speaker identity non-perception confirmation method as claimed in claim 3, wherein: when text clustering is carried out on all sentence vector representations through a DBSCAN algorithm, under the condition that the number of people of a speaker is known, the cluster number of the corresponding speaker is obtained through adjusting the parameter radius and the minimum point number value of the algorithm, the corresponding text cluster is obtained, and then the speaking contents of different speakers are separated.