CN111462762A - Speaker vector regularization method and device, electronic equipment and storage medium - Google Patents

Speaker vector regularization method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111462762A
CN111462762A CN202010218732.3A CN202010218732A CN111462762A CN 111462762 A CN111462762 A CN 111462762A CN 202010218732 A CN202010218732 A CN 202010218732A CN 111462762 A CN111462762 A CN 111462762A
Authority
CN
China
Prior art keywords
speaker
vector
regularization
recognized
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010218732.3A
Other languages
Chinese (zh)
Other versions
CN111462762B (en
Inventor
蔡云麒
王东
李蓝天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010218732.3A priority Critical patent/CN111462762B/en
Publication of CN111462762A publication Critical patent/CN111462762A/en
Application granted granted Critical
Publication of CN111462762B publication Critical patent/CN111462762B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a speaker vector regularization method, a speaker vector regularization device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a speaker vector of a voice to be recognized; inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector; and determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector. The method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention can be well compatible with a back-end scoring model, and the performance of a voiceprint recognition system is improved.

Description

Speaker vector regularization method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a speaker vector regularization method, a speaker vector regularization device, electronic equipment and a storage medium.
Background
With the development of deep learning technology, the voiceprint recognition technology based on the deep speaker characterization vector obtains satisfactory recognition performance, so that the voiceprint recognition technology is gradually applied to various actual scenes from scientific research laboratories.
In the prior art, the training target of a speaker vector model is only to distinguish different speakers to the maximum, and the distribution of speaker vectors obtained by deep speaker vector model inference is free and unconstrained, and a back-end scoring method for speaker recognition, such as P L DA (Probabilistic L input discriminative Analysis, Probabilistic linear Discriminant Analysis) and the like, generally has specific requirements on the distribution of speaker vectors.
Disclosure of Invention
The embodiment of the invention provides a speaker vector regularization method, a speaker vector regularization device, electronic equipment and a storage medium, which are used for solving the problems that the existing speaker vector model and a rear-end scoring model cannot be well compatible and the performance of a voiceprint recognition system is poor.
In a first aspect, an embodiment of the present invention provides speaker vector regularization, including:
determining a speaker vector of a voice to be recognized;
inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector;
and determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector.
Optionally, the discriminative standard flow model is trained based on a maximum likelihood estimation method, with a training objective of maximizing the probability of the sample speaker vector.
Optionally, the optimization function used for training the discriminative standard flow model is:
Figure BDA0002425319140000021
wherein L is an optimization function, xiFor the ith sample speaker vector, ziIs equal to xiThe corresponding sample speaker regular vector, y is a sample speaker label,
Figure BDA0002425319140000022
canonical vector z corresponding to sample speaker yiF is a mapping function representation of the discriminative standard flow model.
Optionally, a vector characterizing any speaker in the speaker regularization vector includes a first component and a second component;
wherein the first component obeys a conditional distribution associated with the any speaker and the second component obeys a marginal distribution that is not associated with each speaker.
Optionally, the determining the speaker vector of the speech to be recognized specifically includes:
inputting the voice to be recognized into a speaker vector extraction model to obtain a speaker vector of the voice to be recognized, which is output by the speaker vector extraction model;
the speaker vector extraction model is obtained by joint training with a classifier based on sample voice and a speaker label corresponding to the sample voice.
Optionally, the speaker vector extraction model includes a local feature extraction layer, a time sequence feature extraction layer, and a fusion output layer;
the inputting the speech to be recognized into a speaker vector extraction model to obtain the speaker vector of the speech to be recognized output by the speaker vector extraction model specifically includes:
inputting the voice to be recognized into the local feature extraction layer to obtain local features output by the local feature extraction layer;
inputting the voice to be recognized into the time sequence feature extraction layer to obtain the time sequence feature output by the time sequence feature extraction layer;
and inputting the local features and the time sequence features into the fusion output layer to obtain the speaker vector output by the fusion output layer.
Optionally, the determining, based on the speaker regularization vector, a speaker recognition result of the speech to be recognized specifically includes:
and inputting the speaker regularization vector into a rear-end scoring model to obtain a speaker recognition result of the speech to be recognized, which is output by the rear-end scoring model.
In a second aspect, an embodiment of the present invention provides a speaker vector regularization apparatus, including:
the determining unit is used for determining a speaker vector of the voice to be recognized;
the regularization unit is used for inputting the speaker vector into a discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector;
and the recognition unit is used for determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the speaker vector regularization method as described in the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the speaker vector regularization method as described in the first aspect.
According to the speaker vector regularization method, the speaker vector regularization device, the electronic equipment and the storage medium, regularization operation is carried out on the speaker vector through the discriminative standard flow model, the speaker regularization vector with stronger representation capability is obtained, a speaker recognition result is further obtained, the pressure of a rear-end scoring model is greatly reduced, the speaker regularization vector can be well compatible with the rear-end scoring model, and the performance of a voiceprint recognition system is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a speaker vector regularization method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of data distribution output by standard flow model transformation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of data distribution output after transformation by a differentiated standard flow model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a reversible transformation of a discriminative standard flow model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a stackable streaming invertible transformation of a discriminative standard flow model according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a speaker vector regularization apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Voiceprint recognition, also known as speaker recognition, is a biometric identification technique that automatically implements speaker identification using computers and various information recognition techniques based on voiceprints in speech signals that can characterize the speaker's personality information. Voiceprint is a kind of information in speech signals, and is a general term for speech features which are contained in speech signals and can characterize the identity of a speaker, and a speech model which is built based on the features.
Traditional voiceprint recognition techniques are based on statistical models, the most classical of which is the gaussian mixture model-generic background model (GMM-UBM) architecture. To further enhance the expressive power of speaker characteristics under limited data, various subspace models are proposed in succession, the most notable of which is the i-vector model. The i-vector model introduces an important concept: speaker characterization vector (Speaker embedding), i.e. a continuous vector of a fixed length is used to characterize the Speaker characteristics. And constructing a space for describing the characteristics of the speaker through the speaker characterization vector.
In recent years, a series of voiceprint recognition models based on a Deep learning method are successively proposed, such as a d-vector model, an x-vector model and the like, which are collectively called Deep speaker characterization vectors (Deep speaker embedding), the Deep characterization vectors achieve the most advanced recognition performance at present through further optimization and improvement on a model structure, a pooling strategy, training criteria and the like, and although the Deep speaker vector model achieves great progress, the Deep speaker vector model has the defect that the training target of the model only distinguishes different speakers to the maximum degree and does not consider the spatial distribution of the speakers.
Fig. 1 is a schematic flowchart of a speaker vector regularization method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101, determining a speaker vector of a voice to be recognized;
specifically, the speech to be recognized contains speech data from different speakers. The speaker vector is a deep speaker characterization vector, and particularly, a continuous vector with a fixed length is used for characterizing the characteristics of each speaker. A space describing the speaker characteristics is constructed through the speaker vector. In the space, the distribution of the speaker vector is free and unconstrained, and is specifically represented by: the distribution of each speaker is extremely complex; the distribution between different speakers is significantly different.
For the back-end scoring method of the voiceprint recognition mainstream, the speaker vector is mostly required to meet the prior assumption of Gaussian distribution, namely, the prior probability of the speaker and the conditional probability of each speaker are assumed to meet the Gaussian distribution. Obviously, the speaker vector of the speech to be recognized cannot well meet the prior requirement of the back-end scoring method, and therefore the speaker vector of the speech to be recognized needs to be regularized.
The speaker vector of the speech to be recognized can be obtained through different speaker vector extraction models, and the embodiment of the invention does not specifically limit the type and the internal structure of the speaker vector extraction model.
102, inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label;
specifically, a standard flow model (NF) is a deep generative model that has emerged in recent years. Similar to most generative models, the essential goal of standard flow models is to fit the distribution of the data space. The standard flow model uses a simple normal distribution and an invertible mapping to fit the true data distribution. Although normalization of observed variables to a standard gaussian distribution is well based on a standard flow model, there is one significant drawback: the standard flow model is only an edge distribution that optimizes the data. This means that data from different classes tend to converge together in hidden space and the distribution of the classes (conditional distribution) is still non-gaussian. The standard flow model is difficult to be effective for open discrimination tasks including voiceprint recognition.
Compared with a standard flow model, a Discriminative standard flow model (DNF) considers class labels of data in the process of optimizing data distribution, allows the whole hidden variable space to obey gaussian distribution, and allows various classes to obey gaussian distribution in the hidden variable space, so that regularization of edge distribution (whole data space) and condition distribution (data space of each class) of an observed variable is realized.
For example, there are 3 observed variables in a certain observation space, x1, x2, and x3, respectively. Fig. 2 is a schematic diagram of data distribution output by transformation of a standard flow model according to an embodiment of the present invention, and as shown in fig. 2, the standard flow model can well normalize an observation variable into a global standard gaussian distribution, but x1, x2, and x3 are grouped together in a transformed space. Fig. 3 is a schematic diagram of data distribution output by the discriminative standard flow model transformation according to the embodiment of the present invention, and as shown in fig. 3, the discriminative standard flow model considers a class label of data in the process of optimizing data distribution, and after optimization, not only the entire data space obeys gaussian distribution, but also the data space of each class also obeys gaussian distribution.
And inputting the speaker vector into a discriminative standard flow model, wherein the discriminative standard flow model considers the influence of the speaker label on the speaker vector distribution, optimizes the data distribution of the speaker vector and outputs a speaker regularization vector. Compared with the method that only the speaker vector is regularized in the whole output by the standard flow model, the condition that the vector distribution of each speaker is represented in the speaker vector is not considered, the speaker regularization vector output by the discriminative standard flow model is in Gaussian distribution in the whole, and the vector representing each speaker in the speaker regularization vector is in Gaussian distribution respectively.
Before step 102 is executed, the discriminative standard flow model may be obtained by training in advance, and specifically, the discriminative standard flow model may be obtained by training in the following manner: firstly, a large number of sample speaker vectors and speaker labels corresponding to the sample speaker vectors are collected, and an initial discriminative standard flow model is trained by applying the sample speaker vectors and the speaker labels corresponding to the sample speaker vectors. The discriminative standard flow model optimized through training can learn a standardized space with the discriminative function of the sample speaker between the sample speaker vector and the regularization vector of the sample speaker determined based on the speaker label, so that the regularization of the speaker vector is realized.
And 103, determining a speaker recognition result of the voice to be recognized based on the speaker regularization vector.
Specifically, the speaker regularization vectors obtained after the regularization of the discriminative standard flow model are subjected to Gaussian distribution in the whole, and the vectors representing all speakers in the speaker regularization vectors are subjected to Gaussian distribution respectively, so that the prior assumption required by the back-end scoring model can be met, and the compatibility with the back-end scoring model is improved. The speaker regularization vector is input into a rear-end scoring model, and the rear-end scoring model selects reasonable parameters through training so that the distinguishability among different speakers is stronger.
According to the speaker vector regularization method provided by the embodiment of the invention, the speaker vector is regularized through the discriminative standard flow model, so that the speaker regularization vector with stronger representation capability is obtained, the prior assumption required by the back-end scoring model is met, the pressure of the back-end scoring model is greatly reduced, the speaker regularization vector can be well compatible with the back-end scoring model, and the performance of a voiceprint recognition system is improved.
Based on the above embodiment, the discriminative standard flow model is trained based on the maximum likelihood estimation method, and the training target is the probability maximization of the sample speaker vector.
Specifically, for the training of the discriminative standard flow model in the above embodiment, the training criterion based on the maximum likelihood is adopted, the speaker labels corresponding to the sample speaker vectors are used for the probability estimation of all the sample speaker vectors, and the training target is the probability maximization of the sample speaker vectors.
The operation of the discriminative standard flow model is illustrated by way of example below. Fig. 4 is a schematic diagram of reversible transformation of a discriminative standard flow model according to an embodiment of the present invention, as shown in fig. 4, an observation variable x is used to represent a speaker vector, a hidden variable z is used to represent a speaker regular vector, a category y is used to represent a speaker tag, the hidden variable z and the observation variable x are linked by a reversible transformation x ═ f (z), f-1(x) Is the inverse function of f (z), f-1(x) Can be viewed as a mapping function representation of a discriminative standard flow model.
The probability density relationship between the observed variable x and the hidden variable z can be formulated as:
Figure BDA0002425319140000081
in the formula, py(x) And py(z) respectively represents probability density distribution functions of an observed variable x and a hidden variable z corresponding to the category y, and the second term on the right represents the change of entropy between the two distributions in the transformation process. To enhance the flexibility of the model, typically f is composed of a series of relatively simple reversible transformations, formulated as:
f=fT·fT-1…·f1
in the formula (f)tDenoted single reversible transformation, T ∈ [1, T]. Through each ftThe variables will more closely approach the target distribution. Each ftCan be represented by a structured neural network.
Fig. 5 is a schematic diagram of a scalable and reversible stream transform that can be superimposed by the differentiated standard stream model provided by the embodiment of the present invention, and when T is 3, the whole transform process may be as shown in fig. 5.
The entire transformation process can be formulated as:
Figure BDA0002425319140000082
based on the characteristics of the discriminative standard flow model probability distribution transformation, the probability maximization of all the observation variables is realized by adopting the training criterion of the maximum likelihood.
Based on any of the above embodiments, the optimization function for training the discriminative standard flow model is:
Figure BDA0002425319140000083
wherein L is an optimization function, xiFor the ith sample speaker vector, ziIs equal to xiThe corresponding sample speaker regular vector, y is a sample speaker label,
Figure BDA0002425319140000084
canonical vector z corresponding to sample speaker yiF is a mapping function representation of the discriminative standard flow model.
The speaker vector regularization method provided by the embodiment of the invention has the advantages of simple structure of the discriminative standard flow model, no excessive calculation overhead, low resource consumption and improvement on the performance of the voiceprint recognition system.
Based on any of the above embodiments, the vector characterizing any speaker in the speaker regularization vector includes a first component and a second component;
wherein the first component obeys a conditional distribution associated with any speaker and the second component obeys a marginal distribution that is not associated with each speaker.
Specifically, the speaker regularization vectors output by the discriminative standard flow model are subjected to gaussian distribution as a whole, and the vectors representing the speakers in the speaker regularization vectors are subjected to gaussian distribution respectively. The vector characterizing any speaker in the speaker regularization vector can be decomposed into a first component and a second component, the first component obeys a conditional Gaussian distribution related to any speaker, and the second component obeys an edge Gaussian distribution unrelated to each speaker.
By way of example, an observation variable x represents a speaker vector, a hidden variable z represents a speaker canonical vector, and a category y represents a speaker tag. For a discriminative standard flow model with an observed variable x and an implicit variable z, the distribution for each class y can be formulated as:
py(z)=N(z;μy,∑y)
in the formula, py(z) is the probability density distribution function, μyIs the mean value corresponding to category y, ΣyThe covariance for category y.
Wherein, muyBy
Figure BDA0002425319140000091
And mu0Two parts are formed. Suppose μyIs an n-dimensional variable whose top p-dimension represents the mean of the distribution of the various classes associated with a class
Figure BDA0002425319140000092
The last n-p dimension represents the mean μ of the class-independent distribution of all variables0. It is clear that, for each category,
Figure BDA0002425319140000093
are independent of each other, and mu0Then it is shared by all categories, which is expressed as follows:
Figure BDA0002425319140000094
correspondingly, ∑yBy
Figure BDA0002425319140000095
And ∑0Two parts are formed.
Figure BDA0002425319140000096
Representing the p-p covariance of the distribution of the various classes associated with a class ∑0(n-p) × (n-p) covariances representing class-independent univariate distributions. It is clear that, for each category,
Figure BDA0002425319140000097
are independent of each other, and ∑0Then all categories are shared. It is expressed as follows:
Figure BDA0002425319140000098
through training optimization, the discriminative standard flow model establishes a z-space with class discriminative performance, wherein the distribution p of each class yy(z) will be decomposed into two parts, one associated with each category
Figure BDA0002425319140000099
Distribution, and N- (. mu.) independent of class0,∑0) And (4) distribution. Obviously, the discriminative standard stream may be distributed N (mu) to the edges of the data space0,∑0) And conditional distribution
Figure BDA00024253191400000910
And 4, Gaussian regularization of a data space is realized. Correspondingly, the regularization vectors of the speakers output by the discriminative standard flow model are subjected to Gaussian distribution as a whole, and the vectors representing the speakers in the regularization vectors of the speakers are subjected to the Gaussian distribution respectively.
In the speaker vector regularization method provided by the embodiment of the invention, the discriminative standard flow model inherits the discriminative ability of the discriminative model on different categories and also inherits the description ability of the statistical model on data distribution, and the regularized data distribution has stronger representation ability, so that the voiceprint recognition performance is obviously improved.
Based on any of the above embodiments, determining a speaker vector of a speech to be recognized specifically includes:
inputting the voice to be recognized into a speaker vector extraction model to obtain a speaker vector of the voice to be recognized, which is output by the speaker vector extraction model;
the speaker vector extraction model is obtained by joint training with a classifier based on sample voice and a corresponding speaker label.
Specifically, since different speakers use different vocal organs (such as tongue, oral cavity, nasal cavity, vocal cords, lung, etc.) in different sizes and forms, and considering the differences of different speakers in terms of age, character, language habits, etc., the characteristics of different speakers, such as pronunciation capacity and pronunciation frequency, are greatly different. The voiceprints of any two people are not identical. Voiceprint recognition is the recognition of a speaker based on the acoustic features contained in speech.
The speaker vector extraction model is used for extracting a speaker vector containing acoustic features from the voice to be recognized.
The speaker vector extraction model can be obtained through pre-training, specifically through combined training with a classifier based on sample voices and corresponding speaker labels. The speaker vector extraction model extracts acoustic features corresponding to speakers from input sample voice to generate speaker vectors, and then the speaker vectors are transmitted to a classifier to distinguish different speakers.
Based on any one of the embodiments, the speaker vector extraction model comprises a local feature extraction layer, a time sequence feature extraction layer and a fusion output layer;
inputting the voice to be recognized into the speaker vector extraction model to obtain the speaker vector of the voice to be recognized output by the speaker vector extraction model, and the method specifically comprises the following steps:
inputting the voice to be recognized into a local feature extraction layer to obtain local features output by the local feature extraction layer;
inputting the voice to be recognized into a time sequence feature extraction layer to obtain time sequence features output by the time sequence feature extraction layer;
and inputting the local features and the time sequence features into the fusion output layer to obtain the speaker vector output by the fusion output layer.
Specifically, the speaker vector extraction model needs to extract acoustic features including local features and time sequence features in the speech to be recognized.
The local feature extraction layer is used for extracting the local features of the voice to be recognized, and a convolutional neural network with local feature learning capability can be selected. The time sequence feature extraction layer is used for extracting time sequence features of the voice to be recognized, and a time delay neural network with long-term dynamic description capability can be selected. The fusion output layer is used for splicing the local features and the time sequence features of the voice to be recognized and outputting the speaker vector with distinguishing information.
The fusion output layer may specifically perform pooling operation on the local features and the timing features, so as to obtain a speaker vector, which is not specifically limited in the embodiment of the present invention.
Based on any of the above embodiments, determining a speaker recognition result of the to-be-recognized speech based on the speaker regularization vector specifically includes:
and inputting the regularized vector of the speaker into the rear-end scoring model to obtain a speaker recognition result of the voice to be recognized output by the rear-end scoring model.
The back-end scoring model can adopt P L DA. as the speaker regularization vector meets the requirement of overall Gaussian distribution obeying, and the vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution, so that the training and scoring of the back-end scoring model are more accurate, and the back-end scoring model selects reasonable parameters through training so that the distinction among different speakers is stronger, thereby improving the performance of the voiceprint recognition system.
The speaker vector regularization method provided by the embodiment of the invention can be used for regularizing edge data in a complex scene, so that the pressure of a rear-end scoring model is greatly reduced, a voiceprint recognition system can still obtain extremely high recognition accuracy in a complex application scene, and the generalization and the robustness of the voiceprint recognition system are improved.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a speaker vector regularization apparatus provided in an embodiment of the present invention, as shown in fig. 6, the apparatus includes a determining unit 601, a regularization unit 602, and an identifying unit 603;
the determining unit 601 is configured to determine a speaker vector of a speech to be recognized;
the regularization unit 602 is configured to input the speaker vector to the discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, where the speaker regularization vector entirely obeys gaussian distribution, and vectors representing speakers in the speaker regularization vector respectively obey gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label;
the recognition unit 603 is configured to determine a speaker recognition result of the speech to be recognized based on the speaker regularization vector.
The speaker vector regularization device provided by the embodiment of the invention carries out regularization operation on speaker vectors through the discriminative standard flow model to obtain speaker regularization vectors with stronger representation capability, meets the prior assumption required by a rear-end scoring model, greatly reduces the pressure of the rear-end scoring model, can be well compatible with the rear-end scoring model, and improves the performance of a voiceprint recognition system.
Based on any of the above embodiments, the regularization unit 602 includes:
the discriminative standard flow model is obtained by training based on a maximum likelihood estimation method, and the probability of the training target being a sample speaker vector is maximized.
Based on any of the above embodiments, the regularization unit 602 includes:
the optimization function used to train the discriminative standard flow model is:
Figure BDA0002425319140000121
wherein L is an optimization function, xiFor the ith sample speaker vector, ziIs equal to xiThe corresponding sample speaker regular vector, y is a sample speaker label,
Figure BDA0002425319140000122
canonical vector z corresponding to sample speaker yiF is a mapping function representation of the discriminative standard flow model.
Based on any of the above embodiments, the regularization unit 602 includes:
the vector for representing any speaker in the speaker regularization vector comprises a first component and a second component;
wherein the first component obeys a conditional distribution associated with any speaker and the second component obeys a marginal distribution that is not associated with each speaker.
Based on any of the embodiments described above, the determining unit 601 is specifically configured to:
inputting the voice to be recognized into a speaker vector extraction model to obtain a speaker vector of the voice to be recognized, which is output by the speaker vector extraction model;
the speaker vector extraction model is obtained by joint training with a classifier based on sample voice and a corresponding speaker label.
Based on any of the above embodiments, the regularization unit 602 includes:
the speaker vector extraction model comprises a local feature extraction layer, a time sequence feature extraction layer and a fusion output layer;
inputting the voice to be recognized into the speaker vector extraction model to obtain the speaker vector of the voice to be recognized output by the speaker vector extraction model, and the method specifically comprises the following steps:
inputting the voice to be recognized into a local feature extraction layer to obtain local features output by the local feature extraction layer;
inputting the voice to be recognized into a time sequence feature extraction layer to obtain time sequence features output by the time sequence feature extraction layer;
and inputting the local features and the time sequence features into the fusion output layer to obtain the speaker vector output by the fusion output layer.
Based on any of the above embodiments, the identifying unit 603 is specifically configured to:
and inputting the regularized vector of the speaker into the rear-end scoring model to obtain a speaker recognition result of the voice to be recognized output by the rear-end scoring model.
Based on any of the above embodiments, fig. 7 is a hardware structure diagram of an electronic device provided in an embodiment of the present invention, and as shown in fig. 7, the electronic device may include: a processor (processor)701, a communication interface (communication interface)704, a memory (memory)702 and a communication bus 703, wherein the processor 701, the communication interface 704 and the memory 702 complete communication with each other through the communication bus 703. The processor 701 may call logic instructions in the memory 702 to perform the following method: determining a speaker vector of a voice to be recognized; inputting the speaker vector into the discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label; and determining a speaker recognition result of the voice to be recognized based on the speaker regularization vector.
Furthermore, the logic instructions in the memory 702 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
A non-transitory computer-readable storage medium provided by an embodiment of the present invention has a computer program stored thereon, where the computer program is executed by a processor, and the method provided by the foregoing embodiments includes, for example: determining a speaker vector of a voice to be recognized; inputting the speaker vector into the discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label; and determining a speaker recognition result of the voice to be recognized based on the speaker regularization vector.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of speaker vector regularization, comprising:
determining a speaker vector of a voice to be recognized;
inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector;
and determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector.
2. The speaker vector regularization method as claimed in claim 1 wherein the discriminative standard flow model is trained based on a maximum likelihood estimation method with a training objective to maximize the probability of the sample speaker vector.
3. The speaker vector regularization method according to claim 2 wherein the optimization function used to train the discriminative standard flow model is:
Figure FDA0002425319130000011
wherein L is an optimization function, xiFor the ith sample speaker vector, ziIs equal to xiThe corresponding sample speaker regular vector, y is a sample speaker label,
Figure FDA0002425319130000012
canonical vector z corresponding to sample speaker yiF is a mapping function representation of the discriminative standard flow model.
4. The speaker vector regularization method as claimed in claim 1 wherein the vector characterizing any speaker in the speaker regularization vector comprises a first component and a second component;
wherein the first component obeys a conditional distribution associated with the any speaker and the second component obeys a marginal distribution that is not associated with each speaker.
5. The speaker vector regularization method according to any one of claims 1 to 4, wherein said determining a speaker vector of a speech to be recognized specifically comprises:
inputting the voice to be recognized into a speaker vector extraction model to obtain a speaker vector of the voice to be recognized, which is output by the speaker vector extraction model;
the speaker vector extraction model is obtained by joint training with a classifier based on sample voice and a speaker label corresponding to the sample voice.
6. The speaker vector regularization method according to claim 5 wherein said speaker vector extraction model comprises a local feature extraction layer, a temporal feature extraction layer, and a fusion output layer;
the inputting the speech to be recognized into a speaker vector extraction model to obtain the speaker vector of the speech to be recognized output by the speaker vector extraction model specifically includes:
inputting the voice to be recognized into the local feature extraction layer to obtain local features output by the local feature extraction layer;
inputting the voice to be recognized into the time sequence feature extraction layer to obtain the time sequence feature output by the time sequence feature extraction layer;
and inputting the local features and the time sequence features into the fusion output layer to obtain the speaker vector output by the fusion output layer.
7. The speaker vector regularization method according to any one of claims 1 to 4, wherein said determining the speaker recognition result of the speech to be recognized based on the speaker regularization vector specifically comprises:
and inputting the speaker regularization vector into a rear-end scoring model to obtain a speaker recognition result of the speech to be recognized, which is output by the rear-end scoring model.
8. A speaker vector regularization apparatus comprising:
the determining unit is used for determining a speaker vector of the voice to be recognized;
the regularization unit is used for inputting the speaker vector into a discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector;
and the recognition unit is used for determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the speaker vector regularization method as claimed in any one of claims 1 to 7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the speaker vector regularization method as claimed in any one of claims 1 to 7.
CN202010218732.3A 2020-03-25 2020-03-25 Speaker vector regularization method and device, electronic equipment and storage medium Active CN111462762B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010218732.3A CN111462762B (en) 2020-03-25 2020-03-25 Speaker vector regularization method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010218732.3A CN111462762B (en) 2020-03-25 2020-03-25 Speaker vector regularization method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111462762A true CN111462762A (en) 2020-07-28
CN111462762B CN111462762B (en) 2023-02-24

Family

ID=71679788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010218732.3A Active CN111462762B (en) 2020-03-25 2020-03-25 Speaker vector regularization method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111462762B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233680A (en) * 2020-09-27 2021-01-15 科大讯飞股份有限公司 Speaker role identification method and device, electronic equipment and storage medium
CN116612521A (en) * 2023-06-15 2023-08-18 山东睿芯半导体科技有限公司 Face recognition method, device, chip and terminal

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254485A (en) * 1997-03-10 1998-09-25 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Speaker normalizing device, speaker adaptive device and speech recognizer
US20050097075A1 (en) * 2000-07-06 2005-05-05 Microsoft Corporation System and methods for providing automatic classification of media entities according to consonance properties
CN1985302A (en) * 2004-07-09 2007-06-20 索尼德国有限责任公司 Method for classifying music
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN105931646A (en) * 2016-04-29 2016-09-07 江西师范大学 Speaker identification method base on simple direct tolerance learning algorithm
US20160335887A1 (en) * 2015-05-12 2016-11-17 Here Global B.V. System and Method for Roundabouts from Probe Data Using Vector Fields
JP2017097188A (en) * 2015-11-25 2017-06-01 日本電信電話株式会社 Speaker-likeness evaluation device, speaker identification device, speaker collation device, speaker-likeness evaluation method, and program
CN107112011A (en) * 2014-12-22 2017-08-29 英特尔公司 Cepstrum normalized square mean for audio feature extraction
WO2018107810A1 (en) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 Voiceprint recognition method and apparatus, and electronic device and medium
US20190115031A1 (en) * 2016-07-15 2019-04-18 Tencent Technology (Shenzhen) Company Limited Identity vector generation method, computer device, and computer-readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254485A (en) * 1997-03-10 1998-09-25 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Speaker normalizing device, speaker adaptive device and speech recognizer
US20050097075A1 (en) * 2000-07-06 2005-05-05 Microsoft Corporation System and methods for providing automatic classification of media entities according to consonance properties
CN1985302A (en) * 2004-07-09 2007-06-20 索尼德国有限责任公司 Method for classifying music
CN107112011A (en) * 2014-12-22 2017-08-29 英特尔公司 Cepstrum normalized square mean for audio feature extraction
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
US20160335887A1 (en) * 2015-05-12 2016-11-17 Here Global B.V. System and Method for Roundabouts from Probe Data Using Vector Fields
JP2017097188A (en) * 2015-11-25 2017-06-01 日本電信電話株式会社 Speaker-likeness evaluation device, speaker identification device, speaker collation device, speaker-likeness evaluation method, and program
CN105931646A (en) * 2016-04-29 2016-09-07 江西师范大学 Speaker identification method base on simple direct tolerance learning algorithm
US20190115031A1 (en) * 2016-07-15 2019-04-18 Tencent Technology (Shenzhen) Company Limited Identity vector generation method, computer device, and computer-readable storage medium
WO2018107810A1 (en) * 2016-12-15 2018-06-21 平安科技(深圳)有限公司 Voiceprint recognition method and apparatus, and electronic device and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YANG ZHANG ET AL: "VAE-based regularization for deep speaker embedding", 《ARXIV》 *
李琳等: "基于概率修正PLDA的说话人识别***", 《天津大学学报(自然科学与工程技术版)》 *
杨绪魁等: "基于正则化i-Vector算法的语种识别", 《信息工程大学学报》 *
缑新科等: "基于T矩阵归一化PLDA的说话人确认", 《计算机与现代化》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233680A (en) * 2020-09-27 2021-01-15 科大讯飞股份有限公司 Speaker role identification method and device, electronic equipment and storage medium
CN112233680B (en) * 2020-09-27 2024-02-13 科大讯飞股份有限公司 Speaker character recognition method, speaker character recognition device, electronic equipment and storage medium
CN116612521A (en) * 2023-06-15 2023-08-18 山东睿芯半导体科技有限公司 Face recognition method, device, chip and terminal

Also Published As

Publication number Publication date
CN111462762B (en) 2023-02-24

Similar Documents

Publication Publication Date Title
US10008209B1 (en) Computer-implemented systems and methods for speaker recognition using a neural network
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN108804453B (en) Video and audio recognition method and device
Zhou et al. A compact representation of visual speech data using latent variables
CN105575388A (en) Emotional speech processing
Huang et al. Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition
JP2018194828A (en) Multi-view vector processing method and apparatus
CN113223560A (en) Emotion recognition method, device, equipment and storage medium
CN112418166B (en) Emotion distribution learning method based on multi-mode information
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
Ivanko et al. An experimental analysis of different approaches to audio–visual speech recognition and lip-reading
Atkar et al. Speech emotion recognition using dialogue emotion decoder and CNN Classifier
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
Chinmayi et al. Emotion Classification Using Deep Learning
Xu et al. Emotion recognition research based on integration of facial expression and voice
CN111932056A (en) Customer service quality scoring method and device, computer equipment and storage medium
Le Cornu et al. Voicing classification of visual speech using convolutional neural networks
CN115145402A (en) Intelligent toy system with network interaction function and control method
CN115861670A (en) Training method of feature extraction model and data processing method and device
Nguyen Multimodal emotion recognition using deep learning techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant