CN111462762A

CN111462762A - Speaker vector regularization method and device, electronic equipment and storage medium

Info

Publication number: CN111462762A
Application number: CN202010218732.3A
Authority: CN
Inventors: 蔡云麒; 王东; 李蓝天
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-07-28
Anticipated expiration: 2040-03-25
Also published as: CN111462762B

Abstract

The embodiment of the invention provides a speaker vector regularization method, a speaker vector regularization device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a speaker vector of a voice to be recognized; inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector; and determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector. The method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention can be well compatible with a back-end scoring model, and the performance of a voiceprint recognition system is improved.

Description

Speaker vector regularization method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a speaker vector regularization method, a speaker vector regularization device, electronic equipment and a storage medium.

Background

With the development of deep learning technology, the voiceprint recognition technology based on the deep speaker characterization vector obtains satisfactory recognition performance, so that the voiceprint recognition technology is gradually applied to various actual scenes from scientific research laboratories.

In the prior art, the training target of a speaker vector model is only to distinguish different speakers to the maximum, and the distribution of speaker vectors obtained by deep speaker vector model inference is free and unconstrained, and a back-end scoring method for speaker recognition, such as P L DA (Probabilistic L input discriminative Analysis, Probabilistic linear Discriminant Analysis) and the like, generally has specific requirements on the distribution of speaker vectors.

Disclosure of Invention

The embodiment of the invention provides a speaker vector regularization method, a speaker vector regularization device, electronic equipment and a storage medium, which are used for solving the problems that the existing speaker vector model and a rear-end scoring model cannot be well compatible and the performance of a voiceprint recognition system is poor.

In a first aspect, an embodiment of the present invention provides speaker vector regularization, including:

determining a speaker vector of a voice to be recognized;

inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector;

and determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector.

Optionally, the discriminative standard flow model is trained based on a maximum likelihood estimation method, with a training objective of maximizing the probability of the sample speaker vector.

Optionally, the optimization function used for training the discriminative standard flow model is:

wherein L is an optimization function, x_iFor the ith sample speaker vector, z_iIs equal to x_iThe corresponding sample speaker regular vector, y is a sample speaker label,

canonical vector z corresponding to sample speaker y_iF is a mapping function representation of the discriminative standard flow model.

Optionally, a vector characterizing any speaker in the speaker regularization vector includes a first component and a second component;

wherein the first component obeys a conditional distribution associated with the any speaker and the second component obeys a marginal distribution that is not associated with each speaker.

Optionally, the determining the speaker vector of the speech to be recognized specifically includes:

inputting the voice to be recognized into a speaker vector extraction model to obtain a speaker vector of the voice to be recognized, which is output by the speaker vector extraction model;

the speaker vector extraction model is obtained by joint training with a classifier based on sample voice and a speaker label corresponding to the sample voice.

Optionally, the speaker vector extraction model includes a local feature extraction layer, a time sequence feature extraction layer, and a fusion output layer;

the inputting the speech to be recognized into a speaker vector extraction model to obtain the speaker vector of the speech to be recognized output by the speaker vector extraction model specifically includes:

inputting the voice to be recognized into the local feature extraction layer to obtain local features output by the local feature extraction layer;

inputting the voice to be recognized into the time sequence feature extraction layer to obtain the time sequence feature output by the time sequence feature extraction layer;

and inputting the local features and the time sequence features into the fusion output layer to obtain the speaker vector output by the fusion output layer.

Optionally, the determining, based on the speaker regularization vector, a speaker recognition result of the speech to be recognized specifically includes:

and inputting the speaker regularization vector into a rear-end scoring model to obtain a speaker recognition result of the speech to be recognized, which is output by the rear-end scoring model.

In a second aspect, an embodiment of the present invention provides a speaker vector regularization apparatus, including:

the determining unit is used for determining a speaker vector of the voice to be recognized;

the regularization unit is used for inputting the speaker vector into a discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on a sample speaker vector and a speaker label corresponding to the sample speaker vector;

and the recognition unit is used for determining the speaker recognition result of the voice to be recognized based on the speaker regularization vector.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the speaker vector regularization method as described in the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the speaker vector regularization method as described in the first aspect.

According to the speaker vector regularization method, the speaker vector regularization device, the electronic equipment and the storage medium, regularization operation is carried out on the speaker vector through the discriminative standard flow model, the speaker regularization vector with stronger representation capability is obtained, a speaker recognition result is further obtained, the pressure of a rear-end scoring model is greatly reduced, the speaker regularization vector can be well compatible with the rear-end scoring model, and the performance of a voiceprint recognition system is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speaker vector regularization method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of data distribution output by standard flow model transformation according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of data distribution output after transformation by a differentiated standard flow model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a reversible transformation of a discriminative standard flow model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a stackable streaming invertible transformation of a discriminative standard flow model according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a speaker vector regularization apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Voiceprint recognition, also known as speaker recognition, is a biometric identification technique that automatically implements speaker identification using computers and various information recognition techniques based on voiceprints in speech signals that can characterize the speaker's personality information. Voiceprint is a kind of information in speech signals, and is a general term for speech features which are contained in speech signals and can characterize the identity of a speaker, and a speech model which is built based on the features.

Traditional voiceprint recognition techniques are based on statistical models, the most classical of which is the gaussian mixture model-generic background model (GMM-UBM) architecture. To further enhance the expressive power of speaker characteristics under limited data, various subspace models are proposed in succession, the most notable of which is the i-vector model. The i-vector model introduces an important concept: speaker characterization vector (Speaker embedding), i.e. a continuous vector of a fixed length is used to characterize the Speaker characteristics. And constructing a space for describing the characteristics of the speaker through the speaker characterization vector.

In recent years, a series of voiceprint recognition models based on a Deep learning method are successively proposed, such as a d-vector model, an x-vector model and the like, which are collectively called Deep speaker characterization vectors (Deep speaker embedding), the Deep characterization vectors achieve the most advanced recognition performance at present through further optimization and improvement on a model structure, a pooling strategy, training criteria and the like, and although the Deep speaker vector model achieves great progress, the Deep speaker vector model has the defect that the training target of the model only distinguishes different speakers to the maximum degree and does not consider the spatial distribution of the speakers.

Fig. 1 is a schematic flowchart of a speaker vector regularization method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101, determining a speaker vector of a voice to be recognized;

specifically, the speech to be recognized contains speech data from different speakers. The speaker vector is a deep speaker characterization vector, and particularly, a continuous vector with a fixed length is used for characterizing the characteristics of each speaker. A space describing the speaker characteristics is constructed through the speaker vector. In the space, the distribution of the speaker vector is free and unconstrained, and is specifically represented by: the distribution of each speaker is extremely complex; the distribution between different speakers is significantly different.

For the back-end scoring method of the voiceprint recognition mainstream, the speaker vector is mostly required to meet the prior assumption of Gaussian distribution, namely, the prior probability of the speaker and the conditional probability of each speaker are assumed to meet the Gaussian distribution. Obviously, the speaker vector of the speech to be recognized cannot well meet the prior requirement of the back-end scoring method, and therefore the speaker vector of the speech to be recognized needs to be regularized.

The speaker vector of the speech to be recognized can be obtained through different speaker vector extraction models, and the embodiment of the invention does not specifically limit the type and the internal structure of the speaker vector extraction model.

102, inputting the speaker vector into a differential standard flow model to obtain a speaker regularization vector output by the differential standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label;

specifically, a standard flow model (NF) is a deep generative model that has emerged in recent years. Similar to most generative models, the essential goal of standard flow models is to fit the distribution of the data space. The standard flow model uses a simple normal distribution and an invertible mapping to fit the true data distribution. Although normalization of observed variables to a standard gaussian distribution is well based on a standard flow model, there is one significant drawback: the standard flow model is only an edge distribution that optimizes the data. This means that data from different classes tend to converge together in hidden space and the distribution of the classes (conditional distribution) is still non-gaussian. The standard flow model is difficult to be effective for open discrimination tasks including voiceprint recognition.

Compared with a standard flow model, a Discriminative standard flow model (DNF) considers class labels of data in the process of optimizing data distribution, allows the whole hidden variable space to obey gaussian distribution, and allows various classes to obey gaussian distribution in the hidden variable space, so that regularization of edge distribution (whole data space) and condition distribution (data space of each class) of an observed variable is realized.

For example, there are 3 observed variables in a certain observation space, x1, x2, and x3, respectively. Fig. 2 is a schematic diagram of data distribution output by transformation of a standard flow model according to an embodiment of the present invention, and as shown in fig. 2, the standard flow model can well normalize an observation variable into a global standard gaussian distribution, but x1, x2, and x3 are grouped together in a transformed space. Fig. 3 is a schematic diagram of data distribution output by the discriminative standard flow model transformation according to the embodiment of the present invention, and as shown in fig. 3, the discriminative standard flow model considers a class label of data in the process of optimizing data distribution, and after optimization, not only the entire data space obeys gaussian distribution, but also the data space of each class also obeys gaussian distribution.

And inputting the speaker vector into a discriminative standard flow model, wherein the discriminative standard flow model considers the influence of the speaker label on the speaker vector distribution, optimizes the data distribution of the speaker vector and outputs a speaker regularization vector. Compared with the method that only the speaker vector is regularized in the whole output by the standard flow model, the condition that the vector distribution of each speaker is represented in the speaker vector is not considered, the speaker regularization vector output by the discriminative standard flow model is in Gaussian distribution in the whole, and the vector representing each speaker in the speaker regularization vector is in Gaussian distribution respectively.

Before step 102 is executed, the discriminative standard flow model may be obtained by training in advance, and specifically, the discriminative standard flow model may be obtained by training in the following manner: firstly, a large number of sample speaker vectors and speaker labels corresponding to the sample speaker vectors are collected, and an initial discriminative standard flow model is trained by applying the sample speaker vectors and the speaker labels corresponding to the sample speaker vectors. The discriminative standard flow model optimized through training can learn a standardized space with the discriminative function of the sample speaker between the sample speaker vector and the regularization vector of the sample speaker determined based on the speaker label, so that the regularization of the speaker vector is realized.

And 103, determining a speaker recognition result of the voice to be recognized based on the speaker regularization vector.

Specifically, the speaker regularization vectors obtained after the regularization of the discriminative standard flow model are subjected to Gaussian distribution in the whole, and the vectors representing all speakers in the speaker regularization vectors are subjected to Gaussian distribution respectively, so that the prior assumption required by the back-end scoring model can be met, and the compatibility with the back-end scoring model is improved. The speaker regularization vector is input into a rear-end scoring model, and the rear-end scoring model selects reasonable parameters through training so that the distinguishability among different speakers is stronger.

According to the speaker vector regularization method provided by the embodiment of the invention, the speaker vector is regularized through the discriminative standard flow model, so that the speaker regularization vector with stronger representation capability is obtained, the prior assumption required by the back-end scoring model is met, the pressure of the back-end scoring model is greatly reduced, the speaker regularization vector can be well compatible with the back-end scoring model, and the performance of a voiceprint recognition system is improved.

Based on the above embodiment, the discriminative standard flow model is trained based on the maximum likelihood estimation method, and the training target is the probability maximization of the sample speaker vector.

Specifically, for the training of the discriminative standard flow model in the above embodiment, the training criterion based on the maximum likelihood is adopted, the speaker labels corresponding to the sample speaker vectors are used for the probability estimation of all the sample speaker vectors, and the training target is the probability maximization of the sample speaker vectors.

The operation of the discriminative standard flow model is illustrated by way of example below. Fig. 4 is a schematic diagram of reversible transformation of a discriminative standard flow model according to an embodiment of the present invention, as shown in fig. 4, an observation variable x is used to represent a speaker vector, a hidden variable z is used to represent a speaker regular vector, a category y is used to represent a speaker tag, the hidden variable z and the observation variable x are linked by a reversible transformation x ═ f (z), f^-1(x) Is the inverse function of f (z), f^-1(x) Can be viewed as a mapping function representation of a discriminative standard flow model.

The probability density relationship between the observed variable x and the hidden variable z can be formulated as:

in the formula, p_y(x) And p_y(z) respectively represents probability density distribution functions of an observed variable x and a hidden variable z corresponding to the category y, and the second term on the right represents the change of entropy between the two distributions in the transformation process. To enhance the flexibility of the model, typically f is composed of a series of relatively simple reversible transformations, formulated as:

f＝f_T·f_T-1…·f₁

in the formula (f)_tDenoted single reversible transformation, T ∈ [1, T]. Through each f_tThe variables will more closely approach the target distribution. Each f_tCan be represented by a structured neural network.

Fig. 5 is a schematic diagram of a scalable and reversible stream transform that can be superimposed by the differentiated standard stream model provided by the embodiment of the present invention, and when T is 3, the whole transform process may be as shown in fig. 5.

The entire transformation process can be formulated as:

based on the characteristics of the discriminative standard flow model probability distribution transformation, the probability maximization of all the observation variables is realized by adopting the training criterion of the maximum likelihood.

Based on any of the above embodiments, the optimization function for training the discriminative standard flow model is:

The speaker vector regularization method provided by the embodiment of the invention has the advantages of simple structure of the discriminative standard flow model, no excessive calculation overhead, low resource consumption and improvement on the performance of the voiceprint recognition system.

Based on any of the above embodiments, the vector characterizing any speaker in the speaker regularization vector includes a first component and a second component;

wherein the first component obeys a conditional distribution associated with any speaker and the second component obeys a marginal distribution that is not associated with each speaker.

Specifically, the speaker regularization vectors output by the discriminative standard flow model are subjected to gaussian distribution as a whole, and the vectors representing the speakers in the speaker regularization vectors are subjected to gaussian distribution respectively. The vector characterizing any speaker in the speaker regularization vector can be decomposed into a first component and a second component, the first component obeys a conditional Gaussian distribution related to any speaker, and the second component obeys an edge Gaussian distribution unrelated to each speaker.

By way of example, an observation variable x represents a speaker vector, a hidden variable z represents a speaker canonical vector, and a category y represents a speaker tag. For a discriminative standard flow model with an observed variable x and an implicit variable z, the distribution for each class y can be formulated as:

p_y(z)＝N(z；μ_y,∑_y)

in the formula, p_y(z) is the probability density distribution function, μ_yIs the mean value corresponding to category y, Σ_yThe covariance for category y.

Wherein, mu_yBy

And mu₀Two parts are formed. Suppose μ_yIs an n-dimensional variable whose top p-dimension represents the mean of the distribution of the various classes associated with a class

The last n-p dimension represents the mean μ of the class-independent distribution of all variables₀. It is clear that, for each category,

are independent of each other, and mu₀Then it is shared by all categories, which is expressed as follows:

correspondingly, ∑_yBy

And ∑₀Two parts are formed.

Representing the p-p covariance of the distribution of the various classes associated with a class ∑₀(n-p) × (n-p) covariances representing class-independent univariate distributions. It is clear that, for each category,

are independent of each other, and ∑₀Then all categories are shared. It is expressed as follows:

through training optimization, the discriminative standard flow model establishes a z-space with class discriminative performance, wherein the distribution p of each class y_y(z) will be decomposed into two parts, one associated with each category

Distribution, and N- (. mu.) independent of class₀,∑₀) And (4) distribution. Obviously, the discriminative standard stream may be distributed N (mu) to the edges of the data space₀,∑₀) And conditional distribution

And 4, Gaussian regularization of a data space is realized. Correspondingly, the regularization vectors of the speakers output by the discriminative standard flow model are subjected to Gaussian distribution as a whole, and the vectors representing the speakers in the regularization vectors of the speakers are subjected to the Gaussian distribution respectively.

In the speaker vector regularization method provided by the embodiment of the invention, the discriminative standard flow model inherits the discriminative ability of the discriminative model on different categories and also inherits the description ability of the statistical model on data distribution, and the regularized data distribution has stronger representation ability, so that the voiceprint recognition performance is obviously improved.

Based on any of the above embodiments, determining a speaker vector of a speech to be recognized specifically includes:

the speaker vector extraction model is obtained by joint training with a classifier based on sample voice and a corresponding speaker label.

Specifically, since different speakers use different vocal organs (such as tongue, oral cavity, nasal cavity, vocal cords, lung, etc.) in different sizes and forms, and considering the differences of different speakers in terms of age, character, language habits, etc., the characteristics of different speakers, such as pronunciation capacity and pronunciation frequency, are greatly different. The voiceprints of any two people are not identical. Voiceprint recognition is the recognition of a speaker based on the acoustic features contained in speech.

The speaker vector extraction model is used for extracting a speaker vector containing acoustic features from the voice to be recognized.

The speaker vector extraction model can be obtained through pre-training, specifically through combined training with a classifier based on sample voices and corresponding speaker labels. The speaker vector extraction model extracts acoustic features corresponding to speakers from input sample voice to generate speaker vectors, and then the speaker vectors are transmitted to a classifier to distinguish different speakers.

Based on any one of the embodiments, the speaker vector extraction model comprises a local feature extraction layer, a time sequence feature extraction layer and a fusion output layer;

inputting the voice to be recognized into the speaker vector extraction model to obtain the speaker vector of the voice to be recognized output by the speaker vector extraction model, and the method specifically comprises the following steps:

inputting the voice to be recognized into a local feature extraction layer to obtain local features output by the local feature extraction layer;

inputting the voice to be recognized into a time sequence feature extraction layer to obtain time sequence features output by the time sequence feature extraction layer;

Specifically, the speaker vector extraction model needs to extract acoustic features including local features and time sequence features in the speech to be recognized.

The local feature extraction layer is used for extracting the local features of the voice to be recognized, and a convolutional neural network with local feature learning capability can be selected. The time sequence feature extraction layer is used for extracting time sequence features of the voice to be recognized, and a time delay neural network with long-term dynamic description capability can be selected. The fusion output layer is used for splicing the local features and the time sequence features of the voice to be recognized and outputting the speaker vector with distinguishing information.

The fusion output layer may specifically perform pooling operation on the local features and the timing features, so as to obtain a speaker vector, which is not specifically limited in the embodiment of the present invention.

Based on any of the above embodiments, determining a speaker recognition result of the to-be-recognized speech based on the speaker regularization vector specifically includes:

and inputting the regularized vector of the speaker into the rear-end scoring model to obtain a speaker recognition result of the voice to be recognized output by the rear-end scoring model.

The back-end scoring model can adopt P L DA. as the speaker regularization vector meets the requirement of overall Gaussian distribution obeying, and the vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution, so that the training and scoring of the back-end scoring model are more accurate, and the back-end scoring model selects reasonable parameters through training so that the distinction among different speakers is stronger, thereby improving the performance of the voiceprint recognition system.

The speaker vector regularization method provided by the embodiment of the invention can be used for regularizing edge data in a complex scene, so that the pressure of a rear-end scoring model is greatly reduced, a voiceprint recognition system can still obtain extremely high recognition accuracy in a complex application scene, and the generalization and the robustness of the voiceprint recognition system are improved.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a speaker vector regularization apparatus provided in an embodiment of the present invention, as shown in fig. 6, the apparatus includes a determining unit 601, a regularization unit 602, and an identifying unit 603;

the determining unit 601 is configured to determine a speaker vector of a speech to be recognized;

the regularization unit 602 is configured to input the speaker vector to the discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, where the speaker regularization vector entirely obeys gaussian distribution, and vectors representing speakers in the speaker regularization vector respectively obey gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label;

the recognition unit 603 is configured to determine a speaker recognition result of the speech to be recognized based on the speaker regularization vector.

The speaker vector regularization device provided by the embodiment of the invention carries out regularization operation on speaker vectors through the discriminative standard flow model to obtain speaker regularization vectors with stronger representation capability, meets the prior assumption required by a rear-end scoring model, greatly reduces the pressure of the rear-end scoring model, can be well compatible with the rear-end scoring model, and improves the performance of a voiceprint recognition system.

Based on any of the above embodiments, the regularization unit 602 includes:

the discriminative standard flow model is obtained by training based on a maximum likelihood estimation method, and the probability of the training target being a sample speaker vector is maximized.

Based on any of the above embodiments, the regularization unit 602 includes:

the optimization function used to train the discriminative standard flow model is:

Based on any of the above embodiments, the regularization unit 602 includes:

the vector for representing any speaker in the speaker regularization vector comprises a first component and a second component;

Based on any of the embodiments described above, the determining unit 601 is specifically configured to:

Based on any of the above embodiments, the regularization unit 602 includes:

the speaker vector extraction model comprises a local feature extraction layer, a time sequence feature extraction layer and a fusion output layer;

Based on any of the above embodiments, the identifying unit 603 is specifically configured to:

Based on any of the above embodiments, fig. 7 is a hardware structure diagram of an electronic device provided in an embodiment of the present invention, and as shown in fig. 7, the electronic device may include: a processor (processor)701, a communication interface (communication interface)704, a memory (memory)702 and a communication bus 703, wherein the processor 701, the communication interface 704 and the memory 702 complete communication with each other through the communication bus 703. The processor 701 may call logic instructions in the memory 702 to perform the following method: determining a speaker vector of a voice to be recognized; inputting the speaker vector into the discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label; and determining a speaker recognition result of the voice to be recognized based on the speaker regularization vector.

Furthermore, the logic instructions in the memory 702 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

A non-transitory computer-readable storage medium provided by an embodiment of the present invention has a computer program stored thereon, where the computer program is executed by a processor, and the method provided by the foregoing embodiments includes, for example: determining a speaker vector of a voice to be recognized; inputting the speaker vector into the discriminative standard flow model to obtain a speaker regularization vector output by the discriminative standard flow model, wherein the speaker regularization vector integrally obeys Gaussian distribution, and vectors representing all speakers in the speaker regularization vector respectively obey the Gaussian distribution; the discriminative standard flow model is obtained by training based on the sample speaker vector and the corresponding speaker label; and determining a speaker recognition result of the voice to be recognized based on the speaker regularization vector.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speaker vector regularization, comprising:

determining a speaker vector of a voice to be recognized;

2. The speaker vector regularization method as claimed in claim 1 wherein the discriminative standard flow model is trained based on a maximum likelihood estimation method with a training objective to maximize the probability of the sample speaker vector.

3. The speaker vector regularization method according to claim 2 wherein the optimization function used to train the discriminative standard flow model is:

4. The speaker vector regularization method as claimed in claim 1 wherein the vector characterizing any speaker in the speaker regularization vector comprises a first component and a second component;

5. The speaker vector regularization method according to any one of claims 1 to 4, wherein said determining a speaker vector of a speech to be recognized specifically comprises:

6. The speaker vector regularization method according to claim 5 wherein said speaker vector extraction model comprises a local feature extraction layer, a temporal feature extraction layer, and a fusion output layer;

7. The speaker vector regularization method according to any one of claims 1 to 4, wherein said determining the speaker recognition result of the speech to be recognized based on the speaker regularization vector specifically comprises:

8. A speaker vector regularization apparatus comprising:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the speaker vector regularization method as claimed in any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the speaker vector regularization method as claimed in any one of claims 1 to 7.