CN117423344A - Voiceprint recognition method and device based on neural network - Google Patents

Voiceprint recognition method and device based on neural network Download PDF

Info

Publication number
CN117423344A
CN117423344A CN202311262765.8A CN202311262765A CN117423344A CN 117423344 A CN117423344 A CN 117423344A CN 202311262765 A CN202311262765 A CN 202311262765A CN 117423344 A CN117423344 A CN 117423344A
Authority
CN
China
Prior art keywords
voiceprint
identified
signal
mel
voice print
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311262765.8A
Other languages
Chinese (zh)
Inventor
胡光强
许敏
张军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUADI COMPUTER GROUP CO Ltd
Original Assignee
HUADI COMPUTER GROUP CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUADI COMPUTER GROUP CO Ltd filed Critical HUADI COMPUTER GROUP CO Ltd
Priority to CN202311262765.8A priority Critical patent/CN117423344A/en
Publication of CN117423344A publication Critical patent/CN117423344A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a voiceprint recognition method and device based on a neural network. The method comprises the following steps: collecting voiceprint signals to be identified and preprocessing; extracting features of the voice print signals to be identified after pretreatment, and determining voice print features of the voice print signals to be identified; inputting voiceprint features into a pre-trained neural network classifier model, and outputting all possible class label sets of voiceprint signals to be identified, wherein the classifier model adopts self-organizing mapping based on quantum optimization; traversing a possible class label set according to a pre-constructed Bayesian optimization probability model, and obtaining an optimal class label of the voiceprint signal to be identified.

Description

Voiceprint recognition method and device based on neural network
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition method and device based on a neural network.
Background
Voiceprint recognition is a technique for performing authentication or identification using individual voice features, and has been widely studied and used in recent years. With the rapid development of various application scenarios such as intelligent voice assistant, intelligent home, mobile payment, security verification and the like, the accuracy and reliability of the voiceprint recognition technology become particularly important.
However, in practical applications, existing voiceprint recognition techniques face a number of challenges. The following problems remain to be solved in the prior art:
data quality and resolution problems: conventional data acquisition methods may fail to ensure sufficient information volume and resolution, thereby affecting the accuracy of the identification.
The pretreatment is insufficient: existing data preprocessing techniques may not be advanced enough, failing to effectively eliminate noise or perform proper data normalization may affect the performance and generalization ability of the model.
Data diversity and volume were insufficient: traditional data expansion approaches may be too simple to take into account the time dependence of the voiceprint data, resulting in insufficient generalization ability of the model in the face of different environmental and noise conditions.
Limited feature extraction capability: conventional Convolutional Neural Networks (CNNs), while performing well in voiceprint recognition, may suffer from deficiencies in handling multi-scale features and in introducing attention mechanisms.
Classifier performance and efficiency problems: existing classifiers may not fully utilize advanced optimization techniques such as quantum optimization and thus have limitations in computational efficiency and accuracy.
Lack of uncertainty assessment: conventional models typically only give one most likely classification result without evaluating the uncertainty of the model to this result, which is a disadvantage in terms of safety and reliability.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a voiceprint recognition method and device based on a neural network.
According to one aspect of the present invention, there is provided a voiceprint recognition method based on a neural network, including:
collecting voiceprint signals to be identified and preprocessing;
extracting features of the voice print signals to be identified after pretreatment, and determining voice print features of the voice print signals to be identified;
inputting voiceprint features into a pre-trained neural network classifier model, and outputting all possible class label sets of voiceprint signals to be identified, wherein the classifier model adopts self-organizing mapping based on quantum optimization;
traversing a possible class label set according to a pre-constructed Bayesian optimization probability model, and obtaining an optimal class label of the voiceprint signal to be identified.
Optionally, the classifier model training process is as follows:
collecting a voiceprint signal dataset of a microphone device;
performing feature extraction on the voiceprint signal data set by using a Mel frequency cepstrum coefficient feature extraction method to determine a Mel frequency cepstrum coefficient set;
preprocessing a voiceprint signal data set by adopting a maximum and minimum normalization and time domain noise filter combination method;
performing data expansion on the preprocessed voiceprint signal data set based on the generation of the echo state against a network algorithm to generate an expanded voiceprint signal data set;
performing feature extraction on the expanded voiceprint signal data set by adopting a multi-scale convolutional neural network, and determining a voiceprint feature set of the voiceprint signal data set;
the Mel frequency cepstrum coefficient set is formed into a Mel frequency cepstrum coefficient matrix, and is fused with the voiceprint feature set to determine the voiceprint fusion feature set;
inputting the voiceprint fusion feature set into a self-organizing map classifier based on quantum optimization for training to obtain a classifier model.
Optionally, the microphone device is any one of the following dynamic microphone, condenser microphone and array microphone.
Optionally, the feature extraction method for extracting features of the voiceprint signal data set by using a mel-frequency cepstral coefficient feature extraction method, and determining the mel-frequency cepstral coefficient set includes:
performing Fourier transformation on the voiceprint signals of the voiceprint signal data set respectively to obtain a frequency spectrum feature set;
mapping the spectrum feature set to a Mel scale to obtain a Mel domain feature set;
and performing discrete cosine transform on the Mel domain feature set to obtain a Mel frequency cepstrum coefficient set.
Optionally, the maximum and minimum normalization method formula is:
where x' is normalized data, min (x) and max (x) are the minimum and maximum values of the original data x, respectively.
Optionally, the time domain noise filter formula is:
where x' is normalized data, α is a weight parameter between 0 and 1, n is the local window size, w i Is the weight of each point in the window and satisfies
Optionally, generating the countermeasure network algorithm includes generating G ESN And discriminator D, wherein generator G ESN Consists of an original generator and an echo state network.
Optionally, the bayesian optimization model is:
where Y is the set of all possible class labels, μ (x, Y) and σ 2 (x, Y) are mean and variance, respectively, x is the voiceprint signal to be identified and Y is the category in Y.
Optionally, the method further comprises: and carrying out uncertainty measurement on the optimal category label, and determining the uncertainty of the optimal label, wherein the uncertainty measurement is expressed as the following formula:
U(x,y * )=σ 2 (x,y * )
wherein x is the voiceprint signal to be identified, y * For the best category label, σ 2 Is the variance.
According to another aspect of the present invention, there is provided a voiceprint recognition apparatus based on a neural network, including:
the acquisition module is used for acquiring the voiceprint signal to be identified and preprocessing the voiceprint signal;
the feature extraction module is used for carrying out feature extraction on the voice print signal to be identified after the pretreatment and determining voice print features of the voice print signal to be identified;
the output module is used for inputting voiceprint characteristics into a pre-trained neural network classifier model and outputting all possible class label sets of voiceprint signals to be identified, wherein the classifier model adopts self-organizing mapping based on quantum optimization;
the acquisition module is used for traversing a possible class label set according to a pre-constructed Bayesian optimization probability model to acquire an optimal class label of the voiceprint signal to be identified.
According to a further aspect of the present invention there is provided a computer readable storage medium storing a computer program for performing the method according to any one of the above aspects of the present invention.
According to still another aspect of the present invention, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to any of the above aspects of the present invention.
Therefore, the invention provides a voiceprint recognition method based on a neural network. On the design of the classifier, self-organizing mapping (SOM) based on quantum optimization is adopted to improve accuracy and calculation efficiency, uncertainty assessment based on Bayesian optimization is introduced, and a more reliable classification result is provided.
Drawings
Exemplary embodiments of the present invention may be more completely understood in consideration of the following drawings:
FIG. 1 is a flow chart of a voice print recognition method based on a neural network according to an exemplary embodiment of the present invention;
fig. 2 is a schematic structural diagram of a voiceprint recognition method based on a neural network according to an exemplary embodiment of the present invention;
fig. 3 is a schematic structural diagram of a voiceprint recognition device based on a neural network according to an exemplary embodiment of the present invention;
fig. 4 is a structure of an electronic device provided in an exemplary embodiment of the present invention.
Detailed Description
Hereinafter, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present invention are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.
It should also be understood that in embodiments of the present invention, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.
It should also be appreciated that any component, data, or structure referred to in an embodiment of the invention may be generally understood as one or more without explicit limitation or the contrary in the context.
In addition, the term "and/or" in the present invention is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In the present invention, the character "/" generally indicates that the front and rear related objects are an or relationship.
It should also be understood that the description of the embodiments of the present invention emphasizes the differences between the embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations with electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the foregoing, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.
Exemplary method
Fig. 1 is a flowchart of a voice print recognition method based on a neural network according to an exemplary embodiment of the present invention. The embodiment can be applied to an electronic device, as shown in fig. 1, a voice print recognition method 100 based on a neural network includes the following steps:
step 101, collecting voiceprint signals to be identified and preprocessing;
102, extracting features of the pre-processed voiceprint signals to be identified, and determining voiceprint features of the voiceprint signals to be identified;
step 103, inputting voiceprint features into a pre-trained neural network classifier model, and outputting all possible class label sets of voiceprint signals to be identified, wherein the classifier model adopts self-organizing mapping based on quantum optimization;
and 104, traversing a possible class label set according to a pre-constructed Bayesian optimization probability model to obtain an optimal class label of the voiceprint signal to be identified.
Specifically, the invention provides a voiceprint recognition method based on a neural network. The method is focused on high quality and high resolution from the beginning of data acquisition; advanced techniques, such as time domain noise filters based on local correlation and convolutional neural networks (MS-CNN) based on multiple scales, are employed during the data preprocessing and feature extraction stage to more fully and accurately capture voiceprint characteristics; on the classifier design, self-organizing mapping (SOM) based on quantum optimization is adopted to improve accuracy and calculation efficiency; in addition, uncertainty evaluation based on Bayesian optimization is introduced, and more reliable classification results are provided.
Specifically, the training steps of the classifier model are as follows:
step one: data acquisition and annotation
The data of the invention mainly come from high-quality microphone equipment, in particular to a dynamic microphone, a capacitor microphone, an array microphone and the like, so as to ensure that collected voiceprint signals have enough information quantity and resolution. The data is stored in WAV or MP3 format to achieve compatibility with various data analysis tools.
The voiceprint signal is one-dimensional time series data, typically denoted by x (t), where x is the amplitude of the voiceprint signal and t is time. Such signals may be further broken down into a number of attributes and features, such as frequency, phase, etc.
Further, according to a Mel Frequency Cepstral Coefficient (MFCC) feature extraction method, preliminary feature extraction is performed on voiceprint features. The method comprises the following steps:
1. fourier transform is applied to the signal X (t) to obtain X (f).
Specifically, in order to extract effective features from the voiceprint signal x (t), the present invention employs fourier transformation for preliminary feature extraction. Specifically, given a time signal x (t), its fourier transform is expressed as:
where X (f) is a frequency domain representation, f is frequency, and j is complex unit.
2. Mapping X (f) onto a Mel scale to obtain M (f).
Specifically, the calculation formula of M (f) is:
3. and performing Discrete Cosine Transform (DCT) on the M (f) to obtain the MFCC.
Specifically, the MFCC may be expressed as:
MFCC=DCT(M(f))
in a specific embodiment, consider a simple voiceprint signal x (t) =sin (2pi ft), where f=440 Hz.
Further, a fourier transform is applied, resulting in X (f) =δ (f-440) +δ (f+440), where δ is a dirac δ -function.
Further, mapping onto Mel scale to obtain
Further, the MFCC is characterized as 392.5. And taking the characteristic as a newly added characteristic, cascading with the original signal, and enhancing the characteristic characterization capability of the original voiceprint signal.
Step two: data preprocessing
The preprocessing of voiceprint data is a crucial step, affecting the performance and generalization ability of the subsequent neural network model. The invention adopts a combination method based on maximum and minimum normalization and a specific noise filter to perform data preprocessing.
For the collected voiceprint signal data, it is composed of a plurality of frequency components and may contain different levels of noise. Maximum and minimum normalization can normalize the data to within a predetermined range (typically 0,1 or-1, 1). Specifically, given a voiceprint signal sample x2 (t), its normalized formula is:
where x2 (t)' is normalized data, min (x 2 (t)) and max (x 2 (t)) are the minimum and maximum values of the original data x, respectively, and x2 (t) is a given voiceprint signal sample before normalization.
Further, the normalized data is input to a noise filter. Conventional noise filters typically operate on a frequency domain basis, and the present invention employs a time domain noise filter based on local correlation. For each time point t, the noise filtered value x "(t) is calculated as follows:
wherein α is a weight parameter between 0 and 1, n is the local window size, w i Is the weight of each point in the window and satisfies
After maximum and minimum normalization of the voiceprint signal x, it is input to a noise filter to obtain final preprocessed data x″. The calculation steps can be represented by the following formula:
x″=NoiseFilter(MinMaxNorm(x))
where NoiseFilter is a noise filter and minmaxnom is the maximum minimum normalization function.
By combining the maximum and minimum normalization with the specific noise filter, the important characteristics in the voiceprint data are reserved, the influence of noise and other irrelevant factors is reduced, and the voiceprint recognition accuracy is further improved.
Step three: data augmentation
It will be appreciated that the amount and diversity of data is a critical factor in the task of voiceprint recognition. The invention provides an improved echo state-based generation countermeasure network algorithm for data expansion.
First, an Echo State Network (ESNs) is used as a basic part of the generator. The echo state network comprises an input weight matrix W in An echo state weight matrix W and an output weight matrix W out . For the preprocessed data, let data sequence x= [ X (1), X (2),. The term, X (N)]Where N is the length of the time series.
Then, the dynamic update rule of the hidden state h (t) is:
h(t)=(1-α)·h(t-1)+α·tanh(W in ·x(t)+W·h(t-1))
where α is a forgetting factor ranging from 0 to 1. W (W) in Is a weight matrix input to the hidden layer, and W is a weight matrix connected inside the hidden layer.
In a standard generation countermeasure network, the loss functions of the generator G and discriminator D are generally defined as follows:
wherein p is data (x) Is the true distribution of data, p z (z) is the noise profile of the generator input.
In the present invention, an Echo State Network (ESN) is incorporated into the generator so that the generator can not only generate new samples, but also capture the time dependent nature of the voiceprint data. Specifically, generator G is denoted as G ESN The echo state network is formed by combining an original generator and an echo state network, and can be expressed as follows:
G ESN (z)=G(ESN(z))
then, the modified loss function is:
by incorporating the ESN into the GAN, the generated data better mimics the time-dependent nature of the voiceprint data.
Step four: data feature extraction
Feature extraction of the expanded sample data requires extraction of key features from the high-dimensional original sound signal for high-accuracy recognition, since voiceprint recognition is a complex pattern recognition problem. Traditional Convolutional Neural Networks (CNNs) have performed well in voiceprint recognition, but still suffer from certain limitations, particularly in dealing with multi-scale features.
The invention provides a multi-scale convolutional neural network (MS-CNN) based feature extraction, in particular to a self-adaptive learning rate and attention mechanism which is introduced to enhance the performance of a model.
Specifically, in the feature extraction stage, firstly, a multi-scale convolution operation is performed on the input voiceprint data X. Specifically, a convolution kernel K of three different scales (scales) is employed 1 ,K 2 ,K 3 The convolution operation is performed, which can be expressed as:
F 1 =X*K 1
F 2 =X*K 2
F 3 =X*K 3
where x represents the convolution operation. X is an input voiceprint data matrix, the dimension is NxM, N is the number of samples, M is the feature number of each sample, and F is the feature matrix extracted by MS-CNN.
Further, for finer feature extraction, the learning rate LR is adaptively adjusted after each convolution operation, which can be expressed as:
where L is the loss function,is the gradient of the weight W with respect to the loss function. W is a weight parameter in CNN, and LR is a learning rate.
Further, attention mechanism weighting is performed. After multi-scale feature extraction, an attention mechanism is adopted to weight and combine the features F extracted at different scales 1 ,F 2 ,F 3 Can be expressed as:
F=α 1 ×F 12 ×F 23 ×F 3
where α is the attention weight.
Further, in processing voiceprint data, frequency domain information often contains a number of useful features in addition to time domain information. In order to more comprehensively capture the characteristics of the voiceprint, the invention introduces a frequency domain information fusion mechanism.
Specifically, the mel-frequency cepstrum coefficient obtained in the step one is first combined into a mel-frequency cepstrum coefficient matrix by means of a multi-scale sliding window, which can be expressed as
Further, a learnable weight parameter beta is utilized for the matchingThe fusion with the time domain (F) information can be expressed as:
wherein,is a frequency domain feature->Regarding the gradient of the loss function L, the fused data is:
in this fusion step, F final The feature set that is ultimately used for voiceprint recognition.
The frequency domain information fusion mechanism provided by the invention further enriches the expression capability of the features and improves the accuracy of voiceprint recognition. And by introducing multi-scale convolution, self-adaptive learning rate and attention mechanism, the characteristics extracted from the voiceprint data are more accurate and robust, so that the accuracy of the voiceprint recognition task is improved.
Step five: training classifier
And inputting the data after the feature extraction into a classifier, and training the classifier. The invention discloses a self-organizing map (SOM) classifier based on quantum optimization, which is used for realizing higher classification accuracy and calculation efficiency.
Self-organizing map (SOM) maps the input space (i.e., voiceprint feature space) to a low-dimensional grid. Specifically, let the data after feature extraction be x= { X 1 ,x 2 ,...,x N [ is, wherein x i Is the eigenvector of the voiceprint and N is the number of samples.
Further, for an n×m SOM grid, a weight matrix W is defined. Wherein w is ij The weights representing the (i, j) nodes in the grid are a d-dimensional vector.
Further, the basic update formula of the SOM is:
w ij (t+1)=w ij (t)+α(t)·h ij (ξ,t)·(x(t)-w ij (t))
where α (t) is the learning rate, h ij (ζ, t) is a neighborhood function, typically using a Gaussian function; ζ is the winning node (i.e., the node closest to input x (t)); t is the time step.
Further, the invention introduces a quantum optimization algorithm to adjust SOM grid weights. Specifically, the optimization objective is to minimize the following loss function J:
wherein w is c(i) Is closest to x i Is the weight of the SOM node, lambda is the regularization parameter,is a quantum optimization operation.
Wherein the quantum optimizing operationIs realized by quantum gates and quantum bits, and is used for quickly finding out the globally optimal solution.
Further, J (W) is required to be minimum, and W is graded to obtain:
further, the weight W is updated using a random gradient descent method:
where η is the step size.
Based on this, not only data point x i SOM node w nearest thereto c(i) The approach is followed and the overall weight matrix W approaches the quantum-optimized state.
The process of voiceprint recognition by the classifier model trained through the steps is as follows:
and performing voiceprint classification by using the trained classifier model. Based on the Bayesian optimization model reasoning strategy, not only can the classification accuracy be improved, but also an uncertainty measure can be provided for each classification result.
Specifically, for a trained model f, it can map voiceprint data points x to corresponding class labels y. The uncertainty of this mapping is expressed in terms of a conditional probability P (y|x), which can be expressed as:
where f (x, y) is the joint score of the model to the input x and the label y, y' traverses all possible labels.
Further, bayesian optimization is applied to the probabilistic model P (y|x) to find the most likely class labels. Specifically, it is provided withFor all possible class label sets, the optimization objective is:
by Gaussian Process Regression (GPR), there are:
wherein μ (x, y) and σ 2 (x, y) are mean and variance, respectively.
Then, the optimization objective becomes:
where β is an adjustable parameter used to trade-off accuracy and uncertainty.
Further, the uncertainty measure is evaluated by the variance of the model, specifically:
U(x,y * )=σ 2 (x,y * )
in voiceprint recognition, the following steps are followed:
1. setting input and output.
Input x (voiceprint data point)
Output y * (most likely class label), U (x, y * ) (uncertainty measurement)
2. P (y|x) was modeled using GPR.
3. Calculate allMu (x, y) and sigma below 2 (x,y)。
4. Find out
5. Calculating uncertainty measure U (x, y * )=σ 2 (x,y * )。
Based on the method, the accuracy of voiceprint recognition is improved, and a reliable uncertainty assessment is provided for each classification result by introducing Bayesian optimization and uncertainty measurement.
Therefore, the invention provides a voiceprint recognition method based on a neural network. On the design of the classifier, self-organizing mapping (SOM) based on quantum optimization is adopted to improve accuracy and calculation efficiency, uncertainty assessment based on Bayesian optimization is introduced, and a more reliable classification result is provided.
Exemplary apparatus
Fig. 3 is a schematic structural diagram of a voiceprint recognition device based on a neural network according to an exemplary embodiment of the present invention. As shown in fig. 3, the apparatus 300 includes:
the acquisition module 310 is used for acquiring and preprocessing the voiceprint signal to be identified;
the feature extraction module 320 is configured to perform feature extraction on the pre-processed voiceprint signal to be identified, and determine voiceprint features of the voiceprint signal to be identified;
the output module 330 is configured to input the voiceprint features into a pre-trained neural network classifier model, and output all possible class label sets of the voiceprint signals to be identified, where the classifier model adopts self-organizing mapping based on quantum optimization;
and the obtaining module 340 is configured to traverse the possible class label set according to the pre-constructed bayesian optimization probability model, and obtain an optimal class label of the voiceprint signal to be identified.
Optionally, the classifier model training process of the output module 330 is as follows:
the acquisition sub-module is used for acquiring a voiceprint signal data set of the microphone equipment;
the characteristic extraction submodule is used for carrying out characteristic extraction on the voiceprint signal data set by utilizing a Mel frequency cepstrum coefficient characteristic extraction method to determine a Mel frequency cepstrum coefficient set;
the preprocessing sub-module is used for preprocessing the voiceprint signal data set by adopting a maximum and minimum normalization and time domain noise filter combination method;
the expansion sub-module is used for carrying out data expansion on the voice print signal data set after pretreatment based on the generation of the echo state and the countermeasure network algorithm, and generating an expanded voice print signal data set;
the second feature extraction submodule is used for carrying out feature extraction on the expanded voiceprint signal data set by adopting a multi-scale convolutional neural network and determining a voiceprint feature set of the voiceprint signal data set;
the fusion submodule is used for forming a Mel frequency cepstrum coefficient set into a Mel frequency cepstrum coefficient matrix, and fusing the Mel frequency cepstrum coefficient matrix with the voiceprint feature set to determine the voiceprint fusion feature set;
and the training sub-module is used for inputting the voiceprint fusion feature set into the self-organizing map classifier based on quantum optimization for training to obtain a classifier model.
Optionally, the microphone device is any one of the following dynamic microphone, condenser microphone and array microphone.
Optionally, the feature extraction submodule includes:
the first transformation unit is used for carrying out Fourier transformation on the voiceprint signals of the voiceprint signal data set respectively to obtain a frequency spectrum feature set;
the mapping unit is used for mapping the spectrum feature set to a Mel scale to obtain a Mel domain feature set;
and the second transformation unit is used for performing discrete cosine transformation on the Mel domain feature set to obtain a Mel frequency cepstrum coefficient set.
Optionally, the maximum and minimum normalization method formula is:
where x' is normalized data, min (x) and max (x) are the minimum and maximum values of the original data x, respectively.
Optionally, the time domain noise filter formula is:
wherein x is Is normalized data, alpha is a weight parameter between 0 and 1, n is the local window size, w i Is the weight of each point in the window and satisfies
Optionally, generating the countermeasure network algorithm includes generating G ESN And discriminator D, wherein generator G ESN Consists of an original generator and an echo state network.
Optionally, the bayesian optimization model is:
where Y is the set of all possible class labels, μ (x, Y) and σ 2 (x, Y) are mean and variance, respectively, x is the voiceprint signal to be identified and Y is the category in Y.
Optionally, the method further comprises: and carrying out uncertainty measurement on the optimal category label, and determining the uncertainty of the optimal label, wherein the uncertainty measurement is expressed as the following formula:
U(x,y * )=σ 2 (x,y * )
wherein x is the voiceprint signal to be identified, y * For the best category label, σ 2 Is the variance.
Exemplary electronic device
Fig. 4 is a structure of an electronic device provided in an exemplary embodiment of the present invention. As shown in fig. 4, the electronic device 40 includes one or more processors 41 and memory 42.

Claims (10)

1. A voice print recognition method based on a neural network, comprising:
collecting voiceprint signals to be identified and preprocessing;
extracting features of the voice print signal to be identified after pretreatment, and determining voice print features of the voice print signal to be identified;
inputting the voiceprint characteristics into a pre-trained neural network classifier model, and outputting all possible class label sets of the voiceprint signals to be identified, wherein the classifier model adopts self-organizing mapping based on quantum optimization;
traversing the possible class label set according to a pre-constructed Bayesian optimization probability model, and acquiring the optimal class label of the voiceprint signal to be identified.
2. The method of claim 1, wherein the classifier model training process is as follows:
collecting a voiceprint signal dataset of a microphone device;
performing feature extraction on the voiceprint signal data set by using a Mel frequency cepstrum coefficient feature extraction method to determine a Mel frequency cepstrum coefficient set;
preprocessing the voiceprint signal data set by adopting a maximum and minimum normalization and time domain noise filter combination method;
performing data expansion on the preprocessed voiceprint signal data set based on the generation of the echo state against a network algorithm to generate an expanded voiceprint signal data set;
performing feature extraction on the extended voiceprint signal data set by adopting a multi-scale convolutional neural network, and determining a voiceprint feature set of the voiceprint signal data set;
the Mel frequency cepstrum coefficient set is formed into a Mel frequency cepstrum coefficient matrix, and is fused with the voiceprint feature set to determine a voiceprint fusion feature set;
inputting the voiceprint fusion feature set to a self-organizing map classifier based on quantum optimization for training to obtain the classifier model.
3. The method of claim 2, wherein the microphone apparatus is any one of a dynamic microphone, a condenser microphone, and an array microphone described below.
4. The method of claim 1, wherein the feature extraction of the voiceprint signal dataset using a mel-frequency cepstral coefficient feature extraction method, determining a mel-frequency cepstral coefficient set, comprises:
performing Fourier transformation on the voiceprint signals of the voiceprint signal data set respectively to obtain a frequency spectrum feature set;
mapping the spectrum feature set to a Mel scale to obtain a Mel domain feature set;
and performing discrete cosine transform on the Mel domain feature set to obtain the Mel frequency cepstrum coefficient set.
5. The method of claim 2, wherein the maximum and minimum normalization method formula is:
where x' is normalized data, min (x) and max (x) are the minimum and maximum values of the original data x, respectively.
6. The method of claim 2, wherein the time domain noise filter formula is:
where x' is normalized data, α is a weight parameter between 0 and 1, n is the local window size, w i Is the weight of each point in the window and satisfies
7. The method of claim 2, wherein the generating an countermeasure network algorithm comprises generating G ESN And a discriminator D, wherein the generator G ESN Consists of an original generator and an echo state network.
8. The method according to claim 1, wherein the bayesian optimization model is:
where Y is the set of all possible class labels, μ (x, Y) and σ 2 (x, Y) are mean and variance, respectively, x is the voiceprint signal to be identified and Y is the category in Y.
9. The method as recited in claim 1, further comprising: and carrying out uncertainty measurement on the optimal category label, and determining the uncertainty of the optimal label, wherein the uncertainty measurement is expressed as the following formula:
U(x,y * )=σ 2 (x,y * )
wherein x is the voiceprint signal to be identified, y * For the best category label, σ 2 Is the variance.
10. A voice print recognition device based on a neural network, comprising:
the acquisition module is used for acquiring the voiceprint signal to be identified and preprocessing the voiceprint signal;
the feature extraction module is used for carrying out feature extraction on the voice print signal to be identified after pretreatment and determining voice print features of the voice print signal to be identified;
the output module is used for inputting the voiceprint characteristics into a pre-trained neural network classifier model and outputting all possible class label sets of the voiceprint signals to be identified, wherein the classifier model adopts self-organizing mapping based on quantum optimization;
and the acquisition module is used for traversing the possible class label set according to a pre-constructed Bayesian optimization probability model and acquiring the optimal class label of the voiceprint signal to be identified.
CN202311262765.8A 2023-09-27 2023-09-27 Voiceprint recognition method and device based on neural network Pending CN117423344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311262765.8A CN117423344A (en) 2023-09-27 2023-09-27 Voiceprint recognition method and device based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311262765.8A CN117423344A (en) 2023-09-27 2023-09-27 Voiceprint recognition method and device based on neural network

Publications (1)

Publication Number Publication Date
CN117423344A true CN117423344A (en) 2024-01-19

Family

ID=89531564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311262765.8A Pending CN117423344A (en) 2023-09-27 2023-09-27 Voiceprint recognition method and device based on neural network

Country Status (1)

Country Link
CN (1) CN117423344A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118155662A (en) * 2024-05-09 2024-06-07 国网江西省电力有限公司南昌供电分公司 Transformer voiceprint fault identification method based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118155662A (en) * 2024-05-09 2024-06-07 国网江西省电力有限公司南昌供电分公司 Transformer voiceprint fault identification method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110084610B (en) Network transaction fraud detection system based on twin neural network
Sugiartawan et al. Prediction by a hybrid of wavelet transform and long-short-term-memory neural network
CN111542841A (en) System and method for content identification
JP2015057630A (en) Acoustic event identification model learning device, acoustic event detection device, acoustic event identification model learning method, acoustic event detection method, and program
CN103810266B (en) Semantic network target recognition sentences card method
CN112861066B (en) Machine learning and FFT (fast Fourier transform) -based blind source separation information source number parallel estimation method
CN117423344A (en) Voiceprint recognition method and device based on neural network
CN116662817B (en) Asset identification method and system of Internet of things equipment
CN114925728A (en) Rolling bearing fault diagnosis method, rolling bearing fault diagnosis device, electronic device and storage medium
CN111785286A (en) Home CNN classification and feature matching combined voiceprint recognition method
CN113657510A (en) Method and device for determining data sample with marked value
CN111027609B (en) Image data weighted classification method and system
CN111401440B (en) Target classification recognition method and device, computer equipment and storage medium
CN117313160A (en) Privacy-enhanced structured data simulation generation method and system
CN115953584B (en) End-to-end target detection method and system with learning sparsity
CN113762005A (en) Method, device, equipment and medium for training feature selection model and classifying objects
CN111160464B (en) Industrial high-order dynamic process soft measurement method based on multi-hidden-layer weighted dynamic model
CN115563468A (en) Automatic modulation classification method based on deep learning network fusion
JP2008292858A (en) Noise suppressing device, computer program, and voice recognition system
CN113420870A (en) U-Net structure generation countermeasure network and method for underwater acoustic target recognition
Trentin et al. Unsupervised nonparametric density estimation: A neural network approach
Asaei et al. Investigation of kNN classifier on posterior features towards application in automatic speech recognition
CN113705786B (en) Model-based data processing method, device and storage medium
US20220101101A1 (en) Domain adaptation
Airlangga Analysis of Machine Learning Classifiers for Speaker Identification: A Study on SVM, Random Forest, KNN, and Decision Tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination