CN112053695A - Voiceprint recognition method and device, electronic equipment and storage medium - Google Patents
Voiceprint recognition method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112053695A CN112053695A CN202010955823.5A CN202010955823A CN112053695A CN 112053695 A CN112053695 A CN 112053695A CN 202010955823 A CN202010955823 A CN 202010955823A CN 112053695 A CN112053695 A CN 112053695A
- Authority
- CN
- China
- Prior art keywords
- voice
- recognized
- voiceprint
- spectrum information
- frequency spectrum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 119
- 238000001228 spectrum Methods 0.000 claims abstract description 93
- 238000000605 extraction Methods 0.000 claims abstract description 34
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims abstract description 19
- 239000012634 fragment Substances 0.000 claims abstract description 6
- 238000013145 classification model Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 9
- 230000015654 memory Effects 0.000 claims description 7
- 238000009432 framing Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the application discloses a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring frequency spectrum information of a voice to be recognized; according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified; removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice; acquiring frequency spectrum information of the effective voice; performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized; and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized. According to the voice recognition method and device, the invalid voice fragments are removed, so that high-quality voice data are provided for the feature extraction model, and the accuracy of the voiceprint recognition result is improved.
Description
Technical Field
The embodiment of the application relates to the technical field of identity recognition, in particular to a voiceprint recognition method and device, electronic equipment and a storage medium.
Background
Voiceprint recognition is also called speaker recognition, is a biological recognition technology for recognizing the identity of a speaker according to the voice characteristics of the speaker, and can be widely applied to the fields of security protection, finance, anti-fraud and the like.
Currently, the most widely used method for voiceprint recognition is the iVector/PLDA algorithm. The process comprises the following steps: obtaining the spectral information of the voice through MFCC (Mel-Frequency Cepstral Coefficients, Mel Frequency Cepstral coefficient); mapping the obtained MFCC high-dimensional characteristic information to a low-dimensional vector iVector through Gaussian supervector factor analysis, wherein the low-dimensional vector iVector comprises voiceprint information and channel information of a speaker; performing channel compensation on the low-dimensional vector iVector by adopting a PLDA algorithm to obtain a voiceprint characteristic vector; and matching the voiceprint characteristic vector with the vector in the database to determine the identity of the speaker.
Because the low-dimensional vector iVector contains both speaker information and channel information, even if the PLDA channel compensation is performed, the low-dimensional vector iVector still contains noise and background sound, which still has great influence on the recognition result, and the recognition result accuracy is low.
Disclosure of Invention
The embodiment of the application provides a voiceprint recognition method and device, electronic equipment and a storage medium, and is beneficial to improving the accuracy of recognition results.
In order to solve the above problem, in a first aspect, an embodiment of the present application provides a voiceprint recognition method, including:
acquiring frequency spectrum information of a voice to be recognized;
according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified;
removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice;
acquiring frequency spectrum information of the effective voice;
performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized;
and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
Optionally, the identifying, according to the spectrum information, an effective speech segment and an ineffective speech segment in the speech to be identified includes:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
Optionally, the inactive speech segments include noise segments and/or background sound segments.
Optionally, the obtaining of the spectrum information of the speech to be recognized includes:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
Optionally, the performing similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine speaker identity information corresponding to the speech to be recognized includes:
similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;
if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;
and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.
Optionally, the deep convolutional neural network is a ResNet network.
In a second aspect, an embodiment of the present application provides a voiceprint recognition apparatus, including:
the first frequency spectrum information acquisition module is used for acquiring frequency spectrum information of the voice to be recognized;
the voice segment identification module is used for identifying an effective voice segment and an ineffective voice segment in the voice to be identified according to the frequency spectrum information;
the effective voice splicing module is used for removing the invalid voice fragments and splicing the effective voice fragments to obtain effective voice;
the second spectrum information acquisition module is used for acquiring the spectrum information of the effective voice;
the characteristic extraction module is used for extracting the characteristics of the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized;
and the speaker identity determining module is used for carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
Optionally, the speech segment recognition module is specifically configured to:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
Optionally, the inactive speech segments include noise segments and/or background sound segments.
Optionally, the first spectrum information obtaining module is specifically configured to:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
Optionally, the speaker identity determining module includes:
a candidate vector determining unit, configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library, and use an existing voiceprint feature vector with a largest similarity value with the voiceprint feature vector to be recognized as a candidate vector;
the speaker identity determining unit is used for determining that the speaker identity information corresponding to the candidate vector is the speaker identity information corresponding to the voice to be recognized if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value;
and the recognition failure determining unit is used for determining that the voiceprint recognition fails if the similarity value of the voiceprint feature vector to be recognized and the candidate vector is smaller than the preset threshold value.
Optionally, the deep convolutional neural network is a ResNet network.
In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the voiceprint recognition method according to the embodiment of the present application.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voiceprint recognition method disclosed in the present application.
The voiceprint recognition method, the voiceprint recognition device, the electronic equipment and the storage medium provided by the embodiment of the application acquire the frequency spectrum information of the voice to be recognized, recognize the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, remove the ineffective voice segment, splice the effective voice segment to obtain the effective voice, acquire the frequency spectrum information of the effective voice, perform characteristic extraction on the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, perform similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in a voice characteristic library to determine the identity of a speaker corresponding to the voice to be recognized, because the effective voice segment and the ineffective voice segment in the voice to be recognized are recognized at first in the voiceprint recognition process and the ineffective voice segment is removed, the effective voice fragments are reserved and spliced to perform voiceprint recognition, so that high-quality voice data as much as possible are provided for the feature extraction model, and the voiceprint features are extracted by using the end-to-end feature extraction model, so that the problem of poor feature selection effect caused by experience limitation is solved, and the accuracy of a voiceprint recognition result is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of a voiceprint recognition method according to a first embodiment of the present application;
fig. 2 is a schematic structural diagram of a voiceprint recognition apparatus according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
As shown in fig. 1, the method for voiceprint recognition provided by this embodiment includes: step 110 to step 160.
Short-Time Fourier Transform (STFT) processing may be performed on the speech to be recognized to obtain spectrum information of the speech to be recognized; alternatively, a Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the speech to be recognized may also be used as the spectrum information of the speech to be recognized.
In an embodiment of the present application, the acquiring the spectrum information of the speech to be recognized includes:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
When short-time Fourier transform processing is carried out on the voice to be recognized, framing and windowing processing is carried out on the voice to be recognized to obtain a plurality of windowed data frames, and then Fourier transform processing is carried out on each windowed data frame to obtain frequency spectrum information of the voice to be recognized.
When calculating the Mel frequency cepstrum coefficient corresponding to the speech to be recognized, firstly, performing pre-emphasis, framing and windowing on the speech to be recognized to obtain a plurality of windowed data frames; fourier transform processing is carried out on each data frame to obtain frequency spectrums in different time windows; inputting the frequency spectrums in different time windows into a Mel filter bank to obtain Mel frequency spectrums; and performing cepstrum analysis on the Mel frequency spectrum, namely taking logarithm of the Mel frequency spectrum, and performing Fourier inverse transformation to obtain a Mel frequency cepstrum coefficient corresponding to the voice to be recognized.
And 120, identifying effective voice segments and ineffective voice segments in the voice to be identified according to the frequency spectrum information.
Wherein the inactive speech segments include noise segments and/or background sound segments.
And identifying the voice to be identified according to the frequency spectrum information so as to determine an effective voice segment and an ineffective voice segment in the voice to be identified.
In an embodiment of the present application, the identifying, according to the spectrum information, valid speech segments and invalid speech segments in the speech to be identified includes:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
The frequency spectrum information of the voice to be recognized can be input into a classification model based on a deep convolutional neural network, the classification model is used for classifying the frequency spectrum information of the voice to be recognized, effective voice segments and invalid voice segments in the voice to be recognized are recognized, and at least one effective voice segment and at least one invalid voice segment can be obtained.
The deep convolutional neural network may be a ResNet network, for example, a ResNet50 network. The classification model may be a full-concatenation layer added behind the ResNet50 network to achieve the purpose of classifying the speech to be recognized, for example, a 2-layer full-concatenation layer may be added, so that the final output is 2 classifications, i.e., one classification is a valid speech segment and one classification is an invalid speech segment.
By filtering invalid voices such as noise, background sound and the like by using the deep neural network, voice data with high quality as much as possible can be provided for the feature extraction model so as to extract more accurate voiceprint feature vectors to be recognized.
And step 130, removing the invalid voice segments, and splicing the valid voice segments to obtain valid voice.
Because the effective voice segment comprises the identity information of the speaker, and the invalid voice segment does not comprise the identity information of the speaker, the invalid voice segment is removed, and the effective voice segment is spliced according to the time sequence of the effective voice segment to obtain the effective voice. And removing the invalid voice segments and splicing the valid voice segments to serve as valid voice, so that high-quality voice data can be provided for a subsequent feature extraction model.
The effective voice can be subjected to short-time Fourier transform processing to obtain frequency spectrum information of the effective voice; alternatively, Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the effective speech may be used as the spectrum information of the effective speech.
When the effective voice is subjected to short-time Fourier transform processing, the effective voice is subjected to framing and windowing processing to obtain a plurality of windowed data frames, and then Fourier transform processing is performed on each windowed data frame to obtain frequency spectrum information of the effective voice.
When calculating the Mel frequency cepstrum coefficient corresponding to the effective voice, firstly, carrying out pre-emphasis, framing and windowing on the effective voice to obtain a plurality of windowed data frames; fourier transform processing is carried out on each data frame to obtain frequency spectrums in different time windows; inputting the frequency spectrums in different time windows into a Mel filter bank to obtain Mel frequency spectrums; and performing cepstrum analysis on the Mel frequency spectrum, namely taking logarithm of the Mel frequency spectrum, and performing Fourier inverse transformation to obtain a Mel frequency cepstrum coefficient corresponding to the effective voice.
And 150, extracting the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized.
Inputting the frequency spectrum information of the effective voice into an end-to-end characteristic extraction model based on a deep convolutional neural network, and performing characteristic extraction on the frequency spectrum information of the effective voice by using the characteristic extraction model to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized. The dimension of the voiceprint feature vector to be identified can be set according to requirements, and can be a 512-dimensional vector for example.
The deep convolutional neural network may be a ResNet network, for example, a ResNet50 network. The feature extraction model may be implemented by adding a full connection layer behind a ResNet50 network so as to output a vector with a preset dimension, for example, a 2-layer full connection layer may be added so as to output a vector with 512 dimensions finally.
And 160, performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and determining the identity information of the speaker corresponding to the voice to be recognized.
Wherein, the voice feature library stores the existing voiceprint feature vector and the corresponding speaker identity information.
Similarity calculation is performed on the voiceprint feature vectors to be recognized and the existing voiceprint feature vectors in the voice feature library respectively to determine the speaker identity information corresponding to the voice to be recognized, for example, the speaker identity information corresponding to the existing voiceprint feature vector with the largest similarity value can be used as the speaker identity information corresponding to the voice to be recognized.
In an embodiment of the present application, the performing similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine speaker identity information corresponding to the speech to be recognized includes:
similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;
if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;
and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.
And respectively carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vectors in the voice feature library to obtain a similarity value between the voiceprint feature vector to be recognized and each existing voiceprint feature vector in the voice feature library, comparing the similarity values to determine the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized as a candidate vector, if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining that voiceprint recognition is successful, using the speaker identity information corresponding to the candidate vector as the speaker identity information corresponding to the voice to be recognized, and if the similarity between the voiceprint feature vector to be recognized and the candidate vector is less than the preset threshold value, determining that voiceprint recognition fails, namely not recognizing the speaker identity information corresponding to the voice to be recognized. The maximum similarity value is compared with a preset threshold value, and when the maximum similarity value is larger than or equal to the preset threshold value, the identity information of the speaker corresponding to the candidate vector is determined to be the identity information of the speaker corresponding to the voice to be recognized, so that the determination result can be more accurate.
The voiceprint recognition method provided by the embodiment of the application obtains the frequency spectrum information of the voice to be recognized, recognizes the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, removes the ineffective voice segment, splices the effective voice segment to obtain the effective voice, obtains the frequency spectrum information of the effective voice, performs characteristic extraction on the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, performs similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in a voice characteristic library to determine the identity of a speaker corresponding to the voice to be recognized, and because the effective voice segment and the ineffective voice segment in the voice to be recognized are firstly recognized in the voiceprint recognition process, the ineffective voice segment is removed, and the effective voice segment is reserved and spliced for voiceprint recognition, therefore, high-quality voice data as much as possible is provided for the feature extraction model, and the voiceprint features are extracted by using the end-to-end feature extraction model, so that the problem of poor feature selection effect caused by experience limitation is solved, and the accuracy of the voiceprint recognition result is improved.
Example two
In the present embodiment, as shown in fig. 2, a voiceprint recognition apparatus 200 includes:
a first spectrum information obtaining module 210, configured to obtain spectrum information of a voice to be recognized;
a voice segment recognition module 220, configured to recognize, according to the spectrum information, an effective voice segment and an ineffective voice segment in the voice to be recognized;
an effective speech splicing module 230, configured to remove the invalid speech segments and splice the effective speech segments to obtain an effective speech;
a second spectrum information obtaining module 240, configured to obtain spectrum information of the valid voice;
a feature extraction module 250, configured to perform feature extraction on the frequency spectrum information of the effective speech through a feature extraction model based on a deep convolutional neural network, so as to obtain a voiceprint feature vector to be recognized, which corresponds to the speech to be recognized;
and the speaker identity determining module 260 is configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in the voice feature library, and determine speaker identity information corresponding to the voice to be recognized.
Optionally, the speech segment recognition module is specifically configured to:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
Optionally, the inactive speech segments include noise segments and/or background sound segments.
Optionally, the first spectrum information obtaining module is specifically configured to:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
Optionally, the speaker identity determining module includes:
a candidate vector determining unit, configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library, and use an existing voiceprint feature vector with a largest similarity value with the voiceprint feature vector to be recognized as a candidate vector;
the speaker identity determining unit is used for determining that the speaker identity information corresponding to the candidate vector is the speaker identity information corresponding to the voice to be recognized if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value;
and the recognition failure determining unit is used for determining that the voiceprint recognition fails if the similarity value of the voiceprint feature vector to be recognized and the candidate vector is smaller than the preset threshold value.
Optionally, the deep convolutional neural network is a ResNet network.
The voiceprint recognition device provided in the embodiment of the present application is configured to implement each step of the voiceprint recognition method described in the first embodiment of the present application, and specific implementation manners of each module of the device refer to the corresponding step, which is not described herein again.
The voiceprint recognition device provided by the embodiment of the application acquires the frequency spectrum information of the voice to be recognized through the first frequency spectrum information acquisition module, the voice segment recognition module recognizes the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, the effective voice splicing module removes the ineffective voice segment and splices the effective voice segment to obtain the effective voice, the second frequency spectrum information acquisition module acquires the frequency spectrum information of the effective voice, the characteristic extraction module performs characteristic extraction on the frequency spectrum information of the effective voice through the characteristic extraction model based on the deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, the speaker identity determination module performs similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in the voice characteristic library to determine the speaker identity corresponding to the voice to be recognized, because the effective voice segment and the ineffective voice segment in the voice to be recognized are firstly recognized in the voiceprint recognition process, and removing the invalid voice segments, and reserving and splicing the valid voice segments to perform voiceprint recognition, thereby providing high-quality voice data as much as possible for the feature extraction model, and extracting the voiceprint features by using the end-to-end feature extraction model, thereby avoiding the problem of poor feature selection effect caused by experience limitation, and improving the accuracy of the voiceprint recognition result.
EXAMPLE III
Embodiments of the present application also provide an electronic device, as shown in fig. 3, the electronic device 300 may include one or more processors 310 and one or more memories 320 connected to the processors 310. Electronic device 300 may also include input interface 330 and output interface 340 for communicating with another apparatus or system. Program code executed by processor 310 may be stored in memory 320.
The processor 310 in the electronic device 300 calls the program code stored in the memory 320 to perform the voiceprint recognition method in the above embodiment.
The above elements in the above electronic device may be connected to each other by a bus, such as one of a data bus, an address bus, a control bus, an expansion bus, and a local bus, or any combination thereof.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voiceprint recognition method according to the first embodiment of the present application.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The voiceprint recognition method, the voiceprint recognition device, the electronic device and the storage medium provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation manner of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Claims (10)
1. A voiceprint recognition method, comprising:
acquiring frequency spectrum information of a voice to be recognized;
according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified;
removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice;
acquiring frequency spectrum information of the effective voice;
performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized;
and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
2. The method according to claim 1, wherein the identifying valid speech segments and invalid speech segments in the speech to be identified according to the spectrum information comprises:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
3. The method according to claim 1 or 2, characterized in that the inactive speech segments comprise noise segments and/or background sound segments.
4. The method according to claim 1, wherein the obtaining of the spectrum information of the speech to be recognized comprises:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
5. The method according to claim 1, wherein the calculating the similarity between the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine the identity information of the speaker corresponding to the speech to be recognized comprises:
similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;
if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;
and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.
6. The method of claim 1, wherein the deep convolutional neural network is a ResNet network.
7. A voiceprint recognition apparatus comprising:
the first frequency spectrum information acquisition module is used for acquiring frequency spectrum information of the voice to be recognized;
the voice segment identification module is used for identifying an effective voice segment and an ineffective voice segment in the voice to be identified according to the frequency spectrum information;
the effective voice splicing module is used for removing the invalid voice fragments and splicing the effective voice fragments to obtain effective voice;
the second spectrum information acquisition module is used for acquiring the spectrum information of the effective voice;
the characteristic extraction module is used for extracting the characteristics of the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized;
and the speaker identity determining module is used for carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
8. The apparatus of claim 7, wherein the speech segment recognition module is specifically configured to:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voiceprint recognition method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the voiceprint recognition method of one of the claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010955823.5A CN112053695A (en) | 2020-09-11 | 2020-09-11 | Voiceprint recognition method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010955823.5A CN112053695A (en) | 2020-09-11 | 2020-09-11 | Voiceprint recognition method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112053695A true CN112053695A (en) | 2020-12-08 |
Family
ID=73610126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010955823.5A Withdrawn CN112053695A (en) | 2020-09-11 | 2020-09-11 | Voiceprint recognition method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112053695A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112750441A (en) * | 2021-04-02 | 2021-05-04 | 北京远鉴信息技术有限公司 | Voiceprint recognition method and device, electronic equipment and storage medium |
CN112767944A (en) * | 2020-12-31 | 2021-05-07 | 中国工商银行股份有限公司 | Voiceprint recognition method and device |
CN112767950A (en) * | 2021-02-24 | 2021-05-07 | 嘉楠明芯(北京)科技有限公司 | Voiceprint recognition method and device and computer readable storage medium |
CN113035202A (en) * | 2021-01-28 | 2021-06-25 | 北京达佳互联信息技术有限公司 | Identity recognition method and device |
CN113436634A (en) * | 2021-07-30 | 2021-09-24 | 中国平安人寿保险股份有限公司 | Voice classification method and device based on voiceprint recognition and related equipment |
CN113448975A (en) * | 2021-05-26 | 2021-09-28 | 科大讯飞股份有限公司 | Method, device and system for updating character image library and storage medium |
CN113488059A (en) * | 2021-08-13 | 2021-10-08 | 广州市迪声音响有限公司 | Voiceprint recognition method and system |
CN113593579A (en) * | 2021-07-23 | 2021-11-02 | 马上消费金融股份有限公司 | Voiceprint recognition method and device and electronic equipment |
CN113628628A (en) * | 2021-07-29 | 2021-11-09 | 的卢技术有限公司 | Steering wheel adjusting method and system based on voiceprint recognition and storage medium |
CN114598516A (en) * | 2022-02-28 | 2022-06-07 | 北京梧桐车联科技有限责任公司 | Information encryption method, information decryption method, device, equipment and storage medium |
CN114648978A (en) * | 2022-04-27 | 2022-06-21 | 腾讯科技(深圳)有限公司 | Voice verification processing method and related device |
WO2023036016A1 (en) * | 2021-09-07 | 2023-03-16 | 广西电网有限责任公司贺州供电局 | Voiceprint recognition method and system applied to electric power operation |
WO2024082928A1 (en) * | 2022-10-21 | 2024-04-25 | 腾讯科技(深圳)有限公司 | Voice processing method and apparatus, and device and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021790A (en) * | 2013-02-28 | 2014-09-03 | 联想(北京)有限公司 | Sound control unlocking method and electronic device |
CN104834849A (en) * | 2015-04-14 | 2015-08-12 | 时代亿宝(北京)科技有限公司 | Dual-factor identity authentication method and system based on voiceprint recognition and face recognition |
CN110164452A (en) * | 2018-10-10 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of method of Application on Voiceprint Recognition, the method for model training and server |
CN110473552A (en) * | 2019-09-04 | 2019-11-19 | 平安科技(深圳)有限公司 | Speech recognition authentication method and system |
CN110970036A (en) * | 2019-12-24 | 2020-04-07 | 网易(杭州)网络有限公司 | Voiceprint recognition method and device, computer storage medium and electronic equipment |
-
2020
- 2020-09-11 CN CN202010955823.5A patent/CN112053695A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021790A (en) * | 2013-02-28 | 2014-09-03 | 联想(北京)有限公司 | Sound control unlocking method and electronic device |
CN104834849A (en) * | 2015-04-14 | 2015-08-12 | 时代亿宝(北京)科技有限公司 | Dual-factor identity authentication method and system based on voiceprint recognition and face recognition |
CN110164452A (en) * | 2018-10-10 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of method of Application on Voiceprint Recognition, the method for model training and server |
CN110473552A (en) * | 2019-09-04 | 2019-11-19 | 平安科技(深圳)有限公司 | Speech recognition authentication method and system |
CN110970036A (en) * | 2019-12-24 | 2020-04-07 | 网易(杭州)网络有限公司 | Voiceprint recognition method and device, computer storage medium and electronic equipment |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112767944A (en) * | 2020-12-31 | 2021-05-07 | 中国工商银行股份有限公司 | Voiceprint recognition method and device |
CN113035202A (en) * | 2021-01-28 | 2021-06-25 | 北京达佳互联信息技术有限公司 | Identity recognition method and device |
CN113035202B (en) * | 2021-01-28 | 2023-02-28 | 北京达佳互联信息技术有限公司 | Identity recognition method and device |
WO2022179360A1 (en) * | 2021-02-24 | 2022-09-01 | 嘉楠明芯(北京)科技有限公司 | Voiceprint recognition method and apparatus, and computer-readable storage medium |
CN112767950A (en) * | 2021-02-24 | 2021-05-07 | 嘉楠明芯(北京)科技有限公司 | Voiceprint recognition method and device and computer readable storage medium |
CN112750441B (en) * | 2021-04-02 | 2021-07-23 | 北京远鉴信息技术有限公司 | Voiceprint recognition method and device, electronic equipment and storage medium |
CN112750441A (en) * | 2021-04-02 | 2021-05-04 | 北京远鉴信息技术有限公司 | Voiceprint recognition method and device, electronic equipment and storage medium |
CN113448975A (en) * | 2021-05-26 | 2021-09-28 | 科大讯飞股份有限公司 | Method, device and system for updating character image library and storage medium |
CN113448975B (en) * | 2021-05-26 | 2023-01-17 | 科大讯飞股份有限公司 | Method, device and system for updating character image library and storage medium |
CN113593579A (en) * | 2021-07-23 | 2021-11-02 | 马上消费金融股份有限公司 | Voiceprint recognition method and device and electronic equipment |
CN113593579B (en) * | 2021-07-23 | 2024-04-30 | 马上消费金融股份有限公司 | Voiceprint recognition method and device and electronic equipment |
CN113628628A (en) * | 2021-07-29 | 2021-11-09 | 的卢技术有限公司 | Steering wheel adjusting method and system based on voiceprint recognition and storage medium |
CN113436634B (en) * | 2021-07-30 | 2023-06-20 | 中国平安人寿保险股份有限公司 | Voice classification method and device based on voiceprint recognition and related equipment |
CN113436634A (en) * | 2021-07-30 | 2021-09-24 | 中国平安人寿保险股份有限公司 | Voice classification method and device based on voiceprint recognition and related equipment |
CN113488059A (en) * | 2021-08-13 | 2021-10-08 | 广州市迪声音响有限公司 | Voiceprint recognition method and system |
WO2023036016A1 (en) * | 2021-09-07 | 2023-03-16 | 广西电网有限责任公司贺州供电局 | Voiceprint recognition method and system applied to electric power operation |
CN114598516B (en) * | 2022-02-28 | 2024-04-26 | 北京梧桐车联科技有限责任公司 | Information encryption and information decryption methods, devices, equipment and storage medium |
CN114598516A (en) * | 2022-02-28 | 2022-06-07 | 北京梧桐车联科技有限责任公司 | Information encryption method, information decryption method, device, equipment and storage medium |
CN114648978A (en) * | 2022-04-27 | 2022-06-21 | 腾讯科技(深圳)有限公司 | Voice verification processing method and related device |
WO2024082928A1 (en) * | 2022-10-21 | 2024-04-25 | 腾讯科技(深圳)有限公司 | Voice processing method and apparatus, and device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112053695A (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
WO2020177380A1 (en) | Voiceprint detection method, apparatus and device based on short text, and storage medium | |
CN110956966B (en) | Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment | |
CN108875463B (en) | Multi-view vector processing method and device | |
CN110265035B (en) | Speaker recognition method based on deep learning | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN111445900A (en) | Front-end processing method and device for voice recognition and terminal equipment | |
CN109979466B (en) | Voiceprint identity identification method and device and computer readable storage medium | |
CN112331217B (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
CN110782902A (en) | Audio data determination method, apparatus, device and medium | |
Poddar et al. | Quality measures for speaker verification with short utterances | |
CN110570870A (en) | Text-independent voiceprint recognition method, device and equipment | |
CN112614510B (en) | Audio quality assessment method and device | |
CN113327584A (en) | Language identification method, device, equipment and storage medium | |
CN113112992B (en) | Voice recognition method and device, storage medium and server | |
Maazouzi et al. | MFCC and similarity measurements for speaker identification systems | |
CN108630208B (en) | Server, voiceprint-based identity authentication method and storage medium | |
CN112992155B (en) | Far-field voice speaker recognition method and device based on residual error neural network | |
CN111524524B (en) | Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium | |
Ghezaiel et al. | Nonlinear multi-scale decomposition by EMD for Co-Channel speaker identification | |
CN113035230A (en) | Authentication model training method and device and electronic equipment | |
EP3613040B1 (en) | Speaker recognition method and system | |
CN114512133A (en) | Sound object recognition method, sound object recognition device, server and storage medium | |
CN112309404A (en) | Machine voice identification method, device, equipment and storage medium | |
CN112185347A (en) | Language identification method, language identification device, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201208 |
|
WW01 | Invention patent application withdrawn after publication |