CN112053695A - Voiceprint recognition method and device, electronic equipment and storage medium - Google Patents

Voiceprint recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112053695A
CN112053695A CN202010955823.5A CN202010955823A CN112053695A CN 112053695 A CN112053695 A CN 112053695A CN 202010955823 A CN202010955823 A CN 202010955823A CN 112053695 A CN112053695 A CN 112053695A
Authority
CN
China
Prior art keywords
voice
recognized
voiceprint
spectrum information
frequency spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010955823.5A
Other languages
Chinese (zh)
Inventor
邹佳宏
梁延峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202010955823.5A priority Critical patent/CN112053695A/en
Publication of CN112053695A publication Critical patent/CN112053695A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application discloses a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring frequency spectrum information of a voice to be recognized; according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified; removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice; acquiring frequency spectrum information of the effective voice; performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized; and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized. According to the voice recognition method and device, the invalid voice fragments are removed, so that high-quality voice data are provided for the feature extraction model, and the accuracy of the voiceprint recognition result is improved.

Description

Voiceprint recognition method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of identity recognition, in particular to a voiceprint recognition method and device, electronic equipment and a storage medium.
Background
Voiceprint recognition is also called speaker recognition, is a biological recognition technology for recognizing the identity of a speaker according to the voice characteristics of the speaker, and can be widely applied to the fields of security protection, finance, anti-fraud and the like.
Currently, the most widely used method for voiceprint recognition is the iVector/PLDA algorithm. The process comprises the following steps: obtaining the spectral information of the voice through MFCC (Mel-Frequency Cepstral Coefficients, Mel Frequency Cepstral coefficient); mapping the obtained MFCC high-dimensional characteristic information to a low-dimensional vector iVector through Gaussian supervector factor analysis, wherein the low-dimensional vector iVector comprises voiceprint information and channel information of a speaker; performing channel compensation on the low-dimensional vector iVector by adopting a PLDA algorithm to obtain a voiceprint characteristic vector; and matching the voiceprint characteristic vector with the vector in the database to determine the identity of the speaker.
Because the low-dimensional vector iVector contains both speaker information and channel information, even if the PLDA channel compensation is performed, the low-dimensional vector iVector still contains noise and background sound, which still has great influence on the recognition result, and the recognition result accuracy is low.
Disclosure of Invention
The embodiment of the application provides a voiceprint recognition method and device, electronic equipment and a storage medium, and is beneficial to improving the accuracy of recognition results.
In order to solve the above problem, in a first aspect, an embodiment of the present application provides a voiceprint recognition method, including:
acquiring frequency spectrum information of a voice to be recognized;
according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified;
removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice;
acquiring frequency spectrum information of the effective voice;
performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized;
and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
Optionally, the identifying, according to the spectrum information, an effective speech segment and an ineffective speech segment in the speech to be identified includes:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
Optionally, the inactive speech segments include noise segments and/or background sound segments.
Optionally, the obtaining of the spectrum information of the speech to be recognized includes:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
Optionally, the performing similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine speaker identity information corresponding to the speech to be recognized includes:
similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;
if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;
and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.
Optionally, the deep convolutional neural network is a ResNet network.
In a second aspect, an embodiment of the present application provides a voiceprint recognition apparatus, including:
the first frequency spectrum information acquisition module is used for acquiring frequency spectrum information of the voice to be recognized;
the voice segment identification module is used for identifying an effective voice segment and an ineffective voice segment in the voice to be identified according to the frequency spectrum information;
the effective voice splicing module is used for removing the invalid voice fragments and splicing the effective voice fragments to obtain effective voice;
the second spectrum information acquisition module is used for acquiring the spectrum information of the effective voice;
the characteristic extraction module is used for extracting the characteristics of the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized;
and the speaker identity determining module is used for carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
Optionally, the speech segment recognition module is specifically configured to:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
Optionally, the inactive speech segments include noise segments and/or background sound segments.
Optionally, the first spectrum information obtaining module is specifically configured to:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
Optionally, the speaker identity determining module includes:
a candidate vector determining unit, configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library, and use an existing voiceprint feature vector with a largest similarity value with the voiceprint feature vector to be recognized as a candidate vector;
the speaker identity determining unit is used for determining that the speaker identity information corresponding to the candidate vector is the speaker identity information corresponding to the voice to be recognized if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value;
and the recognition failure determining unit is used for determining that the voiceprint recognition fails if the similarity value of the voiceprint feature vector to be recognized and the candidate vector is smaller than the preset threshold value.
Optionally, the deep convolutional neural network is a ResNet network.
In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the voiceprint recognition method according to the embodiment of the present application.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voiceprint recognition method disclosed in the present application.
The voiceprint recognition method, the voiceprint recognition device, the electronic equipment and the storage medium provided by the embodiment of the application acquire the frequency spectrum information of the voice to be recognized, recognize the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, remove the ineffective voice segment, splice the effective voice segment to obtain the effective voice, acquire the frequency spectrum information of the effective voice, perform characteristic extraction on the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, perform similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in a voice characteristic library to determine the identity of a speaker corresponding to the voice to be recognized, because the effective voice segment and the ineffective voice segment in the voice to be recognized are recognized at first in the voiceprint recognition process and the ineffective voice segment is removed, the effective voice fragments are reserved and spliced to perform voiceprint recognition, so that high-quality voice data as much as possible are provided for the feature extraction model, and the voiceprint features are extracted by using the end-to-end feature extraction model, so that the problem of poor feature selection effect caused by experience limitation is solved, and the accuracy of a voiceprint recognition result is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of a voiceprint recognition method according to a first embodiment of the present application;
fig. 2 is a schematic structural diagram of a voiceprint recognition apparatus according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
As shown in fig. 1, the method for voiceprint recognition provided by this embodiment includes: step 110 to step 160.
Step 110, obtaining the frequency spectrum information of the voice to be recognized.
Short-Time Fourier Transform (STFT) processing may be performed on the speech to be recognized to obtain spectrum information of the speech to be recognized; alternatively, a Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the speech to be recognized may also be used as the spectrum information of the speech to be recognized.
In an embodiment of the present application, the acquiring the spectrum information of the speech to be recognized includes:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
When short-time Fourier transform processing is carried out on the voice to be recognized, framing and windowing processing is carried out on the voice to be recognized to obtain a plurality of windowed data frames, and then Fourier transform processing is carried out on each windowed data frame to obtain frequency spectrum information of the voice to be recognized.
When calculating the Mel frequency cepstrum coefficient corresponding to the speech to be recognized, firstly, performing pre-emphasis, framing and windowing on the speech to be recognized to obtain a plurality of windowed data frames; fourier transform processing is carried out on each data frame to obtain frequency spectrums in different time windows; inputting the frequency spectrums in different time windows into a Mel filter bank to obtain Mel frequency spectrums; and performing cepstrum analysis on the Mel frequency spectrum, namely taking logarithm of the Mel frequency spectrum, and performing Fourier inverse transformation to obtain a Mel frequency cepstrum coefficient corresponding to the voice to be recognized.
And 120, identifying effective voice segments and ineffective voice segments in the voice to be identified according to the frequency spectrum information.
Wherein the inactive speech segments include noise segments and/or background sound segments.
And identifying the voice to be identified according to the frequency spectrum information so as to determine an effective voice segment and an ineffective voice segment in the voice to be identified.
In an embodiment of the present application, the identifying, according to the spectrum information, valid speech segments and invalid speech segments in the speech to be identified includes:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
The frequency spectrum information of the voice to be recognized can be input into a classification model based on a deep convolutional neural network, the classification model is used for classifying the frequency spectrum information of the voice to be recognized, effective voice segments and invalid voice segments in the voice to be recognized are recognized, and at least one effective voice segment and at least one invalid voice segment can be obtained.
The deep convolutional neural network may be a ResNet network, for example, a ResNet50 network. The classification model may be a full-concatenation layer added behind the ResNet50 network to achieve the purpose of classifying the speech to be recognized, for example, a 2-layer full-concatenation layer may be added, so that the final output is 2 classifications, i.e., one classification is a valid speech segment and one classification is an invalid speech segment.
By filtering invalid voices such as noise, background sound and the like by using the deep neural network, voice data with high quality as much as possible can be provided for the feature extraction model so as to extract more accurate voiceprint feature vectors to be recognized.
And step 130, removing the invalid voice segments, and splicing the valid voice segments to obtain valid voice.
Because the effective voice segment comprises the identity information of the speaker, and the invalid voice segment does not comprise the identity information of the speaker, the invalid voice segment is removed, and the effective voice segment is spliced according to the time sequence of the effective voice segment to obtain the effective voice. And removing the invalid voice segments and splicing the valid voice segments to serve as valid voice, so that high-quality voice data can be provided for a subsequent feature extraction model.
Step 140, obtaining the spectrum information of the effective voice.
The effective voice can be subjected to short-time Fourier transform processing to obtain frequency spectrum information of the effective voice; alternatively, Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the effective speech may be used as the spectrum information of the effective speech.
When the effective voice is subjected to short-time Fourier transform processing, the effective voice is subjected to framing and windowing processing to obtain a plurality of windowed data frames, and then Fourier transform processing is performed on each windowed data frame to obtain frequency spectrum information of the effective voice.
When calculating the Mel frequency cepstrum coefficient corresponding to the effective voice, firstly, carrying out pre-emphasis, framing and windowing on the effective voice to obtain a plurality of windowed data frames; fourier transform processing is carried out on each data frame to obtain frequency spectrums in different time windows; inputting the frequency spectrums in different time windows into a Mel filter bank to obtain Mel frequency spectrums; and performing cepstrum analysis on the Mel frequency spectrum, namely taking logarithm of the Mel frequency spectrum, and performing Fourier inverse transformation to obtain a Mel frequency cepstrum coefficient corresponding to the effective voice.
And 150, extracting the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized.
Inputting the frequency spectrum information of the effective voice into an end-to-end characteristic extraction model based on a deep convolutional neural network, and performing characteristic extraction on the frequency spectrum information of the effective voice by using the characteristic extraction model to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized. The dimension of the voiceprint feature vector to be identified can be set according to requirements, and can be a 512-dimensional vector for example.
The deep convolutional neural network may be a ResNet network, for example, a ResNet50 network. The feature extraction model may be implemented by adding a full connection layer behind a ResNet50 network so as to output a vector with a preset dimension, for example, a 2-layer full connection layer may be added so as to output a vector with 512 dimensions finally.
And 160, performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and determining the identity information of the speaker corresponding to the voice to be recognized.
Wherein, the voice feature library stores the existing voiceprint feature vector and the corresponding speaker identity information.
Similarity calculation is performed on the voiceprint feature vectors to be recognized and the existing voiceprint feature vectors in the voice feature library respectively to determine the speaker identity information corresponding to the voice to be recognized, for example, the speaker identity information corresponding to the existing voiceprint feature vector with the largest similarity value can be used as the speaker identity information corresponding to the voice to be recognized.
In an embodiment of the present application, the performing similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine speaker identity information corresponding to the speech to be recognized includes:
similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;
if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;
and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.
And respectively carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vectors in the voice feature library to obtain a similarity value between the voiceprint feature vector to be recognized and each existing voiceprint feature vector in the voice feature library, comparing the similarity values to determine the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized as a candidate vector, if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining that voiceprint recognition is successful, using the speaker identity information corresponding to the candidate vector as the speaker identity information corresponding to the voice to be recognized, and if the similarity between the voiceprint feature vector to be recognized and the candidate vector is less than the preset threshold value, determining that voiceprint recognition fails, namely not recognizing the speaker identity information corresponding to the voice to be recognized. The maximum similarity value is compared with a preset threshold value, and when the maximum similarity value is larger than or equal to the preset threshold value, the identity information of the speaker corresponding to the candidate vector is determined to be the identity information of the speaker corresponding to the voice to be recognized, so that the determination result can be more accurate.
The voiceprint recognition method provided by the embodiment of the application obtains the frequency spectrum information of the voice to be recognized, recognizes the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, removes the ineffective voice segment, splices the effective voice segment to obtain the effective voice, obtains the frequency spectrum information of the effective voice, performs characteristic extraction on the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, performs similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in a voice characteristic library to determine the identity of a speaker corresponding to the voice to be recognized, and because the effective voice segment and the ineffective voice segment in the voice to be recognized are firstly recognized in the voiceprint recognition process, the ineffective voice segment is removed, and the effective voice segment is reserved and spliced for voiceprint recognition, therefore, high-quality voice data as much as possible is provided for the feature extraction model, and the voiceprint features are extracted by using the end-to-end feature extraction model, so that the problem of poor feature selection effect caused by experience limitation is solved, and the accuracy of the voiceprint recognition result is improved.
Example two
In the present embodiment, as shown in fig. 2, a voiceprint recognition apparatus 200 includes:
a first spectrum information obtaining module 210, configured to obtain spectrum information of a voice to be recognized;
a voice segment recognition module 220, configured to recognize, according to the spectrum information, an effective voice segment and an ineffective voice segment in the voice to be recognized;
an effective speech splicing module 230, configured to remove the invalid speech segments and splice the effective speech segments to obtain an effective speech;
a second spectrum information obtaining module 240, configured to obtain spectrum information of the valid voice;
a feature extraction module 250, configured to perform feature extraction on the frequency spectrum information of the effective speech through a feature extraction model based on a deep convolutional neural network, so as to obtain a voiceprint feature vector to be recognized, which corresponds to the speech to be recognized;
and the speaker identity determining module 260 is configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in the voice feature library, and determine speaker identity information corresponding to the voice to be recognized.
Optionally, the speech segment recognition module is specifically configured to:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
Optionally, the inactive speech segments include noise segments and/or background sound segments.
Optionally, the first spectrum information obtaining module is specifically configured to:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
Optionally, the speaker identity determining module includes:
a candidate vector determining unit, configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library, and use an existing voiceprint feature vector with a largest similarity value with the voiceprint feature vector to be recognized as a candidate vector;
the speaker identity determining unit is used for determining that the speaker identity information corresponding to the candidate vector is the speaker identity information corresponding to the voice to be recognized if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value;
and the recognition failure determining unit is used for determining that the voiceprint recognition fails if the similarity value of the voiceprint feature vector to be recognized and the candidate vector is smaller than the preset threshold value.
Optionally, the deep convolutional neural network is a ResNet network.
The voiceprint recognition device provided in the embodiment of the present application is configured to implement each step of the voiceprint recognition method described in the first embodiment of the present application, and specific implementation manners of each module of the device refer to the corresponding step, which is not described herein again.
The voiceprint recognition device provided by the embodiment of the application acquires the frequency spectrum information of the voice to be recognized through the first frequency spectrum information acquisition module, the voice segment recognition module recognizes the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, the effective voice splicing module removes the ineffective voice segment and splices the effective voice segment to obtain the effective voice, the second frequency spectrum information acquisition module acquires the frequency spectrum information of the effective voice, the characteristic extraction module performs characteristic extraction on the frequency spectrum information of the effective voice through the characteristic extraction model based on the deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, the speaker identity determination module performs similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in the voice characteristic library to determine the speaker identity corresponding to the voice to be recognized, because the effective voice segment and the ineffective voice segment in the voice to be recognized are firstly recognized in the voiceprint recognition process, and removing the invalid voice segments, and reserving and splicing the valid voice segments to perform voiceprint recognition, thereby providing high-quality voice data as much as possible for the feature extraction model, and extracting the voiceprint features by using the end-to-end feature extraction model, thereby avoiding the problem of poor feature selection effect caused by experience limitation, and improving the accuracy of the voiceprint recognition result.
EXAMPLE III
Embodiments of the present application also provide an electronic device, as shown in fig. 3, the electronic device 300 may include one or more processors 310 and one or more memories 320 connected to the processors 310. Electronic device 300 may also include input interface 330 and output interface 340 for communicating with another apparatus or system. Program code executed by processor 310 may be stored in memory 320.
The processor 310 in the electronic device 300 calls the program code stored in the memory 320 to perform the voiceprint recognition method in the above embodiment.
The above elements in the above electronic device may be connected to each other by a bus, such as one of a data bus, an address bus, a control bus, an expansion bus, and a local bus, or any combination thereof.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voiceprint recognition method according to the first embodiment of the present application.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The voiceprint recognition method, the voiceprint recognition device, the electronic device and the storage medium provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation manner of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Claims (10)

1. A voiceprint recognition method, comprising:
acquiring frequency spectrum information of a voice to be recognized;
according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified;
removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice;
acquiring frequency spectrum information of the effective voice;
performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized;
and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
2. The method according to claim 1, wherein the identifying valid speech segments and invalid speech segments in the speech to be identified according to the spectrum information comprises:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
3. The method according to claim 1 or 2, characterized in that the inactive speech segments comprise noise segments and/or background sound segments.
4. The method according to claim 1, wherein the obtaining of the spectrum information of the speech to be recognized comprises:
carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or
And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.
5. The method according to claim 1, wherein the calculating the similarity between the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine the identity information of the speaker corresponding to the speech to be recognized comprises:
similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;
if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;
and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.
6. The method of claim 1, wherein the deep convolutional neural network is a ResNet network.
7. A voiceprint recognition apparatus comprising:
the first frequency spectrum information acquisition module is used for acquiring frequency spectrum information of the voice to be recognized;
the voice segment identification module is used for identifying an effective voice segment and an ineffective voice segment in the voice to be identified according to the frequency spectrum information;
the effective voice splicing module is used for removing the invalid voice fragments and splicing the effective voice fragments to obtain effective voice;
the second spectrum information acquisition module is used for acquiring the spectrum information of the effective voice;
the characteristic extraction module is used for extracting the characteristics of the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized;
and the speaker identity determining module is used for carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.
8. The apparatus of claim 7, wherein the speech segment recognition module is specifically configured to:
and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voiceprint recognition method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the voiceprint recognition method of one of the claims 1 to 6.
CN202010955823.5A 2020-09-11 2020-09-11 Voiceprint recognition method and device, electronic equipment and storage medium Withdrawn CN112053695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010955823.5A CN112053695A (en) 2020-09-11 2020-09-11 Voiceprint recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010955823.5A CN112053695A (en) 2020-09-11 2020-09-11 Voiceprint recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112053695A true CN112053695A (en) 2020-12-08

Family

ID=73610126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010955823.5A Withdrawn CN112053695A (en) 2020-09-11 2020-09-11 Voiceprint recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112053695A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750441A (en) * 2021-04-02 2021-05-04 北京远鉴信息技术有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN112767944A (en) * 2020-12-31 2021-05-07 中国工商银行股份有限公司 Voiceprint recognition method and device
CN112767950A (en) * 2021-02-24 2021-05-07 嘉楠明芯(北京)科技有限公司 Voiceprint recognition method and device and computer readable storage medium
CN113035202A (en) * 2021-01-28 2021-06-25 北京达佳互联信息技术有限公司 Identity recognition method and device
CN113436634A (en) * 2021-07-30 2021-09-24 中国平安人寿保险股份有限公司 Voice classification method and device based on voiceprint recognition and related equipment
CN113448975A (en) * 2021-05-26 2021-09-28 科大讯飞股份有限公司 Method, device and system for updating character image library and storage medium
CN113488059A (en) * 2021-08-13 2021-10-08 广州市迪声音响有限公司 Voiceprint recognition method and system
CN113593579A (en) * 2021-07-23 2021-11-02 马上消费金融股份有限公司 Voiceprint recognition method and device and electronic equipment
CN113628628A (en) * 2021-07-29 2021-11-09 的卢技术有限公司 Steering wheel adjusting method and system based on voiceprint recognition and storage medium
CN114598516A (en) * 2022-02-28 2022-06-07 北京梧桐车联科技有限责任公司 Information encryption method, information decryption method, device, equipment and storage medium
CN114648978A (en) * 2022-04-27 2022-06-21 腾讯科技(深圳)有限公司 Voice verification processing method and related device
WO2023036016A1 (en) * 2021-09-07 2023-03-16 广西电网有限责任公司贺州供电局 Voiceprint recognition method and system applied to electric power operation
WO2024082928A1 (en) * 2022-10-21 2024-04-25 腾讯科技(深圳)有限公司 Voice processing method and apparatus, and device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021790A (en) * 2013-02-28 2014-09-03 联想(北京)有限公司 Sound control unlocking method and electronic device
CN104834849A (en) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 Dual-factor identity authentication method and system based on voiceprint recognition and face recognition
CN110164452A (en) * 2018-10-10 2019-08-23 腾讯科技(深圳)有限公司 A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN110473552A (en) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 Speech recognition authentication method and system
CN110970036A (en) * 2019-12-24 2020-04-07 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021790A (en) * 2013-02-28 2014-09-03 联想(北京)有限公司 Sound control unlocking method and electronic device
CN104834849A (en) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 Dual-factor identity authentication method and system based on voiceprint recognition and face recognition
CN110164452A (en) * 2018-10-10 2019-08-23 腾讯科技(深圳)有限公司 A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN110473552A (en) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 Speech recognition authentication method and system
CN110970036A (en) * 2019-12-24 2020-04-07 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767944A (en) * 2020-12-31 2021-05-07 中国工商银行股份有限公司 Voiceprint recognition method and device
CN113035202A (en) * 2021-01-28 2021-06-25 北京达佳互联信息技术有限公司 Identity recognition method and device
CN113035202B (en) * 2021-01-28 2023-02-28 北京达佳互联信息技术有限公司 Identity recognition method and device
WO2022179360A1 (en) * 2021-02-24 2022-09-01 嘉楠明芯(北京)科技有限公司 Voiceprint recognition method and apparatus, and computer-readable storage medium
CN112767950A (en) * 2021-02-24 2021-05-07 嘉楠明芯(北京)科技有限公司 Voiceprint recognition method and device and computer readable storage medium
CN112750441B (en) * 2021-04-02 2021-07-23 北京远鉴信息技术有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN112750441A (en) * 2021-04-02 2021-05-04 北京远鉴信息技术有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN113448975A (en) * 2021-05-26 2021-09-28 科大讯飞股份有限公司 Method, device and system for updating character image library and storage medium
CN113448975B (en) * 2021-05-26 2023-01-17 科大讯飞股份有限公司 Method, device and system for updating character image library and storage medium
CN113593579A (en) * 2021-07-23 2021-11-02 马上消费金融股份有限公司 Voiceprint recognition method and device and electronic equipment
CN113593579B (en) * 2021-07-23 2024-04-30 马上消费金融股份有限公司 Voiceprint recognition method and device and electronic equipment
CN113628628A (en) * 2021-07-29 2021-11-09 的卢技术有限公司 Steering wheel adjusting method and system based on voiceprint recognition and storage medium
CN113436634B (en) * 2021-07-30 2023-06-20 中国平安人寿保险股份有限公司 Voice classification method and device based on voiceprint recognition and related equipment
CN113436634A (en) * 2021-07-30 2021-09-24 中国平安人寿保险股份有限公司 Voice classification method and device based on voiceprint recognition and related equipment
CN113488059A (en) * 2021-08-13 2021-10-08 广州市迪声音响有限公司 Voiceprint recognition method and system
WO2023036016A1 (en) * 2021-09-07 2023-03-16 广西电网有限责任公司贺州供电局 Voiceprint recognition method and system applied to electric power operation
CN114598516B (en) * 2022-02-28 2024-04-26 北京梧桐车联科技有限责任公司 Information encryption and information decryption methods, devices, equipment and storage medium
CN114598516A (en) * 2022-02-28 2022-06-07 北京梧桐车联科技有限责任公司 Information encryption method, information decryption method, device, equipment and storage medium
CN114648978A (en) * 2022-04-27 2022-06-21 腾讯科技(深圳)有限公司 Voice verification processing method and related device
WO2024082928A1 (en) * 2022-10-21 2024-04-25 腾讯科技(深圳)有限公司 Voice processing method and apparatus, and device and medium

Similar Documents

Publication Publication Date Title
CN112053695A (en) Voiceprint recognition method and device, electronic equipment and storage medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
CN110956966B (en) Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment
CN108875463B (en) Multi-view vector processing method and device
CN110265035B (en) Speaker recognition method based on deep learning
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN111445900A (en) Front-end processing method and device for voice recognition and terminal equipment
CN109979466B (en) Voiceprint identity identification method and device and computer readable storage medium
CN112331217B (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN110782902A (en) Audio data determination method, apparatus, device and medium
Poddar et al. Quality measures for speaker verification with short utterances
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN112614510B (en) Audio quality assessment method and device
CN113327584A (en) Language identification method, device, equipment and storage medium
CN113112992B (en) Voice recognition method and device, storage medium and server
Maazouzi et al. MFCC and similarity measurements for speaker identification systems
CN108630208B (en) Server, voiceprint-based identity authentication method and storage medium
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN111524524B (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
Ghezaiel et al. Nonlinear multi-scale decomposition by EMD for Co-Channel speaker identification
CN113035230A (en) Authentication model training method and device and electronic equipment
EP3613040B1 (en) Speaker recognition method and system
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
CN112309404A (en) Machine voice identification method, device, equipment and storage medium
CN112185347A (en) Language identification method, language identification device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201208

WW01 Invention patent application withdrawn after publication