CN112053695A

CN112053695A - Voiceprint recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112053695A
Application number: CN202010955823.5A
Authority: CN
Inventors: 邹佳宏; 梁延峰
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-08

Abstract

The embodiment of the application discloses a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring frequency spectrum information of a voice to be recognized; according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified; removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice; acquiring frequency spectrum information of the effective voice; performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized; and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized. According to the voice recognition method and device, the invalid voice fragments are removed, so that high-quality voice data are provided for the feature extraction model, and the accuracy of the voiceprint recognition result is improved.

Description

Voiceprint recognition method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of identity recognition, in particular to a voiceprint recognition method and device, electronic equipment and a storage medium.

Background

Voiceprint recognition is also called speaker recognition, is a biological recognition technology for recognizing the identity of a speaker according to the voice characteristics of the speaker, and can be widely applied to the fields of security protection, finance, anti-fraud and the like.

Currently, the most widely used method for voiceprint recognition is the iVector/PLDA algorithm. The process comprises the following steps: obtaining the spectral information of the voice through MFCC (Mel-Frequency Cepstral Coefficients, Mel Frequency Cepstral coefficient); mapping the obtained MFCC high-dimensional characteristic information to a low-dimensional vector iVector through Gaussian supervector factor analysis, wherein the low-dimensional vector iVector comprises voiceprint information and channel information of a speaker; performing channel compensation on the low-dimensional vector iVector by adopting a PLDA algorithm to obtain a voiceprint characteristic vector; and matching the voiceprint characteristic vector with the vector in the database to determine the identity of the speaker.

Because the low-dimensional vector iVector contains both speaker information and channel information, even if the PLDA channel compensation is performed, the low-dimensional vector iVector still contains noise and background sound, which still has great influence on the recognition result, and the recognition result accuracy is low.

Disclosure of Invention

The embodiment of the application provides a voiceprint recognition method and device, electronic equipment and a storage medium, and is beneficial to improving the accuracy of recognition results.

In order to solve the above problem, in a first aspect, an embodiment of the present application provides a voiceprint recognition method, including:

acquiring frequency spectrum information of a voice to be recognized;

according to the frequency spectrum information, identifying effective voice segments and ineffective voice segments in the voice to be identified;

removing the invalid voice segments, and splicing the effective voice segments to obtain effective voice;

acquiring frequency spectrum information of the effective voice;

performing feature extraction on the frequency spectrum information of the effective voice through a feature extraction model based on a deep convolutional neural network to obtain a voiceprint feature vector to be recognized corresponding to the voice to be recognized;

and performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.

Optionally, the identifying, according to the spectrum information, an effective speech segment and an ineffective speech segment in the speech to be identified includes:

and inputting the frequency spectrum information of the voice to be recognized into a classification model based on a deep convolutional neural network to obtain an effective voice segment and an invalid voice segment in the voice to be recognized.

Optionally, the inactive speech segments include noise segments and/or background sound segments.

Optionally, the obtaining of the spectrum information of the speech to be recognized includes:

carrying out short-time Fourier transform processing on the voice to be recognized to obtain frequency spectrum information of the voice to be recognized; or

And calculating a Mel frequency cepstrum coefficient corresponding to the voice to be recognized as the frequency spectrum information of the voice to be recognized.

Optionally, the performing similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine speaker identity information corresponding to the speech to be recognized includes:

similarity calculation is carried out on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized is used as a candidate vector;

if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining the identity information of the speaker corresponding to the candidate vector as the identity information of the speaker corresponding to the voice to be recognized;

and if the similarity value of the voiceprint feature vector to be identified and the candidate vector is smaller than the preset threshold value, determining that voiceprint identification fails.

Optionally, the deep convolutional neural network is a ResNet network.

In a second aspect, an embodiment of the present application provides a voiceprint recognition apparatus, including:

the first frequency spectrum information acquisition module is used for acquiring frequency spectrum information of the voice to be recognized;

the voice segment identification module is used for identifying an effective voice segment and an ineffective voice segment in the voice to be identified according to the frequency spectrum information;

the effective voice splicing module is used for removing the invalid voice fragments and splicing the effective voice fragments to obtain effective voice;

the second spectrum information acquisition module is used for acquiring the spectrum information of the effective voice;

the characteristic extraction module is used for extracting the characteristics of the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized;

and the speaker identity determining module is used for carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library to determine the speaker identity information corresponding to the voice to be recognized.

Optionally, the speech segment recognition module is specifically configured to:

Optionally, the first spectrum information obtaining module is specifically configured to:

Optionally, the speaker identity determining module includes:

a candidate vector determining unit, configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library, and use an existing voiceprint feature vector with a largest similarity value with the voiceprint feature vector to be recognized as a candidate vector;

the speaker identity determining unit is used for determining that the speaker identity information corresponding to the candidate vector is the speaker identity information corresponding to the voice to be recognized if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value;

and the recognition failure determining unit is used for determining that the voiceprint recognition fails if the similarity value of the voiceprint feature vector to be recognized and the candidate vector is smaller than the preset threshold value.

Optionally, the deep convolutional neural network is a ResNet network.

In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the voiceprint recognition method according to the embodiment of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voiceprint recognition method disclosed in the present application.

The voiceprint recognition method, the voiceprint recognition device, the electronic equipment and the storage medium provided by the embodiment of the application acquire the frequency spectrum information of the voice to be recognized, recognize the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, remove the ineffective voice segment, splice the effective voice segment to obtain the effective voice, acquire the frequency spectrum information of the effective voice, perform characteristic extraction on the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, perform similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in a voice characteristic library to determine the identity of a speaker corresponding to the voice to be recognized, because the effective voice segment and the ineffective voice segment in the voice to be recognized are recognized at first in the voiceprint recognition process and the ineffective voice segment is removed, the effective voice fragments are reserved and spliced to perform voiceprint recognition, so that high-quality voice data as much as possible are provided for the feature extraction model, and the voiceprint features are extracted by using the end-to-end feature extraction model, so that the problem of poor feature selection effect caused by experience limitation is solved, and the accuracy of a voiceprint recognition result is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a voiceprint recognition method according to a first embodiment of the present application;

fig. 2 is a schematic structural diagram of a voiceprint recognition apparatus according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

As shown in fig. 1, the method for voiceprint recognition provided by this embodiment includes: step 110 to step 160.

Step 110, obtaining the frequency spectrum information of the voice to be recognized.

Short-Time Fourier Transform (STFT) processing may be performed on the speech to be recognized to obtain spectrum information of the speech to be recognized; alternatively, a Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the speech to be recognized may also be used as the spectrum information of the speech to be recognized.

In an embodiment of the present application, the acquiring the spectrum information of the speech to be recognized includes:

When short-time Fourier transform processing is carried out on the voice to be recognized, framing and windowing processing is carried out on the voice to be recognized to obtain a plurality of windowed data frames, and then Fourier transform processing is carried out on each windowed data frame to obtain frequency spectrum information of the voice to be recognized.

When calculating the Mel frequency cepstrum coefficient corresponding to the speech to be recognized, firstly, performing pre-emphasis, framing and windowing on the speech to be recognized to obtain a plurality of windowed data frames; fourier transform processing is carried out on each data frame to obtain frequency spectrums in different time windows; inputting the frequency spectrums in different time windows into a Mel filter bank to obtain Mel frequency spectrums; and performing cepstrum analysis on the Mel frequency spectrum, namely taking logarithm of the Mel frequency spectrum, and performing Fourier inverse transformation to obtain a Mel frequency cepstrum coefficient corresponding to the voice to be recognized.

And 120, identifying effective voice segments and ineffective voice segments in the voice to be identified according to the frequency spectrum information.

Wherein the inactive speech segments include noise segments and/or background sound segments.

And identifying the voice to be identified according to the frequency spectrum information so as to determine an effective voice segment and an ineffective voice segment in the voice to be identified.

In an embodiment of the present application, the identifying, according to the spectrum information, valid speech segments and invalid speech segments in the speech to be identified includes:

The frequency spectrum information of the voice to be recognized can be input into a classification model based on a deep convolutional neural network, the classification model is used for classifying the frequency spectrum information of the voice to be recognized, effective voice segments and invalid voice segments in the voice to be recognized are recognized, and at least one effective voice segment and at least one invalid voice segment can be obtained.

The deep convolutional neural network may be a ResNet network, for example, a ResNet50 network. The classification model may be a full-concatenation layer added behind the ResNet50 network to achieve the purpose of classifying the speech to be recognized, for example, a 2-layer full-concatenation layer may be added, so that the final output is 2 classifications, i.e., one classification is a valid speech segment and one classification is an invalid speech segment.

By filtering invalid voices such as noise, background sound and the like by using the deep neural network, voice data with high quality as much as possible can be provided for the feature extraction model so as to extract more accurate voiceprint feature vectors to be recognized.

And step 130, removing the invalid voice segments, and splicing the valid voice segments to obtain valid voice.

Because the effective voice segment comprises the identity information of the speaker, and the invalid voice segment does not comprise the identity information of the speaker, the invalid voice segment is removed, and the effective voice segment is spliced according to the time sequence of the effective voice segment to obtain the effective voice. And removing the invalid voice segments and splicing the valid voice segments to serve as valid voice, so that high-quality voice data can be provided for a subsequent feature extraction model.

Step 140, obtaining the spectrum information of the effective voice.

The effective voice can be subjected to short-time Fourier transform processing to obtain frequency spectrum information of the effective voice; alternatively, Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the effective speech may be used as the spectrum information of the effective speech.

When the effective voice is subjected to short-time Fourier transform processing, the effective voice is subjected to framing and windowing processing to obtain a plurality of windowed data frames, and then Fourier transform processing is performed on each windowed data frame to obtain frequency spectrum information of the effective voice.

When calculating the Mel frequency cepstrum coefficient corresponding to the effective voice, firstly, carrying out pre-emphasis, framing and windowing on the effective voice to obtain a plurality of windowed data frames; fourier transform processing is carried out on each data frame to obtain frequency spectrums in different time windows; inputting the frequency spectrums in different time windows into a Mel filter bank to obtain Mel frequency spectrums; and performing cepstrum analysis on the Mel frequency spectrum, namely taking logarithm of the Mel frequency spectrum, and performing Fourier inverse transformation to obtain a Mel frequency cepstrum coefficient corresponding to the effective voice.

And 150, extracting the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized.

Inputting the frequency spectrum information of the effective voice into an end-to-end characteristic extraction model based on a deep convolutional neural network, and performing characteristic extraction on the frequency spectrum information of the effective voice by using the characteristic extraction model to obtain a voiceprint characteristic vector to be recognized corresponding to the voice to be recognized. The dimension of the voiceprint feature vector to be identified can be set according to requirements, and can be a 512-dimensional vector for example.

The deep convolutional neural network may be a ResNet network, for example, a ResNet50 network. The feature extraction model may be implemented by adding a full connection layer behind a ResNet50 network so as to output a vector with a preset dimension, for example, a 2-layer full connection layer may be added so as to output a vector with 512 dimensions finally.

And 160, performing similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vector in the voice feature library, and determining the identity information of the speaker corresponding to the voice to be recognized.

Wherein, the voice feature library stores the existing voiceprint feature vector and the corresponding speaker identity information.

Similarity calculation is performed on the voiceprint feature vectors to be recognized and the existing voiceprint feature vectors in the voice feature library respectively to determine the speaker identity information corresponding to the voice to be recognized, for example, the speaker identity information corresponding to the existing voiceprint feature vector with the largest similarity value can be used as the speaker identity information corresponding to the voice to be recognized.

In an embodiment of the present application, the performing similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine speaker identity information corresponding to the speech to be recognized includes:

And respectively carrying out similarity calculation on the voiceprint feature vector to be recognized and the existing voiceprint feature vectors in the voice feature library to obtain a similarity value between the voiceprint feature vector to be recognized and each existing voiceprint feature vector in the voice feature library, comparing the similarity values to determine the existing voiceprint feature vector with the largest similarity value with the voiceprint feature vector to be recognized as a candidate vector, if the similarity value between the voiceprint feature vector to be recognized and the candidate vector is greater than or equal to a preset threshold value, determining that voiceprint recognition is successful, using the speaker identity information corresponding to the candidate vector as the speaker identity information corresponding to the voice to be recognized, and if the similarity between the voiceprint feature vector to be recognized and the candidate vector is less than the preset threshold value, determining that voiceprint recognition fails, namely not recognizing the speaker identity information corresponding to the voice to be recognized. The maximum similarity value is compared with a preset threshold value, and when the maximum similarity value is larger than or equal to the preset threshold value, the identity information of the speaker corresponding to the candidate vector is determined to be the identity information of the speaker corresponding to the voice to be recognized, so that the determination result can be more accurate.

The voiceprint recognition method provided by the embodiment of the application obtains the frequency spectrum information of the voice to be recognized, recognizes the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, removes the ineffective voice segment, splices the effective voice segment to obtain the effective voice, obtains the frequency spectrum information of the effective voice, performs characteristic extraction on the frequency spectrum information of the effective voice through a characteristic extraction model based on a deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, performs similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in a voice characteristic library to determine the identity of a speaker corresponding to the voice to be recognized, and because the effective voice segment and the ineffective voice segment in the voice to be recognized are firstly recognized in the voiceprint recognition process, the ineffective voice segment is removed, and the effective voice segment is reserved and spliced for voiceprint recognition, therefore, high-quality voice data as much as possible is provided for the feature extraction model, and the voiceprint features are extracted by using the end-to-end feature extraction model, so that the problem of poor feature selection effect caused by experience limitation is solved, and the accuracy of the voiceprint recognition result is improved.

Example two

In the present embodiment, as shown in fig. 2, a voiceprint recognition apparatus 200 includes:

a first spectrum information obtaining module 210, configured to obtain spectrum information of a voice to be recognized;

a voice segment recognition module 220, configured to recognize, according to the spectrum information, an effective voice segment and an ineffective voice segment in the voice to be recognized;

an effective speech splicing module 230, configured to remove the invalid speech segments and splice the effective speech segments to obtain an effective speech;

a second spectrum information obtaining module 240, configured to obtain spectrum information of the valid voice;

a feature extraction module 250, configured to perform feature extraction on the frequency spectrum information of the effective speech through a feature extraction model based on a deep convolutional neural network, so as to obtain a voiceprint feature vector to be recognized, which corresponds to the speech to be recognized;

and the speaker identity determining module 260 is configured to perform similarity calculation on the voiceprint feature vector to be recognized and existing voiceprint feature vectors in the voice feature library, and determine speaker identity information corresponding to the voice to be recognized.

Optionally, the speaker identity determining module includes:

Optionally, the deep convolutional neural network is a ResNet network.

The voiceprint recognition device provided in the embodiment of the present application is configured to implement each step of the voiceprint recognition method described in the first embodiment of the present application, and specific implementation manners of each module of the device refer to the corresponding step, which is not described herein again.

The voiceprint recognition device provided by the embodiment of the application acquires the frequency spectrum information of the voice to be recognized through the first frequency spectrum information acquisition module, the voice segment recognition module recognizes the effective voice segment and the ineffective voice segment in the voice to be recognized according to the frequency spectrum information, the effective voice splicing module removes the ineffective voice segment and splices the effective voice segment to obtain the effective voice, the second frequency spectrum information acquisition module acquires the frequency spectrum information of the effective voice, the characteristic extraction module performs characteristic extraction on the frequency spectrum information of the effective voice through the characteristic extraction model based on the deep convolutional neural network to obtain the voiceprint characteristic vector to be recognized corresponding to the voice to be recognized, the speaker identity determination module performs similarity calculation on the voiceprint characteristic vector to be recognized and the existing voiceprint characteristic vector in the voice characteristic library to determine the speaker identity corresponding to the voice to be recognized, because the effective voice segment and the ineffective voice segment in the voice to be recognized are firstly recognized in the voiceprint recognition process, and removing the invalid voice segments, and reserving and splicing the valid voice segments to perform voiceprint recognition, thereby providing high-quality voice data as much as possible for the feature extraction model, and extracting the voiceprint features by using the end-to-end feature extraction model, thereby avoiding the problem of poor feature selection effect caused by experience limitation, and improving the accuracy of the voiceprint recognition result.

EXAMPLE III

Embodiments of the present application also provide an electronic device, as shown in fig. 3, the electronic device 300 may include one or more processors 310 and one or more memories 320 connected to the processors 310. Electronic device 300 may also include input interface 330 and output interface 340 for communicating with another apparatus or system. Program code executed by processor 310 may be stored in memory 320.

The processor 310 in the electronic device 300 calls the program code stored in the memory 320 to perform the voiceprint recognition method in the above embodiment.

The above elements in the above electronic device may be connected to each other by a bus, such as one of a data bus, an address bus, a control bus, an expansion bus, and a local bus, or any combination thereof.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voiceprint recognition method according to the first embodiment of the present application.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The voiceprint recognition method, the voiceprint recognition device, the electronic device and the storage medium provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation manner of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Claims

1. A voiceprint recognition method, comprising:

acquiring frequency spectrum information of a voice to be recognized;

acquiring frequency spectrum information of the effective voice;

2. The method according to claim 1, wherein the identifying valid speech segments and invalid speech segments in the speech to be identified according to the spectrum information comprises:

3. The method according to claim 1 or 2, characterized in that the inactive speech segments comprise noise segments and/or background sound segments.

4. The method according to claim 1, wherein the obtaining of the spectrum information of the speech to be recognized comprises:

5. The method according to claim 1, wherein the calculating the similarity between the voiceprint feature vector to be recognized and existing voiceprint feature vectors in a speech feature library to determine the identity information of the speaker corresponding to the speech to be recognized comprises:

6. The method of claim 1, wherein the deep convolutional neural network is a ResNet network.

7. A voiceprint recognition apparatus comprising:

8. The apparatus of claim 7, wherein the speech segment recognition module is specifically configured to:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voiceprint recognition method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the voiceprint recognition method of one of the claims 1 to 6.