CN114547568A

CN114547568A - Identity verification method, device and equipment based on voice

Info

Publication number: CN114547568A
Application number: CN202210122476.7A
Authority: CN
Inventors: 方硕
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-05-27

Abstract

The embodiment of the specification discloses an identity authentication method, device and equipment based on voice. The scheme can comprise the following steps: after an identity authentication request aiming at a target user is obtained, judging whether voice data to be authenticated carried in the identity authentication request meets preset voice data quality conditions or not, and if so, generating voiceprint fusion characteristics to be authenticated according to the voice data to be authenticated; and comparing the voiceprint fusion feature to be verified with the pre-stored reference voiceprint fusion feature of the target user to generate an identity verification result aiming at the target user.

Description

Identity verification method, device and equipment based on voice

Technical Field

The present application relates to the field of identity verification technologies, and in particular, to a voice-based identity authentication method, apparatus, and device.

Background

With the advent of the information age, more and more scenes need to verify the identity of a user to ensure that the claimed identity of the user is real rather than fictional, thereby ensuring the rights and interests of the user and ensuring the safe and stable operation of services. The existing authentication methods are various, for example, an authentication method based on an identification card (identity card), an authentication method based on a face recognition technology, an authentication method based on voice recognition, and the like. However, at present, when performing user authentication based on voice, as long as the audio file provided by the user contains the voiceprint feature, the voiceprint feature in the audio file is utilized to generate the user authentication result, and the influence of the interference information contained in the audio file on the user authentication result is not considered.

Therefore, how to improve the accuracy of the identity verification result generated based on the voice becomes a technical problem to be solved urgently.

Disclosure of Invention

The identity verification method, device and equipment based on voice provided by the embodiment of the specification can improve the accuracy of the identity verification result generated based on voice.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

an identity authentication method based on voice provided by the embodiments of the present specification includes:

acquiring an identity authentication request aiming at a target user, wherein the identity authentication request carries voice data to be authenticated;

judging whether the voice data to be verified meets a preset voice data quality condition or not to obtain a first judgment result;

if the first judgment result shows that the voice data to be verified meets the preset voice data quality condition, generating a voiceprint fusion feature to be verified according to the voice data to be verified;

and comparing the voiceprint fusion features to be verified with prestored reference voiceprint fusion features of the target user to obtain an identity verification result aiming at the target user.

An identity authentication device based on voice provided by the embodiments of this specification includes:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an authentication request aiming at a target user, and the authentication request carries voice data to be authenticated;

the judging module is used for judging whether the voice data to be verified meets a preset voice data quality condition or not to obtain a first judging result;

the first generation module is used for generating a voiceprint fusion feature to be verified according to the voice data to be verified if the first judgment result shows that the voice data to be verified meets a preset voice data quality condition;

and the comparison module is used for comparing the voiceprint fusion features to be verified with prestored reference voiceprint fusion features of the target user to obtain an identity verification result aiming at the target user.

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

At least one embodiment provided in the present specification can achieve the following advantageous effects:

after an identity authentication request for a target user is obtained, whether voice data to be authenticated carried in the identity authentication request meets preset voice data quality conditions is judged, if yes, the accuracy of an identity authentication result generated based on the voice data to be authenticated can be represented to be good, and therefore a voiceprint fusion feature to be authenticated with good accuracy can be generated according to the voice data to be authenticated; and then the voiceprint fusion characteristics to be verified with better accuracy are compared with the prestored reference voiceprint fusion characteristics of the target user, so that the accuracy of the obtained user identity verification result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic view of a scenario of a voice-based authentication method provided in an embodiment of the present specification;

fig. 2 is a schematic flowchart of a voice-based authentication method provided in an embodiment of the present disclosure;

FIG. 3 is a schematic swimlane flow chart of a voice-based authentication method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a voice-based authentication apparatus corresponding to fig. 2 provided in an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a voice-based authentication device corresponding to fig. 2 provided in an embodiment of this specification.

Detailed Description

To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the scope of protection of one or more embodiments of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

In service industries operated based on a network order dispatching platform, such as express delivery, take-out, designated driving, network appointment, home administration and the like, users (such as couriers, takeoffs and the like) of the network order dispatching platform can contact a large amount of third-party personal information, and therefore identity verification of the users is very necessary for ensuring the safety of the third-party personal information.

In the prior art, face recognition is usually performed on users such as couriers and takeaway persons for several times at random to complete identity verification, and the user work efficiency is affected due to the fact that the user is disturbed by the body verification method. Therefore, an imperceptible, real-time and all-weather authentication method is needed.

Since voice communication is often required to be performed with a client in the work process of a courier and a takeaway, a voice-based authentication method is increasingly popular. However, in the existing voice-based authentication scheme, it is often not considered that the interference information contained in the audio file affects the user authentication result, so that the accuracy and credibility of the generated authentication result cannot be ensured.

In order to solve the defects in the prior art, the scheme provides the following embodiments:

fig. 1 is a schematic view of a scenario of a voice-based authentication method provided in an embodiment of this specification. As shown in fig. 1, the terminal device 101 bound by the target user may be in communication connection with the authentication server 102 and the terminal device 103 of the third party, respectively. The target user and the third party can use the terminal device 101 to perform a telephone call with the terminal device 103 or send a voice chat message for communication, and an audio file formed in the communication process of the target user and the third party can be used as voice data to be verified and used for verifying the identity of the target user.

Specifically, the terminal device 101 bound by the target user may send an authentication request carrying voice data to be authenticated to the authentication server 102 through a client or a server (not shown in fig. 1) of a specified application. The identity authentication server 102 judges whether the voice data to be authenticated meets a preset voice data quality condition, and if so, the voice data to be authenticated can generate a voiceprint fusion feature to be authenticated with better accuracy based on the voice data to be authenticated; and then comparing the voiceprint fusion characteristics to be verified with prestored reference voiceprint fusion characteristics of the target user to generate an identity verification result with better accuracy and credibility aiming at the target user.

Next, a speech-based authentication method provided in an embodiment of the specification will be specifically described with reference to the accompanying drawings:

fig. 2 is a flowchart of a voice-based authentication method according to an embodiment of the present disclosure. From the program perspective, the execution subject of the process may be a terminal device or an authentication server bound by a target user, or an application program loaded on the terminal device or the authentication server bound by the target user. As shown in fig. 2, the process may include the following steps:

step 202: acquiring an identity authentication request aiming at a target user, wherein the identity authentication request carries voice data to be authenticated.

In the embodiment of the present specification, when authentication needs to be performed for a target user, an authentication request may be generated based on voice data to be authenticated, which is collected by a terminal device bound to the target user within a preset time period. The terminal device bound with the target user may refer to a terminal device logged in a registered account of the target user, or may refer to a terminal device which is pre-designated by the target user and has a target unique device identifier. The preset time period may be set according to actual requirements, for example, the last half hour, the last 5 minutes, and the like, and is not particularly limited.

In practical applications, the authentication request may include, in addition to the voice data to be authenticated, the identity information of the target user, so as to verify whether the holder of the terminal device bound to the target user is the target user through the voice data to be authenticated.

In practical applications, the voice data to be verified may be extracted from a voice call recording or a voice chat message at a terminal device bound by a target user, and only include data of a voice of a holder of the terminal device. Specifically, for voice call recording, uplink and downlink call channels can be separated, so that the voice data to be verified can be obtained by extracting calls in a specified channel; or, the voice data to be verified can be obtained by processing the single-channel voice simultaneously containing the voice data of the user and the third party through a channel separation algorithm. And for the voice chat message, the voice data to be verified can be obtained only by extracting the audio data corresponding to the voice chat message sent by the terminal device bound by the target user.

Step 204: and judging whether the voice data to be verified meets a preset voice data quality condition or not to obtain a first judgment result.

In the embodiment of the present specification, since a part of the voice data may include more interference information, it is not possible to generate an authentication result with better accuracy based on such voice data, and therefore, the method is not suitable for authentication. Based on this, it can be checked whether the voice data to be authenticated is suitable for the identity recognition process using a preset voice data quality condition. Specifically, the preset voice data quality condition may be used to check and evaluate the voice data to be verified from aspects of voice duration, sampling rate, sampling precision, voice definition, whether the voice data is live voice, and the like.

Step 206: and if the first judgment result shows that the voice data to be verified meets the preset voice data quality condition, generating the voiceprint fusion feature to be verified according to the voice data to be verified.

In this embodiment of the present specification, if the first determination result indicates that the voice data to be verified satisfies the preset voice data quality condition, it may generally indicate that an authentication result with better accuracy can be generated based on the voice data to be verified, so as to allow an authentication result to be generated according to the voice data to be verified.

To facilitate an understanding of the present invention, a brief description of voiceprint recognition techniques will first be provided. When a person speaks, not only are the vocal organs (such as tongue, teeth, larynx, lung, nasal cavity and the like) of each person greatly different in size and shape, but also the vocal habits of each person are greatly different, so that different individuals can be distinguished through the voiceprint recognition technology. The voiceprint recognition means that whether the voiceprint fusion features belong to the same user or not is confirmed by comparing the voiceprint fusion features with pre-stored reference voiceprint fusion features, so that an identity verification function is realized.

The voiceprint fusion features may refer to feature vectors that can characterize a speaker's particular organ structure or behavior habits. Specifically, Mel-frequency cepstral coefficients (MFCCs) of the voice data to be verified, Mel-scale Filter Bank features (fbanks), and the like can be extracted to generate the voiceprint fusion features to be verified.

The pre-stored reference voiceprint fusion feature may refer to a voiceprint fusion feature of the target user, which is pre-stored as a comparison reference. The generation principle of the pre-stored reference voiceprint fusion feature and the to-be-verified voiceprint fusion feature may be the same.

In this embodiment of the present specification, if the first determination result indicates that the voice data to be verified does not satisfy the preset voice data quality condition, it may generally indicate that a better-accuracy authentication result cannot be generated based on the voice data to be verified, so that the process may jump to an end step to terminate the authentication process, so as to ensure the accuracy of the authentication result.

Step 208: and comparing the voiceprint fusion features to be verified with prestored reference voiceprint fusion features of the target user to obtain an identity verification result aiming at the target user.

The identity authentication result may be used to reflect whether the voice data to be authenticated carried in the identity authentication request in step 202 belongs to the target user, and may further reflect whether the terminal device holder bound to the target user is the target user.

In practical application, under a normal condition, if the authentication result indicates that the voice data to be authenticated carried in the authentication request belongs to data generated based on the voice sent by the target user, the authentication can be determined to pass; otherwise, the authentication fails.

In the method in fig. 2, after an authentication request for a target user is obtained, it is determined whether voice data to be authenticated carried in the authentication request meets a preset voice data quality condition, and if so, it may indicate that an authentication result generated based on the voice data to be authenticated is good in accuracy, so that a voiceprint fusion feature to be authenticated with good accuracy is generated according to the voice data to be authenticated; and then the voiceprint fusion characteristics to be verified with better accuracy are compared with the prestored reference voiceprint fusion characteristics of the target user, so that the accuracy of the obtained user identity verification result is improved.

Based on the method in fig. 2, some specific embodiments of the method are also provided in the examples of this specification, which are described below.

In practical applications, the voice data to be verified may only contain the mute data, but not contain valid voice data that can be used for authentication, and in this case, it is impossible to generate an authentication result with better accuracy based on the voice data to be verified, so that the mute data needs to be screened out in advance.

Based on this, step 204: before the determining whether the voice data to be verified meets the preset voice data quality condition, the method may further include:

and carrying out silence detection on the voice data to be verified to obtain a silence detection result.

Correspondingly, step 204: judging whether the voice data to be verified meets a preset voice data quality condition or not, specifically, the judging may include:

and if the silence detection result shows that the voice data to be verified does not belong to silence data, judging whether the voice data to be verified meets the preset voice data quality condition or not.

In this embodiment of the present specification, the silence detection may be used to detect whether voice data to be verified is silence data. Specifically, the silence detection can be realized by a gaussian mixture model method, a threshold discrimination algorithm, a model matching algorithm, a high-order statistical method and the like.

In the embodiment of the present specification, after performing silence detection on the voice data to be verified to obtain a silence detection result, if it is determined that the voice data to be verified belongs to the silence data, that is, the voice data is not included, the voice data cannot be used for identity verification under normal conditions, so that the process of identity verification can be terminated by skipping to an end step; on the contrary, if it is determined that the voice data to be verified does not belong to the mute data, whether the voice data to be verified meets the preset voice data quality condition or not can be further judged.

In the embodiment of the present specification, after performing mute detection on the voice data to be verified, the authentication is terminated on the voice data to be verified belonging to the mute data, which is not only beneficial to ensuring the accuracy of the authentication result, but also can avoid unnecessary waste of device resources by avoiding executing the subsequent authentication step.

In the embodiment of the present specification, a silence detection model based on a gaussian classification model is further provided.

Specifically, the performing silence detection on the voice data to be verified to obtain a silence detection result may include:

and extracting the frequency spectrum characteristic of the voice data to be verified to obtain the target frequency spectrum characteristic data of the voice data to be verified.

Inputting the target frequency spectrum characteristic data into a silence detection model to obtain a silence detection result output by the silence detection model; the silence detection model is obtained by training a gaussian classification model by using a spectral feature data sample carrying a first classification label in advance, the spectral feature data sample is feature data obtained by performing spectral feature extraction on a first voice data sample, and the first classification label can be used for indicating whether the first voice data sample is silence data.

In this embodiment, the first voice data sample may include user voice data pre-labeled with the first classification label. The first classification tag may include a tag for indicating that the first voice data sample is silence data and a tag for indicating that the first voice data sample is non-silence data.

The user voice data as the first voice data sample may be user voice data collected by different devices in different scenarios, and specifically may include but is not limited to the following scenarios: the method comprises the steps that user voice data collected during voice call through a Bluetooth headset during outdoor riding and walking, or user voice data collected during call through a mobile phone in an indoor environment, or user chat voice data sent by a user at an instant chat application and the like are collected, and a silence detection model is trained through the user voice data collected by various scenes and equipment, so that the accuracy of the silence detection model obtained through training is improved.

The training process of the silence detection model may be that the first voice data sample carrying the first classification label is used as an input of a gaussian classification model to obtain a prediction classification result, which is output by the gaussian classification model and is directed to whether the first voice data sample belongs to silence data, and parameters of the gaussian classification model are optimized with a goal of minimizing a difference between the prediction classification result and the first classification label. When the prediction accuracy of the output prediction classification result of the Gaussian classification model is greater than the preset value, the prediction accuracy of the trained silence detection model can be shown to be more accurate, so that the model training can be stopped and the model can be put into use.

In practical applications, after the silence detection result for the voice data to be verified is generated by using the silence detection model, although the silence detection result indicates that the voice data to be verified does not belong to the silence data, the voice data to be verified usually contains more or less silence segments, and in order to facilitate the subsequent generation of an authentication result with better accuracy based on the voice data to be verified, each silence segment contained in the voice data to be verified needs to be removed.

Based on this, before the determining whether the voice data to be verified meets the preset voice data quality condition, the method may further include:

and performing time-frequency domain feature extraction on the voice data to be verified to obtain time domain feature data and frequency domain feature data of the voice data to be verified.

And extracting user voice data from the voice data to be verified according to the time domain characteristic data and the frequency domain characteristic data to obtain a user voice data set.

Correspondingly, the determining whether the voice data to be verified meets a preset voice data quality condition to obtain a first determination result may specifically include:

and judging whether the user voice data set meets a preset voice data quality condition or not to obtain a second judgment result.

In this embodiment of the present specification, the extracting of the time-frequency domain feature may be a process of analyzing and processing the time domain and the frequency domain of the voice data to be verified. The time domain feature data may include: one or more of a short-time zero-crossing rate, a short-time energy, a short-time average amplitude difference function, and a short-time autocorrelation function. The frequency domain feature data may include: one or more of cepstral distance, frequency variance, spectral entropy.

In this embodiment of the present specification, because there is a large difference between the time-frequency domain characteristics of the silence segment and the voice segment, when extracting the user voice data, the start point and the end point of the voice segment may be identified from the voice data to be verified according to the time-domain characteristic data and the frequency-domain characteristic data of the voice data to be verified, and the user voice data set is obtained by extracting each voice segment in the voice data to be verified as the user voice data. And the segments which are not extracted in the voice data to be verified belong to the silent segments, so that all the silent segments contained in the voice data to be verified can be eliminated.

In practical application, the user voice data set may include several pieces of extracted user voice data. The extraction of the user voice data can be realized by a threshold discrimination algorithm, a model matching algorithm, a Gaussian mixture model method, a high-order statistical method and the like.

In this embodiment of the present specification, when determining whether the user voice data set satisfies the preset voice data quality condition, the determining may be performed according to whether each piece of user voice data included in the user voice data set satisfies the preset voice data quality condition, so as to generate a final determination result indicating whether the user voice data set satisfies the preset voice data quality condition in combination with a determination result corresponding to each piece of user voice data. Or, the user voice data in the user voice data set may be spliced to obtain a piece of comprehensive user voice data, so as to generate a determination result indicating whether the user voice data set satisfies a preset voice data quality condition according to whether the comprehensive user voice data satisfies the preset voice data quality condition, which is not specifically limited.

In the embodiment of the present specification, the user voice data is extracted from the voice data to be verified, so that each silent segment included in the voice data to be verified can be removed, and the accuracy of the generated authentication result can be improved based on the user voice data set not including the silent segment.

In practical application, because the user voice duration is too short, the generated voiceprint fusion feature to be verified may not be stable enough, thereby affecting the accuracy of the identity verification result generated based on the voiceprint fusion feature to be verified.

Based on this, the preset voice data quality condition may include: the user voice duration is greater than or equal to a first threshold.

Correspondingly, the determining whether the user voice data set meets a preset voice data quality condition may specifically include:

and judging whether the total voice duration of the user voice data set is greater than or equal to the first threshold value or not.

In this embodiment of the present specification, the user voice duration may refer to a sum of durations of user voice data included in the user voice data set, that is, a duration that can be used to generate an effective voice with a voiceprint fusion feature to be verified. The selection of the first threshold is related to the algorithm and the accuracy requirement of voiceprint recognition, and may be set according to the actual situation, which is not specifically limited herein.

In the embodiment of the present specification, if the total voice duration of the user voice data set is less than the first threshold, it may generally indicate that the valid voice duration of the user is too short, and when a user authentication result is generated according to a voiceprint fusion feature to be authenticated generated by the user voice data set, the accuracy of the user authentication result is generally poor, so that the user voice data set cannot meet the requirement of authentication, and therefore, the end step may be skipped, thereby terminating the authentication process; on the contrary, if the total voice duration of the user voice data set is greater than or equal to the first threshold, the voiceprint fusion feature to be verified with better accuracy can be generated according to the user voice data set, so that the accuracy of a user identity verification result generated based on the voiceprint fusion feature to be verified subsequently can be improved.

In practical applications, a user may communicate with a third party in a noisy environment, or the sound collection capability of a terminal device used by the user may be poor, so that the user voice data extracted from the voice data to be verified may have high noise and poor quality.

Based on this, the preset voice data quality condition may include: the user voice quality score is greater than or equal to a second threshold.

performing voice quality analysis on the user voice data set through a voice quality analysis model to obtain a voice quality score of the user voice data set; the voice quality analysis model is obtained by training a deep learning model by utilizing a second voice data sample carrying a voice quality score label in advance, and the voice quality score label can be used for expressing a preset voice quality score of the second voice data sample.

And judging whether the voice quality score of the user voice data set is greater than or equal to the second threshold value.

In this embodiment, the second speech data sample may include user speech data labeled with a speech quality score. The voice quality score label can be generated according to one or more of parameters such as signal-to-noise ratio, segmented signal-to-noise ratio, PESQ (Perceptual evaluation of voice quality), log likelihood ratio measure, log spectral distance, short-term objectivity and intelligibility, weighted spectral tilt measure, Perceptual objective voice quality evaluation, sampling rate and sampling precision.

The user voice data as the second voice data sample may be user voice data collected by different devices in different scenarios, which may specifically include but are not limited to the following scenarios: when the bicycle is ridden outdoors or driven by a car, the voice call is carried out through the Bluetooth headset, the call is carried out through the mobile phone in the indoor environment, and the like. The second speech data sample may also be speech data that does not contain silence segments.

The training process of the voice quality analysis model may be that a second voice data sample carrying a voice quality score label is used as an input of a deep learning model, an output of the deep learning model may be a voice quality prediction score of the second voice data sample, the deep learning model is continuously optimized based on a loss function, and when an error between the voice quality prediction score of the output second voice data sample and the voice quality score label carried by the output second voice data sample is smaller than a preset value, a prediction result of the trained voice quality analysis model may be represented to be more accurate, so that model training may be stopped and put into use.

In this embodiment of the present specification, when performing voice quality analysis on the user voice data set through the voice quality analysis model, the voice quality analysis model may be used to perform voice quality analysis on each piece of user voice data in the user voice data set to obtain a voice quality score of each piece of user voice data, and further calculate (for example, an average value, a weighted average value, and the like) the voice quality score of each piece of user voice data to obtain the voice quality score of the user voice data set. Or, the user voice data in the user voice data set may be spliced to obtain a piece of comprehensive user voice data, so that the voice quality score of the comprehensive user voice data is generated by using the voice quality analysis model to obtain the voice quality score of the user voice data set, which is not specifically limited.

In the embodiment of the present specification, a trained voice quality analysis model is used to generate a voice quality score of the user voice data set, and if the voice quality score is smaller than a second threshold, even if a voiceprint fusion feature to be verified is generated according to the user voice data set, the accuracy of the voiceprint fusion feature to be verified is poor, so that the requirement of voiceprint recognition cannot be met, and therefore, the process can be directly skipped to the end step to terminate the identity verification process; on the contrary, if the voice quality score is greater than or equal to the second threshold, the voiceprint fusion feature to be verified with better accuracy can be further generated according to the user voice data set, so that the accuracy of a user identity verification result generated based on the voiceprint fusion feature to be verified subsequently can be improved.

In the embodiment of the present specification, the trained voice quality analysis model is used to generate the voice quality score of the user voice data set, which is beneficial to improving the accuracy of the obtained voice quality score and further beneficial to ensuring the accuracy of the generated authentication result.

In practical application, when the user of the terminal device bound by the target user performs authentication, there may be cheating behaviors such as playing sound recording, waveform splicing, voice synthesis, and the like, and therefore, in order to ensure the accuracy of the generated authentication result, live voice detection needs to be performed on a user voice data set extracted from voice data to be authenticated.

Based on this, the preset voice data quality condition may include: the user voice belongs to living body voice.

The determining whether the user voice data set meets a preset voice data quality condition may further include:

performing living voice detection on the user voice data set through a living voice detection model to obtain a living voice detection result of the user voice data set; the living voice detection model is obtained by training a deep learning model by utilizing a third voice data sample carrying a second classification label in advance, and the second classification label can be used for indicating whether the third voice data sample belongs to living voice.

And judging whether the living voice detection result indicates that the user voice data set belongs to living voice.

In this embodiment, the third voice data sample may include user voice data pre-labeled with the second classification label. The second classification label can be used for indicating whether the third voice data sample belongs to living voice, and for non-living voice, a specific form of fake sound attack can be further labeled.

The user voice data as the third voice data sample may include user voice data (i.e., living voice data samples) acquired by different devices in different scenes, and may also include voice data (i.e., non-living voice data samples) formed after performing processing such as playback, waveform concatenation, voice synthesis, voice emulation, and the like on the above-mentioned directly acquired user voice data.

The training process for the living body voice detection model is similar to the training process for the silence detection model, and details are not repeated here.

In the embodiment of the specification, live voice detection is performed by using a trained live voice detection model, so that whether user voice contained in voice data to be verified belongs to live voice is judged, and if the user voice does not belong to the live voice, cheating and attacking behaviors can be represented, so that the user voice can directly jump to an end step to terminate an identity verification process, and fake sound attacks such as recording playback, waveform splicing, voice synthesis and the like can be effectively resisted; on the contrary, if the voice belongs to the living voice, the identity authentication result with better accuracy can be generated further according to the user voice data set extracted from the voice data to be authenticated.

It is mentioned above that the user voice data set extracted from the voice data to be verified generally does not contain any silence segment, but only contains a voice segment, but the user voice data set may contain voice data of other persons speaking in the environment where the holder is located, in addition to the voice data of the holder of the terminal device bound by the target user, and therefore, the pronunciation data of each person in the user voice data set needs to be distinguished to ensure the accuracy of the subsequently generated authentication result.

Based on this, after the determining whether the user voice data set meets the preset voice data quality condition and obtaining the second determination result, the method may further include:

and if the second judgment result shows that the user voice data set meets the preset voice data quality condition, performing voiceprint feature extraction on each piece of user voice data contained in the user voice data set to obtain voiceprint feature data of each piece of user voice data.

Dividing each piece of user voice data according to the voiceprint feature data of each piece of user voice data to obtain at least one user voice data subset; the user voice data contained in each user voice data subset belongs to the same user, and the user voice data contained in different user voice data subsets belongs to different users.

Judging whether the voice data volume of the target voice data subset is larger than a third threshold value or not to obtain a third judgment result; and the target voice data subset is combined into the user voice data subset with the largest voice data amount.

Correspondingly, the generating the voiceprint fusion features to be verified according to the voice data to be verified may specifically include:

and if the third judgment result shows that the voice data volume of the target voice data subset is larger than a third threshold value, generating the voiceprint fusion feature to be verified according to the voiceprint feature data of the user voice data contained in the target voice data subset.

In the embodiment of the present specification, the voiceprint features may include Mel-frequency cepstral coeffients (MFCCs), Mel-scale Filter Bank features (fbanks), and the like.

The dividing of the user voice data may refer to dividing the user voice data belonging to the same user into the same user voice data subset, and dividing the user voice data of different users into different user voice data subsets, so that each user voice data subset corresponds to one user. In practical application. The division of the user voice data may be implemented by a speaker clustering technique, or by consistency comparison of voiceprint feature data of the user voice data, which is not specifically limited.

Generally speaking, the voice data of the holder of the terminal device bound by the target user has the largest data volume in the user voice data set, and the voice data of other persons mixed in the user voice data set has the smaller data volume. Therefore, the user voice data subset with the largest voice data amount can be considered to be a set of voices of the holder, and the user voice data subset with the largest voice data amount can be used as a target voice data subset to generate the voiceprint fusion feature to be verified. In practical applications, the maximum voice data amount may refer to the longest voice duration or the largest number of voice entries, which is not limited specifically.

Similarly, in order to avoid the problem of poor accuracy of the identity verification result caused by the fact that the effective voice data volume is too small, the voiceprint fusion feature to be verified can be generated according to the voiceprint feature data of the user voice data contained in the target voice data subset after the voice data volume of the target voice data subset is determined to be larger than the third threshold. And if the voice data volume of the target voice data subset is smaller than the third threshold, skipping to the end and terminating the identity authentication process.

The generating process of the to-be-verified voiceprint fusion feature may be to perform fusion calculation on the voiceprint feature data of the user voice data included in the target voice data subset to generate the to-be-verified voiceprint fusion feature, and the fusion calculation may include averaging, weighted averaging, and the like.

In the embodiment of the specification, according to the voiceprint feature data of each piece of user voice data, dividing the user voice data into user voice data subsets corresponding to speakers one by one; screening out a target voice data subset corresponding to a user to be authenticated (namely a holder of the terminal equipment bound with the target user) from the plurality of user voice data subsets according to the voice data volume of the user voice data subsets; and generating the voiceprint fusion characteristics to be verified according to the target voice data subset, so that the accuracy of the identity verification result is improved.

In this embodiment of the present specification, the data format of the voice data to be verified carried in the authentication request may not be a preset format, or the voice data to be verified may further include voice data of a third party communicating with the user, and therefore, before performing the authentication by using the voice data to be verified, the voice data to be verified may also need to be preprocessed.

Based on this, the identity authentication request carries the audio file to be processed containing the voice data to be authenticated; the audio file to be processed is a file obtained by acquiring audio with the terminal device bound with the target user in the user call process.

Correspondingly, before the determining whether the voice data to be verified meets the preset voice data quality condition, the method may further include:

and judging whether the format type of the audio file to be processed belongs to a preset format type or not, and obtaining a fourth judgment result.

And if the fourth judgment result shows that the format type of the audio file to be processed does not belong to a preset format type, performing audio decoding processing on the audio file to be processed to obtain the audio file of the preset format type.

And carrying out channel separation processing on the audio file with the preset format type to obtain the voice data to be verified.

In this embodiment of the present specification, the audio decoding may be configured to convert an audio file to be processed, which does not belong to a preset format type, into an audio file belonging to a preset format type.

The channel separation can be used for processing a single-channel/multi-channel audio file containing voice data of a user to be authenticated and a third party to obtain voice data only containing the user to be authenticated, and the voice data is used as the voice data to be authenticated.

In practical applications, if the format type of the audio file to be processed is determined to belong to the preset format type, the process of audio decoding can be skipped directly, and channel separation processing is performed. Of course, if the audio file to be processed does not include the third-party voice data communicated with the user, the channel separation process may also be skipped, which is not described herein again.

In this embodiment of the present specification, since the voice-based identity authentication needs to be implemented based on the pre-stored reference voiceprint fusion feature of the target user, the user is further required to perform registration before executing the method in fig. 2, so as to generate and store the reference voiceprint fusion feature of the target user. It is worth noting that the generation principle and the processing process of the pre-stored reference voiceprint fusion feature of the target user and the voiceprint fusion feature to be verified can be the same, so that the accuracy of the pre-stored reference voiceprint fusion feature of the target user can be guaranteed, and the accuracy of the voice-based identity verification result can be improved.

Specifically, step 208: before comparing the to-be-verified voiceprint fusion feature with the pre-stored reference voiceprint fusion feature of the target user, the method may further include:

and acquiring a user voice data sample set of the target user meeting the preset voice data quality condition.

And carrying out voiceprint feature extraction on each user voice data sample contained in the user voice data sample set to obtain a voiceprint feature data sample of each user voice data sample.

Dividing each user voice sample according to the voiceprint characteristic data sample of each user voice data sample to obtain at least one user voice data sample subset; the user voice data samples contained in each user voice data sample subset belong to the same user, and the user voice data samples contained in different user voice data sample subsets belong to different users.

Judging whether the voice data volume of the target voice data sample subset is larger than a fourth threshold value or not to obtain a fourth judgment result; and the target voice data sample subset is combined into the user voice data sample subset with the largest voice data amount.

And if the fourth judgment result shows that the voice data volume of the target voice data sample subset is larger than a fourth threshold, generating the reference voiceprint fusion feature of the target user according to the voiceprint feature data sample of the user voice data sample in the target voice data sample subset.

And storing the reference voiceprint fusion characteristics of the target user to obtain the pre-stored reference voiceprint fusion characteristics of the target user.

In the embodiment of the present specification, the process of acquiring the set of user voice data samples satisfying the requirement is substantially the same as the process of acquiring the set of user voice data mentioned above. Specifically, to-be-registered voice data submitted in the target user registration process is acquired, and the to-be-registered voice data is subjected to audio decoding, channel separation, silence detection and the like. If the voice data to be registered does not belong to the mute data, time-frequency domain feature extraction can be performed on the voice data to be registered, a user voice data sample is extracted according to the extracted time-domain feature data and frequency-domain feature data, a mute segment is removed, and a user voice data sample set is obtained. And continuously judging whether the user voice data sample set meets a preset voice data quality condition (for example, the user voice time length is greater than or equal to a first threshold, the user voice quality score is greater than or equal to a second threshold, and the user voice belongs to living voice) until the user voice data sample set meeting the preset voice data quality condition is determined.

Similarly, for the user voice data sample set meeting the preset voice data quality condition, according to the difference of speakers, dividing each user voice sample in the user voice data sample set into user voice data sample subsets corresponding to the speakers one by one. And selecting the user voice data sample subset with the largest voice data amount as the target voice data sample subset corresponding to the target user.

And then judging whether the voice data amount of the target voice data sample subset is larger than a fourth threshold value. If the value is less than or equal to the fourth threshold value, skipping to the ending step, thereby terminating the identity registration process; and if the value is larger than the fourth threshold value, generating a reference voiceprint fusion characteristic with better accuracy of the target user according to the target voice data sample subset.

In this embodiment of the present description, although the target speech data sample subset meets the preset speech data quality condition, the quality of each user speech data sample in the target speech data sample subset is still different, so that the reference voiceprint fusion feature of the target user can be generated based on the user speech data samples with better quality in the target speech data sample subset, so as to improve the accuracy of the pre-stored reference voiceprint fusion feature of the target user.

Based on this, the generating a reference voiceprint fusion feature of the target user according to the voiceprint feature data sample of the user voice data sample in the target voice data sample subset may specifically include:

and acquiring the voice quality score of the user voice data sample in the target voice data sample subset.

And sequencing the user voice data samples in the target voice data sample subset according to the sequence of the voice quality scores from large to small to obtain a user voice data sample sequence.

And generating the reference voiceprint fusion characteristics of the target user according to the voiceprint characteristic data samples of the first N user voice data samples in the user voice data sample sequence.

In this embodiment of the present specification, the voice quality score of the user voice data sample in the target voice data sample subset may be obtained by performing voice quality analysis on the user voice data sample through a voice quality analysis model.

In this embodiment of the present specification, the generating of the reference voiceprint fusion feature may refer to performing fusion calculation on the voiceprint feature data samples of the first N user voice data samples in the user voice data sample sequence to generate the reference voiceprint fusion feature of the target user, where the fusion calculation may include averaging, weighted averaging, and the like. Wherein, N may be defined according to actual requirements, which is not specifically limited.

In the embodiment of the present specification, the voiceprint feature data samples of the first N user voice data samples with higher voice quality scores are selected to generate the reference voiceprint fusion feature of the target user, which is beneficial to ensuring the accuracy of the reference voiceprint fusion feature of the target user, and is further beneficial to improving the accuracy of the obtained user authentication result.

FIG. 3 is a schematic swimlane flow chart of a voice-based authentication method corresponding to FIG. 2 according to an embodiment of the present disclosure. In fig. 3, a scenario of uploading voice data to be authenticated to an authentication server for authentication is used as an example for explanation, and an execution subject of the flow shown in fig. 3 may include: the terminal equipment bound by the target user, the authentication server and the like.

As shown in fig. 3, in the preprocessing stage, the terminal device bound by the target user may extract the phone call record or the voice chat message generated within the preset time period to obtain a to-be-processed audio file containing to-be-verified voice data, and generate and send an authentication request for the target user to the authentication server, where the authentication request may include the to-be-processed audio file.

After the identity authentication server obtains an identity authentication request for a target user, whether the format type of the audio file to be processed belongs to a preset format type or not can be judged, and if the format type of the audio file to be processed does not belong to the preset format type, audio decoding processing can be performed on the audio file to be processed, so that the audio file of the preset format type is obtained. And if the format type of the audio file to be processed belongs to the preset format type, the audio decoding processing can be directly skipped. The identity authentication server can also perform channel separation processing on the audio file with the preset format type to obtain the voice data to be authenticated. Of course, if the audio file of the preset format type does not contain third-party voice data, the channel separation processing step can be skipped, and the voice data to be verified can be directly obtained.

Performing spectrum feature extraction on the voice data to be verified to obtain target spectrum feature data of the voice data to be verified; and inputting the target frequency spectrum characteristic data into a silence detection model so as to perform silence detection on the voice data to be verified. Judging whether the voice data to be verified is mute data or not, if the voice data to be verified belongs to the mute data, skipping to an ending step, and terminating the identity verification process; if the voice data to be verified does not belong to the mute data, extracting time-frequency domain characteristics of the voice data to be verified to obtain time-domain characteristic data and frequency-domain characteristic data of the voice data to be verified; and extracting user voice data from the voice data to be verified according to the time domain characteristic data and the frequency domain characteristic data to obtain a user voice data set so as to remove silent segments.

Then, continuously judging whether the user voice data set meets a preset voice data quality condition; the preset voice data quality condition may include a user voice duration condition, a user voice quality score condition, and a living voice condition. The user voice duration condition may specifically mean that the user voice duration is greater than or equal to a first threshold; the user voice quality score condition may specifically mean that the user voice quality score is greater than or equal to a second threshold; the living voice condition may specifically mean that the user voice belongs to a living voice; if the user voice data set does not meet the preset voice data quality condition, skipping to an end step, thereby terminating the identity verification process; if the preset voice data quality condition is met, the operation of the voiceprint feature extraction stage can be allowed to be executed.

In the voiceprint feature extraction stage, the same person judgment is needed to be carried out firstly. Specifically, the authentication server may extract voiceprint feature data of each piece of user voice data included in the user voice data set; dividing each piece of user voice data according to the voiceprint feature data of each piece of user voice data to obtain at least one user voice data subset; and taking the user voice data subset with the largest voice data amount as a target voice data subset, so that the voice data of a third party can be screened out.

Judging whether the voice data volume of the target voice data subset is larger than a fourth threshold value or not; if the identity authentication number is less than or equal to the fourth threshold value, skipping to the ending step, and thus terminating the identity authentication process; and if the value is larger than the fourth threshold value, generating the voiceprint fusion feature to be verified according to the voiceprint feature data of the user voice data contained in the target voice data subset.

In the feature comparison stage, the authentication server may compare the voiceprint fusion feature to be authenticated with a pre-stored reference voiceprint fusion feature of the target user to obtain an authentication result for the target user.

The scheme in fig. 3 may further include a stage of performing identity registration by the target user to generate pre-stored reference voiceprint fusion features of the target user before performing identity authentication based on voice. The generation principle of the pre-stored reference voiceprint fusion feature may be the same as that of the voiceprint fusion feature to be verified in fig. 3, but in the generation process of the pre-stored reference voiceprint fusion feature for the target user, in addition to performing the audio decoding, channel separation, silence detection, voice data quality detection, co-judgment, and voiceprint feature fusion processes shown in fig. 3, a plurality of user voice data samples for identity registration sent by a target user may be sorted by their voice quality scores, selecting the voiceprint characteristic data samples of the first N user voice data samples with higher voice quality scores to generate the pre-stored reference voiceprint fusion characteristics of the target user, the accuracy of the pre-stored reference voiceprint fusion characteristics of the target user is further improved, and therefore the accuracy of the subsequently generated identity verification result is guaranteed.

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 4 is a schematic structural diagram of a voice-based authentication apparatus corresponding to fig. 2 provided in an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include:

the first obtaining module 402 may be configured to obtain an authentication request for a target user, where the authentication request carries voice data to be authenticated.

The determining module 404 may be configured to determine whether the voice data to be verified meets a preset voice data quality condition, so as to obtain a first determination result.

The first generating module 406 may be configured to generate a voiceprint fusion feature to be verified according to the voice data to be verified if the first determination result indicates that the voice data to be verified meets a preset voice data quality condition.

The comparison module 408 may be configured to compare the voiceprint fusion feature to be verified with the pre-stored reference voiceprint fusion feature of the target user, so as to obtain an identity verification result for the target user.

The examples of this specification also provide some specific embodiments of the apparatus based on the apparatus shown in fig. 4, which is described below.

Optionally, the apparatus shown in fig. 4 may further include:

and the silence detection module can be used for carrying out silence detection on the voice data to be verified to obtain a silence detection result.

Correspondingly, the determining module 404 may be specifically configured to:

and if the silence detection result shows that the voice data to be verified does not belong to silence data, judging whether the voice data to be verified meets the preset voice data quality condition.

Optionally, the silence detection module may include:

the spectral feature extraction unit may be configured to perform spectral feature extraction on the voice data to be verified to obtain target spectral feature data of the voice data to be verified.

A silence detection result generating unit, configured to input the target frequency spectrum feature data into a silence detection model, and obtain a silence detection result output by the silence detection model; the silence detection model is obtained by training a gaussian classification model by using a spectral feature data sample carrying a first classification label in advance, the spectral feature data sample is feature data obtained by performing spectral feature extraction on a first voice data sample, and the first classification label can be used for indicating whether the first voice data sample is silence data.

Optionally, the apparatus shown in fig. 4 may further include:

and the time-frequency domain feature extraction module can be used for extracting the time-frequency domain features of the voice data to be verified to obtain the time domain feature data and the frequency domain feature data of the voice data to be verified.

And the user voice data extraction module can be used for extracting the user voice data from the voice data to be verified according to the time domain characteristic data and the frequency domain characteristic data to obtain a user voice data set.

Correspondingly, the determining module 404 may include:

the determining unit may be configured to determine whether the user voice data set meets a preset voice data quality condition, so as to obtain a second determination result.

Optionally, the preset voice data quality condition may include: the user voice duration is greater than or equal to a first threshold.

Correspondingly, the determining unit may be specifically configured to:

Optionally, the preset voice data quality condition may include: the user voice quality score is greater than or equal to a second threshold.

Correspondingly, the determining module 404 may further include:

the voice quality analysis unit may be configured to perform voice quality analysis on the user voice data set through a voice quality analysis model to obtain a voice quality score of the user voice data set; the voice quality analysis model is obtained by training a deep learning model by utilizing a second voice data sample carrying a voice quality score label in advance, and the voice quality score label can be used for expressing a preset voice quality score of the second voice data sample.

Correspondingly, the determining unit may be specifically configured to:

Optionally, the preset voice data quality condition may include: the user voice belongs to living body voice.

Correspondingly, the determining module 404 may further include:

the living voice detection unit can be used for carrying out living voice detection on the user voice data set through a living voice detection model to obtain a living voice detection result of the user voice data set; the living voice detection model is obtained by training a deep learning model by utilizing a third voice data sample carrying a second classification label in advance, and the second classification label can be used for indicating whether the third voice data sample belongs to living voice.

Correspondingly, the determining unit may be specifically configured to:

Optionally, the apparatus shown in fig. 4 may further include:

the first voiceprint feature extraction module may be configured to, if the second determination result indicates that the user voice data set meets a preset voice data quality condition, perform voiceprint feature extraction on each piece of user voice data included in the user voice data set to obtain voiceprint feature data of each piece of user voice data.

The first dividing module may be configured to divide the user voice data according to voiceprint feature data of the user voice data to obtain at least one user voice data subset; the user voice data contained in each user voice data subset belongs to the same user, and the user voice data contained in different user voice data subsets belongs to different users.

The first voice data volume judging module can be used for judging whether the voice data volume of the target voice data subset is larger than a third threshold value or not to obtain a third judgment result; and the target voice data subset is combined into the user voice data subset with the largest voice data amount.

Correspondingly, the first generating module 406 may specifically be configured to:

Optionally, the identity authentication request carries a to-be-processed audio file containing the to-be-authenticated voice data; the audio file to be processed is a file obtained by acquiring audio with the terminal equipment bound with the target user in the process of user communication; the apparatus may further include:

the format type judging module may be configured to judge whether the format type of the audio file to be processed belongs to a preset format type, so as to obtain a fourth judgment result.

The audio decoding module may be configured to perform audio decoding processing on the audio file to be processed to obtain the audio file of the preset format type if the fourth determination result indicates that the format type of the audio file to be processed does not belong to the preset format type.

And the channel separation module can be used for carrying out channel separation processing on the audio file with the preset format type to obtain the voice data to be verified.

Optionally, the apparatus shown in fig. 4 may further include:

the second obtaining module may be configured to obtain a user voice data sample set of the target user, where the user voice data sample set meets the preset voice data quality condition.

And the second voiceprint feature extraction module may be configured to perform voiceprint feature extraction on each user voice data sample included in the user voice data sample set, so as to obtain a voiceprint feature data sample of each user voice data sample.

The second dividing module may be configured to divide the user voice samples according to voiceprint feature data samples of the user voice data samples to obtain at least one user voice data sample subset; the user voice data samples contained in each user voice data sample subset belong to the same user, and the user voice data samples contained in different user voice data sample subsets belong to different users.

The second voice data volume judging module may be configured to judge whether the voice data volume of the target voice data sample subset is greater than a fourth threshold, so as to obtain a fourth judgment result; and the target voice data sample subset is combined into the user voice data sample subset with the largest voice data amount.

The second generating module may be configured to, if the fourth determination result indicates that the voice data amount of the target voice data sample subset is greater than a fourth threshold, generate a reference voiceprint fusion feature of the target user according to a voiceprint feature data sample of the user voice data sample in the target voice data sample subset.

And the storage module can be used for storing the reference voiceprint fusion characteristics of the target user to obtain the pre-stored reference voiceprint fusion characteristics of the target user.

Optionally, the second generating module may be specifically configured to:

Based on the same idea, the embodiment of the present specification further provides a device corresponding to the method.

Fig. 5 is a schematic structural diagram of a voice-based authentication device corresponding to fig. 2 provided in an embodiment of this specification. As shown in fig. 5, the apparatus 500 may include:

at least one processor 510; and a memory 530 communicatively coupled to the at least one processor; wherein the memory 530 stores instructions 520 executable by the at least one processor 510 to enable the at least one processor 510 to:

acquiring an identity authentication request aiming at a target user, wherein the identity authentication request carries voice data to be authenticated.

And judging whether the voice data to be verified meets a preset voice data quality condition or not to obtain a first judgment result.

And if the first judgment result shows that the voice data to be verified meets the preset voice data quality condition, generating the voiceprint fusion feature to be verified according to the voice data to be verified.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus shown in fig. 5, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which may include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in purely computer readable program code means, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means that may be included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (which may include, but is not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media may include both non-transitory and non-transitory, removable and non-removable media that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media may include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium may not include a transitory computer readable medium such as a modulated data signal or a carrier wave.

It should also be noted that the terms "may include," "comprises," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that may comprise a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "may include an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that may include the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (which may include, but is not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A voice-based authentication method, comprising:

2. The method according to claim 1, before determining whether the voice data to be verified meets a preset voice data quality condition, further comprising:

carrying out silence detection on the voice data to be verified to obtain a silence detection result;

the judging whether the voice data to be verified meets the preset voice data quality condition specifically comprises:

3. The method according to claim 2, wherein the performing silence detection on the voice data to be verified to obtain a silence detection result specifically includes:

performing spectrum feature extraction on the voice data to be verified to obtain target spectrum feature data of the voice data to be verified;

inputting the target frequency spectrum characteristic data into a silence detection model to obtain a silence detection result output by the silence detection model; the silence detection model is obtained by training a Gaussian mixture classification model by using a spectral feature data sample carrying a first classification label in advance, the spectral feature data sample is feature data obtained by performing spectral feature extraction on a first voice data sample, and the first classification label is used for indicating whether the first voice data sample is silence data.

4. The method according to any one of claims 1 to 3, before the determining whether the voice data to be verified meets a preset voice data quality condition, further comprising:

performing time-frequency domain feature extraction on the voice data to be verified to obtain time domain feature data and frequency domain feature data of the voice data to be verified;

extracting user voice data from the voice data to be verified according to the time domain characteristic data and the frequency domain characteristic data to obtain a user voice data set;

the judging whether the voice data to be verified meets the preset voice data quality condition or not to obtain a first judging result specifically comprises:

5. The method of claim 4, wherein the preset voice data quality condition comprises: the user voice duration is greater than or equal to a first threshold;

the determining whether the user voice data set meets a preset voice data quality condition specifically includes:

6. The method of claim 4, wherein the preset voice data quality condition comprises: the user voice quality score is greater than or equal to a second threshold;

performing voice quality analysis on the user voice data set through a voice quality analysis model to obtain a voice quality score of the user voice data set; the voice quality analysis model is obtained by training a deep learning model by utilizing a second voice data sample carrying a voice quality score label in advance, wherein the voice quality score label is used for expressing a preset voice quality score of the second voice data sample;

7. The method of claim 4, wherein the preset voice data quality condition comprises: the user voice belongs to living voice;

the judging whether the user voice data set meets the preset voice data quality condition further comprises:

performing living voice detection on the user voice data set through a living voice detection model to obtain a living voice detection result of the user voice data set; the living body voice detection model is obtained by training a deep learning model by utilizing a third voice data sample carrying a second classification label in advance, wherein the second classification label is used for indicating whether the third voice data sample belongs to living body voice;

8. The method according to claim 4, wherein said determining whether the user voice data set satisfies a predetermined voice data quality condition, and after obtaining a second determination result, further comprises:

if the second judgment result shows that the user voice data set meets the preset voice data quality condition, performing voiceprint feature extraction on each piece of user voice data contained in the user voice data set to obtain voiceprint feature data of each piece of user voice data;

dividing each piece of user voice data according to the voiceprint feature data of each piece of user voice data to obtain at least one user voice data subset; the user voice data contained in each user voice data subset belongs to the same user, and the user voice data contained in different user voice data subsets belong to different users;

judging whether the voice data volume of the target voice data subset is larger than a third threshold value or not to obtain a third judgment result; the target voice data subset is combined into the user voice data subset with the largest voice data amount;

generating a voiceprint fusion feature to be verified according to the voice data to be verified, which specifically comprises:

9. The method according to claim 1, wherein the authentication request carries a to-be-processed audio file containing the to-be-authenticated voice data; the audio file to be processed is a file obtained by acquiring audio with the terminal equipment bound with the target user in the process of user communication;

before the step of judging whether the voice data to be verified meets the preset voice data quality condition, the method further comprises the following steps:

judging whether the format type of the audio file to be processed belongs to a preset format type or not, and obtaining a fourth judgment result;

if the fourth judgment result shows that the format type of the audio file to be processed does not belong to a preset format type, performing audio decoding processing on the audio file to be processed to obtain an audio file of the preset format type;

10. The method according to claim 1, before comparing the voiceprint fusion feature to be verified with the pre-stored reference voiceprint fusion feature of the target user, further comprising:

acquiring a user voice data sample set of the target user, wherein the user voice data sample set meets the preset voice data quality condition;

performing voiceprint feature extraction on each user voice data sample contained in the user voice data sample set to obtain a voiceprint feature data sample of each user voice data sample;

dividing each user voice sample according to the voiceprint characteristic data sample of each user voice data sample to obtain at least one user voice data sample subset; the user voice data samples contained in each user voice data sample subset belong to the same user, and the user voice data samples contained in different user voice data sample subsets belong to different users;

judging whether the voice data volume of the target voice data sample subset is larger than a fourth threshold value or not to obtain a fourth judgment result; the target voice data sample subset is combined into the user voice data sample subset with the largest voice data amount;

if the fourth judgment result indicates that the voice data volume of the target voice data sample subset is larger than a fourth threshold value, generating a reference voiceprint fusion characteristic of the target user according to the voiceprint characteristic data sample of the user voice data sample in the target voice data sample subset;

11. The method according to claim 10, wherein the generating the reference voiceprint fusion feature of the target user according to the voiceprint feature data samples of the user speech data samples in the target speech data sample subset specifically comprises:

acquiring voice quality scores of the user voice data samples in the target voice data sample subset;

sequencing the user voice data samples in the target voice data sample subset according to the sequence of the voice quality scores from large to small to obtain a user voice data sample sequence;

12. A voice-based authentication apparatus comprising:

13. The apparatus of claim 12, further comprising:

the silence detection module is used for carrying out silence detection on the voice data to be verified to obtain a silence detection result;

the judgment module is specifically configured to:

14. The apparatus of claim 13, the silence detection module comprising:

the spectral feature extraction unit is used for extracting spectral features of the voice data to be verified to obtain target spectral feature data of the voice data to be verified;

a silence detection result generation unit, configured to input the target frequency spectrum feature data into a silence detection model, and obtain a silence detection result output by the silence detection model; the silence detection model is obtained by training a Gaussian mixture classification model by using a spectral feature data sample carrying a first classification label in advance, the spectral feature data sample is feature data obtained by performing spectral feature extraction on a first voice data sample, and the first classification label is used for indicating whether the first voice data sample is silence data.

15. The apparatus of any of claims 12-14, further comprising:

the time-frequency domain feature extraction module is used for extracting time-frequency domain features of the voice data to be verified to obtain time-domain feature data and frequency-domain feature data of the voice data to be verified;

the user voice data extraction module is used for extracting user voice data from the voice data to be verified according to the time domain characteristic data and the frequency domain characteristic data to obtain a user voice data set;

the judging module comprises:

and the judging unit is used for judging whether the user voice data set meets a preset voice data quality condition or not to obtain a second judging result.

16. The apparatus of claim 15, the preset voice data quality condition comprising: the user voice duration is greater than or equal to a first threshold;

the judging unit is specifically configured to:

17. The apparatus of claim 15, the preset voice data quality condition comprising: the user voice quality score is greater than or equal to a second threshold;

the judging module further comprises:

the voice quality analysis unit is used for carrying out voice quality analysis on the user voice data set through a voice quality analysis model to obtain a voice quality score of the user voice data set; the voice quality analysis model is obtained by training a deep learning model by utilizing a second voice data sample carrying a voice quality score label in advance, wherein the voice quality score label is used for expressing a preset voice quality score of the second voice data sample;

the determining unit is specifically configured to:

18. The apparatus of claim 15, the preset voice data quality condition comprising: the user voice belongs to living voice;

the judging module further comprises:

the living voice detection unit is used for carrying out living voice detection on the user voice data set through a living voice detection model to obtain a living voice detection result of the user voice data set; the living body voice detection model is obtained by training a deep learning model by utilizing a third voice data sample carrying a second classification label in advance, wherein the second classification label is used for indicating whether the third voice data sample belongs to living body voice;

the judging unit is specifically configured to:

19. The apparatus of claim 14, further comprising:

a first voiceprint feature extraction module, configured to, if the second determination result indicates that the user voice data set meets a preset voice data quality condition, perform voiceprint feature extraction on each piece of user voice data included in the user voice data set to obtain voiceprint feature data of each piece of user voice data;

the first division module is used for dividing each piece of user voice data according to the voiceprint feature data of each piece of user voice data to obtain at least one user voice data subset; the user voice data contained in each user voice data subset belongs to the same user, and the user voice data contained in different user voice data subsets belong to different users;

the first voice data volume judging module is used for judging whether the voice data volume of the target voice data subset is larger than a third threshold value or not to obtain a third judgment result; the target voice data subset is combined into the user voice data subset with the largest voice data amount;

the first generation module is specifically configured to:

20. The apparatus according to claim 12, wherein the authentication request carries a to-be-processed audio file containing the to-be-authenticated voice data; the audio file to be processed is a file obtained by acquiring audio with the terminal equipment bound with the target user in the process of user communication; the device, still include:

the format type judging module is used for judging whether the format type of the audio file to be processed belongs to a preset format type or not to obtain a fourth judging result;

the audio decoding module is configured to perform audio decoding processing on the audio file to be processed to obtain an audio file of a preset format type if the fourth determination result indicates that the format type of the audio file to be processed does not belong to the preset format type;

and the channel separation module is used for carrying out channel separation processing on the audio file with the preset format type to obtain the voice data to be verified.

21. The apparatus of claim 12, further comprising:

the second acquisition module is used for acquiring a user voice data sample set of the target user, wherein the user voice data sample set meets the preset voice data quality condition;

a second voiceprint feature extraction module, configured to perform voiceprint feature extraction on each user voice data sample included in the user voice data sample set, to obtain a voiceprint feature data sample of each user voice data sample;

the second division module is used for dividing each user voice sample according to the voiceprint feature data sample of each user voice data sample to obtain at least one user voice data sample subset; the user voice data samples contained in each user voice data sample subset belong to the same user, and the user voice data samples contained in different user voice data sample subsets belong to different users;

the second voice data volume judging module is used for judging whether the voice data volume of the target voice data sample subset is larger than a fourth threshold value or not to obtain a fourth judgment result; the target voice data sample subset is combined into the user voice data sample subset with the largest voice data amount;

a second generating module, configured to generate a reference voiceprint fusion feature of the target user according to a voiceprint feature data sample of the user voice data sample in the target voice data sample subset if the fourth determination result indicates that the voice data amount of the target voice data sample subset is greater than a fourth threshold;

and the storage module is used for storing the reference voiceprint fusion characteristics of the target user to obtain the pre-stored reference voiceprint fusion characteristics of the target user.

22. The apparatus of claim 21, wherein the second generating module is specifically configured to:

23. A voice-based authentication device comprising:

at least one processor; and the number of the first and second groups,

if the first judgment result shows that the voice data to be verified meets the preset voice data quality condition, generating voiceprint fusion characteristics to be verified according to the voice data to be verified;