CN112735426A

CN112735426A - Voice verification method and system, computer device and storage medium

Info

Publication number: CN112735426A
Application number: CN202011551178.7A
Authority: CN
Inventors: 陈东鹏
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-30

Abstract

The application relates to a voice verification method and system, computer equipment and a storage medium. The method comprises the following steps: the voice verification terminal emits sound waves to the external environment, wherein the sound waves comprise first audio signals changing according to preset rules, and the preset rules enable the first audio signals to be different at different moments; acquiring first voice data through a first acquisition device in the state of sound wave emission, wherein the first acquisition device is arranged at a position capable of receiving and acquiring the sound wave; if the first voice data comprises a first audio signal corresponding to the time of acquisition of the first voice data, removing the first audio signal in the first voice data to obtain second voice data; extracting voiceprint features of the second voice data; and if the voiceprint features are matched with the preset voiceprint features, the voice verification is passed. The method and the device can defend the record replay attack, and improve the safety of voice verification.

Description

Voice verification method and system, computer device and storage medium

Technical Field

The present application relates to voice recognition technology, and in particular, to a voice verification method, a voice verification system, a computer device, and a computer-readable storage medium.

Background

Playback of a recording is a common way to attack voiceprint recognition systems. A scenario for playback of a recording is: when the user carries out voice verification, a third party illegally records the voice, and then the voice verification is passed by using the voice/voiceprint information of the user obtained by recording (for example, a lawless person is hidden in an ATM and provided with a micro eavesdropper).

Disclosure of Invention

Based on this, it is necessary to provide a high-security voice authentication method and system, a computer device, and a storage medium, for solving the problem that the conventional voice authentication method is easily overcome by means of recording and playing back.

A method of voice authentication, the method comprising: the voice verification terminal emits sound waves to the external environment, wherein the sound waves comprise first audio signals changing according to preset rules, and the preset rules enable the first audio signals to be different at different moments; acquiring first voice data through a first acquisition device in the state of sound wave emission, wherein the first acquisition device is arranged at a position capable of receiving and acquiring the sound wave; if the first voice data comprises a first audio signal corresponding to the time of acquisition of the first voice data, reversely eliminating the first audio signal at the position where the first audio signal is detected in the first voice data to obtain second voice data; extracting voiceprint features of the second voice data; and if the voiceprint features are matched with the preset voiceprint features, the voice verification is passed.

In one embodiment, in the step of transmitting the sound wave to the external environment by the voice authentication terminal, the sound wave further includes a second audio signal, the second audio signal is located at the head of a special audio signal, and the special audio signal includes the second audio signal and a first audio signal following the second audio signal in the time domain; the method further comprises the following steps: detecting whether the first voice data has a single second audio signal or not, if so, removing the first audio signal in the first voice data further comprises removing the second audio signal; otherwise the voice authentication is not passed.

In one embodiment, the step of detecting whether the first voice data has the single second audio signal comprises: and carrying out Fourier transformation on the first voice data, and detecting whether a second audio signal meeting a preset condition exists in the first voice data in a frequency spectrum.

In one embodiment, the step of detecting whether a second audio signal meeting a preset condition exists in the first voice data in the frequency spectrum comprises: and scanning the frequency spectrum line by line, and when scanning a signal of which the signal mode, the time length and the frequency range all accord with the preset conditions, identifying the scanned signal as the second audio signal and taking the second audio signal as the starting point of the special audio signal.

In one embodiment, the first audio signal is a sequence of signals related to the emission time of the sound wave.

In one embodiment, the first audio signal comprises a plurality of sub-signals spaced apart from each other in the time domain, each sub-signal comprising a portion of time information, the time information of the sub-signals being combined together to form the complete transmission time instant.

In one embodiment, the time interval between adjacent second audio signals is constant.

In one embodiment, the first audio signal further includes information corresponding to a device identification code of the voice authentication terminal.

In one embodiment, if the first voice data includes a first audio signal corresponding to the time of the first voice data acquisition, the step of eliminating the first audio signal in a reverse direction at a position where the first audio signal is detected in the first voice data includes: obtaining a first audio signal corresponding to the time of the first voice data acquisition according to the preset rule; and judging whether the first voice data contains a first audio signal corresponding to the time of the first voice data acquisition, if so, reversely eliminating the first audio signal at the position where the first audio signal is detected in the first voice data, otherwise, failing to pass the voice verification.

In one embodiment, the method further comprises a step of voiceprint registration, and the step of voiceprint registration comprises: acquiring registration voice data; and extracting voiceprint features from the registered voice data to serve as the preset voiceprint features.

In one embodiment, the step of acquiring the registration voice data includes: the registered terminal transmits the sound waves to an external environment; acquiring the registration voice data through a second acquisition device under the condition that the registration terminal transmits the sound waves, wherein the second acquisition device is arranged at a position capable of receiving and acquiring the sound waves transmitted by the registration terminal; the step of extracting voiceprint features from the registered voice data as the preset voiceprint features comprises: if the registered voice data contains a first audio signal corresponding to the time of acquisition of the registered voice data, removing the first audio signal from the registered voice data to obtain third voice data; and extracting the voiceprint feature of the third voice data to serve as the preset voiceprint feature.

In one embodiment, the step of acquiring the registration voice data includes: and performing identity authentication, and collecting the voice password input by the user under the condition that the identity authentication is passed.

In one embodiment, the enrollment voice data includes at least 8 syllables and the signal-to-noise ratio is greater than 5 dB.

A voice authentication system comprising: the voice verification terminal comprises a sound wave transmitting device and a voice recognition device, wherein the sound wave transmitting device is used for transmitting sound waves to the external environment, the sound waves comprise first audio signals which change according to preset rules, and the preset rules enable the first audio signals to be different at different moments; the collector is arranged at a position capable of receiving and collecting the sound waves and is used for collecting voice data; the voice acquisition control module is used for controlling the acquisition device to acquire first voice data in the state of transmitting the sound waves; the voice processing device comprises a silencing module, a voice processing module and a voice processing module, wherein the silencing module is used for reversely eliminating a first audio signal at a position where the first audio signal is detected in first voice data when the first voice data comprises the first audio signal corresponding to the time of first voice data acquisition to obtain second voice data; the voiceprint extraction module is used for extracting the voiceprint characteristics of the second voice data; and the matching module is used for passing the voice verification when the voiceprint characteristics are matched with the preset voiceprint characteristics.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the voice authentication method of any of the preceding embodiments when the computer program is executed.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the voice authentication method of any of the preceding embodiments.

A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the voice authentication method of any of the preceding embodiments.

And when the first voice data is detected to contain two groups of second audio signals, and the corresponding first voice data in one group is the first audio signal corresponding to the time of the acquisition of the first voice data, the step of deducing the recorded time according to the time information contained in the first audio data in the other group so as to trace the recording/wiretapping is carried out.

According to the voice verification method and system, the computer device and the storage medium, when the first voice data of the user is collected in the voice verification process, the first audio signal is transmitted and is different at different moments, the first audio signal is detected and removed in the subsequent steps according to the collection time of the first voice data and the preset rule, and voiceprint verification is carried out according to the removed voice data. For the condition that lawless persons carry out illegal recording near the voice verification terminal, the voiceprint characteristics are polluted because the recording contains the first audio signal at the recording time, so that the voiceprint characteristics can not pass the verification when being matched, the record playback attack can be successfully defended, and the safety of the voice verification is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a voice authentication method;

FIG. 2 is a flow diagram that illustrates a method for voice authentication, according to one embodiment;

FIG. 3 is a flow chart illustrating a voice authentication method according to another embodiment;

FIG. 4 is a diagram illustrating an internal structure of a computer device according to an embodiment;

fig. 5 is an internal structural view of a computer device in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "upper," "lower," "left," "right," and the like as used herein are for illustrative purposes only. When an element or layer is referred to as being "on," "adjacent to," "connected to," or "coupled to" other elements or layers, it can be directly on, adjacent to, connected or coupled to the other elements or layers or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly on," "directly adjacent to," "directly connected to" or "directly coupled to" other elements or layers, there are no intervening elements or layers present. It will be understood that, although the terms first, second, third, etc. may be used to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the present invention.

When the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The traditional methods for defending against the replay attack of the recorded sound mainly comprise two methods:

one is from the application, requiring the user to read out the plaintext displayed, time-varying text content, such as a dynamic random number, a piece of text, etc., each time. This method is poor in user experience and requires a display screen to display the plaintext content. In addition, the user cannot set a fixed voice password belonging to himself as with a conventional password. If a fixed voice password is set, it cannot be avoided that a user is stolen voice at a certain authentication time for later replay attack (e.g., a micro-bug in which a lawbreaker is hidden on an ATM machine).

Secondly, technically, a machine learning method is used for judging whether the received voice is sent by a real person or equipment, but the sound sent by high-fidelity recording and playing equipment is difficult to distinguish in the prior art.

The voice verification method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be used as a voice authentication terminal, and some steps of the voice authentication method provided by the present application may be executed on the terminal 102 or the server 104. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and Automatic Teller Machines (ATMs), and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment of the present application, as shown in fig. 2, a voice verification method is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and the method includes the following steps:

and S210, the voice verification terminal transmits sound waves to the external environment.

The sound wave comprises a first audio signal which changes according to a preset rule, and the first audio signal is different at different moments by setting a proper rule, for example, the first audio signal changes along with time. In one embodiment of the present application, the sound wave is emitted by a speaker mounted on the voice authentication terminal; in other embodiments, other devices capable of forming sound waves, such as other devices capable of converting electrical signals into sound waves, known in the art, may be used to emit sound waves including the first audio signal.

S220, collecting first voice data through a first collector in a sound wave emission state.

In one embodiment of the application, the voice verification terminal maintains the state of transmitting sound waves, prompts a user to perform voice verification, and collects voice of the user. The first collector is installed at a position capable of receiving and collecting sound waves emitted by the loudspeaker. It can be understood that, in order to ensure that the first collector can really collect the sound waves emitted by the loudspeaker, the first collector may be installed on the voice verification terminal close to the loudspeaker. In one embodiment of the present application, the first collector is a microphone; in other embodiments, the first collector may also adopt other devices known in the art which can convert the sound wave into an electric signal.

S230, detecting whether the first voice data includes the first audio signal, if yes, the process goes to step S240.

If the first voice data collected by the first collector includes a first audio signal corresponding to the time of collecting the first voice data, the process proceeds to step S240. Since the acquisition time of the first voice data is known, the first audio signal corresponding to the time can be known through the preset rule, and whether the signal is included in the first voice data is further determined. It can be understood that the first voice data actually includes the first audio signal corresponding to the sound wave emission time, but since the first collector is installed close to the speaker, the collection time of the first voice data can be approximately considered to be the same as the sound wave emission time. In other embodiments, the collection time of the first voice data may be modified to approximate the sound wave emission time, for example, the time required from emission to reception of the sound wave may be calculated according to the distance between the speaker that emits the sound wave and the first collector that collects the sound wave, and the collection time of the first voice data is obtained by adding the time to the sound wave emission time. In one embodiment of the present application, step S230 is performed on a server; in other embodiments, step S230 may also be performed on the voice authentication terminal.

S240, removing the first audio signal in the first voice data to obtain second voice data.

And obtaining a corresponding first audio signal according to the acquisition time of the first voice data and the preset rule, and reversely eliminating the position of the first audio signal detected in the first voice data. The known first audio signal can be used for echo cancellation directly, and the first audio signal is erased from the first voice data, so that the voiceprint recognition of the subsequent steps is not influenced. Specifically, the first audio signal in the first voice data may be canceled by superimposing a signal having the same amplitude and an opposite phase to the first audio signal at a position where the first audio signal is detected in the first voice data. In one embodiment of the present application, after determining the position of the first audio signal, the average energy of the position is subtracted from the frequency spectrum, so as to eliminate the influence and obtain the second voice data. In one embodiment of the present application, step S240 is performed on a server; in other embodiments, step S240 may also be performed on the voice authentication terminal.

And S250, extracting the voiceprint characteristics of the second voice data.

The voiceprint characteristics refer to a sound wave spectrum which carries speech information and is displayed by an electroacoustic instrument. The vocal print characteristics of each person are related to respective pronunciation organs, and the sound wave frequency spectrums obtained by analyzing the inconsistency of the pronunciation organs are inconsistent, so that the speakers can be distinguished through the vocal print characteristics.

Specifically, the voiceprint features of the speech password data may be extracted through a voiceprint Feature extraction algorithm, and common voiceprint Feature extraction algorithms include, but are not limited to, Mel-Frequency Cepstral coeffients (MFCCs), Perceptual Linear Prediction Coefficients (PLPs), depth features (Deep features), Power Normalized Cepstral Coeffients (PNCC), or Power Normalized Perceptual Linear Prediction (PNPLP), and the like.

In one embodiment of the present application, step S250 is performed on a server; in other embodiments, step S250 may also be performed on the voice authentication terminal.

And S260, judging whether the voiceprint features are matched with the preset voiceprint features.

And if the voiceprint features are matched with the preset voiceprint features, the voice verification is passed. In an embodiment of the present application, the extracted voiceprint features are matched with the voiceprint features extracted in advance, the matching refers to calculating a similarity between two voiceprint features, and if the similarity between the two voiceprint features meets a preset threshold, the voice verification is passed. In an embodiment of the present application, the preset voiceprint feature is obtained by processing a user voice collected during voiceprint registration.

In one embodiment of the present application, step S260 is performed on a server; in other embodiments, step S260 may also be performed on the voice authentication terminal. It is understood that the foregoing embodiments describe the voice authentication method in a terminal-server architecture; in other embodiments, all of steps S210 to S260 of the foregoing embodiments may be executed in the voice authentication terminal.

According to the voice verification method, when the first voice data of the user is collected in the voice verification process, the first audio signals are transmitted and are different at different moments, the first audio signals are detected and removed in the subsequent steps according to the collection time of the first voice data and the preset rule, and voiceprint verification is carried out according to the removed voice data. For the condition that lawless persons carry out illegal recording near the voice verification terminal, the voiceprint characteristics are polluted because the recording contains the first audio signal at the recording time, so that the voiceprint characteristics can not pass the verification when being matched, the record playback attack can be successfully defended, and the safety of the voice verification is improved.

In one embodiment of the present application, the sound wave emitted in step S210 includes a special audio signal including a second audio signal in addition to the first audio signal. The second audio signal is located at the head of the special audio signal and the first audio signal follows the second audio signal in the time domain. Correspondingly, in step S230, it is further required to detect whether the first voice data has a single second audio signal, and if so, in step S240, the second audio signal is further required to be removed; if there is no second audio signal in the first voice data, or there are two (or more) sets of second audio signals, the voice verification is not passed. It can be understood that, in the case where a lawless person performs illegal recording near the voice verification terminal, two sets of second audio signals are detected when performing voice verification using recording playback, and thus it is determined that the recorded playback voice is not verified. The second audio signal is arranged at the head of the special audio signal, so that the position of the first audio signal in the same special audio signal can be indicated, and the first audio signal can be conveniently recognized.

In one embodiment of the application, the second audio signal is a header that follows a law that is known to us and unknown to the attacker, and sounds like a noise to the human ear. The first audio signal is the information we want to embed, which varies in time, and may be, for example, the current time information. In one embodiment of the present application, different voice authentication terminals have different device IDs, and the first audio signal further contains device ID information of the current voice authentication terminal.

In one of the embodiments of the present application, it is first necessary to detect the position of the head (second audio signal) of the regular signal: and carrying out Fourier transform on the waveform of the first voice data to obtain a frequency spectrum. The x-axis of the spectrum is time and the y-axis is frequency, and the head of the special audio signal is represented in the spectrum as a specific pattern somewhere on the spectrum, such as a bright oblique line and a bright curve according to a certain function f (x, y) ("bright" indicates the energy concentration of the frequency at that moment). Scanning the frequency spectrum line by line, and finding the starting point of the special audio signal when the scanning time length accords with the bright inclined line/bright curve of the preset special audio signal mode and the frequency interval of the inclined line/curve accords with the possible existing range of the preset special audio signal. Next, starting from the start point, the search for the first audio signal continues along the time axis. One simple case is that the first audio signal is only composed of a group of signals with specific frequencies, which are information to be decoded, so that the information (e.g., time information) contained in the signal can be decoded, and the information is compared with the acquisition time of the first audio data to determine whether the first audio data contains the first audio signal corresponding to the acquisition time of the first audio data. Further, in the case where two sets of second audio signals are detected in the first audio data, and the corresponding one of the sets of first audio data is the first audio signal corresponding to the time of the acquisition of the first audio data, the recorded time can be estimated according to the time information included in the decoded other set of first audio data, so as to perform the tracing back of recording/eavesdropping.

In one embodiment of the present application, the first audio signal is a signal sequence related to the emission time instant of the sound wave.

In one embodiment of the present application, the first audio signal includes a plurality of sub-signals spaced apart from each other in a time domain, each sub-signal includes a portion of time information, and the time information of the sub-signals are combined together to form the complete transmission time instant. The following describes the composition of the first audio signal by a specific embodiment: assuming that the current time is 9/10/11/4/Thursday 2020, a signal with a frequency varying once every 20ms may be sent, the frequency being 2020, 900, 1000, 1100, 400Hz in turn, and in other embodiments the frequency may be other values as long as the law is time dependent. Furthermore, in order to prevent the law from being discovered by others, the frequencies of the sub-signals need to be further transformed and encrypted, and noise (namely, the first audio signal) which looks random but is embedded with time information is emitted. The second audio signal may be provided at the head of a group of sub-signals of the first audio signal, i.e. the aforementioned 2020, 900, 1000, 1100, 400Hz frequency signals appear in sequence after the second audio signal appears. In one embodiment, the time interval of adjacent second audio signals is constant. Therefore, when the time interval of two adjacent second audio signals is not detected to be the fixed value, more than two groups of second audio signals can be judged to exist.

FIG. 3 is a flow chart of a voice authentication method in another embodiment, comprising the steps of:

s302, registration voice data is obtained.

In one embodiment of the application, the registration terminal prompts the user to perform identity registration and to enter a voice password. The password can be set by the user, and the user pronounces the voice password to be set so as to complete the setting. The voice password content can be any language, even a nonsense voice, and only the length and the quality of the voice are required to meet the requirements. For the embodiment that the user sets the voice password by himself, in the subsequent voice verification step, the contents such as dynamic random numbers, random characters and the like which need to be read by the user can be displayed without a display screen, so that the cost for configuring the display screen can be saved.

In one embodiment of the present application, it is required that the registered voice data contain at least 8 syllables and that the signal-to-noise ratio is greater than 5 dB.

In an embodiment of the present application, the registration terminal and the voice verification terminal may be the same terminal, that is, the user may perform registration and voice verification on the same terminal; in other embodiments, the registered terminal may be a different terminal from the voice authentication terminal.

In one embodiment of the present application, step S302 includes:

the registered terminal transmits sound waves to the external environment. As in step S210, the sound wave emitted by the registered terminal also includes the first audio signal that varies according to the preset rule. By setting suitable rules such that the first audio signal is not the same at different moments in time, the first audio signal is made to vary over time, for example. In one embodiment of the present application, the sound wave is emitted through a speaker installed at the registered terminal; in other embodiments, other devices capable of forming sound waves, such as other devices capable of converting electrical signals into sound waves, known in the art, may be used to emit sound waves including the first audio signal.

In one embodiment of the present application, the sound wave emitted in step S302 includes a special audio signal including a second audio signal in addition to the first audio signal. The second audio signal is located at the head of the special audio signal and the first audio signal follows the second audio signal in the time domain.

Step S302 further includes: and acquiring registration voice data through a second acquisition device in the state that the registration terminal transmits the sound waves, wherein the second acquisition device is arranged at a position capable of receiving and acquiring the sound waves transmitted by the registration terminal.

In one embodiment of the present application, the second collector is installed at a position capable of receiving and collecting sound waves emitted from the speaker. It will be appreciated that the second collector may be mounted on the registration terminal close to the loudspeaker in order to ensure that the second collector can actually collect the sound waves emitted by the loudspeaker. In one embodiment of the present application, the second collector is a microphone; in other embodiments, the second collector may also adopt other devices known in the art which can convert the sound wave into an electric signal.

In the above embodiment, the first audio signal is also mixed in the collected registration voice data during registration, so that a lawbreaker can be prevented from recording a voice password without the first audio signal near the registration terminal and using the voice password for recording and replaying attacks.

In an embodiment of the present application, the step S302 further includes a step of performing identity authentication when acquiring the registration voice data, and acquiring a voice password entered by the user when the identity authentication passes. The registered terminal in this embodiment does not emit sound waves containing the first audio signal to the external environment at step S302.

S304, extracting the voiceprint feature from the registered voice data to serve as the preset voiceprint feature.

Correspondingly, for the embodiment of keeping the registered terminal to emit the sound wave when acquiring the registered voice data, in step S304, it is further required to detect whether the registered voice data includes the first audio signal and/or the second audio signal corresponding to the time of acquiring the registered voice data, and if so, remove all the first audio signal and the second audio signal detected in the registered voice data to obtain third voice data, and then extract the voiceprint feature of the third voice data to serve as the preset voiceprint feature.

And S310, the voice verification terminal transmits sound waves to the external environment.

Steps S302 and S304 are steps of the user registration phase. After the user completes the registration, a stage for voice authentication is started from step S310. Similar to step S210, the sound wave emitted by the voice verification terminal includes the first audio signal that changes according to the preset rule, and the first audio signal is made different at different times by setting a suitable rule, for example, the first audio signal changes with time. In one embodiment of the present application, the sound wave is emitted by a speaker mounted on the voice authentication terminal; in other embodiments, other devices capable of forming sound waves, such as other devices capable of converting electrical signals into sound waves, known in the art, may be used to emit sound waves including the first audio signal.

And S320, acquiring first voice data through the first acquisition unit in the state of sound wave emission.

S330, detecting whether there is a single second audio signal in the first audio data, if so, proceeding to step S340, otherwise, failing to pass the voice verification.

If there is no second audio signal in the first voice data, or there are two (or more) sets of second audio signals, the voice verification is not passed.

S340, removing the first and second audio signals in the first voice data to obtain second voice data.

In one embodiment of the present application, after the positions of the first audio signal and the second audio signal are determined, the average energy of the positions is subtracted from the frequency spectrum, so as to eliminate the influence of the positions and obtain the second voice data.

And S350, extracting the voiceprint characteristics of the second voice data.

And S360, judging whether the voiceprint features are matched with the preset voiceprint features.

If the voiceprint feature matches the preset voiceprint feature obtained in step S304, the voice verification is passed. In an embodiment of the present application, the extracted voiceprint features are matched with the voiceprint features extracted in advance, the matching refers to calculating a similarity between two voiceprint features, and if the similarity between the two voiceprint features meets a preset threshold, the voice verification is passed.

It should be understood that although the various steps in the flowcharts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps of fig. 2-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The present application correspondingly provides a voice verification system, comprising:

the voice verification terminal comprises a sound wave transmitting device used for transmitting sound waves to the external environment, wherein the sound waves contain first audio signals changing according to preset rules, and the preset rules enable the first audio signals to be different at different moments.

And the collector is arranged at the position capable of receiving and collecting the sound waves and is used for collecting voice data.

And the voice acquisition control module is used for controlling the acquisition device to acquire first voice data in the sound wave emission state.

And the silencing module is used for removing the first audio signal in the first voice data to obtain second voice data when the first voice data contains the first audio signal corresponding to the time of acquisition of the first voice data.

And the voiceprint extraction module is used for extracting the voiceprint characteristics of the second voice data.

And the matching module is used for passing the voice verification when the voiceprint characteristics are matched with the preset voiceprint characteristics.

In one embodiment of the application, the sound wave further comprises a second audio signal, the second audio signal being located at a head of a special audio signal, the special audio signal comprising the second audio signal and a first audio signal following the second audio signal in the time domain. The silencing module is further used for detecting whether the first voice data has a single second audio signal or not, and if so, removing the second audio signal in the first voice data; otherwise, a signal indicating that the voice authentication is not passed is output.

In an embodiment of the application, the step of detecting whether the first voice data has the single second audio signal is performed by the silence module, and detecting whether the first voice data has the second audio signal meeting the preset condition in the frequency spectrum by performing fourier transform on the first voice data.

In one embodiment of the application, the first audio signal is a sequence of signals related to the emission moment of the sound wave.

In an embodiment of the application, the silencing module is configured to obtain, according to the preset rule, a first audio signal corresponding to the time of the first voice data collection, and determine whether the first voice data includes the first audio signal corresponding to the time of the first voice data collection, if so, remove the first audio signal from the first voice data, otherwise, output a signal indicating that the voice verification fails.

For the specific definition of the voice verification system, reference may be made to the above definition of the voice verification method, which is not described herein again. The various modules in the voice verification system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store voice authentication data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a voice authentication method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a voice authentication method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the voice authentication method according to any one of the foregoing embodiments when executing the computer program.

The present application further provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the voice authentication method according to any of the preceding embodiments.

The present application further provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the voice authentication method according to any of the preceding embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of voice authentication, the method comprising:

the voice verification terminal emits sound waves to the external environment, wherein the sound waves comprise first audio signals changing according to preset rules, and the preset rules enable the first audio signals to be different at different moments;

acquiring first voice data through a first acquisition device in the state of sound wave emission, wherein the first acquisition device is arranged at a position capable of receiving and acquiring the sound wave;

if the first voice data comprises a first audio signal corresponding to the time of acquisition of the first voice data, reversely eliminating the first audio signal at the position where the first audio signal is detected in the first voice data to obtain second voice data;

extracting voiceprint features of the second voice data;

and if the voiceprint features are matched with the preset voiceprint features, the voice verification is passed.

2. The voice authentication method according to claim 1, wherein in the step of the voice authentication terminal emitting the sound wave to the external environment, the sound wave further contains a second audio signal, the second audio signal being located at a head of a special audio signal, the special audio signal including the second audio signal and the first audio signal following the second audio signal in a time domain;

the method further comprises the following steps: detecting whether the first voice data has a single second audio signal or not, if so, removing the first audio signal in the first voice data further comprises removing the second audio signal; otherwise the voice authentication is not passed.

3. The voice verification method of claim 2, wherein the step of detecting whether the first voice data has and only has a single second audio signal comprises: and carrying out Fourier transformation on the first voice data, and detecting whether a second audio signal meeting a preset condition exists in the first voice data in a frequency spectrum.

4. A voice verification method according to claim 3, characterised in that the first audio signal is a sequence of signals related to the emission instant of the sound wave.

5. A method according to claim 4, wherein the first audio signal comprises a plurality of sub-signals spaced apart in the time domain, each sub-signal comprising a portion of the time information, the time information of the sub-signals being combined to form the complete transmission time instant.

6. The voice authentication method according to claim 5, wherein the time interval between adjacent second audio signals is constant.

7. The voice verification method according to claim 1, wherein the step of eliminating the first audio signal in a reverse direction at a position where the first audio signal is detected in the first voice data if the first voice data includes the first audio signal corresponding to the time of the first voice data acquisition comprises:

obtaining a first audio signal corresponding to the time of the first voice data acquisition according to the preset rule;

and judging whether the first voice data contains a first audio signal corresponding to the time of the first voice data acquisition, if so, reversely eliminating the first audio signal at the position where the first audio signal is detected in the first voice data, otherwise, failing to pass the voice verification.

8. The voice authentication method of claim 1, further comprising a step of voiceprint registration, the step of voiceprint registration comprising:

acquiring registration voice data;

and extracting voiceprint features from the registered voice data to serve as the preset voiceprint features.

9. A voice authentication system, comprising:

the voice verification terminal comprises a sound wave transmitting device and a voice recognition device, wherein the sound wave transmitting device is used for transmitting sound waves to the external environment, the sound waves comprise first audio signals which change according to preset rules, and the preset rules enable the first audio signals to be different at different moments;

the collector is arranged at a position capable of receiving and collecting the sound waves and is used for collecting voice data;

the voice acquisition control module is used for controlling the acquisition device to acquire first voice data in the state of transmitting the sound waves;

the voice processing device comprises a silencing module, a voice processing module and a voice processing module, wherein the silencing module is used for reversely eliminating a first audio signal at a position where the first audio signal is detected in first voice data when the first voice data comprises the first audio signal corresponding to the time of first voice data acquisition to obtain second voice data;

the voiceprint extraction module is used for extracting the voiceprint characteristics of the second voice data;

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.