WO1995005656A1 - A speaker verification system - Google Patents
A speaker verification system Download PDFInfo
- Publication number
- WO1995005656A1 WO1995005656A1 PCT/AU1994/000468 AU9400468W WO9505656A1 WO 1995005656 A1 WO1995005656 A1 WO 1995005656A1 AU 9400468 W AU9400468 W AU 9400468W WO 9505656 A1 WO9505656 A1 WO 9505656A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- algorithms
- speaker
- identity
- feature
- speech sample
- Prior art date
Links
- 238000012795 verification Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 claims abstract description 40
- 238000013528 artificial neural network Methods 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims description 23
- 230000000306 recurrent effect Effects 0.000 claims description 7
- 230000007774 longterm Effects 0.000 claims description 6
- 230000001419 dependent effect Effects 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000001787 dendrite Anatomy 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 210000000225 synapse Anatomy 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/10—Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Definitions
- a SPEAKER VERIFICATION SYSTEM relates to method and apparatus for speaker verification.
- the invention is directed to a speaker recognition technique in which characteristics of a speech sample are compared with stored characteristics using several different algorithms, and the results of the individual algorithms are combined in a neural network to arrive at a final decision.
- speaker verification is to accept or reject a claim to a particular identity based on one or more samples of the claimant's speech.
- speaker verification is an identification process. It has been found that a person's voice can be used as an identifying characteristic of that person. The person's voice can therefore be used as a robust, secure means of identification, obviating the need for artificial measures such as PIN numbers, security codes, access cards etc.
- Speaker verification can be used as a security measure in any area or application in which the identity of an individual must be authenticated. Areas of immediate application include:
- ATM Automatic Teller Machine
- Access control e.g. to secure building access
- Credit card transactions In particular, speaker verification is ideally suited to transactions conducted over a telephone link as other means of authentication are unsuitable or impractical.
- the present invention provides a method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.
- the present invention provides apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.
- the speech sample may be obtained via a microphone, telephone link or other suitable audio device.
- the speech sample may comprise one or more words.
- the speech sample is typically converted to digital format, and a set of predetermined characteristic features are derived from the digitised sample. These features may include cepstral co-efficients, fundamental frequency (or pitch), energy, duration, zero crossing rate and linear prediction co-efficients.
- the derived features are then compared with similar characteristic features previously derived from a speech sample of the claimed identity and stored in a suitable memory.
- Such algorithms may suitably include dynamic time warping, vector quantisation, recurrent neural network and long term features.
- a neural network is used to compute the similarity between the speaker's speech sample and that of the claimed identity.
- the neural network of the claimed identity has been previously trained to distinguish the speech sample of the claimed identity from others stored in memory using an iterative process.
- the output of the decision algorithms for the user's sample are applied to the known identity's neural network for final verification.
- the system also includes a training facility to allow the stored sample of the claimed identity to be updated when a positive verification is made, to accommodate or compensate for changes in speech patterns with age.
- FIG. 1 is a schematic diagram of the speaker verification system of the preferred embodiment
- Fig. 2 is a schematic of a dynamic time warping process
- Fig. 3 is a schematic diagram of vector quantisation mapping
- Fig. 4 illustrates the architecture of a typical recurrent neural network
- Fig. 5 illustrates the architecture of the multi-layer subsystem arrangement of Fig. 1.
- characteristic features of a speech sample are extracted and correlated with corresponding features of a stored speech sample previously obtained from the true identity.
- the features of the latter speech sample are stored instead of, or in addition to, the speech sample itself).
- the correlation outputs are then combined in an artificial neural network to arrive at the final decision.
- Fig. 1 The major components of the speaker verification system are shown in Fig. 1.
- a transaction is initiated (e.g. at an ATM, or over a telephone link)
- the sample typically comprises several words which correspond to some or all of the words previously recorded by the known identity.
- the system may prompt the user to repeat a number of specific words.
- the user's vocal rendition of these words is then digitised, and a predetermined number of acoustic features are extracted from the digitised samples.
- These acoustic features, together with the identity claim, are then subjected to a number of decision algorithms.
- the outputs of the individual decision algorithms are employed by a final decision mechanism, such as an artificial neural network of multi-layer perceptron (MLP) architecture, to make a final decision as to whether to accept or reject the identity claim.
- MLP multi-layer perceptron
- the final decision is a far more reliable indicator of identity than systems hitherto used.
- the transaction initiation (and prompt) section of the speaker verification system is the primary user interface of the system.
- This user interface is necessarily application dependent.
- a speaker verifying ATM would include a microphone, loud speaker and the current LED display, while remote phone transactions involving speaker verification would make use of the simple resource of the telephone handset.
- the system prompts the user to repeat a number of predetermined words. These words are selected from a known list and presented in a random order so as to minimise potential abuse of the system (i.e. the possibility of using a recording of the true speaker is minimised ) .
- the user is audibly prompted for each required utterance using a voice.
- a visual display of text is also possible (such as an ATM)
- the requested utterances can be displayed instead of, or in addition to, the audible prompt.
- the user' s utterances are then converted into a format suitable for the decision algorithms.
- Signals from the microphone at the point of application e.g. telephone or ATM microphone
- A/D analog-to-digital
- the resulting stream of numbers is analysed via a speech detection algorithm in order to determine the start and end points of the user's utterances of the individual words.
- the resulting speech waveform (with all portions of silence eliminated) is split into a number of small overlapping frames (typically of 64ms duration) in order to exploit the pseudo-stationary nature of speech.
- the characteristic acoustic parameters which are extracted for utilisation in the decision algorithms include:
- a plurality of independent algorithms process the extracted set of acoustic features and the identity claim, each outputting a score measuring the closeness of the acoustic features to those previously extracted from speech samples of the true identity and stored in memory.
- acoustic features extracted from the user' s speech are compared or correlated with those of the individual whose identity is being claimed.
- the latter features are pre-stored in a set of reference templates.
- One obstacle to comparison of incoming acoustic features with those stored as templates is that speakers often repeat the same word with slight timing differences at each utterance.
- Dynamic time warping (DTW) is a technique for normalising utterances (of the same word) to the same duration, hence allowing a much simpler comparison of acoustic features. DTW is described in more detail in references [1, 2] identified at the end of this specification. The disclosure of all such references is incorporated herein by reference. For each word uttered by each speaker, five reference templates are stored.
- a recent innovation to the DTW process has been to employ the time alignment information to enhance the performance of the DTW algorithm [as described in references 3, 4] .
- a schematic diagram of the DTW process is shown in Fig. 2. Namely, an input and reference frame are time aligned so as to minimise the difference.
- a score is output by the decision algorithm, dependent on the closeness of the acoustic features of the time aligned input and reference frames.
- VQ Vector quantisation
- Fig. 3 is a schematic diagram of the mapping from input frame ( I ) to the nearest code book element showing the distortion (arrow length) for that frame.
- a vector of distortion values is derived rather than a simple mean. Linear weighting of these values has been found to increase speaker verification performance over that of a standard mean distortion measure.
- a neural network is a type of artificial- intelligence system modelled after the neurons (nerve cells) in a biological nervous system and intended to simulate the way in which a brain processes information, learns, and_ remembers.
- a neural network is designed as an interconnected system of processing elements, each with a limited number of inputs (comparable to the impulse-receiving dendrites of a neuron) and an output (comparable to the synapse over which a nerve impulse travels to the next neuron). Rather than being programmed, these processing elements are able to "learn” by receiving weighted inputs - roughly, weak to strong or negative to positive - that, with adjustment, time, and repetition, can be made to produce appropriate outputs.
- Neural networks can be implemented either through hardware circuits (the fast metho ) or through software that simulates such a network (a slower method) . Neural networks help computers "learn” by association and recognition.
- RNN Recurrent neural networks
- a separate RNN is trained for each speaker.
- the network is repeatedly presented samples of speech from the individual and other speakers.
- Fig. 4 illustrates the architecture of a typical RNN to which frames of acoustic features are fed.
- the network "learns" the characteristics of the individual ' s voice by altering node connection strengths to produce outputs corresponding to those shown (e.g. output of 1 when input is from true identity, and output of -1 when input is from any other speaker).
- the decision algorithm involves passing the input features extracted from the user's speech through the RNN for the claimed identity.
- the decision algorithm output is based on the output of the network.
- LTF long term features
- a multi-layer perceptron may be used to compute the similarity between the LTFs derived from the user utterance and those LTFs for the claimed identity [10].
- the speaker verification system of the preferred embodiment employs all four abovedescribed independent decision algorithms to make the binary decision to accept or reject the user's identity claim.
- a neural network of MLP architecture is employed to combine the disparate outputs of the four independent decision algorithms so as to arrive at the final decision.
- the MLP may suitably be a software or hardware-based neural network.
- the outputs of the four decision algorithms are fed to the MLP, and its output is compared with a threshold value. Based on the comparison of the MLP output value with a threshold value for each of the words requested from the user, the identity claim is either accepted or rejected.
- the threshold value can be adjusted to vary the level of security of the speaker recognition system.
- the speaker verification system includes a training or adaptive learning facility.
- samples of speech from the claimed identity are required to train or serve as templates for the algorithm.
- the process of collecting the speech data and updating the algorithms to use the new data is known as training.
- That person At the time of entering a new identity on the system, that person must leave a sample of his/her voice to serve as training and reference templates.
- the person is required to repeat all words in a limited vocabulary a number of times, and the digitised speech is then converted to acoustic features as described above.
- the reference templates, code books, etc. of the individual algorithms are updated to take account of the subtle changes in the user' s voice over time.
- the system is trained to adapt to changes in the person's voice which may develop with age.
- the processing can be performed in software. However, the four decision algorithms can be run in hardware if desired, such as on a dedicated DSP board. Furthermore, the four algorithms can be run in parallel so as to minimise decision time.
- the digitisation and feature extraction can also be performed principally in hardware if desired. In most applications, the processing will be split between local and centralised processing.
- a centralised storage centre holds the reference templates, code books and network weights for each speaker. The appropriate templates, code books etc. for the claimed identity may then be down-loaded to the processing units.
- Processing of identity claims may be carried out centrally in a "bank" of processors or localised at the point of transaction. Either approach may be accommodated, the optimum arrangement being dictated by application conditions and restraints. For example, point-of-sale credit card authorisation via speaker verification would normally be performed at some centralised processing centre, while authenticating ATM transactions by speaker verification could be performed either locally at the ATM, or centralised.
- the invention has been described with particular reference to speaker verification, it can also be used for speaker identification.
- the characteristics derived from the speech sample of the user are compared with the characteristics from speech samples from a list of persons using the abovedescribed techniques.
- the system will indicate whether the speaker is one of the persons on the list, and/or the person(s) on the list which most closely match(es) the speaker.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Telephonic Communication Services (AREA)
Abstract
The claimed identity of a person is verified from a sample of that person's speech. The speech sample is processed to extract a set of characteristic features. These features are then compared with stored features previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms in parallel. The outputs of the individual algorithms are then fed to an artificial neural network which is used as a further decision algorithm to arrive at the final decision to accept or reject the speaker's identity claim. By using several different decision algorithms, and combining the outputs of those algorithms in a neural network, the speaker verification technique is more universally applicable, and hence more reliable.
Description
"A SPEAKER VERIFICATION SYSTEM" THIS INVENTION relates to method and apparatus for speaker verification. In particular, the invention is directed to a speaker recognition technique in which characteristics of a speech sample are compared with stored characteristics using several different algorithms, and the results of the individual algorithms are combined in a neural network to arrive at a final decision. BACKGROUND ART
The aim of speaker verification is to accept or reject a claim to a particular identity based on one or more samples of the claimant's speech. Unlike speech recognition which is aimed at deciphering the spoken word, speaker verification is an identification process. It has been found that a person's voice can be used as an identifying characteristic of that person. The person's voice can therefore be used as a robust, secure means of identification, obviating the need for artificial measures such as PIN numbers, security codes, access cards etc.
Speaker verification can be used as a security measure in any area or application in which the identity of an individual must be authenticated. Areas of immediate application include:
Automatic Teller Machine (ATM) transactions Telephone banking
Access control (e.g. to secure building access) Credit card transactions In particular, speaker verification is ideally suited to transactions conducted over a telephone link as other means of authentication are unsuitable or impractical.
Although there are various known techniques for speaker verification, such techniques normally rely upon a single decision making algorithm. It has been found that the accuracy can vary depending on the particular characteristic being compared, or the particular decision
making algorithm being used. Furthermore, while some speech features may be accurate characterising features of some speakers, other speakers may have different characterising speech features. Thus, the known speaker verification techniques are generally not universally applicable.
It is an object of the present invention to provide improved method and apparatus for speaker verification which overcomes or ameliorates the disadvantages of known techniques, or which at least provides the consumer with a useful choice.
SUMMARY OF THE INVENTION In one broad form, the present invention provides a method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.
In another broad form, the present invention provides apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality
of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.
The speech sample may be obtained via a microphone, telephone link or other suitable audio device. The speech sample may comprise one or more words. The speech sample is typically converted to digital format, and a set of predetermined characteristic features are derived from the digitised sample. These features may include cepstral co-efficients, fundamental frequency (or pitch), energy, duration, zero crossing rate and linear prediction co-efficients. The derived features are then compared with similar characteristic features previously derived from a speech sample of the claimed identity and stored in a suitable memory.
Several independent decision algorithms are used in the comparison. Such algorithms may suitably include dynamic time warping, vector quantisation, recurrent neural network and long term features.
Unlike known techniques, the outputs of all the independent algorithms are then utilised to arrive at a final verification decision. In the preferred embodiment, a neural network is used to compute the similarity between the speaker's speech sample and that of the claimed identity. The neural network of the claimed identity has been previously trained to distinguish the speech sample of the claimed identity from others stored in memory using an iterative process. The output of the decision algorithms for the user's sample are applied to the known identity's neural network for final verification. The system also includes a training facility to allow the stored sample of the claimed identity to be updated when a positive verification is made, to accommodate or compensate for changes in speech patterns
with age.
In order that the invention may be more fully understood and put into practice, a preferred embodiment will now be described with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a schematic diagram of the speaker verification system of the preferred embodiment;
Fig. 2 is a schematic of a dynamic time warping process;
Fig. 3 is a schematic diagram of vector quantisation mapping;
Fig. 4 illustrates the architecture of a typical recurrent neural network; and Fig. 5 illustrates the architecture of the multi-layer subsystem arrangement of Fig. 1.
DESCRIPTION OF PREFERRED EMBODIMENT
In the speaker verification system of the preferred embodiment, characteristic features of a speech sample are extracted and correlated with corresponding features of a stored speech sample previously obtained from the true identity. (Typically, the features of the latter speech sample are stored instead of, or in addition to, the speech sample itself). The correlation outputs are then combined in an artificial neural network to arrive at the final decision.
The major components of the speaker verification system are shown in Fig. 1. When a transaction is initiated (e.g. at an ATM, or over a telephone link), the user is asked to supply a speech sample. The sample typically comprises several words which correspond to some or all of the words previously recorded by the known identity. The system may prompt the user to repeat a number of specific words. The user's vocal rendition of these words is then digitised, and a predetermined number of acoustic features are extracted from the digitised samples. These acoustic features, together with the identity claim, are
then subjected to a number of decision algorithms. The outputs of the individual decision algorithms are employed by a final decision mechanism, such as an artificial neural network of multi-layer perceptron (MLP) architecture, to make a final decision as to whether to accept or reject the identity claim.
By using a multiple sub-system approach of employing several decision algorithms in parallel, the outputs of which are combined via the neural network, the final decision is a far more reliable indicator of identity than systems hitherto used.
Referring to Fig. 1, the transaction initiation (and prompt) section of the speaker verification system is the primary user interface of the system. This user interface is necessarily application dependent. For instance, a speaker verifying ATM would include a microphone, loud speaker and the current LED display, while remote phone transactions involving speaker verification would make use of the simple resource of the telephone handset.
In use, once an identity claim is made (e.g. by inserting a card) , the system prompts the user to repeat a number of predetermined words. These words are selected from a known list and presented in a random order so as to minimise potential abuse of the system (i.e. the possibility of using a recording of the true speaker is minimised ) . The user is audibly prompted for each required utterance using a voice. For those applications in which a visual display of text is also possible (such as an ATM), the requested utterances can be displayed instead of, or in addition to, the audible prompt.
The user' s utterances are then converted into a format suitable for the decision algorithms. Signals from the microphone at the point of application (e.g. telephone or ATM microphone) are digitised using suitable analog-to-digital (A/D) hardware. The resulting stream of numbers is analysed via a speech detection algorithm
in order to determine the start and end points of the user's utterances of the individual words.
The resulting speech waveform (with all portions of silence eliminated) is split into a number of small overlapping frames (typically of 64ms duration) in order to exploit the pseudo-stationary nature of speech.
Various signal processing techniques are employed to convert each frame into a smaller number of acoustic features which have been found to maximally encode the identity of the speaker. In the preferred embodiment, the characteristic acoustic parameters which are extracted for utilisation in the decision algorithms include:
Cepstral co-efficients • Fundamental frequency (pitch)
Energy Duration
Zero Crossing Rate Linear Prediction co-efficients The methods used to extract these acoustic features are known, and need not be described in detail in this application.
In the preferred embodiment, a plurality of independent algorithms process the extracted set of acoustic features and the identity claim, each outputting a score measuring the closeness of the acoustic features to those previously extracted from speech samples of the true identity and stored in memory.
Four independent decision algorithms are used in the described embodiment, namely
• Dynamic Time Warping (DTW) Vector Quantisation (VQ) Recurrent Neural Network (RNN) Long Term Features ( LTF) These algorithms are described below.
Dynamic Time Warping
In each decision algorithm, the acoustic
features extracted from the user' s speech are compared or correlated with those of the individual whose identity is being claimed. The latter features are pre-stored in a set of reference templates. One obstacle to comparison of incoming acoustic features with those stored as templates is that speakers often repeat the same word with slight timing differences at each utterance. Dynamic time warping (DTW) is a technique for normalising utterances (of the same word) to the same duration, hence allowing a much simpler comparison of acoustic features. DTW is described in more detail in references [1, 2] identified at the end of this specification. The disclosure of all such references is incorporated herein by reference. For each word uttered by each speaker, five reference templates are stored. A recent innovation to the DTW process has been to employ the time alignment information to enhance the performance of the DTW algorithm [as described in references 3, 4] . A schematic diagram of the DTW process is shown in Fig. 2. Namely, an input and reference frame are time aligned so as to minimise the difference. A score is output by the decision algorithm, dependent on the closeness of the acoustic features of the time aligned input and reference frames.
Vector Quantisation
Vector quantisation (VQ) is a technique chiefly employed for data reduction by computing a code book of elements onto which all input frames are matched. A discussion of the vector quantisation approach to speaker recognition can be found in reference [5].
As employed for speaker verification [6, 3], separate code books are constructed for each speaker. The distortion (or separation) between input frames and the nearest code book element in the code book are averaged and used as a measure of the closeness of the user to that person for whom the code book was
constructed.
Fig. 3 is a schematic diagram of the mapping from input frame ( I ) to the nearest code book element showing the distortion (arrow length) for that frame. Preferably, a vector of distortion values is derived rather than a simple mean. Linear weighting of these values has been found to increase speaker verification performance over that of a standard mean distortion measure.
Recurrent Neural Network
A neural network is a type of artificial- intelligence system modelled after the neurons (nerve cells) in a biological nervous system and intended to simulate the way in which a brain processes information, learns, and_ remembers. A neural network is designed as an interconnected system of processing elements, each with a limited number of inputs (comparable to the impulse-receiving dendrites of a neuron) and an output (comparable to the synapse over which a nerve impulse travels to the next neuron). Rather than being programmed, these processing elements are able to "learn" by receiving weighted inputs - roughly, weak to strong or negative to positive - that, with adjustment, time, and repetition, can be made to produce appropriate outputs. Neural networks can be implemented either through hardware circuits (the fast metho ) or through software that simulates such a network (a slower method) . Neural networks help computers "learn" by association and recognition.
Recurrent neural networks (RNN) [7] have found recent application in the area of speech processing due to their ability to handle time-varying signals. One of the broad family of Multi-Layer Perceptrons [8], RNNs differ by employing self-connections at each node, thus allowing previous frames of data to influence current outputs.
As applied to speaker verification [9], a
separate RNN is trained for each speaker. The network is repeatedly presented samples of speech from the individual and other speakers. Fig. 4 illustrates the architecture of a typical RNN to which frames of acoustic features are fed. The network "learns" the characteristics of the individual ' s voice by altering node connection strengths to produce outputs corresponding to those shown (e.g. output of 1 when input is from true identity, and output of -1 when input is from any other speaker).
The decision algorithm involves passing the input features extracted from the user's speech through the RNN for the claimed identity. The decision algorithm output is based on the output of the network.
Long Term Features
A technique applied for text-independent speaker recognition, long term features (LTF) [4] finds the average of each acoustic feature over the duration of a speaker's utterance. The mean and variance values for the acoustic features of the user's utterance are then compared with similar values derived from utterances of the claimed identity.
A multi-layer perceptron may be used to compute the similarity between the LTFs derived from the user utterance and those LTFs for the claimed identity [10].
Rather than using a single decision algorithm to verify or reject a user's claim of identity, the speaker verification system of the preferred embodiment employs all four abovedescribed independent decision algorithms to make the binary decision to accept or reject the user's identity claim.
More particularly, a neural network of MLP architecture is employed to combine the disparate outputs of the four independent decision algorithms so as to arrive at the final decision. The MLP may suitably be a software or hardware-based neural network. As shown in
fig. 5, for each word that the user is requested to utter, the outputs of the four decision algorithms are fed to the MLP, and its output is compared with a threshold value. Based on the comparison of the MLP output value with a threshold value for each of the words requested from the user, the identity claim is either accepted or rejected.
Preferably, the threshold value can be adjusted to vary the level of security of the speaker recognition system.
The speaker verification system includes a training or adaptive learning facility. In order for each decision algorithm to compute a score based on the correlation of the input speech with speech from the claimed identity, samples of speech from the claimed identity are required to train or serve as templates for the algorithm. The process of collecting the speech data and updating the algorithms to use the new data is known as training. At the time of entering a new identity on the system, that person must leave a sample of his/her voice to serve as training and reference templates. Typically, the person is required to repeat all words in a limited vocabulary a number of times, and the digitised speech is then converted to acoustic features as described above. These values are then used by the algorithms to build code books, reference templates or train networks to be representative of that speaker, the exact details being dependent on the particular algorithm. When a user, who is verified as the claimed identity, uses the system, the reference templates, code books, etc. of the individual algorithms are updated to take account of the subtle changes in the user' s voice over time. Thus, the system is trained to adapt to changes in the person's voice which may develop with age.
The processing can be performed in software. However, the four decision algorithms can be run in hardware if desired, such as on a dedicated DSP board.
Furthermore, the four algorithms can be run in parallel so as to minimise decision time. The digitisation and feature extraction can also be performed principally in hardware if desired. In most applications, the processing will be split between local and centralised processing. A centralised storage centre holds the reference templates, code books and network weights for each speaker. The appropriate templates, code books etc. for the claimed identity may then be down-loaded to the processing units.
Processing of identity claims may be carried out centrally in a "bank" of processors or localised at the point of transaction. Either approach may be accommodated, the optimum arrangement being dictated by application conditions and restraints. For example, point-of-sale credit card authorisation via speaker verification would normally be performed at some centralised processing centre, while authenticating ATM transactions by speaker verification could be performed either locally at the ATM, or centralised.
The foregoing describes only one embodiment of the invention, and modifications which are obvious to those skilled in the art may be made thereto without departing from the scope of the invention. For example, although the preferred embodiment has been described with reference to four particular decision algorithms, the number, and type, of decision algorithms can be varied.
Further, although the invention has been described with particular reference to speaker verification, it can also be used for speaker identification. In such application, the characteristics derived from the speech sample of the user are compared with the characteristics from speech samples from a list of persons using the abovedescribed techniques. The system will indicate whether the speaker is one of the persons on the list, and/or the person(s) on the list which most closely match(es) the speaker.
REFERENCES
[1] Furui, St., "Speaker Independent Isolated Word Recognition Using Dynamic Features of Speech Spectra", IEEE Trans. ASSP, vol 34, 1986, 52- 59.
[2] Doddington G., "A Method of Speaker Verification", PhD Thesis, The University of Wisconsin, 1971.
[3] Booth, I., Barlow, M. , Watson, B.,
"Enhancements to DTW and VQ decision algorithms for speaker recognition", Proc. Fourth Aust.
Int. Conf. Speech Science and Technology, Brisbane, December 1992, 483-488.
[4] Barlow, M. , "Prosodic Acoustic Correlates of Speaker Characteristics", PhD Thesis, University of NSW, 1991.
[5] Soong, F. , Rosenberg, A., Rabiner, L., Juang, G., "A Vector Quantisation Approach to Speaker Recognition", Proc. ICASSP-85, 1985, 387-390.
[6] Matuis, T., Furui, S., "Comparison of Text- Independent Speaker Recognition Methods using VQ-Distortion and Discrete/Continuous HMMs", Proc. ICASSP-91, 1991, 157-160.
[7] Pineda, J.F., "Generalization of Back- Propagation to Recurrent Neural Networks" Physical Review Letters, vol 59, no. 19, 1987, 2229-2232.
[8] Hertz, J., Krogh, A., Palmer, R.G.,
"Introduction to the Theory of Neural Computation", Addison & Wesly press, 1991.
[9] Shrimpton, D., Watson, B., "Comparison of
Recurrent Neural Network Architectures for
Speaker Verification", Proc. Fourth Aust. Int.
Conf. Speech Science and Technology, Brisbane, December 1992, 460-464.
[10] Blauensteiner, L. , "Speaker Verification Group
Report on Long-Term Feature Averaging
Techniques Using Neural Network Implementations", Tech. Report, Speaker
Verification Group, University of Qld. 1993.
Claims
1. A method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.
2. A method as claimed in claim 1, wherein the processing step comprises processing the results of the separate algorithms through an artificial neural network trained on data relating to the claimed identity.
3. A method as claimed in claim 2, wherein the artificial neural network is of multi-layer perceptron architecture.
4. A method as claimed in claim 2, wherein the processing step further comprises comparing the output of the artificial neural network with a threshold value, the final verification decision being dependent on that comparison.
5. A method as claimed in claim 4, wherein the threshold value is variable.
6. A method as claimed in claim 2, wherein in the event of a positive verification, the artificial neural network is further trained on data derived from the speech sample of the speaker.
7. A method as claimed in claim 1, wherein the feature(s) comprise(s) one or more of the following acoustic features:
Cepstral co-efficients Fundamental frequency (pitch) Energy Duration
Zero Crossing Rate Linear Prediction co-efficients.
8. A method as claimed in claim 7 wherein the step of deriving one or more features from the speech sample comprises converting the speech sample to digital form, removing portions of silence from the resultant digital speech waveform, and dividing the waveform into a plurality of overlapping frames.
9. A method as claimed in claim 1, wherein said algorithms comprise the following algorithms • Dynamic Time Warping Vector Quantisation Recurrent Neural Network • Long Term Features.
10. A method as claimed in claim 9, wherein the comparison of the derived feature(s) with the stored feature(s) is carried out using said plurality of separate algorithms in parallel.
11. Apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.
12. Apparatus as claimed in claim 11, wherein said processing means includes an artificial neural network.
13. Apparatus as claimed in claim 12, wherein the feature(s) comprises one or more of the following acoustic features:
Cepstral co-efficients
Fundamental frequency (pitch)
Energy
Duration
Zero Crossing Rate
Linear Prediction co-efficients.
14. Apparatus as claimed in claim 12, wherein the algorithms comprise the following algorithms
• Dynamic Time Warping Vector Quantisation Recurrent Neural Network
• Long Term Features.
15. A method of ascertaining whether a speaker is one of a group of identities, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from speech samples of the identities, using a plurality 'of separate algorithms, characterised in that for each comparison with a particular identity in the group, the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final decision with regard to that identity.
16. A method as claimed in claim 15, wherein the processing step comprises processing the results of the separate algorithms through an artificial neural network trained on data relating to the claimed identity.
17. Apparatus for ascertaining whether a speaker is one of a group of identities, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from speech samples of the identities, means for comparing the feature(s) derived from the speaker with the stored feature(s) derived from the identities, using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final decision in relation to an identity from a combination of the results of the plurality of algorithms used in the comparison with that identity.
18. Apparatus as claimed in claim 17, wherein said processing means includes an artificial neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU73786/94A AU7378694A (en) | 1993-08-12 | 1994-08-12 | A speaker verification system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AUPM0542 | 1993-08-12 | ||
AUPM054293 | 1993-08-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1995005656A1 true WO1995005656A1 (en) | 1995-02-23 |
Family
ID=3777128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/AU1994/000468 WO1995005656A1 (en) | 1993-08-12 | 1994-08-12 | A speaker verification system |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO1995005656A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE19630109A1 (en) * | 1996-07-25 | 1998-01-29 | Siemens Ag | Method for speaker verification using at least one speech signal spoken by a speaker, by a computer |
EP0780830A3 (en) * | 1995-12-22 | 1998-08-12 | Ncr International Inc. | Speaker verification system |
WO1998040875A1 (en) * | 1997-03-13 | 1998-09-17 | Telia Ab (Publ) | Speaker verification system |
EP0870300A1 (en) * | 1995-06-07 | 1998-10-14 | Rutgers University | Speaker verification system |
EP0902415A1 (en) * | 1997-09-15 | 1999-03-17 | Koninklijke KPN N.V. | Method of and arrangement for providing improved speaker reference data and speaker verification |
GB2334864A (en) * | 1998-01-16 | 1999-09-01 | Nec Corp | Mobile phone has vector coded password protection |
US6185536B1 (en) * | 1998-03-04 | 2001-02-06 | Motorola, Inc. | System and method for establishing a communication link using user-specific voice data parameters as a user discriminator |
EP2368213A2 (en) * | 2008-11-28 | 2011-09-28 | The Nottingham Trent University | Biometric identity verification |
WO2014114116A1 (en) * | 2013-01-28 | 2014-07-31 | Tencent Technology (Shenzhen) Company Limited | Method and system for voiceprint recognition |
US9502038B2 (en) | 2013-01-28 | 2016-11-22 | Tencent Technology (Shenzhen) Company Limited | Method and device for voiceprint recognition |
US10257191B2 (en) | 2008-11-28 | 2019-04-09 | Nottingham Trent University | Biometric identity verification |
WO2021075012A1 (en) * | 2019-10-17 | 2021-04-22 | 日本電気株式会社 | Speaker authentication system, method, and program |
CN112885355A (en) * | 2021-01-25 | 2021-06-01 | 上海头趣科技有限公司 | Speech recognition method based on multiple features |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0121248A1 (en) * | 1983-03-30 | 1984-10-10 | Nec Corporation | Speaker verification system and process |
AU8649691A (en) * | 1990-10-03 | 1992-04-28 | Imagination Technologies Limited | Methods and apparatus for verifying the originator of a sequence of operations |
EP0592150A1 (en) * | 1992-10-09 | 1994-04-13 | AT&T Corp. | Speaker verification |
DE4240978A1 (en) * | 1992-12-05 | 1994-06-09 | Telefonbau & Normalzeit Gmbh | Improving recognition quality for speaker identification - verifying characteristic vectors and corresp. index sequence provided by vector quantisation |
-
1994
- 1994-08-12 WO PCT/AU1994/000468 patent/WO1995005656A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0121248A1 (en) * | 1983-03-30 | 1984-10-10 | Nec Corporation | Speaker verification system and process |
AU8649691A (en) * | 1990-10-03 | 1992-04-28 | Imagination Technologies Limited | Methods and apparatus for verifying the originator of a sequence of operations |
EP0592150A1 (en) * | 1992-10-09 | 1994-04-13 | AT&T Corp. | Speaker verification |
DE4240978A1 (en) * | 1992-12-05 | 1994-06-09 | Telefonbau & Normalzeit Gmbh | Improving recognition quality for speaker identification - verifying characteristic vectors and corresp. index sequence provided by vector quantisation |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0870300A1 (en) * | 1995-06-07 | 1998-10-14 | Rutgers University | Speaker verification system |
US5839103A (en) * | 1995-06-07 | 1998-11-17 | Rutgers, The State University Of New Jersey | Speaker verification system using decision fusion logic |
EP0870300A4 (en) * | 1995-06-07 | 1999-04-21 | Univ Rutgers | Speaker verification system |
EP0780830A3 (en) * | 1995-12-22 | 1998-08-12 | Ncr International Inc. | Speaker verification system |
DE19630109A1 (en) * | 1996-07-25 | 1998-01-29 | Siemens Ag | Method for speaker verification using at least one speech signal spoken by a speaker, by a computer |
WO1998040875A1 (en) * | 1997-03-13 | 1998-09-17 | Telia Ab (Publ) | Speaker verification system |
EP0902415A1 (en) * | 1997-09-15 | 1999-03-17 | Koninklijke KPN N.V. | Method of and arrangement for providing improved speaker reference data and speaker verification |
WO1999014742A1 (en) * | 1997-09-15 | 1999-03-25 | Koninklijke Kpn N.V. | Method and arrangement for providing speaker reference data for speaker verification |
US6249759B1 (en) | 1998-01-16 | 2001-06-19 | Nec Corporation | Communication apparatus using speech vector comparison and recognition |
GB2334864A (en) * | 1998-01-16 | 1999-09-01 | Nec Corp | Mobile phone has vector coded password protection |
GB2334864B (en) * | 1998-01-16 | 2000-03-15 | Nec Corp | Communication apparatus |
US6185536B1 (en) * | 1998-03-04 | 2001-02-06 | Motorola, Inc. | System and method for establishing a communication link using user-specific voice data parameters as a user discriminator |
EP2368213A2 (en) * | 2008-11-28 | 2011-09-28 | The Nottingham Trent University | Biometric identity verification |
US10257191B2 (en) | 2008-11-28 | 2019-04-09 | Nottingham Trent University | Biometric identity verification |
WO2014114116A1 (en) * | 2013-01-28 | 2014-07-31 | Tencent Technology (Shenzhen) Company Limited | Method and system for voiceprint recognition |
US9502038B2 (en) | 2013-01-28 | 2016-11-22 | Tencent Technology (Shenzhen) Company Limited | Method and device for voiceprint recognition |
WO2021075012A1 (en) * | 2019-10-17 | 2021-04-22 | 日本電気株式会社 | Speaker authentication system, method, and program |
JPWO2021075012A1 (en) * | 2019-10-17 | 2021-04-22 | ||
JP7259981B2 (en) | 2019-10-17 | 2023-04-18 | 日本電気株式会社 | Speaker authentication system, method and program |
CN112885355A (en) * | 2021-01-25 | 2021-06-01 | 上海头趣科技有限公司 | Speech recognition method based on multiple features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2021286422B2 (en) | End-to-end speaker recognition using deep neural network | |
US6539352B1 (en) | Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation | |
Melin et al. | Voice Recognition with Neural Networks, Type-2 Fuzzy Logic and Genetic Algorithms. | |
US7502736B2 (en) | Voice registration method and system, and voice recognition method and system based on voice registration method and system | |
WO2010120626A1 (en) | Speaker verification system | |
WO1995005656A1 (en) | A speaker verification system | |
Dash et al. | Speaker identification using mel frequency cepstralcoefficient and bpnn | |
Karthikeyan et al. | Hybrid machine learning classification scheme for speaker identification | |
Ozaydin | Design of a text independent speaker recognition system | |
zohra Chelali et al. | Speaker identification system based on PLP coefficients and artificial neural network | |
Shah et al. | Interactive voice response with pattern recognition based on artificial neural network approach | |
RU2161826C2 (en) | Automatic person identification method | |
Shah et al. | Neural network solution for secure interactive voice response | |
Naik et al. | Evaluation of a high performance speaker verification system for access Control | |
Olsson | Text dependent speaker verification with a hybrid HMM/ANN system | |
Sharma et al. | Text-independent speaker identification using backpropagation mlp network classifier for a closed set of speakers | |
Reda et al. | Artificial neural network & mel-frequency cepstrum coefficients-based speaker recognition | |
Das | Utterance based speaker identification using ANN | |
Melin et al. | Voice recognition with neural networks, fuzzy logic and genetic algorithms | |
Faundez-Zanuy et al. | Nonlinear predictive models: overview and possibilities in speaker recognition | |
Ren et al. | A hybrid GMM speaker verification system for mobile devices in variable environments | |
Abd Al-Rahman et al. | Using Deep Learning Neural Networks to Recognize and Authenticate the Identity of the Speaker | |
Anitha et al. | PASSWORD SECURED SPEAKER RECOGNITION USING TIME AND FREQUENCY DOMAIN FEATURES | |
Nedic et al. | Recent developments in speaker verification at IDIAP | |
Chetouani et al. | A new nonlinear feature extraction algorithm for speaker verification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AM AT AU BB BG BR BY CA CH CN CZ DE DK ES FI GB GE HU JP KE KG KP KR KZ LK LT LU LV MD MG MN MW NL NO NZ PL PT RO RU SD SE SI SK TJ TT UA US UZ VN |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): KE MW SD AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: CA |