WO1995005656A1 - A speaker verification system - Google Patents

A speaker verification system Download PDF

Info

Publication number
WO1995005656A1
WO1995005656A1 PCT/AU1994/000468 AU9400468W WO9505656A1 WO 1995005656 A1 WO1995005656 A1 WO 1995005656A1 AU 9400468 W AU9400468 W AU 9400468W WO 9505656 A1 WO9505656 A1 WO 9505656A1
Authority
WO
WIPO (PCT)
Prior art keywords
algorithms
speaker
identity
feature
speech sample
Prior art date
Application number
PCT/AU1994/000468
Other languages
French (fr)
Inventor
Thomas Downs
Ah-Chung Tsoi
Mark Schulz
Brian Carrington Lovell
Ian Michael Booth
Michael Glynn Barlow
Original Assignee
The University Of Queensland
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The University Of Queensland filed Critical The University Of Queensland
Priority to AU73786/94A priority Critical patent/AU7378694A/en
Publication of WO1995005656A1 publication Critical patent/WO1995005656A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • a SPEAKER VERIFICATION SYSTEM relates to method and apparatus for speaker verification.
  • the invention is directed to a speaker recognition technique in which characteristics of a speech sample are compared with stored characteristics using several different algorithms, and the results of the individual algorithms are combined in a neural network to arrive at a final decision.
  • speaker verification is to accept or reject a claim to a particular identity based on one or more samples of the claimant's speech.
  • speaker verification is an identification process. It has been found that a person's voice can be used as an identifying characteristic of that person. The person's voice can therefore be used as a robust, secure means of identification, obviating the need for artificial measures such as PIN numbers, security codes, access cards etc.
  • Speaker verification can be used as a security measure in any area or application in which the identity of an individual must be authenticated. Areas of immediate application include:
  • ATM Automatic Teller Machine
  • Access control e.g. to secure building access
  • Credit card transactions In particular, speaker verification is ideally suited to transactions conducted over a telephone link as other means of authentication are unsuitable or impractical.
  • the present invention provides a method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.
  • the present invention provides apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.
  • the speech sample may be obtained via a microphone, telephone link or other suitable audio device.
  • the speech sample may comprise one or more words.
  • the speech sample is typically converted to digital format, and a set of predetermined characteristic features are derived from the digitised sample. These features may include cepstral co-efficients, fundamental frequency (or pitch), energy, duration, zero crossing rate and linear prediction co-efficients.
  • the derived features are then compared with similar characteristic features previously derived from a speech sample of the claimed identity and stored in a suitable memory.
  • Such algorithms may suitably include dynamic time warping, vector quantisation, recurrent neural network and long term features.
  • a neural network is used to compute the similarity between the speaker's speech sample and that of the claimed identity.
  • the neural network of the claimed identity has been previously trained to distinguish the speech sample of the claimed identity from others stored in memory using an iterative process.
  • the output of the decision algorithms for the user's sample are applied to the known identity's neural network for final verification.
  • the system also includes a training facility to allow the stored sample of the claimed identity to be updated when a positive verification is made, to accommodate or compensate for changes in speech patterns with age.
  • FIG. 1 is a schematic diagram of the speaker verification system of the preferred embodiment
  • Fig. 2 is a schematic of a dynamic time warping process
  • Fig. 3 is a schematic diagram of vector quantisation mapping
  • Fig. 4 illustrates the architecture of a typical recurrent neural network
  • Fig. 5 illustrates the architecture of the multi-layer subsystem arrangement of Fig. 1.
  • characteristic features of a speech sample are extracted and correlated with corresponding features of a stored speech sample previously obtained from the true identity.
  • the features of the latter speech sample are stored instead of, or in addition to, the speech sample itself).
  • the correlation outputs are then combined in an artificial neural network to arrive at the final decision.
  • Fig. 1 The major components of the speaker verification system are shown in Fig. 1.
  • a transaction is initiated (e.g. at an ATM, or over a telephone link)
  • the sample typically comprises several words which correspond to some or all of the words previously recorded by the known identity.
  • the system may prompt the user to repeat a number of specific words.
  • the user's vocal rendition of these words is then digitised, and a predetermined number of acoustic features are extracted from the digitised samples.
  • These acoustic features, together with the identity claim, are then subjected to a number of decision algorithms.
  • the outputs of the individual decision algorithms are employed by a final decision mechanism, such as an artificial neural network of multi-layer perceptron (MLP) architecture, to make a final decision as to whether to accept or reject the identity claim.
  • MLP multi-layer perceptron
  • the final decision is a far more reliable indicator of identity than systems hitherto used.
  • the transaction initiation (and prompt) section of the speaker verification system is the primary user interface of the system.
  • This user interface is necessarily application dependent.
  • a speaker verifying ATM would include a microphone, loud speaker and the current LED display, while remote phone transactions involving speaker verification would make use of the simple resource of the telephone handset.
  • the system prompts the user to repeat a number of predetermined words. These words are selected from a known list and presented in a random order so as to minimise potential abuse of the system (i.e. the possibility of using a recording of the true speaker is minimised ) .
  • the user is audibly prompted for each required utterance using a voice.
  • a visual display of text is also possible (such as an ATM)
  • the requested utterances can be displayed instead of, or in addition to, the audible prompt.
  • the user' s utterances are then converted into a format suitable for the decision algorithms.
  • Signals from the microphone at the point of application e.g. telephone or ATM microphone
  • A/D analog-to-digital
  • the resulting stream of numbers is analysed via a speech detection algorithm in order to determine the start and end points of the user's utterances of the individual words.
  • the resulting speech waveform (with all portions of silence eliminated) is split into a number of small overlapping frames (typically of 64ms duration) in order to exploit the pseudo-stationary nature of speech.
  • the characteristic acoustic parameters which are extracted for utilisation in the decision algorithms include:
  • a plurality of independent algorithms process the extracted set of acoustic features and the identity claim, each outputting a score measuring the closeness of the acoustic features to those previously extracted from speech samples of the true identity and stored in memory.
  • acoustic features extracted from the user' s speech are compared or correlated with those of the individual whose identity is being claimed.
  • the latter features are pre-stored in a set of reference templates.
  • One obstacle to comparison of incoming acoustic features with those stored as templates is that speakers often repeat the same word with slight timing differences at each utterance.
  • Dynamic time warping (DTW) is a technique for normalising utterances (of the same word) to the same duration, hence allowing a much simpler comparison of acoustic features. DTW is described in more detail in references [1, 2] identified at the end of this specification. The disclosure of all such references is incorporated herein by reference. For each word uttered by each speaker, five reference templates are stored.
  • a recent innovation to the DTW process has been to employ the time alignment information to enhance the performance of the DTW algorithm [as described in references 3, 4] .
  • a schematic diagram of the DTW process is shown in Fig. 2. Namely, an input and reference frame are time aligned so as to minimise the difference.
  • a score is output by the decision algorithm, dependent on the closeness of the acoustic features of the time aligned input and reference frames.
  • VQ Vector quantisation
  • Fig. 3 is a schematic diagram of the mapping from input frame ( I ) to the nearest code book element showing the distortion (arrow length) for that frame.
  • a vector of distortion values is derived rather than a simple mean. Linear weighting of these values has been found to increase speaker verification performance over that of a standard mean distortion measure.
  • a neural network is a type of artificial- intelligence system modelled after the neurons (nerve cells) in a biological nervous system and intended to simulate the way in which a brain processes information, learns, and_ remembers.
  • a neural network is designed as an interconnected system of processing elements, each with a limited number of inputs (comparable to the impulse-receiving dendrites of a neuron) and an output (comparable to the synapse over which a nerve impulse travels to the next neuron). Rather than being programmed, these processing elements are able to "learn” by receiving weighted inputs - roughly, weak to strong or negative to positive - that, with adjustment, time, and repetition, can be made to produce appropriate outputs.
  • Neural networks can be implemented either through hardware circuits (the fast metho ) or through software that simulates such a network (a slower method) . Neural networks help computers "learn” by association and recognition.
  • RNN Recurrent neural networks
  • a separate RNN is trained for each speaker.
  • the network is repeatedly presented samples of speech from the individual and other speakers.
  • Fig. 4 illustrates the architecture of a typical RNN to which frames of acoustic features are fed.
  • the network "learns" the characteristics of the individual ' s voice by altering node connection strengths to produce outputs corresponding to those shown (e.g. output of 1 when input is from true identity, and output of -1 when input is from any other speaker).
  • the decision algorithm involves passing the input features extracted from the user's speech through the RNN for the claimed identity.
  • the decision algorithm output is based on the output of the network.
  • LTF long term features
  • a multi-layer perceptron may be used to compute the similarity between the LTFs derived from the user utterance and those LTFs for the claimed identity [10].
  • the speaker verification system of the preferred embodiment employs all four abovedescribed independent decision algorithms to make the binary decision to accept or reject the user's identity claim.
  • a neural network of MLP architecture is employed to combine the disparate outputs of the four independent decision algorithms so as to arrive at the final decision.
  • the MLP may suitably be a software or hardware-based neural network.
  • the outputs of the four decision algorithms are fed to the MLP, and its output is compared with a threshold value. Based on the comparison of the MLP output value with a threshold value for each of the words requested from the user, the identity claim is either accepted or rejected.
  • the threshold value can be adjusted to vary the level of security of the speaker recognition system.
  • the speaker verification system includes a training or adaptive learning facility.
  • samples of speech from the claimed identity are required to train or serve as templates for the algorithm.
  • the process of collecting the speech data and updating the algorithms to use the new data is known as training.
  • That person At the time of entering a new identity on the system, that person must leave a sample of his/her voice to serve as training and reference templates.
  • the person is required to repeat all words in a limited vocabulary a number of times, and the digitised speech is then converted to acoustic features as described above.
  • the reference templates, code books, etc. of the individual algorithms are updated to take account of the subtle changes in the user' s voice over time.
  • the system is trained to adapt to changes in the person's voice which may develop with age.
  • the processing can be performed in software. However, the four decision algorithms can be run in hardware if desired, such as on a dedicated DSP board. Furthermore, the four algorithms can be run in parallel so as to minimise decision time.
  • the digitisation and feature extraction can also be performed principally in hardware if desired. In most applications, the processing will be split between local and centralised processing.
  • a centralised storage centre holds the reference templates, code books and network weights for each speaker. The appropriate templates, code books etc. for the claimed identity may then be down-loaded to the processing units.
  • Processing of identity claims may be carried out centrally in a "bank" of processors or localised at the point of transaction. Either approach may be accommodated, the optimum arrangement being dictated by application conditions and restraints. For example, point-of-sale credit card authorisation via speaker verification would normally be performed at some centralised processing centre, while authenticating ATM transactions by speaker verification could be performed either locally at the ATM, or centralised.
  • the invention has been described with particular reference to speaker verification, it can also be used for speaker identification.
  • the characteristics derived from the speech sample of the user are compared with the characteristics from speech samples from a list of persons using the abovedescribed techniques.
  • the system will indicate whether the speaker is one of the persons on the list, and/or the person(s) on the list which most closely match(es) the speaker.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The claimed identity of a person is verified from a sample of that person's speech. The speech sample is processed to extract a set of characteristic features. These features are then compared with stored features previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms in parallel. The outputs of the individual algorithms are then fed to an artificial neural network which is used as a further decision algorithm to arrive at the final decision to accept or reject the speaker's identity claim. By using several different decision algorithms, and combining the outputs of those algorithms in a neural network, the speaker verification technique is more universally applicable, and hence more reliable.

Description

"A SPEAKER VERIFICATION SYSTEM" THIS INVENTION relates to method and apparatus for speaker verification. In particular, the invention is directed to a speaker recognition technique in which characteristics of a speech sample are compared with stored characteristics using several different algorithms, and the results of the individual algorithms are combined in a neural network to arrive at a final decision. BACKGROUND ART
The aim of speaker verification is to accept or reject a claim to a particular identity based on one or more samples of the claimant's speech. Unlike speech recognition which is aimed at deciphering the spoken word, speaker verification is an identification process. It has been found that a person's voice can be used as an identifying characteristic of that person. The person's voice can therefore be used as a robust, secure means of identification, obviating the need for artificial measures such as PIN numbers, security codes, access cards etc.
Speaker verification can be used as a security measure in any area or application in which the identity of an individual must be authenticated. Areas of immediate application include:
Automatic Teller Machine (ATM) transactions Telephone banking
Access control (e.g. to secure building access) Credit card transactions In particular, speaker verification is ideally suited to transactions conducted over a telephone link as other means of authentication are unsuitable or impractical.
Although there are various known techniques for speaker verification, such techniques normally rely upon a single decision making algorithm. It has been found that the accuracy can vary depending on the particular characteristic being compared, or the particular decision making algorithm being used. Furthermore, while some speech features may be accurate characterising features of some speakers, other speakers may have different characterising speech features. Thus, the known speaker verification techniques are generally not universally applicable.
It is an object of the present invention to provide improved method and apparatus for speaker verification which overcomes or ameliorates the disadvantages of known techniques, or which at least provides the consumer with a useful choice.
SUMMARY OF THE INVENTION In one broad form, the present invention provides a method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.
In another broad form, the present invention provides apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.
The speech sample may be obtained via a microphone, telephone link or other suitable audio device. The speech sample may comprise one or more words. The speech sample is typically converted to digital format, and a set of predetermined characteristic features are derived from the digitised sample. These features may include cepstral co-efficients, fundamental frequency (or pitch), energy, duration, zero crossing rate and linear prediction co-efficients. The derived features are then compared with similar characteristic features previously derived from a speech sample of the claimed identity and stored in a suitable memory.
Several independent decision algorithms are used in the comparison. Such algorithms may suitably include dynamic time warping, vector quantisation, recurrent neural network and long term features.
Unlike known techniques, the outputs of all the independent algorithms are then utilised to arrive at a final verification decision. In the preferred embodiment, a neural network is used to compute the similarity between the speaker's speech sample and that of the claimed identity. The neural network of the claimed identity has been previously trained to distinguish the speech sample of the claimed identity from others stored in memory using an iterative process. The output of the decision algorithms for the user's sample are applied to the known identity's neural network for final verification. The system also includes a training facility to allow the stored sample of the claimed identity to be updated when a positive verification is made, to accommodate or compensate for changes in speech patterns with age.
In order that the invention may be more fully understood and put into practice, a preferred embodiment will now be described with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a schematic diagram of the speaker verification system of the preferred embodiment;
Fig. 2 is a schematic of a dynamic time warping process;
Fig. 3 is a schematic diagram of vector quantisation mapping;
Fig. 4 illustrates the architecture of a typical recurrent neural network; and Fig. 5 illustrates the architecture of the multi-layer subsystem arrangement of Fig. 1.
DESCRIPTION OF PREFERRED EMBODIMENT
In the speaker verification system of the preferred embodiment, characteristic features of a speech sample are extracted and correlated with corresponding features of a stored speech sample previously obtained from the true identity. (Typically, the features of the latter speech sample are stored instead of, or in addition to, the speech sample itself). The correlation outputs are then combined in an artificial neural network to arrive at the final decision.
The major components of the speaker verification system are shown in Fig. 1. When a transaction is initiated (e.g. at an ATM, or over a telephone link), the user is asked to supply a speech sample. The sample typically comprises several words which correspond to some or all of the words previously recorded by the known identity. The system may prompt the user to repeat a number of specific words. The user's vocal rendition of these words is then digitised, and a predetermined number of acoustic features are extracted from the digitised samples. These acoustic features, together with the identity claim, are then subjected to a number of decision algorithms. The outputs of the individual decision algorithms are employed by a final decision mechanism, such as an artificial neural network of multi-layer perceptron (MLP) architecture, to make a final decision as to whether to accept or reject the identity claim.
By using a multiple sub-system approach of employing several decision algorithms in parallel, the outputs of which are combined via the neural network, the final decision is a far more reliable indicator of identity than systems hitherto used.
Referring to Fig. 1, the transaction initiation (and prompt) section of the speaker verification system is the primary user interface of the system. This user interface is necessarily application dependent. For instance, a speaker verifying ATM would include a microphone, loud speaker and the current LED display, while remote phone transactions involving speaker verification would make use of the simple resource of the telephone handset.
In use, once an identity claim is made (e.g. by inserting a card) , the system prompts the user to repeat a number of predetermined words. These words are selected from a known list and presented in a random order so as to minimise potential abuse of the system (i.e. the possibility of using a recording of the true speaker is minimised ) . The user is audibly prompted for each required utterance using a voice. For those applications in which a visual display of text is also possible (such as an ATM), the requested utterances can be displayed instead of, or in addition to, the audible prompt.
The user' s utterances are then converted into a format suitable for the decision algorithms. Signals from the microphone at the point of application (e.g. telephone or ATM microphone) are digitised using suitable analog-to-digital (A/D) hardware. The resulting stream of numbers is analysed via a speech detection algorithm in order to determine the start and end points of the user's utterances of the individual words.
The resulting speech waveform (with all portions of silence eliminated) is split into a number of small overlapping frames (typically of 64ms duration) in order to exploit the pseudo-stationary nature of speech.
Various signal processing techniques are employed to convert each frame into a smaller number of acoustic features which have been found to maximally encode the identity of the speaker. In the preferred embodiment, the characteristic acoustic parameters which are extracted for utilisation in the decision algorithms include:
Cepstral co-efficients • Fundamental frequency (pitch)
Energy Duration
Zero Crossing Rate Linear Prediction co-efficients The methods used to extract these acoustic features are known, and need not be described in detail in this application.
In the preferred embodiment, a plurality of independent algorithms process the extracted set of acoustic features and the identity claim, each outputting a score measuring the closeness of the acoustic features to those previously extracted from speech samples of the true identity and stored in memory.
Four independent decision algorithms are used in the described embodiment, namely
• Dynamic Time Warping (DTW) Vector Quantisation (VQ) Recurrent Neural Network (RNN) Long Term Features ( LTF) These algorithms are described below.
Dynamic Time Warping
In each decision algorithm, the acoustic features extracted from the user' s speech are compared or correlated with those of the individual whose identity is being claimed. The latter features are pre-stored in a set of reference templates. One obstacle to comparison of incoming acoustic features with those stored as templates is that speakers often repeat the same word with slight timing differences at each utterance. Dynamic time warping (DTW) is a technique for normalising utterances (of the same word) to the same duration, hence allowing a much simpler comparison of acoustic features. DTW is described in more detail in references [1, 2] identified at the end of this specification. The disclosure of all such references is incorporated herein by reference. For each word uttered by each speaker, five reference templates are stored. A recent innovation to the DTW process has been to employ the time alignment information to enhance the performance of the DTW algorithm [as described in references 3, 4] . A schematic diagram of the DTW process is shown in Fig. 2. Namely, an input and reference frame are time aligned so as to minimise the difference. A score is output by the decision algorithm, dependent on the closeness of the acoustic features of the time aligned input and reference frames.
Vector Quantisation
Vector quantisation (VQ) is a technique chiefly employed for data reduction by computing a code book of elements onto which all input frames are matched. A discussion of the vector quantisation approach to speaker recognition can be found in reference [5].
As employed for speaker verification [6, 3], separate code books are constructed for each speaker. The distortion (or separation) between input frames and the nearest code book element in the code book are averaged and used as a measure of the closeness of the user to that person for whom the code book was constructed.
Fig. 3 is a schematic diagram of the mapping from input frame ( I ) to the nearest code book element showing the distortion (arrow length) for that frame. Preferably, a vector of distortion values is derived rather than a simple mean. Linear weighting of these values has been found to increase speaker verification performance over that of a standard mean distortion measure.
Recurrent Neural Network
A neural network is a type of artificial- intelligence system modelled after the neurons (nerve cells) in a biological nervous system and intended to simulate the way in which a brain processes information, learns, and_ remembers. A neural network is designed as an interconnected system of processing elements, each with a limited number of inputs (comparable to the impulse-receiving dendrites of a neuron) and an output (comparable to the synapse over which a nerve impulse travels to the next neuron). Rather than being programmed, these processing elements are able to "learn" by receiving weighted inputs - roughly, weak to strong or negative to positive - that, with adjustment, time, and repetition, can be made to produce appropriate outputs. Neural networks can be implemented either through hardware circuits (the fast metho ) or through software that simulates such a network (a slower method) . Neural networks help computers "learn" by association and recognition.
Recurrent neural networks (RNN) [7] have found recent application in the area of speech processing due to their ability to handle time-varying signals. One of the broad family of Multi-Layer Perceptrons [8], RNNs differ by employing self-connections at each node, thus allowing previous frames of data to influence current outputs.
As applied to speaker verification [9], a separate RNN is trained for each speaker. The network is repeatedly presented samples of speech from the individual and other speakers. Fig. 4 illustrates the architecture of a typical RNN to which frames of acoustic features are fed. The network "learns" the characteristics of the individual ' s voice by altering node connection strengths to produce outputs corresponding to those shown (e.g. output of 1 when input is from true identity, and output of -1 when input is from any other speaker).
The decision algorithm involves passing the input features extracted from the user's speech through the RNN for the claimed identity. The decision algorithm output is based on the output of the network.
Long Term Features
A technique applied for text-independent speaker recognition, long term features (LTF) [4] finds the average of each acoustic feature over the duration of a speaker's utterance. The mean and variance values for the acoustic features of the user's utterance are then compared with similar values derived from utterances of the claimed identity.
A multi-layer perceptron may be used to compute the similarity between the LTFs derived from the user utterance and those LTFs for the claimed identity [10].
Rather than using a single decision algorithm to verify or reject a user's claim of identity, the speaker verification system of the preferred embodiment employs all four abovedescribed independent decision algorithms to make the binary decision to accept or reject the user's identity claim.
More particularly, a neural network of MLP architecture is employed to combine the disparate outputs of the four independent decision algorithms so as to arrive at the final decision. The MLP may suitably be a software or hardware-based neural network. As shown in fig. 5, for each word that the user is requested to utter, the outputs of the four decision algorithms are fed to the MLP, and its output is compared with a threshold value. Based on the comparison of the MLP output value with a threshold value for each of the words requested from the user, the identity claim is either accepted or rejected.
Preferably, the threshold value can be adjusted to vary the level of security of the speaker recognition system.
The speaker verification system includes a training or adaptive learning facility. In order for each decision algorithm to compute a score based on the correlation of the input speech with speech from the claimed identity, samples of speech from the claimed identity are required to train or serve as templates for the algorithm. The process of collecting the speech data and updating the algorithms to use the new data is known as training. At the time of entering a new identity on the system, that person must leave a sample of his/her voice to serve as training and reference templates. Typically, the person is required to repeat all words in a limited vocabulary a number of times, and the digitised speech is then converted to acoustic features as described above. These values are then used by the algorithms to build code books, reference templates or train networks to be representative of that speaker, the exact details being dependent on the particular algorithm. When a user, who is verified as the claimed identity, uses the system, the reference templates, code books, etc. of the individual algorithms are updated to take account of the subtle changes in the user' s voice over time. Thus, the system is trained to adapt to changes in the person's voice which may develop with age.
The processing can be performed in software. However, the four decision algorithms can be run in hardware if desired, such as on a dedicated DSP board. Furthermore, the four algorithms can be run in parallel so as to minimise decision time. The digitisation and feature extraction can also be performed principally in hardware if desired. In most applications, the processing will be split between local and centralised processing. A centralised storage centre holds the reference templates, code books and network weights for each speaker. The appropriate templates, code books etc. for the claimed identity may then be down-loaded to the processing units.
Processing of identity claims may be carried out centrally in a "bank" of processors or localised at the point of transaction. Either approach may be accommodated, the optimum arrangement being dictated by application conditions and restraints. For example, point-of-sale credit card authorisation via speaker verification would normally be performed at some centralised processing centre, while authenticating ATM transactions by speaker verification could be performed either locally at the ATM, or centralised.
The foregoing describes only one embodiment of the invention, and modifications which are obvious to those skilled in the art may be made thereto without departing from the scope of the invention. For example, although the preferred embodiment has been described with reference to four particular decision algorithms, the number, and type, of decision algorithms can be varied.
Further, although the invention has been described with particular reference to speaker verification, it can also be used for speaker identification. In such application, the characteristics derived from the speech sample of the user are compared with the characteristics from speech samples from a list of persons using the abovedescribed techniques. The system will indicate whether the speaker is one of the persons on the list, and/or the person(s) on the list which most closely match(es) the speaker. REFERENCES
[1] Furui, St., "Speaker Independent Isolated Word Recognition Using Dynamic Features of Speech Spectra", IEEE Trans. ASSP, vol 34, 1986, 52- 59.
[2] Doddington G., "A Method of Speaker Verification", PhD Thesis, The University of Wisconsin, 1971.
[3] Booth, I., Barlow, M. , Watson, B.,
"Enhancements to DTW and VQ decision algorithms for speaker recognition", Proc. Fourth Aust.
Int. Conf. Speech Science and Technology, Brisbane, December 1992, 483-488.
[4] Barlow, M. , "Prosodic Acoustic Correlates of Speaker Characteristics", PhD Thesis, University of NSW, 1991.
[5] Soong, F. , Rosenberg, A., Rabiner, L., Juang, G., "A Vector Quantisation Approach to Speaker Recognition", Proc. ICASSP-85, 1985, 387-390.
[6] Matuis, T., Furui, S., "Comparison of Text- Independent Speaker Recognition Methods using VQ-Distortion and Discrete/Continuous HMMs", Proc. ICASSP-91, 1991, 157-160.
[7] Pineda, J.F., "Generalization of Back- Propagation to Recurrent Neural Networks" Physical Review Letters, vol 59, no. 19, 1987, 2229-2232.
[8] Hertz, J., Krogh, A., Palmer, R.G.,
"Introduction to the Theory of Neural Computation", Addison & Wesly press, 1991. [9] Shrimpton, D., Watson, B., "Comparison of
Recurrent Neural Network Architectures for
Speaker Verification", Proc. Fourth Aust. Int.
Conf. Speech Science and Technology, Brisbane, December 1992, 460-464.
[10] Blauensteiner, L. , "Speaker Verification Group
Report on Long-Term Feature Averaging
Techniques Using Neural Network Implementations", Tech. Report, Speaker
Verification Group, University of Qld. 1993.

Claims

CLAIMS :
1. A method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.
2. A method as claimed in claim 1, wherein the processing step comprises processing the results of the separate algorithms through an artificial neural network trained on data relating to the claimed identity.
3. A method as claimed in claim 2, wherein the artificial neural network is of multi-layer perceptron architecture.
4. A method as claimed in claim 2, wherein the processing step further comprises comparing the output of the artificial neural network with a threshold value, the final verification decision being dependent on that comparison.
5. A method as claimed in claim 4, wherein the threshold value is variable.
6. A method as claimed in claim 2, wherein in the event of a positive verification, the artificial neural network is further trained on data derived from the speech sample of the speaker.
7. A method as claimed in claim 1, wherein the feature(s) comprise(s) one or more of the following acoustic features:
Cepstral co-efficients Fundamental frequency (pitch) Energy Duration
Zero Crossing Rate Linear Prediction co-efficients.
8. A method as claimed in claim 7 wherein the step of deriving one or more features from the speech sample comprises converting the speech sample to digital form, removing portions of silence from the resultant digital speech waveform, and dividing the waveform into a plurality of overlapping frames.
9. A method as claimed in claim 1, wherein said algorithms comprise the following algorithms • Dynamic Time Warping Vector Quantisation Recurrent Neural Network • Long Term Features.
10. A method as claimed in claim 9, wherein the comparison of the derived feature(s) with the stored feature(s) is carried out using said plurality of separate algorithms in parallel.
11. Apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.
12. Apparatus as claimed in claim 11, wherein said processing means includes an artificial neural network.
13. Apparatus as claimed in claim 12, wherein the feature(s) comprises one or more of the following acoustic features:
Cepstral co-efficients
Fundamental frequency (pitch)
Energy
Duration
Zero Crossing Rate
Linear Prediction co-efficients.
14. Apparatus as claimed in claim 12, wherein the algorithms comprise the following algorithms
• Dynamic Time Warping Vector Quantisation Recurrent Neural Network
• Long Term Features.
15. A method of ascertaining whether a speaker is one of a group of identities, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from speech samples of the identities, using a plurality 'of separate algorithms, characterised in that for each comparison with a particular identity in the group, the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final decision with regard to that identity.
16. A method as claimed in claim 15, wherein the processing step comprises processing the results of the separate algorithms through an artificial neural network trained on data relating to the claimed identity.
17. Apparatus for ascertaining whether a speaker is one of a group of identities, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from speech samples of the identities, means for comparing the feature(s) derived from the speaker with the stored feature(s) derived from the identities, using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final decision in relation to an identity from a combination of the results of the plurality of algorithms used in the comparison with that identity.
18. Apparatus as claimed in claim 17, wherein said processing means includes an artificial neural network.
PCT/AU1994/000468 1993-08-12 1994-08-12 A speaker verification system WO1995005656A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU73786/94A AU7378694A (en) 1993-08-12 1994-08-12 A speaker verification system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AUPM0542 1993-08-12
AUPM054293 1993-08-12

Publications (1)

Publication Number Publication Date
WO1995005656A1 true WO1995005656A1 (en) 1995-02-23

Family

ID=3777128

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU1994/000468 WO1995005656A1 (en) 1993-08-12 1994-08-12 A speaker verification system

Country Status (1)

Country Link
WO (1) WO1995005656A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19630109A1 (en) * 1996-07-25 1998-01-29 Siemens Ag Method for speaker verification using at least one speech signal spoken by a speaker, by a computer
EP0780830A3 (en) * 1995-12-22 1998-08-12 Ncr International Inc. Speaker verification system
WO1998040875A1 (en) * 1997-03-13 1998-09-17 Telia Ab (Publ) Speaker verification system
EP0870300A1 (en) * 1995-06-07 1998-10-14 Rutgers University Speaker verification system
EP0902415A1 (en) * 1997-09-15 1999-03-17 Koninklijke KPN N.V. Method of and arrangement for providing improved speaker reference data and speaker verification
GB2334864A (en) * 1998-01-16 1999-09-01 Nec Corp Mobile phone has vector coded password protection
US6185536B1 (en) * 1998-03-04 2001-02-06 Motorola, Inc. System and method for establishing a communication link using user-specific voice data parameters as a user discriminator
EP2368213A2 (en) * 2008-11-28 2011-09-28 The Nottingham Trent University Biometric identity verification
WO2014114116A1 (en) * 2013-01-28 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and system for voiceprint recognition
US9502038B2 (en) 2013-01-28 2016-11-22 Tencent Technology (Shenzhen) Company Limited Method and device for voiceprint recognition
US10257191B2 (en) 2008-11-28 2019-04-09 Nottingham Trent University Biometric identity verification
WO2021075012A1 (en) * 2019-10-17 2021-04-22 日本電気株式会社 Speaker authentication system, method, and program
CN112885355A (en) * 2021-01-25 2021-06-01 上海头趣科技有限公司 Speech recognition method based on multiple features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0121248A1 (en) * 1983-03-30 1984-10-10 Nec Corporation Speaker verification system and process
AU8649691A (en) * 1990-10-03 1992-04-28 Imagination Technologies Limited Methods and apparatus for verifying the originator of a sequence of operations
EP0592150A1 (en) * 1992-10-09 1994-04-13 AT&T Corp. Speaker verification
DE4240978A1 (en) * 1992-12-05 1994-06-09 Telefonbau & Normalzeit Gmbh Improving recognition quality for speaker identification - verifying characteristic vectors and corresp. index sequence provided by vector quantisation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0121248A1 (en) * 1983-03-30 1984-10-10 Nec Corporation Speaker verification system and process
AU8649691A (en) * 1990-10-03 1992-04-28 Imagination Technologies Limited Methods and apparatus for verifying the originator of a sequence of operations
EP0592150A1 (en) * 1992-10-09 1994-04-13 AT&T Corp. Speaker verification
DE4240978A1 (en) * 1992-12-05 1994-06-09 Telefonbau & Normalzeit Gmbh Improving recognition quality for speaker identification - verifying characteristic vectors and corresp. index sequence provided by vector quantisation

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0870300A1 (en) * 1995-06-07 1998-10-14 Rutgers University Speaker verification system
US5839103A (en) * 1995-06-07 1998-11-17 Rutgers, The State University Of New Jersey Speaker verification system using decision fusion logic
EP0870300A4 (en) * 1995-06-07 1999-04-21 Univ Rutgers Speaker verification system
EP0780830A3 (en) * 1995-12-22 1998-08-12 Ncr International Inc. Speaker verification system
DE19630109A1 (en) * 1996-07-25 1998-01-29 Siemens Ag Method for speaker verification using at least one speech signal spoken by a speaker, by a computer
WO1998040875A1 (en) * 1997-03-13 1998-09-17 Telia Ab (Publ) Speaker verification system
EP0902415A1 (en) * 1997-09-15 1999-03-17 Koninklijke KPN N.V. Method of and arrangement for providing improved speaker reference data and speaker verification
WO1999014742A1 (en) * 1997-09-15 1999-03-25 Koninklijke Kpn N.V. Method and arrangement for providing speaker reference data for speaker verification
US6249759B1 (en) 1998-01-16 2001-06-19 Nec Corporation Communication apparatus using speech vector comparison and recognition
GB2334864A (en) * 1998-01-16 1999-09-01 Nec Corp Mobile phone has vector coded password protection
GB2334864B (en) * 1998-01-16 2000-03-15 Nec Corp Communication apparatus
US6185536B1 (en) * 1998-03-04 2001-02-06 Motorola, Inc. System and method for establishing a communication link using user-specific voice data parameters as a user discriminator
EP2368213A2 (en) * 2008-11-28 2011-09-28 The Nottingham Trent University Biometric identity verification
US10257191B2 (en) 2008-11-28 2019-04-09 Nottingham Trent University Biometric identity verification
WO2014114116A1 (en) * 2013-01-28 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and system for voiceprint recognition
US9502038B2 (en) 2013-01-28 2016-11-22 Tencent Technology (Shenzhen) Company Limited Method and device for voiceprint recognition
WO2021075012A1 (en) * 2019-10-17 2021-04-22 日本電気株式会社 Speaker authentication system, method, and program
JPWO2021075012A1 (en) * 2019-10-17 2021-04-22
JP7259981B2 (en) 2019-10-17 2023-04-18 日本電気株式会社 Speaker authentication system, method and program
CN112885355A (en) * 2021-01-25 2021-06-01 上海头趣科技有限公司 Speech recognition method based on multiple features

Similar Documents

Publication Publication Date Title
AU2021286422B2 (en) End-to-end speaker recognition using deep neural network
US6539352B1 (en) Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
Melin et al. Voice Recognition with Neural Networks, Type-2 Fuzzy Logic and Genetic Algorithms.
US7502736B2 (en) Voice registration method and system, and voice recognition method and system based on voice registration method and system
WO2010120626A1 (en) Speaker verification system
WO1995005656A1 (en) A speaker verification system
Dash et al. Speaker identification using mel frequency cepstralcoefficient and bpnn
Karthikeyan et al. Hybrid machine learning classification scheme for speaker identification
Ozaydin Design of a text independent speaker recognition system
zohra Chelali et al. Speaker identification system based on PLP coefficients and artificial neural network
Shah et al. Interactive voice response with pattern recognition based on artificial neural network approach
RU2161826C2 (en) Automatic person identification method
Shah et al. Neural network solution for secure interactive voice response
Naik et al. Evaluation of a high performance speaker verification system for access Control
Olsson Text dependent speaker verification with a hybrid HMM/ANN system
Sharma et al. Text-independent speaker identification using backpropagation mlp network classifier for a closed set of speakers
Reda et al. Artificial neural network & mel-frequency cepstrum coefficients-based speaker recognition
Das Utterance based speaker identification using ANN
Melin et al. Voice recognition with neural networks, fuzzy logic and genetic algorithms
Faundez-Zanuy et al. Nonlinear predictive models: overview and possibilities in speaker recognition
Ren et al. A hybrid GMM speaker verification system for mobile devices in variable environments
Abd Al-Rahman et al. Using Deep Learning Neural Networks to Recognize and Authenticate the Identity of the Speaker
Anitha et al. PASSWORD SECURED SPEAKER RECOGNITION USING TIME AND FREQUENCY DOMAIN FEATURES
Nedic et al. Recent developments in speaker verification at IDIAP
Chetouani et al. A new nonlinear feature extraction algorithm for speaker verification.

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AM AT AU BB BG BR BY CA CH CN CZ DE DK ES FI GB GE HU JP KE KG KP KR KZ LK LT LU LV MD MG MN MW NL NO NZ PL PT RO RU SD SE SI SK TJ TT UA US UZ VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): KE MW SD AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA