WO1995005656A1

WO1995005656A1 - A speaker verification system

Info

Publication number: WO1995005656A1
Application number: PCT/AU1994/000468
Authority: WO
Inventors: Thomas Downs; Ah-Chung Tsoi; Mark Schulz; Brian Carrington Lovell; Ian Michael Booth; Michael Glynn Barlow
Original assignee: The University Of Queensland
Priority date: 1993-08-12
Filing date: 1994-08-12
Publication date: 1995-02-23

Abstract

The claimed identity of a person is verified from a sample of that person's speech. The speech sample is processed to extract a set of characteristic features. These features are then compared with stored features previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms in parallel. The outputs of the individual algorithms are then fed to an artificial neural network which is used as a further decision algorithm to arrive at the final decision to accept or reject the speaker's identity claim. By using several different decision algorithms, and combining the outputs of those algorithms in a neural network, the speaker verification technique is more universally applicable, and hence more reliable.

Description

"A SPEAKER VERIFICATION SYSTEM" THIS INVENTION relates to method and apparatus for speaker verification. In particular, the invention is directed to a speaker recognition technique in which characteristics of a speech sample are compared with stored characteristics using several different algorithms, and the results of the individual algorithms are combined in a neural network to arrive at a final decision. BACKGROUND ART

The aim of speaker verification is to accept or reject a claim to a particular identity based on one or more samples of the claimant's speech. Unlike speech recognition which is aimed at deciphering the spoken word, speaker verification is an identification process. It has been found that a person's voice can be used as an identifying characteristic of that person. The person's voice can therefore be used as a robust, secure means of identification, obviating the need for artificial measures such as PIN numbers, security codes, access cards etc.

Speaker verification can be used as a security measure in any area or application in which the identity of an individual must be authenticated. Areas of immediate application include:

Automatic Teller Machine (ATM) transactions Telephone banking

Access control (e.g. to secure building access) Credit card transactions In particular, speaker verification is ideally suited to transactions conducted over a telephone link as other means of authentication are unsuitable or impractical.

Although there are various known techniques for speaker verification, such techniques normally rely upon a single decision making algorithm. It has been found that the accuracy can vary depending on the particular characteristic being compared, or the particular decision making algorithm being used. Furthermore, while some speech features may be accurate characterising features of some speakers, other speakers may have different characterising speech features. Thus, the known speaker verification techniques are generally not universally applicable.

It is an object of the present invention to provide improved method and apparatus for speaker verification which overcomes or ameliorates the disadvantages of known techniques, or which at least provides the consumer with a useful choice.

SUMMARY OF THE INVENTION In one broad form, the present invention provides a method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.

In another broad form, the present invention provides apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.

The speech sample may be obtained via a microphone, telephone link or other suitable audio device. The speech sample may comprise one or more words. The speech sample is typically converted to digital format, and a set of predetermined characteristic features are derived from the digitised sample. These features may include cepstral co-efficients, fundamental frequency (or pitch), energy, duration, zero crossing rate and linear prediction co-efficients. The derived features are then compared with similar characteristic features previously derived from a speech sample of the claimed identity and stored in a suitable memory.

Several independent decision algorithms are used in the comparison. Such algorithms may suitably include dynamic time warping, vector quantisation, recurrent neural network and long term features.

Unlike known techniques, the outputs of all the independent algorithms are then utilised to arrive at a final verification decision. In the preferred embodiment, a neural network is used to compute the similarity between the speaker's speech sample and that of the claimed identity. The neural network of the claimed identity has been previously trained to distinguish the speech sample of the claimed identity from others stored in memory using an iterative process. The output of the decision algorithms for the user's sample are applied to the known identity's neural network for final verification. The system also includes a training facility to allow the stored sample of the claimed identity to be updated when a positive verification is made, to accommodate or compensate for changes in speech patterns with age.

In order that the invention may be more fully understood and put into practice, a preferred embodiment will now be described with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a schematic diagram of the speaker verification system of the preferred embodiment;

Fig. 2 is a schematic of a dynamic time warping process;

Fig. 3 is a schematic diagram of vector quantisation mapping;

Fig. 4 illustrates the architecture of a typical recurrent neural network; and Fig. 5 illustrates the architecture of the multi-layer subsystem arrangement of Fig. 1.

DESCRIPTION OF PREFERRED EMBODIMENT

In the speaker verification system of the preferred embodiment, characteristic features of a speech sample are extracted and correlated with corresponding features of a stored speech sample previously obtained from the true identity. (Typically, the features of the latter speech sample are stored instead of, or in addition to, the speech sample itself). The correlation outputs are then combined in an artificial neural network to arrive at the final decision.

The major components of the speaker verification system are shown in Fig. 1. When a transaction is initiated (e.g. at an ATM, or over a telephone link), the user is asked to supply a speech sample. The sample typically comprises several words which correspond to some or all of the words previously recorded by the known identity. The system may prompt the user to repeat a number of specific words. The user's vocal rendition of these words is then digitised, and a predetermined number of acoustic features are extracted from the digitised samples. These acoustic features, together with the identity claim, are then subjected to a number of decision algorithms. The outputs of the individual decision algorithms are employed by a final decision mechanism, such as an artificial neural network of multi-layer perceptron (MLP) architecture, to make a final decision as to whether to accept or reject the identity claim.

By using a multiple sub-system approach of employing several decision algorithms in parallel, the outputs of which are combined via the neural network, the final decision is a far more reliable indicator of identity than systems hitherto used.

Referring to Fig. 1, the transaction initiation (and prompt) section of the speaker verification system is the primary user interface of the system. This user interface is necessarily application dependent. For instance, a speaker verifying ATM would include a microphone, loud speaker and the current LED display, while remote phone transactions involving speaker verification would make use of the simple resource of the telephone handset.

In use, once an identity claim is made (e.g. by inserting a card) , the system prompts the user to repeat a number of predetermined words. These words are selected from a known list and presented in a random order so as to minimise potential abuse of the system (i.e. the possibility of using a recording of the true speaker is minimised ) . The user is audibly prompted for each required utterance using a voice. For those applications in which a visual display of text is also possible (such as an ATM), the requested utterances can be displayed instead of, or in addition to, the audible prompt.

The user' s utterances are then converted into a format suitable for the decision algorithms. Signals from the microphone at the point of application (e.g. telephone or ATM microphone) are digitised using suitable analog-to-digital (A/D) hardware. The resulting stream of numbers is analysed via a speech detection algorithm in order to determine the start and end points of the user's utterances of the individual words.

The resulting speech waveform (with all portions of silence eliminated) is split into a number of small overlapping frames (typically of 64ms duration) in order to exploit the pseudo-stationary nature of speech.

Various signal processing techniques are employed to convert each frame into a smaller number of acoustic features which have been found to maximally encode the identity of the speaker. In the preferred embodiment, the characteristic acoustic parameters which are extracted for utilisation in the decision algorithms include:

Cepstral co-efficients • Fundamental frequency (pitch)

Energy Duration

Zero Crossing Rate Linear Prediction co-efficients The methods used to extract these acoustic features are known, and need not be described in detail in this application.

In the preferred embodiment, a plurality of independent algorithms process the extracted set of acoustic features and the identity claim, each outputting a score measuring the closeness of the acoustic features to those previously extracted from speech samples of the true identity and stored in memory.

Four independent decision algorithms are used in the described embodiment, namely

• Dynamic Time Warping (DTW) Vector Quantisation (VQ) Recurrent Neural Network (RNN) Long Term Features ( LTF) These algorithms are described below.

Dynamic Time Warping

In each decision algorithm, the acoustic features extracted from the user' s speech are compared or correlated with those of the individual whose identity is being claimed. The latter features are pre-stored in a set of reference templates. One obstacle to comparison of incoming acoustic features with those stored as templates is that speakers often repeat the same word with slight timing differences at each utterance. Dynamic time warping (DTW) is a technique for normalising utterances (of the same word) to the same duration, hence allowing a much simpler comparison of acoustic features. DTW is described in more detail in references [1, 2] identified at the end of this specification. The disclosure of all such references is incorporated herein by reference. For each word uttered by each speaker, five reference templates are stored. A recent innovation to the DTW process has been to employ the time alignment information to enhance the performance of the DTW algorithm [as described in references 3, 4] . A schematic diagram of the DTW process is shown in Fig. 2. Namely, an input and reference frame are time aligned so as to minimise the difference. A score is output by the decision algorithm, dependent on the closeness of the acoustic features of the time aligned input and reference frames.

Vector Quantisation

Vector quantisation (VQ) is a technique chiefly employed for data reduction by computing a code book of elements onto which all input frames are matched. A discussion of the vector quantisation approach to speaker recognition can be found in reference [5].

As employed for speaker verification [6, 3], separate code books are constructed for each speaker. The distortion (or separation) between input frames and the nearest code book element in the code book are averaged and used as a measure of the closeness of the user to that person for whom the code book was constructed.

Fig. 3 is a schematic diagram of the mapping from input frame ( I ) to the nearest code book element showing the distortion (arrow length) for that frame. Preferably, a vector of distortion values is derived rather than a simple mean. Linear weighting of these values has been found to increase speaker verification performance over that of a standard mean distortion measure.

Recurrent Neural Network

A neural network is a type of artificial- intelligence system modelled after the neurons (nerve cells) in a biological nervous system and intended to simulate the way in which a brain processes information, learns, and_ remembers. A neural network is designed as an interconnected system of processing elements, each with a limited number of inputs (comparable to the impulse-receiving dendrites of a neuron) and an output (comparable to the synapse over which a nerve impulse travels to the next neuron). Rather than being programmed, these processing elements are able to "learn" by receiving weighted inputs - roughly, weak to strong or negative to positive - that, with adjustment, time, and repetition, can be made to produce appropriate outputs. Neural networks can be implemented either through hardware circuits (the fast metho ) or through software that simulates such a network (a slower method) . Neural networks help computers "learn" by association and recognition.

Recurrent neural networks (RNN) [7] have found recent application in the area of speech processing due to their ability to handle time-varying signals. One of the broad family of Multi-Layer Perceptrons [8], RNNs differ by employing self-connections at each node, thus allowing previous frames of data to influence current outputs.

As applied to speaker verification [9], a separate RNN is trained for each speaker. The network is repeatedly presented samples of speech from the individual and other speakers. Fig. 4 illustrates the architecture of a typical RNN to which frames of acoustic features are fed. The network "learns" the characteristics of the individual ' s voice by altering node connection strengths to produce outputs corresponding to those shown (e.g. output of 1 when input is from true identity, and output of -1 when input is from any other speaker).

The decision algorithm involves passing the input features extracted from the user's speech through the RNN for the claimed identity. The decision algorithm output is based on the output of the network.

Long Term Features

A technique applied for text-independent speaker recognition, long term features (LTF) [4] finds the average of each acoustic feature over the duration of a speaker's utterance. The mean and variance values for the acoustic features of the user's utterance are then compared with similar values derived from utterances of the claimed identity.

A multi-layer perceptron may be used to compute the similarity between the LTFs derived from the user utterance and those LTFs for the claimed identity [10].

Rather than using a single decision algorithm to verify or reject a user's claim of identity, the speaker verification system of the preferred embodiment employs all four abovedescribed independent decision algorithms to make the binary decision to accept or reject the user's identity claim.

More particularly, a neural network of MLP architecture is employed to combine the disparate outputs of the four independent decision algorithms so as to arrive at the final decision. The MLP may suitably be a software or hardware-based neural network. As shown in fig. 5, for each word that the user is requested to utter, the outputs of the four decision algorithms are fed to the MLP, and its output is compared with a threshold value. Based on the comparison of the MLP output value with a threshold value for each of the words requested from the user, the identity claim is either accepted or rejected.

Preferably, the threshold value can be adjusted to vary the level of security of the speaker recognition system.

The speaker verification system includes a training or adaptive learning facility. In order for each decision algorithm to compute a score based on the correlation of the input speech with speech from the claimed identity, samples of speech from the claimed identity are required to train or serve as templates for the algorithm. The process of collecting the speech data and updating the algorithms to use the new data is known as training. At the time of entering a new identity on the system, that person must leave a sample of his/her voice to serve as training and reference templates. Typically, the person is required to repeat all words in a limited vocabulary a number of times, and the digitised speech is then converted to acoustic features as described above. These values are then used by the algorithms to build code books, reference templates or train networks to be representative of that speaker, the exact details being dependent on the particular algorithm. When a user, who is verified as the claimed identity, uses the system, the reference templates, code books, etc. of the individual algorithms are updated to take account of the subtle changes in the user' s voice over time. Thus, the system is trained to adapt to changes in the person's voice which may develop with age.

The processing can be performed in software. However, the four decision algorithms can be run in hardware if desired, such as on a dedicated DSP board. Furthermore, the four algorithms can be run in parallel so as to minimise decision time. The digitisation and feature extraction can also be performed principally in hardware if desired. In most applications, the processing will be split between local and centralised processing. A centralised storage centre holds the reference templates, code books and network weights for each speaker. The appropriate templates, code books etc. for the claimed identity may then be down-loaded to the processing units.

Processing of identity claims may be carried out centrally in a "bank" of processors or localised at the point of transaction. Either approach may be accommodated, the optimum arrangement being dictated by application conditions and restraints. For example, point-of-sale credit card authorisation via speaker verification would normally be performed at some centralised processing centre, while authenticating ATM transactions by speaker verification could be performed either locally at the ATM, or centralised.

The foregoing describes only one embodiment of the invention, and modifications which are obvious to those skilled in the art may be made thereto without departing from the scope of the invention. For example, although the preferred embodiment has been described with reference to four particular decision algorithms, the number, and type, of decision algorithms can be varied.

Further, although the invention has been described with particular reference to speaker verification, it can also be used for speaker identification. In such application, the characteristics derived from the speech sample of the user are compared with the characteristics from speech samples from a list of persons using the abovedescribed techniques. The system will indicate whether the speaker is one of the persons on the list, and/or the person(s) on the list which most closely match(es) the speaker. REFERENCES

[1] Furui, St., "Speaker Independent Isolated Word Recognition Using Dynamic Features of Speech Spectra", IEEE Trans. ASSP, vol 34, 1986, 52- 59.

[2] Doddington G., "A Method of Speaker Verification", PhD Thesis, The University of Wisconsin, 1971.

[3] Booth, I., Barlow, M. , Watson, B.,

"Enhancements to DTW and VQ decision algorithms for speaker recognition", Proc. Fourth Aust.

Int. Conf. Speech Science and Technology, Brisbane, December 1992, 483-488.

[4] Barlow, M. , "Prosodic Acoustic Correlates of Speaker Characteristics", PhD Thesis, University of NSW, 1991.

[5] Soong, F. , Rosenberg, A., Rabiner, L., Juang, G., "A Vector Quantisation Approach to Speaker Recognition", Proc. ICASSP-85, 1985, 387-390.

[6] Matuis, T., Furui, S., "Comparison of Text- Independent Speaker Recognition Methods using VQ-Distortion and Discrete/Continuous HMMs", Proc. ICASSP-91, 1991, 157-160.

[7] Pineda, J.F., "Generalization of Back- Propagation to Recurrent Neural Networks" Physical Review Letters, vol 59, no. 19, 1987, 2229-2232.

[8] Hertz, J., Krogh, A., Palmer, R.G.,

"Introduction to the Theory of Neural Computation", Addison & Wesly press, 1991. [9] Shrimpton, D., Watson, B., "Comparison of

Recurrent Neural Network Architectures for

Speaker Verification", Proc. Fourth Aust. Int.

Conf. Speech Science and Technology, Brisbane, December 1992, 460-464.

[10] Blauensteiner, L. , "Speaker Verification Group

Report on Long-Term Feature Averaging

Techniques Using Neural Network Implementations", Tech. Report, Speaker

Verification Group, University of Qld. 1993.

Claims

CLAIMS :

1. A method of verifying whether a speaker is a claimed identity, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from a speech sample of the claimed identity, using a plurality of separate algorithms, characterised in that the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final verification decision.

2. A method as claimed in claim 1, wherein the processing step comprises processing the results of the separate algorithms through an artificial neural network trained on data relating to the claimed identity.

3. A method as claimed in claim 2, wherein the artificial neural network is of multi-layer perceptron architecture.

4. A method as claimed in claim 2, wherein the processing step further comprises comparing the output of the artificial neural network with a threshold value, the final verification decision being dependent on that comparison.

5. A method as claimed in claim 4, wherein the threshold value is variable.

6. A method as claimed in claim 2, wherein in the event of a positive verification, the artificial neural network is further trained on data derived from the speech sample of the speaker.

7. A method as claimed in claim 1, wherein the feature(s) comprise(s) one or more of the following acoustic features:

Cepstral co-efficients Fundamental frequency (pitch) Energy Duration

Zero Crossing Rate Linear Prediction co-efficients.

8. A method as claimed in claim 7 wherein the step of deriving one or more features from the speech sample comprises converting the speech sample to digital form, removing portions of silence from the resultant digital speech waveform, and dividing the waveform into a plurality of overlapping frames.

9. A method as claimed in claim 1, wherein said algorithms comprise the following algorithms • Dynamic Time Warping Vector Quantisation Recurrent Neural Network • Long Term Features.

10. A method as claimed in claim 9, wherein the comparison of the derived feature(s) with the stored feature(s) is carried out using said plurality of separate algorithms in parallel.

11. Apparatus for verifying whether a speaker is a claimed identity, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from a speech sample of the claimed identity, means for comparing the feature(s) derived from the speaker with the stored feature(s), using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final verification decision from a combination of the results of the plurality of algorithms.

12. Apparatus as claimed in claim 11, wherein said processing means includes an artificial neural network.

13. Apparatus as claimed in claim 12, wherein the feature(s) comprises one or more of the following acoustic features:

Cepstral co-efficients

Fundamental frequency (pitch)

Energy

Duration

Zero Crossing Rate

Linear Prediction co-efficients.

14. Apparatus as claimed in claim 12, wherein the algorithms comprise the following algorithms

• Dynamic Time Warping Vector Quantisation Recurrent Neural Network

• Long Term Features.

15. A method of ascertaining whether a speaker is one of a group of identities, comprising the steps of obtaining a speech sample from the speaker, deriving one or more features from the speech sample, comparing the derived feature(s) with stored feature(s) previously derived from speech samples of the identities, using a plurality 'of separate algorithms, characterised in that for each comparison with a particular identity in the group, the method further comprises the step of processing a combination of the individual results obtained using the plurality of algorithms to arrive at a final decision with regard to that identity.

16. A method as claimed in claim 15, wherein the processing step comprises processing the results of the separate algorithms through an artificial neural network trained on data relating to the claimed identity.

17. Apparatus for ascertaining whether a speaker is one of a group of identities, comprising means for obtaining a speech sample from a speaker, means for deriving one or more feature(s) of the speech sample, memory means for storing one or more feature(s) previously derived from speech samples of the identities, means for comparing the feature(s) derived from the speaker with the stored feature(s) derived from the identities, using a plurality of separate algorithms, characterised in that the apparatus further comprises processing means for deriving a final decision in relation to an identity from a combination of the results of the plurality of algorithms used in the comparison with that identity.

18. Apparatus as claimed in claim 17, wherein said processing means includes an artificial neural network.