CA1246745A

CA1246745A - Man/machine communications system using formant based speech analysis and synthesis

Info

Publication number: CA1246745A
Application number: CA000503281A
Authority: CA
Inventors: Melvyn J. Hunt
Original assignee: Individual
Current assignee: Individual
Priority date: 1985-03-25
Filing date: 1986-03-04
Publication date: 1988-12-13

Abstract

MAN/MACHINE COMMUNICATIONS SYSTEM USING FORMANT BASED SPEECH
ANALYSIS AND SYNTHESIS

ABSTRACT OF THE DISCLOSURE

Formants are extracted and stored from reference speech.
Input speech is suitably processed to derive unlabelled candidate formants. The sets of formants from the input and reference speech are compared using dynamic programming techniques. A
further sequence comparison provides time alignment of the input and reference speech. The sequence comparisons extract a dissim-ilarity measure based on the formant frequencies and other characteristics of the speech. The reference speech resulting in the lowest dissimilarity measure identifies the input speech recognized by the system. System feedback may be provided and is composed of designated responsive multi-voiced speech. The multi-voiced output speech is obtained primarily by altering the prosodic parameters and formant frequencies of the designated responsive speech. Thus, the designated responsive speech may, say in an aircraft communication system, use one voice output when providing an information response to the pilot's recognized input speech question and another appropriately strident voice to issue the pilot warnings. The system also may be placed in a training mode to evaluate performance and adjust parameters.

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A speech recognition system comprising:
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each item of said reference speech vocabulary;
b) means for storing vocabulary item template information for said reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of said formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to errors present in either set;
f) means for comparing said measure of energy and said measure of spectrum balance for the speech to be recognized with said measure of energy, and said measure of spectrum balance for the reference speech vocabulary to provide energy and spectrum balance dissimilarity measures;
g) means for combining said formant dissimilarity measure and said energy and spectrum balance dissimilarity measure to produce local dissimilarity measures;
h) means for identifying a sequence of vocabulary item templates by aligning the speech to be recognized with the reference speech vocabulary which alignment results in the lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of local dissimilarity measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and i) means for outputting the identified sequence of vocabulary item templates.

2. A speaker verification system comprising:
a) means for instructing a speaker to provide speech to be recognized corresponding to at least one of a reference speech vocabulary comprised of a plurality of vocabulary items for all speakers;
b) means for storing speaker identities corresponding to the speaker's reference speech vocabulary;
c) means for extracting and storing from said reference speech vocabulary for each speaker to be identified, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary for each speaker to be identified;
d) means for storing vocabulary item template information for said reference vocabulary for each speaker to be identified;
e) means for storing information defining syntactically allowed sequences of vocabulary items for each speaker to be identified;
f) means for extracting on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance of said produced specified sequence of vocabulary items to be recognized;
g) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the formant parameters of said reference vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
h) means for determining the syntactically allowed sequence of reference speech templates and their non-linear time alignments that minimize a local dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs of the frames of the speech to be recognized and the frames of the reference vocabulary ;
i) means for outputting the reference vocabulary determined by the syntactically allowed sequence of reference speech templates;
j) means for identifying the reference speech vocabulary by aligning the speech to be recognized with the reference speech vocabulary which alignment results in a lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of the local dissimarility measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and k) means for outputting a positive speaker identity corresponding to the identified reference speech vocabulary if the total dissimilarity measure is below a predetermined acceptable limit.

3. The system of claim 1 wherein the total dissimilarity measure is a least cost explanation of one set in terms of the other set, whereby when each formant parameter in the reference speech vocabulary set is paired with said unlabelled potentially errorful candidate formant parameters in the speech to be recognized there is a cost that is a monotonically increasing function of a difference in their frequencies and when an unlabelled potentially errorful candidate formant parameter is left unpaired there is a cost inversely related to a confidence measure placed on that formant candidate.

4. The speaker verification system of claim 2 wherein said total dissimilarity measure is obtained by further comparing the formant parameter of the speech to be recognized with the formant parameters of the reference speech vocabulary given the determined time alignment.

5. The speaker verification system of claim 2 wherein said total dissimilarity measure is a formant dissimilarity measure.

6. A multi-voiced output system comprising:
a) means for extracting from a reference speech vocabulary set of natural speech, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency voiced and unvoiced decision, for each frame of said reference speech vocabulary set;
b) first means for storing at least the formant parameters, energy and voiced and unvoiced decision for each frame of said reference speech vocabulary set;
c) second means for storing syntactic and prosodic rules applicable to said reference speech vocabulary set;
d) means for selecting reference speech out of said reference speech vocabulary set, and choosing a set of parameters for modifying said selected reference speech;
e) means for modifying said selected reference speech in accordance with said chosen parameters by altering one or more of the formant parameters, energy, voiced and unvoiced decisions stored in said first means;
f) means for synthesizing said modified reference speech using an excitation waveform of duration and form similar to the excitation waveform of said selected reference speech; and g) means for suitably analog converting and outputting said synthesized modified selected reference speech.

7. The system of claim 6 wherein:
a) said first storage means includes storage of the fundamental frequency of each frame of said reference speech vocabulary set; and b) said modifying means includes altering the fundamental frequency of said selected reference speech.

8. The system of claim 6 wherein:
a) said first storage means also includes means for storing the bandwidth of the vocabulary set of each frame of said reference speech; and b) said modifying means includes means for altering said bandwidth.

9. The system of claim 6, wherein:
a) said means for extraction includes a Laryngograph.

10. The system of claim 6, wherein:
a) said extraction means provides an error signal from a linear predictive analysis of said reference speech vocabulary set, said error signal being stored in said first storage means;
and b) said synthesizing means uses said error signal as the excitation waveform.

11. A man/machine speech communications system comprising:
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said reference speech vocabulary, said reference speech vocabulary being divided into a recognition speech vocabulary and an output speech vocabulary;
b) means for storing vocabulary item template information for said recognition speech reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the recognition speech formant parameters to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
f) means for determining the syntactically allowed sequence of recognition speech templates and their non-linear time alignments that minimize a total dissimilarity measure comprising at least said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs;
g) means for outputting a signal indicative of the recognition speech template having the lowest total dissimilarity measure;

h) said means for extracting and storing further including extraction and storage of fundamental frequency voiced and unvoiced decision, for each frame of said output speech vocabulary;
i) means for storage of syntactic and prosodic rules applicable to said output speech vocabulary;
j) means for selecting a reference speech out of said output speech vocabulary responsive to said output of a signal indicative of recognition speech template, and means for choosing a set of parameters for modifying said selected output speech;
k) means for modifying the characteristics of said selected output speech in accordance with said chosen parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected output speech;
l) means for synthesizing said modified selected output speech; and m) means for suitably analog converting and outputting said synthesized modified selected output reference speech.

12. A speech recognition method comprising the steps of:
a) extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary;
b) storing vocabulary item template information for said reference speech vocabulary;
c) storing information defining allowed sequences of vocabulary items in speech to be recognized;
d) extracting, on a frame by frame basis, unlabelled candidate formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) comparing sets of said unlabelled candidate formant parameters with any set of formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set:
f) comparing energy and spectrum balance measures for the speech to be recognized with the reference speech vocabulary;
g) determining the syntactically allowed sequence of vocabulary item template information and their non-linear time alignments with the allowed sequence of vocabulary items in the speech to be recognized that minimize a total dissimilarity measure, said total dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs; and h) outputting the determined sequence of vocabulary items corresponding to the template.

13. A speaker verification method comprising the steps of:
a) extracting and storing from a reference speech vocabulary, for each speaker to be identified, on a frame by frame basis, formant frequencies and bandwidths, energy and spectrum balance;
b) storing whole-word template information for said reference vocabulary;
c) storing information defining sequences of words in the reference speech vocabulary;
d) instructing a speaker to say a specified sequence of words and to identify himself or herself;
e) extracting on a frame by frame basis unlabelled candidate formant frequencies and bandwidths, energy and spectrum balance of the speaker's words;
f) comparing sets of unlabelled candidate formant frequencies and bandwidths with the formant frequencies and bandwidths of the reference speech for the identified speaker to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
g) comparing sets of the energy and spectrum balance to provide a further dissimilarity measure which is combined with the formant dissimilarity measure to provide a total dissimilarity measure;
h) determining the time alignment of the specified sequence of words with the reference speech templates corresponding to the speaker's claimed identity that minimizes the total summed formant dissimilarity measure over aligned frame pairs; and i) measuring the equivalence between the time aligned specified sequence of words and the reference speech templates and determining whether the equivalence is above an acceptable lower limit for speaker verification.

14. A method of providing a multi-voiced output comprising the steps of:
a) extracting from a reference speech vocabulary, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency, voiced and unvoiced decision, for each of said reference speech vocabulary;
b) storing in a first means at least said formant parameters, energy and voiced and unvoiced decision for each of said reference speech vocabulary;
c) storing in a second means syntactic and prosodic rules applicable to said reference speech vocabulary;
d) selecting reference speech out of said reference speech vocabulary, and choosing a set of parameters for modifying said selected reference speech;
e) modifying the characteristics of said selected reference speech in accordance with said chosen set of parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected reference speech;
f) re-synthesizing said modified selected reference speech; and g) suitably analog converting and outputting said re-synthesized modified selected reference speech.

15. A system of claim 1, further characterized by:
a) means for extracting and storing boundaries of vocabulary items for the speech to be recognized from the speech recognition system;
b) means for extracting and storing boundaries of vocabulary items for the speech to be recognized independently of said speech recognition system;
c) means for determining the correspondence between the two sets of vocabulary item boundaries;
d) means for identifying and storing vocabulary item templates of the speech to be recognized independently of said speech recognition system;
e) means for comparting the identified sequence of vocabulary item templates from said speech recognition system with the corresponding independently identified and stored vocabulary item templates within said independently extracted and stored vocabulary item boundaries;
f) means for outputting a reliability measure of said speech recognition system as a result of at least a portion of the correspondence determined between the two sets of vocabulary item boundaries and identified sequence comparison of the two sets of vocabulary item templates.

16. The system of claim 14 further characterized by:
a) means for constraining said means for identifying a sequence of vocabulary item templates to match said corresponding vocabulary items identified by the independent means;
b) said means for comparing the identified sequence of vocabulary item templates including means for passing said speech to be recognized through a portion of said speech recognition system at least twice.