CA1246745A - Man/machine communications system using formant based speech analysis and synthesis - Google Patents

Man/machine communications system using formant based speech analysis and synthesis

Info

Publication number
CA1246745A
CA1246745A CA000503281A CA503281A CA1246745A CA 1246745 A CA1246745 A CA 1246745A CA 000503281 A CA000503281 A CA 000503281A CA 503281 A CA503281 A CA 503281A CA 1246745 A CA1246745 A CA 1246745A
Authority
CA
Canada
Prior art keywords
speech
vocabulary
formant
measure
reference speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
CA000503281A
Other languages
French (fr)
Inventor
Melvyn J. Hunt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Application granted granted Critical
Publication of CA1246745A publication Critical patent/CA1246745A/en
Expired legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

MAN/MACHINE COMMUNICATIONS SYSTEM USING FORMANT BASED SPEECH
ANALYSIS AND SYNTHESIS

ABSTRACT OF THE DISCLOSURE

Formants are extracted and stored from reference speech.
Input speech is suitably processed to derive unlabelled candidate formants. The sets of formants from the input and reference speech are compared using dynamic programming techniques. A
further sequence comparison provides time alignment of the input and reference speech. The sequence comparisons extract a dissim-ilarity measure based on the formant frequencies and other characteristics of the speech. The reference speech resulting in the lowest dissimilarity measure identifies the input speech recognized by the system. System feedback may be provided and is composed of designated responsive multi-voiced speech. The multi-voiced output speech is obtained primarily by altering the prosodic parameters and formant frequencies of the designated responsive speech. Thus, the designated responsive speech may, say in an aircraft communication system, use one voice output when providing an information response to the pilot's recognized input speech question and another appropriately strident voice to issue the pilot warnings. The system also may be placed in a training mode to evaluate performance and adjust parameters.

Claims (16)

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:
1. A speech recognition system comprising:
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each item of said reference speech vocabulary;
b) means for storing vocabulary item template information for said reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of said formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to errors present in either set;
f) means for comparing said measure of energy and said measure of spectrum balance for the speech to be recognized with said measure of energy, and said measure of spectrum balance for the reference speech vocabulary to provide energy and spectrum balance dissimilarity measures;
g) means for combining said formant dissimilarity measure and said energy and spectrum balance dissimilarity measure to produce local dissimilarity measures;
h) means for identifying a sequence of vocabulary item templates by aligning the speech to be recognized with the reference speech vocabulary which alignment results in the lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of local dissimilarity measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and i) means for outputting the identified sequence of vocabulary item templates.
2. A speaker verification system comprising:
a) means for instructing a speaker to provide speech to be recognized corresponding to at least one of a reference speech vocabulary comprised of a plurality of vocabulary items for all speakers;
b) means for storing speaker identities corresponding to the speaker's reference speech vocabulary;
c) means for extracting and storing from said reference speech vocabulary for each speaker to be identified, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary for each speaker to be identified;
d) means for storing vocabulary item template information for said reference vocabulary for each speaker to be identified;
e) means for storing information defining syntactically allowed sequences of vocabulary items for each speaker to be identified;
f) means for extracting on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance of said produced specified sequence of vocabulary items to be recognized;
g) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the formant parameters of said reference vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
h) means for determining the syntactically allowed sequence of reference speech templates and their non-linear time alignments that minimize a local dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs of the frames of the speech to be recognized and the frames of the reference vocabulary ;
i) means for outputting the reference vocabulary determined by the syntactically allowed sequence of reference speech templates;
j) means for identifying the reference speech vocabulary by aligning the speech to be recognized with the reference speech vocabulary which alignment results in a lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of the local dissimarility measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and k) means for outputting a positive speaker identity corresponding to the identified reference speech vocabulary if the total dissimilarity measure is below a predetermined acceptable limit.
3. The system of claim 1 wherein the total dissimilarity measure is a least cost explanation of one set in terms of the other set, whereby when each formant parameter in the reference speech vocabulary set is paired with said unlabelled potentially errorful candidate formant parameters in the speech to be recognized there is a cost that is a monotonically increasing function of a difference in their frequencies and when an unlabelled potentially errorful candidate formant parameter is left unpaired there is a cost inversely related to a confidence measure placed on that formant candidate.
4. The speaker verification system of claim 2 wherein said total dissimilarity measure is obtained by further comparing the formant parameter of the speech to be recognized with the formant parameters of the reference speech vocabulary given the determined time alignment.
5. The speaker verification system of claim 2 wherein said total dissimilarity measure is a formant dissimilarity measure.
6. A multi-voiced output system comprising:
a) means for extracting from a reference speech vocabulary set of natural speech, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency voiced and unvoiced decision, for each frame of said reference speech vocabulary set;
b) first means for storing at least the formant parameters, energy and voiced and unvoiced decision for each frame of said reference speech vocabulary set;
c) second means for storing syntactic and prosodic rules applicable to said reference speech vocabulary set;
d) means for selecting reference speech out of said reference speech vocabulary set, and choosing a set of parameters for modifying said selected reference speech;
e) means for modifying said selected reference speech in accordance with said chosen parameters by altering one or more of the formant parameters, energy, voiced and unvoiced decisions stored in said first means;
f) means for synthesizing said modified reference speech using an excitation waveform of duration and form similar to the excitation waveform of said selected reference speech; and g) means for suitably analog converting and outputting said synthesized modified selected reference speech.
7. The system of claim 6 wherein:
a) said first storage means includes storage of the fundamental frequency of each frame of said reference speech vocabulary set; and b) said modifying means includes altering the fundamental frequency of said selected reference speech.
8. The system of claim 6 wherein:
a) said first storage means also includes means for storing the bandwidth of the vocabulary set of each frame of said reference speech; and b) said modifying means includes means for altering said bandwidth.
9. The system of claim 6, wherein:
a) said means for extraction includes a Laryngograph.
10. The system of claim 6, wherein:
a) said extraction means provides an error signal from a linear predictive analysis of said reference speech vocabulary set, said error signal being stored in said first storage means;
and b) said synthesizing means uses said error signal as the excitation waveform.
11. A man/machine speech communications system comprising:
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said reference speech vocabulary, said reference speech vocabulary being divided into a recognition speech vocabulary and an output speech vocabulary;
b) means for storing vocabulary item template information for said recognition speech reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the recognition speech formant parameters to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
f) means for determining the syntactically allowed sequence of recognition speech templates and their non-linear time alignments that minimize a total dissimilarity measure comprising at least said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs;
g) means for outputting a signal indicative of the recognition speech template having the lowest total dissimilarity measure;

h) said means for extracting and storing further including extraction and storage of fundamental frequency voiced and unvoiced decision, for each frame of said output speech vocabulary;
i) means for storage of syntactic and prosodic rules applicable to said output speech vocabulary;
j) means for selecting a reference speech out of said output speech vocabulary responsive to said output of a signal indicative of recognition speech template, and means for choosing a set of parameters for modifying said selected output speech;
k) means for modifying the characteristics of said selected output speech in accordance with said chosen parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected output speech;
l) means for synthesizing said modified selected output speech; and m) means for suitably analog converting and outputting said synthesized modified selected output reference speech.
12. A speech recognition method comprising the steps of:
a) extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary;
b) storing vocabulary item template information for said reference speech vocabulary;
c) storing information defining allowed sequences of vocabulary items in speech to be recognized;
d) extracting, on a frame by frame basis, unlabelled candidate formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) comparing sets of said unlabelled candidate formant parameters with any set of formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set:
f) comparing energy and spectrum balance measures for the speech to be recognized with the reference speech vocabulary;
g) determining the syntactically allowed sequence of vocabulary item template information and their non-linear time alignments with the allowed sequence of vocabulary items in the speech to be recognized that minimize a total dissimilarity measure, said total dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs; and h) outputting the determined sequence of vocabulary items corresponding to the template.
13. A speaker verification method comprising the steps of:
a) extracting and storing from a reference speech vocabulary, for each speaker to be identified, on a frame by frame basis, formant frequencies and bandwidths, energy and spectrum balance;
b) storing whole-word template information for said reference vocabulary;
c) storing information defining sequences of words in the reference speech vocabulary;
d) instructing a speaker to say a specified sequence of words and to identify himself or herself;
e) extracting on a frame by frame basis unlabelled candidate formant frequencies and bandwidths, energy and spectrum balance of the speaker's words;
f) comparing sets of unlabelled candidate formant frequencies and bandwidths with the formant frequencies and bandwidths of the reference speech for the identified speaker to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
g) comparing sets of the energy and spectrum balance to provide a further dissimilarity measure which is combined with the formant dissimilarity measure to provide a total dissimilarity measure;
h) determining the time alignment of the specified sequence of words with the reference speech templates corresponding to the speaker's claimed identity that minimizes the total summed formant dissimilarity measure over aligned frame pairs; and i) measuring the equivalence between the time aligned specified sequence of words and the reference speech templates and determining whether the equivalence is above an acceptable lower limit for speaker verification.
14. A method of providing a multi-voiced output comprising the steps of:
a) extracting from a reference speech vocabulary, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency, voiced and unvoiced decision, for each of said reference speech vocabulary;
b) storing in a first means at least said formant parameters, energy and voiced and unvoiced decision for each of said reference speech vocabulary;
c) storing in a second means syntactic and prosodic rules applicable to said reference speech vocabulary;
d) selecting reference speech out of said reference speech vocabulary, and choosing a set of parameters for modifying said selected reference speech;
e) modifying the characteristics of said selected reference speech in accordance with said chosen set of parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected reference speech;
f) re-synthesizing said modified selected reference speech; and g) suitably analog converting and outputting said re-synthesized modified selected reference speech.
15. A system of claim 1, further characterized by:
a) means for extracting and storing boundaries of vocabulary items for the speech to be recognized from the speech recognition system;
b) means for extracting and storing boundaries of vocabulary items for the speech to be recognized independently of said speech recognition system;
c) means for determining the correspondence between the two sets of vocabulary item boundaries;
d) means for identifying and storing vocabulary item templates of the speech to be recognized independently of said speech recognition system;
e) means for comparting the identified sequence of vocabulary item templates from said speech recognition system with the corresponding independently identified and stored vocabulary item templates within said independently extracted and stored vocabulary item boundaries;
f) means for outputting a reliability measure of said speech recognition system as a result of at least a portion of the correspondence determined between the two sets of vocabulary item boundaries and identified sequence comparison of the two sets of vocabulary item templates.
16. The system of claim 14 further characterized by:
a) means for constraining said means for identifying a sequence of vocabulary item templates to match said corresponding vocabulary items identified by the independent means;
b) said means for comparing the identified sequence of vocabulary item templates including means for passing said speech to be recognized through a portion of said speech recognition system at least twice.
CA000503281A 1985-03-25 1986-03-04 Man/machine communications system using formant based speech analysis and synthesis Expired CA1246745A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US71544385A 1985-03-25 1985-03-25
US715,443 1985-03-25

Publications (1)

Publication Number Publication Date
CA1246745A true CA1246745A (en) 1988-12-13

Family

ID=24874072

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000503281A Expired CA1246745A (en) 1985-03-25 1986-03-04 Man/machine communications system using formant based speech analysis and synthesis

Country Status (1)

Country Link
CA (1) CA1246745A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996033486A1 (en) * 1995-04-18 1996-10-24 Oriol Espar Figueras Speech recognition process and device
EP0645757B1 (en) * 1993-09-23 2000-04-05 Xerox Corporation Semantic co-occurrence filtering for speech recognition and signal transcription applications
CN112951245A (en) * 2021-03-09 2021-06-11 江苏开放大学(江苏城市职业学院) Dynamic voiceprint feature extraction method integrated with static component
CN115879405A (en) * 2023-02-24 2023-03-31 湖南遥光科技有限公司 Circuit performance detection method, computer storage medium and terminal device
CN118173102A (en) * 2024-05-15 2024-06-11 百鸟数据科技(北京)有限责任公司 Bird voiceprint recognition method in complex scene
CN118173102B (en) * 2024-05-15 2024-07-16 百鸟数据科技(北京)有限责任公司 Bird voiceprint recognition method in complex scene

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0645757B1 (en) * 1993-09-23 2000-04-05 Xerox Corporation Semantic co-occurrence filtering for speech recognition and signal transcription applications
WO1996033486A1 (en) * 1995-04-18 1996-10-24 Oriol Espar Figueras Speech recognition process and device
ES2110899A1 (en) * 1995-04-18 1998-02-16 Figueras Oriol Espar Speech recognition process and device
CN112951245A (en) * 2021-03-09 2021-06-11 江苏开放大学(江苏城市职业学院) Dynamic voiceprint feature extraction method integrated with static component
CN112951245B (en) * 2021-03-09 2023-06-16 江苏开放大学(江苏城市职业学院) Dynamic voiceprint feature extraction method integrated with static component
CN115879405A (en) * 2023-02-24 2023-03-31 湖南遥光科技有限公司 Circuit performance detection method, computer storage medium and terminal device
CN115879405B (en) * 2023-02-24 2023-11-17 湖南遥光科技有限公司 Circuit performance detection method, computer storage medium and terminal equipment
CN118173102A (en) * 2024-05-15 2024-06-11 百鸟数据科技(北京)有限责任公司 Bird voiceprint recognition method in complex scene
CN118173102B (en) * 2024-05-15 2024-07-16 百鸟数据科技(北京)有限责任公司 Bird voiceprint recognition method in complex scene

Similar Documents

Publication Publication Date Title
Yoshimura et al. Mixed excitation for HMM-based speech synthesis.
DE69831076T2 (en) METHOD AND DEVICE FOR LANGUAGE ANALYSIS AND SYNTHESIS BY ALLPASS-SIEB CHAIN FILTERS
US4624011A (en) Speech recognition system
Vepa et al. New objective distance measures for spectral discontinuities in concatenative speech synthesis
US5144672A (en) Speech recognition apparatus including speaker-independent dictionary and speaker-dependent
Hunt et al. Speaker dependent and independent speech recognition experiments with an auditory model
Bocklet et al. Age and gender recognition based on multiple systems-early vs. late fusion.
US5202926A (en) Phoneme discrimination method
Chetouani et al. A New Nonlinear speaker parameterization algorithm for speaker identification
Teixeira et al. Prosodic features for automatic text-independent evaluation of degree of nativeness for language learners.
Elenius et al. Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system
Hansen et al. Robust speech recognition training via duration and spectral-based stress token generation
CA1246745A (en) Man/machine communications system using formant based speech analysis and synthesis
US4924518A (en) Phoneme similarity calculating apparatus
Wicaksana et al. Spoken language identification on local language using MFCC, random forest, KNN, and GMM
Dawande et al. Analysis of different feature extraction techniques for speaker recognition system: A review
Siegel et al. A pattern classification algorithm for the voiced/unvoiced decision
Aull et al. Lexical stress and its application in large vocabulary speech recognition
Pedone et al. Phoneme-level text to audio synchronization on speech signals with background music
Dutono et al. Effects of compound parameters on speaker-independent word recognition
Fu et al. Polynomial-Decomposition-Based LPC for Formant Estimation
KR19990050440A (en) Voice recognition method and voice recognition device using voiced, unvoiced and silent section information
Samouelian Frame-level phoneme classification using inductive inference
Pellom et al. Spectral normalization employing hidden Markov modeling of line spectrum pair frequencies
Mariani et al. Acoustic-phonetic recognition of connected speech using transient information

Legal Events

Date Code Title Description
MKEX Expiry