CA1246745A - Man/machine communications system using formant based speech analysis and synthesis - Google Patents
Man/machine communications system using formant based speech analysis and synthesisInfo
- Publication number
- CA1246745A CA1246745A CA000503281A CA503281A CA1246745A CA 1246745 A CA1246745 A CA 1246745A CA 000503281 A CA000503281 A CA 000503281A CA 503281 A CA503281 A CA 503281A CA 1246745 A CA1246745 A CA 1246745A
- Authority
- CA
- Canada
- Prior art keywords
- speech
- vocabulary
- formant
- measure
- reference speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired
Links
- 230000015572 biosynthetic process Effects 0.000 title abstract 2
- 238000003786 synthesis reaction Methods 0.000 title abstract 2
- 238000000034 method Methods 0.000 claims abstract 4
- 238000001228 spectrum Methods 0.000 claims 19
- 230000005284 excitation Effects 0.000 claims 5
- 238000012795 verification Methods 0.000 claims 5
- 238000000605 extraction Methods 0.000 claims 3
- 230000002194 synthesizing effect Effects 0.000 claims 3
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
MAN/MACHINE COMMUNICATIONS SYSTEM USING FORMANT BASED SPEECH
ANALYSIS AND SYNTHESIS
ABSTRACT OF THE DISCLOSURE
Formants are extracted and stored from reference speech.
Input speech is suitably processed to derive unlabelled candidate formants. The sets of formants from the input and reference speech are compared using dynamic programming techniques. A
further sequence comparison provides time alignment of the input and reference speech. The sequence comparisons extract a dissim-ilarity measure based on the formant frequencies and other characteristics of the speech. The reference speech resulting in the lowest dissimilarity measure identifies the input speech recognized by the system. System feedback may be provided and is composed of designated responsive multi-voiced speech. The multi-voiced output speech is obtained primarily by altering the prosodic parameters and formant frequencies of the designated responsive speech. Thus, the designated responsive speech may, say in an aircraft communication system, use one voice output when providing an information response to the pilot's recognized input speech question and another appropriately strident voice to issue the pilot warnings. The system also may be placed in a training mode to evaluate performance and adjust parameters.
ANALYSIS AND SYNTHESIS
ABSTRACT OF THE DISCLOSURE
Formants are extracted and stored from reference speech.
Input speech is suitably processed to derive unlabelled candidate formants. The sets of formants from the input and reference speech are compared using dynamic programming techniques. A
further sequence comparison provides time alignment of the input and reference speech. The sequence comparisons extract a dissim-ilarity measure based on the formant frequencies and other characteristics of the speech. The reference speech resulting in the lowest dissimilarity measure identifies the input speech recognized by the system. System feedback may be provided and is composed of designated responsive multi-voiced speech. The multi-voiced output speech is obtained primarily by altering the prosodic parameters and formant frequencies of the designated responsive speech. Thus, the designated responsive speech may, say in an aircraft communication system, use one voice output when providing an information response to the pilot's recognized input speech question and another appropriately strident voice to issue the pilot warnings. The system also may be placed in a training mode to evaluate performance and adjust parameters.
Claims (16)
1. A speech recognition system comprising:
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each item of said reference speech vocabulary;
b) means for storing vocabulary item template information for said reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of said formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to errors present in either set;
f) means for comparing said measure of energy and said measure of spectrum balance for the speech to be recognized with said measure of energy, and said measure of spectrum balance for the reference speech vocabulary to provide energy and spectrum balance dissimilarity measures;
g) means for combining said formant dissimilarity measure and said energy and spectrum balance dissimilarity measure to produce local dissimilarity measures;
h) means for identifying a sequence of vocabulary item templates by aligning the speech to be recognized with the reference speech vocabulary which alignment results in the lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of local dissimilarity measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and i) means for outputting the identified sequence of vocabulary item templates.
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each item of said reference speech vocabulary;
b) means for storing vocabulary item template information for said reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of said formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to errors present in either set;
f) means for comparing said measure of energy and said measure of spectrum balance for the speech to be recognized with said measure of energy, and said measure of spectrum balance for the reference speech vocabulary to provide energy and spectrum balance dissimilarity measures;
g) means for combining said formant dissimilarity measure and said energy and spectrum balance dissimilarity measure to produce local dissimilarity measures;
h) means for identifying a sequence of vocabulary item templates by aligning the speech to be recognized with the reference speech vocabulary which alignment results in the lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of local dissimilarity measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and i) means for outputting the identified sequence of vocabulary item templates.
2. A speaker verification system comprising:
a) means for instructing a speaker to provide speech to be recognized corresponding to at least one of a reference speech vocabulary comprised of a plurality of vocabulary items for all speakers;
b) means for storing speaker identities corresponding to the speaker's reference speech vocabulary;
c) means for extracting and storing from said reference speech vocabulary for each speaker to be identified, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary for each speaker to be identified;
d) means for storing vocabulary item template information for said reference vocabulary for each speaker to be identified;
e) means for storing information defining syntactically allowed sequences of vocabulary items for each speaker to be identified;
f) means for extracting on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance of said produced specified sequence of vocabulary items to be recognized;
g) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the formant parameters of said reference vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
h) means for determining the syntactically allowed sequence of reference speech templates and their non-linear time alignments that minimize a local dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs of the frames of the speech to be recognized and the frames of the reference vocabulary ;
i) means for outputting the reference vocabulary determined by the syntactically allowed sequence of reference speech templates;
j) means for identifying the reference speech vocabulary by aligning the speech to be recognized with the reference speech vocabulary which alignment results in a lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of the local dissimarility measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and k) means for outputting a positive speaker identity corresponding to the identified reference speech vocabulary if the total dissimilarity measure is below a predetermined acceptable limit.
a) means for instructing a speaker to provide speech to be recognized corresponding to at least one of a reference speech vocabulary comprised of a plurality of vocabulary items for all speakers;
b) means for storing speaker identities corresponding to the speaker's reference speech vocabulary;
c) means for extracting and storing from said reference speech vocabulary for each speaker to be identified, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary for each speaker to be identified;
d) means for storing vocabulary item template information for said reference vocabulary for each speaker to be identified;
e) means for storing information defining syntactically allowed sequences of vocabulary items for each speaker to be identified;
f) means for extracting on a frame by frame basis, a set of unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, together with a measure of energy and a measure of spectrum balance of said produced specified sequence of vocabulary items to be recognized;
g) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the formant parameters of said reference vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
h) means for determining the syntactically allowed sequence of reference speech templates and their non-linear time alignments that minimize a local dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs of the frames of the speech to be recognized and the frames of the reference vocabulary ;
i) means for outputting the reference vocabulary determined by the syntactically allowed sequence of reference speech templates;
j) means for identifying the reference speech vocabulary by aligning the speech to be recognized with the reference speech vocabulary which alignment results in a lowest total dissimilarity measure, wherein the total dissimilarity measure is the sum of the local dissimarility measures over aligned frame pairs of the speech to be recognized and the reference speech vocabulary; and k) means for outputting a positive speaker identity corresponding to the identified reference speech vocabulary if the total dissimilarity measure is below a predetermined acceptable limit.
3. The system of claim 1 wherein the total dissimilarity measure is a least cost explanation of one set in terms of the other set, whereby when each formant parameter in the reference speech vocabulary set is paired with said unlabelled potentially errorful candidate formant parameters in the speech to be recognized there is a cost that is a monotonically increasing function of a difference in their frequencies and when an unlabelled potentially errorful candidate formant parameter is left unpaired there is a cost inversely related to a confidence measure placed on that formant candidate.
4. The speaker verification system of claim 2 wherein said total dissimilarity measure is obtained by further comparing the formant parameter of the speech to be recognized with the formant parameters of the reference speech vocabulary given the determined time alignment.
5. The speaker verification system of claim 2 wherein said total dissimilarity measure is a formant dissimilarity measure.
6. A multi-voiced output system comprising:
a) means for extracting from a reference speech vocabulary set of natural speech, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency voiced and unvoiced decision, for each frame of said reference speech vocabulary set;
b) first means for storing at least the formant parameters, energy and voiced and unvoiced decision for each frame of said reference speech vocabulary set;
c) second means for storing syntactic and prosodic rules applicable to said reference speech vocabulary set;
d) means for selecting reference speech out of said reference speech vocabulary set, and choosing a set of parameters for modifying said selected reference speech;
e) means for modifying said selected reference speech in accordance with said chosen parameters by altering one or more of the formant parameters, energy, voiced and unvoiced decisions stored in said first means;
f) means for synthesizing said modified reference speech using an excitation waveform of duration and form similar to the excitation waveform of said selected reference speech; and g) means for suitably analog converting and outputting said synthesized modified selected reference speech.
a) means for extracting from a reference speech vocabulary set of natural speech, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency voiced and unvoiced decision, for each frame of said reference speech vocabulary set;
b) first means for storing at least the formant parameters, energy and voiced and unvoiced decision for each frame of said reference speech vocabulary set;
c) second means for storing syntactic and prosodic rules applicable to said reference speech vocabulary set;
d) means for selecting reference speech out of said reference speech vocabulary set, and choosing a set of parameters for modifying said selected reference speech;
e) means for modifying said selected reference speech in accordance with said chosen parameters by altering one or more of the formant parameters, energy, voiced and unvoiced decisions stored in said first means;
f) means for synthesizing said modified reference speech using an excitation waveform of duration and form similar to the excitation waveform of said selected reference speech; and g) means for suitably analog converting and outputting said synthesized modified selected reference speech.
7. The system of claim 6 wherein:
a) said first storage means includes storage of the fundamental frequency of each frame of said reference speech vocabulary set; and b) said modifying means includes altering the fundamental frequency of said selected reference speech.
a) said first storage means includes storage of the fundamental frequency of each frame of said reference speech vocabulary set; and b) said modifying means includes altering the fundamental frequency of said selected reference speech.
8. The system of claim 6 wherein:
a) said first storage means also includes means for storing the bandwidth of the vocabulary set of each frame of said reference speech; and b) said modifying means includes means for altering said bandwidth.
a) said first storage means also includes means for storing the bandwidth of the vocabulary set of each frame of said reference speech; and b) said modifying means includes means for altering said bandwidth.
9. The system of claim 6, wherein:
a) said means for extraction includes a Laryngograph.
a) said means for extraction includes a Laryngograph.
10. The system of claim 6, wherein:
a) said extraction means provides an error signal from a linear predictive analysis of said reference speech vocabulary set, said error signal being stored in said first storage means;
and b) said synthesizing means uses said error signal as the excitation waveform.
a) said extraction means provides an error signal from a linear predictive analysis of said reference speech vocabulary set, said error signal being stored in said first storage means;
and b) said synthesizing means uses said error signal as the excitation waveform.
11. A man/machine speech communications system comprising:
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said reference speech vocabulary, said reference speech vocabulary being divided into a recognition speech vocabulary and an output speech vocabulary;
b) means for storing vocabulary item template information for said recognition speech reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the recognition speech formant parameters to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
f) means for determining the syntactically allowed sequence of recognition speech templates and their non-linear time alignments that minimize a total dissimilarity measure comprising at least said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs;
g) means for outputting a signal indicative of the recognition speech template having the lowest total dissimilarity measure;
h) said means for extracting and storing further including extraction and storage of fundamental frequency voiced and unvoiced decision, for each frame of said output speech vocabulary;
i) means for storage of syntactic and prosodic rules applicable to said output speech vocabulary;
j) means for selecting a reference speech out of said output speech vocabulary responsive to said output of a signal indicative of recognition speech template, and means for choosing a set of parameters for modifying said selected output speech;
k) means for modifying the characteristics of said selected output speech in accordance with said chosen parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected output speech;
l) means for synthesizing said modified selected output speech; and m) means for suitably analog converting and outputting said synthesized modified selected output reference speech.
a) means for extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said reference speech vocabulary, said reference speech vocabulary being divided into a recognition speech vocabulary and an output speech vocabulary;
b) means for storing vocabulary item template information for said recognition speech reference vocabulary;
c) means for storing information defining syntactically allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, unlabelled potentially errorful candidate formant parameters comprising frequencies and bandwidths, energy and spectrum balance measures for each frame of said speech to be recognized;
e) means for comparing sets of said unlabelled potentially errorful candidate formant parameters with any set of the recognition speech formant parameters to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
f) means for determining the syntactically allowed sequence of recognition speech templates and their non-linear time alignments that minimize a total dissimilarity measure comprising at least said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs;
g) means for outputting a signal indicative of the recognition speech template having the lowest total dissimilarity measure;
h) said means for extracting and storing further including extraction and storage of fundamental frequency voiced and unvoiced decision, for each frame of said output speech vocabulary;
i) means for storage of syntactic and prosodic rules applicable to said output speech vocabulary;
j) means for selecting a reference speech out of said output speech vocabulary responsive to said output of a signal indicative of recognition speech template, and means for choosing a set of parameters for modifying said selected output speech;
k) means for modifying the characteristics of said selected output speech in accordance with said chosen parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected output speech;
l) means for synthesizing said modified selected output speech; and m) means for suitably analog converting and outputting said synthesized modified selected output reference speech.
12. A speech recognition method comprising the steps of:
a) extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary;
b) storing vocabulary item template information for said reference speech vocabulary;
c) storing information defining allowed sequences of vocabulary items in speech to be recognized;
d) extracting, on a frame by frame basis, unlabelled candidate formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) comparing sets of said unlabelled candidate formant parameters with any set of formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set:
f) comparing energy and spectrum balance measures for the speech to be recognized with the reference speech vocabulary;
g) determining the syntactically allowed sequence of vocabulary item template information and their non-linear time alignments with the allowed sequence of vocabulary items in the speech to be recognized that minimize a total dissimilarity measure, said total dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs; and h) outputting the determined sequence of vocabulary items corresponding to the template.
a) extracting and storing from a reference speech vocabulary comprised of a plurality of vocabulary items, on a frame by frame basis, a set of formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for each frame of each item of said reference speech vocabulary;
b) storing vocabulary item template information for said reference speech vocabulary;
c) storing information defining allowed sequences of vocabulary items in speech to be recognized;
d) extracting, on a frame by frame basis, unlabelled candidate formant parameters comprising frequencies and bandwidths, a measure of energy and a measure of spectrum balance for the speech to be recognized;
e) comparing sets of said unlabelled candidate formant parameters with any set of formant parameters of said reference speech vocabulary to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set:
f) comparing energy and spectrum balance measures for the speech to be recognized with the reference speech vocabulary;
g) determining the syntactically allowed sequence of vocabulary item template information and their non-linear time alignments with the allowed sequence of vocabulary items in the speech to be recognized that minimize a total dissimilarity measure, said total dissimilarity measure comprising said formant dissimilarity measure, energy and spectrum balance dissimilarities summed over aligned frame pairs; and h) outputting the determined sequence of vocabulary items corresponding to the template.
13. A speaker verification method comprising the steps of:
a) extracting and storing from a reference speech vocabulary, for each speaker to be identified, on a frame by frame basis, formant frequencies and bandwidths, energy and spectrum balance;
b) storing whole-word template information for said reference vocabulary;
c) storing information defining sequences of words in the reference speech vocabulary;
d) instructing a speaker to say a specified sequence of words and to identify himself or herself;
e) extracting on a frame by frame basis unlabelled candidate formant frequencies and bandwidths, energy and spectrum balance of the speaker's words;
f) comparing sets of unlabelled candidate formant frequencies and bandwidths with the formant frequencies and bandwidths of the reference speech for the identified speaker to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
g) comparing sets of the energy and spectrum balance to provide a further dissimilarity measure which is combined with the formant dissimilarity measure to provide a total dissimilarity measure;
h) determining the time alignment of the specified sequence of words with the reference speech templates corresponding to the speaker's claimed identity that minimizes the total summed formant dissimilarity measure over aligned frame pairs; and i) measuring the equivalence between the time aligned specified sequence of words and the reference speech templates and determining whether the equivalence is above an acceptable lower limit for speaker verification.
a) extracting and storing from a reference speech vocabulary, for each speaker to be identified, on a frame by frame basis, formant frequencies and bandwidths, energy and spectrum balance;
b) storing whole-word template information for said reference vocabulary;
c) storing information defining sequences of words in the reference speech vocabulary;
d) instructing a speaker to say a specified sequence of words and to identify himself or herself;
e) extracting on a frame by frame basis unlabelled candidate formant frequencies and bandwidths, energy and spectrum balance of the speaker's words;
f) comparing sets of unlabelled candidate formant frequencies and bandwidths with the formant frequencies and bandwidths of the reference speech for the identified speaker to provide a formant dissimilarity measure between the two sets that is not unduly sensitive to the presence of errors in either set;
g) comparing sets of the energy and spectrum balance to provide a further dissimilarity measure which is combined with the formant dissimilarity measure to provide a total dissimilarity measure;
h) determining the time alignment of the specified sequence of words with the reference speech templates corresponding to the speaker's claimed identity that minimizes the total summed formant dissimilarity measure over aligned frame pairs; and i) measuring the equivalence between the time aligned specified sequence of words and the reference speech templates and determining whether the equivalence is above an acceptable lower limit for speaker verification.
14. A method of providing a multi-voiced output comprising the steps of:
a) extracting from a reference speech vocabulary, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency, voiced and unvoiced decision, for each of said reference speech vocabulary;
b) storing in a first means at least said formant parameters, energy and voiced and unvoiced decision for each of said reference speech vocabulary;
c) storing in a second means syntactic and prosodic rules applicable to said reference speech vocabulary;
d) selecting reference speech out of said reference speech vocabulary, and choosing a set of parameters for modifying said selected reference speech;
e) modifying the characteristics of said selected reference speech in accordance with said chosen set of parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected reference speech;
f) re-synthesizing said modified selected reference speech; and g) suitably analog converting and outputting said re-synthesized modified selected reference speech.
a) extracting from a reference speech vocabulary, on a frame by frame basis, formant parameters comprising frequencies and bandwidths, energy, fundamental frequency, voiced and unvoiced decision, for each of said reference speech vocabulary;
b) storing in a first means at least said formant parameters, energy and voiced and unvoiced decision for each of said reference speech vocabulary;
c) storing in a second means syntactic and prosodic rules applicable to said reference speech vocabulary;
d) selecting reference speech out of said reference speech vocabulary, and choosing a set of parameters for modifying said selected reference speech;
e) modifying the characteristics of said selected reference speech in accordance with said chosen set of parameters by altering one or more of said stored formant parameters, energy or duration or form of the excitation waveform of said selected reference speech;
f) re-synthesizing said modified selected reference speech; and g) suitably analog converting and outputting said re-synthesized modified selected reference speech.
15. A system of claim 1, further characterized by:
a) means for extracting and storing boundaries of vocabulary items for the speech to be recognized from the speech recognition system;
b) means for extracting and storing boundaries of vocabulary items for the speech to be recognized independently of said speech recognition system;
c) means for determining the correspondence between the two sets of vocabulary item boundaries;
d) means for identifying and storing vocabulary item templates of the speech to be recognized independently of said speech recognition system;
e) means for comparting the identified sequence of vocabulary item templates from said speech recognition system with the corresponding independently identified and stored vocabulary item templates within said independently extracted and stored vocabulary item boundaries;
f) means for outputting a reliability measure of said speech recognition system as a result of at least a portion of the correspondence determined between the two sets of vocabulary item boundaries and identified sequence comparison of the two sets of vocabulary item templates.
a) means for extracting and storing boundaries of vocabulary items for the speech to be recognized from the speech recognition system;
b) means for extracting and storing boundaries of vocabulary items for the speech to be recognized independently of said speech recognition system;
c) means for determining the correspondence between the two sets of vocabulary item boundaries;
d) means for identifying and storing vocabulary item templates of the speech to be recognized independently of said speech recognition system;
e) means for comparting the identified sequence of vocabulary item templates from said speech recognition system with the corresponding independently identified and stored vocabulary item templates within said independently extracted and stored vocabulary item boundaries;
f) means for outputting a reliability measure of said speech recognition system as a result of at least a portion of the correspondence determined between the two sets of vocabulary item boundaries and identified sequence comparison of the two sets of vocabulary item templates.
16. The system of claim 14 further characterized by:
a) means for constraining said means for identifying a sequence of vocabulary item templates to match said corresponding vocabulary items identified by the independent means;
b) said means for comparing the identified sequence of vocabulary item templates including means for passing said speech to be recognized through a portion of said speech recognition system at least twice.
a) means for constraining said means for identifying a sequence of vocabulary item templates to match said corresponding vocabulary items identified by the independent means;
b) said means for comparing the identified sequence of vocabulary item templates including means for passing said speech to be recognized through a portion of said speech recognition system at least twice.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US71544385A | 1985-03-25 | 1985-03-25 | |
US715,443 | 1985-03-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1246745A true CA1246745A (en) | 1988-12-13 |
Family
ID=24874072
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000503281A Expired CA1246745A (en) | 1985-03-25 | 1986-03-04 | Man/machine communications system using formant based speech analysis and synthesis |
Country Status (1)
Country | Link |
---|---|
CA (1) | CA1246745A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1996033486A1 (en) * | 1995-04-18 | 1996-10-24 | Oriol Espar Figueras | Speech recognition process and device |
EP0645757B1 (en) * | 1993-09-23 | 2000-04-05 | Xerox Corporation | Semantic co-occurrence filtering for speech recognition and signal transcription applications |
CN112951245A (en) * | 2021-03-09 | 2021-06-11 | 江苏开放大学(江苏城市职业学院) | Dynamic voiceprint feature extraction method integrated with static component |
CN115879405A (en) * | 2023-02-24 | 2023-03-31 | 湖南遥光科技有限公司 | Circuit performance detection method, computer storage medium and terminal device |
CN118173102A (en) * | 2024-05-15 | 2024-06-11 | 百鸟数据科技(北京)有限责任公司 | Bird voiceprint recognition method in complex scene |
CN118173102B (en) * | 2024-05-15 | 2024-07-16 | 百鸟数据科技(北京)有限责任公司 | Bird voiceprint recognition method in complex scene |
-
1986
- 1986-03-04 CA CA000503281A patent/CA1246745A/en not_active Expired
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0645757B1 (en) * | 1993-09-23 | 2000-04-05 | Xerox Corporation | Semantic co-occurrence filtering for speech recognition and signal transcription applications |
WO1996033486A1 (en) * | 1995-04-18 | 1996-10-24 | Oriol Espar Figueras | Speech recognition process and device |
ES2110899A1 (en) * | 1995-04-18 | 1998-02-16 | Figueras Oriol Espar | Speech recognition process and device |
CN112951245A (en) * | 2021-03-09 | 2021-06-11 | 江苏开放大学(江苏城市职业学院) | Dynamic voiceprint feature extraction method integrated with static component |
CN112951245B (en) * | 2021-03-09 | 2023-06-16 | 江苏开放大学(江苏城市职业学院) | Dynamic voiceprint feature extraction method integrated with static component |
CN115879405A (en) * | 2023-02-24 | 2023-03-31 | 湖南遥光科技有限公司 | Circuit performance detection method, computer storage medium and terminal device |
CN115879405B (en) * | 2023-02-24 | 2023-11-17 | 湖南遥光科技有限公司 | Circuit performance detection method, computer storage medium and terminal equipment |
CN118173102A (en) * | 2024-05-15 | 2024-06-11 | 百鸟数据科技(北京)有限责任公司 | Bird voiceprint recognition method in complex scene |
CN118173102B (en) * | 2024-05-15 | 2024-07-16 | 百鸟数据科技(北京)有限责任公司 | Bird voiceprint recognition method in complex scene |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yoshimura et al. | Mixed excitation for HMM-based speech synthesis. | |
DE69831076T2 (en) | METHOD AND DEVICE FOR LANGUAGE ANALYSIS AND SYNTHESIS BY ALLPASS-SIEB CHAIN FILTERS | |
US4624011A (en) | Speech recognition system | |
Vepa et al. | New objective distance measures for spectral discontinuities in concatenative speech synthesis | |
US5144672A (en) | Speech recognition apparatus including speaker-independent dictionary and speaker-dependent | |
Hunt et al. | Speaker dependent and independent speech recognition experiments with an auditory model | |
Bocklet et al. | Age and gender recognition based on multiple systems-early vs. late fusion. | |
US5202926A (en) | Phoneme discrimination method | |
Chetouani et al. | A New Nonlinear speaker parameterization algorithm for speaker identification | |
Teixeira et al. | Prosodic features for automatic text-independent evaluation of degree of nativeness for language learners. | |
Elenius et al. | Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system | |
Hansen et al. | Robust speech recognition training via duration and spectral-based stress token generation | |
CA1246745A (en) | Man/machine communications system using formant based speech analysis and synthesis | |
US4924518A (en) | Phoneme similarity calculating apparatus | |
Wicaksana et al. | Spoken language identification on local language using MFCC, random forest, KNN, and GMM | |
Dawande et al. | Analysis of different feature extraction techniques for speaker recognition system: A review | |
Siegel et al. | A pattern classification algorithm for the voiced/unvoiced decision | |
Aull et al. | Lexical stress and its application in large vocabulary speech recognition | |
Pedone et al. | Phoneme-level text to audio synchronization on speech signals with background music | |
Dutono et al. | Effects of compound parameters on speaker-independent word recognition | |
Fu et al. | Polynomial-Decomposition-Based LPC for Formant Estimation | |
KR19990050440A (en) | Voice recognition method and voice recognition device using voiced, unvoiced and silent section information | |
Samouelian | Frame-level phoneme classification using inductive inference | |
Pellom et al. | Spectral normalization employing hidden Markov modeling of line spectrum pair frequencies | |
Mariani et al. | Acoustic-phonetic recognition of connected speech using transient information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKEX | Expiry |