US7069217B2 - Waveform synthesis - Google Patents

Waveform synthesis Download PDF

Info

Publication number
US7069217B2
US7069217B2 US09/043,171 US4317198A US7069217B2 US 7069217 B2 US7069217 B2 US 7069217B2 US 4317198 A US4317198 A US 4317198A US 7069217 B2 US7069217 B2 US 7069217B2
Authority
US
United States
Prior art keywords
waveform
sequence
cycles
point
successive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US09/043,171
Other languages
English (en)
Other versions
US20010018652A1 (en
Inventor
Stephen Mclaughlin
Michael Banbrook
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
British Telecommunications PLC
Original Assignee
British Telecommunications PLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Telecommunications PLC filed Critical British Telecommunications PLC
Assigned to BRITISH TELECOMMUNICATIONS PUBLIC LIMITED COMPANY reassignment BRITISH TELECOMMUNICATIONS PUBLIC LIMITED COMPANY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BANBROOK, MICHAEL, MCLAUGHLIN, STEPHEN
Publication of US20010018652A1 publication Critical patent/US20010018652A1/en
Application granted granted Critical
Publication of US7069217B2 publication Critical patent/US7069217B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules

Definitions

  • the corresponding point s i in the state sequence space is represented by the value of that point s i together with those of a preceding and a succeeding point x i+j , x i+k (where j is conveniently equal to k and in this case both are equal to 10).
  • the attractor of FIG. 4 consists of a double loop (which, in the projection indicated, appears to cross itself but does not in fact do so in three dimensions).
  • each voiced sound gives rise to an attractor of this nature, all of which can adequately be represented in a three dimensional state space, although it might also be possible to use as few as two dimensions or as many as four, five or more.
  • the important parameters for an effective representation of voiced sounds in such a state space are the number of dimensions selected and the time delay between adjacent samples.
  • the shapes of the attractors vary considerably (with the corresponding shapes of the speech waveforms to which they correspond) although there is some relationship between the topologies of respective attractors and the sounds to which they correspond.
  • voiced sounds such as vowels and voiced consonants
  • the state space representation will not follow successive closely similar loops with a well defined topology, but instead will follow a trajectory which passes in an apparently random fashion through a volume in the state sequence space.
  • a speech synthesizer comprises a loudspeaker 2 , fed from the analogue output of a digital to analog converter 4 , coupled to an output port of a central processing unit 6 in communication with a storage system 8 (comprising random access memory 8 a , for use by the CPU 6 in calculation; program memory 8 b for storing the CPU operating program; and data constant memory 8 c for storing data for use in synthesis).
  • a storage system 8 comprising random access memory 8 a , for use by the CPU 6 in calculation; program memory 8 b for storing the CPU operating program; and data constant memory 8 c for storing data for use in synthesis).
  • the apparatus of FIG. 6 may conveniently be provided by a personal computer and sound card such as an Elonex (TM) Personal Computer comprising a 33 MHz Intel 486 microprocessor as the CPU 6 and an Ultrasound Max. (TM) soundcard providing the digital to analogue converter 4 and output to a loudspeaker 2 .
  • TM Elonex
  • TM Ultrasound Max.
  • Any other digital processor of similar or higher power could be used instead.
  • the storage system 8 comprises a mass storage device (e.g. a hard disk) containing the operating program and data to be used in synthesis and a random access memory comprising partitioned areas 8 a , 8 b , 8 c , the program and data being loaded into the latter two areas, respectively, prior to use of the apparatus of FIG. 6 .
  • a mass storage device e.g. a hard disk
  • a random access memory comprising partitioned areas 8 a , 8 b , 8 c , the program and data being loaded into the latter two areas, respectively, prior to use of the apparatus of FIG. 6 .
  • the stored data held within the stored data memory 8 c comprises a set of records 10 a , 10 b , . . . 10 c , each of which represents a small segment of a word which may be considered to be unambiguously distinguishable regardless of its context in a word or phrase (i.e. each corresponds to a phoneme or allophone).
  • the phonemes can be represented by any of a number of different phonetic alphabets; in this embodiment, the SAMPA (Speech Assessment Methodology Phonetic Alphabet, as disclosed in A. Breen, “Speech Synthesis Models: A Review”, Electronics and Communication Engineering Journal, pages 19–31, February 1992) is used.
  • Each of the records comprises a respective waveform recording 11 , comprising successive digital values (e.g. sampled at 20 kHz) of the waveform of an actual utterance of the phoneme in question as successive samples x 1 , x 2 . . . x N .
  • each of the records 10 associated with a voiced sound comprises, for each stored sample x i , a transform matrix defined by nine stored constant values.
  • the data memory 8 c comprises on the order of thirty to forty records 10 (depending the phonetic alphabet chosen), each consisting of the order of half a second of recorded digital waveforms (i.e., for sampling at 20 kHz, around ten thousand samples x i , each of the sample records for voiced sounds having an associated nine element transform matrix).
  • an utterance to be synthesised by the speech synthesizer consists of a sequence of portions each with an associated duration, comprising a silence portion 14 a followed by a word comprising a sequence of portions 14 b – 14 f each consisting of a phoneme of predetermined duration, followed by a further silence portion 14 g , followed by a further word comprised of phoneme portions 14 h – 14 j each of an associated duration, and so on.
  • the sequence of phonemes, together with their durations, are either stored or derived by one of several well known rule systems forming no part of the present invention, but comprised within the control program.
  • the closest point selected in step 508 will in fact be the last point on the current strand (in this case s 21 ). However, it may correspond instead to one of the nearest neighbours on that strand (as in this case, where s 22 is closer), or to a point on another strand of the trajectory where this is closely spaced in the state sequence space, as indicated in FIG. 9 c.
  • step 520 the CPU 6 determines whether the required predetermined duration of the phoneme being synthesised has been reached. If not, then the CPU 6 returns to step 508 of the control program, and determines the new closest point on the trajectory to the most recently synthesized point. In many cases, this may be the same as the point s i+1 from which the synthesised point was itself calculated, but this is not necessarily so.
  • a human speaker recites a single utterance of a desired sound (e.g. a vowel)
  • the CPU 26 and analog to digital converter 24 sample the analog waveform thus produced at the output of the microphone 22 and store successive samples (e.g. around 10,000 samples, corresponding to around half a second of speech) in the working memory area 28 a.
  • the CPU 26 is arranged to normalise the pitch of the recorded utterance by determining the start and end of each pitch pulse period (illustrated in FIG. 1 ) for example by determining the zero crossing points thereof, and then equalising the number of samples within each pitch period (for example to 140 samples in each pitch period) by interpolating between the originally stored samples.
  • the stored data are transferred (either by communications link or a removable carrier such as a floppy disk) to the memory 8 of synthesis apparatus of FIG. 6 .
  • unvoiced sounds do not exhibit stable low dimensional behaviour, and hence they do not follow regular, repeating attractors in state sequence space and synthesis of an attractor as described above is therefore unstable. Accordingly, unvoiced sounds are produced in this embodiment by simply outputting, in succession, the stored waveform values x i stored for the unvoiced sound to the DAC 4 . The same is true of plosive sounds.
  • the present invention interpolates between two waveforms, one representing each sound, in state sequence space.
  • the state space representation is useful where one or both of the waveforms between which interpolation is performed are being synthesised (i.e. one or both are voiced waveforms).
  • the synthesised points in state space are derived, and then the interpolated point is calculated between them; in fact, as discussed below, it is only necessary to interpolate on one co-ordinate axis, so that the state space representation plays no part in the actual interpolation process.
  • the interpolation is performed over more than one pitch pulse cycle (for example 10 cycles) by progressively linearly varying the euclidean distance between the two waveforms in state sequence space.
  • an index j is initialised (e.g. at zero).
  • the transformation matrix is calculated directly at each newly synthesised point; in this case, the synthesizer of FIG. 6 incorporates the functionality of the apparatus of FIG. 10 .
  • Such calculation reduces the required storage space by around one order of magnitude, although higher processing speed is required.
  • a corresponding pair of points s a k , s b l are read from the stored waveform records 10 ; as described in the first embodiment, the points correspond to matching parts of the respective pitch pulse cycles of the two waveforms.
  • step 814 the CPU 6 performs the steps 610 – 622 of FIG. 12 , to calculate the transform matrices T k for each point along this stored track.
  • each interpolated trajectory and set of transformation vectors is used only once to calculate only a single output value, in fact fewer interpolated sets of trajectories and sets of transformation matrices could be calculated, and the same trajectory used for several successive output samples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Lasers (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
US09/043,171 1996-01-15 1997-01-09 Waveform synthesis Expired - Fee Related US7069217B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
GB9600774-5 1996-01-15
GBGB9600774.5A GB9600774D0 (en) 1996-01-15 1996-01-15 Waveform synthesis
PCT/GB1997/000060 WO1997026648A1 (en) 1996-01-15 1997-01-09 Waveform synthesis

Publications (2)

Publication Number Publication Date
US20010018652A1 US20010018652A1 (en) 2001-08-30
US7069217B2 true US7069217B2 (en) 2006-06-27

Family

ID=10787066

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/043,171 Expired - Fee Related US7069217B2 (en) 1996-01-15 1997-01-09 Waveform synthesis

Country Status (8)

Country Link
US (1) US7069217B2 (de)
EP (1) EP0875059B1 (de)
JP (1) JP4194656B2 (de)
AU (1) AU724355B2 (de)
CA (1) CA2241549C (de)
DE (1) DE69722585T2 (de)
GB (1) GB9600774D0 (de)
WO (1) WO1997026648A1 (de)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040034530A1 (en) * 2002-05-31 2004-02-19 Tomomi Hara Data structure for waveform synthesis data and method and apparatus for synthesizing waveform
US20040133585A1 (en) * 2000-07-11 2004-07-08 Fabrice Pautot Data-processing arrangement comprising confidential data
US20080172349A1 (en) * 2007-01-12 2008-07-17 Toyota Engineering & Manufacturing North America, Inc. Neural network controller with fixed long-term and adaptive short-term memory
US20110226116A1 (en) * 2010-03-17 2011-09-22 Casio Computer Co., Ltd. Waveform generation apparatus and waveform generation program
US20120016672A1 (en) * 2010-07-14 2012-01-19 Lei Chen Systems and Methods for Assessment of Non-Native Speech Using Vowel Space Characteristics
US20120310650A1 (en) * 2011-05-30 2012-12-06 Yamaha Corporation Voice synthesis apparatus
US8719030B2 (en) * 2012-09-24 2014-05-06 Chengjun Julian Chen System and method for speech synthesis
US9933990B1 (en) * 2013-03-15 2018-04-03 Sonitum Inc. Topological mapping of control parameters

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3912913B2 (ja) * 1998-08-31 2007-05-09 キヤノン株式会社 音声合成方法及び装置
JP4656443B2 (ja) * 2007-04-27 2011-03-23 カシオ計算機株式会社 波形発生装置および波形発生処理プログラム
JP5347405B2 (ja) * 2008-09-25 2013-11-20 カシオ計算機株式会社 波形発生装置および波形発生処理プログラム
JP5224552B2 (ja) * 2010-08-19 2013-07-03 達 伊福部 音声生成装置およびその制御プログラム
US11373672B2 (en) 2016-06-14 2022-06-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
WO2017218492A1 (en) * 2016-06-14 2017-12-21 The Trustees Of Columbia University In The City Of New York Neural decoding of attentional selection in multi-speaker environments

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4022974A (en) 1976-06-03 1977-05-10 Bell Telephone Laboratories, Incorporated Adaptive linear prediction speech synthesizer
US4622877A (en) 1985-06-11 1986-11-18 The Board Of Trustees Of The Leland Stanford Junior University Independently controlled wavetable-modification instrument and method for generating musical sound
US4635520A (en) * 1983-07-28 1987-01-13 Nippon Gakki Seizo Kabushiki Kaisha Tone waveshape forming device
US4718093A (en) * 1984-03-27 1988-01-05 Exxon Research And Engineering Company Speech recognition method including biased principal components
EP0385444A2 (de) 1989-03-02 1990-09-05 Yamaha Corporation Vorrichtung zum Erzeugen eines Musiktonsignals
US5111505A (en) * 1988-07-21 1992-05-05 Sharp Kabushiki Kaisha System and method for reducing distortion in voice synthesis through improved interpolation
US5745651A (en) * 1994-05-30 1998-04-28 Canon Kabushiki Kaisha Speech synthesis apparatus and method for causing a computer to perform speech synthesis by calculating product of parameters for a speech waveform and a read waveform generation matrix
US5832437A (en) * 1994-08-23 1998-11-03 Sony Corporation Continuous and discontinuous sine wave synthesis of speech signals from harmonic data of different pitch periods

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4022974A (en) 1976-06-03 1977-05-10 Bell Telephone Laboratories, Incorporated Adaptive linear prediction speech synthesizer
US4635520A (en) * 1983-07-28 1987-01-13 Nippon Gakki Seizo Kabushiki Kaisha Tone waveshape forming device
US4718093A (en) * 1984-03-27 1988-01-05 Exxon Research And Engineering Company Speech recognition method including biased principal components
US4622877A (en) 1985-06-11 1986-11-18 The Board Of Trustees Of The Leland Stanford Junior University Independently controlled wavetable-modification instrument and method for generating musical sound
US5111505A (en) * 1988-07-21 1992-05-05 Sharp Kabushiki Kaisha System and method for reducing distortion in voice synthesis through improved interpolation
EP0385444A2 (de) 1989-03-02 1990-09-05 Yamaha Corporation Vorrichtung zum Erzeugen eines Musiktonsignals
US5745651A (en) * 1994-05-30 1998-04-28 Canon Kabushiki Kaisha Speech synthesis apparatus and method for causing a computer to perform speech synthesis by calculating product of parameters for a speech waveform and a read waveform generation matrix
US5832437A (en) * 1994-08-23 1998-11-03 Sony Corporation Continuous and discontinuous sine wave synthesis of speech signals from harmonic data of different pitch periods

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
Daniel P. Lathrop et al. (Characterization of an experimental strange attractor by periodic orbits), Physical Review, p. 4028-4031, 1989. *
Gabriel B. Mindlin et al. (Topological analysis and sunthesis of chaotic time series) , Physica D, pp. 229-242, 1992. *
IBM Technical Disclosure Bulletin, vol. 28, No. 3, Aug. 1985, New York, US, pp. 1248-1249, Anonymous, Use of the Grid Search Technique for Improving Synthetic Speech Control-Data.
IEE Colloquium on 'Exploiting Chaos in Signal Processing' (Digest No. 1994/143), Jun. 6, 1994, London, GB, pp. 8/1-10, Banbrook et al, "Is speech chaotic?: invariant geometrical measures for speech data".
IEEE 100 The Authoritative Dictionary of IEEE Standards Terms, Seventh Edition, Standards Information Network IEEE Press 2000. p. 1000. *
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E76-A, No. 11, Nov. 1993, JP, pp. 1964-1970, Hirokawa et al, "High quality speech synthesis system based on waveform concatenation of phoneme segment".
International Conference on Acoustics, Speech, and Signal Processing 1988, vol. 1, Apr. 11-14, 1988, New York, NY, pp. 675-678, Everett, "Word synthesis based on line spectrum pairs".
Kleijn, W.B. and Paliwal, K.K. (Eds), 'Speech Coding and Synthesis', pp. 557-559, 581-587, 600-610 Elsevier Science B.V., 1995.
M. Banbrook and S. McLaughlin, "Speech Characterisation by Non-Linear Methods", presented at IEEE workshop on Nonlinear Signal and Image Processing NSIP '95, pp. 396-400, Jun. 1995.
M. Casdagli, "Chaos and Deterministic versus Stochastic Non-Linear Modelling", Journal of the Royal Statistical Society B, vol. 54, No. 2, pp. 303-328, 1991.
Mark Shelhamer (Correlation Dimension of Optokinetic Nystragmus as Evidence of Chaos in the Oculomotor System), IEEE Transactions on Biomedical Engineering, vol. 39, No. 12, p. 1319-1321, 1992. *
Westall, F.A. and Ip, S.F.A, "Digital Signal Processing in Telecommunications", pp. 295-297, Chapman & Hall, 1993.

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040133585A1 (en) * 2000-07-11 2004-07-08 Fabrice Pautot Data-processing arrangement comprising confidential data
US7486794B2 (en) * 2000-07-11 2009-02-03 Gemalto Sa Data-processing arrangement comprising confidential data
US7714935B2 (en) * 2002-05-31 2010-05-11 Leader Electronics Corporation Data structure for waveform synthesis data and method and apparatus for synthesizing waveform
US20040034530A1 (en) * 2002-05-31 2004-02-19 Tomomi Hara Data structure for waveform synthesis data and method and apparatus for synthesizing waveform
US20080172349A1 (en) * 2007-01-12 2008-07-17 Toyota Engineering & Manufacturing North America, Inc. Neural network controller with fixed long-term and adaptive short-term memory
US8373056B2 (en) * 2010-03-17 2013-02-12 Casio Computer Co., Ltd Waveform generation apparatus and waveform generation program
US20110226116A1 (en) * 2010-03-17 2011-09-22 Casio Computer Co., Ltd. Waveform generation apparatus and waveform generation program
US20120016672A1 (en) * 2010-07-14 2012-01-19 Lei Chen Systems and Methods for Assessment of Non-Native Speech Using Vowel Space Characteristics
US9262941B2 (en) * 2010-07-14 2016-02-16 Educational Testing Services Systems and methods for assessment of non-native speech using vowel space characteristics
US20120310650A1 (en) * 2011-05-30 2012-12-06 Yamaha Corporation Voice synthesis apparatus
US8996378B2 (en) * 2011-05-30 2015-03-31 Yamaha Corporation Voice synthesis apparatus
US8719030B2 (en) * 2012-09-24 2014-05-06 Chengjun Julian Chen System and method for speech synthesis
US9933990B1 (en) * 2013-03-15 2018-04-03 Sonitum Inc. Topological mapping of control parameters

Also Published As

Publication number Publication date
AU724355B2 (en) 2000-09-21
DE69722585D1 (de) 2003-07-10
EP0875059A1 (de) 1998-11-04
CA2241549A1 (en) 1997-07-24
JP2000503412A (ja) 2000-03-21
DE69722585T2 (de) 2004-05-13
EP0875059B1 (de) 2003-06-04
CA2241549C (en) 2002-09-10
JP4194656B2 (ja) 2008-12-10
US20010018652A1 (en) 2001-08-30
AU1389797A (en) 1997-08-11
GB9600774D0 (en) 1996-03-20
WO1997026648A1 (en) 1997-07-24

Similar Documents

Publication Publication Date Title
US6836761B1 (en) Voice converter for assimilation by frame synthesis with temporal alignment
US5740320A (en) Text-to-speech synthesis by concatenation using or modifying clustered phoneme waveforms on basis of cluster parameter centroids
EP2276019B1 (de) Vorrichtung und Verfahren zur Schaffung einer Gesangssynthetisierungsdatenbank sowie Vorrichtung und Verfahren zur Tonhöhenkurvenerzeugung
US7069217B2 (en) Waveform synthesis
EP2270773B1 (de) Vorrichtung und Verfahren zur Schaffung einer Gesangssynthetisierungsdatenbank sowie Vorrichtung und Verfahren zur Tonhöhenkurvenerzeugung
US7035791B2 (en) Feature-domain concatenative speech synthesis
US8280724B2 (en) Speech synthesis using complex spectral modeling
JP2000172285A (ja) フィルタパラメ―タとソ―ス領域において独立にクロスフェ―ドを行う半音節結合型のフォルマントベ―スのスピ―チシンセサイザ
EP0380572A1 (de) Spracherzeugung aus digital gespeicherten koartikulierten sprachsegmenten.
JPS63285598A (ja) 音素接続形パラメ−タ規則合成方式
US5890118A (en) Interpolating between representative frame waveforms of a prediction error signal for speech synthesis
JPH0727397B2 (ja) 音声合成装置
JP4430174B2 (ja) 音声変換装置及び音声変換方法
WO2004027753A1 (en) Method of synthesis for a steady sound signal
JP4454780B2 (ja) 音声情報処理装置とその方法と記憶媒体
JP2000099020A (ja) ビブラート制御方法及びプログラム記録媒体
JP3904871B2 (ja) 歌唱音声合成における韻律生成方法及び韻律生成プログラム、そのプログラムを記録した記録媒体
Jayasinghe Machine Singing Generation Through Deep Learning
Rodet Sound analysis, processing and synthesis tools for music research and production
CN118262696A (en) Singing voice synthesis model training method, singing voice synthesis method, device and storage medium
CN117995163A (zh) 语音编辑方法及装置
JPH0962295A (ja) 音声素片作成方法および音声合成方法とその装置
JPS58105198A (ja) 音声分析合成方法
JPH07104795A (ja) 音声規則合成装置

Legal Events

Date Code Title Description
AS Assignment

Owner name: BRITISH TELECOMMUNICATIONS PUBLIC LIMITED COMPANY,

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MCLAUGHLIN, STEPHEN;BANBROOK, MICHAEL;REEL/FRAME:009456/0754

Effective date: 19980206

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20140627