EP1286332A1 - Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks - Google Patents

Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks Download PDF

Info

Publication number
EP1286332A1
EP1286332A1 EP01402177A EP01402177A EP1286332A1 EP 1286332 A1 EP1286332 A1 EP 1286332A1 EP 01402177 A EP01402177 A EP 01402177A EP 01402177 A EP01402177 A EP 01402177A EP 1286332 A1 EP1286332 A1 EP 1286332A1
Authority
EP
European Patent Office
Prior art keywords
frequency
sound data
sound
sampling
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP01402177A
Other languages
English (en)
French (fr)
Inventor
Pierre-Yves Oudeyer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony France SA
Original Assignee
Sony France SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony France SA filed Critical Sony France SA
Priority to EP01402177A priority Critical patent/EP1286332A1/de
Publication of EP1286332A1 publication Critical patent/EP1286332A1/de
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion

Definitions

  • the invention relates to a method and device for processing a sound such as a voice delivering a natural or synthesised utterance, so as to controllably modify a characteristic feature thereof, and notably its apparent age.
  • a sound such as a voice delivering a natural or synthesised utterance
  • the sound characteristics of the human voice evolves significantly with age, from infancy through childhood, youth, adulthood, and advancing years. Similar evolutions in the sounding of utterances can be observed with animals. This makes it possible to associate an age to a voice, simply on the basis of everyday experience.
  • an utterance - be it natural or synthesised - conveys an impression of age, hereafter referred to generically as the "age" of the utterance.
  • robotic pets, humanoids and the communicating objects can be expected to simulate a lifespan, e.g. starting from infancy at an initial starting phase.
  • Other applications are, inter alia, in voice synthesis where it is often desirable to control the age of a voice among other parameters such as male/female, speed of delivery, emotion portrayed, etc.
  • Figure 1 illustrates what would be a standard approach for providing a control in the age of an utterance where the latter is produced from samples of a human voice.
  • the samples are obtained from a number N of different human speakers S1, S2, S3, ..., SN, each having a respective age A1, A2, A3, ..., AN corresponding to a given age group, e.g. from a child to an elderly person.
  • the voices V1, V2, V3, ..., VN obtained from each of the speakers are processed independently of each other.
  • the analog voice output is fed to a digitizer 2 where it is converted to successive intensity or amplitude values.
  • the digitizer 2 operates according to a given sampling frequency F1 (Hz) provided by a signal from a sampling clock 4 set for that frequency.
  • F1 sampling frequency
  • the sampling frequency corresponds to the number (F1) of intensity or amplitude values that are sampled i.e. measured, per second on the voice signal. Each sampled value is converted into a digital word of a given number of bits, depending on the required dynamic range.
  • the successive digital words corresponding to respective successive samples are compiled into a sound file 6 for reading by a digital sound player, usually a synthesiser which produces a digital-to-analog conversion on the sound data.
  • Sound files 6 for respective speakers S1-SN are collected and stored in a sound file memory 8.
  • a sound file may contain complete words, or it may result from intermediate processing steps in order to correspond to a collection of basic sound elements extracted from the voice information, allowing words to be formed by so-called concatenative techniques.
  • the digital data of a sound file 6 is reproduced by a digital sound player 10, which converts the digitally-expressed intensity or amplitude values as it receives them into a corresponding time-varying audio signal suitably amplified to be played on a speaker 12 or other transducer.
  • a chosen sound file is selected from the memory 8 by a sound file reader 14, which serves to determine and organise the digital data to be sent to the digital sound player 10.
  • the sound file reader 14 can read selected portions of a sound file 6 to obtain a required concatenation of syllables for producing an utterance.
  • the sampled data contained in the sound files are passed onto the digital sound player at the frequency F1 at which they were sampled, so that the delivery rate of the initially-recorded voice is respected; i.e. the playback speed is the same as the recording speed, as per a normal recording and playback system, the speed being in this case set by the sampling frequency F1 which fixes the delivery rate.
  • the data supplied to the digital sound player 10 will come from a sound file - or possibly a set of sound files - originating specifically from a voice Vi of a speaker Si having that prescribed age.
  • the sound file reader 14 has to access one or more corresponding different sound files from the sound file memory 8.
  • the invention proposes a method of controllably modifying a value of a characteristic associated to a sound contained in sound data to which is associated a sampling frequency, the sound exhibiting a given value of the characteristic for a first sampling frequency F1, characterised in that it comprises the steps of:
  • the above process is particularly advantageous for controllably modifying an age associated to a voice, the sound data corresponding to at least one utterance, the value of the characteristic being an age associated to the utterance.
  • the time-adjusting step is preferably performed by means of a Pitch-Synchronous Overlap and Add (PSOLA) algorithm or the like.
  • PSOLA Pitch-Synchronous Overlap and Add
  • the sound data can be obtained by sampling an analog sound signal at the first sampling frequency F1.
  • All the sound data can obtained by sampling a voice from a same speaker.
  • the sound data may also be obtained by generating synthetically values of sound data such as to exhibit the given value of the characteristic for the first sampling frequency F1.
  • the time adjusting step causes the sampling duration of the time-adjusted form of the sound data portion sampled at the second frequency F2 to be substantially the same as the first sampling duration R1.
  • the time-adjusted sound data can be played substantially at the second frequency F2.
  • the second frequency F2 can be substantially continuously variable, e.g. over a range covering frequency values above and below the first frequency F1.
  • the latter When applied to controlling an age of a voice, the latter can be controllably increased by correspondingly decreasing the value of the second frequency F2.
  • the invention relates to the implementation of the above method in a device which simulates an ageing process, such as a robotic pet having a life-cycle, wherein the second frequency F2 is caused to decrease progressively with time.
  • the invention provides a device for controllably modifying a value of a characteristic associated to a sound contained in sound data to which is associated a sampling frequency, the sound exhibiting a given value of the characteristic for a first sampling frequency F1, characterised in that it comprises:
  • the invention relates to a device which simulates an ageing process, such as a robotic pet having a life-cycle, characterised in that it comprises a device according to the third aspect and means for causing the second frequency F2 to decrease progressively with time.
  • the invention relates to a computer program providing computer executable instructions, which when loaded onto a data processor causes the data processor to operate the above method.
  • the computer program can embodied in a recording medium of any suitable form.
  • FIG. 2 forms a system 20 which exploits the possibility of using a same source of sound, whether it be of natural origin as for a human speaker S or a source of virtual voice data 22, from which to produce utterances having different possible ages.
  • the original voice signal 24 e.g. as picked up by a microphone is supplied to a standard digitizer 26 which is run by a sampling clock 28 for a first sampling frequency F1 (Hz).
  • F1 a first sampling frequency
  • the first sampling frequency can be typically 16 000 or 32 000 Hz, or any other usual sampling frequency.
  • the digitizer 26 thereby produces a number F1 of sampled amplitude values of the voice signal per second, each sampled value being expressed as a number by a binary word of a given bit length.
  • the digitizer For an inputted voice signal 24 lasting for a time interval t, the digitizer produces a time-ordered sequence of t.F1 amplitude values, each giving numerically the instantaneous amplitude of the signal 24 at the moment when it was sampled, and referred to hereafter as a digital sample value.
  • the digital sample values produced by the digitizer 26 are compiled into one or several sound files 30, each containing a digital recording of the voice signal V.
  • the speaker has an age A, whereupon the voice recorded in the sound file(s) will have an ascribed age A.
  • the sound file(s) 30 is/are stored in a sound file memory 32 from which their data can be selectively accessed for reading out by a player.
  • the sound file memory 32 is accessed by a sound data processor 34 which uses the voice sample data to generate utterance data UD, from which a desired utterance is subsequently expressed audibly.
  • the utterance data UD may correspond to a complete pronounced word or a sequence of words already contained in a sound file 30, whereupon the processor 34 has the task of identifying the required word(s) and reading the corresponding digital samples.
  • the processor 34 may also operate in a concatenative mode, in which it generates words or phrases by selecting and concatenating sound elements such as syllables or phonemes contained in a sound file. In this case, it selectively reads out the corresponding groups of digital samples to form the concatenation required for the utterance data UD to be reproduced audibly.
  • the sound data processor 34 generates the utterance data UD in response to commands specifying the utterance to be produced, the latter coming e.g. from the central control system of a robotic pet or the like, as a function of its operating context.
  • the description of the operation that follows is based on an example of a real-time access from the sound file memory 32 to produce the audible age-modified voice on the fly.
  • the operation is given with reference to a portion of utterance data UD having a sampling duration R1 at a sampling frequency equal to F1.
  • the sampling duration R1 would simply be equal to t.
  • an utterance data portion UD undergoes a time adjustment with reference to a second frequency F2 (Hz).
  • the time adjustment is such that the sampling duration of the time-adjusted form of utterance data portion, hereafter designated UDta, at that second frequency F2 is substantially the same as the sampling duration R1 of the original utterance data portion UD at the sampling frequency F1.
  • this expedient has the remarkable effect of changing the age A of the voice as a function of this second frequency F2 when playing the time-adjusted form of utterance data UDta at a frequency at or in the region of the second frequency F2.
  • the second frequency F2 is produced by a continuously variable frequency generator 36 which can be electronically controlled to generate a range of frequency values for F2 above and below the initial frequency F1, respectively to rejuvenate or age the voice conveyed by the utterance data.
  • the variable frequency generator delivers a sampling clock signal at frequency F2 to the processor 34 for managing the data flow rate accordingly, and to a sound player 40 to provide the play sampling frequency F2.
  • the time-adjustment is produced by a pitch-synchronous overlap and add - generally referred to in the literature by the acronym "PSOLA" - algorithm unit 38.
  • PSOLA pitch-synchronous overlap and add - generally referred to in the literature by the acronym "PSOLA" - algorithm unit 38.
  • PSOLA algorithm unit 38 is operative to perform on the incoming utterance data portion UD a PSOLA synthesis in which e.g. the following steps are performed:
  • PSOLA More information on PSOLA algorithms can be found in the following web page: http://www.fon.let.uva.nl/praat/manual/PSOLA.html (copyright ppgb, March 30 th , 2001).
  • the PSOLA algorithm unit 38 thus judiciously produces an appropriate number of samples for the time adjusted utterance data UDta so that a time-adjusted utterance data portion corresponding to the voice signal 24, for instance, will have the same duration t when played at the sampling frequency of F2 (age modified signal 24' at the bottom of figure 2).
  • the PSOLA algorithm unit 38 receives as input parameters the values of F1 and F2 or information expressing the ratio of these two values in order to calculate the required adjustment.
  • a PSOLA algorithm unit is - to the contrary - to modify the sampling duration of a given sound sequence so that it becomes contracted or expanded when played at the same frequency as its digitisation sampling frequency. In other words, it serves to accelerate or slow down the rate of playing through a digital sound sequence. The steps performed to provide this adjustment allows the changed playing rate to maintain nevertheless the same pitch characteristics as the original sound.
  • a typical application in the art of a PSOLA algorithm is in the field of voice synthesis, where it serves to modify the delivery rate of speech (slow /fast talking speed) without for so much altering the pitch of the voice.
  • the PSOLA algorithm unit 38 is used instead to maintain an initial duration of a given voice portion despite a change in the play sampling frequency F2 relative to the sampling frequency F1 for which the sound data was initially prepared.
  • the sampling duration of the utterance data UD at the second frequency F2 would be t.F1/F2 without the above time-adjustment.
  • the second frequency F2 is less than F1 (F2 ⁇ F1)
  • the sampling duration of the utterance data UD would be expanded, while it will be contracted if the frequency F2 were greater than F1 (F2>F1).
  • the time-adjusted form UDta of the sound data portion illustrated just below produced the PSOLA algorithm unit 38 has the same sampling duration t for the second frequency F2.
  • the time-adjusted utterance data UDta outputted by the PSOLA algorithm unit 38 is fed to a digital sound data player 40 where it is digital-to-analog converted into an audible form through a loudspeaker 42.
  • the selection of the apparent age to give to a voice can thus be achieved by correspondingly adjusting the frequency F2.
  • this adjustment is effected through a user-accessible selector 44 which presents a visible scale 46 along which can be displaced a moveable cursor 48.
  • the scale 46 spans an age range from very young to very old respectively at upper and lower limits of the frequency F2 for a range which contains the first frequency value F1.
  • the position of the cursor 48 is arranged to select a corresponding value of the frequency F2.
  • the thus-designated value of F2 is expressed as appropriate frequency-setting data for the frequency selection input of the variable frequency generator 36.
  • This frequency setting data is also sent to the PSOLA algorithm unit 38 so that it can take that parameter into account accordingly.
  • the scale and cursor 46, 48 can be material devices, such as a potentiometer or similar variable control device with graduations, or it can be a virtual device presented on a display screen, the cursor being displaceable by a mouse, trackball, or designated by a pushbutton etc.
  • the frequency-setting data can also be generated without user intervention in the course of the execution of a general management program. For instance, if applied to a robotic pet having a pre-ordained life cycle, the part of its program that governs its ageing process will regularly update the F2 frequency-setting data by gradually decreasing its value to make the voice correlate with the simulated age.
  • the implementation of the system 20 can also allow the sound processing to be performed in advance, rather than on the fly as described above.
  • the sound processor 34 can be set to produce utterance data (UD) for later use, the data being stored in a memory which is read out and fed to the PSOLA algorithm unit 38 as and when required to be rendered audible.
  • the outputted time-adjusted utterance data UDta from the PSOLA algorithm unit 38 can be stored in a memory and supplied to the digital sound player as and when required.
  • the operation of the system 20 when handling virtual voice data is analogous in all respects.
  • the virtual voice data are produced entirely by a voice generating algorithm according to a programmed parameterisation (which may include the possibility of adding an emotional content to the voice).
  • the voice can again be made to sound human or animal-like, or may be invented sounds e.g. to correspond to the "personality" of a robotic pet.
  • Such an algorithm typically has a parameter for establishing a prescribed playing frequency, i.e. the frequency at which the digital sound elements are to be fed through a digital sound player to sound as intended.
  • the "intended" sound is the voice of a prescribed age (e.g. A as above) and the prescribed frequency shall also be designated F1.
  • This frequency is supplied as data to the sound data processor 34 and to the PSOLA algorithm unit 38 for scaling as mentioned above.
  • Figure 3 expresses the main operating steps of the system 20 for producing a voice with a variable age A.
  • the process starts with a step E2 of preparing the voice data of a given age A for a sampling frequency F1, corresponding to a delivery rate R1.
  • This data can be produced from the virtual voice data source 22 or from the digitiser 26 sampling a human voice or other analog sound source.
  • the age of the voice to be produced from this voice data is selected at a step E4, the age being specified relative to the initial age A through a user input (cf. pointer 48 and scale 46), or from a program output.
  • the voice data is processed at step E8 to produce utterance data UD.
  • the utterance data UD is submitted to a PSOLA algorithm or the like at step E10 so as to produce time-adjusted utterance data UDta, which yields the sampling duration R1 when played at a play sampling frequency of F2.
  • This time-adjusted utterance data are then played e.g. at the frequency F2 to produce a voice of variable age, as follows : F2 ⁇ F1 gives an age > A, and F2>F1 gives and age ⁇ A.
  • the variability in the age of the voice exhibits a remarkably smooth and accurate transition with a corresponding change in the frequency F2.
  • the latter can thus advantageously be made substantially continuously variable to give a similarly continuous change in the age of the voice, the rate of change being determined by the application.
  • the system 20 can for instance be incorporated in a device which simulates a life cycle, in which case the second frequency F2 is made to decrease with time over the life cycle.
  • the voice can be made to sound younger than the age A of the speaker, while it can be made to sound older than that age A towards the end of the life cycle, by allowing the second frequency to take values in a range which contains the initial sampling frequency F1.
  • the invention finds many different technical applications in robotic pets, age simulation systems for allowing operators and users of talking computerised systems to choose an appropriate age of voice, for creating animated characters of different ages in studio productions, etc., for educational training, for computerised reading of texts etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Toys (AREA)
EP01402177A 2001-08-14 2001-08-14 Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks Withdrawn EP1286332A1 (de)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP01402177A EP1286332A1 (de) 2001-08-14 2001-08-14 Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP01402177A EP1286332A1 (de) 2001-08-14 2001-08-14 Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks

Publications (1)

Publication Number Publication Date
EP1286332A1 true EP1286332A1 (de) 2003-02-26

Family

ID=8182853

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01402177A Withdrawn EP1286332A1 (de) 2001-08-14 2001-08-14 Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks

Country Status (1)

Country Link
EP (1) EP1286332A1 (de)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0688010A1 (de) * 1994-06-16 1995-12-20 Canon Kabushiki Kaisha Verfahren und Vorrichtung zur Sprachsynthese
JPH10133852A (ja) * 1996-10-31 1998-05-22 Toshiba Corp パーソナルコンピュータおよび音声属性パラメータの管理方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0688010A1 (de) * 1994-06-16 1995-12-20 Canon Kabushiki Kaisha Verfahren und Vorrichtung zur Sprachsynthese
JPH10133852A (ja) * 1996-10-31 1998-05-22 Toshiba Corp パーソナルコンピュータおよび音声属性パラメータの管理方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MOULINES E ET AL: "Non-parametric techniques for pitch-scale and time-scale modification of speech", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 16, no. 2, 1 February 1995 (1995-02-01), pages 175 - 205, XP004024959, ISSN: 0167-6393 *
TITZE I ET AL: "Considerations in voice transformation with physiologic scaling principles", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 22, no. 2-3, 1 August 1997 (1997-08-01), pages 113 - 123, XP004100286, ISSN: 0167-6393 *
VELDHUIS R ET AL: "Time-scale and pitch modifications of speech signals and resynthesis from the discrete short-time Fourier transform", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 18, no. 3, 1 May 1996 (1996-05-01), pages 257 - 279, XP004018610, ISSN: 0167-6393 *

Similar Documents

Publication Publication Date Title
US5890115A (en) Speech synthesizer utilizing wavetable synthesis
JP6791258B2 (ja) 音声合成方法、音声合成装置およびプログラム
CN104412320B (zh) 使用音频波形数据的自动演奏技术
JPH06102877A (ja) 音響構成装置
JP2008146094A (ja) 音声イントネーション較正方法
US7432435B2 (en) Tone synthesis apparatus and method
KR20000005183A (ko) 이미지 합성방법 및 장치
US5659664A (en) Speech synthesis with weighted parameters at phoneme boundaries
CN113016028A (zh) 音响处理方法及音响处理***
US7457752B2 (en) Method and apparatus for controlling the operation of an emotion synthesizing device
EP1286332A1 (de) Klangverarbeitungsverfahren und -vorrichtung zur Modifikation eines Klangmerkmals, z.B. eines mit einer Stimme assoziierten Alterseindrucks
JP5779838B2 (ja) 音響処理装置およびプログラム
CN115349147A (zh) 音信号生成方法、推定模型训练方法、音信号生成***及程序
SE516521C2 (sv) Anordning och förfarande vid talsyntes
JPH1115489A (ja) 歌唱音合成装置
JP3233036B2 (ja) 歌唱音合成装置
JPH09179576A (ja) 音声合成方法
JP6036903B2 (ja) 表示制御装置および表示制御方法
WO2004027758A1 (en) Method for controlling duration in speech synthesis
JPH06250695A (ja) ピッチ制御方法及び装置
JP5471138B2 (ja) 音素符号変換装置および音声合成装置
JP5552797B2 (ja) 音声合成装置および音声合成方法
JP6787491B2 (ja) 音発生装置及び方法
JP3284634B2 (ja) 規則音声合成装置
EP1256933B1 (de) Verfahren und Vorrichtung zur Steuerung eines Emotionssynthesegeräts

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

17P Request for examination filed

Effective date: 20030826

AKX Designation fees paid

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

17Q First examination report despatched

Effective date: 20040203

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20041201