EP1105867A1 - Method and device for the concatenation of audiosegments, taking into account coarticulation - Google Patents
Method and device for the concatenation of audiosegments, taking into account coarticulationInfo
- Publication number
- EP1105867A1 EP1105867A1 EP99942891A EP99942891A EP1105867A1 EP 1105867 A1 EP1105867 A1 EP 1105867A1 EP 99942891 A EP99942891 A EP 99942891A EP 99942891 A EP99942891 A EP 99942891A EP 1105867 A1 EP1105867 A1 EP 1105867A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- area
- audio segment
- sound
- areas
- concatenation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000004590 computer program Methods 0.000 claims abstract description 15
- 238000011144 upstream manufacturing Methods 0.000 claims description 50
- 230000006870 function Effects 0.000 claims description 49
- 230000003068 static effect Effects 0.000 claims description 30
- 230000015572 biosynthetic process Effects 0.000 claims description 29
- 238000003786 synthesis reaction Methods 0.000 claims description 29
- 230000007704 transition Effects 0.000 claims description 28
- 238000001228 spectrum Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 15
- 230000008859 change Effects 0.000 claims description 13
- 230000008451 emotion Effects 0.000 claims description 7
- 239000007788 liquid Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000001419 dependent effect Effects 0.000 claims description 2
- 230000003287 optical effect Effects 0.000 claims description 2
- 238000009795 derivation Methods 0.000 claims 4
- 238000013500 data storage Methods 0.000 claims 2
- 101000886098 Homo sapiens Rho guanine nucleotide exchange factor 40 Proteins 0.000 description 10
- 101000836397 Homo sapiens SEC14 domain and spectrin repeat-containing protein 1 Proteins 0.000 description 10
- 102100027289 SEC14 domain and spectrin repeat-containing protein 1 Human genes 0.000 description 10
- 239000011295 pitch Substances 0.000 description 10
- 230000000694 effects Effects 0.000 description 7
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
- G10L13/07—Concatenation rules
Definitions
- the invention relates to a method and a device for concatenating audio segments for generating synthesized acoustic data, in particular synthesized speech.
- the invention further relates to synthesized speech signals which were generated by the concatenation of speech segments according to the invention in accordance with the articulation, and to a data carrier which contains a computer program for the generation of synthesized acoustic data, in particular synthesized speech, according to the invention.
- the invention relates to a data memory which contains audio segments which are suitable for concatenation in accordance with the invention in accordance with the articulation, and to a sound carrier which contains acoustic data synthesized according to the invention.
- both the prior art presented below and the present invention relate to the entire area of synthesis of acoustic data by concatenation of individual audio segments obtained in any way.
- the following statements relate specifically to synthesized speech data through concatenation of individual speech segments.
- data-based speech synthesis is increasingly being carried out, in which corresponding segments are selected from a database comprising individual speech segments and linked (concatenated) with one another.
- the speech quality depends primarily on the number and type of available speech segments, because only speech can be synthesized that is represented by speech segments in the database.
- various methods are known that perform a concatenation of the language segments according to complex rules.
- an inventory i.e. a database comprising the voice audio segments can be used, which is complete and manageable.
- An inventory is complete if it can be used to generate any phonetic sequence of the language to be synthesized, and is manageable if the number and type of data in the inventory can be processed in a desired manner using the technically available means.
- such a method must ensure that the concatenation of the individual inventory elements generates a synthesized language that differs as little as possible from a naturally spoken language.
- a synthesized language must be fluid and have the same articulatory effects as a natural language.
- co-articulatory effects i.e. the mutual influence of
- the inventory elements should be such that they take into account the co-articulation of individual successive speech sounds. Furthermore, a procedure for concatenating the inventory elements should chain the elements, taking into account the co-articulation of individual consecutive speech sounds as well as the superordinate co-articulation of several consecutive speech sounds, also across word and sentence boundaries.
- a sound is a class of arbitrary sound events (noises, sounds, tones, etc.).
- the sound events are divided into sound classes according to a classification scheme.
- a sound event belongs to a sound if, with regard to the parameters used for classification (e.g. spectrum, pitch, volume, chest or head voice, coarticulation, resonance rooms, emotion, etc.), the values of the sound event lie within the value ranges defined for the sound.
- the classification scheme for sounds depends on the type of application.
- the definition of the term "loud” used here is not limited to this, but any other parameters can be used.
- the pitch or the emotional expression are also included as parameters in the classification, two 'a' sounds with different pitch or with lower different emotional expression to different sounds in the sense of the definition.
- Lute can also be the tones of a musical instrument, such as a violin, at different pitches in different ways of playing (spread and smear, detache, spiccato, dilemmao, col legno etc.). Sounds can also be Hunebellell or the squeak of a car door.
- Sounds can be played through audio segments that contain corresponding acoustic data.
- Phon can be replaced by the term phonetic in the sense of the previous definition and the term phoneme by the term phonetic sign. (This also applies the other way around, as phones are classified sounds according to the IPA classification.)
- a static sound has areas that are similar to previous or subsequent ones
- the similarity does not necessarily have to be an exact correspondence to the periods of a sine tone, but is analogous to the similarity that exists between the areas of the static phones defined below.
- a dynamic sound has no areas that resemble previous or subsequent areas of the dynamic sound, such as the sound event of an explosion or a dynamic phone.
- a phon is a sound generated by the speech organs (a speech sound).
- the phones are divided into static and dynamic phones.
- Static phones include vowels, diphtongs, nasals, laterals, vibrants and fricatives.
- the dynamic phones include plosives, affricates, glottal stops and beaten ones
- a phoneme is the formal description of a phon, whereby i. general.
- the formal description is made by phonetic characters.
- the co-articulation describes the phenomenon that a sound, i.e. also a phon, is influenced by upstream and downstream sounds or phones, whereby the co-articular tion occurs between immediately adjacent sounds / phones, but can also extend as a superordinate co-articulation over a sequence of several sounds / phones (for example, when rounding the lips).
- the initial co-articulation area covers the area from the beginning of the sound / phone to the end of the co-articulation due to an upstream sound / phone.
- the solo articulation range is the range of the sound / phon that is not influenced by a preceding or following sound or a preceding or following phon.
- the end co-articulation area covers the area from the start of co-articulation due to a downstream sound / phone to the end of the sound / phone.
- the co-articulation area comprises an end co-articulation area and the adjacent initial co-articulation area of the adjacent sound / phone.
- a polyphone is a series of phones.
- the elements of an inventory are coded audio segments that reproduce sounds, parts of sounds, sequences of parts or parts of sequences, or phone, parts of phones, polyphones or parts of polyphones.
- FIG. 2a shows a conventional audio segment
- FIGS. 2b-2l in which audio segments according to the invention are shown.
- audio segments can also be formed from smaller or larger audio segments that are contained in the inventory or a database.
- audio segments can also be present in a transformed form (e.g. a Fourier-transformed form) in the inventory or in a database.
- Audio segments for the present method can also originate from an upstream synthesis step (which is not part of the method). Audio segments contain at least part of an initial co-articulation area, a solo articulation area and / or an end co-articulation area. Instead of audio segments, areas of audio segments can also be used.
- Concatenation means the joining of two audio segments.
- the concatenation moment is the point in time at which two audio segments are joined together.
- the concatenation can be done in different ways, e.g. with a crossfade or a hardfade (see also Figures 3a-3e):
- a temporally rear area of a first audio segment area and a temporally front area of a second audio segment area are processed with suitable transition functions, and then these two areas are added in an overlapping manner in such a way that the shorter of the two areas in maximum of the longer of the two areas is completely overlapped.
- a temporally rear area of a first audio segment and a temporally front area of a second audio segment are processed with suitable transition functions, these two audio segments being joined together in such a way that the rear area of the first audio segment and the front area of the second audio segment do not overlap .
- the coarticulation area is particularly noticeable in that a concatenation in it is associated with discontinuities (e.g. spectral jumps).
- a hardfade represents a limit case of a crossfade, in which an overlap of a temporally backward area of a first audio segment and a temporally forward area of a second audio segment has a length of zero. This allows in certain, e.g. Replacing a crossfade with a hardfade in extremely time-critical applications, such a procedure must be carefully considered, since this leads to significant quality losses in the concatenation of audio segments which are actually to be concatenated by a crossfade.
- WO 95/30193 discloses a method and a device for converting text into audible speech signals using a neural network.
- the text to be converted into language is converted into a sequence of phonemes using a conversion unit, with additional information being generated about the syntactical limits of the text and the emphasis on the individual syntactic components of the text. These are forwarded together with the phonemes to a facility that determines the duration of the pronunciation of the individual phonemes based on rules.
- a processor generates a suitable input for the neural network from each individual phoneme in conjunction with the corresponding syntactic and temporal information, this input for the neural network also comprising the corresponding prosodic information for the entire phoneme sequence. From the available audio segments, the neural network now selects those that best reproduce the entered phonemes and links these audio segments accordingly. In this concatenation, the duration, total amplitude and frequency of the individual audio segments are adapted to upstream and downstream audio segments, taking into account the prosodic information of the speech to be synthesized, and are connected to one another in time. A change in individual areas of the audio segments is not described here.
- the neural is used to generate the audio segments required for this method
- No. 5,524,172 describes a device for generating synthesized speech which uses the so-called diphone method.
- a text that is to be converted into synthesized language is divided into phoneme sequences, with each phoneme sequence speaking prosodic information.
- two diphones representing the phoneme are selected for each phoneme in the sequence and concatenated taking into account the corresponding prosodic information.
- the two diphones are each weighted using a suitable filter and the duration and pitch of both diphones are changed so that when the diphones are concatenated, a synthesized phoneme sequence is generated, the duration and pitch of which correspond to the duration and pitch of the desired phoneme sequence.
- the individual diphones are added in such a way that a temporally rear area of a first diphone and a temporally front area of a second diphone overlap, the concatenation moment generally being in the stationary region of the individual diphones (see FIG. 2a). Since a variation of the concatenation moment taking into account the co-articulation of successive audio segments (diphones) is not provided here, the quality (naturalness and intelligibility) of a speech synthesized in this way can be negatively influenced.
- the database also provides audio segments that differ slightly, but are suitable for synthesizing the same phoneme. In this way, the natural variation of the language is to be simulated in order to achieve a higher quality of the synthesized language.
- Both the use of the smoothing filter and the selection from a number of different audio segments for realizing a phoneme requires a high computing power of the system components used when implementing this method.
- the size of the database increases due to the increased number of audio segments provided.
- this method is also a co-articulation-dependent choice of the concatenation moment of individual audio segments is not provided, whereby the quality of the synthesized speech can be reduced.
- DE 689 15 353 T2 aims to improve the sound quality by specifying a procedure for how the transition between two adjacent samples is to be designed. This is particularly relevant for low sampling rates.
- the speech synthesis described in this document uses waveforms that represent sounds to be concatenated. For waveforms for upstream
- a corresponding end sample value and an assigned zero crossing point are determined in each case for sounds, while a first upper sample value and an assigned zero crossing point are each determined for waveforms for downstream sounds.
- sounds are connected to one another in a maximum of four different ways.
- connection types is reduced to two if the waveforms are generated using the Nyquist theorem.
- DE 689 15 353 T2 describes that the range of waveforms used extends between the last sample of the upstream waveform and the first sample of the downstream waveform.
- a synthesized phoneme sequence has an authentic speech quality if it cannot be distinguished by the listener from the same phoneme sequence spoken by a real speaker.
- the acoustic data synthesized with the invention, in particular synthesized speech data, should have an authentic acoustic quality, in particular an authentic speech quality.
- the invention provides a method according to claim 1, a device according to claim 14, synthesized speech signals according to claim 28, a data carrier according to claim 39, a data memory according to claim 51, and a sound carrier according to claim 60.
- the invention thus makes it possible to generate synthesized acoustic data which reproduce a sequence of sounds, in that, when concatenating audio segment areas, the moment of concatenation of two audio segment areas is determined as a function of properties of the audio segment areas to be linked, in particular the co-articulation effects relating to the two audio segment areas.
- the concatenation moment is determined according to the -lü ⁇
- the invention preferably chosen in the vicinity of the limits of the solo articulation range. In this way, a voice quality is achieved that cannot be achieved with the prior art.
- the computing power required is not higher than in the prior art.
- the invention provides for a different selection of the audio segment areas and different types of concatenation that is appropriate for the articulation.
- a higher degree of naturalness of the synthesized acoustic data is achieved when a temporally downstream audio segment area, the beginning of which reproduces a static sound, is connected to a temporally preceding audio segment area by means of a crossfade, or if a temporally downstream audio segment area, the beginning of which is a dynamic sound reproduces, is connected to a temporally preceding audio segment area by means of a hard thread.
- the invention makes it possible to reduce the number of audio segment areas necessary for data synthesis by using audio segment areas which always start to play a dynamic sound, whereby all concatenations of these audio segment areas are carried out by means of a hardfade can be.
- downstream audio segment areas are connected with upstream audio segment areas, the beginnings of which each represent a dynamic sound.
- synthesized acoustic data of high quality can also be generated according to the invention, even with low computing power (for example in the case of answering machines or car control systems).
- the invention provides for the simulation of acoustic phenomena which result from the mutual influence of individual segments of corresponding natural acoustic data.
- individual audio segments or individual areas of the audio segments are processed using suitable functions.
- the frequency, the duration, the amplitude or the spectrum of the audio segments can be changed.
- prosodic information and / or superordinate co-articulation effects are preferably taken into account to solve this task.
- the signal curve of synthesized acoustic data can additionally be improved if the concatenation moment is placed at points of the individual audio segment regions to be linked, at which the two regions used match in terms of one or more suitable properties.
- suitable properties can include be: zero, amplitude value, slope, derivative of any degree, spectrum, pitch, amplitude value in a frequency range, volume, language style, speech emotion, or other properties considered in the sound classification scheme.
- the invention makes it possible to improve the selection of the audio segment regions for generating the synthesized acoustic data and to make their concatenation more efficient by using heuristic knowledge that the
- audio segment areas are preferably used that reproduce sounds / phone or parts of sound sequences / sound sequences.
- the invention allows the use of the synthesized acoustic data generated by converting these data into acoustic signals and / or voice signals and / or storing them on a data carrier.
- the invention can be used to provide synthesized speech signals which differ from known synthesized speech signals in that they do not differ in their naturalness and intelligibility from real speech.
- audio segment areas are concatenated in accordance with the articulation, each reproducing parts of the phonetic sequence / phoneme sequence of the speech to be synthesized, by determining the areas of the audio segments to be used and the moment of concatenation of these areas according to the invention as defined in claim 28.
- An additional improvement of the synthesized speech can be achieved if a downstream audio segment area, the beginning of which is a static sound or reproduces a static phone, is connected to a temporally preceding audio segment area by means of a crossfade, or if a temporally downstream audio segment area, the beginning of which reproduces a dynamic sound or a dynamic phon, is connected to a temporally preceding audio segment area by means of a hardfade.
- a fast and efficient procedure is particularly desirable when generating synthesized speech.
- Such audio segment areas can be generated beforehand with the invention by concatenation of corresponding audio segment areas in accordance with the articulation.
- the invention provides speech signals which have a natural speech flow, speech melody and speech rhythm in that audio segment areas are processed before and / or after concatenation in their entirety or in individual areas with the aid of suitable functions. It is particularly advantageous to additionally carry out this variation in areas in which the corresponding moments of the concatenations lie, in order, inter alia, to change the frequency, duration, amplitude or spectrum.
- An additionally improved signal curve can be achieved if the concatenation moments are located at locations of the audio segment regions to be linked, at which these correspond in one or more suitable properties.
- the speech signals can be converted into acoustic signals or stored on a data carrier.
- a data carrier is provided which contains a computer program which enables the method according to the invention to be carried out or the device according to the invention and its various embodiments to be controlled.
- the data carrier according to the invention also allows the generation of voice signals which have concatenations that are appropriate for co-articulation.
- the invention provides a data memory which contains audio segments which are suitable for to be concatenated according to the invention into synthesized acoustic data.
- a data carrier preferably contains audio segments which are suitable for carrying out the method according to the invention and for use in the device according to the invention or the data carrier according to the invention.
- the data carrier can also include voice signals according to the invention.
- the invention enables synthesized acoustic according to the invention
- a sound carrier that has data that was generated at least partially by the method according to the invention or the device according to the invention or by using the data carrier according to the invention or the data memory according to the invention Speech signals are.
- Figure 1a Schematic representation of an inventive device for generating synthesized acoustic data
- Figure 1b Structure of a sound / phon.
- Figure 2a Structure of a conventional audio segment according to the prior art, consisting of parts of two sounds, ie a diphone for speech. It is essential that the solo articulation areas are only partially contained in the conventional diphone audio segment.
- Figure 2b Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with downstream co-articulation areas (quasi a 'shifted' diphone for speech).
- Figure 2c Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with upstream coarticulation areas.
- Figure 2d Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with downstream coarticulation areas and contains additional areas.
- Figure 2e Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with upstream coarticulation areas and contains additional areas.
- Figure 2f Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with downstream co-articulation areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2g Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with upstream co-articulation areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2h Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with downstream co-articulation areas and contains additional areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2i Structure of an audio segment according to the invention, the parts of several sounds / phone (for speech: a polyphone), each with upstream co-articulation areas reproduces and contains additional areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2j Structure of an audio segment according to the invention, which reproduces part of a loud / phon from the beginning of a sound sequence / phon sequence.
- Figure 2k Structure of an audio segment according to the invention, which reproduces parts of sounds / phonas from the beginning of a sound sequence / phoneme.
- Figure 21 Structure of an audio segment according to the invention, which reproduces a sound / a phon from the end of a sound sequence / phon sequence.
- Figure 3a Concatenation according to the prior art using the example of two conventional audio segments. The segments begin and end with parts of the solo activation areas (usually half each).
- Figure 3al concatenation according to the prior art.
- the solo articulation area of the middle phone comes from two different audio segments.
- Audio segments each containing a sound / a phon with downstream coarticulation areas. Both sounds / phones come from the middle of a sequence of sound units
- Figure 3bl concatenation of these audio segments using a crossfade.
- the solo articulation area comes from an audio segment.
- the transition between the audio segments takes place between two areas and is therefore less sensitive to differences (in the spectrum, frequency, amplitude, etc.).
- the audio segments can also be edited with additional transition functions before concatenation.
- Figure 3bll concatenation of these audio segments using a hardfade.
- Figure 3c Concatenation according to the inventive method using the example of two audio segments according to the invention, each containing a sound / a phon with downstream coarticulation areas, the first audio segment from the beginning of one
- Figure 3cll concatenation of these audio segments using a hardfade.
- Figure 3d Concatenation according to the inventive method using the example of two audio segments according to the invention, each of which contains a sound / a phon with upstream co-articulation areas. Both audio segments come from the middle of a sound sequence.
- Figure 3dl concatenation of these audio segments using a crossfade.
- the solo articulation area comes from an audio segment.
- Figure 3dll concatenation of these audio segments using a hardfade.
- Figure 3el concatenation of these audio segments using a crossfade.
- Figure 3ell concatenation of these audio segments using a hardfade.
- Figure 4 Schematic representation of the steps of a method according to the invention for generating synthesized acoustic data.
- the invention for example, to convert a text into synthesized speech, it is necessary in a preceding step to subdivide this text into a sequence of sound signals or phonemes using known methods or devices. Prosodic information corresponding to the text should preferably also be generated.
- the phonetic sequence or phoneme sequence as well as the prosodic and additional information serve as input variables for the method and the device according to the invention.
- the sounds / phones to be synthesized are fed to an input unit 101 of the device 1 for generating synthesized speech data and stored in a first storage unit 103 (see FIG. 1a).
- the audio segment areas, the sounds or phone or parts of sounds are selected from an inventory containing audio segments (elements), which is stored in a database 107, or from an upstream synthesis device 108 (which is not part of the invention) or reproduce phones which correspond to the individual entered sound characters or phonemes or parts thereof and are stored in a second memory unit 109 in an order which corresponds to the sequence of the input sound characters or phonemes.
- the selection device 105 preferably selects the audio segments which reproduce most parts of sound sequences or polyphones that correspond to a sequence of sound signs or phonemes from the input sound string or phoneme sequence correspond, so that a minimum number of audio segments is required for the synthesis of the input phoneme sequence.
- the selection device 105 preferably selects the longest audio segment areas which reproduce parts of the sequence of sounds / phoneme, by the entered sequence of sounds or phoneme and / or a sequence of sounds / Synthesize phones from a minimal number of audio segment areas. In this case, it is advantageous to use concatenated lute / phone reproducing audio segment areas that have a static upstream
- the concatenation moments of two successive audio segment areas are determined with the aid of a concatenation device 111 as follows:
- step 1 If an audio segment area is to be used to synthesize the beginning of the entered sound sequence / phoneme sequence (step 1), then an audio to select a segment area that reproduces the beginning of a sound sequence / phoneme sequence and to chain it with a temporally downstream audio segment area (see FIG. 3c and step 3 in FIG. 4).
- the concatenation is carried out in the form of a crossfade, with the moment of concatenation being placed in the rear area of the first audio segment area and in the front area of the second audio segment area, whereby these two areas are located in the Concatenation overlap or at least immediately adjoin one another (see Figures 3bl, 3cl, 3dl and 3el, concatenation using crossfade).
- the concatenation is carried out in the form of a hardfade, the moment of the concatenation being immediately behind the temporally rear area of the first audio segment area and temporally immediately before the temporally front area of the second audio segment area (see Figures 3bll, 3cll, 3dll and 3ell, concatenation using hardfade).
- new audio segments can be generated from these originally available audio segment areas, which begin with the reproduction of a static sound / phone. This is achieved by concatenating audio segment areas, which start with the reproduction of a dynamic sound / phone, with audio segment areas, which begin with the playback of a static sound / phone. Although this increases the number of audio segments or the scope of the inventory, it can represent a computing advantage in the generation of synthesized speech data, since fewer individual concatenations are required to generate a phonetic sequence / phoneme sequence and concatenations only have to be carried out in the form of a crossfade.
- the new chained audio segments thus generated are preferably fed to the database 107 or another storage unit 113.
- a further advantage of this concatenation of the original audio segment areas to new, longer audio segments arises if, for example, a sequence of sounds / phones is repeated frequently in the sound sequence / phone sequence entered. Then one can use one of the new correspondingly linked audio segments and it is not necessary to re-concatenate the originally existing audio segment areas each time this sequence of sounds / phones occurs.
- overlapping co-articulation effects are preferably also to be recorded or specific co-articulation effects in the form of additional data are to be assigned to the stored chained audio segment.
- an audio segment area is to be used to synthesize the end of the entered sound sequence / phoneme sequence, then an audio segment area is to be selected from the inventory, which reproduces an end of a sound sequence / phoneme sequence and to be concatenated with an audio segment region preceding it (see FIG. 3e and step 8 in FIG 4).
- the individual audio segments are stored in coded form in the database 107, the coded form of the audio segments in addition to the waveform of the respective audio segment being able to indicate which parts of sound sequences / phonetic sequences the respective audio segment reproduces, what type of concatenation (eg hardfade, linear or exponential) Crossfade) with which temporally subsequent audio segment area is to be carried out and at which moment the concatenation with which temporally subsequent audio segment area takes place.
- the encoded form of the audio segments preferably also contains information relating to prosody, superordinate co-articulations and transition functions, which are used to achieve an additional improvement in speech quality.
- those audio segment areas are selected as temporally downstream that correspond to the properties of the audio segment areas upstream in each case, including the type of concatenation and the concatenation moment.
- the concatenation of two successive audio segment areas takes place with the aid of the concatenation device 111.
- the waveform, the type of concatenation, the concatenation moment and any additional information of the first audio segment area and the second audio segment area are loaded from the database or the synthesis device (FIG. 3b and steps 10 and 11).
- the audio segment areas selected those audio segment areas which match one another with regard to their type of concatenation and their concatenation moment. In this case, it is no longer necessary to load the information regarding the type of concatenation and the concatenation moment of the second audio segment area.
- the waveform of the first audio segment area in a temporally rear area and the waveform of the second audio segment area in a temporally front area are each processed with suitable transition functions, e.g. multiplied by a suitable weighting function (see Figure 3b, steps 12 and 13).
- suitable transition functions e.g. multiplied by a suitable weighting function (see Figure 3b, steps 12 and 13).
- the lengths of the backward area of the first audio segment area and of the front area of the second audio segment area result from the type of concatenation and the temporal position of the concatenation moment, and these lengths can also be stored in the coded form of the audio segments in the database.
- the two audio segment areas are to be linked with a crossfade, these are added in an overlapping manner in accordance with the respective concatenation moment (see FIGS. 3bl, 3cl, 3dl and 3el, step 15).
- a linear symmetrical crossfade is preferably to be used here, but any other type of crossfade or any type of transition function can also be used.
- concatenation is to be carried out in the form of a hardfade, the two audio segment areas are not connected in an overlapping manner one after the other (see FIGS. 3bll, 3cll, 3dll and 3ell, step 15).
- the two audio segment areas are arranged directly one behind the other in time. In order to be able to further process the synthesized speech data generated in this way, these are preferably stored in a third memory unit 115.
- the previously linked audio segment areas are regarded as the first audio segment area (step 1)
- the prosodic and additional information which is entered in addition to the sequence of sounds / phon, should preferably also be taken into account when concatenating the audio segment areas.
- the frequency, duration, amplitude and / or spectral properties of the audio segment areas are changed before and / or after their concatenation so that the synthesized speech data have a natural word and / or sentence melody (steps 14, 17 or 18).
- the processing of the two audio segment areas with the aid of suitable functions in the area of the concatenation moment is also provided, in order, inter alia, to adapt the frequencies, durations, amplitudes and spectral properties.
- the invention also allows superordinate acoustic phenomena of a real language, such as e.g. Superordinate co-articulation effects or language style (e.g. whispering, emphasis, singing voice, falsetto, emotional expression) must be taken into account when synthesizing the sequence of sounds / phonograms.
- information relating to such higher-level phenomena is additionally stored in coded form with the corresponding audio segments, so that when selecting the audio segment areas, only those are selected which correspond to the higher-level co-articulation properties of the audio segment areas upstream and / or downstream.
- the synthesized speech data thus generated preferably have a form which, using an output unit 117, allows the speech data to be converted into acoustic speech signals and the speech data and / or speech signals to be stored on an acoustic, optical, magnetic or electrical data carrier (step 19).
- inventory elements are created by incorporating real spoken language.
- the degree of training of the speaker building the inventory i.e. Due to its ability to control the language to be recorded (e.g. to control the pitch of the language or to speak exactly at one pitch)
- Synthesis of any acoustic data or any sound events can be used. Therefore, this invention can also be used for the generation and / or provision of synthesized speech data and / or speech signals for any languages or dialects as well as for the synthesis of music.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Stereophonic System (AREA)
- Telephone Function (AREA)
- Machine Translation (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
- Document Processing Apparatus (AREA)
- Photoreceptors In Electrophotography (AREA)
- Circuits Of Receivers In General (AREA)
- Stereo-Broadcasting Methods (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE19837661 | 1998-08-19 | ||
DE1998137661 DE19837661C2 (en) | 1998-08-19 | 1998-08-19 | Method and device for co-articulating concatenation of audio segments |
PCT/EP1999/006081 WO2000011647A1 (en) | 1998-08-19 | 1999-08-19 | Method and device for the concatenation of audiosegments, taking into account coarticulation |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1105867A1 true EP1105867A1 (en) | 2001-06-13 |
EP1105867B1 EP1105867B1 (en) | 2003-06-25 |
Family
ID=7878051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP99942891A Expired - Lifetime EP1105867B1 (en) | 1998-08-19 | 1999-08-19 | Method and device for the concatenation of audiosegments, taking into account coarticulation |
Country Status (7)
Country | Link |
---|---|
US (1) | US7047194B1 (en) |
EP (1) | EP1105867B1 (en) |
AT (1) | ATE243876T1 (en) |
AU (1) | AU5623199A (en) |
CA (1) | CA2340073A1 (en) |
DE (2) | DE19861167A1 (en) |
WO (1) | WO2000011647A1 (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7369994B1 (en) | 1999-04-30 | 2008-05-06 | At&T Corp. | Methods and apparatus for rapid acoustic unit selection from a large speech corpus |
US7941481B1 (en) | 1999-10-22 | 2011-05-10 | Tellme Networks, Inc. | Updating an electronic phonebook over electronic communication networks |
US7308408B1 (en) * | 2000-07-24 | 2007-12-11 | Microsoft Corporation | Providing services for an information processing system using an audio interface |
DE10042571C2 (en) * | 2000-08-22 | 2003-02-06 | Univ Dresden Tech | Process for concatenative speech synthesis using graph-based building block selection with a variable evaluation function |
JP3901475B2 (en) * | 2001-07-02 | 2007-04-04 | 株式会社ケンウッド | Signal coupling device, signal coupling method and program |
US7379875B2 (en) * | 2003-10-24 | 2008-05-27 | Microsoft Corporation | Systems and methods for generating audio thumbnails |
DE102004044649B3 (en) * | 2004-09-15 | 2006-05-04 | Siemens Ag | Speech synthesis using database containing coded speech signal units from given text, with prosodic manipulation, characterizes speech signal units by periodic markings |
US20080154601A1 (en) * | 2004-09-29 | 2008-06-26 | Microsoft Corporation | Method and system for providing menu and other services for an information processing system using a telephone or other audio interface |
US8510113B1 (en) * | 2006-08-31 | 2013-08-13 | At&T Intellectual Property Ii, L.P. | Method and system for enhancing a speech database |
US8374868B2 (en) * | 2009-08-21 | 2013-02-12 | General Motors Llc | Method of recognizing speech |
US20110046957A1 (en) * | 2009-08-24 | 2011-02-24 | NovaSpeech, LLC | System and method for speech synthesis using frequency splicing |
JP6047922B2 (en) * | 2011-06-01 | 2016-12-21 | ヤマハ株式会社 | Speech synthesis apparatus and speech synthesis method |
US9368104B2 (en) * | 2012-04-30 | 2016-06-14 | Src, Inc. | System and method for synthesizing human speech using multiple speakers and context |
CN106471569B (en) * | 2014-07-02 | 2020-04-28 | 雅马哈株式会社 | Speech synthesis apparatus, speech synthesis method, and storage medium therefor |
KR20180081504A (en) * | 2015-11-09 | 2018-07-16 | 소니 주식회사 | Decode device, decode method, and program |
CN111145723B (en) * | 2019-12-31 | 2023-11-17 | 广州酷狗计算机科技有限公司 | Method, device, equipment and storage medium for converting audio |
CN113066459B (en) * | 2021-03-24 | 2023-05-30 | 平安科技(深圳)有限公司 | Song information synthesis method, device, equipment and storage medium based on melody |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0727397B2 (en) | 1988-07-21 | 1995-03-29 | シャープ株式会社 | Speech synthesizer |
FR2636163B1 (en) | 1988-09-02 | 1991-07-05 | Hamon Christian | METHOD AND DEVICE FOR SYNTHESIZING SPEECH BY ADDING-COVERING WAVEFORMS |
SE469576B (en) * | 1992-03-17 | 1993-07-26 | Televerket | PROCEDURE AND DEVICE FOR SYNTHESIS |
US5463715A (en) * | 1992-12-30 | 1995-10-31 | Innovation Technologies | Method and apparatus for speech generation from phonetic codes |
CN1057625C (en) * | 1994-04-28 | 2000-10-18 | 摩托罗拉公司 | A method and apparatus for converting text into audible signals using a neural network |
BE1010336A3 (en) * | 1996-06-10 | 1998-06-02 | Faculte Polytechnique De Mons | Synthesis method of its. |
-
1998
- 1998-08-19 DE DE19861167A patent/DE19861167A1/en not_active Ceased
-
1999
- 1999-08-19 WO PCT/EP1999/006081 patent/WO2000011647A1/en active IP Right Grant
- 1999-08-19 AT AT99942891T patent/ATE243876T1/en not_active IP Right Cessation
- 1999-08-19 US US09/763,149 patent/US7047194B1/en not_active Expired - Lifetime
- 1999-08-19 DE DE59906115T patent/DE59906115D1/en not_active Expired - Lifetime
- 1999-08-19 CA CA002340073A patent/CA2340073A1/en not_active Abandoned
- 1999-08-19 AU AU56231/99A patent/AU5623199A/en not_active Abandoned
- 1999-08-19 EP EP99942891A patent/EP1105867B1/en not_active Expired - Lifetime
Non-Patent Citations (1)
Title |
---|
See references of WO0011647A1 * |
Also Published As
Publication number | Publication date |
---|---|
WO2000011647A1 (en) | 2000-03-02 |
US7047194B1 (en) | 2006-05-16 |
DE19861167A1 (en) | 2000-06-15 |
ATE243876T1 (en) | 2003-07-15 |
CA2340073A1 (en) | 2000-03-02 |
DE59906115D1 (en) | 2003-07-31 |
EP1105867B1 (en) | 2003-06-25 |
AU5623199A (en) | 2000-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60112512T2 (en) | Coding of expression in speech synthesis | |
DE19610019C2 (en) | Digital speech synthesis process | |
DE4237563C2 (en) | Method for synthesizing speech | |
DE69821673T2 (en) | Method and apparatus for editing synthetic voice messages, and storage means with the method | |
EP1105867B1 (en) | Method and device for the concatenation of audiosegments, taking into account coarticulation | |
DE69031165T2 (en) | SYSTEM AND METHOD FOR TEXT-LANGUAGE IMPLEMENTATION WITH THE CONTEXT-DEPENDENT VOCALALLOPHONE | |
DE60126575T2 (en) | Apparatus and method for synthesizing a singing voice and program for realizing the method | |
DE69925932T2 (en) | LANGUAGE SYNTHESIS BY CHAINING LANGUAGE SHAPES | |
DE69909716T2 (en) | Formant speech synthesizer using concatenation of half-syllables with independent cross-fading in the filter coefficient and source range | |
DE60035001T2 (en) | Speech synthesis with prosody patterns | |
DE69521955T2 (en) | Method of speech synthesis by chaining and partially overlapping waveforms | |
DE60216651T2 (en) | Speech synthesis device | |
DE2115258A1 (en) | Speech synthesis by concatenating words encoded in formant form | |
DD143970A1 (en) | METHOD AND ARRANGEMENT FOR SYNTHESIS OF LANGUAGE | |
US6424937B1 (en) | Fundamental frequency pattern generator, method and program | |
DE60202161T2 (en) | Method, apparatus and program for analyzing and synthesizing speech | |
DE69318209T2 (en) | Method and arrangement for speech synthesis | |
DE60205421T2 (en) | Method and apparatus for speech synthesis | |
EP0058130B1 (en) | Method for speech synthesizing with unlimited vocabulary, and arrangement for realizing the same | |
DE19841683A1 (en) | Device and method for digital speech processing | |
EP1344211B1 (en) | Device and method for differentiated speech output | |
DE60305944T2 (en) | METHOD FOR SYNTHESIS OF A STATIONARY SOUND SIGNAL | |
DE60311482T2 (en) | METHOD FOR CONTROLLING DURATION OF LANGUAGE SYNTHESIS | |
DE60316678T2 (en) | PROCESS FOR SYNTHETIZING LANGUAGE | |
DE19837661C2 (en) | Method and device for co-articulating concatenation of audio segments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20010319 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
17Q | First examination report despatched |
Effective date: 20010928 |
|
GRAG | Despatch of communication of intention to grant |
Free format text: ORIGINAL CODE: EPIDOS AGRA |
|
GRAG | Despatch of communication of intention to grant |
Free format text: ORIGINAL CODE: EPIDOS AGRA |
|
GRAH | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOS IGRA |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: BUSKIES, CHRISTOPH |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: BUSKIES, CHRISTOPH |
|
GRAG | Despatch of communication of intention to grant |
Free format text: ORIGINAL CODE: EPIDOS AGRA |
|
GRAH | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOS IGRA |
|
GRAH | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOS IGRA |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030625 Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT;WARNING: LAPSES OF ITALIAN PATENTS WITH EFFECTIVE DATE BEFORE 2007 MAY HAVE OCCURRED AT ANY TIME BEFORE 2007. THE CORRECT EFFECTIVE DATE MAY BE DIFFERENT FROM THE ONE RECORDED. Effective date: 20030625 Ref country code: IE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030625 Ref country code: GB Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030625 Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030625 Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030625 |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D Free format text: NOT ENGLISH |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D Free format text: GERMAN |
|
REF | Corresponds to: |
Ref document number: 59906115 Country of ref document: DE Date of ref document: 20030731 Kind code of ref document: P |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030819 Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030819 Ref country code: AT Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030819 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: NL Payment date: 20030829 Year of fee payment: 5 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030831 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030831 Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030831 Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030831 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030925 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030925 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030925 Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20030925 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20031024 Year of fee payment: 5 |
|
NLV1 | Nl: lapsed or annulled due to failure to fulfill the requirements of art. 29p and 29m of the patents act | ||
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20031222 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FD4D |
|
BERE | Be: lapsed |
Owner name: *BUSKIES CHRISTOPH Effective date: 20030831 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
26N | No opposition filed |
Effective date: 20040326 |
|
EN | Fr: translation not filed | ||
REG | Reference to a national code |
Ref country code: GB Ref legal event code: ERR Free format text: CORRECTION FOR CODE "EP GBV" |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20180831 Year of fee payment: 20 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R071 Ref document number: 59906115 Country of ref document: DE |