EP3580755A1 - Verfahren und vorrichtung zur dynamischen modifizierung des stimmklangs durch frequenzverschiebung der formanten einer spektralen hüllkurve - Google Patents

Verfahren und vorrichtung zur dynamischen modifizierung des stimmklangs durch frequenzverschiebung der formanten einer spektralen hüllkurve

Info

Publication number: EP3580755A1
Authority: EP; European Patent Office
Prior art keywords: frequency; spectral envelope; sound signal; frequencies; initial
Prior art date: 2017-02-13
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Ceased

Application number

EP18703604.1A

Other languages

English (en)

French (fr)

Inventor

Jean-Julien Aucouturier

Pablo ARIAS

Axel ROEBEL

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Centre National de la Recherche Scientifique CNRS

Institut de Recherche et de Coordination Acoustique Musique IRCA

Sorbonne Universite

Original Assignee

Centre National de la Recherche Scientifique CNRS

Institut de Recherche et de Coordination Acoustique Musique IRCA

Sorbonne Universite

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2017-02-13

Filing date

2018-02-12

Publication date

2019-12-18

2018-02-12 Application filed by Centre National de la Recherche Scientifique CNRS, Institut de Recherche et de Coordination Acoustique Musique IRCA, Sorbonne Universite filed Critical Centre National de la Recherche Scientifique CNRS

2019-12-18 Publication of EP3580755A1 publication Critical patent/EP3580755A1/de

Status Ceased legal-status Critical Current

Links

230000003595 spectral effect Effects 0.000 title claims abstract description 145
238000000034 method Methods 0.000 title claims abstract description 71
230000005236 sound signal Effects 0.000 claims abstract description 83
230000009466 transformation Effects 0.000 claims abstract description 60
238000012986 modification Methods 0.000 claims abstract description 14
230000004048 modification Effects 0.000 claims abstract description 14
230000007935 neutral effect Effects 0.000 claims description 21
230000001131 transforming effect Effects 0.000 claims description 14
238000004364 calculation method Methods 0.000 claims description 4
238000004590 computer program Methods 0.000 claims description 4
238000012549 training Methods 0.000 claims description 4
230000006870 function Effects 0.000 description 36
238000001228 spectrum Methods 0.000 description 19
230000008451 emotion Effects 0.000 description 8
230000000694 effects Effects 0.000 description 4
238000002474 experimental method Methods 0.000 description 3
238000000605 extraction Methods 0.000 description 3
230000003993 interaction Effects 0.000 description 3
238000012545 processing Methods 0.000 description 3
230000008859 change Effects 0.000 description 2
238000004891 communication Methods 0.000 description 2
230000003247 decreasing effect Effects 0.000 description 2
230000002996 emotional effect Effects 0.000 description 2
230000035807 sensation Effects 0.000 description 2
230000002123 temporal effect Effects 0.000 description 2
238000000844 transformation Methods 0.000 description 2
230000009471 action Effects 0.000 description 1
238000004422 calculation algorithm Methods 0.000 description 1
230000001427 coherent effect Effects 0.000 description 1
230000006835 compression Effects 0.000 description 1
238000007906 compression Methods 0.000 description 1
238000007654 immersion Methods 0.000 description 1
210000000214 mouth Anatomy 0.000 description 1
210000003205 muscle Anatomy 0.000 description 1
238000003909 pattern recognition Methods 0.000 description 1
238000012552 review Methods 0.000 description 1
238000005070 sampling Methods 0.000 description 1
238000004088 simulation Methods 0.000 description 1
238000010183 spectrum analysis Methods 0.000 description 1
238000012360 testing method Methods 0.000 description 1
238000011282 treatment Methods 0.000 description 1

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0324—Details of processing therefor
- G10L21/0332—Details of processing therefor involving modification of waveforms
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

the present invention relates to the field of acoustic processing. More specifically, the present invention relates to the modification of acoustic signals containing words, to give a tone, for example a smiling tone to the voice.
Smiles change the sound of our voices in a recognizable way, to the point that relationship-client services advise their employees to smile on the phone. Even if the smile is not seen by the customer, it is understood, and positively influences customer satisfaction.
the two-stage architecture proposed by Quené requires analyzing a portion of the signal before being able to resynthesize it, and thus induces a temporal shift between the moment when the word is pronounced and the moment when its transformation can take place. to be broadcast. Quené's method does not allow to modify a voice in real time.
a real-time voice modification can be applied to call center operators: the voice of the operator can be modified in real time before being transmitted to a customer, in order to appear more smiling .
the customer would feel that his interlocutor smiled, which is likely to improve customer satisfaction.
Non-player characters are all characters, often secondary, who are controlled by the computer. These characters are often associated with different replicas to declaim, which allow the player to advance in the plot of a video game. These replicas are usually stored as audio files that play when the player interacts with non-player characters. It is interesting, from a single neutral audio file, to apply different filters to the neutral voice, to produce a tone, for example smiling or tense, in order to simulate an emotion of the non-player character, and to increase the sensation Immersion in the game
the invention describes a method of modifying a sound signal, said method comprising: a step of obtaining time frames of the sound signal, in the frequency domain; for at least one time frame, the application of a first transformation of the sound signal in the frequency domain, comprising: a step of extracting a spectral envelope of the sound signal for said at least one time frame; a step of calculating the formant frequencies of said spectral envelope; a step of modifying the spectral envelope of the sound signal, said modification comprising the application of an increasing continuous function of transforming the frequencies of the spectral envelope, parameterized by at least two frequencies of formants of the spectral envelope.
the step of modifying the spectral envelope of the sound signal also comprises the application of a filter to the spectral envelope, said filter being parameterized by the frequency of a third forming of the spectral envelope. sound signal.
the method comprises a step of classifying a time frame, according to a set of classes of time frames comprising at least one class of voiced frames and a class of unvoiced frames.
the method comprises: for each voiced frame, the application of said first transformation of the sound signal in the frequency domain; for each unvoiced frame, the application of a second transformation of the sound signal in the frequency domain, said second transformation comprising a step of applying a filter for increasing the energy of the sound signal centered on a predefined frequency .
the second transformation of the sound signal comprises: the step of extracting a spectral envelope of the sound signal for said at least one time frame; an application of an increasing continuous function of frequency transformation of the spectral envelope, parameterized identically to a continuous function increasing frequency transformation of the spectral envelope for an immediately preceding time frame.
the application of an increasing continuous function of transforming the frequencies of the spectral envelope comprises: a calculation, for a set of initial frequencies determined from formants of the spectral envelope, of modified frequencies; a linear interpolation between the initial frequencies of the set of initial frequencies determined from formants of the spectral envelope and the modified frequencies.
At least one modified frequency is obtained by multiplying an initial frequency of the set of initial frequencies by a multiplier coefficient.
the set of frequencies determined from formants of the spectral envelope comprises: a first initial frequency calculated from half the frequency of a first forming of the spectral envelope of the sound signal; a second initial frequency calculated from the frequency of a second formant of the spectral envelope of the sound signal; a third initial frequency calculated from the frequency of a third formant of the spectral envelope of the sound signal; a fourth initial frequency calculated from the frequency of a fourth formant of the spectral envelope of the sound signal; a fifth initial frequency calculated from the frequency of a fifth forming of the spectral envelope of the sound signal.
a first modified frequency is calculated as being equal to the first initial frequency
a second modified frequency is calculated by multiplying the second initial frequency by the multiplier coefficient
a third modified frequency is calculated by multiplying the third initial frequency by the multiplier coefficient
a fourth modified frequency is calculated by multiplying the fourth initial frequency by the multiplier coefficient
a fifth modified frequency is calculated as equal to the fifth initial frequency.
each initial frequency is calculated from the frequency of a formant of a current time frame.
each initial frequency is calculated from the average of the formant frequencies of the same rank, for a number greater than or equal to two successive time frames.
the method is a method of modifying an audio signal comprising a voice in real time, comprising: receiving audio samples; creating a time frame of audio samples, when a sufficient number of samples is available to form said frame; applying a frequency transformation to the audio samples of said frame; applying the first transformation of the sound signal to at least one time frame in the frequency domain.
the invention also describes a method for applying a smiling tone to a voice, implementing a method of modifying a sound signal according to the invention, said at least two formant frequencies being frequencies. of formants affected by the smiling tone of a voice.
said increasing continuous frequency transformation function of the spectral envelope has been determined during a training phase, by comparison of spectral envelopes of phonemes stated by users, in a neutral or smiling manner.
the invention also describes a computer program product comprising program code instructions recorded on a computer readable medium for implementing the steps of the method when said program is running on a computer.
the invention allows to modify a voice in real time to assign a stamp, for example a smiling or taut stamp.
the method of the invention is not very complex, and can run in real time on ordinary computing capabilities.
the invention introduces a minimum delay between the initial voice and the modified voice.
the invention produces voices perceived as natural.
FIG. 1 an example of spectral envelopes for the vowel 'a', said by an experimenter with and without a smile;
FIG. 2 an example of a system implementing the invention
FIGS. 3a and 3b two examples of methods according to the invention
FIGS. 4a and 4b two examples of increasing continuous frequency transforming functions of the spectral envelope of a time frame according to the invention
FIGS. 5a, 5b and 5c three examples of modified vowel spectral envelopes according to the invention.
FIGS. 6a, 6b and 6c three examples of speech spectrograms uttered with and without a smile
FIG. 7 an example of a vowel spectrogram transformation according to the invention.
FIG. 8 three examples of transformations of vowel spectrograms according to 3 examples of implementation of the invention
FIG. 1 represents an example of spectral envelopes for the vowel 'a', said by an experimenter with and without a smile.
the graph 100 represents two spectral envelopes: the spectral envelope 120 represents the spectral envelope of the vowel 'a', pronounced without a smile by an experimenter; the spectral envelope 130 represents the same vowel 'a', said by the same experimenter, but smiling.
the two spectral envelopes 120 and 130 represent an interpolation of the peaks of the Fourier spectrum of sound: the horizontal axis 1 10 represents the frequency, according to a logarithmic scale; the vertical axis 1 1 1 represents the magnitude of the sound at a given frequency.
the spectral envelope 120 comprises a fundamental frequency F0 121, and several formants, among which a first F1 122 forming, a second forming F2 123, a third forming F3 124, a fourth forming F4 125 and a fifth forming F5 126.
the spectral envelope 130 comprises a fundamental frequency F0 131, and several formants, among which a first forming F1 132, a second forming F2 133, a third forming F3 134, a fourth forming F4 135 and a fifth forming F5 136 .
the fundamental frequencies F0 121 and 131 are the same for the two spectral envelopes.
the spectral envelope of the smiling voice also has a greater intensity around the frequency of the third forming F3 134.
FIG. 2 represents an exemplary system implementing the invention.
the system 200 presents an exemplary implementation of the invention, in the case of a connection between a user 240 and a teleoperator 210.
the teleoperator 210 communicates in this example through a headset equipped audio a microphone, connected to a workstation.
This workstation is connected to a server 220, which can for example be used for a whole call center, or a group of teleoperators.
the server 220 communicates, through a link of communication with a relay antenna 230, allowing a radio link with a user's mobile phone 240.
the user 240 can use a landline.
the teleoperator can also use a telephone, in association with the server 220.
the invention can thus be applied to all the system architectures allowing a connection between a user and a teleoperator, comprising at least one server or a workstation.
the teleoperator 210 generally speaks of a neutral voice.
a method according to the invention can thus be applied, for example by the server 220 or the workstation of the teleoperator 210, to modify in real time the sound of the voice of the teleoperator, and to transmit to the client 240 a modified voice, appearing naturally smiling.
the client can also respond to a smiley-looking voice, thereby improving overall interaction between the client 240 and the teleoperator 210.
the invention is however not restricted to this example.
it can be used to modify neutral voices in real time.
it can be used to give a timbre (tense, smiling ...) to a neutral voice of a non-player character in a video game, in order to give the sensation to a player that the non-player character feels a emotion.
It can be used, on the same principle, to modify in real time sentences said by a humanoid robot, in order to give the user of the humanoid robot the feeling that he / she feels a feeling, and to improve the interaction between the humanoid robot. user and the humanoid robot.
the invention can also be applied to players' voices for online video games, or therapeutically, by modifying the patient's voice in real time, in order to improve the patient's emotional state, by giving him the the impression of speaking himself of a smiling voice.
FIG. 3a represents a first example of a method according to the invention.
the method 300a is a method of modifying a sound signal, and may be used for example to affect an emotion to a voice track pronounced in a neutral manner. Emotion may consist in making the voice more smiling, but may also consist in making the voice less smiling, more tense, or affect it with intermediate emotional states.
the method 300a comprises a step 310 for obtaining time frames of the sound signal, and their transformation in the frequency domain.
Step 310 consists in obtaining successive time frames forming the sound signal.
the audio frames can be obtained in different ways. For example, it can be obtained by recording a speaking operator through a microphone, reading an audio file, or receiving audio data, for example through a connection.
the time frames may be of fixed or variable duration.
the time frames can have as short a duration as possible allowing a good spectral analysis, for example 25 or 50 ms. This duration advantageously makes it possible to obtain a sound signal to be representative of a phoneme, while limiting the latency generated by the modification of the sound signal.
the sound signal can be of different types.
it may be a mono, stereo signal, or a signal with more than two channels.
Method 300a can be applied to all or part of the signal channels.
the signal can be sampled at different frequencies, for example 1 6000 Hz, 22050 Hz, 32000 Hz, 44100 Hz, 48000 Hz, 88200 Hz or 96000 Hz.
the samples can be represented in different ways. For example, they may be sound samples represented on 8, 12, 16, 24 or 32 bits. The invention can thus be applied to any type of computer representation of a sound signal.
the time frames can be obtained either directly in the form of their frequency transform, either acquired in the time domain and transformed in the frequency domain.
the sound signal may for example be obtained directly in the frequency domain if the sound signal is initially stored or transmitted using a compressed audio format, for example according to the MP3 format (or MPEG-1/2 Audio Layer 3 , Motion Picture Expert Group - 1/2 Audio Layer 3, in French Animated Image Expert Group - Audio Layer 3), AAC (Advanced Audio Coding), Advanced Audio Coding ), WMA (from the acronym Windows Media Audio in French Media Audio Window), or any other compression format in which the audio signal is stored in the frequency domain.
MP3 format or MPEG-1/2 Audio Layer 3 , Motion Picture Expert Group - 1/2 Audio Layer 3, in French Animated Image Expert Group - Audio Layer 3
AAC Advanced Audio Coding
Advanced Audio Coding Advanced Audio Coding
WMA Windows Media Audio in French Media Audio Window
the frames can also be obtained initially in the time domain, and then converted into the frequency domain. For example, a sound can be recorded live using a microphone, for example a microphone in which the teleoperator 210 would speak.
the time frames are then initially constituted by storing a given number of successive samples (defined by the duration the frame and the sampling frequency of the sound signal), then applying a frequency transformation of the sound signal.
the frequency transformation can for example be a transformation of the type DFT (of the English Direct Fourier Transform, in French Discrete Fourier Transform), DCT (of the English Direct Cosine Transform, in French Transformed Cosine Discrete), MDCT (of the English Modified Direct Cosine Transform, in French Modified Discrete Cosine Transform), or any other suitable transformation to convert the sound samples from the time domain to the frequency domain.
DFT of the English Direct Fourier Transform, in French Discrete Fourier Transform
DCT of the English Direct Cosine Transform, in French Transformed Cosine Discrete
MDCT of the English Modified Direct Cosine Transform, in French Modified Discrete Cosine Transform
the method 300a then comprises, for at least one time frame, the application of a first transformation 320a of the sound signal in the frequency domain.
the first transformation 320a comprises an extraction step 330 of a spectral envelope of the sound signal for said at least one frame.
the extraction of the spectral envelope of the sound signal from the frequency transform of a frame is well known to those skilled in the art.
the frequency transform can be performed in many ways known to those skilled in the art. Frequency transform can be performed for example by linear predictive coding, as described for example by Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63 (4), 561-580.
the frequency transform can also be carried out for example by cepstral transformation, as described for example by Röbel, A., Villavicencio, F., & Rodet, X. (2007). Cepstral and all-pole based spectral envelope modeling with unknown model order. Pattern Recognition Letters, 28 (1 1), 1343-1350. Any other method known to those skilled in the art of frequency transformation can also be used.
the first transformation 300a also comprises a calculation step 340 of the formant frequencies of said spectral envelope.
Many methods of extracting formants can be used in the invention.
the calculation of the formant frequencies of the spectral envelope can for example be carried out by the method described by McCandless, S. (1974). An algorithm for automatic forming extraction using linear spectra prediction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 22 (2), 135-141.
the method 300a also comprises a modification step 350 of the spectral envelope of the sound signal.
the modification of the spectral envelope of the sound spectrum makes it possible to obtain a spectral envelope more representative of the desired emotion.
the modification step 350 of the spectral envelope comprises the application 351 of an increasing continuous function of transforming the frequencies of the spectral envelope, parameterized by at least two frequencies of formants of the spectral envelope.
the modification step 350 of the spectral envelope of the sound signal also comprises the application 352 of a dynamic filter to the spectral envelope, said filter being parameterized by the frequency of a third forming F3 of the spectral envelope of the sound signal.
This step makes it possible to increase or reduce the signal intensity around the frequency of the third formant F3 of the spectral envelope of the sound signal, so that the modified spectral envelope is even closer to that of a signal. phoneme emitted with the desired emotion. For example, as shown in FIG. 1, an increase in the sound intensity around the frequency of the third formant F3 of the spectral envelope of the sound signal makes it possible to obtain a spectral envelope that is even closer to what would be the spectral envelope. of the same phoneme uttered with a smile.
the filter used at this stage can be of different types.
This filter makes it possible to increase the intensity of the spectrum for frequencies around that of the formant F3, and thus to obtain a spectral envelope closer to that which would have been obtained by a smiling speaker.
the spectral envelope can be applied to the sound spectrum.
Many embodiments are possible for applying the spectral envelope to the sound spectrum. For example, it is possible to multiply each of the components of the spectrum by the corresponding value of the envelope, as described for example by Luini M. et al. (2013). Phase vocoder and beyond. Musica / Tenologia, August 2013, Vol. 7, No. 2013, p. 77-89.
an inverse frequency transform can be directly applied to the soundtrack, in order to reconstruct the audio signal and listen directly to it. This allows for example to listen to a modified voice of non-player character of a video game.
the modified sound signal can be transmitted in raw or compressed form, in the frequency domain or in the time domain.
the method 300a may be used to modify an audio signal comprising a voice in real time, in order to affect in real time an emotion to a neutral voice.
This modification in real time can for example be done in:
This method makes it possible to apply an expression in real time to a neutral voice.
the step of creating the frame (or windowing) induces a latency in the execution of the method, since the audio samples can only be processed when all the samples of a frame are received.
this latency depends solely on the duration of the time frames, and may be low, for example if the time frames have a duration of 50 ms.
the invention also relates to a computer program product comprising program code instructions recorded on a computer readable medium for implementing the method 300a, or any other method according to different embodiments of the invention.
Said computer program may for example be stored and / or executed on the teleoperator workstation 210, or on the server 220.
FIG. 3b represents a second example of a method according to the invention.
the method 300b is also a method of modifying a sound signal, making it possible to treat the time frames differently according to the type of information they contain.
the method 300b comprises a classification step 360 of a time frame, according to a set of classes of time frames comprising at least one class of voiced frames and a class of unvoiced frames.
a time frame may belong to a class of voiced frames if it includes a vowel, and to an unvoiced frame class if it does not include a vowel, for example if it includes a consonant.
ZCR Zero Crossing Rate
the method 300b comprises, for each voiced frame, the application of the first transformation 320a of the sound signal in the frequency domain. All the embodiments of the invention discussed with reference to FIG. 3a may be applied to the first transformation 320a in the context of method 300b.
the method 300b comprises, for each unvoiced frame, the application of a second transformation 320b of the sound signal in the frequency domain.
the second transformation 320b of the sound signal in the frequency domain comprises a step of applying a filter for increasing the energy of the sound signal 370 centered on a frequency, for example a predefined frequency.
This feature makes it possible to refine the transformation of the audio signal by applying a transformation on unvoiced frames, for which the spectral envelope has no shape.
the second transformation 320b of the sound signal also comprises the step 330 of extracting a spectral envelope of the sound signal, for the frame concerned, and an application step 351b of an increasing continuous function of transforming the frequencies of the spectral envelope.
the application step 351b of an increasing continuous function of transforming the frequencies of the spectral envelope is parameterized identically to an increasing continuous function of transforming the frequencies of the spectral envelope for a temporal frame immediately. previous.
an increasing continuous function of frequency transformation of the envelope is parameterized according to the formant frequencies of the spectral envelope of the envelope.
the voiced frame then is applied according to the same parameters to the immediately voiced unvoiced frame. If several unvoiced frames follow the voiced frame, the same transformation function, according to the same parameters, can be applied to successive unvoiced frames.
This characteristic makes it possible to apply a frequency transformation function of the spectral envelope of the unvoiced frames, even if they do not include formants, while benefiting from a transformation that is as coherent as possible with the frames. previous voices.
FIGS. 4a and 4b show two examples of increasing continuous frequency transforming functions of the spectral envelope of a time frame according to the invention.
FIG. 4a represents a first example of an increasing continuous function of transforming the frequencies of the spectral envelope of a time frame according to the invention.
the function 400a defines the frequencies of the modified spectral envelope, represented on the abscissa axis 401, as a function of the frequencies of the initial spectral envelope, represented on the ordinate axis 402.
This function thus makes it possible to construct the modified spectral envelope as follows: the intensity of each frequency of the modified spectral envelope is equal to the intensity of the frequency of the initial spectral envelope indicated by the function. For example, the intensity for the frequency 41 1 a of the modified spectral envelope is equal to the intensity for the frequency 410 a of the initial spectral envelope.
the frequency transformation function is defined as follows:
a modified frequency is calculated.
the modified frequencies 41 1a, 421a, 431a, 441a and 451a corresponding to the initial frequencies 410a, 420a, 430a, 440a and 450a are calculated;
Linear interpolations are then performed between the initial frequencies of the set of initial frequencies determined from formants of the spectral envelope and the modified frequencies.
the linear interpolation 460 makes it possible to define linearly, for each initial frequency between the first initial frequency 410a and the second initial frequency 420a, a modified frequency, between the first modified frequency 41 1a and the second modified frequency 421 at.
Linear interpolation 461 makes it possible to define linearly, for each initial frequency between the second initial frequency 420a and the third initial frequency 430a, a modified frequency, between the second modified frequency 421a and the third modified frequency 431a;
Linear interpolation 462 makes it possible to define linearly, for each initial frequency between the third initial frequency 430a and the fourth initial frequency 440a, a modified frequency, between the modified third frequency 431a and the modified fourth frequency 441a;
Linear interpolation 463 makes it possible to define linearly, for each initial frequency between the fourth initial frequency 440a and the fifth initial frequency 450a, a modified frequency, between the modified fourth frequency 441a and the modified fifth frequency 451a.
the modified frequencies can be calculated in different ways. Some of them can be equal to the frequencies initials. For example, some of them can be obtained by multiplying an initial frequency by a multiplying coefficient a. This allows, depending on whether the multiplier coefficient a is greater or less than one, to obtain modified frequencies higher or lower than the initial frequencies.
a modified frequency higher than the corresponding initial frequency (a> 1) is associated with a happier or smiling voice
a modified frequency lower than the corresponding initial frequency (a ⁇ 1) is associated with a voice more tense, or less smiling.
the values of the coefficient a make it possible to define the transformation to be applied to the voice, but also the importance of this transformation.
the initial frequencies for setting the transformation function are as follows:
a first initial frequency (410a) calculated from half the frequency of a first formant (F1) of the spectral envelope of the sound signal;
a second initial frequency (420a) calculated from the frequency of a second formant (F2) of the spectral envelope of the sound signal;
a third initial frequency (430a) calculated from the frequency of a third formant (F3) of the spectral envelope of the sound signal;
a fourth initial frequency (440a) calculated from the frequency of a fourth formant (F4) of the spectral envelope of the sound signal
the frequencies of the spectral envelope lower than the first initial frequency 410a, and greater than the fifth initial frequency 450a, are thus not modified. This makes it possible to restrict the transformation of frequencies to frequencies corresponding to the formants affected by the tense or smiling tone of the voice, and not modifying, for example, the fundamental frequency FO.
the initial frequencies correspond to the frequencies of the formants of the current time frame.
the parameters of the transformation function are modified for each time frame.
the initial frequencies can also be calculated as the average of the formant frequencies of the same rank, for a number greater than or equal to two of successive time frames.
the first initial frequency 410a can be calculated as the average of the frequencies of the first formants F1 for the spectral envelopes of n successive time frames, with n> 2.
the frequency transformation is mainly applied between the second forming F2 and the fourth forming F4.
the modified frequencies can thus be calculated in the following way:
a first modified frequency 41 1 a is calculated as being equal to the first initial frequency 410a;
a second modified frequency 421a is calculated by multiplying the second initial frequency 420a by the multiplying coefficient a;
a third modified frequency 431a is calculated by multiplying the third initial frequency 430a by the multiplying coefficient a;
a fourth modified frequency 441a is calculated by multiplying the fourth initial frequency 440a by the multiplying coefficient a;
a fifth modified frequency 451a is calculated as being equal to the fifth initial frequency 450a.
the transformation function example 400a transforms the spectral envelope of a time frame to obtain a more smiling voice, thanks to higher frequencies, especially between the second forming F2 and the fourth forming F4.
the multiplier coefficient a is predefined.
the multiplier a may be equal to 1, 1 (10% increase in frequencies).
the multiplier coefficient a may depend on the intensity of modification of the voice to be generated.
the multiplier coefficient a can also be determined for a given user. For example, it can be determined during a training phase, during which the user utters phonemes of a neutral voice and then a smiling voice. The comparison of the frequencies of the different formants, for the pronounced phonemes of neutral voice and of smiling voice, thus makes it possible to calculate a coefficient multiplier a adapted to a given user.
the value of the coefficient a depends on the phoneme.
a method according to the invention comprises a step of detecting the current phoneme, and the value of the coefficient a is defined for the current frame.
the values of a may have been determined for a given phoneme during a training phase.
FIG. 4b represents a second example of an increasing continuous function of transforming the frequencies of the spectral envelope of a time frame according to the invention.
FIG. 4b represents a second function 400b, making it possible to give a voice a more tense or less smiling tone.
FIG. 4b The representation of FIG. 4b is identical to that of FIG. 4a: the frequencies of the modified spectral envelope are represented on the abscissa axis 401, as a function of the frequencies of the initial spectral envelope, represented on FIG. y-axis 402.
the function 400b is also constructed by computing for each frequency 410b, 420b, 430b, 440b, initial 450b, a frequency 41 1b, 421b, 431b, 441b, 451b modified, and then defining linear interpolations. 460b, 461b, 462b and 463b between the initial frequencies and the modified frequencies.
the modified frequencies 41 1b and 451b are equal to the initial frequencies 410b and 450b
the modified frequencies 421b, 431b and 441b are obtained by multiplying the initial frequencies 420b, 430b and 440b by a factor a ⁇ 1.
the frequencies of the second forming F2, third forming F3 and fourth forming F4 of the spectral envelope modified by the 400b function will be more severe than those of the corresponding formers of the initial spectral envelope. This gives the voice a tense tone.
the functions 400a and 400b are given by way of example only. Any increasing continuous frequency function of a spectral envelope, parameterized from the frequencies of the envelope formants can be used in the invention. For example, a function defined according to formant frequencies related to the smiling nature of the voice is particularly suitable for the invention.
Figures 5a, 5b and 5c show three examples of modified vowel spectral envelopes according to the invention.
FIG. 5a represents the spectral envelope 510a of the phoneme 'e', posited in a neutral manner by an experimenter, and the spectral envelope 520a of the same phoneme 'e' positively stated by the experimenter.
Figure 5a also shows the spectral envelope 530a modified by a method according to the invention to make the voice more smiling.
the spectral envelope 530a thus represents the result of the application of a method according to the invention to the spectral envelope 510a.
FIG. 5b represents the spectral envelope 510b of the phoneme 'a', posited in a neutral manner by an experimenter, and the spectral envelope 520b of the same phoneme 'a' positively stated by the experimenter.
Figure 5b also shows the spectral envelope 530b modified by a method according to the invention to make the voice more smiling.
the spectral envelope 530b thus represents the result of the application of a method according to the invention to the spectral envelope 510b.
FIG. 5c represents the spectral envelope 510c of the phoneme 'e', posited in a neutral manner by a second experimenter, and the spectral envelope 520c of the same phoneme 'e' positively stated by the second experimenter.
Figure 5c also shows the envelope spectral 530c modified by a method according to the invention to make the voice more smiling.
the spectral envelope 530c thus represents the result of the application of a method according to the invention to the spectral envelope 510c.
the method according to the invention comprises the application of the frequency transformation function 400a shown in FIG. 4a, and the application of a bi-quad filter centered on the frequency of the third F3 formant. the envelope.
FIGS. 5a, 5b and 5c show that the method according to the invention makes it possible to preserve the overall shape of the envelope of the phoneme, while modifying the position and the amplitude of certain formants, so as to simulate a voice appearing smiling, while remaining natural.
the method according to the invention allows the spectral envelope transformed according to the invention to be very similar to a spectral envelope of smiling voice, for the frequencies of the high medium of the spectrum, as shown by the similarity of curves 521a and 531a; 521b and 531b; 521c and 531c respectively.
FIGS. 6a, 6b and 6c show three examples of speech spectrograms uttered with and without a smile.
FIG. 6a represents a spectrogram 610a of a neutrally pronounced phoneme 'a', and a spectrogram 620a of the same phoneme 'a' to which the invention has been applied, in order to make the voice more smiling.
Figure 6b shows a spectrogram 610b of a neutrally pronounced phoneme 'e', and a spectrogram 620b of the same phoneme 'e' to which the invention has been applied, in order to make the voice more smiling.
FIG. 6c represents a spectrogram 610c of a neutrally pronounced phoneme T, and a spectrogram 620c of the same phoneme T to which the invention has been applied, in order to make the voice more smiling.
Each of the spectrograms shows the evolution over time of the sound intensity for different frequencies, and reads as follows:
the horizontal axis represents the time, within the diction of the phoneme
the vertical axis represents the different frequencies; -
the sound intensities are represented, for a given time and frequency, by the corresponding gray level: the white represents a zero intensity, while a very dark gray represents a strong intensity of the frequency at the corresponding time.
FIG. 7 represents a spectrogram 710 of a neutrally pronounced phoneme ⁇ ', and a spectrogram 720 of the same phoneme ⁇ ' to which the invention has been applied, in order to make the voice more smiling.
Each of the spectrograms shows the evolution over time of the intensity for different frequencies, according to the same representation as that of FIGS. 6a to 6c.
FIG. 8 represents three examples of transformations of vowel spectrograms according to 3 examples of implementation of the invention.
the value of the multiplier coefficient a may be modified over time, for example to simulate a gradual change in the timbre of the voice.
the value of the coefficient multiplier a can increase to give a voice impression more and more smiling, or decrease to give an impression of voice more and more tense.
the spectrogram 810 represents a spectrogram of a vowel set out in a neutral tone and modified by the invention, with a constant multiplier coefficient a.
Spectrogram 820 represents a spectrogram of a vowel uttered in a neutral tone and modified by the invention, with a decreasing multiplier coefficient a.
Spectrogram 830 represents a spectrogram of a vowel uttered in a neutral tone and modified by the invention, with a multiplying coefficient a increasing.
This example demonstrates the ability of a method according to the invention to adjust the transformation of the spectral envelope, in order to produce effects in real time, for example to produce a more or less smiling voice.

Landscapes

Engineering & Computer Science (AREA)
Physics & Mathematics (AREA)
Computational Linguistics (AREA)
Signal Processing (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Human Computer Interaction (AREA)
Acoustics & Sound (AREA)
Multimedia (AREA)
Quality & Reliability (AREA)
Spectroscopy & Molecular Physics (AREA)
Electrophonic Musical Instruments (AREA)

EP18703604.1A 2017-02-13 2018-02-12 Verfahren und vorrichtung zur dynamischen modifizierung des stimmklangs durch frequenzverschiebung der formanten einer spektralen hüllkurve Ceased EP3580755A1 (de)

Applications Claiming Priority (2)

Application Number	Priority Date	Filing Date	Title
FR1751163A FR3062945B1 (fr)	2017-02-13	2017-02-13	Methode et appareil de modification dynamique du timbre de la voix par decalage en frequence des formants d'une enveloppe spectrale
PCT/EP2018/053433 WO2018146305A1 (fr)	2017-02-13	2018-02-12	Methode et appareil de modification dynamique du timbre de la voix par decalage en fréquence des formants d'une enveloppe spectrale

Publications (1)

Publication Number	Publication Date
EP3580755A1 true EP3580755A1 (de)	2019-12-18

Family

ID=58501711

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP18703604.1A Ceased EP3580755A1 (de)	2017-02-13	2018-02-12	Verfahren und vorrichtung zur dynamischen modifizierung des stimmklangs durch frequenzverschiebung der formanten einer spektralen hüllkurve

Country Status (7)

Country	Link
US (1)	US20190378532A1 (de)
EP (1)	EP3580755A1 (de)
JP (1)	JP2020507819A (de)
CN (1)	CN110663080A (de)
CA (1)	CA3053032A1 (de)
FR (1)	FR3062945B1 (de)
WO (1)	WO2018146305A1 (de)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
CN109817193B (zh) *	2019-02-21	2022-11-22	深圳市魔耳乐器有限公司	一种基于时变多段式频谱的音色拟合***
US20210407527A1 (en) *	2019-08-08	2021-12-30	Avaya Inc.	Optimizing interaction results using ai-guided manipulated video
CN111816198A (zh) *	2020-08-05	2020-10-23	上海影卓信息科技有限公司	改变语音音调和音色的变声方法和***
CN112289330A (zh) *	2020-08-26	2021-01-29	北京字节跳动网络技术有限公司	一种音频处理方法、装置、设备及存储介质
CN112397087B (zh) *	2020-11-13	2023-10-31	展讯通信（上海）有限公司	共振峰包络估计、语音处理方法及装置、存储介质、终端
CN112506341B (zh) *	2020-12-01	2022-05-03	瑞声新能源发展(常州)有限公司科教城分公司	一种振动效果的生成方法、装置、终端设备及存储介质
CN113611326B (zh) *	2021-08-26	2023-05-12	中国地质大学（武汉）	一种实时语音情感识别方法及装置
EP4145444A1 (de) *	2021-09-07	2023-03-08	Avaya Management L.P.	Optimierung von interaktionsergebnissen unter verwendung von ki-geführter manipulierter sprache

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
JP3282693B2 (ja) *	1993-10-01	2002-05-20	日本電信電話株式会社	声質変換方法
US7065485B1 (en) *	2002-01-09	2006-06-20	At&T Corp	Enhancing speech intelligibility using variable-rate time-scale modification
JP3941611B2 (ja) *	2002-07-08	2007-07-04	ヤマハ株式会社	歌唱合成装置、歌唱合成方法及び歌唱合成用プログラム
JP4076887B2 (ja) *	2003-03-24	2008-04-16	ローランド株式会社	ボコーダ装置
CN100440314C (zh) *	2004-07-06	2008-12-03	中国科学院自动化研究所	基于语音分析与合成的高品质实时变声方法
CN101004911B (zh) *	2006-01-17	2012-06-27	纽昂斯通讯公司	用于生成频率弯曲函数及进行频率弯曲的方法和装置
US8224648B2 (en) *	2007-12-28	2012-07-17	Nokia Corporation	Hybrid approach in voice conversion
WO2011026247A1 (en) *	2009-09-04	2011-03-10	Svox Ag	Speech enhancement techniques on the power spectrum
US9324337B2 (en) *	2009-11-17	2016-04-26	Dolby Laboratories Licensing Corporation	Method and system for dialog enhancement
CN102184731A (zh) *	2011-05-12	2011-09-14	北京航空航天大学	一种韵律类和音质类参数相结合的情感语音转换方法
CN103038825B (zh) *	2011-08-05	2014-04-30	华为技术有限公司	语音增强方法和设备
JP6433063B2 (ja) *	2014-11-27	2018-12-05	日本放送協会	音声加工装置、及びプログラム
CN106024010B (zh) *	2016-05-19	2019-08-20	渤海大学	一种基于共振峰曲线的语音信号动态特征提取方法

2017
- 2017-02-13 FR FR1751163A patent/FR3062945B1/fr not_active Expired - Fee Related
2018
- 2018-02-12 EP EP18703604.1A patent/EP3580755A1/de not_active Ceased
- 2018-02-12 CN CN201880013636.6A patent/CN110663080A/zh active Pending
- 2018-02-12 JP JP2019565053A patent/JP2020507819A/ja active Pending
- 2018-02-12 WO PCT/EP2018/053433 patent/WO2018146305A1/fr active Application Filing
- 2018-02-12 US US16/485,275 patent/US20190378532A1/en not_active Abandoned
- 2018-02-12 CA CA3053032A patent/CA3053032A1/fr active Pending

Also Published As

Publication number	Publication date
FR3062945B1 (fr)	2019-04-05
CN110663080A (zh)	2020-01-07
JP2020507819A (ja)	2020-03-12
WO2018146305A1 (fr)	2018-08-16
CA3053032A1 (fr)	2018-08-16
FR3062945A1 (fr)	2018-08-17
US20190378532A1 (en)	2019-12-12

Legal Events

Date	Code	Title	Description
2018-02-17	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: UNKNOWN
2018-08-18	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE
2019-11-15	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2019-11-15	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE
2019-12-18	17P	Request for examination filed	Effective date: 20190813
2019-12-18	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
2019-12-18	AX	Request for extension of the european patent	Extension state: BA ME
2020-02-26	RIN1	Information on inventor provided before grant (corrected)	Inventor name: ARIAS, PABLO Inventor name: AUCOUTURIER, JEAN-JULIEN Inventor name: ROEBEL, AXEL
2020-05-20	DAV	Request for validation of the european patent (deleted)
2020-05-20	DAX	Request for extension of the european patent (deleted)
2020-08-21	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: EXAMINATION IS IN PROGRESS
2020-09-23	17Q	First examination report despatched	Effective date: 20200820
2021-01-11	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: EXAMINATION IS IN PROGRESS
2021-02-19	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: EXAMINATION IS IN PROGRESS
2022-05-12	REG	Reference to a national code	Ref country code: DE Ref legal event code: R003
2022-08-12	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED
2022-09-14	18R	Application refused	Effective date: 20220512

Publication	Publication Date	Title
EP3580755A1 (de)	2019-12-18	Verfahren und vorrichtung zur dynamischen modifizierung des stimmklangs durch frequenzverschiebung der formanten einer spektralen hüllkurve
EP2415047B1 (de)	2013-03-13	Klassifizieren von in einem Tonsignal enthaltenem Hintergrundrauschen
EP2419900B1 (de)	2013-03-13	Verfahren und einrichtung zur objektiven evaluierung der sprachqualität eines sprachsignals unter berücksichtigung der klassifikation der in dem signal enthaltenen hintergrundgeräusche
JP2017506767A (ja)	2017-03-09	話者辞書に基づく発話モデル化のためのシステムおよび方法
Maruri et al.	2018	V-Speech: noise-robust speech capturing glasses using vibration sensors
EP1593116A1 (de)	2005-11-09	Verfahren zur differenzierten digitalen sprach- und musikbearbeitung, rauschfilterung, erzeugung von spezialeffekten und einrichtung zum ausführen des verfahrens
Braun et al.	2022	Effect of noise suppression losses on speech distortion and ASR performance
CN114203163A (zh)	2022-03-18	音频信号处理方法及装置
Chenchah et al.	2017	A bio-inspired emotion recognition system under real-life conditions
EP1606792A1 (de)	2005-12-21	Verfahren zur analyse der grundfrequenz, verfahren und vorrichtung zur sprachkonversion unter dessen verwendung
Saleem et al.	2022	E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
Parisae et al.	2024	Adaptive attention mechanism for single channel speech enhancement
CN112885318A (zh)	2021-06-01	多媒体数据生成方法、装置、电子设备及计算机存储介质
Chen et al.	2022	CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
González-Salazar et al.	2020	Enhancing speech recorded from a wearable sensor using a collection of autoencoders
CN116013343A (zh)	2023-04-25	语音增强方法、电子设备和存储介质
EP0621582B1 (de)	1999-02-10	Verfahren zur Spracherkennung mit Lernphase
Xiao et al.	2022	Speech intelligibility enhancement by non-parallel speech style conversion using CWT and iMetricGAN based CycleGAN
Weber et al.	2020	Constructing a dataset of speech recordings with lombard effect
Tzudir et al.	2022	Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Bae et al.	2021	A neural text-to-speech model utilizing broadcast data mixed with background music
de Souza et al.	2022	Multitaper-mel spectrograms for keyword spotting
Bous	2023	A neural voice transformation framework for modification of pitch and intensity
Wen et al.	2023	Multi-Stage Progressive Audio Bandwidth Extension
US11380345B2 (en)	2022-07-05	Real-time voice timbre style transform