WO2016108722A1 - Method to restore the vocal tract configuration - Google Patents

Method to restore the vocal tract configuration Download PDF

Info

Publication number
WO2016108722A1
WO2016108722A1 PCT/RU2015/000198 RU2015000198W WO2016108722A1 WO 2016108722 A1 WO2016108722 A1 WO 2016108722A1 RU 2015000198 W RU2015000198 W RU 2015000198W WO 2016108722 A1 WO2016108722 A1 WO 2016108722A1
Authority
WO
WIPO (PCT)
Prior art keywords
vocal tract
configuration
acoustic characteristics
lengths
speech
Prior art date
Application number
PCT/RU2015/000198
Other languages
French (fr)
Inventor
Ilja Sergeevich MAKAROV
Original Assignee
Obshestvo S Ogranichennoj Otvetstvennostyu "Integrirovannye Biometricheskie Reshenija I Sistemy"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Obshestvo S Ogranichennoj Otvetstvennostyu "Integrirovannye Biometricheskie Reshenija I Sistemy" filed Critical Obshestvo S Ogranichennoj Otvetstvennostyu "Integrirovannye Biometricheskie Reshenija I Sistemy"
Publication of WO2016108722A1 publication Critical patent/WO2016108722A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/75Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 for modelling vocal tract parameters

Definitions

  • This invention pertains to the automatic processing of human voice and can be used in different speech technology applications including the areas related to the following tasks: automatic correction of the pronunciation in the foreign language training systems or rehabilitation of different voice and hearing disorders, automatic speech recognition, automatic personal identification and verification based on voice, automatic speech synthesis based on random text, speech coding in mobile communication and VoIP systems.
  • Speech inverse problem is more accurately formulated as the problem of finding the form of the vocal tract or articulation parameters, or function of the cross section area, or articulation control based on measured acoustic parameters of the speech signal.
  • the prior art has a known method based on the usage of so-called sensitivity functions, which allows for determination of the function of the cross section area by an iterative procedure of minimization of the discrepancy between measured resonance frequencies and resonance frequencies of the articulatory model (B. Story, Technique for "tuning" vocal tract area functions based on acoustic sensitivity functions // J. Acoust. Soc. Am. 119 (2), February 2006. P. 715-718; S. Adachi, H. Takemoto, T. Kitamura, P. Mokhtari, and K. Honda, Vocal tract length perturbation and its application to male-female vocal tract shape conversion // J. Acoust. Soc. Am. 121 (6), June 2007. P. 3874-3885).
  • This method has the following drawbacks: firstly, this method uses only resonance frequencies of the vocal tract as acoustic parameters. Unfortunately, automatic determination of the resonance frequencies of the tract is a very difficult and non-trivial problem. We do not know any generalizations of this method for other acoustic parameters. Secondly, for launching the iteration process this method uses the same cross- section area function for different sounds. In many cases it results in the need of large number of iterations for achievement of the required accuracy, which in its turn significantly increases processing time.
  • a special database is used as initial approximation for minimization - so called articulatory code book containing numerous configurations of the vocal tract and corresponding acoustic parameters (J. Schroeter, M. M. Sondhi, Techniques for estimating vocal tract shapes from the speech signal. IEEE Trans. On Speech and Audio Processing. 1994. Vol. 2. No. 1, Pt. 2. P. 133-150).
  • the main drawback of this approach is very significant computing and, as a consequence, time costs: processing speed is around dozens of seconds and even minutes per one second of the speech.
  • the key factors which significantly increase processing time are as follows: 1) necessity of non-linear minimization with non-linear equality and nonequality constraints, 2) necessity of launching the non-linear minimization from different initial approximations from articulatory code book.
  • This invention is aimed for elimination of the drawbacks of the existing solutions.
  • Technical result of this invention is improvement of the accuracy of the articulation restoration and faster processing of the data during restoration of the vocal tract configuration.
  • the said technical result is achieved through the use of the articulatory code book containing (unlike the regularization method) not only vocal tract configuration, but also corresponding functions of the cross-section areas (it makes possible to use different cross-section area functions as the initial approximation for different sounds (not only one cross-section area function like in sensitivity function method); as result it significantly reduces number of iterations and considerably improves accuracy of the solution); use as algorithm of minimization of the acoustic parameters discrepancy of the method allowing (unlike the standard sensitivity function method) for using not only the resonance frequencies, but also any standard voice technology parameters describing the acoustic spectrum of sounds (unlike the regularization method this method is less computing intensive and requires less time for processing); use of additional automatic algorithms for pre-processing of the speech signal (noise filtration, speech-nonspeech detection, identification of the boundaries of vowels and vowel-like sounds, etc.) which ensures resolving the speech inverse problem in automatic mode.
  • the method of restoration of vocal tract includes the following steps: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract.
  • the steps of the vocal tract configuration restoration method can be performed in the cyclic manner.
  • Preliminary processing of the audio signal can including noise filtration, extraction of the speech segments from pauses; delineation of the boundaries of the sounds, selection of vowels and vowel-like segments.
  • code book Any known articulatory code books can be used as code book.
  • Resonance frequencies of the vocal tract can be used for determination of the functions of the areas and lengths using initial approximations as acoustic characteristics.
  • Parameters describing short-time amplitude-frequency spectrum of the speech signal can be used as acoustic characteristics for determination of the functions of the areas and lengths based on initial approximations.
  • any known algorithm of conversion of the functions of the areas and lengths into corresponding vocal tract configuration based on initial approximations can be used.
  • This invention can be realized in the form of the vocal tract configuration restoration system including: one or more command processing devices, one or more data storage devices, one ore more programs, where one or more programs are stored in one or more data storage devices and executed on one or more processors while one or more program include the following functions: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract.
  • the vocal tract configuration restoration method can be performed in the cyclic manner.
  • Preliminary processing of the audio signal can including noise filtration, speech- nonspeech detection; segmentation of speech into sounds, selection of vowels and vowel-like segments.
  • code book Any known articulatory code books can be used as code book.
  • Resonance frequencies of the vocal tract can be used for determination of the functions of the areas and lengths using initial approximations as acoustic characteristics.
  • Parameters describing short-time amplitude-frequency spectrum of the speech signal can be used as acoustic characteristics for determination of the functions of the areas and lengths using initial approximations.
  • determination of the vocal tract configuration any known algorithm of conversion of the functions of the areas and lengths into corresponding vocal tract configuration based on initial approximations can be used.
  • Fig.l Diagram of one of the options of the method of the vocal tract configuration restoration.
  • Fig. 4 - Result of segmentation of the speech wave into "pause-speech-pause" sections. Vertical lines show boundaries between pause and speech.
  • Fig. 7 Configuration of the vocal tract from articulatory code book, corresponding distribution of the area of cross-section and acoustic spectrum.
  • Fig. 8 Initial distribution of the areas of cross-sections SO (top to bottom) and distribution of the areas S opt , calculated using developed algorithm (bottom to top).
  • Fig. 9 Initial configuration of the vocal tract (dashed line) and configuration computed by the developed algorithm (solid line). Both configurations correspond to distribution of the cross-section areas shown on Fig. 8.
  • This invention in its different variants can be implemented as a computer method, in the form of a system or a machine-readable medium containing instructions for using the said method.
  • the invention can be realized as a distributed computer system.
  • system means a computer system, PC (personal computer), CNC (computer numeric control), PLC (programmable logic controller), computerized control systems and any other devices that can perform defined, clearly determined sequence of operations (actions, instructions).
  • Command processing device means an electronic unit or integral circuit (microprocessor) that executes machine instructions (programs).
  • Command processing device reads and executes machine instructions (programs) from one or more data storage devices.
  • Data storage devices include but are not limited to hard drives (HDD), flash memory, ROM (read-only memory), solid- state drives (SSD), optic drives.
  • Program means a sequence of instructions intended for execution by computer control device or command processing devices.
  • Approximation is a scientific method comprising substitution of one objects by other objects similar to some extent, but more simple.
  • Approximation allows for studying numeric characteristics and quality of object reducing the task to studying more simple and convenient objects (e.g. such as characteristics which are easily calculated or whose properties are known).
  • the method of restoration of vocal tract configuration includes the following steps: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract.
  • Fig. l shows a diagram of one of the options of the method of the vocal tract configuration restoration.
  • the input signal is digitized voice record of a random person speaking any language.
  • Input signal sampling rate should be at least 8,000 Hz and minimum quantization level should be 8 bit/sample.
  • the person can pronounce a random speech material (speech material means separate sounds, combinations of sounds, words, phrases, texts in this language. Non-speech sounds like coughing, breathing, chirrup, etc. are not materials).
  • Digitized record can have any acceptable sound format such as wav, mpeg, mp4, etc. Any voice recorder can be used, e.g. microphone, voice recorder, telephone, video camera, etc.
  • the record of English word "seed' pronounced by a male is used here as an example.
  • Fig. 2 shows a chart of acoustic wave of this word (oscillogram).
  • digitized human voice record is filtered for removal of noise and distortions (additive noise, distortions, induced by communication channel, reverberations, etc.).
  • Any noise and distortion reduction algorithms can be used as filtration algorithms (for example, algorithms described in S. Vaseghi, Advanced Digital Signal Processing and Noise Reduction, 2nd ed. John Wiley & Sons, Ltd, 2000).
  • Fig. 3 shows a charts of speech wave of word seed after filtration of external noise using spectral subtraction algorithm, see. S. Vaseghi, Chapter 1 1, P. 333-352, Advanced Digital Signal Processing and Noise Reduction, 2nd ed. John Wiley & Sons, Ltd, 2000).
  • digitized voice record filtered from noise is analysed by automatic speech- nonspeech detection algorithm, which determines the boundaries of the beginning and end of all pauses inside digitized record (the pause means any section of digitized record when the person is silent).
  • Any speech-nonspeech detection algorithm described in the international literature can be used (for example, algorithm described in Q. Li, J. Zheng, A. Tsai, and Q. Zhou, Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition, IEEE Transactions on Speech and Audio Process., vol. 10, No. 3, 2002, P. 146-157).
  • Fig. 4 shows as an example the result of the application of pause detection algorithm described in Q. Li, J. Zheng, A.
  • acoustic characteristics of each vowel or vowel-like sound in the automatic mode are determined using short-term analysis, see L. Rabiner, R. Schafer, Digital Processing of Speech Signals. Prentice-Hall, Inc. 1976).
  • the acoustic characteristics are calculated in the moving time analysis window with duration from 15 msec to 40 msec with offset of over 1 msec.
  • the window has random shape (namely, different popular windows can be used such as Hamming window).
  • Resonance frequencies of vocal tract or any other parameters describing short-term amplitude-frequency spectrum of speech signal can be used as acoustic characteristics (for example, Fast Fourier transform (FFT), linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC), etc.).
  • FFT Fast Fourier transform
  • LPC linear predictive coding
  • MFCC mel-frequency cepstral coefficients
  • resonance frequencies of vocal tract are used as acoustic characteristics, then to define them it is possible to use any algorithm of automatic evaluation of resonance frequencies described in the international literature for their detection (for example, method based on linear prediction of speech described in J. Markel, A. Gray, Linear Prediction of Speech. Springer- Verlag. 1976). If parameters describing the short- term amplitude-frequency spectrum of speech signal are used as acoustic characteristics, then they can be defined using any algorithm described in the international sources (for example, see X. Huang, A. Acero, H.-W. Hon, Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Inc. 2001). Fig.
  • FIG. 6 shows an example of chart of dynamic spectrum (sonogram) of vowel in word seed with the values of the first three resonance frequencies computed using the algorithms described in J. Markel, A. Gray, Linear Prediction of Speech. Springer- Verlag. 1976. Further, the acoustic characteristics defined in N successive time windows of the analysis will designated as ⁇ a ⁇ a ⁇ ), here a, is a set (vector) of acoustic characteristics computed in time window.
  • the articulatory code book is loaded from ROM, i.e. special database containing lots of pairs: configuration of the model of vocal tract - acoustic characteristics corresponding to the current configuration of the vocal tract, e.g. model which is based on the fact that one of the sound generation sources is a voice source which is produced during oscillation of the vocal cords, this source participates in the generation of several groups of sounds and in terms of the participation of this voice source the sounds are divided into vowels and consonants.
  • ROM i.e. special database containing lots of pairs: configuration of the model of vocal tract - acoustic characteristics corresponding to the current configuration of the vocal tract, e.g. model which is based on the fact that one of the sound generation sources is a voice source which is produced during oscillation of the vocal cords, this source participates in the generation of several groups of sounds and in terms of the participation of this voice source the sounds are divided into vowels and consonants.
  • articulatory code book for example, see J. Schroeter,
  • each such configuration of the vocal tract from articulatory code book should be approximated by the sequence of cylinder tubes of different length and variable cross-section area.
  • Any algorithms described in the international literature can be used for such approximation (e.g. algorithm developed in P. Badin, I.S. Makarov, V.N. Sorokin, Algorithm for calculating the cross-section areas of the vocal tract // Acoustical Physics. Vol. 51. No. 1. 2005. P. 38-43).
  • Fig. 7 as an example shows some configuration of the vocal tract, corresponding distribution of the cross-section areas of the approximating cylinder tubes and corresponding acoustic spectrum.
  • a h For each vector of acoustic characteristics a h calculated for vowel or vowel-like segment of digitized speech the most similar vector of acoustic characteristics a is selected from the articulatory code book. Any metrics can be used as a measure of similarity (e.g. Euclidean). Functions of distribution of the cross-section area S% and length 1 of cylinder tubes corresponding to o , are used as the first approximations for further calculations.
  • c is the speed of propagation of sound waves in the vocal tract
  • p is the density of air in the vocal tract
  • ⁇ and ⁇ are coefficients introduced to account the acoustic losses for viscous friction and heat conductivity in the vocal tract and to account acoustic impedance of the walls of the cylinder tubes (different formulae for these coefficients and specific constant values are provided in M. M. Sondhi, J. Scroeter, A Hybrid Time-Frequency Domain Articulatory Speech Synthesizer // IEEE Trans. Acoust., Speecn, and Signal Process. ASSP-35. 1987. P. 955-967; I.S. Makarov, Approximating the vocal tract by conical horns // Acoustical Physics, 2009, vol. 55. No 2. P. 261-269).
  • symbol "+” means operation of calculation of pseudo-inverse matrix.
  • Step 3 Using £ /ter and l Iter and equations (1-2) calculate T Iter .
  • Step 4 Calculate the measure of similarity d between T Iter and ⁇ ,
  • Step 5 If d ⁇ Thr or Iter > Iter max, move to Step 6. Otherwise move to Step 1.
  • the vocal tract configuration is determined based on the functions of the cross-section areas and lengths of cylinder tubes approximating the vocal tract.
  • ⁇ S opt , l opt ⁇ is recalculated into respective configuration of the vocal tract.
  • Any algorithm described in the literature can be used for recalculation, e.g., B. Story, On the ability of a physiologically constrained area function model of the vocal tract to produce normal formant patterns under perturbed conditions // J. Acoust. Soc. Amer. 115 (4), April 2004. P. 1760-1770.
  • Fig. 9 shows as an example the original configuration of the vocal tract (corresponding to 5 0 ) (dashed line) and calculated configuration of the vocal tract (corresponding to S opt ) (solid line).

Abstract

This invention pertains to the automatic processing of the human voice and can be used in different applications of the speech technologies. Technical result of this invention is improvement of the accuracy of articulation restoration and faster processing of the data during restoration of the vocal tract configuration. The method of restoration of the vocal tract configuration includes the following steps: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract.

Description

METHOD TO RESTORE THE VOCAL TRACT CONFIGURATION
TECHNICAL FIELD
This invention pertains to the automatic processing of human voice and can be used in different speech technology applications including the areas related to the following tasks: automatic correction of the pronunciation in the foreign language training systems or rehabilitation of different voice and hearing disorders, automatic speech recognition, automatic personal identification and verification based on voice, automatic speech synthesis based on random text, speech coding in mobile communication and VoIP systems.
BACKGROUND
The problem of automatic restoration of the vocal tract configuration using only the acoustic human voice record is called in the specialized literature a speech inverse problem. Speech inverse problem is more accurately formulated as the problem of finding the form of the vocal tract or articulation parameters, or function of the cross section area, or articulation control based on measured acoustic parameters of the speech signal.
The prior art has a known method based on the usage of so-called sensitivity functions, which allows for determination of the function of the cross section area by an iterative procedure of minimization of the discrepancy between measured resonance frequencies and resonance frequencies of the articulatory model (B. Story, Technique for "tuning" vocal tract area functions based on acoustic sensitivity functions // J. Acoust. Soc. Am. 119 (2), February 2006. P. 715-718; S. Adachi, H. Takemoto, T. Kitamura, P. Mokhtari, and K. Honda, Vocal tract length perturbation and its application to male-female vocal tract shape conversion // J. Acoust. Soc. Am. 121 (6), June 2007. P. 3874-3885). This method has the following drawbacks: firstly, this method uses only resonance frequencies of the vocal tract as acoustic parameters. Unfortunately, automatic determination of the resonance frequencies of the tract is a very difficult and non-trivial problem. We do not know any generalizations of this method for other acoustic parameters. Secondly, for launching the iteration process this method uses the same cross- section area function for different sounds. In many cases it results in the need of large number of iterations for achievement of the required accuracy, which in its turn significantly increases processing time.
There is also the known method of regularization which is based on minimization of the discrepancy between parameters of measured acoustic signal and parameters calculated using mathematical models of articulation and acoustics, as well as some additional stabilizing functional (J. Schroeter, M. M. Sondhi, Techniques for estimating vocal tract shapes from the speech signal. IEEE Trans. On Speech and Audio Processing. 1994. Vol. 2. No 1 , Pt. 2. P. 133-150; V. Sorokin, A. Leonov, A. Trushkin, Estimation of stability and accuracy of inverse problem solution for the vocal tract // Speech Communication. Vol. 30. No 1. 2000. P. 55- 74). A special database is used as initial approximation for minimization - so called articulatory code book containing numerous configurations of the vocal tract and corresponding acoustic parameters (J. Schroeter, M. M. Sondhi, Techniques for estimating vocal tract shapes from the speech signal. IEEE Trans. On Speech and Audio Processing. 1994. Vol. 2. No. 1, Pt. 2. P. 133-150). The main drawback of this approach is very significant computing and, as a consequence, time costs: processing speed is around dozens of seconds and even minutes per one second of the speech. At that, the key factors which significantly increase processing time, are as follows: 1) necessity of non-linear minimization with non-linear equality and nonequality constraints, 2) necessity of launching the non-linear minimization from different initial approximations from articulatory code book.
CONCEPT OF INVENTION
This invention is aimed for elimination of the drawbacks of the existing solutions. Technical result of this invention is improvement of the accuracy of the articulation restoration and faster processing of the data during restoration of the vocal tract configuration.
The said technical result is achieved through the use of the articulatory code book containing (unlike the regularization method) not only vocal tract configuration, but also corresponding functions of the cross-section areas (it makes possible to use different cross-section area functions as the initial approximation for different sounds (not only one cross-section area function like in sensitivity function method); as result it significantly reduces number of iterations and considerably improves accuracy of the solution); use as algorithm of minimization of the acoustic parameters discrepancy of the method allowing (unlike the standard sensitivity function method) for using not only the resonance frequencies, but also any standard voice technology parameters describing the acoustic spectrum of sounds (unlike the regularization method this method is less computing intensive and requires less time for processing); use of additional automatic algorithms for pre-processing of the speech signal (noise filtration, speech-nonspeech detection, identification of the boundaries of vowels and vowel-like sounds, etc.) which ensures resolving the speech inverse problem in automatic mode.
The method of restoration of vocal tract includes the following steps: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract.
The steps of the vocal tract configuration restoration method can be performed in the cyclic manner. Preliminary processing of the audio signal can including noise filtration, extraction of the speech segments from pauses; delineation of the boundaries of the sounds, selection of vowels and vowel-like segments.
Any known articulatory code books can be used as code book.
It is also possible to create an own articulatory code book using any of the known methods.
Resonance frequencies of the vocal tract can be used for determination of the functions of the areas and lengths using initial approximations as acoustic characteristics.
Parameters describing short-time amplitude-frequency spectrum of the speech signal can be used as acoustic characteristics for determination of the functions of the areas and lengths based on initial approximations.
For determination of the configuration of the vocal tract any known algorithm of conversion of the functions of the areas and lengths into corresponding vocal tract configuration based on initial approximations can be used.
This invention can be realized in the form of the vocal tract configuration restoration system including: one or more command processing devices, one or more data storage devices, one ore more programs, where one or more programs are stored in one or more data storage devices and executed on one or more processors while one or more program include the following functions: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract. The vocal tract configuration restoration method can be performed in the cyclic manner.
Preliminary processing of the audio signal can including noise filtration, speech- nonspeech detection; segmentation of speech into sounds, selection of vowels and vowel-like segments.
Any known articulatory code books can be used as code book.
It is also possible to create an own articulatory code book using any of the known methods.
Resonance frequencies of the vocal tract can be used for determination of the functions of the areas and lengths using initial approximations as acoustic characteristics.
Parameters describing short-time amplitude-frequency spectrum of the speech signal can be used as acoustic characteristics for determination of the functions of the areas and lengths using initial approximations. For determination of the vocal tract configuration any known algorithm of conversion of the functions of the areas and lengths into corresponding vocal tract configuration based on initial approximations can be used.
BRIEF DESCRIPTION OF DRAWINGS
Fig.l - Diagram of one of the options of the method of the vocal tract configuration restoration.
Fig. 2 - Acoustic wave plot of word "seed".
Fig. 3 - Speech wave plot of word "seed" after additive noise filtration.
Fig. 4 - Result of segmentation of the speech wave into "pause-speech-pause" sections. Vertical lines show boundaries between pause and speech.
Fig. 5 - Results of automatic determination of boundaries of vowel sound in word
"seed". Start and end of vowel are shown by vertical lines. Fig. 6 -Plot of dynamic spectrum of vowel in word "seed" with resonance frequency values marked by white asterisks.
Fig. 7 - Configuration of the vocal tract from articulatory code book, corresponding distribution of the area of cross-section and acoustic spectrum.
Fig. 8 - Initial distribution of the areas of cross-sections SO (top to bottom) and distribution of the areas Sopt, calculated using developed algorithm (bottom to top).
Fig. 9 - Initial configuration of the vocal tract (dashed line) and configuration computed by the developed algorithm (solid line). Both configurations correspond to distribution of the cross-section areas shown on Fig. 8.
DETAILED DESCIPTION OF INVENTION
This invention in its different variants can be implemented as a computer method, in the form of a system or a machine-readable medium containing instructions for using the said method. The invention can be realized as a distributed computer system.
In this invention the system means a computer system, PC (personal computer), CNC (computer numeric control), PLC (programmable logic controller), computerized control systems and any other devices that can perform defined, clearly determined sequence of operations (actions, instructions). Command processing device means an electronic unit or integral circuit (microprocessor) that executes machine instructions (programs).
Command processing device reads and executes machine instructions (programs) from one or more data storage devices. Data storage devices include but are not limited to hard drives (HDD), flash memory, ROM (read-only memory), solid- state drives (SSD), optic drives.
Program means a sequence of instructions intended for execution by computer control device or command processing devices. Some terms which will be further used for description of the invention are reviewed below.
Articulation is work of separate articulatory organs for production of sounds of speech. All active pronouncing organs are engaged in the pronunciation of any speech sound. The position of these organs required for this sound creation forms its articulation, separability of sounds, their clearness .
Approximation is a scientific method comprising substitution of one objects by other objects similar to some extent, but more simple.
Approximation allows for studying numeric characteristics and quality of object reducing the task to studying more simple and convenient objects (e.g. such as characteristics which are easily calculated or whose properties are known).
For improvement of the accuracy of articulation restoration and for reduction of processing time during resolving the speech inverse problem, the method of restoration of vocal tract configuration is proposed that includes the following steps: preliminary processing of the audio signal; determination of the vector of acoustic characteristics for vowels and vowel-like segments; determination of the similar vectors of the acoustic characteristics using articulatory code book; determination of the function of areas and lengths of the cylinder tubes of the vectors of acoustic characteristics; determination of the vocal tract configuration based on the functions of the areas and lengths of the cylinder tubes approximating the vocal tract.
Fig. l shows a diagram of one of the options of the method of the vocal tract configuration restoration.
The input signal is digitized voice record of a random person speaking any language. Input signal sampling rate should be at least 8,000 Hz and minimum quantization level should be 8 bit/sample. The person can pronounce a random speech material (speech material means separate sounds, combinations of sounds, words, phrases, texts in this language. Non-speech sounds like coughing, breathing, chirrup, etc. are not materials). Digitized record can have any acceptable sound format such as wav, mpeg, mp4, etc. Any voice recorder can be used, e.g. microphone, voice recorder, telephone, video camera, etc. The record of English word "seed' pronounced by a male is used here as an example. Fig. 2 shows a chart of acoustic wave of this word (oscillogram).
Preliminary processing of audio signal.
At the first stage digitized human voice record is filtered for removal of noise and distortions (additive noise, distortions, induced by communication channel, reverberations, etc.). Any noise and distortion reduction algorithms can be used as filtration algorithms (for example, algorithms described in S. Vaseghi, Advanced Digital Signal Processing and Noise Reduction, 2nd ed. John Wiley & Sons, Ltd, 2000). Fig. 3 shows a charts of speech wave of word seed after filtration of external noise using spectral subtraction algorithm, see. S. Vaseghi, Chapter 1 1, P. 333-352, Advanced Digital Signal Processing and Noise Reduction, 2nd ed. John Wiley & Sons, Ltd, 2000).
Then digitized voice record filtered from noise is analysed by automatic speech- nonspeech detection algorithm, which determines the boundaries of the beginning and end of all pauses inside digitized record (the pause means any section of digitized record when the person is silent). Any speech-nonspeech detection algorithm described in the international literature can be used (for example, algorithm described in Q. Li, J. Zheng, A. Tsai, and Q. Zhou, Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition, IEEE Transactions on Speech and Audio Process., vol. 10, No. 3, 2002, P. 146-157). Fig. 4 shows as an example the result of the application of pause detection algorithm described in Q. Li, J. Zheng, A. Tsai, and Q. Zhou, Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition, IEEE Transactions on Speech and Audio Process., vol. 10, No. 3, 2002, P. 146-157, for word seed. At that, vertical straight lines show boundaries separating the pause from the speech segment. Further processing is performed in the sections which do not correspond to pauses (i.e. only in the speech segments). Start and end boundaries are defined for each speech segment in the automatic mode. Any algorithm of automatic speech segmentation can be used for automatic detection of the boundaries of sounds (for example, Dynamic Time Warping, DTW, described, for example, in L. Rabiner, A. Rosenberg, J. Wilpon, and T. Zampini, A bootstrapping training technique for obtaining demisyllable reference patterns, J. Acoust. Soc. Amer. 71 (6), June 1982, P. 1588-1595 or the algorithm based on Hidden Markov Models, HMM, described, for example, in F. Brugnara, D. Falavigna, and M. Omologo, Automatic segmentation and labeling of speech based on Hidden Markov Models, Speech Communication 12 (1993), P. 357-370). For further analysis the segments of the digitized record are selected, which correspond to vowel and vowel-like sounds (definition of terms "vowel" and "vowel-like sound" is provided, for example, in monograph P. Ladefoged, I. Maddieson, The Sounds of the World's Languages. Wiley-Blackwell. 1996). Fig. 5 shows as an example the boundaries of vowel in the word "seed".
Determination of the vectors of acoustic characteristics for vowels and vowellike segments.
Acoustic characteristics of each vowel or vowel-like sound in the automatic mode are determined using short-term analysis, see L. Rabiner, R. Schafer, Digital Processing of Speech Signals. Prentice-Hall, Inc. 1976). In the short-term analysis the acoustic characteristics are calculated in the moving time analysis window with duration from 15 msec to 40 msec with offset of over 1 msec. The window has random shape (namely, different popular windows can be used such as Hamming window). Resonance frequencies of vocal tract or any other parameters describing short-term amplitude-frequency spectrum of speech signal can be used as acoustic characteristics (for example, Fast Fourier transform (FFT), linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC), etc.). If resonance frequencies of vocal tract are used as acoustic characteristics, then to define them it is possible to use any algorithm of automatic evaluation of resonance frequencies described in the international literature for their detection (for example, method based on linear prediction of speech described in J. Markel, A. Gray, Linear Prediction of Speech. Springer- Verlag. 1976). If parameters describing the short- term amplitude-frequency spectrum of speech signal are used as acoustic characteristics, then they can be defined using any algorithm described in the international sources (for example, see X. Huang, A. Acero, H.-W. Hon, Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Inc. 2001). Fig. 6 shows an example of chart of dynamic spectrum (sonogram) of vowel in word seed with the values of the first three resonance frequencies computed using the algorithms described in J. Markel, A. Gray, Linear Prediction of Speech. Springer- Verlag. 1976. Further, the acoustic characteristics defined in N successive time windows of the analysis will designated as {a^ a^), here a, is a set (vector) of acoustic characteristics computed in time window.
The most similar vectors of acoustic characteristics are defined using articulatory code book.
For further analysis the articulatory code book is loaded from ROM, i.e. special database containing lots of pairs: configuration of the model of vocal tract - acoustic characteristics corresponding to the current configuration of the vocal tract, e.g. model which is based on the fact that one of the sound generation sources is a voice source which is produced during oscillation of the vocal cords, this source participates in the generation of several groups of sounds and in terms of the participation of this voice source the sounds are divided into vowels and consonants. Already existing code books can be used as articulatory code book (for example, see J. Schroeter, M. Sondhi, Techniques for estimating vocal tract shapes from the speech signal. IEEE Trans. On Speech and Audio Processing. 1994. Vol. 2. No. 1, Pt. 2. P. 133-150). It is also possible to develop specific articulatory code book using the methods described in the literature (e.g. method of development of articulatory code books described in J. Schroeter, M. Sondhi, Techniques for estimating vocal tract shapes from the speech signal. IEEE Trans. On Speech and Audio Processing. 1994. Vol. 2. No. 1, Pt. 2. P. 133-150).
For further analysis each such configuration of the vocal tract from articulatory code book should be approximated by the sequence of cylinder tubes of different length and variable cross-section area. Any algorithms described in the international literature can be used for such approximation (e.g. algorithm developed in P. Badin, I.S. Makarov, V.N. Sorokin, Algorithm for calculating the cross-section areas of the vocal tract // Acoustical Physics. Vol. 51. No. 1. 2005. P. 38-43). Fig. 7 as an example shows some configuration of the vocal tract, corresponding distribution of the cross-section areas of the approximating cylinder tubes and corresponding acoustic spectrum. In further description the articulatory code book will be designated as {a£, S£, l^j, k = 1 , .. M, where M is total number of vectors in the articulatory code book, S£ - kth function of distribution of cross- section areas of cylinder tubes in the code books, l^ - k function of the distribution of the lengths of cylinder tubes in the code book a£ is a vector of acoustic characteristics corresponding to these functions of distribution of cross-section areas and lengths of cylinder tubes.
For each vector of acoustic characteristics ah calculated for vowel or vowel-like segment of digitized speech the most similar vector of acoustic characteristics a is selected from the articulatory code book. Any metrics can be used as a measure of similarity (e.g. Euclidean). Functions of distribution of the cross-section area S% and length 1 of cylinder tubes corresponding to o , are used as the first approximations for further calculations.
Determination of the functions of the cross-section areas and lengths of cylinder tubes of vectors of acoustic characteristics.
Further calculations significantly depend on the acoustic characteristics calculated using input digitized acoustic signal. If vocal tract resonance frequencies are used as acoustic characteristics the automatic algorithm using S H i as the first approximations iteratively changes the cross section area and length of each cylinder tube so that to reduce the distance in the Euclidean metrics between the measured resonance frequencies and resonance frequencies calculated based on the current distribution of the areas and lengths of the cylinder tubes. Any methods described in the international literature can be used as algorithm of iterative modification of the areas and lengths (e.g., algorithms described in B. Story, Technique for "tuning" vocal tract area functions based on acoustic sensitivity functions // J. Acoust. Soc. Am. 1 19 (2), February 2006. P. 715-718; S. Adachi, H. Takemoto, T. Kitamura, P. Mokhtari, and . Honda, Vocal tract length perturbation and its application to male-female vocal tract shape conversion // J. Acoust. Soc. Am. 121 (6), June 2007. P. 3874-3885). Any algorithms described in the literature can be used as algorithms of generation of resonance frequencies of the tract based on the current distribution of the cross-section areas and lengths of cylinder tubes (e.g., algorithm described in I.S. Makarov, Approximating the vocal tract by conical horns // Acoustical Physics, 2009, vol. 55. No 2. P. 261-269). The result of the algorithm is such distribution of the cross-section areas and lengths of the cylinder tubes {Sopt, lopt}, which generates the resonance frequencies which are least different in the Euclidean metrics from the resonance frequencies evaluated using vocal digitized signal.
If the parameters describing the short-term amplitude-frequency spectrum of the speech signal are used as acoustic characteristics, the algorithm of determination of {Sopt, lopt} will be different. The transfer function of the vocal tract approximated
th by N cylinder tubes (where St , lt is the cross-section area and length of the i cylinder tube), is determined as (see I.S. Makarov, Approximating the vocal tract by conical horns // Acoustical Physics, 2009, vol. 55. No 2. P. 261-269): Here j = V— Ϊ, / is the frequency (in Hz) ZL(j2nf) is the radiation acoustic impedance at the lips, and Aijlizf) and Cijl f) are calculated using the following matrix relations:
Figure imgf000014_0001
(2)
Here c is the speed of propagation of sound waves in the vocal tract, p is the density of air in the vocal tract, σ and γ are coefficients introduced to account the acoustic losses for viscous friction and heat conductivity in the vocal tract and to account acoustic impedance of the walls of the cylinder tubes (different formulae for these coefficients and specific constant values are provided in M. M. Sondhi, J. Scroeter, A Hybrid Time-Frequency Domain Articulatory Speech Synthesizer // IEEE Trans. Acoust., Speecn, and Signal Process. ASSP-35. 1987. P. 955-967; I.S. Makarov, Approximating the vocal tract by conical horns // Acoustical Physics, 2009, vol. 55. No 2. P. 261-269).
From (1) we have the following formulae for local derivatives from T for St H lt
Figure imgf000014_0002
According to (2) the local derivatives of A and C for St H Ζέ onpe^e^HiOTCH TaK:
Figure imgf000014_0003
(4a)
Figure imgf000015_0001
Having ratios (1-4) we obtain the following algorithm for calculation of {Sopt, lopt}. Let us introduce the following definitions: / = (flt ... , fv)Tr- a set of frequencies for calculation of the transfer function, Tr - transposition symbol,
T = ( ijlnf-i), ... , T(J2n )) - set of values of transfer function calculated using values of /, / = - Jacobi matrix, Iter - number of the current
Figure imgf000015_0002
iteration, Iter^ - maximum number of iterations of the algorithm, Thr - desired value of the similarity between the acoustic characteristics.
Step 0: algorithm input data: 1) ø, - set (vector) of acoustic characteristics calculated in the ith time window, 2) functions of distribution of cross-section areas f and length i of cylinder tubes from articulatory code book. Assume that Iter = 0, Sjter = Sf , l]ter = . Transfer function T is calculated based on a, using relations from X. Huang, A. Acero, H.-W. Hon, Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Inc. 2001. Using Siter and liter and equations (1-2) calculate TUer.
Step 1: Assume that Iter = Iter + 1.
Step 2: Calculate g^) = (^ ) + [7| , _,]+(r - Here symbol "+" means operation of calculation of pseudo-inverse matrix.
Step 3: Using £/terand lIterand equations (1-2) calculate TIter.
Step 4: Calculate the measure of similarity d between TIter and Γ,
Step 5: If d < Thr or Iter > Iter max, move to Step 6. Otherwise move to Step 1.
Step 6: Assume that Sopt = Sner, lopt = liter- Fig. 8 shows as an example 50 (top) H 5opi(bottom), calculated using the described algorithm.
The vocal tract configuration is determined based on the functions of the cross-section areas and lengths of cylinder tubes approximating the vocal tract.
{Sopt, lopt} is recalculated into respective configuration of the vocal tract. Any algorithm described in the literature can be used for recalculation, e.g., B. Story, On the ability of a physiologically constrained area function model of the vocal tract to produce normal formant patterns under perturbed conditions // J. Acoust. Soc. Amer. 115 (4), April 2004. P. 1760-1770. Fig. 9 shows as an example the original configuration of the vocal tract (corresponding to 50 ) (dashed line) and calculated configuration of the vocal tract (corresponding to Sopt) (solid line).
It is evident for a specialist in this field that specific options of implementing the method and system of vocal tract configuration restoration were described here for illustrative purpose, different modifications are acceptable within the framework and concept of the scope of the invention.

Claims

FORMULA
1. Method of restoration of the vocal tract configuration characterized by the following:
• Preliminary processing of audio signal.
• Detection of the vector of acoustic characteristics for vowels and vowel-like segments.
• The most similar vectors of acoustic characteristics are determined using articulatory code book.
• Determination of the functions of the cross-section areas and lengths of cylinder tubes of vectors of acoustic characteristics.
• Determination of the configuration of the vocal tract based on the function of the cross-section areas and lengths of cylinder tubes approximating the vocal tract.
2. Method as per item 1 characterized by the fact that preliminary processing of the audio signal includes noise filtering and/or extraction of the segments of speech from pauses, and/or disposition of the boundaries of the sounds, and/or sampling of vowels and vowel-like sounds.
3. Method as per item 1 characterized by the fact that configuration of the vocal tract is restored in cyclic manner.
4. Method as per item 1 characterized by the fact that any known articulatory code book is used as a code book.
5. Method as per item 1 characterized by the fact that an own articulatory code book using any of the known methods.
6. Method as per item 1 characterized by the fact that in case of calculation of the function of the cross-section areas and lengths using first approximations the resonance frequencies of the vocal tract are used as acoustic characteristics.
7. Method as per item 1 characterized by the fact that in case of calculation of the function of the cross-section areas and lengths using first approximations the parameters describing the short-term amplitude-frequency spectrum of the speech signal are used as acoustic characteristics.
8. Method as per item 1 characterized by the fact that any known algorithm of conversion of the functions of the cross-section areas and lengths based on the first approximations into respective configuration of the vocal tract is used for determination of the configuration of the vocal tract.
9. The system of restoration of the vocal tract configuration contains:
• at least one command processing device;
• at least one data storage device;
• one or more computer programs loaded into at least one of the said data storage devices and executed in at least one of the said command processing devices, while one or more computer programs contain instructions for the usage of the method described in item 1.
10. Machine-readable media containing machine-readable instructions executable by one or more processors, which during their execution implement the method of the vocal tract configuration restoration as described in any of the items 1-8.
PCT/RU2015/000198 2014-12-30 2015-03-30 Method to restore the vocal tract configuration WO2016108722A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2014154164 2014-12-30
RU2014154164 2014-12-30

Publications (1)

Publication Number Publication Date
WO2016108722A1 true WO2016108722A1 (en) 2016-07-07

Family

ID=56284736

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2015/000198 WO2016108722A1 (en) 2014-12-30 2015-03-30 Method to restore the vocal tract configuration

Country Status (1)

Country Link
WO (1) WO2016108722A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020002460A1 (en) * 1999-08-31 2002-01-03 Valery Pertrushin System method and article of manufacture for a voice messaging expert system that organizes voice messages based on detected emotions
RU2351023C2 (en) * 2007-05-02 2009-03-27 Общество с ограниченной ответственностью "Тридакна" User verification method in authorised access systems
RU2427044C1 (en) * 2010-05-14 2011-08-20 Закрытое акционерное общество "Ай-Ти Мобайл" Text-dependent voice conversion method
US8706483B2 (en) * 2007-10-29 2014-04-22 Nuance Communications, Inc. Partial speech reconstruction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020002460A1 (en) * 1999-08-31 2002-01-03 Valery Pertrushin System method and article of manufacture for a voice messaging expert system that organizes voice messages based on detected emotions
RU2351023C2 (en) * 2007-05-02 2009-03-27 Общество с ограниченной ответственностью "Тридакна" User verification method in authorised access systems
US8706483B2 (en) * 2007-10-29 2014-04-22 Nuance Communications, Inc. Partial speech reconstruction
RU2427044C1 (en) * 2010-05-14 2011-08-20 Закрытое акционерное общество "Ай-Ти Мобайл" Text-dependent voice conversion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEONOV A.S. ET AL.: "Artikulyatorny resintez glasnyh.", INFORMATSIONNYE PROTSESSY., vol. 3, no. 2, pages 73 *

Similar Documents

Publication Publication Date Title
Dave Feature extraction methods LPC, PLP and MFCC in speech recognition
Shahnawazuddin et al. Pitch-Adaptive Front-End Features for Robust Children's ASR.
Le Cornu et al. Generating intelligible audio speech from visual speech
Yadav et al. Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing
CN108831463B (en) Lip language synthesis method and device, electronic equipment and storage medium
O'Shaughnessy Acoustic analysis for automatic speech recognition
Milner et al. Prediction of fundamental frequency and voicing from mel-frequency cepstral coefficients for unconstrained speech reconstruction
JP2006171750A (en) Feature vector extracting method for speech recognition
JP2015068897A (en) Evaluation method and device for utterance and computer program for evaluating utterance
Shanthi et al. Review of feature extraction techniques in automatic speech recognition
Eringis et al. Improving speech recognition rate through analysis parameters
Shaneh et al. Voice command recognition system based on MFCC and VQ algorithms
Nakamura et al. Fast and high-quality singing voice synthesis system based on convolutional neural networks
Shanthi Therese et al. Review of feature extraction techniques in automatic speech recognition
WO2015025788A1 (en) Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern
Yadav et al. Non-Uniform Spectral Smoothing for Robust Children's Speech Recognition.
Shahnawazuddin et al. Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition
Sahoo et al. MFCC feature with optimized frequency range: An essential step for emotion recognition
US20200312322A1 (en) Electronic device, method and computer program
Nasreen et al. Speech analysis for automatic speech recognition
Kaur et al. Power-Normalized Cepstral Coefficients (PNCC) for Punjabi automatic speech recognition using phone based modelling in HTK
WO2016108722A1 (en) Method to restore the vocal tract configuration
Rafieee et al. A novel model characteristics for noise-robust automatic speech recognition based on HMM
Singh et al. Speech recognition system for north-east Indian accent
JP2021099454A (en) Speech synthesis device, speech synthesis program, and speech synthesis method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15875775

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 07-11-2017)

122 Ep: pct application non-entry in european phase

Ref document number: 15875775

Country of ref document: EP

Kind code of ref document: A1