WO2021149213A1 - Dispositif d'apprentissage, dispositif de mise en relief de la parole, procédés associés et programme - Google Patents

Dispositif d'apprentissage, dispositif de mise en relief de la parole, procédés associés et programme Download PDF

Info

Publication number
WO2021149213A1
WO2021149213A1 PCT/JP2020/002270 JP2020002270W WO2021149213A1 WO 2021149213 A1 WO2021149213 A1 WO 2021149213A1 JP 2020002270 W JP2020002270 W JP 2020002270W WO 2021149213 A1 WO2021149213 A1 WO 2021149213A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
quality evaluation
subjective
mask
function
Prior art date
Application number
PCT/JP2020/002270
Other languages
English (en)
Japanese (ja)
Inventor
悠馬 小泉
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/002270 priority Critical patent/WO2021149213A1/fr
Publication of WO2021149213A1 publication Critical patent/WO2021149213A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Definitions

  • the present invention relates to a speech enhancement technique.
  • T is an integer greater than or equal to 1.
  • the purpose of speech enhancement is to estimate s from x with high accuracy.
  • the observation signal X Q (x) ⁇ C F ⁇ K in which the observation signal x is expressed in the time domain region by the frequency domain conversion process Q such as short-time Fourier transform.
  • T, F, and K are positive integers, T represents the number of observed signals x belonging to a predetermined time interval (time length), and F represents the number of discrete frequencies belonging to a predetermined band in the time frequency domain (bandwidth). Represents, and K represents the number of discrete times (time length) belonging to a predetermined time interval in the time frequency domain.
  • M (x; ⁇ ) ⁇ Q (x) represents multiplying Q (x) by the TF mask M (x; ⁇ ).
  • is the Hadamard product.
  • is a parameter of DNN, and is usually learned to minimize the signal-to-distortion ratio (SDR) L SDR represented by, for example, the following equation (2).
  • Equation (2) is widely used as a cost function for DNN speech enhancement is that LSDR is differentiable with respect to ⁇ .
  • DNN learning is performed by a gradient method using a gradient obtained by a method called an error back propagation method.
  • ⁇ ⁇ is a differential operator on ⁇ . Since L SDR can be calculated analytically, it can DNN learning efficiently performed.
  • is a positive constant.
  • Non-Patent Document 3 a method of approximating P (s, y) with a differentiable function D ⁇ (s, y) having a parameter ⁇ has been proposed.
  • D ⁇ (s, y) is designed by DNN
  • D ⁇ (s, y) is differentiable with respect to ⁇
  • ⁇ ⁇ D ⁇ (s, y) can be calculated analytically.
  • L M (GAN) (D ⁇ (s, y) ⁇ 1) 2 (7)
  • Non-Patent Document 3 The problem with the prior art described in Non-Patent Document 3 is the stability of learning. In order to proceed with learning so as to stably improve the OSQAS of the test data, the approximation of the equation (5) needs to be highly accurate. However, in this conventional technique, OSQAS is not stably improved even if the number of learnings is increased. Therefore, there is still room for improvement in this prior art.
  • the present invention has been made in view of these points, and an objective index that imitates the subjective evaluation of human sound quality is approximated to a differentiable function with high accuracy, and the learning of the differentiable function is stabilized.
  • the purpose is to make it.
  • An objective index that imitates the subjective sound quality evaluation of the observed sound which is obtained by inputting the error of the index, the subjective sound quality evaluation of the observed sound based on the target sound, and the observation signal representing the observed sound into the first approximation function.
  • the subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound is input to the first approximation function.
  • the first approximation function is updated so as to minimize the first cost function based on the sum of the error of the objective index imitating the subjective sound quality evaluation of the emphasized sound, and the second approximation function is obtained.
  • an objective index that imitates the subjective evaluation of human sound quality can be approximated to a differentiable function with high accuracy, and the learning of the differentiable function can be stabilized.
  • FIG. 1 is a block diagram illustrating the functional configuration of the learning device of the embodiment.
  • FIG. 2 is a diagram for explaining the learning method of the embodiment.
  • FIG. 3 is a diagram for explaining the learning method of the embodiment.
  • FIG. 4 is a block diagram for explaining the functional configuration of the speech enhancement device of the embodiment.
  • FIG. 5 is a diagram for explaining a speech enhancement method of the embodiment. It is a figure for intuitively exemplifying the learning result of the differentiable function which approximates OSQAS. It is a figure for demonstrating the learning result of embodiment.
  • FIG. 8 is a block diagram for explaining a hardware configuration.
  • the present embodiment provides a method for improving the accuracy of the approximation of equation (5) and stabilizing the learning of the differentiable function D ⁇ (s, ⁇ ).
  • a cost function L D instead of the cost function L D (GAN). 6 intuitively illustrate (a) the cost function L D (GAN) is trained to minimize the D ⁇ (s, ⁇ ), which minimizes the cost function L D in (b) Intuitively exemplify the D ⁇ (s, ⁇ ) learned in this way.
  • the solid line represents the true OSQAS, dotted and broken lines indicate D ⁇ (s, ⁇ ) learned.
  • the horizontal axis represents the amount of noise contained in the input observation sound, and the vertical axis represents the PESQ score.
  • D ⁇ (s, ⁇ ) is learned so as to minimize the cost function LD (GAN) , which is (i) OSQAS (P (s, y)) before speech enhancement.
  • GAN cost function
  • D ⁇ (s, ⁇ ) is learned so as to minimize the cost function LD (GAN) , which is (i) OSQAS (P (s, y)) before speech enhancement.
  • the error ⁇ ⁇ (y) with respect to (ii) and D ⁇ (s, ⁇ ) to minimize the error ⁇ ⁇ (s) with respect to OSQAS (P (s, s)) in the case of perfect speech enhancement.
  • an error in OSQAS when speech enhancement fails is not taken into consideration. Therefore, when D ⁇ (s, ⁇ )
  • the mini-batch size M calculates the cost value of the cost function cost function L D below, to learn ⁇ to minimize this.
  • M is a positive integer
  • j 1, ..., M
  • s (j) is the j-th target audio signal
  • x (j) is the j-th observation signal
  • y (j) Is an emphasized audio signal.
  • s (j) , x (j) , y (j) , n (j) are time-series signals of T samples in the time domain, respectively, and the target voice signal s (j) represents the target sound, and the observation signal x (J) represents the observed sound, the emphasized voice signal y (j) represents the emphasized sound, and the noise signal n (j) represents noise. That is, the first approximation function D ⁇ that approximates the subjective sound quality evaluation P (s (j) , s (j) ) of the target sound and the objective index imitating the subjective sound quality evaluation P (s (j), ⁇ ) of the input.
  • (J) , y (j) ) and the first approximation function D ⁇ (s (j) , ⁇ ) are input to the emphasized sound signal y (j), which is obtained by inputting the emphasized sound signal y (j).
  • the learning device 11 of the present embodiment includes storage units 111 and 112, mask estimation application units 113a, mask application units 113b and 113c, model application units 114a to 114c, and approximation function application units 115a to 115c. , The gradient calculation units 116a to 116h, the parameter update units 117a to 117d, the memory 118, and the control unit 119. The learning device 11 executes each process under the control of the control unit 119, and the data obtained in each process is stored in the memory 118, read out as needed, and used for other processes.
  • the speech enhancement device 12 of the present embodiment includes a model storage unit 120, an input unit 121, a frequency domain conversion unit 122, a mask estimation unit 123, a mask application unit 124, a time domain conversion unit 125, and an output. It has a unit 126, a control unit 127, and a memory 128.
  • the speech enhancement device 12 executes each process under the control of the control unit 127, and the data obtained in each process is stored in the memory 128, read out as needed, and used for other processes.
  • learning data is prepared consisting of the observed signal x (j) corresponding to the target speech signal s (j) and the target speech signal s (j).
  • j 1, ..., M
  • M is an integer of 1 or more.
  • the target audio signals s (1) , ..., S (M) are stored in the storage unit 111, and the observation signals x (1) , ..., X (M) are stored in the storage unit 112. Under this premise, the following steps 1, 2 and 3 are executed.
  • Step 1 Pre-learning of model M ⁇
  • the model M ⁇ is pre-learned.
  • the control unit 119 sets the parameter theta model M theta to the initial value (step S119aa).
  • the model application unit 114a extracts the observation signal x (i) from the storage unit 112, applies the model M ⁇ to the observation signal x (i) , obtains the mask M (x (i) ; ⁇ ), and outputs the mask M (x (i); ⁇ ). (Step S114a).
  • the mask M (x (i) ; ⁇ ) is input to the gradient calculation unit 116a.
  • Gradient calculation unit 116a extracts the observed signal x (i) from the storage unit 111, and outputs the resulting gradient ⁇ ⁇ L SDR M (i) .
  • the gradient ⁇ ⁇ L SDR M (1) , ..., ⁇ ⁇ L SDR M (N) is input into the gradient calculation unit 116 b.
  • the gradient calculation unit 116b gradient ⁇ ⁇ L SDR M (1) , ..., and outputs the resulting gradient ⁇ ⁇ L SDR M using ⁇ ⁇ L SDR M (N) .
  • the gradient ⁇ ⁇ L SDR M is input to the parameter update unit 117a.
  • the control unit 119 determines whether or not the convergence condition is satisfied.
  • the convergence condition, step S114a, S116a, S119b, S116b, the processing of S117a is repeated a certain number of times and, the change amount of ⁇ and L SDR M is equal to or less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119ab. If the convergence condition is satisfied, the parameter update unit 117a outputs the parameter ⁇ and ends the process of step 1 (step S119c).
  • Step 2 Pre-learning of the approximate function D ⁇ (s, ⁇ ) >>
  • the approximation function D ⁇ (s, ⁇ ) is pre-learned.
  • the process of step 2 will be described with reference to FIG.
  • the control unit 119 sets the parameter ⁇ of the approximation function D ⁇ (s, ⁇ ) to the initial value (step S119da).
  • the parameter ⁇ obtained in step 1 is input to the mask estimation application unit 113a.
  • the mask estimation application unit 113a extracts the observation signal x (j) from the storage unit 112 and applies the model M ⁇ to the observation signal x (j) to obtain the mask M (x (j) ; ⁇ ). Further, the mask estimator application unit 113a, the mask M to the observed signal x (j) (x (j ); ⁇ ) by applying, to emphasize the voice signal y (j) the obtained output as in equation (1b) (Step S113a).
  • y (j) Q + (M (x (j) ; ⁇ ) ⁇ Q (x (j) )) (1b)
  • the observed signal x (j) and the emphasized audio signal y (j) are input to the approximation function application unit 115a.
  • the approximation function application unit 115a further extracts the target audio signal s (j) from the storage unit 111.
  • the approximation function application unit 115a inputs s (j) , x (j) , y (j) into the approximation function D ⁇ (s, ⁇ ) and D ⁇ (s (j) , s (j) ), D. ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are obtained and output (step S115a).
  • D ⁇ (s (j) , s (j) ), D ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are input to the gradient calculation unit 116c. .. Further calculated P (s (j) , s (j) ), P (s (j) , x (j) ), P (s (j) , y (j) ) are stored in the gradient calculation unit 116c. Entered. Next, the gradient calculation unit 116c has an error between P (s (j) , s (j) ) and D ⁇ (s (j) , s (j) ) ⁇ (s (j) ), P (s (j)).
  • the gradient ⁇ ⁇ L D (1), ..., ⁇ ⁇ L D (M) is input into the gradient calculation unit 116d.
  • Gradient calculation unit 116d gradient ⁇ ⁇ L D (1), ..., and outputs the resulting gradient ⁇ phi L D using ⁇ ⁇ L D (M).
  • the gradient ⁇ ⁇ L D is input to the parameter update unit 117b.
  • Parameter updating unit 117b updates the parameter phi by the gradient method using a gradient ⁇ ⁇ L D. That parameter updating unit 117b outputs the L D of formula (8) by updating the parameters ⁇ to minimize (step S117b).
  • the control unit 119 determines whether or not the convergence condition is satisfied.
  • the convergence condition, step S115a, S116c, S119e, S116d, the processing of S117b is repeated a certain number of times and, the change amount of the ⁇ and L D is less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119db. If the convergence condition is satisfied, the parameter update unit 117b outputs the parameter ⁇ and ends the process of step 2 (step S119f).
  • step 2 the first approximation that approximates the subjective sound quality evaluation P (s (j) , s (j) ) of the target sound and the objective index that imitates the subjective sound quality evaluation P (s (j), ⁇ ) of the input.
  • An objective index D ⁇ (s (j) , s) that imitates the subjective sound quality evaluation of the target sound which is obtained by inputting the target sound signal s (j) representing the target sound to the function D ⁇ (s (j), ⁇ ). (J) ), the error ⁇ ⁇ (s (j) ), the subjective sound quality evaluation P (s (j) , x (j) ) of the observed sound based on the target sound, and the first approximation function D ⁇ (s).
  • An objective index D ⁇ (s (j) , x (j) ) that imitates the subjective sound quality evaluation of the observed sound which is obtained by inputting the observation signal x (j) representing the observed sound into (j), ⁇ ), Subjective sound quality evaluation of the emphasized sound corresponding to the error ⁇ ⁇ (x (j) ) and the post-masked sound signal obtained by applying the first mask M (x (j) ; ⁇ ) to the observed signal x (j).
  • the first mask M (x (j) ; ⁇ ) is obtained by applying the first model M ⁇ to the observed signal x (j)
  • the approximation function learning step is the first model M.
  • the first approximation function D ⁇ (s (j) , ⁇ ) is updated to minimize the first cost function L D without updating ⁇
  • the second approximation function D ⁇ (s (j) , ⁇ ⁇ ). ) Is the step to obtain.
  • control unit 119 sets the parameters ⁇ and ⁇ obtained in steps 1 and 2 to the initial values (step S119ga).
  • the model application unit 114b extracts the observation signal x (i) from the storage unit 112, applies the model M ⁇ to the observation signal x (i) , obtains the mask M (x (i) ; ⁇ ), and outputs the mask M (x (i); ⁇ ). (Step S114b).
  • the mask M (x (i) ; ⁇ ) is input to the mask application unit 113b.
  • Q (x (i) ) is further input to the mask application unit 113b, and the mask application unit 113b obtains and outputs the emphasized audio signal y (i) according to the equation (1a) (step S113b).
  • the emphasized audio signal y (i) is input to the approximation function application unit 115b.
  • the approximation function application unit 115b further extracts the target audio signal s (i) from the storage unit 111.
  • the approximate function application unit 115a inputs s (i) and y (i) into the approximate function D ⁇ (s, ⁇ ) to obtain D ⁇ (s (i) , y (i) ) and outputs it (step). S115b).
  • D ⁇ (s (i) , y (i) ) is input to the gradient calculation unit 116e.
  • Gradient calculation unit 116e and outputs the resulting gradient ⁇ ⁇ L M (i).
  • LM (i) satisfies the following equation (4b) (step S116e).
  • L M (i) -E [D ⁇ (s (i) , y (i) )] x, y (4b)
  • the gradient ⁇ ⁇ L M (1), ..., ⁇ ⁇ L M (N) is input into the gradient calculation unit 116f.
  • the gradient calculation unit 116f gradient ⁇ ⁇ L M (1), ..., and outputs the resulting gradient ⁇ phi L M using ⁇ ⁇ L M (N).
  • Gradient ⁇ phi L M is input to the parameter update unit 117c.
  • the updated parameter ⁇ is input to the model application units 114b and 114c (step S117c).
  • the model application unit 114c extracts the observation signal x (j) from the storage unit 112, applies the model M ⁇ to the observation signal x (j) , obtains the mask M (x (j) ; ⁇ ), and outputs the mask M (x (j); ⁇ ). (Step S114c).
  • the mask M (x (j) ; ⁇ ) is input to the mask application unit 113c.
  • Q (x (j) ) is further input to the mask application unit 113c, and the mask application unit 113c obtains and outputs the emphasized audio signal y (j) according to the following equation (1a') (step S113b).
  • y (j) Q + (M (x (j) ; ⁇ ) ⁇ Q (x (j) )) (1a')
  • the emphasized audio signal y (j) is input to the approximation function application unit 115c. Further, the approximation function application unit 115c extracts the target audio signal s (j) from the storage unit 111, and extracts the observation signal x (j) from the storage unit 112. The approximation function application unit 115c inputs s (j) , x (j) , y (j) into the approximation function D ⁇ (s, ⁇ ) and D ⁇ (s (j) , s (j) ), D. ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are obtained and output (step S115c).
  • D ⁇ (s (j) , s (j) ), D ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are input to the gradient calculation unit 116g. .. Further calculated P (s (j) , s (j) ), P (s (j) , x (j) ), P (s (j) , y (j) ) are stored in the gradient calculation unit 116g. Entered. Next, the gradient calculation unit 116g has an error between P (s (j) , s (j) ) and D ⁇ (s (j) , s (j) ) ⁇ (s (j) ), P (s (j)).
  • the gradient ⁇ ⁇ L D (1), ..., ⁇ ⁇ L D (M) is input into the gradient calculation unit 116h.
  • the gradient ⁇ ⁇ L D is input to the parameter update unit 117f.
  • Parameter updating unit 117f updates the parameter phi by the gradient method using a gradient ⁇ ⁇ L D. That parameter updating unit 117b, and outputs the updated parameters ⁇ to minimize L D of formula (8).
  • the parameter ⁇ is input to the approximation function application units 115b and 115c (step S117d).
  • the control unit 119 determines whether or not the convergence condition is satisfied.
  • the convergence condition step S114b, S113b, S115b, S116e, S119h, S116f, S117c, S114c, S113c, S115c, S116g, S119e, S116h, and the treatment was repeated a certain number of S117d, ⁇ , ⁇ and L M, the change amount of the L D is less than a predetermined value, and which may or may not be. If the convergence condition is satisfied, the parameter update unit 117c outputs the parameter ⁇ , the parameter update unit 117d outputs the parameter ⁇ , and the process of step 3 ends (step S119j).
  • S114c, S113c, S115c, S116g , S119e, S116h, processing of S117d is, subjective sound quality evaluation P of the target sound (s (j), s ( j)) and the input of subjective sound quality evaluation P (s (j) , ⁇ ) Subjective sound quality evaluation of the target sound obtained by inputting the target sound signal s (j) representing the target sound into the first approximation function D ⁇ (s (j) , ⁇ ) that approximates the objective index imitating).
  • the first mask M (x (j) ; ⁇ ) is obtained by applying the first model M ⁇ to the observed signal x (j) , and the approximation function learning step is the first model M.
  • the first approximation function D ⁇ (s (j) , ⁇ ) is updated to minimize the first cost function L D without updating ⁇
  • the first model M ⁇ is updated to obtain the second model M ⁇ .
  • ⁇ Speech enhancement processing> The information for identifying the model M ⁇ and the approximate function D ⁇ (s, ⁇ ) learned as described above is stored in the model storage unit 120 of the speech enhancement device 12 (FIG. 4). For example, the parameters ⁇ and ⁇ output in step S119j are stored in the model storage unit 120. Under this premise, the following speech enhancement processing is executed.
  • an observation signal x which is a time-series acoustic signal in the time domain, is input to the input unit 121 of the speech enhancement device 12 (FIG. 4) (step S121).
  • the observation signal x is input to the frequency domain conversion unit 122.
  • the observation signal x is input to the mask estimation unit 123.
  • the mask estimation unit 123 applies the model M ⁇ to the observation signal x to estimate and output the TF mask M (x; ⁇ ) (step S123).
  • the observation signal X and the TF mask M (x; ⁇ ) are input to the mask application unit 124.
  • the mask application unit 124 applies (multiplies) the TF mask M (x; ⁇ ) to the observation signal X in the time frequency domain, obtains and outputs the masked audio signal M (x; ⁇ ) ⁇ X ( Step S124).
  • the audio signal M (x; ⁇ ) ⁇ X is input to the time domain conversion unit 125.
  • the time domain conversion unit 125 applies a time domain conversion process Q + such as an inverse FTFT to the masked voice signal M (x; ⁇ ) ⁇ X to obtain and output the time domain emphasized voice y (Equation (1)). ) (Step S126).
  • the learning device 11 and the voice enhancing device 12 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a memory such as a RAM (random-access memory) or a ROM (read-only memory). It is a device configured by executing a predetermined program by a general-purpose or dedicated computer equipped with the above.
  • This computer may have one processor and memory, or may have a plurality of processors and memory.
  • This program may be installed in a computer or may be recorded in a ROM or the like in advance.
  • a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. ..
  • the electronic circuit constituting one device may include a plurality of CPUs.
  • FIG. 8 is a block diagram illustrating the hardware configurations of the learning device 11 and the speech enhancement device 12 in each embodiment.
  • the learning device 11 and the voice enhancing device 12 of this example include a CPU (Central Processing Unit) 10a, an output unit 10b, an output unit 10c, a RAM (Random Access Memory) 10d, and a ROM (Read Only Memory). ) 10e, an auxiliary storage device 10f, and a bus 10g.
  • the CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac.
  • the output unit 10b is an output terminal, a display, or the like on which data is output.
  • the output unit 10c is a LAN card or the like controlled by the CPU 10a that has read a predetermined program.
  • the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored.
  • the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data.
  • the bus 10g connects the CPU 10a, the output unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged.
  • the CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program.
  • the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a.
  • the control unit 10ab of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program.
  • the calculation result is stored in the register 10ac.
  • the above program can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded.
  • the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
  • the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program.
  • a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially.
  • the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
  • the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
  • OSQAS is not limited to PESQ, and may be any value as long as it is an objective index that imitates the subjective evaluation of human sound quality.
  • step 3 the model M ⁇ was trained first, but in step 3, the approximate function D ⁇ (s, ⁇ ) may be trained first.
  • DNN is used in the above-described embodiment, other models such as a probabilistic model may be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

Dans la présente invention, une seconde fonction d'approximation est obtenue en mettant à jour une première fonction d'approximation qui s'approche d'un indice objectif simulant l'évaluation subjective de la qualité sonore d'une entrée de façon à minimiser une première fonction de coût qui est fondée sur la somme : d'une erreur entre l'évaluation subjective de la qualité sonore d'un son cible et un indice objectif qui simule l'évaluation subjective de la qualité sonore du son cible et qui est obtenu en entrant un signal vocal cible représentant le son cible dans une première fonction d'approximation ; d'une erreur entre l'évaluation subjective de la qualité sonore d'un son d'observation sur la base du son cible et un indice objectif qui simule l'évaluation subjective de la qualité sonore du son d'observation et qui est obtenu en entrant un signal d'observation représentant le son d'observation dans la première fonction d'approximation ; et d'une erreur entre l'évaluation subjective de la qualité sonore d'un son emphatique qui correspond à un signal sonore masqué obtenu en appliquant un premier masque au signal d'observation et un indice objectif qui simule l'évaluation subjective de la qualité sonore du son emphatique et qui est obtenu en entrant un signal vocal emphatique représentant le son emphatique dans la première fonction d'approximation.
PCT/JP2020/002270 2020-01-23 2020-01-23 Dispositif d'apprentissage, dispositif de mise en relief de la parole, procédés associés et programme WO2021149213A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/002270 WO2021149213A1 (fr) 2020-01-23 2020-01-23 Dispositif d'apprentissage, dispositif de mise en relief de la parole, procédés associés et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/002270 WO2021149213A1 (fr) 2020-01-23 2020-01-23 Dispositif d'apprentissage, dispositif de mise en relief de la parole, procédés associés et programme

Publications (1)

Publication Number Publication Date
WO2021149213A1 true WO2021149213A1 (fr) 2021-07-29

Family

ID=76993302

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/002270 WO2021149213A1 (fr) 2020-01-23 2020-01-23 Dispositif d'apprentissage, dispositif de mise en relief de la parole, procédés associés et programme

Country Status (1)

Country Link
WO (1) WO2021149213A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0689095A (ja) * 1992-09-08 1994-03-29 Nippon Telegr & Teleph Corp <Ntt> 音響信号選択装置
JP2006313181A (ja) * 2005-05-06 2006-11-16 Nissan Motor Co Ltd 音声入力装置及び音声入力方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0689095A (ja) * 1992-09-08 1994-03-29 Nippon Telegr & Teleph Corp <Ntt> 音響信号選択装置
JP2006313181A (ja) * 2005-05-06 2006-11-16 Nissan Motor Co Ltd 音声入力装置及び音声入力方法

Similar Documents

Publication Publication Date Title
Bonastre et al. ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition
Magron et al. Model-based STFT phase recovery for audio source separation
JP6623376B2 (ja) 音源強調装置、その方法、及びプログラム
Hwang et al. LP-WaveNet: Linear prediction-based WaveNet speech synthesis
WO2020045313A1 (fr) Dispositif d&#39;estimation de masque, procédé d&#39;estimation de masque, et programme d&#39;estimation de masque
Nakamura et al. Real-time audio-to-score alignment of music performances containing errors and arbitrary repeats and skips
JP2022031196A (ja) ノイズ除去方法および装置
Llombart et al. Progressive loss functions for speech enhancement with deep neural networks
JP6721165B2 (ja) 入力音マスク処理学習装置、入力データ処理関数学習装置、入力音マスク処理学習方法、入力データ処理関数学習方法、プログラム
Moliner et al. Diffusion-based audio inpainting
WO2021149213A1 (fr) Dispositif d&#39;apprentissage, dispositif de mise en relief de la parole, procédés associés et programme
JP7036054B2 (ja) 音響モデル学習装置、音響モデル学習方法、プログラム
JP2018031910A (ja) 音源強調学習装置、音源強調装置、音源強調学習方法、プログラム、信号処理学習装置
JP4981579B2 (ja) 誤り訂正モデルの学習方法、装置、プログラム、このプログラムを記録した記録媒体
JP2007304445A (ja) 周波数成分の修復・抽出方法、周波数成分の修復・抽出装置、周波数成分の修復・抽出プログラムならびに周波数成分の修復・抽出プログラムを記録した記録媒体
Moliner et al. Blind audio bandwidth extension: A diffusion-based zero-shot approach
GB2622654A (en) Patched multi-condition training for robust speech recognition
Gabrielli et al. A multi-stage algorithm for acoustic physical model parameters estimation
Lü et al. Feature compensation based on independent noise estimation for robust speech recognition
Nortier et al. Unsupervised speech enhancement with diffusion-based generative models
JP7428251B2 (ja) 目的音信号生成装置、目的音信号生成方法、プログラム
US20220130406A1 (en) Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program
Wang et al. Speech Enhancement Control Design Algorithm for Dual‐Microphone Systems Using β‐NMF in a Complex Environment
JP7264282B2 (ja) 音声強調装置、学習装置、それらの方法、およびプログラム
JP7156064B2 (ja) 潜在変数最適化装置、フィルタ係数最適化装置、潜在変数最適化方法、フィルタ係数最適化方法、プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 20915634

Country of ref document: EP

Kind code of ref document: A1