WO2023095470A1 - 信号処理装置、信号処理方法及び信号処理プログラム - Google Patents
信号処理装置、信号処理方法及び信号処理プログラム Download PDFInfo
- Publication number
- WO2023095470A1 WO2023095470A1 PCT/JP2022/037913 JP2022037913W WO2023095470A1 WO 2023095470 A1 WO2023095470 A1 WO 2023095470A1 JP 2022037913 W JP2022037913 W JP 2022037913W WO 2023095470 A1 WO2023095470 A1 WO 2023095470A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- acoustic
- mixed
- signal
- environmental sound
- feature
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims description 5
- 238000006243 chemical reaction Methods 0.000 claims abstract description 88
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 239000000284 extract Substances 0.000 claims abstract description 8
- 230000007613 environmental effect Effects 0.000 claims description 302
- 230000005236 sound signal Effects 0.000 claims description 188
- 230000001131 transforming effect Effects 0.000 claims description 32
- 238000000926 separation method Methods 0.000 claims description 31
- 230000015654 memory Effects 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000000034 method Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 7
- 230000006866 deterioration Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
Definitions
- the present disclosure relates to technology for separating multiple acoustic signals from a mixed acoustic signal.
- Patent Literature 1 describes a conversion unit that converts an input mixed acoustic signal into a plurality of first internal states, and when auxiliary information about an acoustic signal of a target sound source is input, a plurality of first internal states are generated based on the auxiliary information.
- a weighting unit for generating a second internal state that is a weighted sum of one internal state and generating the second internal state by selecting one of the plurality of first internal states when auxiliary information is not input;
- a signal processing apparatus is disclosed that includes a mask estimator that estimates a mask based on internal states.
- the above-described conventional technology may require complicated preparatory processing for creating auxiliary information about the acoustic signal of the target sound source in advance, and the performance of separating a plurality of acoustic signals from a mixed acoustic signal is degraded. There is a risk that it will be, and further improvement was required.
- the present disclosure has been made in order to solve the above-mentioned problems. It is an object of the present invention to provide a technique capable of preventing deterioration in the performance of separating .
- a signal processing device includes a mixed acoustic signal acquisition unit that acquires a mixed acoustic signal including a plurality of acoustic signals; a mask estimation unit for estimating a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature quantity; and the plurality of acoustic signals from the mixed feature quantity using the plurality of masks.
- an acoustic signal conversion unit that calculates a plurality of separated feature amounts corresponding to each of them and converts the calculated plurality of separated feature amounts into a plurality of separated acoustic signals; and based on the plurality of separated acoustic signals, the mixed acoustic signal an environmental sound section estimating unit for estimating an environmental sound section containing only an acoustic signal representing the environmental sound in all input sections of the environmental sound section, and the mixed sound signal of the estimated environmental sound section from the mixed sound signal is an environmental sound signal and an environmental sound feature value conversion unit that converts the environmental sound signal into an environmental sound feature value representing the feature of the environmental sound signal, wherein the mask estimation unit is configured to extract the environment The mixed feature is weighted using the sound feature, and the plurality of masks are estimated based on the weighted mixed feature.
- FIG. 1 is a block diagram showing the configuration of a signal processing device according to an embodiment of the present disclosure
- FIG. 1 is a block diagram showing the configuration of a learning device according to an embodiment of the present disclosure
- FIG. 4 is a flowchart for explaining sound source separation processing of the signal processing device according to the present embodiment. 4 is a flowchart for explaining learning processing of the learning device according to the present embodiment;
- the mixed sound signal contains noise (environmental sound) that was not used for learning the neural network model, a plurality of sound signals are separated from the mixed sound signal. Performance may be degraded.
- a signal processing device includes: a mixed acoustic signal acquisition unit that acquires a mixed acoustic signal including a plurality of acoustic signals; a mixed feature amount transforming unit for converting into a quantity; a mask estimation unit for estimating a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature amount; and the mixed feature amount using the plurality of masks.
- an acoustic signal conversion unit that calculates a plurality of separated feature amounts corresponding to each of the plurality of acoustic signals from the plurality of acoustic signals, converts the calculated plurality of separated feature amounts into a plurality of separated acoustic signals, and based on the plurality of separated acoustic signals an environmental sound section estimating unit for estimating an environmental sound section containing only an acoustic signal representing the environmental sound in all input sections of the mixed sound signal; an environmental sound signal extraction unit that extracts an acoustic signal as an environmental sound signal; and an environmental sound feature quantity conversion unit that converts the environmental sound signal into an environmental sound feature quantity representing a feature of the environmental sound signal, wherein the mask
- the estimation unit weights the mixed feature amount using the environmental sound feature amount, and estimates the plurality of masks based on the weighted mixed feature amount.
- the mixed sound signal in the environmental sound section containing only the sound signal representing the environmental sound is extracted from the mixed sound signal as the environmental sound signal, and the environmental sound feature amount representing the feature of the environmental sound signal is used to extract the mixed sound signal.
- the mixed features are weighted, and a plurality of masks are estimated based on the weighted mixed features. Therefore, multiple masks are estimated using the ambient sound signal extracted from the mixed sound signal in real time, and the mixed sound signal is separated into multiple separated sound signals using the estimated multiple masks.
- the mixed feature amount conversion unit includes a first acoustic model that outputs the mixed feature amount when the mixed acoustic signal is input
- the mask estimation unit includes: and a second acoustic model that outputs the plurality of masks when the mixed feature is input
- the acoustic signal conversion unit converts the plurality of separated acoustic signals when the calculated plurality of separated features are input
- the environmental sound feature quantity conversion unit may include a fourth acoustic model that outputs the environmental sound feature quantity when the environmental sound signal is input.
- the mixed acoustic signal is input to the first acoustic model, and the mixed feature quantity is output from the first acoustic model.
- the mixed features are input to the second acoustic model, and multiple masks are output from the second acoustic model.
- the plurality of calculated separated feature amounts are input to the third acoustic model, and the plurality of separated acoustic signals are output from the third acoustic model.
- the environmental sound signal is input to the fourth acoustic model, and the environmental sound feature quantity is output from the fourth acoustic model.
- a mixed feature can be easily estimated by the first acoustic model, a plurality of masks can be easily estimated by the second acoustic model, and a plurality of separated acoustic signals can be easily estimated by the third acoustic model. and the environmental sound feature quantity can be easily estimated by the fourth acoustic model.
- the mixed acoustic signal for learning and a plurality of correct acoustic signals corresponding to the correct answers of the plurality of acoustic signals included in the mixed acoustic signal for learning are acquired.
- an acoustic signal acquisition unit and a parameter update unit that updates each parameter of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model, wherein the mixed feature quantity conversion unit inputs the learning mixed acoustic signal to the first acoustic model, acquires the mixed feature quantity output from the first acoustic model, and the environmental sound feature quantity conversion unit converts the plurality of correct acoustic signals A correct environmental sound signal indicating an environmental sound corresponding to the correct answer among the above is input to the fourth acoustic model, the environmental sound feature quantity output from the fourth acoustic model is acquired, and the mask estimation unit is configured to: weighting the mixed feature output from the first acoustic model using the environmental sound feature output from the fourth acoustic model; inputting the weighted mixed feature to the second acoustic model; The plurality of masks output from the second acoustic model are obtained, and the a
- the parameter updating unit calculates an error between each of the plurality of acoustic signals output from the third acoustic model and each of the plurality of correct acoustic signals, and based on the plurality of calculated errors, calculates the first
- Each parameter of the acoustic model, the second acoustic model, the third acoustic model and the fourth acoustic model may be updated.
- the learning mixed acoustic signal and a plurality of correct acoustic signals corresponding to the correct answers of the plurality of acoustic signals included in the learning mixed acoustic signal are acquired.
- a learning mixed acoustic signal is input to the first acoustic model, and a mixed feature amount is output from the first acoustic model.
- a correct environmental sound signal representing an environmental sound corresponding to the correct answer among the plurality of correct sound signals is input to the fourth acoustic model, and the fourth acoustic model outputs an environmental sound feature amount.
- the mixed feature output from the first acoustic model is weighted using the environmental sound feature output from the fourth acoustic model.
- the weighted mixed features are input to the second acoustic model, and multiple masks are output from the second acoustic model.
- a plurality of separated feature quantities corresponding to each of the plurality of correct acoustic signals are calculated from the mixed feature quantity using the plurality of masks output from the second acoustic model.
- a plurality of calculated separated feature amounts are input to the third acoustic model, and a plurality of separated acoustic signals are output from the third acoustic model. Errors between each of the plurality of acoustic signals output from the third acoustic model and each of the plurality of correct acoustic signals are calculated.
- Each parameter of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model is updated based on the plurality of calculated errors.
- the first acoustic model, the second acoustic model, and the third acoustic model and the fourth acoustic model can be learned, and the estimation accuracy of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model can be improved.
- the plurality of acoustic signals include an acoustic signal representing the environmental sound and an acoustic signal representing a sound other than the environmental sound. and may include
- the sound other than the environmental sound may be a voice uttered by a person.
- the acoustic signal representing the environmental sound and the acoustic signal representing the voice uttered by the person can be separated from the mixed acoustic signal.
- the sound other than the environmental sound may be a sound emitted by a specific object.
- the environmental sound signal extraction unit stores the extracted environmental sound signal in a memory, and converts the environmental sound feature quantity into a memory.
- the unit may read the environmental sound signal from the memory and convert the read environmental sound signal into the environmental sound feature quantity.
- the extracted environmental sound signal is stored in the memory each time the mixed sound signal is acquired, and the environmental sound feature quantity is generated using the environmental sound signal stored in the memory.
- the environmental sound feature quantity is generated using the environmental sound signal stored in the memory.
- multiple masks can be estimated in real time using the environmental sound feature quantity, and multiple separated acoustic signals can be accurately separated from the mixed acoustic signal using the multiple masks. can.
- the present disclosure can be realized not only as a signal processing device having the characteristic configuration as described above, but also a signal processing method for executing characteristic processing corresponding to the characteristic configuration of the signal processing device. It can also be realized as Moreover, it can also be realized as a computer program that causes a computer to execute characteristic processing included in such a signal processing method. Therefore, the following other aspects can also achieve the same effect as the signal processing device described above.
- a signal processing method is such that a computer acquires a mixed acoustic signal including a plurality of acoustic signals, converts the mixed acoustic signal into a mixed feature amount indicating a feature of the mixed acoustic signal, estimating a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature, and using the plurality of masks to generate a plurality of masks corresponding to each of the plurality of acoustic signals from the mixed feature calculating a separation feature amount, converting the calculated plurality of separation feature amounts into a plurality of separation sound signals, and based on the plurality of separation sound signals, an acoustic signal representing environmental sounds in all input sections of the mixed sound signal; and extracting the mixed sound signal of the estimated environmental sound section as an environmental sound signal from the mixed sound signal, extracting the environmental sound signal from the environmental sound signal by In estimating the plurality of masks, the mixed feature is weighted
- a signal processing program includes: a mixed acoustic signal acquisition unit that acquires a mixed acoustic signal including a plurality of acoustic signals; a mixed feature conversion unit for converting into a feature amount; a mask estimation unit for estimating a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature amount; and the mixed feature using the plurality of masks.
- an acoustic signal conversion unit that calculates a plurality of separated feature quantities corresponding to each of the plurality of acoustic signals from the quantity and converts the calculated plurality of separated feature quantities into a plurality of separated acoustic signals; an environmental sound interval estimating unit for estimating an environmental sound interval containing only an acoustic signal representing the environmental sound in all input intervals of the mixed acoustic signal based on the above; causing a computer to function as an environmental sound signal extraction unit that extracts a mixed sound signal as an environmental sound signal, and an environmental sound feature quantity conversion unit that converts the environmental sound signal into an environmental sound feature quantity representing the characteristics of the environmental sound signal,
- the mask estimation unit weights the mixed feature amount using the environmental sound feature amount, and estimates the plurality of masks based on the weighted mixed feature amount.
- a non-transitory computer-readable recording medium recording a signal processing program includes a mixed sound signal acquisition unit that acquires a mixed sound signal including a plurality of sound signals; a mixed feature conversion unit that converts an acoustic signal into a mixed feature representing a feature of the mixed acoustic signal; and a mask estimation that estimates a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature. and calculating a plurality of separated feature amounts corresponding to each of the plurality of acoustic signals from the mixed feature amount using the plurality of masks, and converting the calculated plurality of separated feature amounts into a plurality of separated acoustic signals.
- an acoustic signal conversion unit for estimating, based on the plurality of separated acoustic signals, an environmental sound interval including only acoustic signals representing environmental sounds in all input intervals of the mixed acoustic signal, and the mixing an environmental sound signal extraction unit that extracts the mixed sound signal of the estimated environmental sound section from the sound signal as an environmental sound signal;
- the mask estimation unit weights the mixed feature amount using the environmental sound feature amount, and calculates the plurality of masks based on the weighted mixed feature amount.
- FIG. 1 is a block diagram showing the configuration of a signal processing device 1 according to an embodiment of the present disclosure.
- the signal processing device 1 separates a plurality of acoustic signals from the mixed acoustic signal.
- a mixed acoustic signal includes a plurality of acoustic signals.
- the plurality of acoustic signals include, for example, acoustic signals indicating environmental sounds and acoustic signals indicating sounds other than the environmental sounds. Sounds other than environmental sounds are, for example, voices uttered by people.
- the signal processing device 1 shown in FIG. An acoustic signal output unit 17 , an environmental sound section estimation unit 18 and an environmental sound signal extraction unit 19 are provided.
- the processor is composed of, for example, a CPU (Central Processing Unit).
- the environmental sound signal storage unit 13 is realized by a memory.
- the memory is composed of, for example, ROM (Read Only Memory) or EEPROM (Electrically Erasable Programmable Read Only Memory).
- the signal processing device 1 may be, for example, a computer, a smartphone, a tablet computer, or a server. Moreover, the signal processing device 1 may be incorporated in other devices such as a car navigation device or home electric appliances.
- the mixed acoustic signal acquisition unit 11 acquires a mixed acoustic signal including multiple acoustic signals.
- the mixed sound signal includes a first sound signal representing ambient sounds around a person and a second sound signal representing a person's voice.
- the mixed acoustic signal acquisition unit 11 may be connected to a microphone (not shown).
- the microphone picks up sounds from a plurality of sound sources, converts them into acoustic signals, and outputs the converted acoustic signals to the signal processing device 1 as mixed acoustic signals.
- a microphone picks up a voice spoken by a person and environmental sounds around the person.
- a mixed acoustic signal acquisition unit 11 acquires a mixed acoustic signal from a microphone.
- the mixed acoustic signal acquisition unit 11 acquires a mixed acoustic signal for a predetermined period every predetermined period. For example, the mixed acoustic signal acquisition unit 11 may acquire the mixed acoustic signal for 10 seconds every 10 seconds.
- the mixed acoustic signal acquisition unit 11 acquires the mixed acoustic signal picked up by the microphone directly from the microphone, but the present disclosure is not particularly limited to this.
- a mixed acoustic signal picked up by a microphone or the like may be recorded on a computer-readable recording medium.
- the mixed acoustic signal acquisition unit 11 may acquire the mixed acoustic signal from a computer-readable recording medium.
- the computer-readable recording medium is, for example, a semiconductor memory, hard disk drive, optical disk, or USB (Universal Serial Bus) memory.
- the mixed acoustic signal acquisition unit 11 may acquire the mixed acoustic signal from another device via a network such as the Internet.
- the mixed feature quantity conversion unit 12 converts the mixed sound signal acquired by the mixed sound signal acquisition unit 11 into a mixed feature quantity indicating the features of the mixed sound signal.
- a mixed feature amount is a feature amount representing a mixed acoustic signal by a vector or a matrix, and is, for example, an embedding vector.
- the mixed feature amount conversion unit 12 includes a first acoustic model that outputs a mixed feature amount when a mixed acoustic signal is input.
- the first acoustic model is, for example, a convolutional neural network, a recurrent neural network, a long short-term memory network, or a deep neural network.
- the first acoustic model converts the input mixed acoustic signal into a mixed feature amount and outputs the mixed feature amount.
- the first acoustic model is machine-learned by the learning device 2, which will be described later.
- the mixed feature amount conversion unit 12 inputs the mixed acoustic signal to the first acoustic model and acquires the mixed feature amount output from the first acoustic model.
- the mixed feature amount transforming section 12 outputs the mixed feature amount transformed from the mixed acoustic signal to the mask estimating section 15 and the acoustic signal transforming section 16 .
- the environmental sound signal storage unit 13 stores, as environmental sound signals, mixed sound signals in environmental sound sections containing only sound signals representing environmental sounds in all input sections of the mixed sound signal.
- the environmental sound signal storage unit 13 temporarily stores the environmental sound signal.
- the environmental sound signal stored in the environmental sound signal storage unit 13 is newly updated every predetermined period.
- the environmental sound feature amount conversion unit 14 converts the environmental sound signal into an environmental sound feature amount that indicates the feature of the environmental sound signal.
- the environmental sound feature quantity conversion unit 14 reads the environmental sound signal from the environmental sound signal storage unit 13 and converts the read environmental sound signal into the environmental sound feature quantity.
- the environmental sound feature quantity is a feature quantity expressing the environmental sound signal by a vector or matrix, and is, for example, an embedding vector.
- the environmental sound feature amount conversion unit 14 includes a fourth acoustic model that outputs an environmental sound feature amount when the environmental sound signal is input.
- the fourth acoustic model is, for example, a convolutional neural network, a recurrent neural network, a long short-term memory network, or a deep neural network.
- the fourth acoustic model is machine-learned by the learning device 2, which will be described later.
- the environmental sound feature quantity conversion unit 14 inputs the environmental sound signal to the fourth acoustic model and acquires the environmental sound feature quantity output from the fourth acoustic model.
- the environmental sound feature amount corresponds to auxiliary information.
- the environmental sound feature amount conversion unit 14 outputs the environmental sound feature amount converted from the environmental sound signal to the mask estimation unit 15 .
- the mask estimation unit 15 estimates a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature quantity transformed by the mixed feature quantity transformation unit 12 .
- the mask estimating unit 15 includes a second acoustic model that outputs a plurality of masks when mixed features are input.
- the second acoustic model is, for example, a convolutional neural network, a recurrent neural network, a long short-term memory network, or a deep neural network.
- the second acoustic model is machine-learned by the learning device 2, which will be described later.
- the mask estimation unit 15 also weights the mixed feature amount using the environmental sound feature amount converted by the environmental sound feature amount conversion unit 14, and estimates a plurality of masks based on the weighted mixed feature amount.
- the multiple masks are, for example, time-frequency masks.
- the mask estimation unit 15 inputs the mixed feature quantity weighted using the environmental sound feature quantity to the second acoustic model, and acquires a plurality of masks corresponding to each of the plurality of acoustic signals output from the second acoustic model.
- the mask estimation unit 15 outputs a plurality of masks estimated from the mixed feature quantity to the acoustic signal conversion unit 16 .
- a mask for extracting an acoustic signal representing the environmental sound and a mask for extracting an acoustic signal representing a voice other than the environmental sound can be accurately estimated. be able to.
- the mask estimating unit 15 converts A first mask for extracting a first acoustic signal representing environmental sound is estimated based on the obtained mixed feature amount, and a second mask for extracting a second acoustic signal representing human voice is estimated.
- the acoustic signal transforming unit 16 uses the multiple masks estimated by the mask estimating unit 15 to extract multiple separated feature quantities corresponding to the respective multiple acoustic signals from the mixed feature quantity transformed by the mixed feature quantity transforming unit 12. calculate.
- the separation feature amount is a feature amount representing the acoustic signal included in the mixed acoustic signal in a vector or matrix, and is, for example, an embedding vector.
- the acoustic signal transforming unit 16 masks the mixed feature using the multiple masks estimated by the mask estimating unit 15, and calculates multiple separated features corresponding to each of the multiple acoustic signals.
- the acoustic signal conversion unit 16 converts the calculated plurality of separated feature amounts into a plurality of separated acoustic signals.
- the acoustic signal conversion unit 16 includes a third acoustic model that outputs a plurality of separated acoustic signals when the plurality of calculated separated feature amounts are input.
- the third acoustic model is, for example, a convolutional neural network, a recurrent neural network, a long short-term memory network, or a deep neural network.
- the third acoustic model is machine-learned by the learning device 2, which will be described later.
- the acoustic signal conversion unit 16 inputs the plurality of calculated separated feature amounts to the third acoustic model, and acquires the plurality of separated acoustic signals output from the third acoustic model.
- the acoustic signal conversion unit 16 outputs a plurality of separated acoustic signals converted from the plurality of separated feature quantities to the acoustic signal output unit 17 and the environmental sound section estimation unit 18 .
- the acoustic signal transforming unit 16 uses the first mask estimated by the mask estimating unit 15 to calculate the first separated feature corresponding to the first acoustic signal from the mixed feature, and the mask estimating unit 15 Using the estimated second mask, a second separated feature amount corresponding to the second acoustic signal is calculated from the mixed feature amount.
- the acoustic signal conversion unit 16 multiplies the mixed feature amount and the first mask in each time-frequency component to calculate the first separated feature amount corresponding to the first acoustic signal, and combines the mixed feature amount and the second mask. is multiplied by each time-frequency component to calculate a second separation feature quantity corresponding to the second acoustic signal. Further, the acoustic signal conversion unit 16 converts the calculated first separation feature amount into the first separated acoustic signal, and converts the calculated second separation feature amount into the second separated acoustic signal.
- the acoustic signal output unit 17 outputs a plurality of separated acoustic signals converted by the acoustic signal conversion unit 16.
- the acoustic signal output unit 17 outputs a plurality of separated acoustic signals separated from the mixed acoustic signal.
- the acoustic signal output unit 17 may output all of the plurality of separated acoustic signals, or may output a portion of the plurality of separated acoustic signals.
- the acoustic signal output unit 17 outputs the first separated acoustic signal representing the environmental sound and the second separated acoustic signal representing the human voice converted by the acoustic signal conversion unit 16 .
- environmental sounds such as factory noise, vehicle interior noise, and vehicle exterior noise can be removed from the input mixed acoustic signal, and only human voices can be extracted.
- the second separated acoustic signal representing human voice is then used for speech recognition, for example.
- the first separated acoustic signal representing environmental sound is used, for example, to detect an event occurring around a person.
- the acoustic signal output unit 17 may output both the first separated acoustic signal and the second separated acoustic signal, or may output either the first separated acoustic signal or the second separated acoustic signal.
- the environmental sound section estimator 18 estimates an environmental sound section containing only an acoustic signal representing the environmental sound in all input sections of the mixed sound signal, based on the plurality of separated sound signals converted by the sound signal converter 16. .
- the environmental sound interval estimating unit 18 subtracts the interval of the second separated acoustic signal indicating human voice from the interval of the first separated acoustic signal indicating the environmental sound, thereby obtaining the environment Estimate the environmental sound interval that contains only the acoustic signal representing the sound.
- the environmental sound section estimating unit 18 performs voice section detection (VAD: Voice Activity Detection) processing to extract speech sections containing human voice and sounds other than human voice from all input sections of each of the plurality of acoustic signals. It is also possible to identify non-speech intervals that include , and estimate only non-speech intervals that do not overlap with speech intervals as environmental sound intervals. For example, the environmental sound section estimating unit 18 uses VAD processing to distinguish speech sections and non-speech sections from all input sections of the first separated acoustic signal representing the environmental sound, and distinguishes the second separated acoustic signal representing the human voice. distinguish between speech and non-speech intervals from all input intervals of . Then, the environmental sound section estimating unit 18 may estimate, as the environmental sound section, only the non-speech section that does not overlap with the speech section among all the input sections of the mixed sound signal.
- VAD Voice Activity Detection
- the environmental sound signal extraction unit 19 extracts the mixed sound signal in the environmental sound section estimated by the environmental sound section estimation unit 18 from the mixed sound signal as the environmental sound signal.
- the environmental sound signal extraction unit 19 stores the extracted environmental sound signal in the environmental sound signal storage unit 13 .
- the environmental sound signal extraction unit 19 stores the environmental sound signal in the environmental sound signal storage unit 13 every predetermined period, and updates the environmental sound signal in the environmental sound signal storage unit 13 .
- the predetermined time period is the interval at which the mixed acoustic signal is acquired.
- the environmental sound signal is stored in the environmental sound signal storage unit 13 for each predetermined period, and the environmental sound signal stored in the environmental sound signal storage unit 13 is used as the environmental sound feature amount indicating the feature of the environmental sound signal.
- the transformed environmental sound features are used to estimate a plurality of masks. Therefore, multiple acoustic signals can be separated from a mixed acoustic signal using the real-time changing environmental sound.
- FIG. 2 is a block diagram showing the configuration of the learning device 2 according to the embodiment of the present disclosure.
- the learning device 2 learns the parameters of each acoustic model (for example, neural network) of the mixed feature quantity transforming unit 12, the environmental sound feature quantity transforming unit 14, the mask estimating unit 15, and the acoustic signal transforming unit 16.
- each acoustic model for example, neural network
- the learning acoustic signal acquisition unit 21, the mixed feature quantity conversion unit 12, the environmental sound feature quantity conversion unit 14, the mask estimation unit 15, the acoustic signal conversion unit 16, and the parameter update unit 22 are implemented by a processor.
- a processor is comprised from CPU etc., for example.
- the learning device 2 may be, for example, a computer or a server. Further, in the present embodiment, the signal processing device 1 and the learning device 2 are different devices, but the signal processing device 1 includes the learning acoustic signal acquiring unit 21 and the parameter updating unit 22 of the learning device 2. good too. That is, the signal processing device 1 may have the function of the learning device 2 .
- the learning acoustic signal acquisition unit 21 acquires a learning mixed acoustic signal and a plurality of correct acoustic signals corresponding to the correct answers of the plurality of acoustic signals included in the learning mixed acoustic signal.
- the learning acoustic signal acquisition unit 21 outputs a plurality of correct acoustic signals to the parameter updating unit 22, outputs a learning mixed acoustic signal to the mixed feature quantity conversion unit 12, and outputs a correct answer among the plurality of correct acoustic signals.
- a correct environmental sound signal indicating the environmental sound to be generated is output to the environmental sound feature value conversion unit 14 .
- the learning acoustic signal acquisition unit 21 may be connected to a microphone (not shown).
- the microphone individually picks up sounds from a plurality of sound sources, converts them into acoustic signals, and outputs each converted acoustic signal to the signal processing device 1 as a correct acoustic signal.
- a microphone individually picks up a voice uttered by a person and ambient environmental sounds.
- the microphone picks up a sound obtained by mixing a plurality of sounds that are the same as a plurality of correct sound signals, converts the sound into an sound signal, and outputs the converted sound signal to the signal processing device 1 as a learning mixed sound signal.
- the learning acoustic signal acquisition unit 21 acquires a learning mixed acoustic signal and a plurality of correct acoustic signals from the microphones. Further, the learning acoustic signal acquisition unit 21 acquires a plurality of teacher data, with the learning mixed acoustic signal and the plurality of correct acoustic signals as one teacher data.
- the learning acoustic signal acquisition unit 21 directly acquires the learning mixed acoustic signal and the plurality of correct acoustic signals picked up by the microphones.
- a learning mixed acoustic signal picked up by a microphone or the like and a plurality of correct acoustic signals may be recorded on a computer-readable recording medium.
- the learning acoustic signal acquisition unit 21 may acquire the learning mixed acoustic signal and the plurality of correct acoustic signals from a computer-readable recording medium.
- the learning acoustic signal acquisition unit 21 may acquire the learning mixed acoustic signal and the plurality of correct acoustic signals from another device via a network such as the Internet.
- the parameter updating unit 22 updates each parameter of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model.
- the mixed feature amount conversion unit 12 converts the learning mixed acoustic signal acquired by the learning acoustic signal acquisition unit 21 into a mixed feature amount indicating the feature of the learning mixed acoustic signal.
- the mixed feature amount conversion unit 12 inputs the learning mixed acoustic signal acquired by the learning acoustic signal acquisition unit 21 to the first acoustic model, and acquires the mixed feature amount output from the first acoustic model.
- the environmental sound feature amount conversion unit 14 converts a correct environmental sound signal indicating the environmental sound corresponding to the correct answer among the plurality of correct answer acoustic signals obtained by the learning acoustic signal obtaining unit 21 to a correct environmental sound signal indicating the characteristics of the correct environmental sound signal. Convert to environmental sound features.
- the environmental sound feature value conversion unit 14 inputs a correct environmental sound signal representing the environmental sound corresponding to the correct answer among the plurality of correct sound signals acquired by the learning acoustic signal acquisition unit 21 to the fourth acoustic model, 4 Acquire the environmental sound feature quantity output from the acoustic model.
- the mask estimating unit 15 weights the mixed feature quantity using the environmental sound feature quantity converted by the environmental sound feature quantity conversion unit 14, and based on the weighted mixed feature quantity, a plurality of sound signals corresponding to each of the plurality of correct acoustic signals. Estimate the mask of .
- the mask estimation unit 15 weights the mixed feature amount output from the first acoustic model using the environmental sound feature amount output from the fourth acoustic model, inputs the weighted mixed feature amount to the second acoustic model, A plurality of masks output from the second acoustic model are obtained.
- the acoustic signal conversion unit 16 uses the masks output from the second acoustic model to calculate a plurality of separated feature amounts corresponding to each of the plurality of correct acoustic signals from the mixed feature amount.
- the acoustic signal transforming unit 16 masks the mixed feature using the masks estimated by the mask estimating unit 15, and calculates a plurality of separated features corresponding to each of the correct acoustic signals.
- the acoustic signal conversion unit 16 converts the calculated plurality of separated feature quantities into a plurality of separated acoustic signals.
- the acoustic signal conversion unit 16 inputs the plurality of calculated separated feature amounts to the third acoustic model, and acquires the plurality of separated acoustic signals output from the third acoustic model.
- the parameter updating unit 22 calculates the error between each of the plurality of separated acoustic signals output from the third acoustic model and each of the plurality of correct acoustic signals acquired by the learning acoustic signal acquiring unit 21. Based on a plurality of errors, the first acoustic model of the mixed feature quantity transforming unit 12, the second acoustic model of the mask estimating unit 15, the third acoustic model of the acoustic signal transforming unit 16, and the fourth acoustic model of the environmental sound feature quantity transforming unit 14 Update each parameter of the acoustic model.
- the parameter updating unit 22 updates each parameter of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model using the error backpropagation method. More specifically, the parameter updating unit 22 calculates the average of the errors between each of the plurality of separated acoustic signals output from the third acoustic model and each of the plurality of correct acoustic signals, and calculates the average of the plurality of calculated errors. Each parameter of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model is updated so that the average is minimized.
- Each part of the learning device 2 processes a plurality of teacher data to repeatedly update the parameters of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model, and the first acoustic model, A second acoustic model, a third acoustic model and a fourth acoustic model are learned.
- An environmental sound feature value conversion unit 14 including an acoustic model is mounted on the signal processing device 1 .
- FIG. 3 is a flowchart for explaining sound source separation processing of the signal processing device 1 according to the present embodiment.
- the mixed acoustic signal acquisition unit 11 acquires a mixed acoustic signal including a plurality of acoustic signals.
- the mixed sound signal includes a first sound signal representing ambient sounds around a person and a second sound signal representing a person's voice.
- the second acoustic signal may indicate not only the voice of one person but also the voices of a plurality of people.
- step S2 the mixed feature amount conversion unit 12 converts the mixed sound signal acquired by the mixed sound signal acquisition unit 11 into a mixed feature amount indicating the feature of the mixed sound signal.
- the mixed feature amount transforming unit 12 inputs the mixed acoustic signal to the trained first acoustic model, and acquires the mixed feature amount output from the first acoustic model.
- step S3 the environmental sound feature value conversion unit 14 reads the environmental sound signal representing only the environmental sound from the environmental sound signal storage unit 13.
- step S4 the environmental sound feature quantity conversion unit 14 converts the environmental sound signal read from the environmental sound signal storage unit 13 into an environmental sound feature quantity indicating the characteristics of the environmental sound signal.
- the environmental sound feature quantity conversion unit 14 inputs the environmental sound signal to the trained fourth acoustic model, and acquires the environmental sound feature quantity output from the fourth acoustic model.
- step S5 the mask estimation unit 15 uses the environmental sound feature quantity converted by the environmental sound feature quantity conversion unit 14 to weight the mixed feature quantity.
- step S6 the mask estimation unit 15 estimates a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature quantity weighted using the environmental sound feature quantity.
- the mask estimating unit 15 inputs the mixed feature quantity weighted using the environmental sound feature quantity to the trained second acoustic model, and the mask estimation unit 15 inputs a plurality of mixed feature quantities corresponding to each of the plurality of acoustic signals output from the second acoustic model.
- the mask estimation unit 15 inputs the mixed feature quantity weighted using the environmental sound feature quantity to the trained second acoustic model, and the first mask corresponding to the first acoustic signal output from the second acoustic model. and a second mask corresponding to the second acoustic signal.
- the mask estimation unit 15 does not weight using the environmental sound feature amount, but based on the mixed feature amount converted by the mixed feature amount conversion unit 12, for each of the plurality of acoustic signals. A corresponding plurality of masks may be estimated. Then, in the second and subsequent sound source separation processes, the mask estimation unit 15 may estimate a plurality of masks corresponding to each of the plurality of acoustic signals based on the mixed feature amount weighted using the environmental sound feature amount. .
- step S7 the acoustic signal transforming unit 16 uses the multiple masks estimated by the mask estimating unit 15 to correspond to each of the multiple acoustic signals from the mixed features transformed by the mixed feature transforming unit 12.
- a plurality of separation feature quantities are calculated.
- the acoustic signal transforming unit 16 multiplies the mixed feature quantity transformed by the mixed feature quantity transforming unit 12 by each of the plurality of masks estimated by the mask estimating unit 15 in each time-frequency component to obtain a plurality of A plurality of separation feature quantities corresponding to each of the acoustic signals are calculated.
- the acoustic signal transforming unit 16 multiplies the mixed feature quantity transformed by the mixed feature quantity transforming unit 12 and the first mask estimated by the mask estimating unit 15 in each time-frequency component to obtain the first sound Calculating a first separated feature quantity corresponding to the signal, and multiplying the mixed feature quantity transformed by the mixed feature quantity transforming unit 12 by the second mask estimated by the mask estimating unit 15 in each time-frequency component. to calculate a second separation feature amount corresponding to the second acoustic signal.
- step S8 the acoustic signal conversion unit 16 converts the calculated plurality of separated feature amounts into a plurality of separated acoustic signals.
- the acoustic signal conversion unit 16 inputs the plurality of calculated separated feature amounts to the learned third acoustic model, and acquires the plurality of separated acoustic signals output from the third acoustic model.
- the acoustic signal conversion unit 16 inputs the calculated first separation feature quantity to a learned third acoustic model, acquires the first separation acoustic signal output from the third acoustic model, and calculates the second The separated feature amount is input to the learned third acoustic model, and the second separated acoustic signal output from the third acoustic model is obtained.
- the acoustic signal output unit 17 outputs the multiple separated acoustic signals converted by the acoustic signal conversion unit 16 .
- the acoustic signal output unit 17 outputs the first separated acoustic signal and the second separated acoustic signal converted by the acoustic signal conversion unit 16 .
- the environmental sound interval estimating unit 18, based on the plurality of separated acoustic signals converted by the acoustic signal converting unit 16, includes only acoustic signals representing environmental sounds in all input intervals of the mixed acoustic signal. estimating the environmental sound interval. For example, based on the first separated acoustic signal and the second separated acoustic signal converted by the acoustic signal conversion unit 16, the environmental sound interval estimating unit 18 generates only the acoustic signal representing the environmental sound in all the input intervals of the mixed acoustic signal. Estimate the included environmental sound interval.
- step S11 the environmental sound signal extraction unit 19 converts the mixed sound signal of the environmental sound section estimated by the environmental sound section estimation unit 18 from the mixed sound signal acquired by the mixed sound signal acquisition unit 11 into an environmental sound signal. Extract as a signal.
- step S ⁇ b>12 the environmental sound signal extraction unit 19 stores the extracted environmental sound signal in the environmental sound signal storage unit 13 . After the process of step S12 is completed, the process returns to step S1.
- the mixed sound signal in the environmental sound section in which only the sound signal representing the environmental sound is included is extracted as the environmental sound signal, and the mixed feature is extracted using the environmental sound feature amount indicating the feature of the environmental sound signal.
- the quantities are weighted and a plurality of masks are estimated based on the weighted mixture features. Therefore, multiple masks are estimated using the ambient sound signal extracted from the mixed sound signal in real time, and the mixed sound signal is separated into multiple separated sound signals using the estimated multiple masks.
- FIG. 4 is a flowchart for explaining the learning process of the learning device 2 according to this embodiment.
- the learning acoustic signal acquisition unit 21 acquires a learning mixed acoustic signal and a plurality of correct acoustic signals.
- the plurality of correct sound signals include a first correct sound signal representing environmental sounds around a person and a second correct sound signal representing a human voice.
- step S22 the mixed feature quantity conversion unit 12 converts the learning mixed acoustic signal acquired by the learning acoustic signal acquisition unit 21 into a mixed feature quantity indicating the features of the learning mixed acoustic signal.
- the mixed feature quantity conversion unit 12 inputs the learning mixed acoustic signal acquired by the learning acoustic signal acquisition unit 21 to the unlearned first acoustic model, and the mixed feature quantity output from the first acoustic model to get
- step S23 the environmental sound feature value conversion unit 14 converts a correct environmental sound signal representing the environmental sound corresponding to the correct answer among the plurality of correct answer acoustic signals acquired by the learning acoustic signal acquisition unit 21 into a correct answer.
- the environmental sound signal is converted into an environmental sound feature amount that indicates the feature of the environmental sound signal.
- the environmental sound feature value conversion unit 14 inputs the correct environmental sound signal among the plurality of correct sound signals acquired by the learning sound signal acquisition unit 21 to the unlearned fourth acoustic model, and Acquire the environmental sound features output from the model.
- step S24 the mask estimation unit 15 uses the environmental sound feature quantity converted by the environmental sound feature quantity conversion unit 14 to weight the mixed feature quantity.
- the mask estimation unit 15 estimates a plurality of masks corresponding to each of the plurality of correct acoustic signals based on the mixed feature quantity weighted using the environmental sound feature quantity.
- the mask estimation unit 15 inputs the mixed feature quantity weighted using the environmental sound feature quantity to the unlearned second acoustic model, and corresponds to each of the plurality of correct acoustic signals output from the second acoustic model. Get multiple masks.
- the mask estimation unit 15 inputs the mixed feature quantity weighted using the environmental sound feature quantity to the unlearned second acoustic model, and the first correct acoustic signal output from the second acoustic model. A mask and a second mask corresponding to the second correct acoustic signal are obtained.
- step S26 the acoustic signal transforming unit 16 converts the mixed feature quantity transformed by the mixed feature quantity transforming unit 12 into a plurality of correct acoustic signals using the plurality of masks estimated by the mask estimating unit 15. A plurality of corresponding separation features are calculated. At this time, the acoustic signal transforming unit 16 multiplies the mixed feature quantity transformed by the mixed feature quantity transforming unit 12 by each of the plurality of masks estimated by the mask estimating unit 15 in each time-frequency component to obtain a plurality of A plurality of separated feature quantities corresponding to each of the correct acoustic signals are calculated.
- the acoustic signal transforming unit 16 multiplies the mixed feature quantity transformed by the mixed feature quantity transforming unit 12 and the first mask estimated by the mask estimating unit 15 in each time-frequency component to obtain the first correct answer.
- a first separated feature amount corresponding to the acoustic signal is calculated, and the mixed feature amount transformed by the mixed feature amount transforming unit 12 is multiplied by the second mask estimated by the mask estimating unit 15 in each time-frequency component.
- a second separation feature quantity corresponding to the second correct acoustic signal is calculated.
- step S27 the acoustic signal conversion unit 16 converts the calculated plurality of separated feature quantities into a plurality of separated acoustic signals.
- the acoustic signal conversion unit 16 inputs the plurality of calculated separated feature amounts to the unlearned third acoustic model, and acquires the plurality of separated acoustic signals output from the third acoustic model.
- the acoustic signal conversion unit 16 inputs the calculated first separation feature quantity to an unlearned third acoustic model, acquires the first separated acoustic signal output from the third acoustic model, and calculates the second A separated feature amount is input to an unlearned third acoustic model, and a second separated acoustic signal output from the third acoustic model is obtained.
- step S28 the parameter updating unit 22 updates each of the plurality of separated acoustic signals output from the third acoustic model and each of the plurality of correct acoustic signals acquired by the learning acoustic signal acquiring unit 21. Calculate the error. For example, the parameter updating unit 22 calculates the error between the first separated acoustic signal output from the third acoustic model and the first correct acoustic signal, and also calculates the difference between the second separated acoustic signal output from the third acoustic model and the first correct acoustic signal. 2 Calculate the error from the correct acoustic signal.
- step S29 the parameter updating unit 22 calculates the average of the multiple calculated errors. For example, the parameter updating unit 22 calculates the average of the error between the first separated acoustic signal and the first correct acoustic signal and the error between the second separated acoustic signal and the second correct acoustic signal.
- step S30 the parameter updating unit 22 updates the first acoustic model of the mixed feature transform unit 12, the second acoustic model of the mask estimating unit 15, the acoustic Each parameter of the third acoustic model of the signal conversion unit 16 and the fourth acoustic model of the environmental sound feature amount conversion unit 14 is updated.
- one teacher data includes a learning mixed acoustic signal and a plurality of correct acoustic signals
- the learning acoustic signal acquisition unit 21 acquires one teacher data from among the plurality of teacher data. Then, the processing of steps S21 to S30 is performed for all of the plurality of teacher data, and the first acoustic model, second acoustic model, third acoustic model, and fourth acoustic model are learned.
- the learning mixed acoustic signal and a plurality of correct acoustic signals corresponding to the correct answers of the plurality of acoustic signals included in the learning mixed acoustic signal are acquired.
- a learning mixed acoustic signal is input to the first acoustic model, and a mixed feature amount is output from the first acoustic model.
- a correct environmental sound signal representing an environmental sound corresponding to the correct answer among the plurality of correct sound signals is input to the fourth acoustic model, and the fourth acoustic model outputs an environmental sound feature amount.
- the mixed feature output from the first acoustic model is weighted using the environmental sound feature output from the fourth acoustic model.
- the weighted mixed features are input to the second acoustic model, and multiple masks are output from the second acoustic model. Separation feature amounts corresponding to each of the plurality of acoustic signals are calculated from the mixed feature amount using the plurality of masks output from the second acoustic model. A plurality of calculated separated feature amounts are input to the third acoustic model, and a plurality of separated acoustic signals are output from the third acoustic model. Errors between each of the plurality of acoustic signals output from the third acoustic model and each of the plurality of correct acoustic signals are calculated. Each parameter of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model is updated based on the plurality of calculated errors.
- the first acoustic model, the second acoustic model, and the third acoustic model and the fourth acoustic model can be learned, and the estimation accuracy of the first acoustic model, the second acoustic model, the third acoustic model, and the fourth acoustic model can be improved.
- sounds other than environmental sounds may be sounds emitted by specific objects.
- the sound emitted by the particular object may be, for example, the sound of the siren of a police car, fire engine or ambulance.
- the learning device 2 learns the first to fourth acoustic models using a learning mixed acoustic signal obtained by mixing an acoustic signal indicating the siren sound and an acoustic signal indicating the environmental sound other than the siren sound.
- the signal processing device 1 can separate and output the siren sound and environmental sounds other than the siren sound.
- the multiple masks are time-frequency masks, but the present disclosure is not limited to this.
- the multiple masks may be vectors indicating the degree of contribution of each element of the mixed feature to each acoustic signal.
- each component may be implemented by dedicated hardware or by executing a software program suitable for each component.
- Each component may be realized by reading and executing a software program recorded in a recording medium such as a hard disk or a semiconductor memory by a program execution unit such as a CPU or processor.
- the program may be executed by another independent computer system by recording the program on a recording medium and transferring it, or by transferring the program via a network.
- LSI Large Scale Integration
- circuit integration is not limited to LSIs, and may be realized by dedicated circuits or general-purpose processors.
- An FPGA Field Programmable Gate Array
- reconfigurable processor that can reconfigure the connections and settings of the circuit cells inside the LSI may be used.
- a processor such as a CPU executing a program.
- each step shown in the above flowchart is executed is for illustrative purposes in order to specifically describe the present disclosure, and may be an order other than the above as long as the same effect can be obtained. . Also, some of the above steps may be executed concurrently (in parallel) with other steps.
- the technology according to the present disclosure eliminates the need for complicated preparatory processing for creating in advance auxiliary information about an acoustic signal of a target sound source, and prevents deterioration in the performance of separating a plurality of acoustic signals from a mixed acoustic signal. Therefore, it is useful as a technique for separating a plurality of acoustic signals from a mixed acoustic signal.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
上記の従来技術において、目的音源の補助情報を用いて音源分離が行われる場合、事前に目的音源の音声を収音し、収音した目的音源の音声から補助情報を生成する必要があり、目的音源の音響信号に関する補助情報を事前に作成するための煩雑な準備処理が必要になるおそれがある。
図1は、本開示の実施の形態における信号処理装置1の構成を示すブロック図である。
Claims (10)
- 複数の音響信号を含む混合音響信号を取得する混合音響信号取得部と、
前記混合音響信号を、前記混合音響信号の特徴を示す混合特徴量に変換する混合特徴量変換部と、
前記混合特徴量に基づいて、前記複数の音響信号それぞれに対応する複数のマスクを推定するマスク推定部と、
前記複数のマスクを用いて前記混合特徴量から前記複数の音響信号それぞれに対応する複数の分離特徴量を算出し、算出した前記複数の分離特徴量を複数の分離音響信号に変換する音響信号変換部と、
前記複数の分離音響信号に基づいて、前記混合音響信号の全入力区間における環境音を示す音響信号のみが含まれる環境音区間を推定する環境音区間推定部と、
前記混合音響信号から、推定された前記環境音区間の前記混合音響信号を環境音響信号として抽出する環境音響信号抽出部と、
前記環境音響信号を、前記環境音響信号の特徴を示す環境音特徴量に変換する環境音特徴量変換部と、
を備え、
前記マスク推定部は、前記環境音特徴量を用いて前記混合特徴量を重み付けし、重み付けした前記混合特徴量に基づいて、前記複数のマスクを推定する、
信号処理装置。 - 前記混合特徴量変換部は、前記混合音響信号が入力されると前記混合特徴量を出力する第1音響モデルを含み、
前記マスク推定部は、前記混合特徴量が入力されると前記複数のマスクを出力する第2音響モデルを含み、
前記音響信号変換部は、算出した前記複数の分離特徴量が入力されると前記複数の分離音響信号を出力する第3音響モデルを含み、
前記環境音特徴量変換部は、前記環境音響信号が入力されると前記環境音特徴量を出力する第4音響モデルを含む、
請求項1記載の信号処理装置。 - 学習用混合音響信号と、前記学習用混合音響信号に含まれる複数の音響信号の正解に相当する複数の正解音響信号とを取得する学習用音響信号取得部と、
前記第1音響モデル、前記第2音響モデル、前記第3音響モデル及び前記第4音響モデルの各パラメータを更新するパラメータ更新部と、
をさらに備え、
前記混合特徴量変換部は、前記学習用混合音響信号を前記第1音響モデルに入力し、前記第1音響モデルから出力される前記混合特徴量を取得し、
前記環境音特徴量変換部は、前記複数の正解音響信号のうちの正解に相当する環境音を示す正解環境音響信号を前記第4音響モデルに入力し、前記第4音響モデルから出力される前記環境音特徴量を取得し、
前記マスク推定部は、前記第4音響モデルから出力された前記環境音特徴量を用いて前記第1音響モデルから出力された前記混合特徴量を重み付けし、重み付けした前記混合特徴量を前記第2音響モデルに入力し、前記第2音響モデルから出力される前記複数のマスクを取得し、
前記音響信号変換部は、前記第2音響モデルから出力された前記複数のマスクを用いて前記混合特徴量から前記複数の正解音響信号それぞれに対応する複数の分離特徴量を算出し、算出した複数の分離特徴量を前記第3音響モデルに入力し、前記第3音響モデルから出力される前記複数の分離音響信号を取得し、
前記パラメータ更新部は、前記第3音響モデルから出力された前記複数の音響信号の各々と、前記複数の正解音響信号の各々との誤差を算出し、算出した複数の誤差に基づいて、前記第1音響モデル、前記第2音響モデル、前記第3音響モデル及び前記第4音響モデルの各パラメータを更新する、
請求項2記載の信号処理装置。 - 前記複数の音響信号は、前記環境音を示す音響信号と、前記環境音以外の音声を示す音響信号とを含む、
請求項1~3のいずれか1項に記載の信号処理装置。 - 前記環境音以外の前記音声は、人が発話した声である、
請求項4記載の信号処理装置。 - 前記環境音以外の前記音声は、特定の物体が発した音である、
請求項4記載の信号処理装置。 - 前記環境音響信号抽出部は、抽出した前記環境音響信号をメモリに記憶し、
前記環境音特徴量変換部は、前記メモリから前記環境音響信号を読み出し、読み出した前記環境音響信号を環境音特徴量に変換する、
請求項1~3のいずれか1項に記載の信号処理装置。 - 前記音響信号変換部によって変換された前記複数の分離音響信号を出力する音響信号出力部をさらに備える、
請求項1~3のいずれか1項に記載の信号処理装置。 - コンピュータが、
複数の音響信号を含む混合音響信号を取得し、
前記混合音響信号を、前記混合音響信号の特徴を示す混合特徴量に変換し、
前記混合特徴量に基づいて、前記複数の音響信号それぞれに対応する複数のマスクを推定し、
前記複数のマスクを用いて前記混合特徴量から前記複数の音響信号それぞれに対応する複数の分離特徴量を算出し、算出した前記複数の分離特徴量を複数の分離音響信号に変換し、
前記複数の分離音響信号に基づいて、前記混合音響信号の全入力区間における環境音を示す音響信号のみが含まれる環境音区間を推定し、
前記混合音響信号から、推定された前記環境音区間の前記混合音響信号を環境音響信号として抽出し、
前記環境音響信号を、前記環境音響信号の特徴を示す環境音特徴量に変換し、
前記複数のマスクの推定において、前記環境音特徴量を用いて前記混合特徴量を重み付けし、重み付けした前記混合特徴量に基づいて、前記複数のマスクを推定する、
信号処理方法。 - 複数の音響信号を含む混合音響信号を取得する混合音響信号取得部と、
前記混合音響信号を、前記混合音響信号の特徴を示す混合特徴量に変換する混合特徴量変換部と、
前記混合特徴量に基づいて、前記複数の音響信号それぞれに対応する複数のマスクを推定するマスク推定部と、
前記複数のマスクを用いて前記混合特徴量から前記複数の音響信号それぞれに対応する複数の分離特徴量を算出し、算出した前記複数の分離特徴量を複数の分離音響信号に変換する音響信号変換部と、
前記複数の分離音響信号に基づいて、前記混合音響信号の全入力区間における環境音を示す音響信号のみが含まれる環境音区間を推定する環境音区間推定部と、
前記混合音響信号から、推定された前記環境音区間の前記混合音響信号を環境音響信号として抽出する環境音響信号抽出部と、
前記環境音響信号を、前記環境音響信号の特徴を示す環境音特徴量に変換する環境音特徴量変換部としてコンピュータを機能させ、
前記マスク推定部は、前記環境音特徴量を用いて前記混合特徴量を重み付けし、重み付けした前記混合特徴量に基づいて、前記複数のマスクを推定する、
信号処理プログラム。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021191319 | 2021-11-25 | ||
JP2021-191319 | 2021-11-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023095470A1 true WO2023095470A1 (ja) | 2023-06-01 |
Family
ID=86539314
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/037913 WO2023095470A1 (ja) | 2021-11-25 | 2022-10-11 | 信号処理装置、信号処理方法及び信号処理プログラム |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023095470A1 (ja) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011141540A (ja) * | 2009-12-09 | 2011-07-21 | Sharp Corp | 音声信号処理装置、テレビジョン受像機、音声信号処理方法、プログラム、および、記録媒体 |
JP2020134657A (ja) | 2019-02-18 | 2020-08-31 | 日本電信電話株式会社 | 信号処理装置、学習装置、信号処理方法、学習方法及びプログラム |
-
2022
- 2022-10-11 WO PCT/JP2022/037913 patent/WO2023095470A1/ja active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011141540A (ja) * | 2009-12-09 | 2011-07-21 | Sharp Corp | 音声信号処理装置、テレビジョン受像機、音声信号処理方法、プログラム、および、記録媒体 |
JP2020134657A (ja) | 2019-02-18 | 2020-08-31 | 日本電信電話株式会社 | 信号処理装置、学習装置、信号処理方法、学習方法及びプログラム |
Non-Patent Citations (2)
Title |
---|
MARC DELCROIX; KATERINA ZMOLIKOVA; TSUBASA OCHIAI; KEISUKE KINOSHITA; TOMOHIRO NAKATANI: "Speaker activity driven neural speech extraction", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 February 2021 (2021-02-09), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081876897 * |
MASATAKA SUZUKI, NOBUTAKA ONO, TORU TANIGUCHI, MASARU SAKAI, AKINORI KAWAMURA, MIQUEL ESPI, SHIGEKI SAGAYAMA: "Auxiliary-function-based independent vector analysis with non-speech frame information for speech enhancement", IEICE TECHNICAL REPORT, EA, IEICE, JP, vol. 112, no. 293 (EA2012-87), 9 November 2012 (2012-11-09), JP, pages 35 - 38, XP009546090 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3346462B1 (en) | Speech recognizing method and apparatus | |
CN110415687B (zh) | 语音处理方法、装置、介质、电子设备 | |
CN110444214B (zh) | 语音信号处理模型训练方法、装置、电子设备及存储介质 | |
KR101153093B1 (ko) | 다감각 음성 향상을 위한 방법 및 장치 | |
JP4166153B2 (ja) | 鳴声の音声的特徴分析に基づく犬の感情判別装置及びその方法 | |
CN109817222B (zh) | 一种年龄识别方法、装置及终端设备 | |
KR20160089210A (ko) | 언어 모델 학습 방법 및 장치, 언어 인식 방법 및 장치 | |
US20070260455A1 (en) | Feature-vector compensating apparatus, feature-vector compensating method, and computer program product | |
JP2001517325A (ja) | 認識システム | |
JP6991041B2 (ja) | 生成装置、生成方法、および生成プログラム | |
CN110837758B (zh) | 一种关键词输入方法、装置及电子设备 | |
CN111785288A (zh) | 语音增强方法、装置、设备及存储介质 | |
JPWO2017146073A1 (ja) | 声質変換装置、声質変換方法およびプログラム | |
CN111667834B (zh) | 一种助听设备及助听方法 | |
JP4705414B2 (ja) | 音声認識装置、音声認識方法、音声認識プログラムおよび記録媒体 | |
JP6099032B2 (ja) | 信号処理装置、信号処理方法及びコンピュータプログラム | |
CN110797039B (zh) | 语音处理方法、装置、终端及介质 | |
US20180033432A1 (en) | Voice interactive device and voice interaction method | |
KR20190032868A (ko) | 음성인식 방법 및 그 장치 | |
WO2023095470A1 (ja) | 信号処理装置、信号処理方法及び信号処理プログラム | |
CN111028833B (zh) | 一种交互、车辆的交互方法、装置 | |
JP6891144B2 (ja) | 生成装置、生成方法及び生成プログラム | |
CN113504891B (zh) | 一种音量调节方法、装置、设备以及存储介质 | |
CN112489678B (zh) | 一种基于信道特征的场景识别方法及装置 | |
CN112002307B (zh) | 一种语音识别方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22898254 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023563546 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022898254 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022898254 Country of ref document: EP Effective date: 20240522 |