WO2022145015A1 - Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal - Google Patents

Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal Download PDF

Info

Publication number
WO2022145015A1
WO2022145015A1 PCT/JP2020/049247 JP2020049247W WO2022145015A1 WO 2022145015 A1 WO2022145015 A1 WO 2022145015A1 JP 2020049247 W JP2020049247 W JP 2020049247W WO 2022145015 A1 WO2022145015 A1 WO 2022145015A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
speaker
feature vector
speaker feature
signal processing
Prior art date
Application number
PCT/JP2020/049247
Other languages
English (en)
Japanese (ja)
Inventor
慶介 木下
智広 中谷
マーク デルクロア
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/049247 priority Critical patent/WO2022145015A1/fr
Priority to JP2022572857A priority patent/JPWO2022145015A1/ja
Publication of WO2022145015A1 publication Critical patent/WO2022145015A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present invention relates to a signal processing device, a signal processing method, and a signal processing program.
  • the first sound source separation technology that separates the acoustic signal that is a mixture of sounds from multiple sound sources into a signal for each sound source
  • the first sound source separation technology that targets the sound picked up by multiple sound sources and one microphone
  • the second sound source separation technique is said to be more difficult than the first sound source separation technique because information regarding the position of the microphone cannot be used.
  • Non-Patent Documents 1 to 3 are known as a second sound source separation technique for performing sound source separation based on input acoustic signal information without using microphone position information.
  • Non-Patent Document 1 separates an input acoustic signal into a predetermined number of sound sources.
  • BLSTM neural network hereinafter referred to as BLSTM
  • the mask for extracting each sound source can be estimated. ..
  • BLSTM bi-directional long short-term memory neural network
  • the distance between the previously given correct answer separation signal and the separation signal extracted by applying the estimated mask to the observation signal is minimized. Update the parameters.
  • Non-Patent Document 2 performs sound source separation by changing the number of separated signals according to the number of sound sources contained in the input signal.
  • the mask is estimated using BLSTM, but in Non-Patent Document 2, the separation mask estimated at one time is only for one sound source in the input signal.
  • the observation signal is separated into a separation signal and a residual signal obtained by removing the separation signal from the observation signal by using a separation mask.
  • it is automatically determined whether or not another sound source signal still remains in the residual signal, and if it remains, the residual signal is input to BLSTM. Extract another sound source. On the other hand, if another sound source signal does not remain in the residual signal, the process ends at that point.
  • this mask estimation process is repeated until the residual signal is finally no longer included in the residual signal, and each sound source is extracted one by one, whereby the sound source separation process and the sound source number estimation process are performed. Achieve both at the same time.
  • a method of thresholding the volume of the residual signal, inputting the residual signal to another neural network, and inspecting whether another sound source remains is proposed for the determination. ing.
  • the same BLSTM is repeatedly used for mask estimation.
  • Non-Patent Documents 1 and 2 are batch processing, and take the form of applying the processing to the entire input signal. Therefore, the techniques described in Non-Patent Documents 1 and 2 lack real-time property. For example, when trying to apply these processes to a certain conference voice, the processing cannot be started at least until the recording of the conference voice is completed. That is, the techniques described in Non-Patent Documents 1 and 2 cannot be applied to applications in which sound source separation is applied from the start of a conference and each separated voice is sequentially transcribed using automatic voice recognition.
  • Non-Patent Document 3 The technique described in Non-Patent Document 3 was devised in view of such a problem.
  • the input signal is divided into a plurality of time blocks (blocks having a length of about 5 to 10 seconds), and processing is sequentially applied to each block.
  • the method of Non-Patent Document 2 is applied to the first block.
  • the speaker feature vectors corresponding to the extracted speakers are calculated and output as well.
  • each of all the speaker feature vectors obtained as a result of the processing of the first block is used to produce the voice corresponding to those speakers. Extract repeatedly in the same order as the first block. Then, in the technique described in Non-Patent Document 3, the speakers are extracted in the order in which the speakers are detected in the past blocks even in the third and subsequent blocks.
  • Non-Patent Document 3 when a new speaker (new speaker) that did not appear in the first block appears in the second block, it appears in the first block. Even if all the existing speakers are extracted from the observation signal of the second block, the voice component of the new speaker remains in the residual signal. Therefore, by using the above-mentioned determination process, it is possible to detect the presence of a new speaker, and as a result, it is possible to estimate the mask of the new speaker and the speaker feature vector of the new speaker. ..
  • sound source separation and sound source number estimation can be performed even for long-time data in the form of block online processing.
  • order of speaker extraction common among each block, which of the separated sounds obtained in a certain time block and the separated sound obtained in a different time block is followed by the speaker between the time blocks. Can work to determine if they belong to the same speaker).
  • Non-Patent Document 3 If the technique described in Non-Patent Document 3 is used, even if the speaker extracted in the past block does not speak in the current block, the speaker's speech is the same as in the case of other speakers.
  • the feature vector is used to extract the signal corresponding to the speaker. However, if the speaker's signal is not included in the block, the mask corresponding to the speaker is 0, and as a result, a signal with a sound pressure of 0 is ideally extracted.
  • Non-Patent Document 3 When the technique described in Non-Patent Document 3 is used, it is not possible to know in advance whether or not a specific speaker is speaking in a new time block (new time block), so in the past block. For speakers that have been extracted even once, it is necessary to try to extract the sound source from the new time block using the speaker feature vector. Then, as a result, whether or not the speaker is speaking can be determined by examining the sound pressure of the output signal.
  • Non-Patent Document 3 performs an extraction process that is not originally necessary for a speaker who is not speaking (silent speaker).
  • Non-Patent Document 3 when the technique described in Non-Patent Document 3 is used, the speakers of the new time block are extracted in the order in which the utterances were made in the past.
  • the optimum order of speaker extraction is likely to be different in each block (see Non-Patent Document 1). That is, trying to extract speakers in the same order in all blocks increases the amount of processing and impairs the optimality of processing.
  • the present invention has been made in view of the above, and an object of the present invention is to provide a signal processing device, a signal processing method, and a signal processing program capable of improving the processing accuracy for an acoustic signal while reducing the processing amount. And.
  • the signal processing apparatus sets a speaker feature vector for each time block with respect to the input acoustic signal, and a speaker existing in the time block.
  • An extraction unit that repeatedly extracts the number of speakers, an external memory unit that stores the speaker feature vector extracted by the extraction unit, and a speaker feature vector of a speaker that has never appeared in the time block so far
  • it is instructed to write the speaker feature vector to the unused memory slot of the external memory unit, and the speaker feature amount vector of the speaker who has already appeared in the time block so far.
  • the instruction unit that instructs the instruction unit, the writing unit that receives an instruction from the instruction unit and writes the speaker feature vector to the external memory unit, and the instruction unit that receives an instruction from the instruction unit and reads the speaker feature vector from the external memory unit. It is characterized by having a reading unit and a processing unit that executes signal processing based on the speaker feature vector read by the reading unit.
  • FIG. 1 is a diagram schematically showing an example of a configuration of a signal processing device according to an embodiment.
  • FIG. 2 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment.
  • FIG. 3 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment.
  • FIG. 4 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment.
  • FIG. 5 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
  • FIG. 6 is a diagram showing an example of a computer in which a signal processing device is realized by executing a program.
  • FIG. 1 is a diagram showing an example of a configuration of a signal processing device according to an embodiment.
  • a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the CPU is a predetermined CPU. It is realized by executing the program.
  • the signal processing device 10 includes a speaker feature vector extraction unit 11 (extraction unit), an external memory unit 12, a memory control instruction unit 13 (instruction unit), and an external memory writing unit 14 (writing unit). It has an external memory reading unit 15 (reading unit), a sound source extraction unit 16, an utterance section detection unit 17, a voice recognition unit 18, a repetition control unit 19, and a learning unit 20.
  • the signal processing device 10 divides it into time blocks, and inputs the acoustic signal of each divided time block to the speaker feature vector extraction unit 11.
  • the speaker feature vector extraction unit 11 repeatedly extracts the speaker feature vector for each time block for the input acoustic signal (hereinafter referred to as an observation signal) for the number of speakers existing in the block. ..
  • the number of repetitions of the speaker feature vector extraction process is set by the repetition control unit 19.
  • the speaker feature vector extraction unit 11 can use various methods as the speaker feature vector extraction method. For example, when the technique described in Non-Patent Document 3 or the technique described in Reference 1 is used, in the first time block, a speaker feature vector is set for each of any number of speakers included in the block. Can be extracted.
  • Reference 1 Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”, arXiv: 2005.09921, 2021.
  • the extraction of the speaker feature vector can be formulated as in Eq. (1).
  • I b is the total number of speakers in the time block b.
  • a b and i are speaker feature vectors related to the i-th speaker extracted in the time block b.
  • X b is an observation signal of the time block b, and
  • NN embed [ ⁇ ] is a neural network such as BLSTM.
  • h b and 0 are initial vectors (in some cases, a matrix) for informing the network that it is the first iteration in the speaker feature vector extraction, and are appropriately set according to NN embed [ ⁇ ]. ..
  • the external memory unit 12 is a memory for storing the speaker feature vector extracted by the speaker feature vector extraction unit 11.
  • the external memory unit 12 has a plurality of memory addresses, and stores a speaker feature vector of one speaker in one memory address.
  • the external memory unit 12 may be a semiconductor memory in which data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory) can be rewritten.
  • the external memory unit 12 may be a storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or an optical disk.
  • the memory control instruction unit 13 receives the speaker feature vector each time the speaker feature vector is extracted by the speaker feature vector extraction unit 11.
  • the memory control instruction unit 13 instructs the external memory unit 12 to write the vector to the new memory address. That is, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the memory control instruction unit 13 puts this story into an unused memory slot of the external memory unit 12. Instructs to write the person feature vector.
  • the memory control instruction unit 13 has a memory corresponding to the speaker that has appeared in the past among the memory addresses of the external memory unit 12. Instruct the address to write the speaker feature vector in an appropriate form. That is, when the speaker feature amount vector of the speaker who has already appeared in the time block up to now is extracted, the memory control instruction unit 13 is the same as the memory slot corresponding to this speaker in the external memory unit 12. Instructs to write the speaker feature vector.
  • the memory control instruction unit 13 writes and gives an instruction to read the speaker feature vector to the external memory unit 12 by using the one based on the neural network described in References 2 to 4.
  • Reference 2 A. Graves, G. Wayne, and I. Danihelka. “Neural Turing Machines”, arxiv: 1410.5401., 2014.
  • Reference 3 Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez; Edward Grefenstette, and Tiago Ramalho, “Hybrid computing using a neural network with dynamic external memory”. Nature. 538 (7626): 471-476. (2016-10-12).
  • Reference 4 Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus, “End-To-End Memory Networks”, Advances in Neural Information Processing Systems 28, pp, 2440-2448, 2015
  • the external memory writing unit 14 writes the speaker feature vector to the external memory unit 12 in response to an instruction from the memory control instruction unit 13.
  • the external memory reading unit 15 receives an instruction from the memory control instruction unit 13 and reads a speaker feature vector for each speaker from the external memory unit 12. In this way, the signal processing device 10 can optimize all the systems by the error propagation method by implementing the exchange of information with the external memory unit 12 using the neural network.
  • the memory control instruction unit 13 is described in References 2 to 4 and the like for instructions to the external memory writing unit 14 that writes to the external memory unit 12 and the external memory reading unit 15 that reads to the external memory unit 12.
  • the instruction vector is generated according to the procedure of.
  • the memory control instruction unit 13 first bases on the outputs a b and i from the speaker feature vector extraction unit and the past outputs (instruction vectors) w b and i-1 of the memory control instruction unit 13.
  • a 1 ⁇ M size key vector kb , i is generated using a neural network.
  • the memory control instruction unit 13 also generates ⁇ b and i used for the calculation of the equation (2).
  • the memory control instruction unit 13 measures the closeness of the same key vector to each column n of the current external memories M b and i , and calculates the instruction vectors w b and i as in the equation (2). ..
  • M b and i are external memory matrices (size of N ⁇ M) before writing the i-th speaker feature vector in the b-th time block.
  • N be the total number of memory addresses
  • M be the length of the vector that can be written to the address.
  • w b and i are 1 ⁇ N-dimensional instruction vectors
  • w b and i (n) are each element thereof and have the characteristics shown in the equation (4).
  • Equation (5) c is a constant for enhancing sparsity.
  • the external memory writing unit 14 writes and updates the speaker feature vector to the external memory unit 12 based on the instruction vectors w b and i output from the memory control instruction unit 13.
  • the writing process by the external memory writing unit 14 is performed by pairing the erasing process and the writing process, as described below.
  • the speaker feature vectors ab and i extracted by the speaker feature vector extraction unit 11 are passed to the external memory writing unit 14, and are written out by the external memory unit 12 in an appropriate form.
  • the external memory writing unit 14 erases the memory according to the equation (6) based on the erasing vectors e b and i which are 1 ⁇ N vectors and the instruction vectors w b and i .
  • the erasure vectors e b and i are also outputs from the memory control instruction unit 13.
  • L is a 1 ⁇ N vector consisting of N 1s.
  • the elimination vectors e b and i are often set as in the equation (7).
  • the external memory reading unit 15 receives an instruction from the memory control instruction unit 13, reads a speaker feature vector for each speaker from the external memory unit 12, and outputs the speaker feature vector to the memory control instruction unit 13.
  • the external memory reading unit 15 reads the updated speaker feature vector from the external memory unit 12 based on the instruction vectors w b and i output from the memory control instruction unit 13.
  • the external memory reading unit 15 can read data from the external memory unit 12 as shown in the equation (9) by using the instruction vectors w b and i .
  • the speaker feature vectors r b, i to be read can be output by multiplying the matrix M and the instruction vectors w b, i .
  • the fact that the value of the nth element w b, i (n) of the instruction vector is 0 corresponds to the instruction that the information is not read from the nth address, so that the elements w b , i (n) of the instruction vector.
  • the information of the memory address whose value of) is close to 1 is mainly read, and the information of the memory address close to 0 is relatively not read.
  • the sound source extraction unit 16 extracts the speaker's voice corresponding to this speaker feature vector from the observation signal based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (details are non-patented). See Document 3).
  • the sound source extraction unit 16 extracts the separated voices ⁇ S b, i as shown in the equation (10) using the speaker feature vectors r b, i read from the observation signal X b and the external memory unit 12. ..
  • NN extract [ ⁇ ] it is common to use a neural network such as BLSTM or a convolutional neural network.
  • the separated voices ⁇ S b, i (i 1, ..., I b ) extracted in the time block b and the separated voices ⁇ S b'extracted in the time block b'(b ⁇ b').
  • the speaker feature vector extraction unit 11 extracts the speaker feature vector in a different order between the time blocks, the maximum value detection and the comparison process in the instruction vector to the external memory unit 12 described above can be performed. By using it, the speaker can be followed between time blocks.
  • the utterance section detection unit 17 outputs the utterance section detection result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, refer to the reference). See Document 1).
  • the voice recognition unit 18 outputs the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, see Reference 5). reference).
  • Reference 5 Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani, “End-to-end SpeakerBeam for single channel target speech recognition”, Interspeech2019, pp.451-455, 2019.
  • the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 are examples of processing units that process voice. Further, as the signal processing device according to the embodiment, the signal processing device 10 having a plurality of processing units of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 has been described, but the present invention is not limited thereto.
  • the signal processing device according to the embodiment is a signal processing device 10A having a sound source extraction unit 16 (see FIG. 2), a signal processing device 10B having a speech section detection unit 17 (see FIG. 3), or a voice recognition unit 18. It may be a signal processing device 10C (see FIG. 4) having the above.
  • the repeat control unit 19 is based on the extraction processing state of the speaker feature vector extraction unit 11 or the processing result of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18, and the speaker feature vector extraction unit The number of repetitions of the extraction process according to 11 is determined. In other words, the repetition control unit 19 determines the number of repetitions of the extraction process by the speaker feature vector extraction unit 11 based on, for example, the state of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Alternatively, the repetition control unit 19 uses the output results from the utterance section detection unit 17 and the voice recognition unit 18 to determine the number of repetitions of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Ideally, the speaker feature vector extraction unit 11 should extract I b speaker feature vectors in each block b.
  • the repetition control unit 19 has the internal state vectors h b, i and the observation signal X b output from the speaker feature vector extraction unit 11.
  • the scalar values ⁇ f b, i (0 ⁇ ⁇ f b, i ⁇ 1) indicating whether or not the repetition as in the following equation (11) should be stopped are calculated using the separated voices ⁇ S b, i . do. If the scalar values ⁇ f b and i are larger than the predetermined value, the repetition is stopped, and if it is lower than the predetermined value, the repetition is continued.
  • the repeat control unit 19 inputs different auxiliary information to the neural network of the speaker feature vector extraction unit 11 each time the speaker feature vector extraction unit 11 extracts the speaker feature vector, so that the speaker feature vector extraction unit 19 can be used. 11 is made to output the extraction result of the speaker feature vector corresponding to a different sound source.
  • the repeat control unit 19 virtually recognizes all the speaker feature vectors included in the input acoustic signal.
  • the repeat control unit 19 recognizes which speaker the speaker feature vector corresponds to.
  • the repeat control unit 19 inputs auxiliary information about the speaker feature vector of the next speaker to the speaker feature vector extraction unit 11, and the speaker feature vector extraction unit 19
  • the speaker feature vector of the next speaker is extracted in 11.
  • the repetition control unit 19 stops the repetition of the extraction process by the speaker feature vector extraction unit 11 when the extraction process of the speaker feature vector for all the speakers of the observation signal of this time block is completed.
  • the learning unit 20 optimizes the parameters used by the signal processing device 10 using the learning data.
  • the learning unit 20 optimizes the parameters of the neural network constituting the signal processing device 10 based on a predetermined objective function.
  • the learning data includes an input signal (observation signal), a correct answer clean signal corresponding to each sound source included in this input signal, correct answer utterance time information, correct utterance content, and total number of people included in the input signal (total speaker). Number) consists of I.
  • the learning unit 20 learns parameters based on the learning data so that the error between the output of the signal processing device 10 and the correct answer information becomes small.
  • the system output result (separated voice) ⁇ S b, i a loss function based on the circumstances error of Eq. (12) is provided so that the separated voice S b, i of the correct answer is close to.
  • the cross entropy loss function of the equation (13) is provided for the scalar value for determining the number of repetitions in order to estimate the correct number of sound sources.
  • the learning unit 20 updates the parameter so that the total number of addresses used in the external memory becomes the same value as the total number of speakers I, and the equation (14) also becomes small at the same time.
  • T count is a preset threshold value, and is generally set to 1.
  • min ( ⁇ ) is a function that outputs a T count when the input value is larger than the T count , and outputs the value as it is when the input value is smaller than the T count .
  • Equation (14) is a loss function that prompts the number of memory addresses used in the external memory to match the total number of speakers I.
  • the learning unit 20 learns the parameters of the neural network so that the value of the equation (15), which is the sum of the values of all the loss functions, becomes small.
  • the learning unit 20 uses a signal processing device 10B having a speech section detection unit 17, a signal processing device 10C having a voice recognition unit 18, and a loss function corresponding to the processing unit provided for the signal processing device 10. Then, the parameters of the neural network may be optimized. Please refer to Reference 6 for the signal processing device 10B having the utterance section detection unit 17, and Reference 7 for the signal processing device 10C having the voice recognition unit 18.
  • Reference 6 Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe, “End-to-End Neural Speaker Diarization with Permutation-free Objectives”, Proc. Interspeech, pp. 4300-4304, 2019.
  • Reference 7 Shigeki Karita et al., “A comparative study on transformer vs RNN in speech applications”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
  • FIG. 5 is a flowchart showing a processing procedure of the signal processing method according to the embodiment.
  • the speaker feature vector extraction unit 11 When the acoustic signal (observation signal) of the time block b is input to the speaker feature vector extraction unit 11, the speaker feature vector extraction unit 11 is one person existing in the time block b from the observation signal of the time block b.
  • the speaker feature vector of the speaker is estimated and extracted (step S4).
  • the memory control instruction unit 13 determines whether or not the speaker feature vector extracted by the speaker feature vector extraction unit 11 is a speaker feature amount vector of a speaker that has never appeared in the time block so far. Is determined (step S5).
  • step S5 The case where it is a speaker feature vector of a speaker who has never appeared in the time block so far (step S5: Yes) will be described.
  • the memory control instruction unit 13 instructs the unused memory slot of the external memory unit 12 to write the speaker feature vector
  • the external memory writing unit 14 directs the unused memory of the external memory unit 12 to be written.
  • the speaker feature vector is written in the slot (step S6).
  • step S5 The case where the speaker feature quantity vector of the speaker who has appeared in the time block so far (step S5: No) will be described.
  • the memory control instruction unit 13 instructs the external memory unit 12 to write the speaker feature vector to the memory slot corresponding to the speaker, and the external memory writing unit 14 indicates this story of the external memory unit 12.
  • the speaker feature vector is written in the memory slot corresponding to the person (step S7).
  • the external memory writing unit 14 reads the speaker feature vector corresponding to one speaker extracted by the speaker feature vector extraction unit 11 from the external memory unit 12 according to the instruction from the memory control instruction unit 13 (step). S8), output to the memory control instruction unit 13.
  • the sound source extraction unit 16 observes the speaker's voice corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Extract from (step S9).
  • the utterance section detection unit 17 detects the utterance section of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. The result is output (step S10).
  • the voice recognition unit 18 obtains the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Output (step S11).
  • steps S9 to S11 may be processed in parallel or in series.
  • the order of processing is not particularly limited.
  • processing may be executed according to the voice processing function unit provided.
  • the sound source extraction process step S9 by the sound source extraction unit 16 is executed, and the process proceeds to step S12.
  • the repetition control unit 19 determines whether or not to stop the repetition based on the processing results of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 (step S12).
  • the repetition control unit 19 may determine whether or not to stop the repetition based on the extraction processing state of the speaker feature vector extraction unit 11.
  • step S12 determines that the repetition is not stopped (step S12: No)
  • the repetition control unit 19 returns to step S4 in order to proceed with the processing regarding the next speaker in this time block b, and returns to the speaker feature vector for the next speaker. Is extracted.
  • the repetition control unit 19 determines that the repetition is stopped (step S12: Yes)
  • the repetition control unit 19 outputs the processing result for this time block b (step S13).
  • the output results are the sound source extraction result, the utterance section detection result, and the voice recognition result. Further, the signal processing device 10 may collectively output the processing results of all the time blocks.
  • the signal processing device 10 determines whether or not the processing of all the time blocks is completed (step S14).
  • step S14: Yes the signal processing device 10 ends the processing for the input acoustic signal.
  • step S14: No 1 is added to the time block b in order to perform the processing for the next time block (step S15). The process returns to step S4 and the process is continued.
  • the signal processing device 10 repeatedly extracts the speaker feature vector for each time block and writes it to the external memory unit 12. At this time, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature vector in the unused memory slot of the external memory unit 12. To write. When the speaker feature quantity vector of the speaker who has already appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature in the memory slot corresponding to this speaker in the external memory unit 12. Write a vector.
  • the signal processing device 10 does not execute the extraction process itself for the speaker (silent speaker) who is not speaking. Therefore, the signal processing device 10 does not need to try to extract the sound source for the silent speaker for each time block as in the conventional case, so that the processing amount can be reduced as compared with the conventional case and the utterance is truly performed. Since the processing for only the speaker can be appropriately performed, the processing accuracy can be improved.
  • the signal processing device 10 extracts the speaker feature vector for each time block, it is not necessary to extract the speaker feature vector in the same order in all the time blocks as in the conventional case, so that the processing is optimal. Does not hurt.
  • the signal processing device 10 can improve the processing accuracy for the acoustic signal while reducing the processing amount.
  • FIG. 6 is a diagram showing an example of a computer in which the signal processing device 10 is realized by executing a program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the signal processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • a program module 1093 for executing processing similar to the functional configuration in the signal processing device 10 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne un dispositif de traitement de signal (10) qui comprend : une unité d'extraction de vecteur de caractéristiques de locuteur (11) pour extraire de manière répétée, pour un signal acoustique entré, un vecteur de caractéristiques de locuteur pour chaque bloc de temps un certain nombre de fois correspondant au nombre de locuteurs présents dans le bloc de temps; une unité d'instruction de commande de mémoire (13) pour donner comme instruction d'écrire le vecteur de caractéristique de locuteur dans un bloc de mémoire inutilisé d'une unité de mémoire externe (12) lorsqu'un vecteur de caractéristiques de locuteur d'un locuteur qui n'est pas apparu dans un bloc de temps ainsi éloigné est extrait, et qui donne comme instruction d'écrire le vecteur de caractéristiques de locuteur dans une banque de mémoire de l'unité de mémoire externe (12) qui correspond à un locuteur qui est déjà apparu lorsqu'un vecteur de valeur de caractéristique de locuteur d'un locuteur qui est déjà apparu dans un bloc de temps ainsi éloigné, est extrait; et une unité d'extraction de source sonore (16), une unité de détection de section d'énoncé (17) et une unité de reconnaissance de parole (18) pour exécuter un traitement de signal sur la base d'un vecteur de caractéristiques de locuteur.
PCT/JP2020/049247 2020-12-28 2020-12-28 Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal WO2022145015A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/049247 WO2022145015A1 (fr) 2020-12-28 2020-12-28 Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal
JP2022572857A JPWO2022145015A1 (fr) 2020-12-28 2020-12-28

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/049247 WO2022145015A1 (fr) 2020-12-28 2020-12-28 Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal

Publications (1)

Publication Number Publication Date
WO2022145015A1 true WO2022145015A1 (fr) 2022-07-07

Family

ID=82259163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/049247 WO2022145015A1 (fr) 2020-12-28 2020-12-28 Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal

Country Status (2)

Country Link
JP (1) JPWO2022145015A1 (fr)
WO (1) WO2022145015A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007233239A (ja) * 2006-03-03 2007-09-13 National Institute Of Advanced Industrial & Technology 発話イベント分離方法、発話イベント分離システム、及び、発話イベント分離プログラム
JP2020013034A (ja) * 2018-07-19 2020-01-23 株式会社日立製作所 音声認識装置及び音声認識方法
WO2020039571A1 (fr) * 2018-08-24 2020-02-27 三菱電機株式会社 Dispositif de séparation de voix, procédé de séparation de voix, programme de séparation de voix et système de séparation de voix
JP2020134657A (ja) * 2019-02-18 2020-08-31 日本電信電話株式会社 信号処理装置、学習装置、信号処理方法、学習方法及びプログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007233239A (ja) * 2006-03-03 2007-09-13 National Institute Of Advanced Industrial & Technology 発話イベント分離方法、発話イベント分離システム、及び、発話イベント分離プログラム
JP2020013034A (ja) * 2018-07-19 2020-01-23 株式会社日立製作所 音声認識装置及び音声認識方法
WO2020039571A1 (fr) * 2018-08-24 2020-02-27 三菱電機株式会社 Dispositif de séparation de voix, procédé de séparation de voix, programme de séparation de voix et système de séparation de voix
JP2020134657A (ja) * 2019-02-18 2020-08-31 日本電信電話株式会社 信号処理装置、学習装置、信号処理方法、学習方法及びプログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NEUMANN THILO VON; KINOSHITA KEISUKE; DELCROIX MARC; ARAKI SHOKO; NAKATANI TOMOHIRO; HAEB-UMBACH REINHOLD: "All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 91 - 95, XP033565103, DOI: 10.1109/ICASSP.2019.8682572 *

Also Published As

Publication number Publication date
JPWO2022145015A1 (fr) 2022-07-07

Similar Documents

Publication Publication Date Title
Kreuk et al. Fooling end-to-end speaker verification with adversarial examples
JP6679898B2 (ja) キーワード検出装置、キーワード検出方法及びキーワード検出用コンピュータプログラム
EP2943951B1 (fr) Vérification et identification de locuteur au moyen d'une différenciation d'unité sous-phonétique basée sur un réseau neuronal artificiel
WO2020228173A1 (fr) Procédé, appareil et dispositif de détection de parole illégale, et support de stockage lisible par ordinateur
WO2016095218A1 (fr) Identification d'orateur à l'aide d'informations spatiales
US20140350934A1 (en) Systems and Methods for Voice Identification
WO2019151507A1 (fr) Dispositif, procédé et programme d'apprentissage
Wang et al. Recurrent deep stacking networks for supervised speech separation
US8160875B2 (en) System and method for improving robustness of speech recognition using vocal tract length normalization codebooks
US10089977B2 (en) Method for system combination in an audio analytics application
CN110335608B (zh) 声纹验证方法、装置、设备及存储介质
JP2020042257A (ja) 音声認識方法及び装置
JP6985221B2 (ja) 音声認識装置及び音声認識方法
JP2010078877A (ja) 音声認識装置、音声認識方法及び音声認識プログラム
KR20210141115A (ko) 발화 시간 추정 방법 및 장치
Qian et al. Noise robust speech recognition on aurora4 by humans and machines
CN112489623A (zh) 语种识别模型的训练方法、语种识别方法及相关设备
KR101122591B1 (ko) 핵심어 인식에 의한 음성 인식 장치 및 방법
CN108847251B (zh) 一种语音去重方法、装置、服务器及存储介质
JP5670298B2 (ja) 雑音抑圧装置、方法及びプログラム
WO2022145015A1 (fr) Dispositif de traitement de signal, procédé de traitement de signal et programme de traitement de signal
JP2012063611A (ja) 音声認識結果検索装置、音声認識結果検索方法および音声認識結果検索プログラム
Broughton et al. Improving end-to-end neural diarization using conversational summary representations
KR20130126570A (ko) 핵심어에서의 음소 오류 결과를 고려한 음향 모델 변별 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체
JPH06266386A (ja) ワードスポッティング方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20968026

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022572857

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20968026

Country of ref document: EP

Kind code of ref document: A1