JP7376896B2

JP7376896B2 - Learning device, learning method, learning program, generation device, generation method, and generation program

Info

Publication number: JP7376896B2
Application number: JP2020092463A
Authority: JP
Inventors: 邦夫柏野; 康智大石; 隆仁川西; 博俊竹内
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2023-11-09
Anticipated expiration: 2040-05-27
Also published as: JP2021189247A

Description

特許法第３０条第２項適用２０２０年日本音響学会春季研究発表会講演論文集発行日２０２０年３月２日Application of Article 30, Paragraph 2 of the Patent Act 2020 Acoustical Society of Japan Spring Research Presentation Collection of Lectures Publication Date March 2, 2020

本発明は、学習装置、学習方法、学習プログラム、生成装置、生成方法及び生成プログラムに関する。 The present invention relates to a learning device, a learning method, a learning program, a generation device, a generation method, and a generation program.

従来、目的音声の物理的な性質に基づき、音響信号から目的音声の信号を分離する音源分離という技術が知られている。音源分離では、例えば、目的音声の到来方向、音響的性質、音色、声質、音源の統計的な独立性、要素信号の共通性といった物理的な性質が利用される。 Conventionally, a technique called sound source separation is known in which a signal of a target voice is separated from an acoustic signal based on the physical properties of the target voice. Sound source separation utilizes, for example, physical properties such as direction of arrival of target speech, acoustic properties, timbre, voice quality, statistical independence of sound sources, and commonality of element signals.

例えば、音源分離として、目的話者が実際に発した音声を使って音源分離モデルを当該目的話者に適応させることにより、混合音声から目的話者の音声を分離する技術が知られている（例えば、非特許文献１を参照）。 For example, there is a known technology for sound source separation in which the target speaker's voice is separated from mixed speech by adapting a source separation model to the target speaker using the voice actually uttered by the target speaker ( For example, see Non-Patent Document 1).

Marc Delcroix, Katerina Zmolikova,木下慶介,荒木章子,小川厚徳,中谷智広, "SpeakerBeam: 聞きたい人の声に耳を傾けるコンピュータ──深層学習に基づく音声の選択的聴取" NTT技術ジャーナル 2018.9Marc Delcroix, Katerina Zmolikova, Keisuke Kinoshita, Akiko Araki, Atsunori Ogawa, Tomohiro Nakatani, "SpeakerBeam: A computer that listens to the voice of the person you want to hear -- Selective listening to voices based on deep learning" NTT Technology Journal 2018.9

しかしながら、従来の技術には、目的音声の物理的性質が未知であったり、音響信号中に目的音声と類似の物理的性質を有する音声の信号が含まれる場合、精度良く音源分離を行うことができない場合があるという問題がある。 However, when the physical properties of the target voice are unknown or the acoustic signal contains a voice signal with similar physical properties to the target voice, conventional techniques cannot perform sound source separation with high accuracy. The problem is that it may not be possible.

例えば、非特許文献１に記載の技術では、実際に適応用の音声が得られた第１の音源からの音声を分離することはできるが、当該第１の音源と音声の物理的性質が類似する第２の音源については、当該第２の音源から適応用の音声を得ていない場合、音声を分離することは難しい。 For example, with the technology described in Non-Patent Document 1, it is possible to separate the sound from the first sound source from which the sound for adaptation was actually obtained, but the physical properties of the sound are similar to that of the first sound source. Regarding the second sound source, it is difficult to separate the sounds if the adaptation sound is not obtained from the second sound source.

上述した課題を解決し、目的を達成するために、学習装置は、意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する第１の特徴量算出部と、音響信号を前記第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する第２の特徴量算出部と、前記第１の特徴量及び前記第２の特徴量を基に生成されるマスクであって、前記音響信号から前記ラベルに対応する成分を抽出するためのマスクを評価するための指標を算出する指標算出部と、前記指標が最適化されるように、前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新する更新部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the purpose, the learning device uses a first model to map a first feature amount, which is a mapping of information representing a label in a manner in which the meaning can be interpreted, into a first space. a second feature amount calculation section that calculates a second feature amount by mapping the acoustic signal to the first space using a second model; Index calculation that calculates an index for evaluating a mask generated based on the first feature amount and the second feature amount for extracting a component corresponding to the label from the acoustic signal. and an updating unit that updates the parameters of the first model and the parameters of the second model so that the index is optimized.

本発明によれば、目的音声の物理的性質が未知であっても、音響信号中に目的音声と類似の物理的性質を有する音声の信号が含まれていても精度良く音源分離を行うことができる。 According to the present invention, even if the physical properties of the target voice are unknown, even if the acoustic signal contains a signal of a voice having similar physical properties to the target voice, it is possible to perform sound source separation with high accuracy. can.

図１は、第１の実施形態に係る学習装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a learning device according to a first embodiment. 図２は、第１の実施形態に係る学習処理の流れを示す模式図である。FIG. 2 is a schematic diagram showing the flow of learning processing according to the first embodiment. 図３は、Triplet Lossを説明する図である。FIG. 3 is a diagram explaining Triplet Loss. 図４は、第１の実施形態に係る学習装置の処理の流れを示すフローチャートである。FIG. 4 is a flowchart showing the flow of processing of the learning device according to the first embodiment. 図５は、第２の実施形態に係る学習装置の構成例を示す図である。FIG. 5 is a diagram illustrating a configuration example of a learning device according to the second embodiment. 図６は、第２の実施形態に係る学習処理の流れを示す模式図である。FIG. 6 is a schematic diagram showing the flow of learning processing according to the second embodiment. 図７は、第２の実施形態に係る学習装置の処理の流れを示すフローチャートである。FIG. 7 is a flowchart showing the process flow of the learning device according to the second embodiment. 図８は、第３の実施形態に係る生成装置の構成例を示す図である。FIG. 8 is a diagram illustrating a configuration example of a generation device according to the third embodiment. 図９は、第３の実施形態に係る生成処理の流れを示す模式図である。FIG. 9 is a schematic diagram showing the flow of generation processing according to the third embodiment. 図１０は、第３の実施形態に係る生成装置の処理の流れを示すフローチャートである。FIG. 10 is a flowchart showing the process flow of the generation device according to the third embodiment. 図１１は、第４の実施形態に係る生成処理の流れを示す模式図である。FIG. 11 is a schematic diagram showing the flow of generation processing according to the fourth embodiment. 図１２は、第４の実施形態に係る生成装置の処理の流れを示すフローチャートである。FIG. 12 is a flowchart showing the process flow of the generation device according to the fourth embodiment. 図１３は、実験におけるデータの結合について説明する図である。FIG. 13 is a diagram illustrating data combination in an experiment. 図１４は、実験における各パラメータの設定値を示す図である。FIG. 14 is a diagram showing the set values of each parameter in the experiment. 図１５は、実験で得られたスペクトログラムを示す図である。FIG. 15 is a diagram showing a spectrogram obtained in an experiment. 図１６は、実験で得られたマスクを示す図である。FIG. 16 is a diagram showing a mask obtained in an experiment. 図１７は、実験で得られたマスクを示す図である。FIG. 17 is a diagram showing a mask obtained in an experiment. 図１８は、生成プログラムを実行するコンピュータの一例を示す図である。FIG. 18 is a diagram showing an example of a computer that executes the generation program.

以下に、本願に係る学習装置、学習方法、学習プログラム、生成装置、生成方法及び生成プログラムの実施形態を図面に基づいて詳細に説明する。なお、本発明は、以下に説明する実施形態により限定されるものではない。 DESCRIPTION OF EMBODIMENTS Below, embodiments of a learning device, a learning method, a learning program, a generating device, a generating method, and a generating program according to the present application will be described in detail based on the drawings. Note that the present invention is not limited to the embodiments described below.

［第１の実施形態］
第１の実施形態に係る学習装置は、音源分離のための音源分離モデルの学習を行う。本実施形態における音源分離モデルは、ラベルを特定可能な情報及び音響信号の入力を受け付け、音響信号から目的音声の成分を抽出するためのマスクを推定する。なお、目的音声を発する音源を目的音源と呼ぶ。ラベルは、目的音源を識別するための情報である。また、ラベルを特定可能な情報を、ラベル情報と呼ぶ。 [First embodiment]
The learning device according to the first embodiment performs learning of a sound source separation model for sound source separation. The sound source separation model in this embodiment accepts input of information that allows identification of a label and an audio signal, and estimates a mask for extracting components of target speech from the audio signal. Note that the sound source that emits the target sound is called a target sound source. The label is information for identifying the target sound source. Further, information that allows identification of a label is called label information.

本実施形態の音源分離モデルは、ラベル情報を基にラベルを特定することができる。ラベル情報は、意味を解釈可能な態様でラベルを表現した情報であればよい。例えば、意味を解釈可能な態様には、言語が含まれる。このため、ラベル情報は文字列で表現されたものであってもよい。 The sound source separation model of this embodiment can identify a label based on label information. The label information may be any information that expresses the label in a manner that allows its meaning to be interpreted. For example, aspects in which meaning can be interpreted include language. Therefore, the label information may be expressed as a character string.

例えば、「ヴァイオリン」という文字列については、楽器の一種であるヴァイオリンを意味するものと解釈可能である。このため、本実施形態の音源分離モデルは、「ヴァイオリン」という文字列がラベル情報として入力されれば、ラベルがヴァイオリンであることを特定する。つまり、音源分離モデルは、「ヴァイオリン」という文字列がラベル情報として入力されれば、ヴァイオリンの音を目的音声として分離するためのマスクを推定する。 For example, the character string "violin" can be interpreted to mean a violin, which is a type of musical instrument. Therefore, if the character string "violin" is input as label information, the sound source separation model of this embodiment identifies that the label is violin. In other words, if the character string "violin" is input as label information, the sound source separation model estimates a mask for separating the violin sound as the target sound.

これに対し、例えば非特許文献１に記載の技術では、ヴァイオリンの音を目的音声として分離するためには、実際にヴァイオリンを演奏して得られた音声の信号をモデルに入力する必要があった。 On the other hand, in the technology described in Non-Patent Document 1, for example, in order to separate the violin sound as the target sound, it was necessary to input the sound signal obtained by actually playing the violin into the model. .

ラベル情報は、文字列で表現されたものに限られない。例えば、ラベル情報は、ラベルによって識別される物体が写った画像、ラベルに対応する単語列を含む発話の音声から得られる信号等であってもよい。なお、ラベル情報が音声信号である場合、本実施形態の音源分離モデルは、当該音声の信号の物理的性質ではなく、当該音声に含まれる言語的意味内容に基づきラベルを特定する。以下の説明では、音声を観測して得られる信号を音響信号と呼ぶ場合がある。 Label information is not limited to what is expressed as a character string. For example, the label information may be an image of the object identified by the label, a signal obtained from the audio of an utterance that includes a word string corresponding to the label, or the like. Note that when the label information is an audio signal, the sound source separation model of this embodiment specifies the label based on the linguistic semantic content included in the audio, rather than the physical properties of the audio signal. In the following description, a signal obtained by observing a sound may be referred to as an acoustic signal.

［第１の実施形態の構成］
まず、図１を用いて、第１の実施形態に係る生成装置の構成について説明する。図１は、第１の実施形態に係る学習装置の構成例を示す図である。図１に示すように、学習装置１０は、ラベル特徴量算出部１０１、スペクトログラム特徴量算出部１０２、マスク生成部１０３、指標算出部１０４、更新部１０５を有する。また、学習装置１０は、ラベルエンコーダ情報１１１及びオーディオエンコーダ情報１１２を記憶する。 [Configuration of first embodiment]
First, the configuration of the generation device according to the first embodiment will be described using FIG. 1. FIG. 1 is a diagram showing a configuration example of a learning device according to a first embodiment. As shown in FIG. 1, the learning device 10 includes a label feature calculation section 101, a spectrogram feature calculation section 102, a mask generation section 103, an index calculation section 104, and an update section 105. The learning device 10 also stores label encoder information 111 and audio encoder information 112.

ラベル特徴量算出部１０１は、ラベルを特定可能なラベル情報を入力とし、ラベル情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出し、出力する。ラベルエンコーダ情報１１１は、第１のモデルを構築するための情報である。第１のモデルがニューラルネットワークである場合、ラベルエンコーダ情報１１１は、各ユニットの重みやバイアス等のパラメータである。 The label feature calculation unit 101 receives label information that allows identification of a label as input, calculates a first feature obtained by mapping the label information in a first space using a first model, and outputs the result. Label encoder information 111 is information for constructing the first model. When the first model is a neural network, the label encoder information 111 is parameters such as weights and biases of each unit.

スペクトログラム特徴量算出部１０２は、音響信号を入力とし、入力された音響信号（以下、入力音響信号と記載）を第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出し、出力する。オーディオエンコーダ情報１１２は、第２のモデルを構築するための情報である。第２のモデルがニューラルネットワークである場合、オーディオエンコーダ情報１１２は、各ユニットの重みやバイアス等のパラメータである。 The spectrogram feature calculation unit 102 receives an acoustic signal as input, and calculates a second feature obtained by mapping the input acoustic signal (hereinafter referred to as input acoustic signal) onto a first space using a second model. Calculate and output. Audio encoder information 112 is information for constructing the second model. If the second model is a neural network, the audio encoder information 112 is parameters such as weights and biases for each unit.

マスク生成部１０３は、第１の特徴量及び第２の特徴量を入力とし、第１の特徴量及び第２の特徴量を基に、音響信号からラベルに対応する成分を抽出するためのマスクを生成し、出力する。指標算出部１０４は、第１の特徴量及び第２の特徴量を基に生成されるマスクであって、音響信号からラベルに対応する成分を抽出するためのマスクを入力とし、マスクを評価するための指標を算出し、出力する。更新部１０５は、指標を入力とし、指標が最適化されるように、第１のモデルのパラメータ及び第２のモデルのパラメータを更新する。つまり、更新部１０５は、ラベルエンコーダ情報１１１及びオーディオエンコーダ情報１１２を更新し、出力する。 The mask generation unit 103 receives the first feature amount and the second feature amount as input, and generates a mask for extracting a component corresponding to the label from the acoustic signal based on the first feature amount and the second feature amount. Generate and output. The index calculation unit 104 inputs a mask generated based on the first feature amount and the second feature amount, and is used to extract a component corresponding to a label from an acoustic signal, and evaluates the mask. Calculate and output indicators for The updating unit 105 receives the index and updates the parameters of the first model and the second model so that the index is optimized. That is, the updating unit 105 updates the label encoder information 111 and the audio encoder information 112 and outputs them.

なお、指標算出部１０４は、マスク生成部１０３によって生成されたマスクを使用せずに指標を算出してもよい。この場合、本実施形態では、マスク生成部１０３はマスクを生成しなくてもよい。 Note that the index calculation unit 104 may calculate the index without using the mask generated by the mask generation unit 103. In this case, in this embodiment, the mask generation unit 103 does not need to generate a mask.

図２を用いて、学習装置１０による学習処理を詳細に説明する。図２は、第１の実施形態に係る学習処理の流れを示す模式図である。なお、図２に示す各手法は一例であり、適宜他の手法に置き換えられてもよい。 The learning process by the learning device 10 will be explained in detail using FIG. 2. FIG. 2 is a schematic diagram showing the flow of learning processing according to the first embodiment. Note that each method shown in FIG. 2 is an example, and may be replaced with another method as appropriate.

図２に示すように、ラベル特徴量算出部１０１は、ラベル情報（Label input）を、ラベルエンコーダ（Label encoder）に入力する。ここでは、ラベル情報は、「Writing」、「Cough」等の文字列であるものとする。 As shown in FIG. 2, the label feature calculation unit 101 inputs label information (Label input) to a label encoder. Here, it is assumed that the label information is a character string such as "Writing" or "Cough".

ラベル特徴量算出部１０１は、ラベル情報に対し、One-hot encodingを行い、ｓ次元のバイナリベクトル（Binary vector）に変換する。さらに、ラベル特徴量算出部１０１は、３層の全結合型ニューラルネットワーク（Fully connected network）にｓ次元のバイナリベクトルを入力し、１×１×ｈ次元のベクトルであるラベル特徴量（Label feature）を得る。 The label feature calculation unit 101 performs one-hot encoding on the label information and converts it into an s-dimensional binary vector. Furthermore, the label feature calculation unit 101 inputs an s-dimensional binary vector to a three-layer fully connected neural network, and generates a label feature that is a 1×1×h-dimensional vector. get.

このように、ラベルエンコーダによれば、ラベル情報はｈ次元の潜在空間にマッピングされる。ｈ次元の潜在空間は、第１の空間の一例である。また、全結合型ニューラルネットワークを含むラベルエンコーダは、第１のモデルの一例である。また、ラベル特徴量は、第１の特徴量の一例である。また、ｈは、第１の次元数の一例である。 In this way, according to the label encoder, label information is mapped to an h-dimensional latent space. The h-dimensional latent space is an example of the first space. Further, a label encoder including a fully connected neural network is an example of the first model. Further, the label feature amount is an example of the first feature amount. Further, h is an example of the first number of dimensions.

なお、図２に示すような、One-hot encoding及び３層の全結合型ニューラルネットワークを含むラベルエンコーダは、ラベル情報からラベル特徴量を得る手段の一例に過ぎない。例えば、ラベルエンコーダは、word2vec等のベクトル化手段、及びLSTM（例えば、参考文献１を参照）等を用いたものであってもよい。
参考文献１：Shota Ikawa, Kunio Kashino, “Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds,” In Proc. Detection and Classification of Acoustic Scenes and Events (DCASE), 2018. Note that a label encoder including one-hot encoding and a three-layer fully connected neural network as shown in FIG. 2 is only an example of a means for obtaining label features from label information. For example, the label encoder may use vectorization means such as word2vec, LSTM (for example, see Reference 1), and the like.
Reference 1: Shota Ikawa, Kunio Kashino, “Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds,” In Proc. Detection and Classification of Acoustic Scenes and Events (DCASE), 2018.

また、ラベルエンコーダの性能によっては、単語だけでなく、文章や擬音語等がラベル情報として用いられてもよい。例えば、word2vecによれば、単語を組み合わせた文章をベクトルに変換することができる。 Furthermore, depending on the performance of the label encoder, not only words but also sentences, onomatopoeias, etc. may be used as label information. For example, word2vec allows you to convert sentences made up of words into vectors.

このように、ラベル特徴量算出部１０１は、意味を解釈可能な態様でラベルを表現した情報をｈ次元の潜在空間にマッピングしたラベル特徴量を、ラベルエンコーダを用いて算出する。また、ラベル特徴量算出部１０１は、文字列で表現されたラベルから作成されたベクトルをラベルエンコーダに入力して得られる出力を、ラベル特徴量として算出する。また、ラベル特徴量算出部１０１は、ｈ次元の変数を少なくとも含む特徴量をラベル特徴量として算出する。 In this way, the label feature amount calculation unit 101 uses the label encoder to calculate a label feature amount that is obtained by mapping information expressing a label in a manner in which the meaning can be interpreted into an h-dimensional latent space. Further, the label feature calculation unit 101 calculates an output obtained by inputting a vector created from a label expressed as a character string to a label encoder as a label feature. Further, the label feature calculation unit 101 calculates a feature that includes at least an h-dimensional variable as a label feature.

特に、図２の例では、ラベル特徴量算出部１０１は、所定の単語を表すラベルから作成されたｓ次元（ｓはあらかじめ設定された単語数）のOne-hotベクトルを、ラベルエンコーダに含まれるニューラルネットワークに入力して得られる１×１×ｈ（ｈはあらかじめ設定された任意の数）次元の特徴量をラベル特徴量として算出する。 In particular, in the example of FIG. 2, the label feature calculation unit 101 calculates an s-dimensional (s is the preset number of words) One-hot vector created from a label representing a predetermined word, which is included in the label encoder. A 1×1×h (h is an arbitrary number set in advance) dimension feature quantity obtained by inputting it into the neural network is calculated as a label feature quantity.

一方、スペクトログラム特徴量算出部１０２は、入力音響信号（Audio input）をオーディオエンコーダに入力する。まず、スペクトログラム特徴量算出部１０２は、入力音響信号の振幅スペクトログラムを算出する。例えば、スペクトログラム特徴量算出部１０２は、６４ｍｓのハミング窓からなるフレームを８ｍｓずつずらしながらSTFT（短時間フーリエ変換）を行うことでｆ×ｔの振幅スペクトログラムを算出する。ただし、ｆ及びｔは、それぞれ周波数ビンの数及び時間ビンの数である。 On the other hand, the spectrogram feature calculation unit 102 inputs an input audio signal (Audio input) to an audio encoder. First, the spectrogram feature calculation unit 102 calculates an amplitude spectrogram of an input acoustic signal. For example, the spectrogram feature amount calculation unit 102 calculates an f×t amplitude spectrogram by performing STFT (short-time Fourier transform) while shifting a frame consisting of a 64-ms Hamming window by 8 ms. However, f and t are the number of frequency bins and the number of time bins, respectively.

さらに、スペクトログラム特徴量算出部１０２は、振幅スペクトログラムをAudio U-Net（例えば、参考文献２を参照）に入力し、ｆ×ｔ×ｈ次元のベクトルであるスペクトログラム特徴量（Spectrogram feature）を得る。
参考文献２：Rouditchenko, Andrew, et al. “Self-supervised Audio-visual Co-segmentation.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. Furthermore, the spectrogram feature calculation unit 102 inputs the amplitude spectrogram to Audio U-Net (see Reference 2, for example), and obtains a spectrogram feature that is an f×t×h dimensional vector.
Reference 2: Rouditchenko, Andrew, et al. “Self-supervised Audio-visual Co-segmentation.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

ここで、振幅スペクトログラムは、各時間の振幅スペクトルを時間的につなげたものである。また、パワースペクトログラムは、振幅スペクトログラムを２乗したものである。例えば、スペクトログラム特徴量算出部１０２は、振幅スペクトログラムの代わりに、パワースペクトログラムの対数値をAudio U-Netに入力し、スペクトログラム特徴量を得るようにしてもよい。また、以降の説明では、振幅スペクトログラムを単にスペクトログラムと呼ぶ。 Here, the amplitude spectrogram is a temporal connection of amplitude spectra at each time. Further, the power spectrogram is the amplitude spectrogram squared. For example, the spectrogram feature calculation unit 102 may input the logarithm of the power spectrogram to Audio U-Net instead of the amplitude spectrogram to obtain the spectrogram feature. Further, in the following description, the amplitude spectrogram will be simply referred to as a spectrogram.

スペクトログラム特徴量は、スペクトログラムのサイズｆ×ｔを保持したｈ次元の特徴ベクトルの集合ということができる。また、スペクトログラム特徴量算出部１０２は、入力音響信号から得られたスペクトログラムをミニバッチ処理するために、スペクトログラムの時間フレーム数がｔより長い場合はｔ以降を切り捨て、ｔよりも短い場合は０埋めをしてもよい。 The spectrogram feature amount can be said to be a set of h-dimensional feature vectors holding the spectrogram size f×t. In addition, in order to perform mini-batch processing on the spectrogram obtained from the input acoustic signal, the spectrogram feature calculation unit 102 truncates the spectrogram after t when the number of time frames is longer than t, and fills it with zeros when it is shorter than t. You may.

このように、オーディオエンコーダによれば、入力音響信号はｈ次元の潜在空間にマッピングされる。また、Audio U-Netを含むオーディオエンコーダは、第２のモデルの一例である。また、スペクトログラム特徴量は、第２の特徴量の一例である。 Thus, according to the audio encoder, the input audio signal is mapped into an h-dimensional latent space. Furthermore, an audio encoder including Audio U-Net is an example of the second model. Further, the spectrogram feature amount is an example of the second feature amount.

なお、図２に示すような、STFT及びAudio U-Netを含むオーディオエンコーダは、入力音響信号からスペクトログラム特徴量を得る手段の一例に過ぎない。例えば、オーディオエンコーダは、STFTの代わりに、MFCC（メル周波数ケプストラム係数；Mel Frequency Cepstrum Coefficients）、帯域フィルタバンク、CNN（畳み込みニューラルネットワーク）等を用いるものであってもよい。また、オーディオエンコーダは、Audio U-Netの代わりに、CNNを組み合わせたモデルを用いるものであってもよい。 Note that the audio encoder including STFT and Audio U-Net as shown in FIG. 2 is only an example of means for obtaining spectrogram features from an input audio signal. For example, the audio encoder may use MFCC (Mel Frequency Cepstrum Coefficients), bandpass filter bank, CNN (Convolutional Neural Network), etc. instead of STFT. Furthermore, the audio encoder may use a model combining CNN instead of Audio U-Net.

このように、スペクトログラム特徴量算出部１０２は、ｈ次元の変数と、時間に対応する変数と、周波数成分に対応する変数とを少なくとも含む特徴量をスペクトログラム特徴量として算出する。 In this way, the spectrogram feature amount calculation unit 102 calculates a feature amount including at least an h-dimensional variable, a variable corresponding to time, and a variable corresponding to a frequency component as a spectrogram feature amount.

特に、図２の例では、スペクトログラム特徴量算出部１０２は、入力音響信号から作成されたｆ（ｆは周波数ビンの数）×ｔ（ｔは時間ビンの数）次元のスペクトログラムを、オーディオエンコーダに含まれるニューラルネットワークに入力して得られるｆ×ｔ×ｈ次元の特徴量をスペクトログラム特徴量として算出する。 In particular, in the example of FIG. 2, the spectrogram feature calculation unit 102 outputs an f (f is the number of frequency bins) x t (t is the number of time bins) dimensional spectrogram created from the input audio signal to the audio encoder. The f×t×h dimensional feature amount obtained by inputting it into the included neural network is calculated as a spectrogram feature amount.

マスク生成部１０３は、ラベル特徴量及びスペクトログラム特徴量を基に、音響信号からラベルに対応する成分を抽出するためのマスクを生成する。図２に示すように、まず、マスク生成部１０３は、ラベル特徴量とスペクトログラム特徴量の内積（Dot product）を算出する。 The mask generation unit 103 generates a mask for extracting a component corresponding to the label from the acoustic signal based on the label feature amount and the spectrogram feature amount. As shown in FIG. 2, first, the mask generation unit 103 calculates the inner product (Dot product) of the label feature amount and the spectrogram feature amount.

そして、マスク生成部１０３は、内積を活性化関数に通すことで、マスクを得る。活性化関数をＲｅＬｕ６とし、ラベル特徴量をｘ、時間周波数点（ｆ，ｔ）のスペクトログラム特徴量をｙ_ｆ，ｔとすると、マスク生成部１０３は、時間周波数点（ｆ，ｔ）のマスクｍ_ｆ，ｔをＲｅＬｕ６（ｘ^Ｔｙ_ｆ，ｔ）のように算出することができる。なお、第１の実施形態のマスク生成部１０３によって得られる、時間周波数点ごとの要素を持つマスクを、後に説明するタイムマスクと区別して、スペクトログラムマスクと呼ぶ場合がある。 Then, the mask generation unit 103 obtains a mask by passing the inner product through an activation function. When the activation function is ReLu6, the label feature is x, and the spectrogram feature of the time-frequency point (f, t) is y _f,t , the mask generation unit 103 generates a mask m of the time-frequency point (f, t). _{f, t} can be calculated as ReLu6(x ^T y _{f, t} ). Note that the mask obtained by the mask generation unit 103 of the first embodiment, which has elements for each time-frequency point, may be called a spectrogram mask to distinguish it from a time mask, which will be described later.

指標算出部１０４は、ラベル特徴量とスペクトログラム特徴量の類似度を基に指標を算出する。指標算出部１０４は、内積そのものを指標としてもよいし、マスクを指標としてもよいし、マスクから算出される類似度（Similarity score）を指標としてもよい。また、指標算出部１０４は、内積以外にもL1距離、L2距離、Lp距離及び各種の統計的ダイバージェンスを用いて指標を算出することができる。また、図２の例では、マスク生成部１０３が内積を算出しているが、指標算出部１０４が内積を算出してもよい。更新部１０５は、指標が最小化されるようにラベルエンコーダのパラメータ及びオーディオエンコーダのパラメータを更新する。 The index calculation unit 104 calculates an index based on the degree of similarity between the label feature amount and the spectrogram feature amount. The index calculating unit 104 may use the inner product itself as an index, may use a mask as an index, or may use a similarity score calculated from a mask as an index. In addition to the inner product, the index calculation unit 104 can also calculate the index using L1 distance, L2 distance, Lp distance, and various statistical divergences. Further, in the example of FIG. 2, the mask generation unit 103 calculates the inner product, but the index calculation unit 104 may calculate the inner product. The updating unit 105 updates the label encoder parameters and the audio encoder parameters so that the index is minimized.

学習装置１０は、Triplet Lossによって各モデルの評価及び更新を行うことができる。図３は、Triplet Lossを説明する図である。図３において、関数ｆ、関数ｇは、それぞれオーディオエンコーダ及びラベルエンコーダである。また、Ａ_ａは入力音響信号である。また、Ｌ_ａは、positiveなラベル、すなわち音響信号Ａ_ａに対応付けるためのラベルである。また、Ｌ_ｂは、negativeなラベル、すなわちpositiveなラベル以外のラベルである。また、Ｓｉｍは類似度を求める関数である。ペアデータの組み合わせの選び方は膨大であるが、一例として、参考文献３に記載の方法のように、ミニバッチに含まれるデータの中から、効率的な学習に有用なデータ、すなわちハードポジティブ（positiveなラベルをもつデータの中でアンカーとのロスが大きいもの）、ハードネガティブ（negativeなラベルをもつデータの中でアンカーとのロスが小さいもの）、を選べばよい。
参考文献３：Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. The learning device 10 can evaluate and update each model using Triplet Loss. FIG. 3 is a diagram explaining Triplet Loss. In FIG. 3, a function f and a function g are an audio encoder and a label encoder, respectively. Moreover, A _a is an input acoustic signal. Further, L _a is a positive label, that is, a label to be associated with the acoustic signal A _a . Further, L _b is a negative label, that is, a label other than a positive label. Further, Sim is a function for determining the degree of similarity. There are a huge number of ways to choose combinations of paired data, but as an example, the method described in Reference 3 uses data useful for efficient learning, that is, hard positives, from among the data included in the mini-batch. You can select a hard negative (data with a negative label that has a small loss with the anchor), and a hard negative (data with a negative label with a small loss with the anchor).
Reference 3: Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

指標算出部１０４は、音響信号に対応付けられたラベルから算出されたラベル特徴量とスペクトログラム特徴量との類似度が大きいほど小さくなり、かつ、音響信号に対応付けられたラベルと異なるラベルから算出されたラベル特徴量とスペクトログラム特徴量との類似度が大きいほど大きくなるような指標を算出する。 The index calculation unit 104 calculates the index from a label that is smaller as the degree of similarity between the label feature calculated from the label associated with the acoustic signal and the spectrogram feature is smaller, and is different from the label associated with the acoustic signal. An index is calculated that increases as the degree of similarity between the label feature amount and the spectrogram feature amount increases.

例えば、指標算出部１０４は、マスクのGMP（Global mean pooling）を類似度として算出することができる。GMPによれば、周波数成分と時間成分が集約されるため、指標算出部１０４は、ｆ×ｔ×１次元のマスクから、スカラである類似度を得ることができる。この場合、図３の関数Ｓｉｍは、GMPを得るための関数である。また、指標算出部１０４は、指標として、損失関数Ｓ_ｎ－Ｓ_ｐを算出することができる。 For example, the index calculation unit 104 can calculate GMP (Global Mean Pooling) of the mask as the similarity. According to GMP, since frequency components and time components are aggregated, the index calculation unit 104 can obtain a scalar similarity degree from an f×t×1-dimensional mask. In this case, the function Sim in FIG. 3 is a function for obtaining GMP. Furthermore, the index calculation unit 104 can calculate a loss function S _n −S _p as an index.

また、指標算出部１０４は、参考文献４に記載された手法を用いて、（１）式のように損失関数を算出してもよい。
参考文献４：Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In ICCV, pages 360-368, 2017. Furthermore, the index calculation unit 104 may calculate the loss function as shown in equation (1) using the method described in reference document 4.
Reference 4: Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In ICCV, pages 360-368, 2017.

（１）式において、Ｂはミニバッチサイズである。各ミニバッチには、入力音響信号及びラベル情報の組み合わせが含まれている。ｘは、ラベルエンコーダから出力されるラベル特徴量である。ｙは、オーディオエンコーダから出力されるスペクトログラム特徴量である。ｉ，ｊ，ｋは、ミニバッチ内の入力音響信号及びラベル情報を識別するための識別子である。また、識別子が一致するｘとｙはpositiveなペアデータである。また、識別子が一致しないｘとｙはnegativeなペアデータである。例えば、negativeなペアは、ミニバッチの中からランダムに選ばれたものであってもよい。 In equation (1), B is the mini-batch size. Each mini-batch includes a combination of input audio signal and label information. x is a label feature output from the label encoder. y is a spectrogram feature output from the audio encoder. i, j, k are identifiers for identifying the input audio signal and label information within the mini-batch. Furthermore, x and y whose identifiers match are positive paired data. Moreover, x and y whose identifiers do not match are negative pair data. For example, negative pairs may be randomly selected from a mini-batch.

なお、マスクの各時間周波数点の値を、入力音響信号のスペクトログラムの各時間周波数点に乗じることで正解ラベルに対応する成分が抽出される。このため、入力音響信号に正解ラベルに対応する成分が多く含まれているほど、マスクの各要素の値は大きくなる傾向にあり、さらにGMPの値も大きくなることが考えられる。本実施形態の指標算出部１０４は、このような性質を利用して類似度を算出する。 Note that the component corresponding to the correct label is extracted by multiplying each time-frequency point of the spectrogram of the input acoustic signal by the value of each time-frequency point of the mask. Therefore, as the input acoustic signal contains more components corresponding to the correct label, the value of each element of the mask tends to increase, and the value of GMP also increases. The index calculation unit 104 of this embodiment calculates the degree of similarity using such properties.

なお、マスクは、目的外音の遮断又は減衰に使われる場合もある。そのような場合、positiveなペアとnegativeなペアの類似度の大小関係は逆転する場合がある。その場合、指標算出部１０４は、例えば損失関数の正負を逆転させる等の対応を行えばよい。 Note that masks may also be used to block or attenuate unintended sounds. In such a case, the magnitude of the similarity between the positive pair and the negative pair may be reversed. In that case, the index calculation unit 104 may take measures such as reversing the sign of the loss function.

更新部１０５は、誤差逆伝播法等の手法を用いて、損失関数が最小化されるように、ラベルエンコーダとオーディオエンコーダの両方のパラメータを更新する。損失関数を最小化することは、マスクを最適化することを意味する。 The updating unit 105 updates the parameters of both the label encoder and the audio encoder using a method such as error backpropagation so that the loss function is minimized. Minimizing the loss function means optimizing the mask.

［第１の実施形態の処理の流れ］
図４は、第１の実施形態に係る学習装置の処理の流れを示すフローチャートである。図４に示すように、まず、ラベル特徴量算出部１０１は、ラベルエンコーダにより、ラベル情報からラベル特徴量を算出する（ステップＳ１０１）。次に、スペクトログラム特徴量算出部１０２は、オーディオエンコーダにより、入力音響信号からスペクトログラム特徴量を算出する（ステップＳ１０２）。ここで、ステップＳ１０１とステップＳ１０２が実行される順番は逆であってもよい。また、ステップＳ１０１とステップＳ１０２は並行して実行されてもよい。 [Processing flow of the first embodiment]
FIG. 4 is a flowchart showing the flow of processing of the learning device according to the first embodiment. As shown in FIG. 4, first, the label feature calculation unit 101 calculates a label feature from label information using a label encoder (step S101). Next, the spectrogram feature calculation unit 102 uses the audio encoder to calculate a spectrogram feature from the input audio signal (step S102). Here, the order in which step S101 and step S102 are executed may be reversed. Further, step S101 and step S102 may be executed in parallel.

ここで、マスク生成部１０３は、ラベル特徴量とスペクトログラム特徴量の内積を算出する（ステップＳ１０３）。そして、マスク生成部１０３は、内積からスペクトログラムマスクを生成する（ステップＳ１０４）。さらに、指標算出部１０４は、スペクトログラムマスクを集約し類似度を算出する（ステップＳ１０５）。 Here, the mask generation unit 103 calculates the inner product of the label feature amount and the spectrogram feature amount (step S103). Then, the mask generation unit 103 generates a spectrogram mask from the inner product (step S104). Furthermore, the index calculation unit 104 aggregates the spectrogram masks and calculates the degree of similarity (step S105).

指標算出部１０４は、算出した類似度とnegativeペアデータの類似度を基に損失関数を算出する（ステップＳ１０６）。例えば、指標算出部１０４は、ステップＳ１０５とステップＳ１０６の間に、negativeペアデータの類似度を算出する処理を実行してもよい。そして、更新部１０５は、損失関数が最小化されるように各エンコーダのパラメータを更新する（ステップＳ１０７）。 The index calculation unit 104 calculates a loss function based on the calculated similarity and the similarity of the negative pair data (step S106). For example, the index calculation unit 104 may perform a process of calculating the similarity of negative pair data between step S105 and step S106. Then, the updating unit 105 updates the parameters of each encoder so that the loss function is minimized (step S107).

ここで、学習装置１０は、終了条件が充足されている場合（ステップＳ１０８、Yes）、処理を終了する。一方、学習装置１０は、終了条件が充足されていない場合（ステップＳ１０８、No）、ステップＳ１０１に戻り更新済みの各モデルを使って処理を繰り返す。なお、例えば、終了条件は、用意されたミニバッチ内の全てのデータについて処理が実行済みであること、規定回数だけ処理が繰り返されたこと、パラメータの更新幅が収束したこと等である。 Here, if the termination condition is satisfied (step S108, Yes), the learning device 10 terminates the process. On the other hand, if the end condition is not satisfied (step S108, No), the learning device 10 returns to step S101 and repeats the process using each updated model. Note that, for example, the termination conditions include that all data in the prepared mini-batch have been processed, that the process has been repeated a specified number of times, that the parameter update width has converged, and so on.

［第１の実施形態の効果］
これまで説明してきたように、ラベル特徴量算出部１０１は、意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する。スペクトログラム特徴量算出部１０２は、音響信号を第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する。指標算出部１０４は、第１の特徴量及び第２の特徴量を基に生成されるマスクであって、音響信号からラベルに対応する成分を抽出するためのマスクを評価するための指標を算出する。更新部１０５は、指標が最適化されるように、第１のモデルのパラメータ及び第２のモデルのパラメータを更新する。このように、学習装置１０は、意味を解釈可能な態様で目的音声の音源を識別可能なラベルが表現されていれば、各モデルの学習を行うことができる。このため、第１の実施形態によれば、目的音声の物理的性質が未知であっても、音響信号中に目的音声と類似の物理的性質を有する音声の信号が含まれていても精度良く音源分離を行うことができる。 [Effects of the first embodiment]
As described above, the label feature calculation unit 101 uses the first model to calculate the first feature obtained by mapping information representing the label in a manner that allows interpretation of the meaning in the first space. calculate. The spectrogram feature calculation unit 102 calculates a second feature obtained by mapping the acoustic signal to the first space using the second model. The index calculation unit 104 calculates an index for evaluating a mask that is generated based on the first feature amount and the second feature amount and is used to extract a component corresponding to a label from an acoustic signal. do. The updating unit 105 updates the parameters of the first model and the parameters of the second model so that the index is optimized. In this way, the learning device 10 can learn each model as long as the label that allows the source of the target voice to be identified is expressed in a manner that allows interpretation of the meaning. Therefore, according to the first embodiment, even if the physical properties of the target voice are unknown, even if the acoustic signal contains a signal of a voice having similar physical properties to the target voice, the accuracy can be improved. Sound source separation can be performed.

また、ラベル特徴量算出部１０１は、文字列で表現されたラベルから作成されたベクトルを第１のモデルに入力して得られる出力を、第１の特徴量として算出する。このように、学習装置１０は、目的音声の物理的性質が未知の場合であっても、文字列のような人間が認識可能な態様で表現されたラベルを基に、音源分離モデルの学習を行うことができる。 Further, the label feature calculation unit 101 calculates, as the first feature, an output obtained by inputting a vector created from a label expressed as a character string into the first model. In this way, even if the physical properties of the target speech are unknown, the learning device 10 can learn a sound source separation model based on labels expressed in a human-recognizable manner such as character strings. It can be carried out.

また、ラベル特徴量算出部１０１は、第１の次元数の変数を少なくとも含む特徴量を第１の特徴量として算出する。スペクトログラム特徴量算出部１０２は、第１の次元数の変数と、時間に対応する変数と、周波数成分に対応する変数とを少なくとも含む特徴量を第２の特徴量として算出する。指標算出部１０４は、第１の特徴量と第２の特徴量の内積を基に指標を算出する。このように、学習装置１０は、ラベルの特徴量と入力音響信号の特徴量を同一次元数の潜在空間にマッピングすることにより、容易に指標を算出することができる。 Further, the label feature amount calculation unit 101 calculates a feature amount that includes at least a variable of the first number of dimensions as a first feature amount. The spectrogram feature calculation unit 102 calculates, as a second feature, a feature that includes at least a variable of the first dimension, a variable corresponding to time, and a variable corresponding to a frequency component. The index calculation unit 104 calculates an index based on the inner product of the first feature amount and the second feature amount. In this way, the learning device 10 can easily calculate the index by mapping the feature amount of the label and the feature amount of the input acoustic signal into a latent space with the same number of dimensions.

また、指標算出部１０４は、音響信号に対応付けられたラベルから算出された第１の特徴量と第２の特徴量との類似度が大きいほど小さくなり、かつ、音響信号に対応付けられたラベルと異なるラベルから算出された第１の特徴量と第２の特徴量との類似度が大きいほど大きくなるような指標を算出する。更新部１０５は、指標が最小化されるように第１のモデルのパラメータ及び第２のモデルのパラメータを更新する。このように、学習装置１０は、距離指標を使ったTriplet Lossによる学習を行うことができる。 In addition, the index calculation unit 104 calculates that the degree of similarity between the first feature amount and the second feature amount calculated from the label associated with the acoustic signal decreases as the degree of similarity increases, and An index is calculated that increases as the degree of similarity between the first feature amount and the second feature amount calculated from a label different from the label increases. The updating unit 105 updates the parameters of the first model and the parameters of the second model so that the index is minimized. In this way, the learning device 10 can perform triplet loss learning using a distance index.

また、ラベル特徴量算出部１０１は、所定の単語を表すラベルから作成されたｓ次元（ｓはあらかじめ設定された単語数）のOne-hotベクトルを、第１のモデルであるニューラルネットワークに入力して得られる１×１×ｈ（ｈはあらかじめ設定された任意の数）次元の特徴量を第１の特徴量として算出する。スペクトログラム特徴量算出部１０２は、音響信号から作成されたｆ（ｆは周波数ビンの数）×ｔ（ｔは時間ビンの数）次元のスペクトログラムを、第２のモデルであるニューラルネットワークに入力して得られるｆ×ｔ×ｈ次元の特徴量を第２の特徴量として算出する。このように、学習装置１０は、スペクトログラムの時間周波数方向の特徴を維持しつつ潜在空間にマッピングすることができる。 In addition, the label feature calculation unit 101 inputs an s-dimensional (s is the number of words set in advance) one-hot vector created from a label representing a predetermined word to a neural network that is a first model. A 1×1×h (h is an arbitrary number set in advance) dimension feature amount obtained by the calculation is calculated as the first feature amount. The spectrogram feature calculation unit 102 inputs an f (f is the number of frequency bins) x t (t is the number of time bins) dimensional spectrogram created from the acoustic signal to a neural network that is a second model. The obtained f×t×h dimensional feature amount is calculated as a second feature amount. In this manner, the learning device 10 can maintain the characteristics of the spectrogram in the time-frequency direction while mapping it to the latent space.

［ラベルの付与方法について］
第１の実施形態では、学習用のデータとして、入力音響信号とラベル情報とを組み合わせたデータが入力される。第１の実施形態では、入力音響信号にラベルを付与する方法は任意の方法であってよい。例えば、専門家が入力音響信号を聴き、組み合わせるべきラベルを判断し付与することが考えられる。このように付与されたラベルを使った学習を、ここでは網羅的な教師あり学習と呼ぶ。 [About how to add labels]
In the first embodiment, data that is a combination of an input acoustic signal and label information is input as learning data. In the first embodiment, any method may be used to label the input acoustic signal. For example, it is conceivable that an expert listens to input audio signals, determines and assigns labels to be combined. Learning using labels assigned in this way is referred to here as exhaustive supervised learning.

しかしながら、網羅的な教師あり学習には、作業コストが大きいという問題がある。そこで、ラベルの付与を自動化する方法として、動画の音声と、当該動画に写っている物体を示すラベルとを対応付けることが考えられる。動画に写っている物体は、画像認識によって得ることができる。また、クラウドソーシング等を利用して大規模にラベル付与を行う方法が考えられる。 However, comprehensive supervised learning has the problem of high operational costs. Therefore, one possible method for automating label assignment is to associate the audio of a video with a label indicating an object in the video. Objects in videos can be identified through image recognition. Another possible method is to apply labels on a large scale using crowdsourcing or the like.

また、学習用の入力音響信号としてdry sourceが手に入るとは限らない。このため、入力音響信号には多数の雑音や残響が含まれることになり、ラベルと一対一に対応しない場合がある。さらに、上記のクラウドソーシング等では、非専門家がラベルの付与を行うため、基準がばらつくことが考えられる。 Furthermore, it is not always possible to obtain a dry source as an input audio signal for learning. Therefore, the input acoustic signal contains a lot of noise and reverberation, and may not correspond one-to-one with the labels. Furthermore, in the above-mentioned crowdsourcing etc., since labels are assigned by non-experts, standards may vary.

しかしながら、第１の実施形態では、ラベルを入力音響信号と必ずしも一対一で対応させる必要はない。例えば、入力音響信号に少なくとも「Writing」に対応する目的音声が含まれていれば、当該入力音響信号に「Writing」というラベルが付されていてもよい。そのような入力音響信号とラベルから算出された類似度は、少なくとも他のラベルから算出された類似度よりも、大きくなると考えられるためである。また、Triplet Lossは、このような網羅的な教師あり学習が行えない状況でも利用可能である。 However, in the first embodiment, the labels do not necessarily have to correspond one-to-one with the input acoustic signals. For example, if the input audio signal includes at least a target voice corresponding to "Writing", the input audio signal may be labeled "Writing". This is because the degree of similarity calculated from such an input acoustic signal and the label is considered to be greater than at least the degree of similarity calculated from other labels. Triplet Loss can also be used in situations where such exhaustive supervised learning is not possible.

［第２の実施形態］
第１の実施形態では、学習装置１０は、ラベル特徴量及びスペクトログラム特徴量の内積を基に損失関数を算出していた。一方で、内積を基に生成されたマスクによれば、ラベルに対応する目的音声の成分を実際に分離し、合成した音響信号を出力することができる。第２の実施形態では、学習装置１０は、合成した音響信号が最適化されるように学習を行う。 [Second embodiment]
In the first embodiment, the learning device 10 calculates the loss function based on the inner product of the label feature and the spectrogram feature. On the other hand, with a mask generated based on the inner product, it is possible to actually separate the components of the target speech corresponding to the label and output a synthesized acoustic signal. In the second embodiment, the learning device 10 performs learning so that the synthesized acoustic signal is optimized.

［第２の実施形態の構成］
図５を用いて、第２の実施形態に係る学習装置の構成について説明する。図５は、第２の実施形態に係る学習装置の構成例を示す図である。なお、図５において、第１の実施形態と同様の部分については、図１等と同様の符号を付し説明を省略する。図５に示すように、学習装置１０ａは、抽出部１０６、合成部１０７及び更新部１０８を有する。 [Configuration of second embodiment]
The configuration of the learning device according to the second embodiment will be described using FIG. 5. FIG. 5 is a diagram illustrating a configuration example of a learning device according to the second embodiment. Note that in FIG. 5, the same parts as in the first embodiment are given the same reference numerals as in FIG. 1, etc., and the description thereof will be omitted. As shown in FIG. 5, the learning device 10a includes an extraction section 106, a synthesis section 107, and an update section 108.

前述の通り、スペクトログラム特徴量算出部１０２は、入力音響信号からスペクトログラム特徴量を算出する過程で、スペクトログラムを算出する。第２の実施形態では、スペクトログラム特徴量算出部１０２は、スペクトログラムを抽出部１０６に対し出力する。抽出部１０６は、スペクトログラム及びマスク生成部１０３によって生成されるマスクを入力とする。 As described above, the spectrogram feature amount calculation unit 102 calculates a spectrogram in the process of calculating the spectrogram feature amount from the input acoustic signal. In the second embodiment, the spectrogram feature calculation unit 102 outputs the spectrogram to the extraction unit 106. The extraction unit 106 receives the spectrogram and the mask generated by the mask generation unit 103 as input.

抽出部１０６は、スペクトログラムにマスクを適用し、所定の成分を抽出し、出力する。例えば、抽出部１０６は、スペクトログラムの各時間周波数成分にマスクの値を重みとして乗じてもよいし、マスクの値に基づいて抽出する成分を選択してもよい。 The extraction unit 106 applies a mask to the spectrogram, extracts predetermined components, and outputs the extracted components. For example, the extraction unit 106 may multiply each time-frequency component of the spectrogram by a mask value as a weight, or may select a component to be extracted based on the mask value.

合成部１０７は、抽出部１０６によって抽出された成分を入力とし、抽出部１０６によって抽出された成分を基に、音響信号を合成し、出力する。例えば、合成部１０７は、時間周波数成分から正弦波パラメータを抽出し、正弦波加算合成を行うMcAulay-Quatieriの方法（例えば、参考文献５を参照）によって音響信号を合成してもよい。また、合成部１０７は、時間周波数成分に対して位相成分を反復処理により推定し復元するGriffin-Limの方法（例えば、参考文献６）によって音響信号を合成してもよい。
参考文献５：R. J. McAulay, T. F. Quatieri. Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Trans. ASSP, vol.34, no.4, pp.744-754, 1986.
参考文献６：D. W. Griffin and J. S. Lim. Signal estimation from modified short-time Fourier transform," IEEE Trans. ASSP, vol.32, no.2, pp. 236-243, 1984. The synthesizing section 107 receives the components extracted by the extracting section 106 as input, synthesizes an acoustic signal based on the components extracted by the extracting section 106, and outputs the synthesized signal. For example, the synthesis unit 107 may synthesize the acoustic signal using the McAulay-Quatieri method (for example, see Reference 5), which extracts sine wave parameters from time-frequency components and performs sine wave additive synthesis. Furthermore, the synthesizing unit 107 may synthesize the acoustic signal using the Griffin-Lim method (for example, reference document 6) in which phase components are estimated and restored by iterative processing for time-frequency components.
Reference 5: RJ McAulay, TF Quatieri. Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Trans. ASSP, vol.34, no.4, pp.744-754, 1986.
Reference 6: DW Griffin and JS Lim. Signal estimation from modified short-time Fourier transform," IEEE Trans. ASSP, vol.32, no.2, pp. 236-243, 1984.

更新部１０８は、入力音響信号、及び、合成部１０７によって合成された音響信号を入力とし、音響信号に関する損失関数が最小化されるように、各モデルを更新し、更新したパラメータを出力する。例えば、更新部１０８は、第１の実施形態と同様にTriplet Lossを採用し、positiveなペアデータから合成された音響信号、negativeなペアデータから合成された音響信号とを基に算出された損失関数を最適化してもよい。 The updating unit 108 receives the input acoustic signal and the acoustic signal synthesized by the synthesizing unit 107, updates each model so that the loss function related to the acoustic signal is minimized, and outputs the updated parameters. For example, the update unit 108 adopts Triplet Loss as in the first embodiment, and calculates the loss calculated based on the acoustic signal synthesized from the positive pair data and the acoustic signal synthesized from the negative pair data. Functions may be optimized.

図６は、第２の実施形態に係る学習処理の流れを示す模式図である。図６に示すように、合成部１０７は、スペクトログラムとマスクとから出力音響信号（Audio output）を合成する。そして、更新部１０８は、Triplet Lossによりモデルを更新（Updating）する。 FIG. 6 is a schematic diagram showing the flow of learning processing according to the second embodiment. As shown in FIG. 6, the synthesis unit 107 synthesizes an output audio signal (Audio output) from the spectrogram and the mask. Then, the updating unit 108 updates the model using Triplet Loss.

［第２の実施形態の処理の流れ］
図７は、第２の実施形態に係る学習装置の処理の流れを示すフローチャートである。図７に示すように、まず、ラベル特徴量算出部１０１は、ラベルエンコーダにより、ラベル情報からラベル特徴量を算出する（ステップＳ１２１）。次に、スペクトログラム特徴量算出部１０２は、オーディオエンコーダにより、入力音響信号からスペクトログラム特徴量を算出する（ステップＳ１２２）。ここで、ステップＳ１２１とステップＳ１２２が実行される順番は逆であってもよい。また、ステップＳ１２１とステップＳ１２２は並行して実行されてもよい。 [Process flow of second embodiment]
FIG. 7 is a flowchart showing the process flow of the learning device according to the second embodiment. As shown in FIG. 7, the label feature calculation unit 101 first calculates the label feature from the label information using the label encoder (step S121). Next, the spectrogram feature calculation unit 102 uses the audio encoder to calculate a spectrogram feature from the input audio signal (step S122). Here, the order in which step S121 and step S122 are executed may be reversed. Moreover, step S121 and step S122 may be executed in parallel.

ここで、マスク生成部１０３は、ラベル特徴量とスペクトログラム特徴量の内積を算出する（ステップＳ１２３）。そして、マスク生成部１０３は、内積からスペクトログラムマスクを生成する（ステップＳ１２４）。そして、抽出部１０６は、入力音響信号にスペクトログラムマスクを適用し、所定の成分を抽出する（ステップＳ１２５）。また、合成部１０７は、抽出した成分を基に音響信号を合成する（ステップＳ１２６）。 Here, the mask generation unit 103 calculates the inner product of the label feature amount and the spectrogram feature amount (step S123). Then, the mask generation unit 103 generates a spectrogram mask from the inner product (step S124). Then, the extraction unit 106 applies a spectrogram mask to the input acoustic signal and extracts predetermined components (step S125). Furthermore, the synthesis unit 107 synthesizes an acoustic signal based on the extracted components (step S126).

指標算出部１０４は、合成した音響信号を基に損失関数を算出する（ステップＳ１２７）。そして、指標算出部１０４は、損失関数が最小化されるように各エンコーダのパラメータを更新する（ステップＳ１２８）。 The index calculation unit 104 calculates a loss function based on the synthesized acoustic signal (step S127). Then, the index calculation unit 104 updates the parameters of each encoder so that the loss function is minimized (step S128).

ここで、学習装置１０ａは、終了条件が充足されている場合（ステップＳ１２９、Yes）、処理を終了する。一方、学習装置１０ａは、終了条件が充足されていない場合（ステップＳ１２９、No）、ステップＳ１２１に戻り更新済みの各モデルを使って処理を繰り返す。なお、例えば、終了条件は、用意されたミニバッチ内の全てのデータについて処理が実行済みであること、規定回数だけ処理が繰り返されたこと、パラメータの更新幅が収束したこと等である。 Here, if the termination condition is satisfied (step S129, Yes), the learning device 10a terminates the process. On the other hand, if the termination condition is not satisfied (step S129, No), the learning device 10a returns to step S121 and repeats the process using each updated model. Note that, for example, the termination conditions include that all data in the prepared mini-batch have been processed, that the process has been repeated a specified number of times, that the parameter update width has converged, and so on.

［第２の実施形態の効果］
第２の実施形態の学習装置１０ａは、第１の実施形態の学習装置１０と同様に、意味を解釈可能な態様で目的音声の音源を識別可能なラベルが表現されていれば、各モデルの学習を行うことができる。このため、第２の実施形態でも同様に、目的音声の物理的性質が未知であっても、音響信号中に目的音声と類似の物理的性質を有する音声の信号が含まれていても精度良く音源分離を行うことができる。 [Effects of the second embodiment]
Similar to the learning device 10 of the first embodiment, the learning device 10a of the second embodiment is capable of using each model as long as a label that allows the source of the target voice to be identified is expressed in a manner that allows interpretation of the meaning. Learning can be done. Therefore, in the second embodiment as well, even if the physical properties of the target voice are unknown, even if the acoustic signal contains a signal of a voice having similar physical properties to the target voice, the accuracy can be improved. Sound source separation can be performed.

［第３の実施形態］
第３の実施形態に係る生成装置は、学習済みの音源分離モデルを使って、マスクの生成やマスクを使った目的音声の分離を行う。音源分離モデルには、ラベルエンコーダ及びオーディオエンコーダが含まれる。 [Third embodiment]
The generation device according to the third embodiment uses a trained sound source separation model to generate a mask and separate target speech using the mask. The sound source separation model includes a label encoder and an audio encoder.

［第３の実施形態の構成］
まず、図８を用いて、第３の実施形態に係る生成装置の構成について説明する。図８は、第３の実施形態に係る学習装置の構成例を示す図である。図８に示すように、生成装置２０は、ラベル特徴量算出部２０１、スペクトログラム特徴量算出部２０２、マスク生成部２０３、指標算出部２０４、抽出部２０６、合成部２０７を有する。また、生成装置２０は、ラベルエンコーダ情報２１１及びオーディオエンコーダ情報２１２を記憶する。 [Configuration of third embodiment]
First, the configuration of the generation device according to the third embodiment will be described using FIG. 8. FIG. 8 is a diagram showing a configuration example of a learning device according to the third embodiment. As shown in FIG. 8, the generation device 20 includes a label feature calculation section 201, a spectrogram feature calculation section 202, a mask generation section 203, an index calculation section 204, an extraction section 206, and a synthesis section 207. The generation device 20 also stores label encoder information 211 and audio encoder information 212.

ラベル特徴量算出部２０１、スペクトログラム特徴量算出部２０２、マスク生成部２０３、指標算出部２０４、抽出部２０６、合成部２０７は、それぞれラベル特徴量算出部１０１、スペクトログラム特徴量算出部１０２、マスク生成部１０３、指標算出部１０４、抽出部１０６、合成部１０７と同様の機能を有する。また、ラベルエンコーダ情報２１１は、学習済みのラベルエンコーダの情報である。また、オーディオエンコーダ情報２１２は、学習済みのオーディオエンコーダの情報である。 The label feature calculation unit 201, the spectrogram feature calculation unit 202, the mask generation unit 203, the index calculation unit 204, the extraction unit 206, and the synthesis unit 207 are the label feature calculation unit 101, the spectrogram feature calculation unit 102, and the mask generation unit, respectively. It has the same functions as the section 103, the index calculation section 104, the extraction section 106, and the synthesis section 107. Further, the label encoder information 211 is information on a learned label encoder. Furthermore, the audio encoder information 212 is information on a learned audio encoder.

ラベル特徴量算出部２０１は、ラベルを特定可能なラベル情報を入力とし、当該ラベル情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出し、出力する。スペクトログラム特徴量算出部２０２は、入力音響信号を入力とし、入力音響信号を第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出し、出力する。 The label feature amount calculation unit 201 receives label information that allows identification of a label as input, calculates a first feature amount by mapping the label information in a first space using a first model, and outputs the result. The spectrogram feature calculation unit 202 receives the input acoustic signal, calculates a second feature obtained by mapping the input acoustic signal onto the first space using the second model, and outputs the calculated second feature.

マスク生成部２０３は、第１の特徴量及び第２の特徴量を入力とし、第１の特徴量及び第２の特徴量を基に、音響信号からラベルに対応する成分を抽出するためのマスクを生成し、出力する。指標算出部２０４は、第１の特徴量及び第２の特徴量を基に生成されるマスクを入力として、マスクの類似度を算出する。 The mask generation unit 203 receives the first feature amount and the second feature amount as input, and generates a mask for extracting a component corresponding to the label from the acoustic signal based on the first feature amount and the second feature amount. Generate and output. The index calculation unit 204 receives as input a mask generated based on the first feature amount and the second feature amount, and calculates the similarity of the mask.

抽出部２０６は、入力音響信号から得られるスペクトログラム及びマスクを入力とし、スペクトログラムに、マスクを適用し、所定の成分を抽出し、出力する。合成部２０７は、抽出部２０６によって抽出された成分を入力とし、成分を基に、音響信号を合成し、出力する。 The extraction unit 206 inputs the spectrogram and mask obtained from the input acoustic signal, applies the mask to the spectrogram, extracts predetermined components, and outputs the extracted components. The synthesizing section 207 receives the components extracted by the extracting section 206, synthesizes an acoustic signal based on the components, and outputs the synthesized signal.

例えば、合成部２０７は、時間周波数成分から正弦波パラメータを抽出し、正弦波加算合成を行うMcAulay-Quatieriの方法（例えば、参考文献５を参照）によって音響信号を合成してもよい。また、合成部２０７は、時間周波数成分に対して位相成分を反復処理により推定し復元するGriffin-Limの方法（例えば、参考文献６）によって音響信号を合成してもよい。 For example, the synthesis unit 207 may synthesize the acoustic signal using the McAulay-Quatieri method (for example, see Reference 5), which extracts sine wave parameters from time-frequency components and performs sine wave additive synthesis. Furthermore, the synthesis unit 207 may synthesize the acoustic signal using the Griffin-Lim method (for example, Reference Document 6) in which phase components are estimated and restored by iterative processing for time-frequency components.

生成装置２０は、合成部２０７によって合成された出力音響信号を出力する。また、図８に示すように、生成装置２０は、出力音響信号だけでなく、指標算出部２０４によって算出された類似度を出力してもよいし、マスク生成部２０３によって生成されたマスクを出力してもよい。 The generation device 20 outputs the output acoustic signal synthesized by the synthesis section 207. Further, as shown in FIG. 8, the generation device 20 may output not only the output acoustic signal but also the similarity calculated by the index calculation unit 204, or output the mask generated by the mask generation unit 203. You may.

図９は、第３の実施形態に係る生成処理の流れを示す模式図である。図９に示すように、第３の実施形態の生成装置２０は、第１の実施形態及び第２の実施形態の学習装置と異なり、学習済みのモデルをあらかじめ記憶装置等に記憶し、当該モデルを用いてマスクの生成を行うものである。そのため、生成装置２０は、更新部を有しておらず、モデルの更新に関する処理は行わない。ただし、生成装置２０に学習装置と同等の学習機能を追加し、モデルの更新に関する処理を行うように構成することは妨げられない。 FIG. 9 is a schematic diagram showing the flow of generation processing according to the third embodiment. As shown in FIG. 9, unlike the learning devices of the first and second embodiments, the generation device 20 of the third embodiment stores a learned model in advance in a storage device, etc. This is used to generate a mask. Therefore, the generation device 20 does not have an update unit and does not perform processing related to updating the model. However, it is possible to add a learning function equivalent to that of the learning device to the generation device 20 and configure it to perform processing related to model updating.

［第３の実施形態の処理の流れ］
図１０は、第３の実施形態に係る生成装置の処理の流れを示すフローチャートである。図１０に示すように、まず、ラベル特徴量算出部２０１は、ラベルエンコーダにより、ラベル情報からラベル特徴量を算出する（ステップＳ２０１）。次に、スペクトログラム特徴量算出部２０２は、オーディオエンコーダにより、入力音響信号からスペクトログラム特徴量を算出する（ステップＳ２０２）。ここで、ステップＳ２０１とステップＳ２０２が実行される順番は逆であってもよい。また、ステップＳ２０１とステップＳ２０２は並行して実行されてもよい。 [Processing flow of third embodiment]
FIG. 10 is a flowchart showing the process flow of the generation device according to the third embodiment. As shown in FIG. 10, first, the label feature calculation unit 201 calculates a label feature from label information using a label encoder (step S201). Next, the spectrogram feature calculation unit 202 uses the audio encoder to calculate a spectrogram feature from the input audio signal (step S202). Here, the order in which step S201 and step S202 are executed may be reversed. Moreover, step S201 and step S202 may be executed in parallel.

ここで、マスク生成部２０３は、ラベル特徴量とスペクトログラム特徴量の内積を算出する（ステップＳ２０３）。そして、マスク生成部２０３は、内積からスペクトログラムマスクを生成する（ステップＳ２０４）。そして、抽出部２０６は、入力音響信号にスペクトログラムマスクを適用し、所定の成分を抽出する（ステップＳ２０５）。また、合成部２０７は、抽出した成分を基に音響信号を合成する（ステップＳ２０６）。 Here, the mask generation unit 203 calculates the inner product of the label feature amount and the spectrogram feature amount (step S203). Then, the mask generation unit 203 generates a spectrogram mask from the inner product (step S204). Then, the extraction unit 206 applies a spectrogram mask to the input acoustic signal and extracts a predetermined component (step S205). Furthermore, the synthesis unit 207 synthesizes an acoustic signal based on the extracted components (step S206).

生成装置２０は、生成した音響信号を出力音響信号として出力する（ステップＳ２０７）。なお、生成装置２０は、スペクトログラムマスクそのものを出力してもよいし、スペクトログラムマスクから算出された類似度を出力してもよい。 The generation device 20 outputs the generated acoustic signal as an output acoustic signal (step S207). Note that the generation device 20 may output the spectrogram mask itself, or may output the degree of similarity calculated from the spectrogram mask.

［第３の実施形態の効果］
これまで説明してきたように、ラベル特徴量算出部２０１は、意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する。スペクトログラム特徴量算出部２０２は、音響信号を第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する。マスク生成部２０３は、第１の特徴量及び第２の特徴量を基に、音響信号からラベルに対応する成分を抽出するためのマスクを生成する。このように、生成装置２０は、意味を解釈可能な態様で目的音声の音源を識別可能なラベルが表現されていれば、当該ラベルに応じたマスクを生成することができる。このため、第３の実施形態によれば、目的音声の物理的性質が未知であっても、音響信号中に目的音声と類似の物理的性質を有する音声の信号が含まれていても精度良く音源分離を行うことができる。 [Effects of the third embodiment]
As explained above, the label feature calculation unit 201 uses the first model to calculate the first feature obtained by mapping information representing the label in a manner that allows interpretation of the meaning in the first space. calculate. The spectrogram feature calculation unit 202 calculates a second feature obtained by mapping the acoustic signal to the first space using the second model. The mask generation unit 203 generates a mask for extracting a component corresponding to the label from the acoustic signal based on the first feature amount and the second feature amount. In this way, the generation device 20 can generate a mask according to the label, as long as the label that allows the source of the target voice to be identified is expressed in a manner that allows interpretation of the meaning. Therefore, according to the third embodiment, even if the physical properties of the target voice are unknown, even if the acoustic signal contains a signal of a voice having similar physical properties to the target voice, the accuracy can be improved. Sound source separation can be performed.

［第４の実施形態］
これまでの実施形態では、マスクは、時間周波数点ごとの成分を抽出するためのものであった。一方で、マスクによって時間方向の音源分離を行いたい場合がある。特に、一定期間において、異なるラベルの音声が時間的に重複なく存在する場合、マスクによって各ラベルに対応する時間帯を特定できれば、各ラベルに対応する目的音声を分離することができると考えられる。 [Fourth embodiment]
In the previous embodiments, the mask was for extracting components for each time-frequency point. On the other hand, there are cases where it is desired to perform temporal sound source separation using a mask. In particular, if sounds with different labels exist without temporal overlap in a certain period of time, it is considered possible to separate the target sounds corresponding to each label if the time period corresponding to each label can be identified using a mask.

そこで、第４の実施形態では、図１１に示すように、時間周波数点ごとの成分が周波数方向に集約された時間方向のマスク、すなわちタイムマスクを生成する。図１１は、第４の実施形態に係る生成処理の流れを示す模式図である。 Therefore, in the fourth embodiment, as shown in FIG. 11, a mask in the time direction in which components of each time-frequency point are aggregated in the frequency direction, that is, a time mask is generated. FIG. 11 is a schematic diagram showing the flow of generation processing according to the fourth embodiment.

図１１に示すように、スペクトログラム特徴量算出部２０２は、生成したスペクトログラム特徴量を周波数方向に集約する。このため、スペクトログラム特徴量の周波数方向のサイズは１となる。そして、マスク生成部２０３は、ラベル特徴量と集約済みのスペクトログラム特徴量の内積を算出する。 As shown in FIG. 11, the spectrogram feature calculation unit 202 aggregates the generated spectrogram features in the frequency direction. Therefore, the size of the spectrogram feature amount in the frequency direction is 1. Then, the mask generation unit 203 calculates the inner product of the label feature amount and the aggregated spectrogram feature amount.

なお、マスク生成部２０３は、時間周波数点ごとの成分を含むスペクトログラムマスクを、さらに周波数方向に集約することによりタイムマスクを生成してもよい。その場合、スペクトログラム特徴量算出部２０２は、スペクトログラム特徴量の集約を行わない。 Note that the mask generation unit 203 may generate a time mask by further aggregating spectrogram masks including components for each time-frequency point in the frequency direction. In that case, the spectrogram feature calculation unit 202 does not aggregate the spectrogram features.

また、図１１に示すように、指標算出部２０４は、タイムマスクをさらに時間方向に集約することで、類似度を算出することができる。なお、スペクトログラムマスクは、第１のマスクの一例である。また、タイムマスクは、第２のマスクの一例である。 Further, as shown in FIG. 11, the index calculation unit 204 can calculate the degree of similarity by further aggregating the time masks in the time direction. Note that the spectrogram mask is an example of the first mask. Further, the time mask is an example of a second mask.

このように、タイムマスクは、スペクトログラムマスクの周波数成分を集約したマスクと言うことができる。例えば、タイムマスクの生成方法には、スペクトログラムマスクを実際に生成することなく、スペクトログラム特徴量をあらかじめ集約しておく第１の方法と、スペクトロマスクを実際に生成し集約を行う第２の方法がある。第１の方法には、計算量が削減されるという効果がある。一方、第２の方法には、スペクトログラムマスクとタイムマスクの両方を得ることができるという効果がある。 In this way, the time mask can be said to be a mask that aggregates the frequency components of the spectrogram mask. For example, there are two ways to generate a time mask: the first method is to aggregate spectrogram features in advance without actually generating a spectrogram mask, and the second method is to actually generate and aggregate spectrogram masks. be. The first method has the effect of reducing the amount of calculation. On the other hand, the second method has the advantage that both a spectrogram mask and a time mask can be obtained.

［第４の実施形態の処理の流れ］
図１２は、第４の実施形態に係る生成装置の処理の流れを示すフローチャートである。図１２に示すように、まず、ラベル特徴量算出部２０１は、ラベルエンコーダにより、ラベル情報からラベル特徴量を算出する（ステップＳ２２１）。次に、スペクトログラム特徴量算出部２０２は、オーディオエンコーダにより、入力音響信号から周波数成分を集約したスペクトログラム特徴量を算出する（ステップＳ２２２）。ここで、ステップＳ２２１とステップＳ２２２が実行される順番は逆であってもよい。また、ステップＳ２２１とステップＳ２２２は並行して実行されてもよい。 [Processing flow of fourth embodiment]
FIG. 12 is a flowchart showing the process flow of the generation device according to the fourth embodiment. As shown in FIG. 12, the label feature calculation unit 201 first calculates the label feature from the label information using the label encoder (step S221). Next, the spectrogram feature calculation unit 202 uses the audio encoder to calculate a spectrogram feature that aggregates frequency components from the input audio signal (step S222). Here, the order in which step S221 and step S222 are executed may be reversed. Moreover, step S221 and step S222 may be executed in parallel.

ここで、マスク生成部２０３は、ラベル特徴量とスペクトログラム特徴量の内積を算出する（ステップＳ２２３）。そして、マスク生成部２０３は、内積からタイムマスクを生成する（ステップＳ２２４）。 Here, the mask generation unit 203 calculates the inner product of the label feature amount and the spectrogram feature amount (step S223). Then, the mask generation unit 203 generates a time mask from the inner product (step S224).

そして、生成装置２０は、入力音響信号にタイムマスクを適用し、所定の成分を抽出する（ステップＳ２２５）。また、生成装置２０は、抽出した成分を基に音響信号を合成する（ステップＳ２２６）。 Then, the generation device 20 applies a time mask to the input acoustic signal and extracts a predetermined component (step S225). Furthermore, the generation device 20 synthesizes an acoustic signal based on the extracted components (step S226).

生成装置２０は、生成した音響信号を出力音響信号として出力する（ステップＳ２２７）。なお、生成装置２０は、タイムマスクそのものを出力してもよいし、タイムマスクから算出された類似度を出力してもよい。 The generation device 20 outputs the generated acoustic signal as an output acoustic signal (step S227). Note that the generation device 20 may output the time mask itself, or may output the degree of similarity calculated from the time mask.

［第４の実施形態の効果］
これまで説明してきたように、ラベル特徴量算出部２０１は、意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する。スペクトログラム特徴量算出部２０２は、音響信号を第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する。マスク生成部２０３は、第１の特徴量及び第２の特徴量を基に生成される第１のマスクであって、音響信号からラベルに対応する成分を抽出するための時間周波数点ごとの値を持つ第１のマスクの、周波数成分を集約した第２のマスクを生成する。このため、第４の実施形態によれば、特に、異なるラベルの音声が時間的に重複なく存在する場合に、目的音声の分離を効率良く行うことができる。 [Effects of the fourth embodiment]
As explained above, the label feature calculation unit 201 uses the first model to calculate the first feature obtained by mapping information representing the label in a manner that allows interpretation of the meaning in the first space. calculate. The spectrogram feature calculation unit 202 calculates a second feature obtained by mapping the acoustic signal to the first space using the second model. The mask generation unit 203 generates a first mask based on a first feature amount and a second feature amount, and generates a value for each time-frequency point for extracting a component corresponding to a label from an acoustic signal. A second mask is generated by consolidating the frequency components of the first mask. Therefore, according to the fourth embodiment, target sounds can be efficiently separated, especially when sounds with different labels exist without temporal overlap.

第４の実施形態によれば、例えば、ニュース番組では、政治コーナー、スポーツコーナーといったコーナーが時間で区切られている場合がある。例えば、各コーナーで読み上げられたニュースの原稿に、第３の実施形態で得られたタイムマスクを適用することで、特定のコーナーに対応する原稿の部分を特定することができる。 According to the fourth embodiment, for example, in a news program, corners such as a political corner and a sports corner may be separated by time. For example, by applying the time mask obtained in the third embodiment to the news manuscript read out in each corner, it is possible to specify the portion of the manuscript that corresponds to a specific corner.

ここで、第４の実施形態で用いられるラベルエンコーダ及びオーディオエンコーダは、例えば、タイムマスクによって抽出された成分から合成された音響信号を使ってTriplet Lossによって学習されたものであってもよい。これは、学習時に、マスク生成部１０３が、生成したスペクトログラムマスクの周波数成分を集約し、タイムマスクを生成することによって実現される。 Here, the label encoder and audio encoder used in the fourth embodiment may be trained by Triplet Loss using, for example, an audio signal synthesized from components extracted by a time mask. This is achieved by the mask generation unit 103 collecting the frequency components of the generated spectrogram masks and generating a time mask during learning.

これより、以下のような実施形態が考えられる。ラベル特徴量算出部１０１は、意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する。スペクトログラム特徴量算出部１０２は、音響信号を第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する。指標算出部１０４は、第１の特徴量及び第２の特徴量を基に生成される第１のマスクであって、音響信号からラベルに対応する成分を抽出するための時間周波数点ごとの値を持つ第１のマスクの、周波数成分を集約した第２のマスクを評価するための指標を算出する。更新部１０５は、指標が最適化されるように、第１のモデルのパラメータ及び第２のモデルのパラメータを更新する。 From this, the following embodiments can be considered. The label feature amount calculation unit 101 uses a first model to calculate a first feature amount, which is obtained by mapping information expressing a label in a manner in which the meaning can be interpreted into a first space. The spectrogram feature calculation unit 102 calculates a second feature obtained by mapping the acoustic signal to the first space using the second model. The index calculation unit 104 is a first mask generated based on a first feature amount and a second feature amount, and is a value for each time-frequency point for extracting a component corresponding to a label from an acoustic signal. An index for evaluating a second mask that aggregates the frequency components of the first mask with . The updating unit 105 updates the parameters of the first model and the parameters of the second model so that the index is optimized.

［実験結果］
各実施形態を基に行った実験について説明する。実験は、２種の音源からなる時間的な重畳のない混合音に対して、スペクトログラムマスクを生成した。さらに、実験では、潜在変数の次元数ｈを音源クラス数ｌ以下に設定することで、潜在変数を媒介にすることの有用性を検証した。 [Experimental result]
Experiments conducted based on each embodiment will be described. In the experiment, a spectrogram mask was generated for a mixed sound consisting of two types of sound sources without temporal overlap. Furthermore, in the experiment, the usefulness of using the latent variable as a mediator was verified by setting the number of dimensions h of the latent variable to be less than or equal to the number l of sound source classes.

実験では、作成した混合音からなるデータセットを用いて、第１の実施形態で説明した音源分離モデルの学習を行った。さらに、学習済みの音源分離モデルを用いて、第３の実施形態で説明した生成装置にテスト用の混合音と２ラベルのうちの片方のみを入力し、対応する領域にスペクトログラムが生成されるかを確認した。 In the experiment, the sound source separation model described in the first embodiment was trained using a dataset consisting of the created mixed sounds. Furthermore, using the trained sound source separation model, input the test mixture sound and only one of the two labels to the generation device described in the third embodiment, and check whether a spectrogram is generated in the corresponding region. It was confirmed.

実験では、データセットとして、DCASE 2018 challenge task2（参考文献７：http://dcase.community/challenge2018/index）で公開されたFSD Kaggle 2018を用いた。FSD Kaggle 2018は、４１クラスの環境音からなる９５００個程度のデータセットである。 In the experiment, we used FSD Kaggle 2018 published in DCASE 2018 challenge task2 (Reference 7: http://dcase.community/challenge2018/index) as a dataset. FSD Kaggle 2018 is a dataset of approximately 9,500 environmental sounds in 41 classes.

データセットのうち、手動アノテーションデータを使用した。また、極端に時間の短いデータを避けるため３秒以上の長さを持つデータを使用した。また、上記を満たすデータから異なるクラスのデータを２種類抜き出し、図１３のように、２つの信号（ラベルＡ及びラベルＢの信号）を、無音を挟み結合した。図１３は、実験におけるデータの結合について説明する図である。結合されたシングルチャネルデータに対応するクラスは常に２つである。 Among the datasets, we used manual annotation data. Furthermore, in order to avoid extremely short data, data with a length of 3 seconds or more was used. Furthermore, two types of data of different classes were extracted from the data satisfying the above, and the two signals (label A and label B signals) were combined with silence interposed between them, as shown in FIG. FIG. 13 is a diagram illustrating data combination in an experiment. There are always two classes that correspond to combined single channel data.

図１４は、実験における各パラメータの設定値を示す図である。図１４に示すように、潜在変数の次元数ｈは３２とした。また、２つの信号は、トランペットの音と、鍵をジャラジャラさせた音であり、それぞれラベルＡ及びラベルＢに対応する。 FIG. 14 is a diagram showing the set values of each parameter in the experiment. As shown in FIG. 14, the number of dimensions h of the latent variables was set to 32. Further, the two signals are the sound of a trumpet and the sound of jingling keys, and correspond to labels A and B, respectively.

図１５は、実験で得られたスペクトログラムを示す図である。また、図１６及び図１７は、実験で得られたマスクを示す図である。図１６は、ラベルとしてトランペットを指定したときのマスクである。また、図１７は、ラベルとして鍵をジャラジャラする音を指定したときのマスクである。これらの図から、実施形態によれば、各ラベルを分離可能なマスクが生成されていることが分かる。 FIG. 15 is a diagram showing a spectrogram obtained in an experiment. Moreover, FIGS. 16 and 17 are diagrams showing masks obtained in experiments. FIG. 16 shows a mask when trumpet is specified as a label. Further, FIG. 17 shows a mask when the sound of jingling keys is specified as a label. From these figures, it can be seen that according to the embodiment, a mask that can separate each label is generated.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散及び統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each device shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings. In other words, the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices may be functionally or physically distributed or integrated in arbitrary units depending on various loads and usage conditions. Can be integrated and configured. Furthermore, all or any part of each processing function performed by each device can be realized by a CPU and a program that is analyzed and executed by the CPU, or can be realized as hardware using wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be performed automatically using known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings may be changed arbitrarily, unless otherwise specified.

［プログラム］
一実施形態として、学習装置１０及び生成装置２０は、パッケージソフトウェアやオンラインソフトウェアとして上記の学習処理又は生成処理を実行するプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を学習装置１０又は生成装置２０として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS（Personal Handyphone System）等の移動体通信端末、さらには、PDA（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
As one embodiment, the learning device 10 and the generation device 20 can be implemented by installing a program that executes the above learning process or generation process on a desired computer as packaged software or online software. For example, by causing the information processing device to execute the above program, the information processing device can be made to function as the learning device 10 or the generation device 20. The information processing device referred to here includes a desktop or notebook personal computer. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone System), as well as slate terminals such as PDA (Personal Digital Assistant).

また、学習装置１０及び生成装置２０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の学習処理又は生成処理に関するサービスを提供するサーバ装置として実装することもできる。例えば、サーバ装置は、ラベルと音響信号を入力とし、分離された目的音声の信号を出力とするサービスを提供するサーバ装置として実装される。この場合、サーバ装置は、Webサーバとして実装することとしてもよいし、アウトソーシングによって上記の処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the learning device 10 and the generation device 20 can also be implemented as a server device that uses a terminal device used by a user as a client and provides the client with a service related to the above learning process or generation process. For example, the server device is implemented as a server device that provides a service that receives a label and an audio signal as input and outputs a separated target audio signal. In this case, the server device may be implemented as a web server, or may be implemented as a cloud that provides services related to the above processing through outsourcing.

図１８は、学習プログラムを実行するコンピュータの一例を示す図である。なお、生成処理についても同様のコンピュータによって実行されてもよい。コンピュータ１０００は、例えば、メモリ１０１０、CPU１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 18 is a diagram showing an example of a computer that executes a learning program. Note that the generation process may also be executed by a similar computer. Computer 1000 includes, for example, memory 1010 and CPU 1020. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.

メモリ１０１０は、ROM（Read Only Memory）１０１１及びRAM１０１２を含む。ROM１０１１は、例えば、BIOS（BASIC Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as BIOS (BASIC Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090. Disk drive interface 1040 is connected to disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into disk drive 1100. Serial port interface 1050 is connected to, for example, mouse 1110 and keyboard 1120. Video adapter 1060 is connected to display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、学習装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、学習装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、SSDにより代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the learning device 10 is implemented as a program module 1093 in which computer-executable code is written. Program module 1093 is stored in hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration of the learning device 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、CPU１０２０は、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてRAM１０１２に読み出して、上述した実施形態の処理を実行する。 Furthermore, the setting data used in the processing of the embodiment described above is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processing of the embodiment described above.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してCPU１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（LAN（Local Area Network）、WAN（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してCPU１０２０によって読み出されてもよい。 Note that the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, program module 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from another computer via the network interface 1070.

１０、１０ａ学習装置
２０生成装置
１０１、２０１ラベル特徴量算出部
１０２、２０２スペクトログラム特徴量算出部
１０３、２０３マスク生成部
１０４、２０４指標算出部
１０５、１０８更新部
１０６、２０６抽出部
１０７、２０７合成部
１１１、２１１ラベルエンコーダ情報
１１２、２１２オーディオエンコーダ情報 10, 10a learning device 20 generation device 101, 201 label feature calculation unit 102, 202 spectrogram feature calculation unit 103, 203 mask generation unit 104, 204 index calculation unit 105, 108 update unit 106, 206 extraction unit 107, 207 synthesis Section 111, 211 Label encoder information 112, 212 Audio encoder information

Claims

意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する第１の特徴量算出部と、
音響信号を前記第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する第２の特徴量算出部と、
前記第１の特徴量及び前記第２の特徴量を基に生成されるマスクであって、前記音響信号から前記ラベルに対応する成分を抽出するためのマスクを評価するための指標を算出する指標算出部と、
前記指標が最適化されるように、前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新する更新部と、
を有することを特徴とする学習装置。 a first feature amount calculation unit that uses a first model to calculate a first feature amount by mapping information representing the label in a manner that allows interpretation of the meaning in a first space;
a second feature calculation unit that uses a second model to calculate a second feature obtained by mapping the acoustic signal to the first space;
An index for calculating an index for evaluating a mask for extracting a component corresponding to the label from the acoustic signal, the mask being generated based on the first feature amount and the second feature amount. A calculation section,
an updating unit that updates parameters of the first model and parameters of the second model so that the index is optimized;
A learning device characterized by having.

前記第１の特徴量算出部は、文字列で表現されたラベルから作成されたベクトルを前記第１のモデルに入力して得られる出力を、前記第１の特徴量として算出することを特徴とする請求項１に記載の学習装置。 The first feature calculation unit is characterized in that it calculates an output obtained by inputting a vector created from a label expressed as a character string into the first model as the first feature. The learning device according to claim 1.

前記第１の特徴量算出部は、第１の次元数の変数を少なくとも含む特徴量を前記第１の特徴量として算出し、
前記第２の特徴量算出部は、前記第１の次元数の変数と、時間に対応する変数と、周波数成分に対応する変数とを少なくとも含む特徴量を前記第２の特徴量として算出し、
前記指標算出部は、前記第１の特徴量と前記第２の特徴量の類似度を基に前記指標を算出することを特徴とする請求項１又は２に記載の学習装置。 The first feature amount calculation unit calculates a feature amount including at least a variable of a first number of dimensions as the first feature amount,
The second feature amount calculation unit calculates, as the second feature amount, a feature amount that includes at least a variable of the first dimension number, a variable corresponding to time, and a variable corresponding to a frequency component,
The learning device according to claim 1 or 2, wherein the index calculation unit calculates the index based on the degree of similarity between the first feature amount and the second feature amount.

前記指標算出部は、前記音響信号に対応付けられたラベルから算出された前記第１の特徴量と前記第２の特徴量との類似度が大きいほど小さくなり、かつ、前記音響信号に対応付けられたラベルと異なるラベルから算出された前記第１の特徴量と前記第２の特徴量との類似度が大きいほど大きくなるような指標を算出し、
前記更新部は、前記指標が最小化されるように前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新することを特徴とする請求項１から３のいずれか１項に記載の学習装置。 The index calculating unit decreases the degree of similarity as the degree of similarity between the first feature amount and the second feature amount calculated from the label associated with the acoustic signal increases; calculating an index that increases as the degree of similarity between the first feature quantity and the second feature quantity calculated from a label different from the given label is larger;
The updating unit updates the parameters of the first model and the parameters of the second model so that the index is minimized. learning device.

第１の特徴量算出部は、所定の単語を表すラベルから作成されたｓ次元（ｓはあらかじめ設定された単語数）のOne-hotベクトルを、前記第１のモデルであるニューラルネットワークに入力して得られる１×１×ｈ（ｈはあらかじめ設定された任意の数）次元の特徴量を前記第１の特徴量として算出し、
前記第２の特徴量算出部は、前記音響信号から作成されたｆ（ｆは周波数ビンの数）×ｔ（ｔは時間ビンの数）次元のスペクトログラムを、前記第２のモデルであるニューラルネットワークに入力して得られるｆ×ｔ×ｈ次元の特徴量を前記第２の特徴量として算出することを特徴とする請求項１から４のいずれか１項に記載の学習装置。 The first feature calculation unit inputs an s-dimensional (s is the number of words set in advance) one-hot vector created from a label representing a predetermined word to the neural network that is the first model. Calculate a 1×1×h (h is an arbitrary number set in advance) dimension feature obtained as the first feature,
The second feature calculation unit converts the f (f is the number of frequency bins) x t (t is the number of time bins) dimension spectrogram created from the acoustic signal into a neural network that is the second model. The learning device according to any one of claims 1 to 4, wherein an f×t×h dimension feature obtained by inputting the second feature is calculated as the second feature.

ラベルを特定可能な情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する第１の特徴量算出部と、
音響信号を前記第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する第２の特徴量算出部と、
前記第１の特徴量及び前記第２の特徴量を基に生成されるマスクであって、前記音響信号から前記ラベルに対応する成分を抽出するためのマスクを評価するための指標を算出する指標算出部と、
前記指標が最適化されるように、前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新する更新部と、
を有することを特徴とする学習装置。 a first feature calculation unit that uses a first model to calculate a first feature obtained by mapping information that can identify a label in a first space;
a second feature calculation unit that uses a second model to calculate a second feature obtained by mapping the acoustic signal to the first space;
An index for calculating an index for evaluating a mask for extracting a component corresponding to the label from the acoustic signal, the mask being generated based on the first feature amount and the second feature amount. A calculation section,
an updating unit that updates parameters of the first model and parameters of the second model so that the index is optimized;
A learning device characterized by having.

学習装置が実行する学習方法であって、
意味を解釈可能な態様でラベルを表現した情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出する第１の特徴量算出工程と、
音響信号を前記第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出する第２の特徴量算出工程と、
前記第１の特徴量及び前記第２の特徴量を基に生成されるマスクであって、前記音響信号から前記ラベルに対応する成分を抽出するためのマスクを評価するための指標を算出する指標算出工程と、
前記指標が最適化されるように、前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新する更新工程と、
を含むことを特徴とする学習方法。 A learning method executed by a learning device, comprising:
a first feature amount calculation step of calculating a first feature amount by mapping information representing the label in a manner in which the meaning can be interpreted in a first space using a first model;
a second feature amount calculation step of calculating a second feature amount by mapping the acoustic signal to the first space using a second model;
An index for calculating an index for evaluating a mask for extracting a component corresponding to the label from the acoustic signal, the mask being generated based on the first feature amount and the second feature amount. calculation process,
an updating step of updating parameters of the first model and parameters of the second model so that the index is optimized;
A learning method characterized by including.

コンピュータを、請求項１から６のいずれか１項に記載の学習装置として機能させるための学習プログラム。 A learning program for causing a computer to function as the learning device according to claim 1.

意味を解釈可能な態様でラベルを表現した第１の情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出し、第１の音響信号を前記第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出し、前記第１の特徴量及び前記第２の特徴量を基に生成されるマスクであって、前記第１の音響信号から前記ラベルに対応する成分を抽出するためのマスクを評価するための指標を算出し、前記指標が最適化されるように、前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新する方法によって学習済みの前記第１のモデル及び前記第２のモデルを用いる生成装置であって、
意味を解釈可能な態様でラベルを表現した第２の情報を前記第１の空間にマッピングした第３の特徴量を、前記第１のモデルを用いて算出する第１の特徴量算出部と、
第２の音響信号を前記第１の空間にマッピングした第４の特徴量を、前記第２のモデルを用いて算出する第２の特徴量算出部と、
前記第３の特徴量及び前記第４の特徴量を基に、前記第２の音響信号からラベルに対応する成分を抽出するためのマスクを生成するマスク生成部と、
を有することを特徴とする生成装置。 A first feature amount is calculated by mapping first information representing a label in a manner that allows interpretation of the meaning into a first space using a first model, and a first acoustic signal is calculated by using a first model. A second feature mapped in space is calculated using a second model, and the mask is generated based on the first feature and the second feature, the mask being generated based on the first feature and the second feature, An index for evaluating a mask for extracting a component corresponding to the label from a signal is calculated, and parameters of the first model and parameters of the second model are adjusted so that the index is optimized. A generation device that uses the first model and the second model that have been trained by an updating method,
a first feature calculation unit that uses the first model to calculate a third feature obtained by mapping second information representing a label in a manner that allows interpretation of the meaning onto the first space;
a second feature calculation unit that uses the second model to calculate a fourth feature obtained by mapping a second acoustic signal to the first space;
a mask generation unit that generates a mask for extracting a component corresponding to a label from the second acoustic signal based on the third feature amount and the fourth feature amount;
A generating device characterized by having:

意味を解釈可能な態様でラベルを表現した第１の情報を第１の空間にマッピングした第１の特徴量を、第１のモデルを用いて算出し、第１の音響信号を前記第１の空間にマッピングした第２の特徴量を、第２のモデルを用いて算出し、前記第１の特徴量及び前記第２の特徴量を基に生成されるマスクであって、前記第１の音響信号から前記ラベルに対応する成分を抽出するためのマスクを評価するための指標を算出し、前記指標が最適化されるように、前記第１のモデルのパラメータ及び前記第２のモデルのパラメータを更新する方法によって学習済みの前記第１のモデル及び前記第２のモデルを用いる生成装置が実行する生成方法であって、
意味を解釈可能な態様でラベルを表現した第２の情報を前記第１の空間にマッピングした第３の特徴量を、前記第１のモデルを用いて算出する第１の特徴量算出工程と、
第２の音響信号を前記第１の空間にマッピングした第４の特徴量を、前記第２のモデルを用いて算出する第２の特徴量算出工程と、
前記第３の特徴量及び前記第４の特徴量を基に、前記第２の音響信号からラベルに対応する成分を抽出するためのマスクを生成するマスク生成工程と、
を含むことを特徴とする生成方法。 A first feature amount is calculated by mapping first information representing a label in a manner that allows interpretation of the meaning into a first space using a first model, and a first acoustic signal is calculated by using a first model. A second feature mapped in space is calculated using a second model, and the mask is generated based on the first feature and the second feature, the mask being generated based on the first feature and the second feature, An index for evaluating a mask for extracting a component corresponding to the label from a signal is calculated, and parameters of the first model and parameters of the second model are adjusted so that the index is optimized. A generation method executed by a generation device using the first model and the second model that have been trained by an updating method ,
a first feature amount calculation step of calculating, using the first model, a third feature amount in which second information expressing a label in a manner that allows interpretation of the meaning is mapped to the first space;
a second feature calculation step of calculating a fourth feature obtained by mapping a second acoustic signal to the first space using the second model;
a mask generation step of generating a mask for extracting a component corresponding to a label from the second acoustic signal based on the third feature amount and the fourth feature amount;
A generation method characterized by comprising:

コンピュータを、請求項９に記載の生成装置として機能させるための生成プログラム。 A generation program for causing a computer to function as the generation device according to claim 9.