WO2023135788A1

WO2023135788A1 - Voice processing learning method, voice processing learning device, and program

Info

Publication number: WO2023135788A1
Application number: PCT/JP2022/001315
Authority: WO
Inventors: 宏佐藤; 直輝牧島
Original assignee: 日本電信電話株式会社
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2023-07-20

Abstract

A feature extraction unit 11 extracts a feature amount from a voice that is the same as a subject voice, which is the voice spoken by a target speaker, received as an enrolment voice. A speaker expression extraction unit 13 extracts a speaker expression from the extracted feature amount. A target speaker extraction unit 15 uses the extracted speaker expression to extract a voice inferred to be a subject voice from mixed sound constituted by the subject voice, a non-subject voice that is the voice of a different speaker to the target speaker, and noise. An optimization unit 16 calculates a loss function using the inferred voice and the subject voice, and optimizes the target speaker extraction process so that the calculated value is minimized.

Description

音声処理学習方法、音声処理学習装置、およびプログラムSpeech processing learning method, speech processing learning device, and program

　本発明は音声認識技術に関し、特に目的話者の発話に加え、他話者の発話や雑音などを含む混合音から、目的話者の発話音声のみを抽出する目的話者抽出技術に関する。 The present invention relates to speech recognition technology, and more particularly to target speaker extraction technology for extracting only the target speaker's uttered voice from mixed sound that includes other speakers' utterances and noise in addition to the target speaker's utterances.

　近年、深層学習技術の発達により音声認識の性能は向上した。しかし、それでも音声認識が困難な状況の例として複数人の混合音（オーバーラップ発話）が挙げられる。これに対処するため、以下のような技術が考案されている。 In recent years, the performance of speech recognition has improved due to the development of deep learning technology. However, as an example of a situation in which speech recognition is still difficult, there is a mixed sound (overlapping speech) of multiple people. In order to deal with this, the following techniques have been devised.

　ブラインド音源分離は、混合音のままでは音声認識が困難な音声を、各話者の音声に分離することで音声認識を可能にする（例えば、非特許文献１参照）。 Blind source separation enables speech recognition by separating speech, which is difficult to recognize as a mixed sound, into the speech of each speaker (see, for example, Non-Patent Document 1).

　目的話者抽出は、目的話者が事前登録した発話を補助的な情報として利用し、事前登録された話者の音声のみを混合音から取得する（例えば、非特許文献２参照）。混合音に含まれる話者数に関する事前情報を必要としない利点があり、実用上有用な技術である。抽出した音声は目的話者の声だけを含むことから、音声認識が可能である。 Target speaker extraction uses the pre-registered utterances of the target speaker as auxiliary information, and obtains only the voice of the pre-registered speaker from the mixed sound (see, for example, Non-Patent Document 2). It has the advantage of not requiring prior information on the number of speakers included in the mixed sound, and is a practically useful technique. Since the extracted speech contains only the voice of the target speaker, speech recognition is possible.

　目的話者抽出はニューラルネットワークを用いて実装され、その学習をする際は、入力として目的話者音声を含む混合音と、目的話者の事前登録発話と、出力として混合音から抽出すべき目的話者音声のペアデータを利用する。即ち、目的話者抽出モデルを作成するための学習には、目的話者による発話であって、かつ抽出対象発話とは別の発話の音声（エンロールメント音声）が、抽出すべき目的話者の事前登録発話としてデータセットに含まれる必要がある。 The target speaker extraction is implemented using a neural network, and when it is trained, the input is a mixed sound containing the target speaker's speech, the pre-registered utterance of the target speaker, and the target to be extracted from the mixed sound as an output. Use speaker-voice pair data. That is, in the learning for creating the target speaker extraction model, the speech of the target speaker that is different from the extraction target speech (enrollment speech) is extracted from the target speaker to be extracted. Must be included in the dataset as pre-registered utterances.

　従って、話者ラベルが付与されていないデータセット、不特定話者の音声を含むデータセット等に対して目的話者抽出を学習することができなかった。また、話者ラベルが付与されていても1話者が1発話のみを発しているデータセットに対しては目的話者抽出を学習することができなかった。 Therefore, it was not possible to learn to extract the target speaker for data sets without speaker labels, data sets containing voices of unspecified speakers, etc. In addition, we could not learn target speaker extraction for a data set in which one speaker uttered only one utterance, even if the speaker label was assigned.

　これによりプライバシーの観点から匿名化された音声ログ、同一話者から複数発話が多くの場合得られないアプリケーションの音声ログ等に対しては、目的話者抽出モデルを学習させることや、ドメイン適応することが困難であった。 As a result, for speech logs anonymized from the viewpoint of privacy and speech logs of applications where multiple utterances from the same speaker cannot be obtained in many cases, it is possible to train the target speaker extraction model and apply domain adaptation. was difficult.

　更に、会議音声の録音等を使用して目的話者抽出のための学習を行う場合、手作業で話者音声のアノテーションをする必要が生じることがあり、そのコストが負担になることがあった。また、類似した声質の話者が存在する場合に、完璧に話者をアノテーションすることは必ずしも可能ではなかった。 Furthermore, when learning to extract the target speaker using recordings of conference voices, etc., it may be necessary to manually annotate the speaker's voice, which can be a burden. . Moreover, when there are speakers with similar voice qualities, it is not always possible to annotate the speakers perfectly.

　本発明の目的は、上記のような課題に鑑みて、話者ラベルの付与されていない音声データベースや、不特定多数話者の音声を含む音声データベースをもとに、実用上利用可能な目的話者抽出の学習を行うことにある。 In view of the above-mentioned problems, the object of the present invention is to develop a practically usable target speech based on a speech database without speaker labels and a speech database containing speeches of unspecified majority of speakers. It is to learn person extraction.

　上記課題を解決するために、本発明の一態様の音声処理学習方法は、エンロールメント音声として受け取った、目的話者の発話音声である対象音声と同一の音声から、固定長のベクトルである時系列データで構成される特徴量を抽出する特徴抽出処理と、前記特徴抽出処理により抽出された特徴量から固定長のベクトルである話者表現を抽出する話者表現抽出処理と、前記話者表現抽出処理により抽出された話者表現を用いて、前記対象音声、前記目的話者とは異なる話者の音声である非対象音声、及びノイズから構成される混合音から前記対象音声と推定する音声を抽出する目的話者抽出処理と、前記目的話者抽出処理により抽出された音声と、前記対象音声とを用いて損失関数を算出し、その算出値が最小になるように前記目的話者抽出処理を最適化する最適化処理とを実施する。 In order to solve the above problems, a speech processing learning method according to one aspect of the present invention provides a fixed-length vector from the same speech as the target speech, which is the utterance speech of the target speaker, received as the enrollment speech. A feature extraction process for extracting a feature amount composed of series data, a speaker expression extraction process for extracting a speaker expression that is a fixed-length vector from the feature amount extracted by the feature extraction process, and the speaker expression Using the speaker expression extracted by the extraction process, the target speech is estimated to be the target speech from the mixed sound composed of the target speech, the non-target speech that is the speech of a speaker different from the target speaker, and noise. a target speaker extraction process for extracting a target speaker, a loss function is calculated using the speech extracted by the target speaker extraction process, and the target speech, and the target speaker is extracted so that the calculated value is minimized An optimization process for optimizing the process is performed.

　本発明によれば、話者ラベルの付与されていない音声データベースや、不特定多数話者の音声を含む音声データベースをもとに、実用上利用可能な目的話者抽出の学習を行うことができる。 According to the present invention, it is possible to learn practically applicable target speaker extraction based on a speech database without speaker labels and a speech database containing speeches of unspecified majority of speakers. .

本発明の一実施の形態に係る音声処理学習装置の機能構成例を示した図。1 is a diagram showing a functional configuration example of a speech processing learning device according to an embodiment of the present invention; FIG. 本発明の一実施の形態に係る音声処理学習装置における音声処理学習方法の処理フロー例を示した図。FIG. 4 is a diagram showing a processing flow example of a speech processing learning method in the speech processing learning device according to the embodiment of the present invention; 本発明の一実施の形態に係る学習データ作成装置の機能構成例を示した図。The figure which showed the functional structural example of the learning data preparation apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る学習データ作成装置における学習データ作成方法の処理フロー例を示した図。The figure which showed the processing flow example of the learning data preparation method in the learning data preparation apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第１の変形例の機能構成例を示した図。The figure which showed the functional structural example of the 1st modification of the speech processing learning apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第１の変形例における音声処理学習方法の処理フロー例を示した図。The figure which showed the example of the processing flow of the speech processing learning method in the 1st modification of the speech processing learning apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第２の変形例の機能構成例を示した図。The figure which showed the functional structural example of the 2nd modification of the speech processing learning apparatus which concerns on one Embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第２の変形例における音声処理学習方法の処理フロー例を示した図。The figure which showed the example of the processing flow of the speech-processing learning method in the 2nd modification of the speech-processing learning apparatus which concerns on one embodiment of this invention. 目的話者抽出実験を行った場合の目的話者抽出モデルの性能評価結果の一例を示した図。The figure which showed an example of the performance evaluation result of the target speaker extraction model when the target speaker extraction experiment was performed. コンピュータの機能構成を例示する図。The figure which illustrates the functional structure of a computer.

　以下、本発明の実施の形態について詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Components having the same function are given the same number, and redundant description is omitted.

　図１に本発明の一実施の形態に係る音声処理学習装置の機能構成例を示した図を示す。図１に示した音声処理学習装置１は、特徴抽出部１１と、話者表現抽出部１３と、目的話者抽出部１５と、最適化部１６とを備えている。図２は、本発明の一実施の形態に係る音声処理学習装置における音声処理学習方法の処理フロー例を示した図である。音声処理学習装置１が、図２に例示するステップＳ１１からステップＳ１６の各ステップの処理を行うことにより、実施形態の音声処理学習方法が実現される。音声処理学習装置１の一態様は、後述する学習データ作成装置２で作成された学習データγを構成する、対象音声ｉと、エンロールメント音声ｉ’と、混合音Ｘとを使用して学習することで、学習済みモデルである目的話者抽出モデルＺを作成する。 FIG. 1 shows a diagram showing a functional configuration example of a speech processing learning device according to one embodiment of the present invention. The speech processing learning device 1 shown in FIG. FIG. 2 is a diagram showing a processing flow example of the speech processing learning method in the speech processing learning device according to the embodiment of the present invention. The speech processing learning method of the embodiment is realized by the speech processing learning device 1 performing the processing of steps S11 to S16 illustrated in FIG. One aspect of the speech processing learning device 1 is learning using a target speech i, an enrollment speech i′, and a mixed sound X, which constitute learning data γ created by a learning data creation device 2, which will be described later. Thus, a target speaker extraction model Z, which is a trained model, is created.

　以下、図１と図２を用いて、音声処理学習装置１の各要素が行う処理を説明しつつ、音声処理学習装置１の機能、及び音声処理学習装置１が行う音声処理学習方法について説明する。 Hereinafter, the functions of the speech processing learning device 1 and the speech processing learning method performed by the speech processing learning device 1 will be explained while explaining the processing performed by each element of the speech processing learning device 1 with reference to FIGS. 1 and 2. .

［特徴抽出部１１］
　特徴抽出部１１は、特徴抽出処理（ステップＳ１１）を行う。即ち、特徴抽出部１１は、後述する学習データγに格納されたペアデータのうち、エンロールメント音声ｉ’を取得して、このエンロールメント音声ｉ’に対して特徴抽出を行い、固定長のベクトル表現の時系列データを取得する。特徴抽出部１１は取得した時系列データを話者表現抽出部１３に出力する。特徴抽出の手法としては、例えば公知の短時間窓に対するフィルタバンク特徴量や、短時間フーリエ変換結果を用いることが挙げられる。特徴抽出部１１は、ニューラルネットワークで構成することもできる。特徴抽出部１１のニューラルネットワークの一例としては、公知の畳み込みニューラルネットワークが挙げられる。なお、通常の方法では、対象音声とエンロールメント音声とは、同一話者の発話であるが、夫々が異なる発話を用いるが、本発明では、後述するように、学習データγに格納されたエンロールメント音声ｉ’は、目的話者の発話音声である対象音声ｉと同一の音声が設定される。 [Feature extraction unit 11]
The feature extraction unit 11 performs feature extraction processing (step S11). That is, the feature extraction unit 11 acquires the enrollment speech i′ from the paired data stored in the learning data γ described later, performs feature extraction on this enrollment speech i′, and extracts a fixed-length vector Get time series data for an expression. The feature extraction unit 11 outputs the acquired time-series data to the speaker expression extraction unit 13 . As a method of feature extraction, for example, a filter bank feature amount for a known short-time window or a short-time Fourier transform result may be used. The feature extraction unit 11 can also be configured with a neural network. An example of the neural network of the feature extraction unit 11 is a known convolutional neural network. In the usual method, the target voice and the enrollment voice are utterances of the same speaker, but different utterances are used for each. The comment voice i′ is set to the same voice as the target voice i, which is the utterance voice of the target speaker.

　そのため、特徴抽出部１１が学習データγから対象音声ｉを受け取り、特徴抽出部１１が受け取った対象音声ｉをエンロールメント音声ｉ’として設定し、特徴抽出部１１がこのエンロールメント音声ｉ’に対して上述した特徴抽出を行うように構成しても良い。 Therefore, the feature extraction unit 11 receives the target speech i from the learning data γ, sets the target speech i received by the feature extraction unit 11 as the enrollment speech i′, and sets the feature extraction unit 11 to the enrollment speech i′. It may be configured to perform the feature extraction described above.

［話者表現抽出部１３］
　話者表現抽出部１３は、話者表現抽出処理（ステップＳ１３）を行う。即ち、話者表現抽出部１３は、ステップＳ１１の特徴抽出処理により特徴抽出部１１から受け取った時系列データから、固定長のベクトルである話者表現を抽出する。話者表現抽出部１３は公知のニューラルネットワークや、i-vectorなどの公知の話者表現抽出器を用いて構築することができる。話者表現抽出部１３は、抽出した話者表現を目的話者抽出部１５へ出力する。 [Speaker expression extraction unit 13]
The speaker expression extraction unit 13 performs speaker expression extraction processing (step S13). That is, the speaker's expression extraction unit 13 extracts the speaker's expression, which is a fixed-length vector, from the time-series data received from the feature extraction unit 11 through the feature extraction processing in step S11. The speaker expression extractor 13 can be constructed using a known neural network or a known speaker expression extractor such as i-vector. The speaker expression extraction unit 13 outputs the extracted speaker expression to the target speaker extraction unit 15 .

［目的話者抽出部１５］
　目的話者抽出部１５は、目的話者抽出処理（ステップＳ１５）を行う。即ち、目的話者抽出部１５は、ステップＳ１３の処理により話者表現抽出部１３から受け取った話者表現を用いて、後述する学習データγに格納されたペアデータのうち、混合音Ｘを取得して、その混合音Ｘの中から目的話者と推測する発話の抽出を行う。目的話者抽出部１５は公知のニューラルネットワークによって構成することができる。目的話者抽出部１５のニューラルネットワークの一例としては、公知のConvTasNetなどのネットワーク構造が挙げられる。目的話者抽出部１５は、抽出した目的話者と推測した発話を最適化部１６へ出力する。なお、目的話者抽出部１５は、後述する最適化部１６のステップＳ１６の処理により、最適化部１６から最適化されたパラメータを受け取り、そのパラメータを目的話者抽出部１５が有する目的話者抽出に関するモデルに含まれるパラメータに反映させることで、目的話者抽出モデルＺを作成する。 [Target speaker extraction unit 15]
The target speaker extraction unit 15 performs target speaker extraction processing (step S15). That is, the target speaker extraction unit 15 uses the speaker expression received from the speaker expression extraction unit 13 in the process of step S13 to obtain the mixed sound X from the paired data stored in the learning data γ described later. Then, from the mixed sound X, an utterance assumed to be the target speaker is extracted. The target speaker extraction unit 15 can be configured by a known neural network. An example of the neural network of the target speaker extraction unit 15 is a network structure such as a known ConvTasNet. The target speaker extraction unit 15 outputs the extracted target speaker and the estimated utterance to the optimization unit 16 . Note that the target speaker extracting unit 15 receives optimized parameters from the optimizing unit 16 by the processing of step S16 of the optimizing unit 16, which will be described later, and applies the parameters to the target speaker possessed by the target speaker extracting unit 15. A target speaker extraction model Z is created by reflecting the parameters included in the model for extraction.

［最適化部１６］
　最適化部１６は、最適化処理を行う（ステップＳ１６）を行う。即ち、最適化部１６は、ステップＳ１５の処理により目的話者抽出部１５から受け取った目的話者と推測した発話と、後述する学習データγに格納されたペアデータから受け取った対象音声ｉとを用いて損失関数を計算する。最適化部１６は、損失関数の算出値が最小の値を示すように、目的話者抽出部１５が有する目的話者抽出に関するモデルに含まれるパラメータの最適化を行う。 [Optimization unit 16]
The optimization unit 16 performs optimization processing (step S16). That is, the optimization unit 16 combines the utterance estimated to be the target speaker received from the target speaker extraction unit 15 by the process of step S15, and the target speech i received from the pair data stored in the learning data γ, which will be described later. to calculate the loss function. The optimization unit 16 optimizes the parameters included in the target speaker extraction model of the target speaker extraction unit 15 so that the calculated value of the loss function is minimized.

　最適化部１６が用いる損失関数の例としては、公知のsd-SNRが挙げられる。また、最適化手法の一例としては、公知のAdam手法が挙げられる。最適化部１６が行う最適化処理は、目的話者抽出に関するモデルに含まれる全てのパラメータに対して行うことができる。最適化部１６が行う最適化処理は、一部のパラメータを固定し、残りのパラメータに対してのみ行うこともできる。最適化部１６は、最適化された目的話者抽出に関するモデルに含まれるパラメータを１５へ出力する。なお、最適化部１６から目的話者抽出部１５に出力されたパラメータは、目的話者抽出部１５が有する目的話者抽出に関するモデルに含まれるパラメータに反映され、目的話者抽出モデルＺが作成される。 An example of the loss function used by the optimization unit 16 is known sd-SNR. An example of the optimization method is the well-known Adam method. The optimization processing performed by the optimization unit 16 can be performed for all parameters included in the model for extracting the target speaker. The optimization processing performed by the optimization unit 16 can also be performed only on the remaining parameters by fixing some parameters. The optimization unit 16 outputs to 15 the parameters included in the optimized model for target speaker extraction. The parameters output from the optimization unit 16 to the target speaker extraction unit 15 are reflected in the parameters included in the target speaker extraction model held by the target speaker extraction unit 15, and the target speaker extraction model Z is created. be done.

　音声処理学習装置１が上述したステップＳ１１からステップＳ１６の処理を行うことで、音声処理学習方法が実現される。これにより学習済みモデルである、図１に示した目的話者抽出モデルＺが作成される。 The speech processing learning method is realized by the speech processing learning device 1 performing the processing from step S11 to step S16 described above. As a result, the target speaker extraction model Z shown in FIG. 1, which is a trained model, is created.

＜学習データ作成装置＞
　以下、本発明の実施の形態に係る音声処理学習装置１が用いる学習データの作成方法について説明する。図３は、本発明の一実施の形態に係る学習データ作成装置の機能構成例を示した図を示す。 <Learning data creation device>
A method of creating learning data used by the speech processing learning device 1 according to the embodiment of the present invention will be described below. FIG. 3 shows a diagram showing a functional configuration example of a learning data creation device according to an embodiment of the present invention.

　図３に示す学習データ作成装置２は、目的話者抽出を学習するために、音声処理学習装置１で用いるペアデータからなる学習データγを作成する装置である。図３に示すように、学習データ作成装置２は、対象発話抽出部２１と、非対象発話抽出部２２と、ノイズ抽出部２３と、音声混合部２４と、データ作成部２５とを備えている。図４は、本発明の一実施の形態に係る学習データ作成装置における学習データ作成方法の処理フロー例を示した図である。学習データ作成装置２が、図４に例示するステップＳ２１からステップＳ２５の各ステップの処理を行うことにより、本発明の一実施の形態に係る学習データ作成方法が実現される。 The learning data creation device 2 shown in FIG. 3 is a device that creates learning data γ consisting of paired data used in the speech processing learning device 1 in order to learn target speaker extraction. As shown in FIG. 3, the learning data creation device 2 includes a target utterance extraction unit 21, a non-target utterance extraction unit 22, a noise extraction unit 23, a voice mixing unit 24, and a data creation unit 25. . FIG. 4 is a diagram showing a processing flow example of a learning data creation method in the learning data creation device according to the embodiment of the present invention. The learning data creation method according to one embodiment of the present invention is realized by the learning data creation device 2 performing the processing of steps S21 to S25 illustrated in FIG. 4 .

　以下、図３と図４を用いて、学習データ作成装置２の各要素が行う処理を説明しつつ、学習データ作成装置２の機能、及び学習データ作成装置２が行う学習データ作成方法について説明する。 Hereinafter, the functions of the learning data creation device 2 and the learning data creation method performed by the learning data creation device 2 will be explained while explaining the processing performed by each element of the learning data creation device 2 with reference to FIGS. 3 and 4. .

［対象発話抽出部２１］
　対象発話抽出部２１は、対象発話抽出処理（ステップＳ２１）を行う。即ち、対象発話抽出部２１は、予め準備された不特定多数の発話が含まれる音声データベースαから音声データを受け取り、その音声データから目的話者の発話音声である対象音声ｉを抽出する。対象発話抽出部２１は、抽出した対象音声ｉを音声混合部２４とデータ作成部２５に出力する。 [Target utterance extraction unit 21]
The target utterance extraction unit 21 performs target utterance extraction processing (step S21). That is, the target utterance extraction unit 21 receives voice data from a previously prepared voice database α containing an unspecified number of utterances, and extracts the target voice i, which is the utterance voice of the target speaker, from the voice data. The target utterance extractor 21 outputs the extracted target speech i to the voice mixer 24 and the data generator 25 .

［非対象発話抽出部２２］
　非対象発話抽出部２２は、非対象発話抽出処理（ステップＳ２２）を行う。即ち、非対象発話抽出部２２は、上述の音声データベースαから音声データを受け取り、その音声データから目的話者以外の話者の発話音声である非対象音声ｋを抽出する。非対象発話抽出部２２は、抽出した非対象音声ｋを音声混合部２４に出力する。 [Non-target utterance extraction unit 22]
The non-target utterance extraction unit 22 performs a non-target utterance extraction process (step S22). That is, the non-target speech extraction unit 22 receives speech data from the above-described speech database α, and extracts non-target speech k, which is the speech speech of a speaker other than the target speaker, from the speech data. The non-target speech extraction unit 22 outputs the extracted non-target speech k to the speech mixing unit 24 .

［ノイズ抽出部２３］
　ノイズ抽出部２３は、ノイズ抽出処理（ステップＳ２３）を行う。即ち、ノイズ抽出部２３は、予め準備されたノイズデータベースβからノイズデータを受け取り、そのノイズデータからノイズの信号であるノイズｒを抽出する。ノイズ抽出部２３は、抽出したノイズｒを音声混合部２４に出力する。 [Noise extractor 23]
The noise extraction unit 23 performs noise extraction processing (step S23). That is, the noise extraction unit 23 receives noise data from a noise database β prepared in advance, and extracts noise r, which is a noise signal, from the noise data. The noise extractor 23 outputs the extracted noise r to the audio mixer 24 .

［音声混合部２４］
　音声混合部２４は、音声混合処理（ステップＳ２４）を行う。即ち、音声混合部２４は、対象発話抽出部２１から受け取った対象音声ｉと、非対象発話抽出部２２から受け取った非対象音声ｋと、ノイズ抽出部２３から受け取ったノイズｒとを任意のゲインで混合して混合音Ｘを作成する。音声混合部２４は作成した混合音Ｘを２５に出力する。 [Audio Mixer 24]
The voice mixing unit 24 performs voice mixing processing (step S24). That is, the voice mixing unit 24 combines the target voice i received from the target utterance extraction unit 21, the non-target voice k received from the non-target utterance extraction unit 22, and the noise r received from the noise extraction unit 23 with an arbitrary gain. to create mixed sound X. The audio mixer 24 outputs the created mixed sound X to 25 .

［データ作成部２５］
　データ作成部２５は、データ作成処理（ステップＳ２５）を行う。即ち、データ作成部２５は、対象発話抽出部２１から受け取った対象音声ｉと、音声混合部２４から受け取った混合音Ｘを基にして、データセットである学習データγを作成する。データ作成部２５は、目的話者の発話音声である対象音声ｉと同一の音声をエンロールメント音声ｉ’と設定した上で、対象音声ｉ、エンロールメント音声ｉ’、混合音Ｘの３つの音声のペアデータを、学習データγとして作成する。 [Data creation unit 25]
The data creation unit 25 performs data creation processing (step S25). That is, the data creation unit 25 creates learning data γ, which is a data set, based on the target speech i received from the target utterance extraction unit 21 and the mixed sound X received from the speech mixing unit 24 . The data creation unit 25 sets the same voice as the target voice i, which is the uttered voice of the target speaker, as the enrollment voice i', and then creates three voices, the target voice i, the enrollment voice i', and the mixed sound X. is created as learning data γ.

　学習データ作成装置２が上述したステップＳ２１からステップＳ２５の処理を行うことで、学習データ作成方法が実現され、音声処理学習装置１に用いる学習データが作成できる。 The learning data creation method is implemented by the learning data creation device 2 performing the processing from step S21 to step S25 described above, and the learning data used in the speech processing learning device 1 can be created.

　学習データ作成装置２は、対象音声ｉ以外にエンロールメント音声を用意せず、音声処理学習装置１に用いる学習データの作成が可能となる。学習データ作成装置２は、話者ラベルの付与されていない音声データベースや、不特定多数の発話が含まれるような音声データベースαをもとに学習データセットを構築することができる。 The learning data creation device 2 can create learning data used in the speech processing learning device 1 without preparing enrollment voices other than the target voice i. The learning data creation device 2 can construct a learning data set based on a speech database with no speaker label attached or a speech database α containing an unspecified number of utterances.

　上述の音声データベースαが、不特定多数の発話データと、話者ラベル付きのデータとで構成される場合も考えられる。その場合、不特定多数の発話データについては、上述した学習データ作成方法を用いてペアデータを生成し、話者ラベル付きの音声に対しては、話者ラベルを利用した従来の手法でペアデータを生成することもできる。あるいは、話者ラベル付きの音声に対しても、上述した本発明の実施の形態の手法を用いてペアデータを生成することもできる。 It is conceivable that the above-mentioned speech database α is composed of an unspecified number of utterance data and data with speaker labels. In that case, for unspecified number of utterance data, paired data is generated using the above-described learning data creation method, and for speech with speaker labels, paired data is generated by a conventional method using speaker labels. can also be generated. Alternatively, paired data can also be generated using the technique of the embodiment of the present invention described above for speech with speaker labels.

　なお、上述の説明では、ステップＳ２１、ステップＳ２２、ステップＳ２３の各処理はは、この順序で実施するように説明したが、この３つの処理の順番についてはどのような順序で処理してもよい。あるいは、この処理の内の２つあるいは３つの処理について、並行して処理するようにしても良い。 In the above description, steps S21, S22, and S23 are performed in this order, but these three processes may be performed in any order. . Alternatively, two or three of these processes may be processed in parallel.

＜第１の変形例＞
　以下、本発明の実施の形態に係る第１の変形例について説明する。図５は本発明の一実施の形態に係る音声処理学習装置の第１の変形例の機能構成例を示した図である。本変形例において、音声処理学習装置１は、既述の図１で示した音声処理学習装置１の各要素に加え、第１のデータ拡張部１２を有する。図６は、本発明の一実施の形態に係る音声処理学習装置の第１の変形例における音声処理学習方法の処理フロー例を示している。図５に示した音声処理学習装置１が図６に例示するステップＳ１１からステップＳ１６の各ステップの処理を行うことにより、本変形例の音声処理学習方法が実現される。 <First modification>
A first modification according to the embodiment of the present invention will be described below. FIG. 5 is a diagram showing a functional configuration example of the first modification of the speech processing learning device according to the embodiment of the present invention. In this modified example, the speech processing learning device 1 has a first data extension unit 12 in addition to each element of the speech processing learning device 1 shown in FIG. FIG. 6 shows a processing flow example of the speech processing learning method in the first modification of the speech processing learning device according to the embodiment of the present invention. The speech processing learning method of this modified example is realized by the speech processing learning device 1 shown in FIG. 5 performing the processing of steps S11 to S16 illustrated in FIG.

　以下、図５と図６を用いて、図１の音声処理学習装置１と異なる機能や処理を説明しつつ、図５の音声処理学習装置１の機能、及び音声処理学習装置１が行う音声処理学習方法について説明する。 5 and 6, functions and processes different from those of the speech processing learning device 1 shown in FIG. 1 will be described. Explain the learning method.

［第１のデータ拡張部１２］
　第１のデータ拡張部１２は、第１のデータ拡張処理を行う（ステップＳ１２）。即ち、第１のデータ拡張部１２は、ステップＳ１１の処理により特徴抽出部１１にて抽出された特徴量である時系列データに対してデータ拡張（Data Augmentation）の変換を行う。データ拡張手法としては、例えば、時間周波数表現の一部を無作為に０（ゼロ）や平均値といった所定の値へ置き換える手法が挙げられる。具体的手法としては、公知のSpecAugmentが挙げられる。実装形態としては、SpecAugmentのうち周波数に関するデータ拡張と、時間に関するデータ拡張と、タイムワーピングのうちのいずれか１つ、あるいは複数を採用することが挙げられる。第１のデータ拡張部１２は、変換後のデータを話者表現抽出部１３へ出力する。 [First data extension unit 12]
The first data extension unit 12 performs a first data extension process (step S12). That is, the first data extension unit 12 performs data augmentation conversion on the time-series data, which is the feature amount extracted by the feature extraction unit 11 in the process of step S11. Examples of data expansion methods include a method of randomly replacing part of the time-frequency representation with a predetermined value such as 0 (zero) or an average value. As a specific method, there is a known SpecAugment. As an implementation form, one or more of data extension related to frequency, data extension related to time, and time warping among SpecAugments can be used. The first data extension unit 12 outputs the converted data to the speaker expression extraction unit 13 .

［話者表現抽出部１３］
　話者表現抽出部１３が行うステップＳ１３の処理は、本変形例においては、第１のデータ拡張部１２から受け取った変換後のデータを用いて話者表現抽出を行う。その他の処理は、図２で示したステップＳ１３の処理と同様である。 [Speaker expression extraction unit 13]
In the process of step S13 performed by the speaker expression extraction unit 13, the speaker expression extraction is performed using the converted data received from the first data extension unit 12 in this modification. Other processing is the same as the processing of step S13 shown in FIG.

　図５の音声処理学習装置１が図６のステップＳ１１からステップＳ１６の処理を行うことで、音声処理学習方法が実現される。これにより学習済みモデルである、図５に示した目的話者抽出モデルＺが作成される。 A speech processing learning method is realized by the speech processing learning device 1 in FIG. 5 performing the processing from step S11 to step S16 in FIG. As a result, the target speaker extraction model Z shown in FIG. 5, which is a trained model, is created.

＜第２の変形例＞
　以下、本発明の実施の形態に係る第２の変形例について説明する。図７は本発明の一実施の形態に係る音声処理学習装置の第２の変形例の機能構成例を示した図である。本変形例において、音声処理学習装置１は、既述の図５で示した音声処理学習装置１の各要素に加え、第２のデータ拡張部１４を有する。図８は、本発明の一実施の形態に係る音声処理学習装置の第２の変形例における音声処理学習方法の処理フロー例を示している。図７に示した音声処理学習装置１が図８に例示するステップＳ１１からステップＳ１６の各ステップの処理を行うことにより、本変形例の音声処理学習方法が実現される。 <Second Modification>
A second modification according to the embodiment of the present invention will be described below. FIG. 7 is a diagram showing a functional configuration example of a second modification of the speech processing learning device according to the embodiment of the present invention. In this modification, the speech processing learning device 1 has a second data extension unit 14 in addition to the elements of the speech processing learning device 1 shown in FIG. FIG. 8 shows a processing flow example of the speech processing learning method in the second modification of the speech processing learning device according to the embodiment of the present invention. The speech processing learning method of this modified example is realized by the speech processing learning device 1 shown in FIG. 7 performing the processing of steps S11 to S16 illustrated in FIG.

　以下、図７と図８を用いて、図５の音声処理学習装置１と異なる機能や処理を説明しつつ、図７の音声処理学習装置１の機能、及び音声処理学習装置１が行う音声処理学習方法について説明する。 7 and 8, functions and processes different from those of the speech processing learning device 1 shown in FIG. Explain the learning method.

［第２のデータ拡張部１４］
　第２のデータ拡張部１４は、第２のデータ拡張処理（ステップＳ１４）を行う。即ち、第２のデータ拡張部１４は、ステップＳ１３の処理により話者表現抽出部１３にて抽出された固定長の話者表現に対してデータ拡張（Data Augmentation）の変換を行う。データ拡張の手法としては、例えば、公知のdropout手法を用いてベクトルの一部の要素を０（ゼロ）や平均値といった所定の値へ置き換える方法が挙げられる。 [Second data extension unit 14]
The second data extension unit 14 performs a second data extension process (step S14). That is, the second data extension unit 14 performs data augmentation conversion on the fixed-length speaker expressions extracted by the speaker expression extraction unit 13 in the process of step S13. As a data extension method, for example, there is a method of replacing some elements of a vector with a predetermined value such as 0 (zero) or an average value using a known dropout method.

［目的話者抽出部１５］
　目的話者抽出部１５が行うステップＳ１５の処理は、本変形例においては、第２のデータ拡張部１４から受け取った変換後のデータを用いて目的話者抽出を行う。その他の処理は、図２で示したステップＳ１５の処理と同様である。 [Target speaker extraction unit 15]
In the present modification, the target speaker extraction unit 15 extracts the target speaker using the converted data received from the second data extension unit 14 in the process of step S15 performed by the target speaker extraction unit 15 . Other processing is the same as the processing of step S15 shown in FIG.

　図７の音声処理学習装置１が図８のステップＳ１１からステップＳ１６の処理を行うことで、音声処理学習方法が実現される。これにより学習済みモデルである、図７に示した目的話者抽出モデルＺが作成される。 A speech processing learning method is realized by the speech processing learning device 1 in FIG. 7 performing the processing from step S11 to step S16 in FIG. As a result, the target speaker extraction model Z shown in FIG. 7, which is a trained model, is created.

　なお、図８においては、第１のデータ拡張部１２の処理（ステップＳ１２）と、第２のデータ拡張部１４の処理（ステップＳ１４）の双方の処理を実施することとしているが、音声処理学習装置１をいずれか一方のみを実施するように構成しても良い。即ち、第１の変形例と同様に、音声処理学習装置１を、第２のデータ拡張部１４の処理（テップＳ１４）を行わず、第１のデータ拡張部１２（ステップＳ１２）のみの処理を行うように構成しても良い。また、音声処理学習装置１を第１のデータ拡張部１２（ステップＳ１２）の処理を行わず、第２のデータ拡張部１４の処理（テップＳ１４）のみの処理を行うように構成しても良い。 In FIG. 8, both the processing of the first data extension unit 12 (step S12) and the processing of the second data extension unit 14 (step S14) are performed. The device 1 may be configured to implement only one or the other. That is, as in the first modified example, the speech processing learning apparatus 1 does not perform the processing of the second data extension unit 14 (step S14), and performs only the processing of the first data extension unit 12 (step S12). It may be configured to do so. Further, the speech processing learning apparatus 1 may be configured to perform only the processing of the second data extension unit 14 (step S14) without performing the processing of the first data extension unit 12 (step S12). .

＜性能評価結果＞
　上述した目的話者抽出のための学習手法を利用して学習された目的話者抽出モデルの性能評価結果について説明する。図９は、目的話者抽出実験を行った場合の目的話者抽出モデルの性能評価結果の一例を示している。図９では、目的話者抽出実験を行う目的話者抽出モデルとして、従来の話者ラベルを用いて学習したモデルを用いた場合（図９の(a)）と、図２の音声処理学習方法を実施したモデルを用いた場合（図９の(b)）、図６の音声処理学習方法を実施したモデルを用いた場合（図９の(c)）、図８の音声処理学習方法を実施したモデルを用いた場合（図９の(d)）、の４つの場合の結果を示している。 <Performance evaluation results>
Performance evaluation results of the target speaker extraction model trained using the learning method for target speaker extraction described above will be described. FIG. 9 shows an example of the performance evaluation result of the target speaker extraction model when the target speaker extraction experiment was performed. FIG. 9 shows the case where a model trained using conventional speaker labels is used as the target speaker extraction model for the target speaker extraction experiment ((a) in FIG. 9), and the speech processing learning method of FIG. (b) in FIG. 9), when using the model in which the speech processing learning method in FIG. 6 is used ((c) in FIG. 9), the speech processing learning method in FIG. The results are shown for the four cases of using the model ((d) in FIG. 9).

　本実験に当たっては、日本語話し言葉コーパス（CSJ：Corpus of Spontaneous Japanese）を用いた。また目的話者抽出性能を測るための評価指標としては、信号対歪み比（SDR：Signal to Distortion Ratio)、及び文字誤り率（CER: Character Error Rate）の２つの評価指標を採用した。SDRは、数字が大きいほど抽出性能が高いことを示している。また、CERは、数字が小さいほど抽出性能が高いことを示している。 For this experiment, the Corpus of Spontaneous Japanese (CSJ) was used. As evaluation indices for measuring target speaker extraction performance, we adopted two evaluation indices: Signal to Distortion Ratio (SDR) and Character Error Rate (CER). SDR indicates that the higher the number, the higher the extraction performance. In addition, CER shows that the smaller the number, the higher the extraction performance.

　図９ではいずれも日本語話し言葉コーパス（CSJ）を使用しており、同一の評価セットで評価を行っている。話者ラベルを使用した条件と未使用の条件は、それぞれ学習時には同一の混合音Ｘと対象音声iのペアを用い、エンロールメント音声i’の選択方法にのみ違いがある。そのため各モデル同士の性能を直接比較することが可能である。 In Figure 9, the Corpus of Spoken Japanese (CSJ) is used, and the same evaluation set is used for evaluation. The condition using the speaker label and the condition not using the speaker label use the same pair of the mixed sound X and the target speech i at the time of learning, and differ only in the method of selecting the enrollment speech i'. Therefore, it is possible to directly compare the performance of each model.

　図９の（b）に示す通り、図２の音声処理学習方法を実施したモデルを用いた場合には、信号対歪み比（SDR）が15.5、文字誤り率（CER）が12.3%となった。これは、従来の話者ラベルを用いて学習したモデルを用いた場合（図９の(a)）の結果である、信号対歪み比（SDR）17.3、文字誤り率（CER）8.1%と比較しても、実用に耐えうる程度の目的話者抽出性能が実現されているといえる。 As shown in Fig. 9(b), the signal-to-distortion ratio (SDR) was 15.5 and the character error rate (CER) was 12.3% when using the model that implemented the speech processing learning method of Fig. 2. . This compares with the signal-to-distortion ratio (SDR) of 17.3 and the character error rate (CER) of 8.1%, which are the results of using a model trained using conventional speaker labels (Fig. 9(a)). However, it can be said that the target speaker extraction performance that can withstand practical use is realized.

　本実施の形態の音声処理学習装置１における音声処理学習方法を採用することにより、エンロールメント音声ｉ’の情報を適切に変動させることになり、エンロールメント音声ｉ’がいわゆる本来のエンロールメント音声としての役割を十分に担っていることが分かる。つまり、上述の図９の（b）の結果より、話者ラベルを施していない音声データであっても実用レベルの目的話者抽出性能が実現されているといえる。 By adopting the speech processing learning method in the speech processing learning device 1 of the present embodiment, the information of the enrollment speech i' can be appropriately varied, and the enrollment speech i' can be regarded as the so-called original enrollment speech. It can be seen that the role of In other words, from the result of FIG. 9(b), it can be said that a practical level of target speaker extraction performance is achieved even with speech data that is not labeled with a speaker.

　また、図９の(c)に示す通り、図６の音声処理学習方法を実施したモデルを用いた場合には、信号対歪み比（SDR）が16.4、文字誤り率（CER）が11.5%となり、図９の(b)のモデルと比較し、SDR、CER共に目的話者抽出性能が高くなっていることが分かる。 Also, as shown in (c) of FIG. 9, when using the model that implemented the speech processing learning method of FIG. 6, the signal-to-distortion ratio (SDR) was 16.4 and the character error rate (CER) was 11.5%. , and compared with the model of FIG. 9(b), both SDR and CER have high target speaker extraction performance.

　同様に、図９の(d)に示す通り、図８の音声処理学習方法を実施したモデルを用いた場合には、信号対歪み比（SDR）が17.2、文字誤り率（CER）が9.7%となり、図９の(b)のモデルと比較し、SDR、CER共に目的話者抽出性能が高くなっていることが分かる。 Similarly, as shown in (d) of FIG. 9, when using the model that implemented the speech processing learning method of FIG. 8, the signal-to-distortion ratio (SDR) was 17.2 and the character error rate (CER) was 9.7% As compared with the model in FIG. 9(b), it can be seen that the target speaker extraction performance is high for both SDR and CER.

　図９の(c)及び図９の(d)の結果より、データ拡張手法によって、話者ラベルがない音声データであっても、高い目的話者抽出性能が実現されていることが分かる。 From the results of (c) and (d) of FIG. 9, it can be seen that the data augmentation method achieves high target speaker extraction performance even with speech data without speaker labels.

　対象音声ｉをエンロールメント音声ｉ’として利用する場合には、対象音声と、エンロールメント音声との間の話者性の変動が生じない。従って、従来手法のように、エンロールメント音声が、目的話者の別の発話だった場合との間の話者性の変動と比較した場合、その変動に対する頑健性が十分には獲得できないことが生じ得る。本実施形態のデータ拡張手法の採用は、この変動に対する頑健性の獲得に寄与で来ていることが分かる。 When the target voice i is used as the enrollment voice i', there is no change in speaker characteristics between the target voice and the enrollment voice. Therefore, compared with the conventional method, when the enrollment voice is different utterances of the target speaker and compared with the change in speaker characteristics, it is difficult to obtain sufficient robustness against the change. can occur. It can be seen that the adoption of the data augmentation method of this embodiment contributes to the acquisition of robustness against this variation.

　特に、図９の(d)の場合には、信号対歪み比（SDR）が17.2であり、話者ラベルを付した図９の(a)の信号対歪み比（SDR）17.3と比して僅差となった。図９の(d)の場合の文字誤り率（CER）の結果は9.7%であり、話者ラベルを付した図９の(a)の結果の文字誤り率（CER）8.1%の値に近い結果となっている。話者ラベルのない音声資源は、とりわけ実データにおいては音声ラベルを有するものと比べ、量的に大規模の入手が可能である。同量のデータを用いた今回の実験では、図９の(a)のモデル性能と比較して、図９の(d)のモデル性能が大きく劣っている結果とはならなかった。従って、入手可能な大規模な話者ラベルデータを用いた学習によって、実用性能を大幅に改善することができると考えられる。 In particular, the signal-to-distortion ratio (SDR) is 17.2 for case (d) of FIG. became a small margin. The character error rate (CER) result for case (d) of FIG. 9 is 9.7%, which is close to the character error rate (CER) value of 8.1% for the result of (a) of FIG. 9 with speaker labeling. result. Speech resources without speaker labels are available in larger quantities than those with speech labels, especially in real data. In this experiment using the same amount of data, the model performance of FIG. 9(d) was not significantly inferior to the model performance of FIG. 9(a). Therefore, it is believed that training using available large-scale speaker label data can greatly improve practical performance.

　以上、本発明の一実施の形態及びその変形例について、音声処理学習装置と音声処理学習方法を説明した。本手法により、話者ラベルを施していない音声データであっても実用レベルの目的話者抽出性能が実現されていると考えられる。即ち、話者ラベルの付与の有無にかかわらず、エンロールメント音声を用意せずとも、実用上利用可能な目的話者抽出を実現できる。 So far, the speech processing learning device and the speech processing learning method have been described with respect to one embodiment and its modification of the present invention. It is believed that this method achieves a practical level of target speaker extraction performance even for speech data without speaker labeling. In other words, regardless of whether or not a speaker label is assigned, it is possible to implement practically usable target speaker extraction without preparing an enrollment speech.

　上述の効果により、従来活用することのできなかったデータを用いて目的話者抽出を学習することが可能となる。即ち、目的話者抽出モデルの学習用として活用できる音声データの範囲が広がる。 Due to the above effects, it is possible to learn to extract the target speaker using data that could not be used in the past. That is, the range of speech data that can be used for learning the target speaker extraction model is expanded.

　データの活用範囲の広がりは、以下の２つの効用が得られる。一つは従来学習データに含めることのできなかったデータを学習データとして活用することで、データにおける発話や話者のバリエーションを増加させることがきる。従って、音声強調の話者の差異に関する頑健性を高め、目的話者抽出の性能を向上させることができる。もう一つは、話者ラベルの付与されていないデータセットや、1話者が1発話のみを発話しているデータを多く含むデータセットに対しても、音声強調の再学習を行うことでドメイン適応を実現できる。 The expansion of the range of data utilization has the following two benefits. One is that by using data that could not be included in conventional learning data as learning data, it is possible to increase the variation of utterances and speakers in the data. Therefore, it is possible to improve the robustness of speech enhancement regarding speaker differences and improve the performance of target speaker extraction. The other is to re-learn speech enhancement for datasets without speaker labels and datasets containing a lot of data in which only one utterance is uttered by one speaker. Adaptation can be achieved.

　なお、上述した各種の処理は、記載に従って時系列的に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Note that the various types of processing described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually according to the processing capacity of the device that executes the processing or as necessary. In addition, it goes without saying that appropriate modifications are possible without departing from the gist of the present invention.

［プログラム、記録媒体］
　上述の各種の処理は、図１０に示すコンピュータ２０００の記録部２０２０に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部２０１０、入力部２０３０、出力部２０４０、表示部２０５０などに動作させることで実施できる。 [Program, recording medium]
In the above-described various processes, the recording unit 2020 of the computer 2000 shown in FIG. It can be implemented by

　この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program that describes this process can be recorded on a computer-readable recording medium. Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like.

　また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition, the distribution of this program is carried out, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.

　このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information used for processing by a computer and conforming to the program (data that is not a direct command to the computer but has the property of prescribing the processing of the computer, etc.).

　また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, in this embodiment, the device is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.

Claims

　エンロールメント音声として受け取った、目的話者の発話音声である対象音声と同一の音声から、固定長のベクトルである時系列データで構成される特徴量を抽出する特徴抽出処理と、
　前記特徴抽出処理により抽出された特徴量から固定長のベクトルである話者表現を抽出する話者表現抽出処理と、
　前記話者表現抽出処理により抽出された話者表現を用いて、前記対象音声、前記目的話者とは異なる話者の音声である非対象音声、及びノイズから構成される混合音から前記対象音声と推定する音声を抽出する目的話者抽出処理と、
　前記目的話者抽出処理により抽出された音声と、前記対象音声とを用いて損失関数を算出し、その算出値が最小になるように前記目的話者抽出処理を最適化する最適化処理と
を実施する音声処理学習方法。 A feature extraction process for extracting a feature amount composed of time-series data, which is a fixed-length vector, from the same voice as the target voice, which is the utterance voice of the target speaker, received as the enrollment voice;
a speaker expression extraction process for extracting a speaker expression, which is a fixed-length vector, from the feature amount extracted by the feature extraction process;
Using the speaker expression extracted by the speaker expression extraction process, the target speech is obtained from a mixed sound composed of the target speech, a non-target speech that is the speech of a speaker different from the target speaker, and noise. A target speaker extraction process for extracting a speech estimated as
optimization processing for calculating a loss function using the speech extracted by the target speaker extraction processing and the target speech, and optimizing the target speaker extraction processing so that the calculated value is minimized; Speech processing learning method to be implemented.
　前記特徴抽出処理により抽出された特徴量に対してデータ拡張を行う第１のデータ拡張処理を実施し、
　前記話者表現抽出処理は、前記第１のデータ拡張処理によりデータ拡張された特徴量から固定長のベクトルである話者表現を抽出する、
請求項１に記載の音声処理学習方法。 Performing a first data extension process for performing data extension on the feature amount extracted by the feature extraction process,
The speaker expression extraction process extracts a speaker expression, which is a fixed-length vector, from the feature amount data-extended by the first data extension process.
The speech processing learning method according to claim 1.
　前記第１のデータ拡張処理におけるデータ拡張は、前記特徴量の一部を無作為に所定の値へと置換する、請求項２に記載の音声処理学習方法。 The speech processing learning method according to claim 2, wherein the data extension in the first data extension processing randomly replaces part of the feature quantity with a predetermined value.
　前記話者表現抽出処理により抽出された話者表現に対してデータ拡張を行う第２のデータ拡張処理を実施し、
　前記目的話者抽出処理は、前記第２のデータ拡張処理によりデータ拡張された話者表現を用いて、前記混合音から前記対象音声と推定する音声を抽出する
請求項１に記載の音声処理学習方法。 performing a second data extension process for extending data of the speaker expression extracted by the speaker expression extraction process;
2. The speech processing learning according to claim 1, wherein the target speaker extraction processing extracts a speech to be estimated as the target speech from the mixed sound using the speaker expression data-extended by the second data extension processing. Method.
　前記第２のデータ拡張処理におけるデータ拡張は、前記話者表現の一部を所定の値へと置換する、請求項４に記載の音声処理学習方法。　The speech processing learning method according to claim 4, wherein the data expansion in the second data expansion process replaces a part of the speaker's expression with a predetermined value.
　エンロールメント音声として受け取った、目的話者の発話音声である対象音声と同一の音声から、固定長のベクトルである時系列データで構成される特徴量を抽出する特徴抽出部と、
　前記特徴抽出部により抽出された特徴量から固定長のベクトルである話者表現を抽出する話者表現抽出部と、
　前記話者表現抽出部により抽出された話者表現を用いて、前記対象音声、前記目的話者とは異なる話者の音声である非対象音声、及びノイズから構成される混合音から前記対象音声と推定する音声を抽出する目的話者抽出部と、
　前記目的話者抽出部により抽出された音声と、前記対象音声とを用いて損失関数を算出し、その算出値が最小になるように前記目的話者抽出部を最適化する最適化部と
を有する音声処理学習装置。 A feature extracting unit that extracts a feature amount composed of time-series data, which is a fixed-length vector, from the same speech as the target speech, which is the utterance speech of the target speaker, received as the enrollment speech;
a speaker expression extraction unit for extracting a speaker expression, which is a fixed-length vector, from the feature quantity extracted by the feature extraction unit;
Using the speaker expression extracted by the speaker expression extraction unit, the target speech is obtained from a mixed sound composed of the target speech, a non-target speech that is the speech of a speaker different from the target speaker, and noise. a target speaker extracting unit for extracting speech estimated as
an optimization unit that calculates a loss function using the speech extracted by the target speaker extraction unit and the target speech, and optimizes the target speaker extraction unit so that the calculated value is minimized; speech processing learning device with
　前記特徴抽出部により抽出された特徴量に対してデータ拡張を行う第１のデータ拡張部と、
　前記話者表現抽出部により抽出された話者表現に対してデータ拡張を行う第２のデータ拡張部と、を有し、
　前記話者表現抽出部は、前記第１のデータ拡張部によりデータ拡張された特徴量から固定長のベクトルである話者表現を抽出し、
　前記目的話者抽出部は、前記第２のデータ拡張部によりデータ拡張された話者表現を用いて、前記混合音から前記対象音声と推定する音声を抽出する
請求項６に記載の音声処理学習装置。 a first data extension unit that performs data extension on the feature amount extracted by the feature extraction unit;
a second data extension unit that performs data extension on the speaker expression extracted by the speaker expression extraction unit;
The speaker expression extraction unit extracts a speaker expression, which is a fixed-length vector, from the feature amount data-extended by the first data extension unit,
7. The speech processing learning according to claim 6, wherein the target speaker extraction unit extracts the speech to be estimated as the target speech from the mixed sound using the speaker expression data-extended by the second data extension unit. Device.
　請求項１から５のいずれかに記載の音声処理学習方法をコンピュータに機能させるためのプログラム。 A program for causing a computer to function the speech processing learning method according to any one of claims 1 to 5.