WO2023135788A1 - Voice processing learning method, voice processing learning device, and program - Google Patents

Voice processing learning method, voice processing learning device, and program Download PDF

Info

Publication number
WO2023135788A1
WO2023135788A1 PCT/JP2022/001315 JP2022001315W WO2023135788A1 WO 2023135788 A1 WO2023135788 A1 WO 2023135788A1 JP 2022001315 W JP2022001315 W JP 2022001315W WO 2023135788 A1 WO2023135788 A1 WO 2023135788A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
speech
target
data
expression
Prior art date
Application number
PCT/JP2022/001315
Other languages
French (fr)
Japanese (ja)
Inventor
宏 佐藤
直輝 牧島
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/001315 priority Critical patent/WO2023135788A1/en
Publication of WO2023135788A1 publication Critical patent/WO2023135788A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present invention relates to speech recognition technology, and more particularly to target speaker extraction technology for extracting only the target speaker's uttered voice from mixed sound that includes other speakers' utterances and noise in addition to the target speaker's utterances.
  • Blind source separation enables speech recognition by separating speech, which is difficult to recognize as a mixed sound, into the speech of each speaker (see, for example, Non-Patent Document 1).
  • Target speaker extraction uses the pre-registered utterances of the target speaker as auxiliary information, and obtains only the voice of the pre-registered speaker from the mixed sound (see, for example, Non-Patent Document 2). It has the advantage of not requiring prior information on the number of speakers included in the mixed sound, and is a practically useful technique. Since the extracted speech contains only the voice of the target speaker, speech recognition is possible.
  • the target speaker extraction is implemented using a neural network, and when it is trained, the input is a mixed sound containing the target speaker's speech, the pre-registered utterance of the target speaker, and the target to be extracted from the mixed sound as an output.
  • Use speaker-voice pair data That is, in the learning for creating the target speaker extraction model, the speech of the target speaker that is different from the extraction target speech (enrollment speech) is extracted from the target speaker to be extracted. Must be included in the dataset as pre-registered utterances.
  • the object of the present invention is to develop a practically usable target speech based on a speech database without speaker labels and a speech database containing speeches of unspecified majority of speakers. It is to learn person extraction.
  • a speech processing learning method provides a fixed-length vector from the same speech as the target speech, which is the utterance speech of the target speaker, received as the enrollment speech.
  • a feature extraction process for extracting a feature amount composed of series data a speaker expression extraction process for extracting a speaker expression that is a fixed-length vector from the feature amount extracted by the feature extraction process, and the speaker expression
  • the target speech is estimated to be the target speech from the mixed sound composed of the target speech, the non-target speech that is the speech of a speaker different from the target speaker, and noise.
  • a target speaker extraction process for extracting a target speaker a loss function is calculated using the speech extracted by the target speaker extraction process, and the target speech, and the target speaker is extracted so that the calculated value is minimized
  • An optimization process for optimizing the process is performed.
  • FIG. 1 is a diagram showing a functional configuration example of a speech processing learning device according to an embodiment of the present invention
  • FIG. FIG. 4 is a diagram showing a processing flow example of a speech processing learning method in the speech processing learning device according to the embodiment of the present invention
  • the figure which showed the functional structural example of the 1st modification of the speech processing learning apparatus which concerns on one embodiment of this invention.
  • FIG. 1 shows a diagram showing a functional configuration example of a speech processing learning device according to one embodiment of the present invention.
  • the speech processing learning device 1 shown in FIG. FIG. 2 is a diagram showing a processing flow example of the speech processing learning method in the speech processing learning device according to the embodiment of the present invention.
  • the speech processing learning method of the embodiment is realized by the speech processing learning device 1 performing the processing of steps S11 to S16 illustrated in FIG.
  • One aspect of the speech processing learning device 1 is learning using a target speech i, an enrollment speech i′, and a mixed sound X, which constitute learning data ⁇ created by a learning data creation device 2, which will be described later.
  • a target speaker extraction model Z which is a trained model, is created.
  • the feature extraction unit 11 performs feature extraction processing (step S11). That is, the feature extraction unit 11 acquires the enrollment speech i′ from the paired data stored in the learning data ⁇ described later, performs feature extraction on this enrollment speech i′, and extracts a fixed-length vector Get time series data for an expression. The feature extraction unit 11 outputs the acquired time-series data to the speaker expression extraction unit 13 .
  • a filter bank feature amount for a known short-time window or a short-time Fourier transform result may be used.
  • the feature extraction unit 11 can also be configured with a neural network.
  • An example of the neural network of the feature extraction unit 11 is a known convolutional neural network.
  • the target voice and the enrollment voice are utterances of the same speaker, but different utterances are used for each.
  • the comment voice i′ is set to the same voice as the target voice i, which is the utterance voice of the target speaker.
  • the feature extraction unit 11 receives the target speech i from the learning data ⁇ , sets the target speech i received by the feature extraction unit 11 as the enrollment speech i′, and sets the feature extraction unit 11 to the enrollment speech i′. It may be configured to perform the feature extraction described above.
  • the speaker expression extraction unit 13 performs speaker expression extraction processing (step S13). That is, the speaker's expression extraction unit 13 extracts the speaker's expression, which is a fixed-length vector, from the time-series data received from the feature extraction unit 11 through the feature extraction processing in step S11.
  • the speaker expression extractor 13 can be constructed using a known neural network or a known speaker expression extractor such as i-vector.
  • the speaker expression extraction unit 13 outputs the extracted speaker expression to the target speaker extraction unit 15 .
  • the target speaker extraction unit 15 performs target speaker extraction processing (step S15). That is, the target speaker extraction unit 15 uses the speaker expression received from the speaker expression extraction unit 13 in the process of step S13 to obtain the mixed sound X from the paired data stored in the learning data ⁇ described later. Then, from the mixed sound X, an utterance assumed to be the target speaker is extracted.
  • the target speaker extraction unit 15 can be configured by a known neural network.
  • An example of the neural network of the target speaker extraction unit 15 is a network structure such as a known ConvTasNet.
  • the target speaker extraction unit 15 outputs the extracted target speaker and the estimated utterance to the optimization unit 16 .
  • the target speaker extracting unit 15 receives optimized parameters from the optimizing unit 16 by the processing of step S16 of the optimizing unit 16, which will be described later, and applies the parameters to the target speaker possessed by the target speaker extracting unit 15.
  • a target speaker extraction model Z is created by reflecting the parameters included in the model for extraction.
  • the optimization unit 16 performs optimization processing (step S16). That is, the optimization unit 16 combines the utterance estimated to be the target speaker received from the target speaker extraction unit 15 by the process of step S15, and the target speech i received from the pair data stored in the learning data ⁇ , which will be described later. to calculate the loss function. The optimization unit 16 optimizes the parameters included in the target speaker extraction model of the target speaker extraction unit 15 so that the calculated value of the loss function is minimized.
  • An example of the loss function used by the optimization unit 16 is known sd-SNR.
  • An example of the optimization method is the well-known Adam method.
  • the optimization processing performed by the optimization unit 16 can be performed for all parameters included in the model for extracting the target speaker.
  • the optimization processing performed by the optimization unit 16 can also be performed only on the remaining parameters by fixing some parameters.
  • the optimization unit 16 outputs to 15 the parameters included in the optimized model for target speaker extraction.
  • the parameters output from the optimization unit 16 to the target speaker extraction unit 15 are reflected in the parameters included in the target speaker extraction model held by the target speaker extraction unit 15, and the target speaker extraction model Z is created. be done.
  • the speech processing learning method is realized by the speech processing learning device 1 performing the processing from step S11 to step S16 described above.
  • the target speaker extraction model Z shown in FIG. 1, which is a trained model, is created.
  • FIG. 3 shows a diagram showing a functional configuration example of a learning data creation device according to an embodiment of the present invention.
  • the learning data creation device 2 shown in FIG. 3 is a device that creates learning data ⁇ consisting of paired data used in the speech processing learning device 1 in order to learn target speaker extraction.
  • the learning data creation device 2 includes a target utterance extraction unit 21, a non-target utterance extraction unit 22, a noise extraction unit 23, a voice mixing unit 24, and a data creation unit 25.
  • FIG. 4 is a diagram showing a processing flow example of a learning data creation method in the learning data creation device according to the embodiment of the present invention.
  • the learning data creation method according to one embodiment of the present invention is realized by the learning data creation device 2 performing the processing of steps S21 to S25 illustrated in FIG. 4 .
  • the target utterance extraction unit 21 performs target utterance extraction processing (step S21). That is, the target utterance extraction unit 21 receives voice data from a previously prepared voice database ⁇ containing an unspecified number of utterances, and extracts the target voice i, which is the utterance voice of the target speaker, from the voice data. The target utterance extractor 21 outputs the extracted target speech i to the voice mixer 24 and the data generator 25 .
  • Non-target utterance extraction unit 22 performs a non-target utterance extraction process (step S22). That is, the non-target speech extraction unit 22 receives speech data from the above-described speech database ⁇ , and extracts non-target speech k, which is the speech speech of a speaker other than the target speaker, from the speech data. The non-target speech extraction unit 22 outputs the extracted non-target speech k to the speech mixing unit 24 .
  • the noise extraction unit 23 performs noise extraction processing (step S23). That is, the noise extraction unit 23 receives noise data from a noise database ⁇ prepared in advance, and extracts noise r, which is a noise signal, from the noise data. The noise extractor 23 outputs the extracted noise r to the audio mixer 24 .
  • the voice mixing unit 24 performs voice mixing processing (step S24). That is, the voice mixing unit 24 combines the target voice i received from the target utterance extraction unit 21, the non-target voice k received from the non-target utterance extraction unit 22, and the noise r received from the noise extraction unit 23 with an arbitrary gain. to create mixed sound X.
  • the audio mixer 24 outputs the created mixed sound X to 25 .
  • the data creation unit 25 performs data creation processing (step S25). That is, the data creation unit 25 creates learning data ⁇ , which is a data set, based on the target speech i received from the target utterance extraction unit 21 and the mixed sound X received from the speech mixing unit 24 .
  • the data creation unit 25 sets the same voice as the target voice i, which is the uttered voice of the target speaker, as the enrollment voice i', and then creates three voices, the target voice i, the enrollment voice i', and the mixed sound X. is created as learning data ⁇ .
  • the learning data creation method is implemented by the learning data creation device 2 performing the processing from step S21 to step S25 described above, and the learning data used in the speech processing learning device 1 can be created.
  • the learning data creation device 2 can create learning data used in the speech processing learning device 1 without preparing enrollment voices other than the target voice i.
  • the learning data creation device 2 can construct a learning data set based on a speech database with no speaker label attached or a speech database ⁇ containing an unspecified number of utterances.
  • the above-mentioned speech database ⁇ is composed of an unspecified number of utterance data and data with speaker labels.
  • paired data is generated using the above-described learning data creation method, and for speech with speaker labels, paired data is generated by a conventional method using speaker labels. can also be generated.
  • paired data can also be generated using the technique of the embodiment of the present invention described above for speech with speaker labels.
  • steps S21, S22, and S23 are performed in this order, but these three processes may be performed in any order. . Alternatively, two or three of these processes may be processed in parallel.
  • FIG. 5 is a diagram showing a functional configuration example of the first modification of the speech processing learning device according to the embodiment of the present invention.
  • the speech processing learning device 1 has a first data extension unit 12 in addition to each element of the speech processing learning device 1 shown in FIG.
  • FIG. 6 shows a processing flow example of the speech processing learning method in the first modification of the speech processing learning device according to the embodiment of the present invention.
  • the speech processing learning method of this modified example is realized by the speech processing learning device 1 shown in FIG. 5 performing the processing of steps S11 to S16 illustrated in FIG.
  • the first data extension unit 12 performs a first data extension process (step S12). That is, the first data extension unit 12 performs data augmentation conversion on the time-series data, which is the feature amount extracted by the feature extraction unit 11 in the process of step S11.
  • Examples of data expansion methods include a method of randomly replacing part of the time-frequency representation with a predetermined value such as 0 (zero) or an average value.
  • a specific method there is a known SpecAugment.
  • one or more of data extension related to frequency, data extension related to time, and time warping among SpecAugments can be used.
  • the first data extension unit 12 outputs the converted data to the speaker expression extraction unit 13 .
  • step S13 In the process of step S13 performed by the speaker expression extraction unit 13, the speaker expression extraction is performed using the converted data received from the first data extension unit 12 in this modification. Other processing is the same as the processing of step S13 shown in FIG.
  • a speech processing learning method is realized by the speech processing learning device 1 in FIG. 5 performing the processing from step S11 to step S16 in FIG. As a result, the target speaker extraction model Z shown in FIG. 5, which is a trained model, is created.
  • FIG. 7 is a diagram showing a functional configuration example of a second modification of the speech processing learning device according to the embodiment of the present invention.
  • the speech processing learning device 1 has a second data extension unit 14 in addition to the elements of the speech processing learning device 1 shown in FIG.
  • FIG. 8 shows a processing flow example of the speech processing learning method in the second modification of the speech processing learning device according to the embodiment of the present invention.
  • the speech processing learning method of this modified example is realized by the speech processing learning device 1 shown in FIG. 7 performing the processing of steps S11 to S16 illustrated in FIG.
  • the second data extension unit 14 performs a second data extension process (step S14). That is, the second data extension unit 14 performs data augmentation conversion on the fixed-length speaker expressions extracted by the speaker expression extraction unit 13 in the process of step S13.
  • a data extension method for example, there is a method of replacing some elements of a vector with a predetermined value such as 0 (zero) or an average value using a known dropout method.
  • the target speaker extraction unit 15 extracts the target speaker using the converted data received from the second data extension unit 14 in the process of step S15 performed by the target speaker extraction unit 15 .
  • Other processing is the same as the processing of step S15 shown in FIG.
  • a speech processing learning method is realized by the speech processing learning device 1 in FIG. 7 performing the processing from step S11 to step S16 in FIG. As a result, the target speaker extraction model Z shown in FIG. 7, which is a trained model, is created.
  • both the processing of the first data extension unit 12 (step S12) and the processing of the second data extension unit 14 (step S14) are performed.
  • the device 1 may be configured to implement only one or the other. That is, as in the first modified example, the speech processing learning apparatus 1 does not perform the processing of the second data extension unit 14 (step S14), and performs only the processing of the first data extension unit 12 (step S12). It may be configured to do so. Further, the speech processing learning apparatus 1 may be configured to perform only the processing of the second data extension unit 14 (step S14) without performing the processing of the first data extension unit 12 (step S12). .
  • FIG. 9 shows an example of the performance evaluation result of the target speaker extraction model when the target speaker extraction experiment was performed.
  • FIG. 9 shows the case where a model trained using conventional speaker labels is used as the target speaker extraction model for the target speaker extraction experiment ((a) in FIG. 9), and the speech processing learning method of FIG. (b) in FIG. 9), when using the model in which the speech processing learning method in FIG. 6 is used ((c) in FIG. 9), the speech processing learning method in FIG.
  • the results are shown for the four cases of using the model ((d) in FIG. 9).
  • SDR Signal to Distortion Ratio
  • CER Character Error Rate
  • the signal-to-distortion ratio (SDR) was 15.5 and the character error rate (CER) was 12.3% when using the model that implemented the speech processing learning method of Fig. 2. .
  • SDR signal-to-distortion ratio
  • CER character error rate
  • the information of the enrollment speech i' can be appropriately varied, and the enrollment speech i' can be regarded as the so-called original enrollment speech. It can be seen that the role of In other words, from the result of FIG. 9(b), it can be said that a practical level of target speaker extraction performance is achieved even with speech data that is not labeled with a speaker.
  • the target voice i When the target voice i is used as the enrollment voice i', there is no change in speaker characteristics between the target voice and the enrollment voice. Therefore, compared with the conventional method, when the enrollment voice is different utterances of the target speaker and compared with the change in speaker characteristics, it is difficult to obtain sufficient robustness against the change. can occur. It can be seen that the adoption of the data augmentation method of this embodiment contributes to the acquisition of robustness against this variation.
  • the signal-to-distortion ratio (SDR) is 17.2 for case (d) of FIG. became a small margin.
  • the character error rate (CER) result for case (d) of FIG. 9 is 9.7%, which is close to the character error rate (CER) value of 8.1% for the result of (a) of FIG. 9 with speaker labeling. result.
  • Speech resources without speaker labels are available in larger quantities than those with speech labels, especially in real data.
  • the model performance of FIG. 9(d) was not significantly inferior to the model performance of FIG. 9(a). Therefore, it is believed that training using available large-scale speaker label data can greatly improve practical performance.
  • the speech processing learning device and the speech processing learning method have been described with respect to one embodiment and its modification of the present invention. It is believed that this method achieves a practical level of target speaker extraction performance even for speech data without speaker labeling. In other words, regardless of whether or not a speaker label is assigned, it is possible to implement practically usable target speaker extraction without preparing an enrollment speech.
  • the expansion of the range of data utilization has the following two benefits.
  • One is that by using data that could not be included in conventional learning data as learning data, it is possible to increase the variation of utterances and speakers in the data. Therefore, it is possible to improve the robustness of speech enhancement regarding speaker differences and improve the performance of target speaker extraction.
  • the other is to re-learn speech enhancement for datasets without speaker labels and datasets containing a lot of data in which only one utterance is uttered by one speaker. Adaptation can be achieved.
  • a program that describes this process can be recorded on a computer-readable recording medium.
  • Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like.
  • this program is carried out, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded.
  • the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.
  • a computer that executes such a program for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information used for processing by a computer and conforming to the program (data that is not a direct command to the computer but has the property of prescribing the processing of the computer, etc.).
  • ASP Application Service Provide
  • the device is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A feature extraction unit 11 extracts a feature amount from a voice that is the same as a subject voice, which is the voice spoken by a target speaker, received as an enrolment voice. A speaker expression extraction unit 13 extracts a speaker expression from the extracted feature amount. A target speaker extraction unit 15 uses the extracted speaker expression to extract a voice inferred to be a subject voice from mixed sound constituted by the subject voice, a non-subject voice that is the voice of a different speaker to the target speaker, and noise. An optimization unit 16 calculates a loss function using the inferred voice and the subject voice, and optimizes the target speaker extraction process so that the calculated value is minimized.

Description

音声処理学習方法、音声処理学習装置、およびプログラムSpeech processing learning method, speech processing learning device, and program
 本発明は音声認識技術に関し、特に目的話者の発話に加え、他話者の発話や雑音などを含む混合音から、目的話者の発話音声のみを抽出する目的話者抽出技術に関する。 The present invention relates to speech recognition technology, and more particularly to target speaker extraction technology for extracting only the target speaker's uttered voice from mixed sound that includes other speakers' utterances and noise in addition to the target speaker's utterances.
 近年、深層学習技術の発達により音声認識の性能は向上した。しかし、それでも音声認識が困難な状況の例として複数人の混合音(オーバーラップ発話)が挙げられる。これに対処するため、以下のような技術が考案されている。 In recent years, the performance of speech recognition has improved due to the development of deep learning technology. However, as an example of a situation in which speech recognition is still difficult, there is a mixed sound (overlapping speech) of multiple people. In order to deal with this, the following techniques have been devised.
 ブラインド音源分離は、混合音のままでは音声認識が困難な音声を、各話者の音声に分離することで音声認識を可能にする(例えば、非特許文献1参照)。 Blind source separation enables speech recognition by separating speech, which is difficult to recognize as a mixed sound, into the speech of each speaker (see, for example, Non-Patent Document 1).
 目的話者抽出は、目的話者が事前登録した発話を補助的な情報として利用し、事前登録された話者の音声のみを混合音から取得する(例えば、非特許文献2参照)。混合音に含まれる話者数に関する事前情報を必要としない利点があり、実用上有用な技術である。抽出した音声は目的話者の声だけを含むことから、音声認識が可能である。 Target speaker extraction uses the pre-registered utterances of the target speaker as auxiliary information, and obtains only the voice of the pre-registered speaker from the mixed sound (see, for example, Non-Patent Document 2). It has the advantage of not requiring prior information on the number of speakers included in the mixed sound, and is a practically useful technique. Since the extracted speech contains only the voice of the target speaker, speech recognition is possible.
 目的話者抽出はニューラルネットワークを用いて実装され、その学習をする際は、入力として目的話者音声を含む混合音と、目的話者の事前登録発話と、出力として混合音から抽出すべき目的話者音声のペアデータを利用する。即ち、目的話者抽出モデルを作成するための学習には、目的話者による発話であって、かつ抽出対象発話とは別の発話の音声(エンロールメント音声)が、抽出すべき目的話者の事前登録発話としてデータセットに含まれる必要がある。 The target speaker extraction is implemented using a neural network, and when it is trained, the input is a mixed sound containing the target speaker's speech, the pre-registered utterance of the target speaker, and the target to be extracted from the mixed sound as an output. Use speaker-voice pair data. That is, in the learning for creating the target speaker extraction model, the speech of the target speaker that is different from the extraction target speech (enrollment speech) is extracted from the target speaker to be extracted. Must be included in the dataset as pre-registered utterances.
 従って、話者ラベルが付与されていないデータセット、不特定話者の音声を含むデータセット等に対して目的話者抽出を学習することができなかった。また、話者ラベルが付与されていても1話者が1発話のみを発しているデータセットに対しては目的話者抽出を学習することができなかった。 Therefore, it was not possible to learn to extract the target speaker for data sets without speaker labels, data sets containing voices of unspecified speakers, etc. In addition, we could not learn target speaker extraction for a data set in which one speaker uttered only one utterance, even if the speaker label was assigned.
 これによりプライバシーの観点から匿名化された音声ログ、同一話者から複数発話が多くの場合得られないアプリケーションの音声ログ等に対しては、目的話者抽出モデルを学習させることや、ドメイン適応することが困難であった。 As a result, for speech logs anonymized from the viewpoint of privacy and speech logs of applications where multiple utterances from the same speaker cannot be obtained in many cases, it is possible to train the target speaker extraction model and apply domain adaptation. was difficult.
 更に、会議音声の録音等を使用して目的話者抽出のための学習を行う場合、手作業で話者音声のアノテーションをする必要が生じることがあり、そのコストが負担になることがあった。また、類似した声質の話者が存在する場合に、完璧に話者をアノテーションすることは必ずしも可能ではなかった。 Furthermore, when learning to extract the target speaker using recordings of conference voices, etc., it may be necessary to manually annotate the speaker's voice, which can be a burden. . Moreover, when there are speakers with similar voice qualities, it is not always possible to annotate the speakers perfectly.
 本発明の目的は、上記のような課題に鑑みて、話者ラベルの付与されていない音声データベースや、不特定多数話者の音声を含む音声データベースをもとに、実用上利用可能な目的話者抽出の学習を行うことにある。 In view of the above-mentioned problems, the object of the present invention is to develop a practically usable target speech based on a speech database without speaker labels and a speech database containing speeches of unspecified majority of speakers. It is to learn person extraction.
 上記課題を解決するために、本発明の一態様の音声処理学習方法は、エンロールメント音声として受け取った、目的話者の発話音声である対象音声と同一の音声から、固定長のベクトルである時系列データで構成される特徴量を抽出する特徴抽出処理と、前記特徴抽出処理により抽出された特徴量から固定長のベクトルである話者表現を抽出する話者表現抽出処理と、前記話者表現抽出処理により抽出された話者表現を用いて、前記対象音声、前記目的話者とは異なる話者の音声である非対象音声、及びノイズから構成される混合音から前記対象音声と推定する音声を抽出する目的話者抽出処理と、前記目的話者抽出処理により抽出された音声と、前記対象音声とを用いて損失関数を算出し、その算出値が最小になるように前記目的話者抽出処理を最適化する最適化処理とを実施する。 In order to solve the above problems, a speech processing learning method according to one aspect of the present invention provides a fixed-length vector from the same speech as the target speech, which is the utterance speech of the target speaker, received as the enrollment speech. A feature extraction process for extracting a feature amount composed of series data, a speaker expression extraction process for extracting a speaker expression that is a fixed-length vector from the feature amount extracted by the feature extraction process, and the speaker expression Using the speaker expression extracted by the extraction process, the target speech is estimated to be the target speech from the mixed sound composed of the target speech, the non-target speech that is the speech of a speaker different from the target speaker, and noise. a target speaker extraction process for extracting a target speaker, a loss function is calculated using the speech extracted by the target speaker extraction process, and the target speech, and the target speaker is extracted so that the calculated value is minimized An optimization process for optimizing the process is performed.
 本発明によれば、話者ラベルの付与されていない音声データベースや、不特定多数話者の音声を含む音声データベースをもとに、実用上利用可能な目的話者抽出の学習を行うことができる。 According to the present invention, it is possible to learn practically applicable target speaker extraction based on a speech database without speaker labels and a speech database containing speeches of unspecified majority of speakers. .
本発明の一実施の形態に係る音声処理学習装置の機能構成例を示した図。1 is a diagram showing a functional configuration example of a speech processing learning device according to an embodiment of the present invention; FIG. 本発明の一実施の形態に係る音声処理学習装置における音声処理学習方法の処理フロー例を示した図。FIG. 4 is a diagram showing a processing flow example of a speech processing learning method in the speech processing learning device according to the embodiment of the present invention; 本発明の一実施の形態に係る学習データ作成装置の機能構成例を示した図。The figure which showed the functional structural example of the learning data preparation apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る学習データ作成装置における学習データ作成方法の処理フロー例を示した図。The figure which showed the processing flow example of the learning data preparation method in the learning data preparation apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第1の変形例の機能構成例を示した図。The figure which showed the functional structural example of the 1st modification of the speech processing learning apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第1の変形例における音声処理学習方法の処理フロー例を示した図。The figure which showed the example of the processing flow of the speech processing learning method in the 1st modification of the speech processing learning apparatus which concerns on one embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第2の変形例の機能構成例を示した図。The figure which showed the functional structural example of the 2nd modification of the speech processing learning apparatus which concerns on one Embodiment of this invention. 本発明の一実施の形態に係る音声処理学習装置の第2の変形例における音声処理学習方法の処理フロー例を示した図。The figure which showed the example of the processing flow of the speech-processing learning method in the 2nd modification of the speech-processing learning apparatus which concerns on one embodiment of this invention. 目的話者抽出実験を行った場合の目的話者抽出モデルの性能評価結果の一例を示した図。The figure which showed an example of the performance evaluation result of the target speaker extraction model when the target speaker extraction experiment was performed. コンピュータの機能構成を例示する図。The figure which illustrates the functional structure of a computer.
 以下、本発明の実施の形態について詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Components having the same function are given the same number, and redundant description is omitted.
 図1に本発明の一実施の形態に係る音声処理学習装置の機能構成例を示した図を示す。図1に示した音声処理学習装置1は、特徴抽出部11と、話者表現抽出部13と、目的話者抽出部15と、最適化部16とを備えている。図2は、本発明の一実施の形態に係る音声処理学習装置における音声処理学習方法の処理フロー例を示した図である。音声処理学習装置1が、図2に例示するステップS11からステップS16の各ステップの処理を行うことにより、実施形態の音声処理学習方法が実現される。音声処理学習装置1の一態様は、後述する学習データ作成装置2で作成された学習データγを構成する、対象音声iと、エンロールメント音声i’と、混合音Xとを使用して学習することで、学習済みモデルである目的話者抽出モデルZを作成する。 FIG. 1 shows a diagram showing a functional configuration example of a speech processing learning device according to one embodiment of the present invention. The speech processing learning device 1 shown in FIG. FIG. 2 is a diagram showing a processing flow example of the speech processing learning method in the speech processing learning device according to the embodiment of the present invention. The speech processing learning method of the embodiment is realized by the speech processing learning device 1 performing the processing of steps S11 to S16 illustrated in FIG. One aspect of the speech processing learning device 1 is learning using a target speech i, an enrollment speech i′, and a mixed sound X, which constitute learning data γ created by a learning data creation device 2, which will be described later. Thus, a target speaker extraction model Z, which is a trained model, is created.
 以下、図1と図2を用いて、音声処理学習装置1の各要素が行う処理を説明しつつ、音声処理学習装置1の機能、及び音声処理学習装置1が行う音声処理学習方法について説明する。 Hereinafter, the functions of the speech processing learning device 1 and the speech processing learning method performed by the speech processing learning device 1 will be explained while explaining the processing performed by each element of the speech processing learning device 1 with reference to FIGS. 1 and 2. .
[特徴抽出部11]
 特徴抽出部11は、特徴抽出処理(ステップS11)を行う。即ち、特徴抽出部11は、後述する学習データγに格納されたペアデータのうち、エンロールメント音声i’を取得して、このエンロールメント音声i’に対して特徴抽出を行い、固定長のベクトル表現の時系列データを取得する。特徴抽出部11は取得した時系列データを話者表現抽出部13に出力する。特徴抽出の手法としては、例えば公知の短時間窓に対するフィルタバンク特徴量や、短時間フーリエ変換結果を用いることが挙げられる。特徴抽出部11は、ニューラルネットワークで構成することもできる。特徴抽出部11のニューラルネットワークの一例としては、公知の畳み込みニューラルネットワークが挙げられる。なお、通常の方法では、対象音声とエンロールメント音声とは、同一話者の発話であるが、夫々が異なる発話を用いるが、本発明では、後述するように、学習データγに格納されたエンロールメント音声i’は、目的話者の発話音声である対象音声iと同一の音声が設定される。
[Feature extraction unit 11]
The feature extraction unit 11 performs feature extraction processing (step S11). That is, the feature extraction unit 11 acquires the enrollment speech i′ from the paired data stored in the learning data γ described later, performs feature extraction on this enrollment speech i′, and extracts a fixed-length vector Get time series data for an expression. The feature extraction unit 11 outputs the acquired time-series data to the speaker expression extraction unit 13 . As a method of feature extraction, for example, a filter bank feature amount for a known short-time window or a short-time Fourier transform result may be used. The feature extraction unit 11 can also be configured with a neural network. An example of the neural network of the feature extraction unit 11 is a known convolutional neural network. In the usual method, the target voice and the enrollment voice are utterances of the same speaker, but different utterances are used for each. The comment voice i′ is set to the same voice as the target voice i, which is the utterance voice of the target speaker.
 そのため、特徴抽出部11が学習データγから対象音声iを受け取り、特徴抽出部11が受け取った対象音声iをエンロールメント音声i’として設定し、特徴抽出部11がこのエンロールメント音声i’に対して上述した特徴抽出を行うように構成しても良い。 Therefore, the feature extraction unit 11 receives the target speech i from the learning data γ, sets the target speech i received by the feature extraction unit 11 as the enrollment speech i′, and sets the feature extraction unit 11 to the enrollment speech i′. It may be configured to perform the feature extraction described above.
[話者表現抽出部13]
 話者表現抽出部13は、話者表現抽出処理(ステップS13)を行う。即ち、話者表現抽出部13は、ステップS11の特徴抽出処理により特徴抽出部11から受け取った時系列データから、固定長のベクトルである話者表現を抽出する。話者表現抽出部13は公知のニューラルネットワークや、i-vectorなどの公知の話者表現抽出器を用いて構築することができる。話者表現抽出部13は、抽出した話者表現を目的話者抽出部15へ出力する。
[Speaker expression extraction unit 13]
The speaker expression extraction unit 13 performs speaker expression extraction processing (step S13). That is, the speaker's expression extraction unit 13 extracts the speaker's expression, which is a fixed-length vector, from the time-series data received from the feature extraction unit 11 through the feature extraction processing in step S11. The speaker expression extractor 13 can be constructed using a known neural network or a known speaker expression extractor such as i-vector. The speaker expression extraction unit 13 outputs the extracted speaker expression to the target speaker extraction unit 15 .
[目的話者抽出部15]
 目的話者抽出部15は、目的話者抽出処理(ステップS15)を行う。即ち、目的話者抽出部15は、ステップS13の処理により話者表現抽出部13から受け取った話者表現を用いて、後述する学習データγに格納されたペアデータのうち、混合音Xを取得して、その混合音Xの中から目的話者と推測する発話の抽出を行う。目的話者抽出部15は公知のニューラルネットワークによって構成することができる。目的話者抽出部15のニューラルネットワークの一例としては、公知のConvTasNetなどのネットワーク構造が挙げられる。目的話者抽出部15は、抽出した目的話者と推測した発話を最適化部16へ出力する。なお、目的話者抽出部15は、後述する最適化部16のステップS16の処理により、最適化部16から最適化されたパラメータを受け取り、そのパラメータを目的話者抽出部15が有する目的話者抽出に関するモデルに含まれるパラメータに反映させることで、目的話者抽出モデルZを作成する。
[Target speaker extraction unit 15]
The target speaker extraction unit 15 performs target speaker extraction processing (step S15). That is, the target speaker extraction unit 15 uses the speaker expression received from the speaker expression extraction unit 13 in the process of step S13 to obtain the mixed sound X from the paired data stored in the learning data γ described later. Then, from the mixed sound X, an utterance assumed to be the target speaker is extracted. The target speaker extraction unit 15 can be configured by a known neural network. An example of the neural network of the target speaker extraction unit 15 is a network structure such as a known ConvTasNet. The target speaker extraction unit 15 outputs the extracted target speaker and the estimated utterance to the optimization unit 16 . Note that the target speaker extracting unit 15 receives optimized parameters from the optimizing unit 16 by the processing of step S16 of the optimizing unit 16, which will be described later, and applies the parameters to the target speaker possessed by the target speaker extracting unit 15. A target speaker extraction model Z is created by reflecting the parameters included in the model for extraction.
[最適化部16]
 最適化部16は、最適化処理を行う(ステップS16)を行う。即ち、最適化部16は、ステップS15の処理により目的話者抽出部15から受け取った目的話者と推測した発話と、後述する学習データγに格納されたペアデータから受け取った対象音声iとを用いて損失関数を計算する。最適化部16は、損失関数の算出値が最小の値を示すように、目的話者抽出部15が有する目的話者抽出に関するモデルに含まれるパラメータの最適化を行う。
[Optimization unit 16]
The optimization unit 16 performs optimization processing (step S16). That is, the optimization unit 16 combines the utterance estimated to be the target speaker received from the target speaker extraction unit 15 by the process of step S15, and the target speech i received from the pair data stored in the learning data γ, which will be described later. to calculate the loss function. The optimization unit 16 optimizes the parameters included in the target speaker extraction model of the target speaker extraction unit 15 so that the calculated value of the loss function is minimized.
 最適化部16が用いる損失関数の例としては、公知のsd-SNRが挙げられる。また、最適化手法の一例としては、公知のAdam手法が挙げられる。最適化部16が行う最適化処理は、目的話者抽出に関するモデルに含まれる全てのパラメータに対して行うことができる。最適化部16が行う最適化処理は、一部のパラメータを固定し、残りのパラメータに対してのみ行うこともできる。最適化部16は、最適化された目的話者抽出に関するモデルに含まれるパラメータを15へ出力する。なお、最適化部16から目的話者抽出部15に出力されたパラメータは、目的話者抽出部15が有する目的話者抽出に関するモデルに含まれるパラメータに反映され、目的話者抽出モデルZが作成される。 An example of the loss function used by the optimization unit 16 is known sd-SNR. An example of the optimization method is the well-known Adam method. The optimization processing performed by the optimization unit 16 can be performed for all parameters included in the model for extracting the target speaker. The optimization processing performed by the optimization unit 16 can also be performed only on the remaining parameters by fixing some parameters. The optimization unit 16 outputs to 15 the parameters included in the optimized model for target speaker extraction. The parameters output from the optimization unit 16 to the target speaker extraction unit 15 are reflected in the parameters included in the target speaker extraction model held by the target speaker extraction unit 15, and the target speaker extraction model Z is created. be done.
 音声処理学習装置1が上述したステップS11からステップS16の処理を行うことで、音声処理学習方法が実現される。これにより学習済みモデルである、図1に示した目的話者抽出モデルZが作成される。 The speech processing learning method is realized by the speech processing learning device 1 performing the processing from step S11 to step S16 described above. As a result, the target speaker extraction model Z shown in FIG. 1, which is a trained model, is created.
<学習データ作成装置>
 以下、本発明の実施の形態に係る音声処理学習装置1が用いる学習データの作成方法について説明する。図3は、本発明の一実施の形態に係る学習データ作成装置の機能構成例を示した図を示す。
<Learning data creation device>
A method of creating learning data used by the speech processing learning device 1 according to the embodiment of the present invention will be described below. FIG. 3 shows a diagram showing a functional configuration example of a learning data creation device according to an embodiment of the present invention.
 図3に示す学習データ作成装置2は、目的話者抽出を学習するために、音声処理学習装置1で用いるペアデータからなる学習データγを作成する装置である。図3に示すように、学習データ作成装置2は、対象発話抽出部21と、非対象発話抽出部22と、ノイズ抽出部23と、音声混合部24と、データ作成部25とを備えている。図4は、本発明の一実施の形態に係る学習データ作成装置における学習データ作成方法の処理フロー例を示した図である。学習データ作成装置2が、図4に例示するステップS21からステップS25の各ステップの処理を行うことにより、本発明の一実施の形態に係る学習データ作成方法が実現される。 The learning data creation device 2 shown in FIG. 3 is a device that creates learning data γ consisting of paired data used in the speech processing learning device 1 in order to learn target speaker extraction. As shown in FIG. 3, the learning data creation device 2 includes a target utterance extraction unit 21, a non-target utterance extraction unit 22, a noise extraction unit 23, a voice mixing unit 24, and a data creation unit 25. . FIG. 4 is a diagram showing a processing flow example of a learning data creation method in the learning data creation device according to the embodiment of the present invention. The learning data creation method according to one embodiment of the present invention is realized by the learning data creation device 2 performing the processing of steps S21 to S25 illustrated in FIG. 4 .
 以下、図3と図4を用いて、学習データ作成装置2の各要素が行う処理を説明しつつ、学習データ作成装置2の機能、及び学習データ作成装置2が行う学習データ作成方法について説明する。 Hereinafter, the functions of the learning data creation device 2 and the learning data creation method performed by the learning data creation device 2 will be explained while explaining the processing performed by each element of the learning data creation device 2 with reference to FIGS. 3 and 4. .
[対象発話抽出部21]
 対象発話抽出部21は、対象発話抽出処理(ステップS21)を行う。即ち、対象発話抽出部21は、予め準備された不特定多数の発話が含まれる音声データベースαから音声データを受け取り、その音声データから目的話者の発話音声である対象音声iを抽出する。対象発話抽出部21は、抽出した対象音声iを音声混合部24とデータ作成部25に出力する。
[Target utterance extraction unit 21]
The target utterance extraction unit 21 performs target utterance extraction processing (step S21). That is, the target utterance extraction unit 21 receives voice data from a previously prepared voice database α containing an unspecified number of utterances, and extracts the target voice i, which is the utterance voice of the target speaker, from the voice data. The target utterance extractor 21 outputs the extracted target speech i to the voice mixer 24 and the data generator 25 .
[非対象発話抽出部22]
 非対象発話抽出部22は、非対象発話抽出処理(ステップS22)を行う。即ち、非対象発話抽出部22は、上述の音声データベースαから音声データを受け取り、その音声データから目的話者以外の話者の発話音声である非対象音声kを抽出する。非対象発話抽出部22は、抽出した非対象音声kを音声混合部24に出力する。
[Non-target utterance extraction unit 22]
The non-target utterance extraction unit 22 performs a non-target utterance extraction process (step S22). That is, the non-target speech extraction unit 22 receives speech data from the above-described speech database α, and extracts non-target speech k, which is the speech speech of a speaker other than the target speaker, from the speech data. The non-target speech extraction unit 22 outputs the extracted non-target speech k to the speech mixing unit 24 .
[ノイズ抽出部23]
 ノイズ抽出部23は、ノイズ抽出処理(ステップS23)を行う。即ち、ノイズ抽出部23は、予め準備されたノイズデータベースβからノイズデータを受け取り、そのノイズデータからノイズの信号であるノイズrを抽出する。ノイズ抽出部23は、抽出したノイズrを音声混合部24に出力する。
[Noise extractor 23]
The noise extraction unit 23 performs noise extraction processing (step S23). That is, the noise extraction unit 23 receives noise data from a noise database β prepared in advance, and extracts noise r, which is a noise signal, from the noise data. The noise extractor 23 outputs the extracted noise r to the audio mixer 24 .
[音声混合部24]
 音声混合部24は、音声混合処理(ステップS24)を行う。即ち、音声混合部24は、対象発話抽出部21から受け取った対象音声iと、非対象発話抽出部22から受け取った非対象音声kと、ノイズ抽出部23から受け取ったノイズrとを任意のゲインで混合して混合音Xを作成する。音声混合部24は作成した混合音Xを25に出力する。
[Audio Mixer 24]
The voice mixing unit 24 performs voice mixing processing (step S24). That is, the voice mixing unit 24 combines the target voice i received from the target utterance extraction unit 21, the non-target voice k received from the non-target utterance extraction unit 22, and the noise r received from the noise extraction unit 23 with an arbitrary gain. to create mixed sound X. The audio mixer 24 outputs the created mixed sound X to 25 .
[データ作成部25]
 データ作成部25は、データ作成処理(ステップS25)を行う。即ち、データ作成部25は、対象発話抽出部21から受け取った対象音声iと、音声混合部24から受け取った混合音Xを基にして、データセットである学習データγを作成する。データ作成部25は、目的話者の発話音声である対象音声iと同一の音声をエンロールメント音声i’と設定した上で、対象音声i、エンロールメント音声i’、混合音Xの3つの音声のペアデータを、学習データγとして作成する。
[Data creation unit 25]
The data creation unit 25 performs data creation processing (step S25). That is, the data creation unit 25 creates learning data γ, which is a data set, based on the target speech i received from the target utterance extraction unit 21 and the mixed sound X received from the speech mixing unit 24 . The data creation unit 25 sets the same voice as the target voice i, which is the uttered voice of the target speaker, as the enrollment voice i', and then creates three voices, the target voice i, the enrollment voice i', and the mixed sound X. is created as learning data γ.
 学習データ作成装置2が上述したステップS21からステップS25の処理を行うことで、学習データ作成方法が実現され、音声処理学習装置1に用いる学習データが作成できる。 The learning data creation method is implemented by the learning data creation device 2 performing the processing from step S21 to step S25 described above, and the learning data used in the speech processing learning device 1 can be created.
 学習データ作成装置2は、対象音声i以外にエンロールメント音声を用意せず、音声処理学習装置1に用いる学習データの作成が可能となる。学習データ作成装置2は、話者ラベルの付与されていない音声データベースや、不特定多数の発話が含まれるような音声データベースαをもとに学習データセットを構築することができる。 The learning data creation device 2 can create learning data used in the speech processing learning device 1 without preparing enrollment voices other than the target voice i. The learning data creation device 2 can construct a learning data set based on a speech database with no speaker label attached or a speech database α containing an unspecified number of utterances.
 上述の音声データベースαが、不特定多数の発話データと、話者ラベル付きのデータとで構成される場合も考えられる。その場合、不特定多数の発話データについては、上述した学習データ作成方法を用いてペアデータを生成し、話者ラベル付きの音声に対しては、話者ラベルを利用した従来の手法でペアデータを生成することもできる。あるいは、話者ラベル付きの音声に対しても、上述した本発明の実施の形態の手法を用いてペアデータを生成することもできる。 It is conceivable that the above-mentioned speech database α is composed of an unspecified number of utterance data and data with speaker labels. In that case, for unspecified number of utterance data, paired data is generated using the above-described learning data creation method, and for speech with speaker labels, paired data is generated by a conventional method using speaker labels. can also be generated. Alternatively, paired data can also be generated using the technique of the embodiment of the present invention described above for speech with speaker labels.
 なお、上述の説明では、ステップS21、ステップS22、ステップS23の各処理はは、この順序で実施するように説明したが、この3つの処理の順番についてはどのような順序で処理してもよい。あるいは、この処理の内の2つあるいは3つの処理について、並行して処理するようにしても良い。 In the above description, steps S21, S22, and S23 are performed in this order, but these three processes may be performed in any order. . Alternatively, two or three of these processes may be processed in parallel.
<第1の変形例>
 以下、本発明の実施の形態に係る第1の変形例について説明する。図5は本発明の一実施の形態に係る音声処理学習装置の第1の変形例の機能構成例を示した図である。本変形例において、音声処理学習装置1は、既述の図1で示した音声処理学習装置1の各要素に加え、第1のデータ拡張部12を有する。図6は、本発明の一実施の形態に係る音声処理学習装置の第1の変形例における音声処理学習方法の処理フロー例を示している。図5に示した音声処理学習装置1が図6に例示するステップS11からステップS16の各ステップの処理を行うことにより、本変形例の音声処理学習方法が実現される。
<First modification>
A first modification according to the embodiment of the present invention will be described below. FIG. 5 is a diagram showing a functional configuration example of the first modification of the speech processing learning device according to the embodiment of the present invention. In this modified example, the speech processing learning device 1 has a first data extension unit 12 in addition to each element of the speech processing learning device 1 shown in FIG. FIG. 6 shows a processing flow example of the speech processing learning method in the first modification of the speech processing learning device according to the embodiment of the present invention. The speech processing learning method of this modified example is realized by the speech processing learning device 1 shown in FIG. 5 performing the processing of steps S11 to S16 illustrated in FIG.
 以下、図5と図6を用いて、図1の音声処理学習装置1と異なる機能や処理を説明しつつ、図5の音声処理学習装置1の機能、及び音声処理学習装置1が行う音声処理学習方法について説明する。 5 and 6, functions and processes different from those of the speech processing learning device 1 shown in FIG. 1 will be described. Explain the learning method.
[第1のデータ拡張部12]
 第1のデータ拡張部12は、第1のデータ拡張処理を行う(ステップS12)。即ち、第1のデータ拡張部12は、ステップS11の処理により特徴抽出部11にて抽出された特徴量である時系列データに対してデータ拡張(Data Augmentation)の変換を行う。データ拡張手法としては、例えば、時間周波数表現の一部を無作為に0(ゼロ)や平均値といった所定の値へ置き換える手法が挙げられる。具体的手法としては、公知のSpecAugmentが挙げられる。実装形態としては、SpecAugmentのうち周波数に関するデータ拡張と、時間に関するデータ拡張と、タイムワーピングのうちのいずれか1つ、あるいは複数を採用することが挙げられる。第1のデータ拡張部12は、変換後のデータを話者表現抽出部13へ出力する。
[First data extension unit 12]
The first data extension unit 12 performs a first data extension process (step S12). That is, the first data extension unit 12 performs data augmentation conversion on the time-series data, which is the feature amount extracted by the feature extraction unit 11 in the process of step S11. Examples of data expansion methods include a method of randomly replacing part of the time-frequency representation with a predetermined value such as 0 (zero) or an average value. As a specific method, there is a known SpecAugment. As an implementation form, one or more of data extension related to frequency, data extension related to time, and time warping among SpecAugments can be used. The first data extension unit 12 outputs the converted data to the speaker expression extraction unit 13 .
[話者表現抽出部13]
 話者表現抽出部13が行うステップS13の処理は、本変形例においては、第1のデータ拡張部12から受け取った変換後のデータを用いて話者表現抽出を行う。その他の処理は、図2で示したステップS13の処理と同様である。
[Speaker expression extraction unit 13]
In the process of step S13 performed by the speaker expression extraction unit 13, the speaker expression extraction is performed using the converted data received from the first data extension unit 12 in this modification. Other processing is the same as the processing of step S13 shown in FIG.
 図5の音声処理学習装置1が図6のステップS11からステップS16の処理を行うことで、音声処理学習方法が実現される。これにより学習済みモデルである、図5に示した目的話者抽出モデルZが作成される。 A speech processing learning method is realized by the speech processing learning device 1 in FIG. 5 performing the processing from step S11 to step S16 in FIG. As a result, the target speaker extraction model Z shown in FIG. 5, which is a trained model, is created.
<第2の変形例>
 以下、本発明の実施の形態に係る第2の変形例について説明する。図7は本発明の一実施の形態に係る音声処理学習装置の第2の変形例の機能構成例を示した図である。本変形例において、音声処理学習装置1は、既述の図5で示した音声処理学習装置1の各要素に加え、第2のデータ拡張部14を有する。図8は、本発明の一実施の形態に係る音声処理学習装置の第2の変形例における音声処理学習方法の処理フロー例を示している。図7に示した音声処理学習装置1が図8に例示するステップS11からステップS16の各ステップの処理を行うことにより、本変形例の音声処理学習方法が実現される。
<Second Modification>
A second modification according to the embodiment of the present invention will be described below. FIG. 7 is a diagram showing a functional configuration example of a second modification of the speech processing learning device according to the embodiment of the present invention. In this modification, the speech processing learning device 1 has a second data extension unit 14 in addition to the elements of the speech processing learning device 1 shown in FIG. FIG. 8 shows a processing flow example of the speech processing learning method in the second modification of the speech processing learning device according to the embodiment of the present invention. The speech processing learning method of this modified example is realized by the speech processing learning device 1 shown in FIG. 7 performing the processing of steps S11 to S16 illustrated in FIG.
 以下、図7と図8を用いて、図5の音声処理学習装置1と異なる機能や処理を説明しつつ、図7の音声処理学習装置1の機能、及び音声処理学習装置1が行う音声処理学習方法について説明する。 7 and 8, functions and processes different from those of the speech processing learning device 1 shown in FIG. Explain the learning method.
[第2のデータ拡張部14]
 第2のデータ拡張部14は、第2のデータ拡張処理(ステップS14)を行う。即ち、第2のデータ拡張部14は、ステップS13の処理により話者表現抽出部13にて抽出された固定長の話者表現に対してデータ拡張(Data Augmentation)の変換を行う。データ拡張の手法としては、例えば、公知のdropout手法を用いてベクトルの一部の要素を0(ゼロ)や平均値といった所定の値へ置き換える方法が挙げられる。
[Second data extension unit 14]
The second data extension unit 14 performs a second data extension process (step S14). That is, the second data extension unit 14 performs data augmentation conversion on the fixed-length speaker expressions extracted by the speaker expression extraction unit 13 in the process of step S13. As a data extension method, for example, there is a method of replacing some elements of a vector with a predetermined value such as 0 (zero) or an average value using a known dropout method.
[目的話者抽出部15]
 目的話者抽出部15が行うステップS15の処理は、本変形例においては、第2のデータ拡張部14から受け取った変換後のデータを用いて目的話者抽出を行う。その他の処理は、図2で示したステップS15の処理と同様である。
[Target speaker extraction unit 15]
In the present modification, the target speaker extraction unit 15 extracts the target speaker using the converted data received from the second data extension unit 14 in the process of step S15 performed by the target speaker extraction unit 15 . Other processing is the same as the processing of step S15 shown in FIG.
 図7の音声処理学習装置1が図8のステップS11からステップS16の処理を行うことで、音声処理学習方法が実現される。これにより学習済みモデルである、図7に示した目的話者抽出モデルZが作成される。 A speech processing learning method is realized by the speech processing learning device 1 in FIG. 7 performing the processing from step S11 to step S16 in FIG. As a result, the target speaker extraction model Z shown in FIG. 7, which is a trained model, is created.
 なお、図8においては、第1のデータ拡張部12の処理(ステップS12)と、第2のデータ拡張部14の処理(ステップS14)の双方の処理を実施することとしているが、音声処理学習装置1をいずれか一方のみを実施するように構成しても良い。即ち、第1の変形例と同様に、音声処理学習装置1を、第2のデータ拡張部14の処理(テップS14)を行わず、第1のデータ拡張部12(ステップS12)のみの処理を行うように構成しても良い。また、音声処理学習装置1を第1のデータ拡張部12(ステップS12)の処理を行わず、第2のデータ拡張部14の処理(テップS14)のみの処理を行うように構成しても良い。 In FIG. 8, both the processing of the first data extension unit 12 (step S12) and the processing of the second data extension unit 14 (step S14) are performed. The device 1 may be configured to implement only one or the other. That is, as in the first modified example, the speech processing learning apparatus 1 does not perform the processing of the second data extension unit 14 (step S14), and performs only the processing of the first data extension unit 12 (step S12). It may be configured to do so. Further, the speech processing learning apparatus 1 may be configured to perform only the processing of the second data extension unit 14 (step S14) without performing the processing of the first data extension unit 12 (step S12). .
<性能評価結果>
 上述した目的話者抽出のための学習手法を利用して学習された目的話者抽出モデルの性能評価結果について説明する。図9は、目的話者抽出実験を行った場合の目的話者抽出モデルの性能評価結果の一例を示している。図9では、目的話者抽出実験を行う目的話者抽出モデルとして、従来の話者ラベルを用いて学習したモデルを用いた場合(図9の(a))と、図2の音声処理学習方法を実施したモデルを用いた場合(図9の(b))、図6の音声処理学習方法を実施したモデルを用いた場合(図9の(c))、図8の音声処理学習方法を実施したモデルを用いた場合(図9の(d))、の4つの場合の結果を示している。
<Performance evaluation results>
Performance evaluation results of the target speaker extraction model trained using the learning method for target speaker extraction described above will be described. FIG. 9 shows an example of the performance evaluation result of the target speaker extraction model when the target speaker extraction experiment was performed. FIG. 9 shows the case where a model trained using conventional speaker labels is used as the target speaker extraction model for the target speaker extraction experiment ((a) in FIG. 9), and the speech processing learning method of FIG. (b) in FIG. 9), when using the model in which the speech processing learning method in FIG. 6 is used ((c) in FIG. 9), the speech processing learning method in FIG. The results are shown for the four cases of using the model ((d) in FIG. 9).
 本実験に当たっては、日本語話し言葉コーパス(CSJ:Corpus of Spontaneous Japanese)を用いた。また目的話者抽出性能を測るための評価指標としては、信号対歪み比(SDR:Signal to Distortion Ratio)、及び文字誤り率(CER: Character Error Rate)の2つの評価指標を採用した。SDRは、数字が大きいほど抽出性能が高いことを示している。また、CERは、数字が小さいほど抽出性能が高いことを示している。 For this experiment, the Corpus of Spontaneous Japanese (CSJ) was used. As evaluation indices for measuring target speaker extraction performance, we adopted two evaluation indices: Signal to Distortion Ratio (SDR) and Character Error Rate (CER). SDR indicates that the higher the number, the higher the extraction performance. In addition, CER shows that the smaller the number, the higher the extraction performance.
 図9ではいずれも日本語話し言葉コーパス(CSJ)を使用しており、同一の評価セットで評価を行っている。話者ラベルを使用した条件と未使用の条件は、それぞれ学習時には同一の混合音Xと対象音声iのペアを用い、エンロールメント音声i’の選択方法にのみ違いがある。そのため各モデル同士の性能を直接比較することが可能である。 In Figure 9, the Corpus of Spoken Japanese (CSJ) is used, and the same evaluation set is used for evaluation. The condition using the speaker label and the condition not using the speaker label use the same pair of the mixed sound X and the target speech i at the time of learning, and differ only in the method of selecting the enrollment speech i'. Therefore, it is possible to directly compare the performance of each model.
 図9の(b)に示す通り、図2の音声処理学習方法を実施したモデルを用いた場合には、信号対歪み比(SDR)が15.5、文字誤り率(CER)が12.3%となった。これは、従来の話者ラベルを用いて学習したモデルを用いた場合(図9の(a))の結果である、信号対歪み比(SDR)17.3、文字誤り率(CER)8.1%と比較しても、実用に耐えうる程度の目的話者抽出性能が実現されているといえる。 As shown in Fig. 9(b), the signal-to-distortion ratio (SDR) was 15.5 and the character error rate (CER) was 12.3% when using the model that implemented the speech processing learning method of Fig. 2. . This compares with the signal-to-distortion ratio (SDR) of 17.3 and the character error rate (CER) of 8.1%, which are the results of using a model trained using conventional speaker labels (Fig. 9(a)). However, it can be said that the target speaker extraction performance that can withstand practical use is realized.
 本実施の形態の音声処理学習装置1における音声処理学習方法を採用することにより、エンロールメント音声i’の情報を適切に変動させることになり、エンロールメント音声i’がいわゆる本来のエンロールメント音声としての役割を十分に担っていることが分かる。つまり、上述の図9の(b)の結果より、話者ラベルを施していない音声データであっても実用レベルの目的話者抽出性能が実現されているといえる。 By adopting the speech processing learning method in the speech processing learning device 1 of the present embodiment, the information of the enrollment speech i' can be appropriately varied, and the enrollment speech i' can be regarded as the so-called original enrollment speech. It can be seen that the role of In other words, from the result of FIG. 9(b), it can be said that a practical level of target speaker extraction performance is achieved even with speech data that is not labeled with a speaker.
 また、図9の(c)に示す通り、図6の音声処理学習方法を実施したモデルを用いた場合には、信号対歪み比(SDR)が16.4、文字誤り率(CER)が11.5%となり、図9の(b)のモデルと比較し、SDR、CER共に目的話者抽出性能が高くなっていることが分かる。 Also, as shown in (c) of FIG. 9, when using the model that implemented the speech processing learning method of FIG. 6, the signal-to-distortion ratio (SDR) was 16.4 and the character error rate (CER) was 11.5%. , and compared with the model of FIG. 9(b), both SDR and CER have high target speaker extraction performance.
 同様に、図9の(d)に示す通り、図8の音声処理学習方法を実施したモデルを用いた場合には、信号対歪み比(SDR)が17.2、文字誤り率(CER)が9.7%となり、図9の(b)のモデルと比較し、SDR、CER共に目的話者抽出性能が高くなっていることが分かる。 Similarly, as shown in (d) of FIG. 9, when using the model that implemented the speech processing learning method of FIG. 8, the signal-to-distortion ratio (SDR) was 17.2 and the character error rate (CER) was 9.7% As compared with the model in FIG. 9(b), it can be seen that the target speaker extraction performance is high for both SDR and CER.
 図9の(c)及び図9の(d)の結果より、データ拡張手法によって、話者ラベルがない音声データであっても、高い目的話者抽出性能が実現されていることが分かる。 From the results of (c) and (d) of FIG. 9, it can be seen that the data augmentation method achieves high target speaker extraction performance even with speech data without speaker labels.
 対象音声iをエンロールメント音声i’として利用する場合には、対象音声と、エンロールメント音声との間の話者性の変動が生じない。従って、従来手法のように、エンロールメント音声が、目的話者の別の発話だった場合との間の話者性の変動と比較した場合、その変動に対する頑健性が十分には獲得できないことが生じ得る。本実施形態のデータ拡張手法の採用は、この変動に対する頑健性の獲得に寄与で来ていることが分かる。 When the target voice i is used as the enrollment voice i', there is no change in speaker characteristics between the target voice and the enrollment voice. Therefore, compared with the conventional method, when the enrollment voice is different utterances of the target speaker and compared with the change in speaker characteristics, it is difficult to obtain sufficient robustness against the change. can occur. It can be seen that the adoption of the data augmentation method of this embodiment contributes to the acquisition of robustness against this variation.
 特に、図9の(d)の場合には、信号対歪み比(SDR)が17.2であり、話者ラベルを付した図9の(a)の信号対歪み比(SDR)17.3と比して僅差となった。図9の(d)の場合の文字誤り率(CER)の結果は9.7%であり、話者ラベルを付した図9の(a)の結果の文字誤り率(CER)8.1%の値に近い結果となっている。話者ラベルのない音声資源は、とりわけ実データにおいては音声ラベルを有するものと比べ、量的に大規模の入手が可能である。同量のデータを用いた今回の実験では、図9の(a)のモデル性能と比較して、図9の(d)のモデル性能が大きく劣っている結果とはならなかった。従って、入手可能な大規模な話者ラベルデータを用いた学習によって、実用性能を大幅に改善することができると考えられる。 In particular, the signal-to-distortion ratio (SDR) is 17.2 for case (d) of FIG. became a small margin. The character error rate (CER) result for case (d) of FIG. 9 is 9.7%, which is close to the character error rate (CER) value of 8.1% for the result of (a) of FIG. 9 with speaker labeling. result. Speech resources without speaker labels are available in larger quantities than those with speech labels, especially in real data. In this experiment using the same amount of data, the model performance of FIG. 9(d) was not significantly inferior to the model performance of FIG. 9(a). Therefore, it is believed that training using available large-scale speaker label data can greatly improve practical performance.
 以上、本発明の一実施の形態及びその変形例について、音声処理学習装置と音声処理学習方法を説明した。本手法により、話者ラベルを施していない音声データであっても実用レベルの目的話者抽出性能が実現されていると考えられる。即ち、話者ラベルの付与の有無にかかわらず、エンロールメント音声を用意せずとも、実用上利用可能な目的話者抽出を実現できる。 So far, the speech processing learning device and the speech processing learning method have been described with respect to one embodiment and its modification of the present invention. It is believed that this method achieves a practical level of target speaker extraction performance even for speech data without speaker labeling. In other words, regardless of whether or not a speaker label is assigned, it is possible to implement practically usable target speaker extraction without preparing an enrollment speech.
 上述の効果により、従来活用することのできなかったデータを用いて目的話者抽出を学習することが可能となる。即ち、目的話者抽出モデルの学習用として活用できる音声データの範囲が広がる。 Due to the above effects, it is possible to learn to extract the target speaker using data that could not be used in the past. That is, the range of speech data that can be used for learning the target speaker extraction model is expanded.
 データの活用範囲の広がりは、以下の2つの効用が得られる。一つは従来学習データに含めることのできなかったデータを学習データとして活用することで、データにおける発話や話者のバリエーションを増加させることがきる。従って、音声強調の話者の差異に関する頑健性を高め、目的話者抽出の性能を向上させることができる。もう一つは、話者ラベルの付与されていないデータセットや、1話者が1発話のみを発話しているデータを多く含むデータセットに対しても、音声強調の再学習を行うことでドメイン適応を実現できる。 The expansion of the range of data utilization has the following two benefits. One is that by using data that could not be included in conventional learning data as learning data, it is possible to increase the variation of utterances and speakers in the data. Therefore, it is possible to improve the robustness of speech enhancement regarding speaker differences and improve the performance of target speaker extraction. The other is to re-learn speech enhancement for datasets without speaker labels and datasets containing a lot of data in which only one utterance is uttered by one speaker. Adaptation can be achieved.
 なお、上述した各種の処理は、記載に従って時系列的に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Note that the various types of processing described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually according to the processing capacity of the device that executes the processing or as necessary. In addition, it goes without saying that appropriate modifications are possible without departing from the gist of the present invention.
[プログラム、記録媒体]
 上述の各種の処理は、図10に示すコンピュータ2000の記録部2020に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部2010、入力部2030、出力部2040、表示部2050などに動作させることで実施できる。
[Program, recording medium]
In the above-described various processes, the recording unit 2020 of the computer 2000 shown in FIG. It can be implemented by
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program that describes this process can be recorded on a computer-readable recording medium. Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition, the distribution of this program is carried out, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information used for processing by a computer and conforming to the program (data that is not a direct command to the computer but has the property of prescribing the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, in this embodiment, the device is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.

Claims (8)

  1.  エンロールメント音声として受け取った、目的話者の発話音声である対象音声と同一の音声から、固定長のベクトルである時系列データで構成される特徴量を抽出する特徴抽出処理と、
     前記特徴抽出処理により抽出された特徴量から固定長のベクトルである話者表現を抽出する話者表現抽出処理と、
     前記話者表現抽出処理により抽出された話者表現を用いて、前記対象音声、前記目的話者とは異なる話者の音声である非対象音声、及びノイズから構成される混合音から前記対象音声と推定する音声を抽出する目的話者抽出処理と、
     前記目的話者抽出処理により抽出された音声と、前記対象音声とを用いて損失関数を算出し、その算出値が最小になるように前記目的話者抽出処理を最適化する最適化処理と
    を実施する音声処理学習方法。
    A feature extraction process for extracting a feature amount composed of time-series data, which is a fixed-length vector, from the same voice as the target voice, which is the utterance voice of the target speaker, received as the enrollment voice;
    a speaker expression extraction process for extracting a speaker expression, which is a fixed-length vector, from the feature amount extracted by the feature extraction process;
    Using the speaker expression extracted by the speaker expression extraction process, the target speech is obtained from a mixed sound composed of the target speech, a non-target speech that is the speech of a speaker different from the target speaker, and noise. A target speaker extraction process for extracting a speech estimated as
    optimization processing for calculating a loss function using the speech extracted by the target speaker extraction processing and the target speech, and optimizing the target speaker extraction processing so that the calculated value is minimized; Speech processing learning method to be implemented.
  2.  前記特徴抽出処理により抽出された特徴量に対してデータ拡張を行う第1のデータ拡張処理を実施し、
     前記話者表現抽出処理は、前記第1のデータ拡張処理によりデータ拡張された特徴量から固定長のベクトルである話者表現を抽出する、
    請求項1に記載の音声処理学習方法。
    Performing a first data extension process for performing data extension on the feature amount extracted by the feature extraction process,
    The speaker expression extraction process extracts a speaker expression, which is a fixed-length vector, from the feature amount data-extended by the first data extension process.
    The speech processing learning method according to claim 1.
  3.  前記第1のデータ拡張処理におけるデータ拡張は、前記特徴量の一部を無作為に所定の値へと置換する、請求項2に記載の音声処理学習方法。 The speech processing learning method according to claim 2, wherein the data extension in the first data extension processing randomly replaces part of the feature quantity with a predetermined value.
  4.  前記話者表現抽出処理により抽出された話者表現に対してデータ拡張を行う第2のデータ拡張処理を実施し、
     前記目的話者抽出処理は、前記第2のデータ拡張処理によりデータ拡張された話者表現を用いて、前記混合音から前記対象音声と推定する音声を抽出する
    請求項1に記載の音声処理学習方法。
    performing a second data extension process for extending data of the speaker expression extracted by the speaker expression extraction process;
    2. The speech processing learning according to claim 1, wherein the target speaker extraction processing extracts a speech to be estimated as the target speech from the mixed sound using the speaker expression data-extended by the second data extension processing. Method.
  5.  前記第2のデータ拡張処理におけるデータ拡張は、前記話者表現の一部を所定の値へと置換する、請求項4に記載の音声処理学習方法。  The speech processing learning method according to claim 4, wherein the data expansion in the second data expansion process replaces a part of the speaker's expression with a predetermined value.
  6.  エンロールメント音声として受け取った、目的話者の発話音声である対象音声と同一の音声から、固定長のベクトルである時系列データで構成される特徴量を抽出する特徴抽出部と、
     前記特徴抽出部により抽出された特徴量から固定長のベクトルである話者表現を抽出する話者表現抽出部と、
     前記話者表現抽出部により抽出された話者表現を用いて、前記対象音声、前記目的話者とは異なる話者の音声である非対象音声、及びノイズから構成される混合音から前記対象音声と推定する音声を抽出する目的話者抽出部と、
     前記目的話者抽出部により抽出された音声と、前記対象音声とを用いて損失関数を算出し、その算出値が最小になるように前記目的話者抽出部を最適化する最適化部と
    を有する音声処理学習装置。
    A feature extracting unit that extracts a feature amount composed of time-series data, which is a fixed-length vector, from the same speech as the target speech, which is the utterance speech of the target speaker, received as the enrollment speech;
    a speaker expression extraction unit for extracting a speaker expression, which is a fixed-length vector, from the feature quantity extracted by the feature extraction unit;
    Using the speaker expression extracted by the speaker expression extraction unit, the target speech is obtained from a mixed sound composed of the target speech, a non-target speech that is the speech of a speaker different from the target speaker, and noise. a target speaker extracting unit for extracting speech estimated as
    an optimization unit that calculates a loss function using the speech extracted by the target speaker extraction unit and the target speech, and optimizes the target speaker extraction unit so that the calculated value is minimized; speech processing learning device with
  7.  前記特徴抽出部により抽出された特徴量に対してデータ拡張を行う第1のデータ拡張部と、
     前記話者表現抽出部により抽出された話者表現に対してデータ拡張を行う第2のデータ拡張部と、を有し、
     前記話者表現抽出部は、前記第1のデータ拡張部によりデータ拡張された特徴量から固定長のベクトルである話者表現を抽出し、
     前記目的話者抽出部は、前記第2のデータ拡張部によりデータ拡張された話者表現を用いて、前記混合音から前記対象音声と推定する音声を抽出する
    請求項6に記載の音声処理学習装置。
    a first data extension unit that performs data extension on the feature amount extracted by the feature extraction unit;
    a second data extension unit that performs data extension on the speaker expression extracted by the speaker expression extraction unit;
    The speaker expression extraction unit extracts a speaker expression, which is a fixed-length vector, from the feature amount data-extended by the first data extension unit,
    7. The speech processing learning according to claim 6, wherein the target speaker extraction unit extracts the speech to be estimated as the target speech from the mixed sound using the speaker expression data-extended by the second data extension unit. Device.
  8.  請求項1から5のいずれかに記載の音声処理学習方法をコンピュータに機能させるためのプログラム。 A program for causing a computer to function the speech processing learning method according to any one of claims 1 to 5.
PCT/JP2022/001315 2022-01-17 2022-01-17 Voice processing learning method, voice processing learning device, and program WO2023135788A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/001315 WO2023135788A1 (en) 2022-01-17 2022-01-17 Voice processing learning method, voice processing learning device, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/001315 WO2023135788A1 (en) 2022-01-17 2022-01-17 Voice processing learning method, voice processing learning device, and program

Publications (1)

Publication Number Publication Date
WO2023135788A1 true WO2023135788A1 (en) 2023-07-20

Family

ID=87278631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/001315 WO2023135788A1 (en) 2022-01-17 2022-01-17 Voice processing learning method, voice processing learning device, and program

Country Status (1)

Country Link
WO (1) WO2023135788A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05143094A (en) * 1991-11-26 1993-06-11 Sekisui Chem Co Ltd Speaker recognition system
JP2019219574A (en) * 2018-06-21 2019-12-26 株式会社東芝 Speaker model creation system, recognition system, program and control device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05143094A (en) * 1991-11-26 1993-06-11 Sekisui Chem Co Ltd Speaker recognition system
JP2019219574A (en) * 2018-06-21 2019-12-26 株式会社東芝 Speaker model creation system, recognition system, program and control device

Similar Documents

Publication Publication Date Title
Kannan et al. Large-scale multilingual speech recognition with a streaming end-to-end model
JP7023934B2 (en) Speech recognition method and equipment
JP7243760B2 (en) Audio feature compensator, method and program
JP6437581B2 (en) Speaker-adaptive speech recognition
CN113470662A (en) Generating and using text-to-speech data for keyword spotting systems and speaker adaptation in speech recognition systems
Xue et al. Online end-to-end neural diarization with speaker-tracing buffer
US20240144945A1 (en) Signal processing apparatus and method, training apparatus and method, and program
JP6189818B2 (en) Acoustic feature amount conversion device, acoustic model adaptation device, acoustic feature amount conversion method, acoustic model adaptation method, and program
US11562735B1 (en) Multi-modal spoken language understanding systems
Lu et al. Pykaldi2: Yet another speech toolkit based on kaldi and pytorch
Alam et al. Text-independent speaker verification employing CNN-LSTM-TDNN hybrid networks
Miyazaki et al. Structured state space decoder for speech recognition and synthesis
Zhang et al. AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents
WO2023135788A1 (en) Voice processing learning method, voice processing learning device, and program
Snyder X-Vectors: Robust neural embeddings for speaker recognition
Mirishkar et al. CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection
JP4964194B2 (en) Speech recognition model creation device and method thereof, speech recognition device and method thereof, program and recording medium thereof
JP2017194510A (en) Acoustic model learning device, voice synthesis device, methods therefor and programs
JP6542823B2 (en) Acoustic model learning device, speech synthesizer, method thereof and program
JP6546070B2 (en) Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program
US11257503B1 (en) Speaker recognition using domain independent embedding
Das et al. Deep Auto-Encoder Based Multi-Task Learning Using Probabilistic Transcriptions.
Heymans et al. Multi-style training for South African call centre audio
JP6220733B2 (en) Voice classification device, voice classification method, and program
WO2022034630A1 (en) Audio processing device, audio processing method, recording medium, and audio authentication system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22920313

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023573787

Country of ref document: JP

Kind code of ref document: A