JP2019008120A

JP2019008120A - Voice quality conversion system, voice quality conversion method and voice quality conversion program

Info

Publication number: JP2019008120A
Application number: JP2017123363A
Authority: JP
Inventors: 拓也藤岡; Takuya Fujioka; 慶華孫; Keika Son; 藤田　雄介; Yusuke Fujita; 雄介藤田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2017-06-23
Filing date: 2017-06-23
Publication date: 2019-01-17

Abstract

To provide a voice quality conversion system, a voice quality conversion method and a voice quality conversion program, which can precisely convert voice quality.SOLUTION: A voice quality conversion server 1000 comprises: a voice recognition section 200 for recognizing voice having first voice quality, and voice having second voice quality; a voiced part/voiceless part estimation section 203 for estimating features in a voiced part and a voiceless part in the voice of the first voice quality and the voice of the second voice quality, which are recognized; an accent estimation section 204 for estimating a feature of a part of accent in the voice of the first voice quality and the voice of the second voice quality, which are recognized; a conversion model generation section 2151 for generating a conversion model which estimates a relation between the estimated features of the voiced part, the voiceless part and the accent in the voice of the first voice quality, and the estimated features of the voiced part, the voiceless part and the accent in the voice of the second voice quality; and a voice quality conversion section 215 for converting the inputted voice having the first voice quality into the voice having the second voice quality on the basis of the generated conversion model.SELECTED DRAWING: Figure 3

Description

本発明は、声質変換システム、声質変換方法、及び声質変換プログラムに関する。 The present invention relates to a voice quality conversion system, a voice quality conversion method, and a voice quality conversion program.

音声信号処理を用いて、ある話者の音声の声質を他の目標話者の音声の声質に変換する手法として、声質変換と呼ばれる技術がある。この技術の適用例として、サービスロボットのオペレーションや、コールセンタにおけるコンピュータによる自動応答がある。このうち、サービスロボットのオペレーションによる対話は、例えば、サービスロボットが音声認識を用いて相手話者の音声を聞き取り、聞き取った音声に対する適切な応答内容を推定すると共に音声合成を行って応答音声を生成することにより実現される。 There is a technique called voice quality conversion as a technique for converting the voice quality of one speaker's voice into the voice quality of another target speaker using voice signal processing. Application examples of this technology include service robot operation and automatic response by a computer in a call center. Among these, the dialogue by the operation of the service robot is, for example, listening to the other speaker's voice using the voice recognition, estimating the appropriate response content for the heard voice, and synthesizing the voice to generate the response voice It is realized by doing.

このような対話において、環境ノイズによって音声認識が成功しなかった場合や、相手話者の質問が難解であり適切な応答の内容推定が成功しなかった場合には、遠隔地にいるオペレータが相手話者の発話を聞き取り、そのオペレータの発話により応答することで相手話者との対話を継続するために、サービスロボットによる自動応答音声がオペレータによる応答音声に切り替わる。 In such a conversation, if voice recognition is not successful due to environmental noise, or if the other speaker's question is difficult and the response content estimation is not successful, the remote operator The automatic response voice by the service robot is switched to the response voice by the operator in order to continue the dialogue with the other speaker by listening to the speaker's utterance and responding by the operator's utterance.

この際、オペレータの発話と同じ内容を有するサービスボットの音声を新たに合成して出力する（例えば、オペレータの発話を音声認識し、これによりサービスロボットの音声を新たに合成する）ことで、相手話者に違和感を与えないようにすることができる。しかしながら、新たに音声を合成すると、オペレータが発話してから合成音声が生成されるまでに時間がかかる（数秒程度）ため、円滑なコミュニケーションの妨げとなる。また、オペレータの発話内容を正しく認識した上で、その意図を確実に表現できる音声を合成すること自体も技術的に容易ではない。そこで、新たな音声を生成することなく、オペレータの発話をサービスロボットの音声と同じ声質に変換することで、相手話者に違和感を与えないようにすることが好ましい。このように、サービスロボットのオペレーションによる対話では声質変換の技術が重要となっている。 At this time, the voice of the service bot having the same content as the utterance of the operator is newly synthesized and output (for example, the voice of the operator is uttered and the voice of the service robot is newly synthesized thereby). It is possible to prevent the speaker from feeling uncomfortable. However, when a new voice is synthesized, it takes time (several seconds) until the synthesized voice is generated after the operator speaks, which hinders smooth communication. In addition, it is not technically easy to synthesize speech that can accurately express the intention after the operator's utterance content is correctly recognized. Therefore, it is preferable not to give the other speaker a sense of incongruity by converting the utterance of the operator into the same voice quality as the voice of the service robot without generating a new voice. In this way, voice quality conversion technology is important in dialogues with service robot operations.

一方、声質変換の他の適用例である、コールセンタにおける自動応答では、所定の対話システム又は音声合成システムが、相手話者の発話に対して音声認識を行い、応答音声を生成する。このような自動応答システムの構成の例としては、特許文献１がある。しかし、このような自動応答でも正しく対応できない場合には、最終的には人間のオペレータにより、相手話者に対する応答を行うことになる。しかし、相手話者は、コンピュータによる自動応答よりも人間のオペレータと会話をすることを望む傾向がある。そこで、相手話者がコールセンタにおける応答が自動応答なのか人間のオペレータによる応答なのかの区別をつけられないようにすれば、人間のオペレータによる応答の件数を結果として減らすことができると考えられる。 On the other hand, in an automatic response in a call center, which is another application example of voice quality conversion, a predetermined dialog system or speech synthesis system performs speech recognition on the utterance of the other speaker and generates a response speech. There exists patent document 1 as an example of a structure of such an automatic response system. However, when such an automatic response cannot be handled correctly, the response to the other speaker is finally made by a human operator. However, the other party's speaker tends to desire to have a conversation with a human operator rather than an automatic response by a computer. Thus, if the other speaker cannot distinguish whether the response at the call center is an automatic response or a response by a human operator, the number of responses by the human operator can be reduced as a result.

そのため、前記のサービスロボットのオペレーションの場合と同様、コールセンタにおける自動応答についても、オペレータの発話音声を自動応答の音声と同じ声質に変換する構成が有効であると考えられる。 Therefore, as in the case of the operation of the service robot described above, it is considered effective to convert the voice of the operator into the same voice quality as the voice of the automatic response for the automatic response in the call center.

声質変換を実現するための基礎的な技術として、有声部分と無声部分との識別がある。例えば、特許文献２には、その基礎技術として、変換元話者のスペクトル包絡と変換先話者のスペクトル包絡とに関する周波数ワーピング関数を求め、声質変換時には、「有声音区間／無声音区間」の平均値を利用した平均周波数ワーピング関数を用いて、変換元話者のスペクトル包絡を変換先話者のスペクトル包絡に変換することにより、声質変換を行う
旨が記載されている。また、特許文献３には、携帯電話における音声のうち有声部分と無声部分とを判別する技術として、入力された狭帯域音声信号に対し線形予測分析を行って抽出した声道伝達特性を帯域拡張する際に、フィルタ係数として有声音用と無声音用を用意した補間フィルタを利用し、入力音声が有声音であるか無声音であるかを判別することが開示されている。 As a basic technique for realizing voice quality conversion, there is a distinction between voiced and unvoiced parts. For example, in Patent Document 2, as the basic technology, a frequency warping function relating to the spectrum envelope of the conversion source speaker and the spectrum envelope of the conversion destination speaker is obtained, and at the time of voice quality conversion, the average of “voiced sound section / unvoiced sound section” It is described that the voice quality conversion is performed by converting the spectrum envelope of the conversion source speaker into the spectrum envelope of the conversion destination speaker using an average frequency warping function using the value. In addition, Patent Document 3 discloses a technique for discriminating voiced and unvoiced parts of voice in a mobile phone, and expanding the vocal tract transfer characteristics extracted by performing linear prediction analysis on an input narrowband voice signal. In doing so, it is disclosed that an interpolation filter prepared for voiced sound and unvoiced sound is used as a filter coefficient to determine whether the input sound is voiced sound or unvoiced sound.

特開２０１５−７０３７１号公報Japanese Patent Laying-Open No. 2015-70371 特開２００１−２８２３００号公報JP 2001-282300 A 特開２０１５−２０６９５８号公報JP2015-206958A

しかしながら、声質変換の実際の適用においては、話者ごとに音声データベース（例えば、パラレルコーパス）が利用されることが多い。すなわち、音声の声質を変換するために、事前に、変換元の話者の音声の音声データベースと、変換後の話者の音声データベースとを含むパラレルコーパスを用いる。このようなパラレルコーパスにおける２つの音声データベースの間の非話者性の音声特徴量（音声に含まれる話者性以外の情報）の一致性が高ければ高いほど、高精度な声質変換が可能となる。 However, in actual application of voice quality conversion, a voice database (for example, a parallel corpus) is often used for each speaker. That is, in order to convert the voice quality of the voice, a parallel corpus including the voice database of the conversion source speaker and the converted speaker voice database is used in advance. In such a parallel corpus, the higher the consistency of the non-speaker speech feature (information other than the speech nature included in the speech) between the two speech databases, the more accurate voice quality conversion is possible. Become.

しかし、同じ意味内容を表す発声をした場合でも、話者によって、その発声におけるアクセントのパターンや、有声部分及び無声部分の出現位置、さらに、ポーズ位置及び調音位置のパターンが大きく異なる。したがって、パラレルコーパスにおいて、このような要素による非話者性の音声特徴量が一致していないと、声質変換を行った際に、誤った音韻に知覚されたり、無声化が起きたり、アクセントが付いたりすることがあるので、このような点を適切に考慮した声質変換の技術の開発が望まれている。 However, even when utterances representing the same semantic content are used, accent patterns in the utterances, appearance positions of voiced and unvoiced parts, and patterns of pause positions and articulation positions differ greatly depending on the speaker. Therefore, in the parallel corpus, if the non-speaker voice feature quantities due to such elements do not match, when voice quality conversion is performed, it is perceived as a wrong phoneme, devoiced, or accented. Therefore, there is a demand for the development of voice quality conversion technology that appropriately considers these points.

本発明はこのような背景に鑑みてなされたものであり、その目的は、声質変換を正確に行うことが可能な声質変換システム、声質変換方法、及び声質変換プログラムを提供することにある。 The present invention has been made in view of such a background, and an object thereof is to provide a voice quality conversion system, a voice quality conversion method, and a voice quality conversion program capable of accurately performing voice quality conversion.

以上の課題を解決するための本発明の一つは、入力された音声の音質を異なる声質に変換する、プロセッサ及びメモリを備える声質変換システムであって、第１の声質を有する音声、及び第２の声質を有する音声をそれぞれ認識する音声認識部と、前記認識した、第１の声質の音声及び第２の声質の音声における有声部分及び無声部分の特徴を推定する有声無声推定部と、前記認識した、第１の声質の音声及び第２の声質の音声におけるアクセントの部分の特徴を推定するアクセント推定部と、前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの特徴と、前記推定した第２の声質の音声における有声部分、無声部分、及びアクセントの特徴との関係を推定する変換モデルを生成する変換モデル生成部と、前記生成した変換モデルに基づき、入力された前記第１の声質を有する音声を、前記第２の声質を有する音声に変換する声質変換部とを備える。 One aspect of the present invention for solving the above-described problems is a voice quality conversion system including a processor and a memory, which converts a voice quality of an input voice into a different voice quality, the voice having a first voice quality, A voice recognition unit for recognizing voices having two voice qualities, a voiced / voiceless estimation unit for estimating the characteristics of voiced and unvoiced parts in the recognized voices of the first voice quality and the second voice quality, Accent estimation unit for estimating the feature of the recognized accent portion in the first voice quality speech and the second voice quality speech, and the voiced portion, unvoiced portion, and accent feature in the estimated first voice quality speech A conversion model generating unit that generates a conversion model for estimating a relationship between the voiced portion, the unvoiced portion, and the accent feature in the speech of the estimated second voice quality, and the generation Based on the transformation model, a sound having entered the first voice quality, and a voice conversion unit which converts the voice having the second voice quality.

本発明によれば、声質変換を正確に行うことができる。 According to the present invention, voice quality conversion can be performed accurately.

図１は、実施例１に係る声質変換システム１０の構成の一例を説明する図である。FIG. 1 is a diagram illustrating an example of a configuration of a voice quality conversion system 10 according to the first embodiment. 図２は、声質変換サーバ１０００の機能の概要を説明する図である。FIG. 2 is a diagram for explaining an overview of the functions of the voice quality conversion server 1000. 図３は、声質変換サーバ１０００が備える機能の一例を説明する図である。FIG. 3 is a diagram for explaining an example of functions provided in the voice quality conversion server 1000. 図４は、声質変換処理の一例を説明するフローチャートである。FIG. 4 is a flowchart for explaining an example of voice quality conversion processing. 図５は、音声認識部２００の処理の詳細を説明する図である。FIG. 5 is a diagram for explaining the details of the processing of the speech recognition unit 200. 図６は、音声データにおけるフレーム、フレームに対応する読み、及び読みに対する確信度の間の関係を説明する図である。FIG. 6 is a diagram for explaining a relationship among frames in audio data, readings corresponding to the frames, and certainty for reading. 図７は、有声無声推定部２０３の処理の詳細を説明する図である。FIG. 7 is a diagram for explaining the details of the processing of the voiced / unvoiced estimation unit 203. 図８は、時間アライメント処理部２０８が行う処理の詳細を説明する図である。FIG. 8 is a diagram for explaining the details of the processing performed by the time alignment processing unit 208. 図９は、低確信度フレーム除去部２１１が行う処理の詳細を説明する図である。FIG. 9 is a diagram for explaining the details of the processing performed by the low confidence frame removal unit 211. 図１０は、除去フレームの特定方法を説明する図である。FIG. 10 is a diagram for explaining a method for specifying a removal frame. 図１１は、実施例２に係る声質変換サーバ１０００の機能の概要を説明する図である。FIG. 11 is a diagram for explaining an overview of functions of the voice quality conversion server 1000 according to the second embodiment. 図１２は、実施例２に係る低確信度フレーム除去部２１１の機能の一例を説明する図である。FIG. 12 is a schematic diagram illustrating an example of the function of the low confidence frame removal unit 211 according to the second embodiment. 図１３は、実施例２に係る低確信度フレーム除去部２１１が出力する情報の一例を示す図である。FIG. 13 is a diagram illustrating an example of information output by the low confidence frame removal unit 211 according to the second embodiment.

−−実施例１−−
＜システム構成＞
図１は、実施例１に係る声質変換システム１０の構成の一例を説明する図である。同図に示すように、声質変換システム１０は、入力された音声に対応する応答音声を出力することにより所定の話者との対話を行うサービスロボット２０と、当該話者との対話をサービスロボット２０と共に行う者（人間）であるオペレータが使用する端末であって、当該オペレータの音声が入力されるオペレータ端末３０と、オペレータ端末３０に入力されたオペレータの音声の声質をサービスロボット２０の音声の声質に変換し、又は、サービスロボット２０の音声の声質をオペレータの声質に変換する声質変換サーバ１０００とを含んで構成されている。サービスロボット２０、オペレータ端末３０、及び声質変換サーバ１０００はいずれも情報処理装置（コンピュータ）である。 -Example 1-
<System configuration>
FIG. 1 is a diagram illustrating an example of a configuration of a voice quality conversion system 10 according to the first embodiment. As shown in the figure, the voice quality conversion system 10 includes a service robot 20 that interacts with a predetermined speaker by outputting a response speech corresponding to the input speech, and a service robot that interacts with the speaker. 20 is a terminal used by an operator who is a person (human) together with the operator terminal 30, the operator terminal 30 to which the voice of the operator is input, and the voice quality of the operator's voice input to the operator terminal 30 It includes a voice quality conversion server 1000 that converts voice quality of the service robot 20 into voice quality of the operator. The service robot 20, the operator terminal 30, and the voice quality conversion server 1000 are all information processing devices (computers).

サービスロボット２０は、ＣＰＵ（Central Processing Unit）などのプロセッサ２１
と、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等のメモリ２２と、他の装置と通信を行う通信Ｉ／Ｆ２３と（Ｉ／Ｆ：Interface。以下において同様。）、
ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）等の記憶装置２４と、キーボード、マウス、タッチパネル、及びモニタ（ディスプレイ）等からなる入出力装置２５と、相手の話者等の音声が入力される、マイク等の音声入力Ｉ／Ｆ２６と、音声を出力するスピーカー等の音声出力Ｉ／Ｆ２７とを有し、これらはバス２８によって相互に接続されている。 The service robot 20 includes a processor 21 such as a CPU (Central Processing Unit).
A memory 22 such as a RAM (Random Access Memory) and a ROM (Read Only Memory), a communication I / F 23 for communicating with other devices (I / F: Interface; the same applies hereinafter),
A storage device 24 such as an HDD (Hard Disk Drive), an SSD (Solid State Drive) or the like, an input / output device 25 including a keyboard, a mouse, a touch panel, a monitor (display), and the like, and a voice of a partner speaker, etc. are input. The audio input I / F 26 such as a microphone and the audio output I / F 27 such as a speaker that outputs sound are connected to each other via a bus 28.

オペレータ端末３０は、サービスロボット２０と同様に、ＣＰＵ（Central Processing
Unit）などのプロセッサ３１と、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only
Memory）等のメモリ３２と、他の装置と通信を行う通信Ｉ／Ｆ３３と、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）等の記憶装置３４と、キーボード、マウス、
タッチパネル、及びモニタ（ディスプレイ）等からなる入出力装置３５と、オペレータ等の音声が入力される、マイク等の音声入力Ｉ／Ｆ３６と、音声を出力するスピーカー等の音声出力Ｉ／Ｆ３７とを有し、これらはバス３８によって相互に接続されている。 As with the service robot 20, the operator terminal 30 has a CPU (Central Processing).
Unit 31), RAM (Random Access Memory), ROM (Read Only)
A memory 32 such as a memory, a communication I / F 33 that communicates with other devices, a storage device 34 such as a hard disk drive (HDD) and a solid state drive (SSD), a keyboard, a mouse,
An input / output device 35 including a touch panel, a monitor (display), etc., an audio input I / F 36 such as a microphone to which an operator's voice is input, and an audio output I / F 37 such as a speaker that outputs audio are provided. These are connected to each other by a bus 38.

声質変換サーバ１０００は、ＣＰＵ（Central Processing Unit）などのプロセッサ１
００１と、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等のメモリ１００２と、他の装置と通信を行う通信Ｉ／Ｆ１００３と、ＨＤＤ（Hard Disk Drive）、
ＳＳＤ（Solid State Drive）等の記憶装置１００４と、キーボード、マウス、タッチパ
ネル、及びモニタ（ディスプレイ）等からなる入出力装置１００５とを有し、これらはバス１００６によって相互に接続されている。 The voice quality conversion server 1000 is a processor 1 such as a CPU (Central Processing Unit).
001, a memory 1002 such as a RAM (Random Access Memory) and a ROM (Read Only Memory), a communication I / F 1003 for communicating with other devices, an HDD (Hard Disk Drive),
A storage device 1004 such as an SSD (Solid State Drive) and an input / output device 1005 including a keyboard, a mouse, a touch panel, a monitor (display), and the like are included, and these are connected to each other by a bus 1006.

なお、声質変換サーバ１０００、オペレータ端末３０、及びサービスロボット２０の間は、例えば、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インター
ネット、専用線等からなるネットワーク５０により通信可能に接続されている。 The voice quality conversion server 1000, the operator terminal 30, and the service robot 20 are communicably connected via a network 50 including, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, a dedicated line, and the like. ing.

＜声質変換サーバの機能＞
まず、声質変換サーバ１０００の機能の概要を説明する。
図２は、声質変換サーバ１０００の機能の概要を説明する図である。同図に示すように、声質変換サーバ１０００は、声質変換モデル１０２に基づき、変換元話者音声１０３を目標話者音声１０４に変換する。すなわち、声質変換サーバ１０００は、オペレータの音声の声質をサービスロボット２０の音声の声質に変換し、又は、サービスロボット２０の音声の声質をオペレータの音声の声質に変換することができる。 <Function of voice quality conversion server>
First, an outline of functions of the voice quality conversion server 1000 will be described.
FIG. 2 is a diagram for explaining an overview of the functions of the voice quality conversion server 1000. As shown in the figure, the voice quality conversion server 1000 converts the conversion source speaker voice 103 into the target speaker voice 104 based on the voice quality conversion model 102. That is, the voice quality conversion server 1000 can convert the voice quality of the voice of the operator into the voice quality of the voice of the service robot 20, or can convert the voice quality of the voice of the service robot 20 into the voice quality of the voice of the operator.

声質変換モデル１０２は、オペレータの音声（発話音声）が記憶されている変換元話者音声データベース１００と、サービスロボット２０が発する音声の音声（以下、ロボット音声という。）が記憶されている目標話者音声データベース１０１とに基づき生成される。変換元話者音声データベース１００及び目標話者音声データベース１０１はパラレルコーパスである。すなわち、ある意味内容を有する音声に関して、変換元話者音声データベース１００にはその音声に対応するオペレータの音声が記憶され、目標話者音声データベース１０１にはその音声に対応するロボット音声が記憶され、これらの音声が互いに対応づけられている。 The voice quality conversion model 102 includes a conversion source speaker voice database 100 in which operator's voice (uttered voice) is stored, and a target story in which voice of voice uttered by the service robot 20 (hereinafter referred to as robot voice) is stored. Generated based on the person voice database 101. The conversion source speaker voice database 100 and the target speaker voice database 101 are parallel corpora. That is, for speech having a certain meaning, the speech of the operator corresponding to the speech is stored in the conversion source speaker speech database 100, and the robot speech corresponding to the speech is stored in the target speaker speech database 101. These voices are associated with each other.

次に、声質変換サーバ１０００の機能の詳細を説明する。
図３は、声質変換サーバ１０００が備える機能の一例を説明する図である。同図に示すように、声質変換サーバ１０００は、音声認識部２００と、有声無声推定部２０３と、アクセント推定部２０４と、音声合成部２０６と、時間アライメント処理部２０８と、低確信度フレーム除去部２１１と、声質変換部２１５と、音声出力部２１７とを備える。 Next, details of functions of the voice quality conversion server 1000 will be described.
FIG. 3 is a diagram for explaining an example of functions provided in the voice quality conversion server 1000. As shown in the figure, the voice quality conversion server 1000 includes a speech recognition unit 200, a voiced and unvoiced estimation unit 203, an accent estimation unit 204, a speech synthesis unit 206, a time alignment processing unit 208, and a low confidence frame removal. Unit 211, voice quality conversion unit 215, and audio output unit 217.

音声認識部２００は、第１の声質を有する音声（変換元話者音声データベース１００における音声）、及び第２の声質を有する音声（目標話者音声データベース１０１における音声）をそれぞれ認識する。 The voice recognition unit 200 recognizes a voice having the first voice quality (voice in the conversion source speaker voice database 100) and a voice having the second voice quality (voice in the target speaker voice database 101).

音声認識部２００は、確信度算出部２００１を備える。
確信度算出部２００１は、前記第１の音声を認識する際に、音声認識の確からしさを示す値である確信度を算出する。 The voice recognition unit 200 includes a certainty factor calculation unit 2001.
The certainty factor calculation unit 2001 calculates a certainty factor that is a value indicating the certainty of voice recognition when recognizing the first voice.

具体的には、前記確信度算出部２００１は、前記第１の音声の確信度を、前記第１の音声の音韻の認識の確からしさとして算出する。 Specifically, the certainty factor calculation unit 2001 calculates the certainty factor of the first voice as the probability of recognition of the phoneme of the first voice.

有声無声推定部２０３は、音声認識部２００が前記認識した、第１の声質の音声及び第２の声質の音声における有声部分及び無声部分の特徴を推定する。 The voiced / voiceless estimation unit 203 estimates the features of the voiced part and the voiceless part in the first voice quality voice and the second voice quality voice recognized by the voice recognition unit 200.

アクセント推定部２０４は、音声認識部２００が前記認識した、第１の声質の音声及び第２の声質の音声におけるアクセントの部分の特徴を推定する。 The accent estimation unit 204 estimates the feature of the accent portion in the voice of the first voice quality and the voice of the second voice quality recognized by the voice recognition unit 200.

音声合成部２０６は、有声無声推定部２０３及びアクセント推定部２０４が前記推定した第１の音声における有声部分、無声部分、及びアクセントの特徴を有する、前記第２の声質の合成音声を生成する。 The voice synthesis unit 206 generates the second voice quality synthesized voice having the voiced part, the voiceless part, and the accent characteristics of the first voice estimated by the voiced / unvoiced estimation unit 203 and the accent estimation unit 204.

時間アライメント処理部２０８は、前記第１の声質の音声又は前記第２の声質の音声における発音のタイミングを、有声無声推定部２０３及びアクセント推定部２０４が前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの部分のタイミングに基づき修正する。 The time alignment processing unit 208 is voiced in the voice of the first voice quality estimated by the voiced and unvoiced estimation unit 203 and the accent estimation unit 204 with respect to the sounding timing in the voice of the first voice quality or the voice of the second voice quality. Correction is made based on the timing of the part, unvoiced part, and accent part.

低確信度フレーム除去部２１１は、確信度算出部２００１が前記算出した確信度に基づき、前記第１の声質の音声のうち前記確信度が所定の条件を満たさない部分の音声を除去する。 Based on the certainty factor calculated by the certainty factor calculation unit 2001, the low certainty factor frame removing unit 211 removes a portion of the voice of the first voice quality where the certainty factor does not satisfy a predetermined condition.

具体的には、前記低確信度フレーム除去部２１１は、前記第１の声質の音声を、音素の種類に応じて複数のグループに分類し、分類した前記グループのそれぞれの音声のうち前記確信度が所定の条件を満たさない部分の音声を除去する。 Specifically, the low confidence frame removal unit 211 classifies the voices of the first voice quality into a plurality of groups according to phoneme types, and the confidence level of each of the classified voices. Removes the sound of the part that does not satisfy the predetermined condition.

声質変換部２１５は、前記生成した変換モデルに基づき、入力された前記第１の声質を有する音声を、前記第２の声質を有する音声に変換する。 The voice quality conversion unit 215 converts the input voice having the first voice quality into the voice having the second voice quality based on the generated conversion model.

声質変換部２１５は、変換モデル生成部２１５１を備える。
変換モデル生成部２１５１は、有声無声推定部２０３及びアクセント推定部２０４が前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの特徴と、有声無声推定部２０３及びアクセント推定部２０４が前記推定した第２の声質の音声における有声部分、無声部分、及びアクセントの特徴との関係を推定する変換モデル（声質変換モデル１０２）を生成する。 The voice quality conversion unit 215 includes a conversion model generation unit 2151.
The conversion model generation unit 2151 includes voiced portions, unvoiced portions, and accent features in the voice of the first voice quality estimated by the voiced / unvoiced estimation unit 203 and the accent estimation unit 204, and the voiced / unvoiced estimation unit 203 and the accent estimation unit 204. Generates a conversion model (voice quality conversion model 102) for estimating the relationship between the voiced portion, unvoiced portion, and accent feature in the estimated second voice quality speech.

具体的には、前記変換モデル生成部２１５１は、低確信度フレーム除去部２１１が前記部分を除去した第１の声質の音声に基づき、前記変換モデルを生成する。 Specifically, the conversion model generation unit 2151 generates the conversion model based on the voice of the first voice quality from which the low confidence frame removal unit 211 has removed the portion.

また、前記変換モデル生成部２１５１は、変換モデル生成部２１５１が前記生成した合成音声に基づき、前記変換モデルを生成する。 The conversion model generation unit 2151 generates the conversion model based on the synthesized speech generated by the conversion model generation unit 2151.

また、前記変換モデル生成部２１５１は、時間アライメント処理部２０８が前記修正した前記第１の声質の音声又は前記第２の声質の音声に基づき、前記変換モデルを生成する。 In addition, the conversion model generation unit 2151 generates the conversion model based on the voice of the first voice quality or the voice of the second voice quality corrected by the time alignment processing unit 208.

音声出力部２１７は、変換モデル生成部２１５１が前記変換した第２の声質を有する音声を出力する。 The voice output unit 217 outputs the voice having the second voice quality converted by the conversion model generation unit 2151.

以上に説明した各情報処理装置の機能は、各情報処理装置のハードウェアによって、もしくは、各情報処理装置のプロセッサが、メモリや記憶装置に記憶されている各プログラムを読み出して実行することにより実現される。なお、このプログラムは、例えば、二次記憶デバイスや不揮発性半導体メモリ、ハードディスクドライブ、ＳＳＤなどの記憶デバイス、又は、ＩＣカード、ＳＤカード、ＤＶＤなどの、計算機で読み取り可能な非一時的データ記憶媒体に格納される。 The functions of each information processing apparatus described above are realized by the hardware of each information processing apparatus or by the processor of each information processing apparatus reading and executing each program stored in a memory or storage device Is done. Note that this program is a non-transitory data storage medium that can be read by a computer, such as a secondary storage device, a nonvolatile semiconductor memory, a hard disk drive, an SSD, or an IC card, an SD card, a DVD, etc. Stored in

次に、声質変換システム１０において行われる処理を説明する。
＜声質変換処理＞ Next, processing performed in the voice quality conversion system 10 will be described.
<Voice conversion processing>

図４は、声質変換システム１０が行う処理のうち、変換元話者音声１０３を目標話者音声１０４に変換する処理（以下、声質変換処理という。）の一例を説明するフローチャートである。この処理は、例えば、予め設定されたタイミング（例えば、所定の時間間隔、又は所定の時刻）で、もしくは、変換元話者音声データベース１００又は目標話者音声データベース１０１が更新されたことを契機に開始される。 FIG. 4 is a flowchart for explaining an example of processing (hereinafter referred to as voice quality conversion processing) for converting the conversion source speaker voice 103 into the target speaker voice 104 among the processing performed by the voice quality conversion system 10. This process is performed, for example, at a preset timing (for example, a predetermined time interval or a predetermined time) or when the conversion source speaker voice database 100 or the target speaker voice database 101 is updated. Be started.

なお、本実施例では、変換元話者音声１０３はオペレータの音声であり、目標話者音声１０４はサービスロボット２０の音声であるものとする。 In this embodiment, it is assumed that the conversion source speaker voice 103 is an operator voice and the target speaker voice 104 is a voice of the service robot 20.

まず、音声認識部２００は、変換元話者音声データベース１００に記録されている音声を認識し、認識した音声に対応する文字列２０１を出力する。具体的には、例えば、音声認識部２００は、音声の波形データを、所定の長さを有する複数のフレームに分割し、分割したフレームごとに音声を認識し、認識した音声に対する最適な読み（文字列）を出力する。また、音声認識部２００は、各フレームに対する音声認識の確信度２０２を算出する。 First, the voice recognition unit 200 recognizes the voice recorded in the conversion source speaker voice database 100 and outputs a character string 201 corresponding to the recognized voice. Specifically, for example, the speech recognition unit 200 divides speech waveform data into a plurality of frames having a predetermined length, recognizes speech for each of the divided frames, and performs optimal reading on the recognized speech ( Character string). Further, the voice recognition unit 200 calculates a voice recognition certainty 202 for each frame.

ここで、音声認識部２００の処理の詳細について説明する。
図５は、音声認識部２００の処理の詳細を説明する図である。同図に示すように、音声認識部２００は、変換元話者音声データベース１００における各フレームの音声について、その音声に対して可能な全ての読み（例えば、「あ」、「い」、「う」、…）に対する尤度を算出する（ｓ２１４）。そして、音声認識部２００は、算出した各尤度のうち、最も尤度が高い読みの尤度２１６を特定し（ｓ２１５）、特定した尤度２１６の読みに対応する文字列２０１を生成する。さらに、音声認識部２００は、特定した尤度２１６を、確信度２０２に変換する（ｓ２１７）。具体的には、例えば、音声認識部２００は、尤度２１６の値の対数を確信度２０２とする。なお、尤度２１６をそのまま確信度２０２としてもよい。 Here, details of the processing of the speech recognition unit 200 will be described.
FIG. 5 is a diagram for explaining the details of the processing of the speech recognition unit 200. As shown in the figure, the speech recognition unit 200, for each frame of speech in the conversion source speaker speech database 100, reads all the possible readings for that speech (for example, “A”, “I”, “U”). ...) Is calculated (s214). Then, the speech recognition unit 200 identifies the likelihood 216 of the reading with the highest likelihood among the calculated likelihoods (s215), and generates the character string 201 corresponding to the reading of the identified likelihood 216. Furthermore, the speech recognition unit 200 converts the specified likelihood 216 into the certainty factor 202 (s217). Specifically, for example, the speech recognition unit 200 sets the logarithm of the value of the likelihood 216 as the certainty factor 202. The likelihood 216 may be used as the certainty 202 as it is.

なお、図６は、音声データにおけるフレーム、フレームに対応する読み、及び読みに対する確信度の間の関係を説明する図である。同図に示すように、互いにフレームの範囲を所定時間ずらした複数のフレーム部分５０１、５０２、５０３のそれぞれに対して全ての読みに対する尤度がそれぞれ算出され、これらの尤度を最も高く算出する読み５０４が、正しい読みとして特定される。そして、この読み５０４に対応した、フレーム部分５０１、５０２、５０３のそれぞれに対する確信度５０５、５０６、５０７が算出される。例えば、フレーム部分５０１、５０２、５０３のそれぞれが、「こ」なる読みの部分に対応し、フレーム部分５０８、５０９のそれぞれは「ん」なる読みの部分に対応している。 FIG. 6 is a diagram for explaining the relationship between frames in audio data, readings corresponding to the frames, and certainty for reading. As shown in the figure, the likelihood for all readings is calculated for each of a plurality of frame portions 501, 502, and 503 whose frame ranges are shifted from each other by a predetermined time, and these likelihoods are calculated to be the highest. Reading 504 is identified as the correct reading. Then, certainty factors 505, 506, and 507 corresponding to the frame portions 501, 502, and 503 corresponding to the reading 504 are calculated. For example, each of the frame portions 501, 502, and 503 corresponds to a reading portion “ko”, and each of the frame portions 508 and 509 corresponds to a reading portion “n”.

次に、図３に示すように、有声無声推定部２０３は、文字列２０１、及び変換元話者音声データベース１００に基づき、変換元話者音声データベース１００に記録されている音声における有声部分及び無声部分の特徴を推定し、推定した特徴を示す情報（以下、有声無声情報という。）を文字列２０１に付加する。 Next, as shown in FIG. 3, the voiced / voiceless estimation unit 203 is based on the character string 201 and the conversion source speaker voice database 100, and the voiced portion and the voiceless part in the voice recorded in the conversion source speaker voice database 100. The feature of the part is estimated, and information indicating the estimated feature (hereinafter referred to as voiced / unvoiced information) is added to the character string 201.

ここで、有声無声推定部２０３の処理の詳細を説明する。
図７は、有声無声推定部２０３の処理の詳細を説明する図である。同図に示すように、有声無声推定部２０３は、変換元話者音声データベース１００の音声波形における各フレーム（例えば、フレーム部分５０１、５０２、５０４、５０８、５０９）に対するケプストラムを算出する（ｓ２１８）。ケプストラムは、例えば、以下の式により算出される。 Here, the details of the processing of the voiced / voiceless estimation unit 203 will be described.
FIG. 7 is a diagram for explaining the details of the processing of the voiced / unvoiced estimation unit 203. As shown in the figure, the voiced / unvoiced estimation unit 203 calculates a cepstrum for each frame (for example, frame portions 501, 502, 504, 508, and 509) in the speech waveform of the conversion source speaker speech database 100 (s218). . The cepstrum is calculated by the following formula, for example.

c(t)=ift(log (|ft(x(t))|)) c (t) = ift (log (| ft (x (t)) |))

ここで、c(t)はケプストラム、iftは逆フーリエ変換、ftはフーリエ変換、x(t)は変換元
話者音声データベース１００における各フレームの音声波形である。 Here, c (t) is a cepstrum, ift is an inverse Fourier transform, ft is a Fourier transform, and x (t) is a speech waveform of each frame in the source speaker speech database 100.

有声無声推定部２０３は、算出した各ケプストラムがピークを有するか否かをそれぞれ判定し、その結果を示す情報を文字列情報２０５として生成する。例えば、あるケプストラムがピークを有する場合、有声無声推定部２０３は、そのケプストラムに対応する音声部分が有声である旨を示す文字列情報２０５を生成する。一方、あるケプストラムがピークを有しない場合、有声無声推定部２０３は、そのケプストラムに対応する音声部分が無声である旨を示す文字列情報２０５を生成する。 The voiced / unvoiced estimation unit 203 determines whether each calculated cepstrum has a peak, and generates information indicating the result as character string information 205. For example, when a cepstrum has a peak, the voiced / unvoiced estimation unit 203 generates character string information 205 indicating that the voice part corresponding to the cepstrum is voiced. On the other hand, when a certain cepstrum does not have a peak, the voiced / voiceless estimation unit 203 generates character string information 205 indicating that the voice part corresponding to the cepstrum is unvoiced.

次に、図３に示すように、アクセント推定部２０４は、変換元話者音声データベース１００、及び文字列２０１に基づき、変換元話者音声データベース１００に記録されている音声におけるアクセントの部分の特徴を推定し、推定した特徴を示す情報（以下、アクセント情報という。）を文字列２０１に付加することで、文字列情報２０５を生成する。 Next, as shown in FIG. 3, the accent estimation unit 204 is characterized by the accent portion in the voice recorded in the conversion source speaker voice database 100 based on the conversion source speaker voice database 100 and the character string 201. And character string information 205 is generated by adding information indicating the estimated feature (hereinafter referred to as accent information) to the character string 201.

例えば、アクセント推定部２０４は、変換元話者音声データベース１００における音声波形における振幅又は韻律に基づきアクセントの部分を推定し、推定したアクセントの部分と、文字列２０１における音素の部分とを対応づける。アクセント推定部２０４は、その結果を示す情報をアクセント情報として生成する。 For example, the accent estimation unit 204 estimates an accent part based on the amplitude or prosody in the speech waveform in the conversion source speaker speech database 100 and associates the estimated accent part with the phoneme part in the character string 201. The accent estimation unit 204 generates information indicating the result as accent information.

具体的には、例えば、アクセント推定部２０４は、文字列２０１からある一文字を選択し、選択した文字に対応する、変換元話者音声データベース１００のフレームについて、そのフレームにおける振幅の平均値及び基本周波数の平均値を求める。そして、アクセント推定部２０４は、これらの平均値を、選択した文字の前後の文字に対応するフレームの振幅の平均値及び基本周波数の平均値と比較することにより、選択した文字に対応する音声におけるアクセントの有無を判別する。 Specifically, for example, the accent estimation unit 204 selects one character from the character string 201, and for the frame of the conversion source speaker speech database 100 corresponding to the selected character, the average value of amplitude and the basic value in that frame Find the average frequency. Then, the accent estimation unit 204 compares these average values with the average value of the amplitude and the fundamental frequency of the frame corresponding to the characters before and after the selected character, so that the voice corresponding to the selected character Determine if there is an accent.

次に、音声合成部２０６は、前記のようにして、文字列２０１に有声無声情報及びアクセント情報が付加された文字列情報２０５に基づき、目標話者音声データベース１０１を修正したデータベース（目標話者音声データベース２０７）を生成する。すなわち、音声合成部２０６は、変換元話者音声データベース１００における音声の声質（有声部分、無声部分、及びアクセント部分）と同様の声質の音声の目標話者音声データベース１０１のデータベースを生成する。 Next, the speech synthesis unit 206 corrects the target speaker speech database 101 based on the character string information 205 in which voiced unvoiced information and accent information are added to the character string 201 as described above (target speaker). A voice database 207) is generated. That is, the speech synthesizer 206 generates a database of the target speaker speech database 101 for speech having the same voice quality as the speech quality (voiced portion, unvoiced portion, and accent portion) in the conversion source speaker speech database 100.

そして、時間アライメント処理部２０８は、音声合成部２０６が生成した目標話者音声データベース２０７に基づき、時間アライメントを行ったパラレルコーパスを生成する（例えば、同じ時間位置において同じ音素の発音が行われるように調節された２つの音声波形を生成する）。すなわち、時間アライメント処理部２０８は、相互に時間アライメントを調節した、変換元話者音声データベース１００（すなわち、変換元話者音声データベース２０９）及び目標話者音声データベース２０７（すなわち、目標話者音声データベース２１０）を生成する。 Then, the time alignment processing unit 208 generates a parallel corpus in which time alignment is performed based on the target speaker speech database 207 generated by the speech synthesizer 206 (for example, the same phoneme is pronounced at the same time position). To generate two speech waveforms adjusted to That is, the time alignment processing unit 208 adjusts the time alignment with each other, and the conversion source speaker voice database 100 (ie, the conversion source speaker voice database 209) and the target speaker voice database 207 (ie, the target speaker voice database). 210).

ここで、時間アライメント処理部２０８が行う処理の詳細を説明する。
図８は、時間アライメント処理部２０８が行う処理の詳細を説明する図である。同図に示すように、まず時間アライメント処理部２０８は、変換元話者音声データベース１００及び目標話者音声データベース２０７のメルケプストラム（例えば、メル周波数ケプストラム（MFCC）：Mel-Frequency Cepstral Coefficients）を生成する。 Here, details of processing performed by the time alignment processing unit 208 will be described.
FIG. 8 is a diagram for explaining the details of the processing performed by the time alignment processing unit 208. As shown in the figure, first, the time alignment processing unit 208 generates a mel cepstrum (for example, Mel-Frequency Cepstral Coefficients) of the conversion source speaker voice database 100 and the target speaker voice database 207. To do.

具体的には、例えば、時間アライメント処理部２０８は、変換元話者音声データベース１００及び目標話者音声データベース２０７の各音声波形をフーリエ変換することにより、スペクトルを算出する（ｓ２２３）。そして、時間アライメント処理部２０８は、算出
した各スペクトルに対して、メルフィルタバンクを掛けることにより、メル周波数スペクトルを算出する（ｓ２２４）。さらに、時間アライメント処理部２０８は、算出した各メル周波数スペクトルに対して、離散コサイン変換を行うことにより、変換元話者音声データベース１００に対応するメルケプストラム２２６、及び、目標話者音声データベース２０７に対応するメルケプストラム２２７を生成する。 Specifically, for example, the time alignment processing unit 208 calculates a spectrum by performing Fourier transform on each speech waveform in the conversion source speaker speech database 100 and the target speaker speech database 207 (s223). Then, the time alignment processing unit 208 calculates a mel frequency spectrum by multiplying each calculated spectrum by a mel filter bank (s224). Further, the time alignment processing unit 208 performs discrete cosine transform on each calculated mel frequency spectrum to thereby store the mel cepstrum 226 corresponding to the conversion source speaker voice database 100 and the target speaker voice database 207. A corresponding mel cepstrum 227 is generated.

そして、時間アライメント処理部２０８は、生成した各メルケプストラムに対して、時間アライメントを行う（ｓ２２８）。例えば、動的計画法によるマッチング（ＤＰマッチング：Dynamic Programming）に基づき時間アライメントが行われる。これにより、変換
元話者音声データベース２０９、及び目標話者音声データベース２１０が生成される。 Then, the time alignment processing unit 208 performs time alignment on each generated mel cepstrum (s228). For example, time alignment is performed based on matching by dynamic programming (DP matching: Dynamic Programming). Thereby, the conversion source speaker voice database 209 and the target speaker voice database 210 are generated.

ここで、音声認識部２００による文字列２０１の推定には、誤りが含まれている可能性がある。文字列２０１に誤りが存在すると、変換元話者音声データベース１００と目標話者音声データベース２０７の内容が一致しないこととなり、適切な声質変換を行うことができなくなる。 Here, the estimation of the character string 201 by the speech recognition unit 200 may include an error. If there is an error in the character string 201, the contents of the conversion source speaker voice database 100 and the target speaker voice database 207 do not match, and appropriate voice quality conversion cannot be performed.

そこで、図４に示すように、低確信度フレーム除去部２１１は、変換元話者音声データベース２０９、及び目標話者音声データベース２１０の音声データのうち確信度２０２の低い部分を除去することにより、補正した変換元話者音声データベース２０９（変換元話者音声データベース２１２）、及び補正した目標話者音声データベース２１０（目標話者音声データベース２１３）を生成する。 Therefore, as shown in FIG. 4, the low certainty frame removal unit 211 removes a portion with low confidence 202 from the voice data of the conversion source speaker voice database 209 and the target speaker voice database 210. The corrected conversion source speaker voice database 209 (conversion source speaker voice database 212) and the corrected target speaker voice database 210 (target speaker voice database 213) are generated.

ここで、低確信度フレーム除去部２１１が行う処理の詳細を説明する。
図９は、低確信度フレーム除去部２１１が行う処理の詳細を説明する図である。同図に示すように低確信度フレーム除去部２１１は、変換元話者音声データベース２０９における全フレームに対して、各フレームを構成する音素の種類（分布）に応じたクラスタリングを行う（ｓ２２０）。これにより、各フレームはＮ個（Ｎ＞＝２）以上のクラスタに分類される。 Here, details of the processing performed by the low confidence frame removal unit 211 will be described.
FIG. 9 is a diagram for explaining the details of the processing performed by the low confidence frame removal unit 211. As shown in the figure, the low confidence frame removal unit 211 performs clustering on all frames in the conversion source speaker voice database 209 according to the type (distribution) of phonemes constituting each frame (s220). Thus, each frame is classified into N (N> = 2) or more clusters.

なお、このクラスタリングは、例えば、k-meansクラスタリング、音韻情報に基づいた
決定木クラスタリングである。音韻情報に基づいて決定木クラスタリングを行った場合には、低確信度フレーム除去部２１１は、現在どのような音韻的特徴を持つフレームが不足しているのかを示す情報を出力することにより、ユーザに、変換元話者音声データベース１００の拡張を促してもよい。 This clustering is, for example, k-means clustering or decision tree clustering based on phonological information. When decision tree clustering is performed based on phonological information, the low-confidence frame removal unit 211 outputs information indicating what phonological features are currently lacking to the user. In addition, expansion of the conversion source speaker voice database 100 may be prompted.

次に、低確信度フレーム除去部２１１は、ｓ２２０で分類されたクラスタごとに、確信度２０２の高いフレームを特定し、それ以外のフレームを「除去フレーム」として特定し、その除外フレームを変換元話者音声データベース２０９から除外する（ｓ２２１）。 Next, the low certainty factor frame removal unit 211 identifies a frame with a high certainty factor 202 for each cluster classified in s220, identifies other frames as “removed frames”, and converts the excluded frames to the conversion source. Excluded from the speaker voice database 209 (s221).

なお、図１０は、除去フレームの特定方法を説明する図である。同図に示すように、低確信度フレーム除去部２１１は、変換元話者音声データベース２０９における全フレームを、ｎ個のクラスタ（クラスタ１、クラスタ２、クラスタ３、．．．クラスタｎ）に分類する。そして、低確信度フレーム除去部２１１は、各クラスタについて、そのフレームにおける各フレームを確信度２０２が高い順に並べ、確信度が高い上位ｍ個のフレームを除いた下位のフレーム（以下、除去フレームという。）を全て、変換元話者音声データベース２０９から除外する。なお、除外フレームの特定方法はこれに限らず、例えば、確信度が所定の閾値未満の、（各クラスタにおける）全フレームとしてもよい。 FIG. 10 is a diagram for explaining a method for specifying a removal frame. As shown in the figure, the low confidence frame removal unit 211 classifies all frames in the conversion source speaker voice database 209 into n clusters (cluster 1, cluster 2, cluster 3,... Cluster n). To do. Then, the low-confidence frame removal unit 211 arranges the frames in each frame in descending order of the certainty degree 202 for each cluster, and subordinate frames (hereinafter referred to as removed frames) excluding the upper m frames having the high certainty degree. .) Are all excluded from the conversion source speaker voice database 209. Note that the method of specifying the excluded frame is not limited to this, and may be, for example, all frames (in each cluster) having a certainty factor less than a predetermined threshold.

低確信度フレーム除去部２１１は、ｓ２２１で特定した除去フレームのそれぞれに時間的に対応する、目標話者音声データベース２１０における各フレーム（以下、対応除去フ
レームという。）を全て特定する（ｓ２２２）。 The low-confidence frame removal unit 211 identifies all the frames in the target speaker voice database 210 (hereinafter referred to as “corresponding removal frames”) that temporally correspond to the removal frames identified in s221 (s222).

そして、低確信度フレーム除去部２１１は、変換元話者音声データベース２０９から除去フレームを除去することにより、変換元話者音声データベース２１２を生成する。また、低確信度フレーム除去部２１１は、目標話者音声データベース２１０から対応除去フレームを除去することにより、目標話者音声データベース２１３を生成する。これにより、確信度の低いフレームが除去された、すなわち修正されたパラレルコーパスが作成される。 Then, the low confidence frame removal unit 211 generates the conversion source speaker voice database 212 by removing the removal frames from the conversion source speaker voice database 209. Further, the low confidence frame removal unit 211 generates the target speaker voice database 213 by removing the corresponding removal frame from the target speaker voice database 210. As a result, a parallel corpus in which a frame with low confidence is removed, that is, a corrected corpus is created.

なお、低確信度フレーム除去部２１１が、ｓ２２０でクラスタリングを行ってから除去フレームを除去する理由は、各データベース内に存在する音素のバランスをとるためである。適切な声質変換を行うためには、全ての音素がバランスよくデータベースに含まれていることが理想的だからである。 The reason why the low-confidence frame removing unit 211 removes the removed frames after performing clustering in s220 is to balance phonemes existing in each database. This is because it is ideal that all phonemes are included in the database in a well-balanced manner in order to perform appropriate voice quality conversion.

次に、図４に示すように、声質変換部２１５（変換モデル生成部２１５１）は、変換元話者音声データベース２１２、及び目標話者音声データベース２１３を機械学習することにより、声質変換モデル１０２を生成する。 Next, as shown in FIG. 4, the voice quality conversion unit 215 (conversion model generation unit 2151) performs machine learning on the conversion source speaker voice database 212 and the target speaker voice database 213, thereby obtaining the voice quality conversion model 102. Generate.

このようにして、声質変換モデル１０２が生成されると、声質変換サーバ１０００は、オペレータ端末３０から、ネットワーク５０を経由した音声の入力を受け付ける。 When the voice quality conversion model 102 is generated in this way, the voice quality conversion server 1000 receives an input of voice from the operator terminal 30 via the network 50.

声質変換サーバ１０００が、オペレータ端末３０から音声の入力を受け付けると、すなわち声質変換部２１５に変換元話者音声１０３が入力されると、声質変換部２１５は、入力された変換元話者音声１０３の声質を、目標話者音声データベース１０１の声質を有する音声（目標話者音声１０４）に変換する。 When the voice quality conversion server 1000 receives an input of voice from the operator terminal 30, that is, when the source speaker voice 103 is input to the voice quality converter 215, the voice quality converter 215 receives the input source speaker voice 103. Is converted into a voice having the voice quality of the target speaker voice database 101 (target speaker voice 104).

そして、音声出力部２１７は、変換した目標話者音声１０４を、ネットワーク５０を経由してサービスロボット２０に送信し、サービスロボット２０は音声出力Ｉ／Ｆ２７により目標話者音声１０４を出力する（サービスロボット２０の声質の音声を発する）。これにより、変換元話者音声１０３から目標話者音声１０４への声質変換がなされたことになる。 Then, the voice output unit 217 transmits the converted target speaker voice 104 to the service robot 20 via the network 50, and the service robot 20 outputs the target speaker voice 104 by the voice output I / F 27 (service). The voice of the voice quality of the robot 20 is emitted). As a result, the voice quality conversion from the conversion source speaker voice 103 to the target speaker voice 104 is performed.

以上のように、本実施例の声質変換システム１０は、第１の声質（変換元話者音声データベース１００における音声の声質）及び第２の声質（目標話者音声データベース１０１における音声の声質）の音声における有声部分、無声部分、及びアクセントの部分の特徴を推定し、第１の声質及び第２の声質の音声における有声部分、無声部分、及びアクセントの特徴の間の関係を推定する変換モデル（声質変換モデル１０２）を生成し、生成した変換モデルに基づき、入力された第１の声質の音声を第２の声質の音声に変換するので、入力された音声の音質を、有声部分、無声部分、及びアクセントの特徴を維持した異なる声質の音声に変換することができる。これにより、声質の異なる音声の間における声質変換を正確に行うことができる。 As described above, the voice quality conversion system 10 according to the present embodiment has the first voice quality (voice voice quality in the conversion source speaker voice database 100) and the second voice quality (voice voice quality in the target speaker voice database 101). A transformation model that estimates the features of voiced, unvoiced, and accent parts in speech and estimates the relationship between voiced, unvoiced, and accent features in the first and second voice qualities. The voice quality conversion model 102) is generated, and the input voice of the first voice quality is converted to the voice of the second voice quality based on the generated conversion model, so that the voice quality of the input voice is changed to a voiced part, an unvoiced part. , And voices of different voice qualities while maintaining accent characteristics. Thereby, voice quality conversion between voices having different voice qualities can be accurately performed.

例えば、本実施例の声質変換システム１０によれば、オペレータの発話音声が不自然に無声化したり、不要なアクセントが付いたりすることがなく、また、ユーザによって意図しない箇所で別の音韻に知覚されたりすることもなく、オペレータの発話音声をサービスロボット２０が発する音声の声質に変換することができる。 For example, according to the voice quality conversion system 10 of the present embodiment, the operator's uttered voice is not unnaturally silenced or added with unnecessary accents, and another phoneme is perceived at a place unintended by the user. Without being done, the voice of the operator can be converted into the voice quality of the voice uttered by the service robot 20.

なお、本実施例の声質変換システム１０は、第２の声質を有する音声を出力するので、声質変換システム１０の利用者等は、第１の声質の特徴を備えた、正確に声質変換された音声を聴くことができる。 In addition, since the voice quality conversion system 10 of the present embodiment outputs the voice having the second voice quality, the users of the voice quality conversion system 10 and the like have the characteristics of the first voice quality and have been accurately voice quality converted. Listen to the sound.

また、本実施例の声質変換システム１０は、第１の音声を認識する際に、認識の確からしさを示す値である確信度に基づき、第１の声質の音声のうち所定の条件を満たない部分を除去し（低確信度フレーム除去部２１１）、除去した部分を除いた第１の声質の音声に基づき、変換モデルを生成するので、第１の音声の音声認識の正確性を向上させることができる。これにより、より高精度な声質変換を実現することができる。 In addition, when the voice quality conversion system 10 according to the present embodiment recognizes the first voice, the voice quality conversion system 10 does not satisfy a predetermined condition among the voices of the first voice quality based on the certainty that is a value indicating the certainty of the recognition. Since the conversion model is generated based on the voice of the first voice quality excluding the removed part (low confidence frame removing unit 211), the accuracy of voice recognition of the first voice is improved. Can do. Thereby, voice quality conversion with higher accuracy can be realized.

特に、本実施例の声質変換システム１０は、第１の音声の確信度を、第１の音声の音韻の認識の確からしさとして算出するので、より自然に聞こえる声質の音声に変換することができる。 In particular, the voice quality conversion system 10 according to the present embodiment calculates the certainty factor of the first voice as the probability of recognition of the phoneme of the first voice, so that it can be converted into voice with a voice quality that sounds more natural. .

また、本実施例の声質変換システム１０は、第１の声質の音声を、音素の種類に応じて複数のグループに分類し（クラスタリングを行い）、分類したグループのそれぞれの音声の部分のうち所定の割合の部分を除去するので、各グループにつき音素のバランスをとることができる。これにより、安定した声質の音声に変換することができる。 In addition, the voice quality conversion system 10 according to the present embodiment classifies the voices of the first voice quality into a plurality of groups (clustering) according to the type of phoneme, and determines a predetermined part of each voice portion of the classified group. Therefore, the phonemes can be balanced for each group. As a result, it is possible to convert the voice into a stable voice quality.

なお、本実施例の声質変換システム１０は、第１の音声における有声部分、無声部分、及びアクセントの特徴を有する、第２の声質を有する合成音声を生成し、生成した合成音声に基づき変換モデルを生成するので、合成音声を利用した様々な意味内容を有する音声に変換することができる。 Note that the voice quality conversion system 10 of the present embodiment generates a synthesized voice having the second voice quality, which has voiced parts, unvoiced parts, and accent features in the first voice, and a conversion model based on the generated synthesized voices. Can be converted into speech having various semantic contents using synthesized speech.

また、本実施例の声質変換システム１０は、第１の声質の音声又は第２の声質の音声における発音のタイミングを、第１の声質の音声における有声部分、無声部分、及びアクセントの部分のタイミングに基づき修正し、修正した音声に基づき変換モデルを生成する（時間アライメント処理部２０８）ので、第１の声質の音声及び第２の声質の音声の対応関係を正確に把握し、正確な声質変換を行うことができる。 In addition, the voice quality conversion system 10 of the present embodiment uses the timing of pronunciation in the voice of the first voice quality or the voice of the second voice quality as the timing of the voiced part, the voiceless part, and the accent part in the voice of the first voice quality. And the conversion model is generated based on the corrected voice (time alignment processing unit 208), so that the correspondence between the voices of the first voice quality and the voice of the second voice quality is accurately grasped and accurate voice quality conversion is performed. It can be performed.

−−実施例２−−
本実施例の声質変換システム１０は、変換元話者音声データベース１００に対する音声認識の精度が低い場合にその旨を出力することによって、声質変換の精度に関してユーザに警告を発する。 -Example 2-
The voice quality conversion system 10 of the present embodiment issues a warning to the user regarding the accuracy of voice quality conversion by outputting a message to that effect when the accuracy of voice recognition with respect to the conversion source speaker voice database 100 is low.

＜構成及び機能＞
図１１は、実施例２に係る声質変換サーバ１０００の機能の概要を説明する図である。同図に示すように、実施例２に係る声質変換サーバ１０００は、実施例１に係る声質変換サーバ１０００とほぼ同様の機能を備えるが、低確信度フレーム除去部２１１の内容は実施例１と異なる。 <Configuration and function>
FIG. 11 is a diagram for explaining an overview of functions of the voice quality conversion server 1000 according to the second embodiment. As shown in the figure, the voice quality conversion server 1000 according to the second embodiment has substantially the same function as the voice quality conversion server 1000 according to the first embodiment, but the content of the low confidence frame removal unit 211 is the same as that of the first embodiment. Different.

すなわち、前記低確信度フレーム除去部２１１は、前記第１の音声を複数の部分に分割し、分割した部分のそれぞれに対して前記確信度を算出し、算出した各前記確信度に基づき音声認識の精度が充分であるか否かを判定し、音声認識の精度が充分でないと判定した場合にはその旨を示す情報を出力する。 That is, the low confidence frame removal unit 211 divides the first speech into a plurality of parts, calculates the confidence for each of the divided parts, and performs speech recognition based on the calculated confidences. If the accuracy of the speech recognition is not sufficient, information indicating that is output.

なお、その他の要素（声質変換システム１０の構成、オペレータ端末３０の機能、及びサービスロボット２０の機能）は実施例１と同様である。 The other elements (the configuration of the voice quality conversion system 10, the function of the operator terminal 30, and the function of the service robot 20) are the same as those in the first embodiment.

ここで、本実施例の低確信度フレーム除去部２１１について説明する。
＜低確信度フレーム除去部２１１＞
図１２は、実施例２に係る低確信度フレーム除去部２１１の機能の一例を説明する図である。まず、低確信度フレーム除去部２１１が各フレームに対してクラスタリングを行う
点（ｓ２２０）、除去フレームを変換元話者音声データベース２０９から除外する点（ｓ２２１）は実施例１と同様である。例えば、低確信度フレーム除去部２１１は、k-means
クラスタリング、又は音韻情報に基づいた決定木クラスタリングを行う。 Here, the low confidence frame removal unit 211 of this embodiment will be described.
<Low Confidence Frame Removal Unit 211>
FIG. 12 is a schematic diagram illustrating an example of the function of the low confidence frame removal unit 211 according to the second embodiment. First, the low-confidence frame removal unit 211 performs clustering on each frame (s220), and excludes the removal frame from the conversion source speaker voice database 209 (s221), as in the first embodiment. For example, the low confidence frame removal unit 211 performs k-means
Clustering or decision tree clustering based on phonological information is performed.

次に、低確信度フレーム除去部２１１は、ｓ２２１で除外フレームを除外した残りのフレームに基づき、音声認識部２００が行った音声認識の精度が充分であるか否かを判定する（ｓ３００）。 Next, the low confidence frame removal unit 211 determines whether or not the accuracy of the speech recognition performed by the speech recognition unit 200 is sufficient based on the remaining frames from which the excluded frame is excluded in s221 (s300).

具体的には、例えば、低確信度フレーム除去部２１１は、ｓ２２０でクラスタリングを行った各フレームについて、そのフレームの確信度が所定の閾値以上であるか否かを確認する。そして、確信度が所定の閾値以上（例えば、７０％以上）であったフレームの割合が所定割合、又は確信度が所定の閾値以上であったフレームの数が所定数以上であった場合には、低確信度フレーム除去部２１１は、音声認識の精度が充分であると判定し、そうでない場合には、音声認識の精度が充分でなかったと判定する。そして、音声認識の精度が充分でなかったと判定した場合、低確信度フレーム除去部２１１は、その旨を示す情報を出力する（例えば、入出力装置１００５により表示し、ユーザに提示する）。 Specifically, for example, the low certainty factor frame removal unit 211 checks whether or not the certainty factor of each frame subjected to clustering in s220 is equal to or greater than a predetermined threshold. When the ratio of frames whose certainty is equal to or higher than a predetermined threshold (for example, 70% or higher) is a predetermined ratio, or the number of frames whose certainty is equal to or higher than a predetermined threshold is equal to or higher than a predetermined number The low confidence frame removal unit 211 determines that the accuracy of speech recognition is sufficient, and otherwise determines that the accuracy of speech recognition is not sufficient. If it is determined that the accuracy of voice recognition is not sufficient, the low confidence frame removal unit 211 outputs information indicating that fact (for example, displayed by the input / output device 1005 and presented to the user).

例えば、ｓ２２０において音韻情報に基づいて決定木クラスタリングを行った場合、低確信度フレーム除去部２１１は、どのような音韻的特徴を持つフレームが不足しているのかを示す情報を出力することにより、ユーザに、変換元話者音声データベース１００の拡張を促す。 For example, when decision tree clustering is performed based on phonological information in s220, the low confidence frame removing unit 211 outputs information indicating what phonological features are lacking, The user is prompted to expand the conversion source speaker voice database 100.

なお、図１３は、実施例２に係る低確信度フレーム除去部２１１が出力する情報の一例を示す図である。同図に示すように、低確信度フレーム除去部２１１は、確信度が７０％以上であるフレームが３つ以下であるクラスタが存在する場合（同図では「クラスタ２」）、そのクラスタに係る音声認識の精度が充分でないことを示す表示３００（例えば、ハイライト表示、文章による警告等）を入出力装置１００５のモニタやディスプレイ等により行う。なお、この表示３００は、オペレータ端末３０やその他の端末が行ってもよい。 FIG. 13 is a diagram illustrating an example of information output by the low confidence frame removal unit 211 according to the second embodiment. As shown in the figure, when there is a cluster having three or less frames having a certainty factor of 70% or more (“cluster 2” in the figure), the low confidence frame removal unit 211 relates to the cluster. A display 300 (for example, highlight display, warning by text, etc.) indicating that the accuracy of voice recognition is not sufficient is performed on the monitor or display of the input / output device 1005. The display 300 may be performed by the operator terminal 30 or other terminals.

このように、本実施例の声質変換システム１０は、第１の音声（変換元話者音声データベース１００における音声）を複数の部分に分割し、分割した部分のそれぞれに対して確信度２０２を算出し、音声認識の精度が充分でない場合にはその旨を示す情報を出力するので、高精度な声質変換が行われない可能性がある場合にはその旨をユーザに警告することができる。 As described above, the voice quality conversion system 10 according to the present embodiment divides the first voice (the voice in the conversion source speaker voice database 100) into a plurality of parts, and calculates the certainty factor 202 for each of the divided parts. If the accuracy of voice recognition is not sufficient, information indicating that is output, so if there is a possibility that highly accurate voice quality conversion may not be performed, the user can be warned to that effect.

以上の各実施例の説明は、本発明の理解を容易にするためのものであり、本発明を限定するものではない。本発明はその趣旨を逸脱することなく、変更、改良され得ると共に本発明にはその等価物が含まれる。 The above description of each example is for facilitating the understanding of the present invention, and does not limit the present invention. The present invention can be changed and improved without departing from the gist thereof, and the present invention includes equivalents thereof.

１０声質変換システム、１０００声質変換サーバ、２００音声認識部、２０３有声無声推定部、２０４アクセント推定部、２１５声質変換部、２１５１変換モデル生成部 10 voice quality conversion system, 1000 voice quality conversion server, 200 voice recognition unit, 203 voiced / unvoiced estimation unit, 204 accent estimation unit, 215 voice quality conversion unit, 2151 conversion model generation unit

Claims

入力された音声の音質を異なる声質に変換する、プロセッサ及びメモリを備える声質変換システムであって、
第１の声質を有する音声、及び第２の声質を有する音声をそれぞれ認識する音声認識部と、
前記認識した、第１の声質の音声及び第２の声質の音声における有声部分及び無声部分の特徴を推定する有声無声推定部と、
前記認識した、第１の声質の音声及び第２の声質の音声におけるアクセントの部分の特徴を推定するアクセント推定部と、
前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの特徴と、前記推定した第２の声質の音声における有声部分、無声部分、及びアクセントの特徴との関係を推定する変換モデルを生成する変換モデル生成部と、
前記生成した変換モデルに基づき、入力された前記第１の声質を有する音声を、前記第２の声質を有する音声に変換する声質変換部とを備える、声質変換システム。 A voice quality conversion system comprising a processor and a memory for converting the sound quality of an input voice into a different voice quality,
A voice recognition unit for recognizing a voice having a first voice quality and a voice having a second voice quality;
A voiced and unvoiced estimation unit for estimating the characteristics of the voiced portion and the unvoiced portion in the recognized first voice quality voice and the second voice quality voice;
An accent estimator for estimating the characteristics of the portion of the accent in the recognized voice of the first voice quality and the voice of the second voice quality;
A conversion model for estimating a relationship between voiced portions, unvoiced portions, and accent features in the estimated first voice quality speech and voiced portions, unvoiced portions, and accent features in the estimated second voice quality speech A conversion model generation unit for generating
A voice quality conversion system comprising: a voice quality conversion unit that converts the input voice having the first voice quality into voice having the second voice quality based on the generated conversion model.

前記第１の音声を認識する際に、音声認識の確からしさを示す値である確信度を算出する前記確信度算出部と、
前記算出した確信度に基づき、前記第１の声質の音声のうち所定の条件を満たさない部分の音声を除去する低確信度フレーム除去部とを備え、
前記変換モデル生成部は、前記所定の条件を満たさない部分の音声を除去した前記第１の声質の音声に基づき、前記変換モデルを生成する、
請求項１に記載の声質変換システム。 When recognizing the first voice, the certainty factor calculating unit that calculates a certainty factor that is a value indicating the certainty of voice recognition;
A low-confidence frame removing unit that removes a portion of the first voice quality speech that does not satisfy a predetermined condition based on the calculated certainty factor,
The conversion model generation unit generates the conversion model based on the voice of the first voice quality obtained by removing a part of the voice that does not satisfy the predetermined condition.
The voice quality conversion system according to claim 1.

前記変換した第２の声質を有する音声を出力する音声出力部を備える、請求項１に記載の声質変換システム。 The voice quality conversion system according to claim 1, further comprising a voice output unit that outputs the voice having the converted second voice quality.

前記低確信度フレーム除去部は、前記第１の声質の音声を、音素の種類に応じて複数のグループに分類し、分類した前記グループのそれぞれの音声のうち前記確信度が所定の条件を満たさない部分の音声を除去する、請求項２に記載の声質変換システム。 The low confidence frame removal unit classifies the voices of the first voice quality into a plurality of groups according to phoneme types, and the confidence level among the classified voices of the group satisfies a predetermined condition. The voice quality conversion system according to claim 2, wherein a voice of a part not present is removed.

前記確信度算出部は、前記第１の音声の確信度を、前記第１の音声の音韻の認識の確からしさとして算出する、請求項２に記載の声質変換システム。 The voice quality conversion system according to claim 2, wherein the certainty factor calculation unit calculates the certainty factor of the first voice as a probability of recognition of a phoneme of the first voice.

前記推定した第１の音声における有声部分、無声部分、及びアクセントの特徴を有する、前記第２の声質の合成音声を生成する音声合成部を備え、
前記変換モデル生成部は、前記生成した合成音声に基づき、前記変換モデルを生成する、請求項１に記載の声質変換システム。 A speech synthesizer for generating a synthesized speech of the second voice quality having features of the voiced portion, unvoiced portion, and accent in the estimated first speech;
The voice quality conversion system according to claim 1, wherein the conversion model generation unit generates the conversion model based on the generated synthesized speech.

前記低確信度フレーム除去部は、前記第１の音声を複数の部分に分割し、分割した部分のそれぞれに対して前記確信度を算出し、算出した各前記確信度に基づき前記音声認識の精度が充分であるか否かを判定し、音声認識の精度が充分でないと判定した場合にはその旨を示す情報を出力する、請求項２に記載の声質変換システム。 The low confidence frame removal unit divides the first speech into a plurality of parts, calculates the confidence level for each of the divided parts, and accuracy of the speech recognition based on the calculated confidence levels 3. The voice quality conversion system according to claim 2, wherein it is determined whether or not the voice recognition accuracy is sufficient, and when it is determined that the accuracy of voice recognition is not sufficient, information indicating that is output.

前記第１の声質の音声又は前記第２の声質の音声における発音のタイミングを、前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの部分のタイミングに基づき修正する時間アライメント処理部を備え、
前記変換モデル生成部は、前記修正した前記第１の声質の音声又は前記第２の声質の音声に基づき、前記変換モデルを生成する、
請求項１に記載の声質変換システム。 Time alignment processing for correcting the timing of pronunciation in the voice of the first voice quality or the voice of the second voice quality based on the timing of the voiced part, the voiceless part, and the accent part in the voice of the estimated first voice quality Part
The conversion model generation unit generates the conversion model based on the corrected voice of the first voice quality or the voice of the second voice quality,
The voice quality conversion system according to claim 1.

前記変換した第２の声質を有する音声を出力する音声出力部と、
前記第１の音声を認識する際に、音声認識の確からしさを示す値である確信度を算出する前記確信度算出部と、
前記算出した確信度に基づき、前記第１の声質の音声のうち所定の条件を満たさない部分の音声を除去する低確信度フレーム除去部と、
前記第１の声質の音声又は前記第２の声質の音声における発音のタイミングを、前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの部分のタイミングに基づき修正する時間アライメント処理部と、
前記推定した第１の音声における有声部分、無声部分、及びアクセントの特徴を有する、前記第２の声質の合成音声を生成する音声合成部とを備え、
前記確信度算出部は、前記第１の音声の確信度を、前記第１の音声の音韻の認識の確からしさとして算出し、
前記低確信度フレーム除去部は、
前記第１の声質の音声を、音素の種類に応じて複数のグループに分類し、分類した前記グループのそれぞれの音声のうち前記確信度が所定の条件を満たさない部分の音声を除去し、
前記第１の音声を複数の部分に分割し、分割した部分のそれぞれに対して前記確信度を算出し、算出した各前記確信度に基づき音声認識の精度が充分であるか否かを判定し、音声認識の精度が充分でないと判定した場合にはその旨を示す情報を出力し、
前記変換モデル生成部は、前記発音のタイミングを修正した前記第１の声質の音声又は前記第２の声質の音声、前記確信度が所定の条件を満たさない部分を除去した前記第１の声質の音声、及び、前記生成した合成音声に基づき、前記変換モデルを生成する、
請求項１に記載の声質変換システム。 An audio output unit for outputting audio having the converted second voice quality;
When recognizing the first voice, the certainty factor calculating unit that calculates a certainty factor that is a value indicating the certainty of voice recognition;
Based on the calculated certainty factor, a low certainty factor frame removing unit that removes a part of the voice of the first voice quality that does not satisfy a predetermined condition;
Time alignment processing for correcting the timing of pronunciation in the voice of the first voice quality or the voice of the second voice quality based on the timing of the voiced part, the voiceless part, and the accent part in the voice of the estimated first voice quality And
A speech synthesizer that generates a synthesized speech of the second voice quality having voiced portions, unvoiced portions, and accent features in the estimated first speech,
The certainty factor calculating unit calculates the certainty factor of the first voice as a probability of recognizing the phoneme of the first voice,
The low confidence frame removal unit includes:
Classifying the voice of the first voice quality into a plurality of groups according to the type of phoneme, and removing a part of the voice of the classified group where the certainty factor does not satisfy a predetermined condition;
The first voice is divided into a plurality of parts, the certainty factor is calculated for each of the divided parts, and it is determined whether or not the accuracy of voice recognition is sufficient based on the calculated certainty factors. If it is determined that the accuracy of voice recognition is not sufficient, information indicating that is output,
The conversion model generation unit removes the voice of the first voice quality or the voice of the second voice quality in which the timing of the pronunciation is corrected, and the portion of the first voice quality from which the certainty factor does not satisfy a predetermined condition. Based on the voice and the generated synthesized voice, the conversion model is generated.
The voice quality conversion system according to claim 1.

入力された音声の音質を異なる声質に変換する声質変換方法であって、
プロセッサ及びメモリを備える情報処理装置が、
第１の声質を有する音声、及び第２の声質を有する音声をそれぞれ認識する音声認識処理と、
前記認識した、第１の声質の音声及び第２の声質の音声における有声部分及び無声部分の特徴を推定する有声無声推定処理と、
前記認識した、第１の声質の音声及び第２の声質の音声におけるアクセントの部分の特徴を推定するアクセント推定処理と、
前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの特徴と、前記推定した第２の声質の音声における有声部分、無声部分、及びアクセントの特徴との関係を推定する変換モデルを生成する変換モデル生成処理と、
前記生成した変換モデルに基づき、入力された前記第１の声質を有する音声を、前記第２の声質を有する音声に変換する声質変換処理とを実行する、声質変換方法。 A voice quality conversion method for converting the sound quality of an input voice into a different voice quality,
An information processing apparatus comprising a processor and a memory
A voice recognition process for recognizing a voice having a first voice quality and a voice having a second voice quality;
Voiced and unvoiced estimation processing for estimating the characteristics of the voiced and unvoiced portions in the recognized first voice quality and second voice quality;
An accent estimation process for estimating a feature of an accent portion in the recognized voice of the first voice quality and the voice of the second voice quality;
A conversion model for estimating a relationship between voiced portions, unvoiced portions, and accent features in the estimated first voice quality speech and voiced portions, unvoiced portions, and accent features in the estimated second voice quality speech A conversion model generation process for generating
A voice quality conversion method for executing voice quality conversion processing for converting the input voice having the first voice quality into voice having the second voice quality based on the generated conversion model.

前記第１の音声を認識する際に、音声認識の確からしさを示す値である確信度を算出する前記確信度算出処理と、
前記算出した確信度に基づき、前記第１の声質の音声のうち所定の条件を満たさない部分の音声を除去する低確信度フレーム除去処理とを実行し、
前記変換モデル生成処理は、前記所定の条件を満たさない部分の音声を除去した前記第１の声質の音声に基づき、前記変換モデルを生成する、
請求項１０に記載の声質変換方法。 When recognizing the first sound, the certainty factor calculation process for calculating a certainty factor that is a value indicating the certainty of voice recognition;
Based on the calculated certainty factor, low-confidence frame removal processing for removing a part of the voice of the first voice quality that does not satisfy a predetermined condition,
The conversion model generation process generates the conversion model based on the voice of the first voice quality obtained by removing a part of the voice that does not satisfy the predetermined condition.
The voice quality conversion method according to claim 10.

前記変換した第２の声質を有する音声を出力する音声出力処理を実行する、請求項１０に記載の声質変換方法。 The voice quality conversion method according to claim 10, wherein voice output processing is performed to output the voice having the converted second voice quality.

入力された音声の音質を異なる声質に変換する声質変換プログラムであって、
プロセッサ及びメモリを備える情報処理装置に、
第１の声質を有する音声、及び第２の声質を有する音声をそれぞれ認識する音声認識処理と、
前記認識した、第１の声質の音声及び第２の声質の音声における有声部分及び無声部分の特徴を推定する有声無声推定処理と、
前記認識した、第１の声質の音声及び第２の声質の音声におけるアクセントの部分の特徴を推定するアクセント推定処理と、
前記推定した第１の声質の音声における有声部分、無声部分、及びアクセントの特徴と、前記推定した第２の声質の音声における有声部分、無声部分、及びアクセントの特徴との関係を推定する変換モデルを生成する変換モデル生成処理と、
前記生成した変換モデルに基づき、入力された前記第１の声質を有する音声を、前記第２の声質を有する音声に変換する声質変換処理とを実行させる、声質変換プログラム。 A voice quality conversion program for converting the sound quality of input speech into a different voice quality,
In an information processing apparatus including a processor and a memory,
A voice recognition process for recognizing a voice having a first voice quality and a voice having a second voice quality;
Voiced and unvoiced estimation processing for estimating the characteristics of the voiced and unvoiced portions in the recognized first voice quality and second voice quality;
An accent estimation process for estimating a feature of an accent portion in the recognized voice of the first voice quality and the voice of the second voice quality;
A conversion model for estimating a relationship between voiced portions, unvoiced portions, and accent features in the estimated first voice quality speech and voiced portions, unvoiced portions, and accent features in the estimated second voice quality speech A conversion model generation process for generating
A voice quality conversion program for executing voice quality conversion processing for converting the input voice having the first voice quality into voice having the second voice quality based on the generated conversion model.

前記第１の音声を認識する際に、音声認識の確からしさを示す値である確信度を算出する前記確信度算出処理と、
前記算出した確信度に基づき、前記第１の声質の音声のうち所定の条件を満たさない部分の音声を除去する低確信度フレーム除去処理とを実行させ、
前記変換モデル生成処理は、前記所定の条件を満たさない部分の音声を除去した前記第１の声質の音声に基づき、前記変換モデルを生成する、
請求項１３に記載の声質変換プログラム。 When recognizing the first sound, the certainty factor calculation process for calculating a certainty factor that is a value indicating the certainty of voice recognition;
Based on the calculated certainty factor, a low certainty factor frame removing process for removing a part of the first voice quality voice that does not satisfy a predetermined condition is executed,
The conversion model generation process generates the conversion model based on the voice of the first voice quality obtained by removing a part of the voice that does not satisfy the predetermined condition.
The voice quality conversion program according to claim 13.

前記変換した第２の声質を有する音声を出力する音声出力処理を実行させる、請求項１３に記載の声質変換プログラム。 The voice quality conversion program according to claim 13, wherein a voice output process for outputting voice having the converted second voice quality is executed.