JP6028289B2

JP6028289B2 - Relay system, relay method and program

Info

Publication number: JP6028289B2
Application number: JP2013037293A
Authority: JP
Inventors: 野村　英司; 英司野村; 豊國田; 石原　晋也; 晋也石原
Original assignee: Nippon Telegraph and Telephone East Corp
Current assignee: Nippon Telegraph and Telephone East Corp
Priority date: 2013-02-27
Filing date: 2013-02-27
Publication date: 2016-11-16
Anticipated expiration: 2033-02-27
Also published as: JP2014164241A

Description

本発明は、通話における音声を変換する技術に関する。 The present invention relates to a technique for converting voice in a call.

近年、情報処理装置や通信装置の機能向上に伴い、従来は単に音声の送受信に過ぎなかった通話機能に対し新たな付加機能が提供されている。例えば、特許文献１に記載の技術では、テレビ電話着信応答時に、相手先の電話番号が電話帳に登録されているかどうか判定される。電話帳に登録されていれば、受信した画像及び音声が夫々解析される。画像解析の結果、発信者以外の人物が映っておらず、発信者側に画面を覗き込んでいる人物が映っておらず、且つ、音声解析結果に基づいて発信者が公衆の場にいないと判定された場合には、ユーザの自画像が送信される。 In recent years, with the improvement of functions of information processing devices and communication devices, new additional functions have been provided in addition to call functions that have traditionally been merely transmission / reception of voice. For example, in the technique described in Patent Document 1, it is determined whether or not the other party's telephone number is registered in the telephone directory when an incoming videophone call is answered. If registered in the telephone directory, the received image and sound are analyzed respectively. As a result of image analysis, no person other than the caller is shown, no person looking into the screen is shown on the caller side, and the caller is not in a public place based on the voice analysis result If determined, the user's own image is transmitted.

特開２００９−０６５６２０号公報JP 2009-065620 A

特許文献１に開示された技術は一例に過ぎず、より新しい付加機能が要求されている。そこで、上記事情に鑑み、本発明は音声通話に付加される新たな機能を提供することを目的としている。 The technique disclosed in Patent Document 1 is merely an example, and a newer additional function is required. Therefore, in view of the above circumstances, an object of the present invention is to provide a new function added to a voice call.

本発明の一態様は、第一端末と第二端末との間で通話を中継する中継システムであって、第一端末から送信された音声に基づいて発話内容を認識する第一認識部と、第一端末から送信された音声に基づいて音程又は抑揚を認識する第二認識部と、合成音声を生成する際に用いられる音声素片を記憶する記憶部と、前記記憶部に記憶される前記音声素片を組み合わせることによって、前記第一認識部によって認識された発話内容と前記第二認識部によって認識された音程又は抑揚とを再現した合成音声を生成する音声生成部と、前記音声生成部によって生成された前記合成音声を前記第二端末に送信する通信部と、を備える。 One aspect of the present invention is a relay system that relays a call between a first terminal and a second terminal, and a first recognition unit that recognizes utterance content based on voice transmitted from the first terminal; A second recognizing unit for recognizing a pitch or intonation based on the sound transmitted from the first terminal; a storage unit for storing a speech unit used when generating a synthesized speech; and the storage unit storing the speech unit. A speech generation unit that generates a synthesized speech that reproduces the speech content recognized by the first recognition unit and the pitch or intonation recognized by the second recognition unit by combining speech units; and the speech generation unit A communication unit that transmits the synthesized speech generated by the second terminal to the second terminal.

本発明の一態様は、第一端末と第二端末との間で通話を中継する中継システムが行う中継方法であって、第一端末から送信された音声に基づいて発話内容を認識する第一認識ステップと、第一端末から送信された音声に基づいて音程又は抑揚を認識する第二認識ステップと、記憶部に記憶される音声素片を組み合わせることによって、前記第一認識ステップにおいて認識された発話内容と前記第二認識ステップにおいて認識された音程又は抑揚とを再現した合成音声を生成する音声生成ステップと、前記音声生成ステップにおいて生成された前記合成音声を前記第二端末に送信する通信ステップと、を有する。 One aspect of the present invention is a relay method performed by a relay system that relays a call between a first terminal and a second terminal, the first recognizing the utterance content based on the voice transmitted from the first terminal. Recognized in the first recognizing step by combining a recognizing step, a second recognizing step for recognizing a pitch or intonation based on the sound transmitted from the first terminal, and a speech element stored in the storage unit. A speech generation step for generating a synthesized speech that reproduces the utterance content and the pitch or intonation recognized in the second recognition step, and a communication step for transmitting the synthesized speech generated in the speech generation step to the second terminal And having.

本発明の一態様は、第一端末と第二端末との間で通話を中継するコンピュータに対し、第一端末から送信された音声に基づいて発話内容を認識する第一認識ステップと、第一端末から送信された音声に基づいて音程又は抑揚を認識する第二認識ステップと、記憶部に記憶される音声素片を組み合わせることによって、前記第一認識ステップにおいて認識された発話内容と前記第二認識ステップにおいて認識された音程又は抑揚とを再現した合成音声を生成する音声生成ステップと、前記音声生成ステップにおいて生成された前記合成音声を前記第二端末に送信する通信ステップと、を実行させるためのプログラムである。 One aspect of the present invention provides a first recognition step of recognizing utterance contents based on voice transmitted from a first terminal to a computer that relays a call between the first terminal and the second terminal; By combining the second recognition step for recognizing the pitch or intonation based on the voice transmitted from the terminal and the speech unit stored in the storage unit, the speech content recognized in the first recognition step and the second A voice generation step for generating a synthesized voice that reproduces the pitch or intonation recognized in the recognition step, and a communication step for transmitting the synthesized voice generated in the voice generation step to the second terminal. It is a program.

本発明により、音声通話において新たな付加機能を提供することが可能となる。 According to the present invention, it is possible to provide a new additional function in a voice call.

通話システム１のシステム構成を表すシステム構成図である。1 is a system configuration diagram illustrating a system configuration of a call system 1. FIG. 中継システム３００の機能構成を示す概略ブロック図である。2 is a schematic block diagram showing a functional configuration of a relay system 300. FIG. 通話システム１の処理の流れの具体例を表すシーケンス図である。4 is a sequence diagram illustrating a specific example of a processing flow of the call system 1. FIG. 通話セッションが確立されている間の中継システム３００の処理の具体例を示すフローチャートである。It is a flowchart which shows the specific example of a process of the relay system 300 while the telephone call session is established.

図１は、通話システム１のシステム構成を表すシステム構成図である。通話システム１は、第一端末１００、第二端末２００、中継システム３００を備える。通話システム１では、第一端末１００のユーザ及び第二端末２００のユーザの間で通話が行われる。通話システム１では、変換対象端末となっている端末から送信された音声が、中継システム３００によって他の音声に変換されて他方の端末に中継される。第一端末１００及び第二端末２００の双方が変換対象端末であっても良いし、第一端末１００又は第二端末２００のいずれか一方のみが変換対象端末であっても良い。以下、第一端末１００が変換対象端末であり、第二端末２００は変換対象端末でない場合の構成について説明する。なお、以下の説明では、変換対象端末ではない端末を「非対象端末」という。 FIG. 1 is a system configuration diagram showing a system configuration of the call system 1. The call system 1 includes a first terminal 100, a second terminal 200, and a relay system 300. In the call system 1, a call is performed between the user of the first terminal 100 and the user of the second terminal 200. In the call system 1, the voice transmitted from the terminal that is the conversion target terminal is converted into another voice by the relay system 300 and relayed to the other terminal. Both the first terminal 100 and the second terminal 200 may be conversion target terminals, or only one of the first terminal 100 or the second terminal 200 may be a conversion target terminal. Hereinafter, a configuration when the first terminal 100 is a conversion target terminal and the second terminal 200 is not a conversion target terminal will be described. In the following description, a terminal that is not a conversion target terminal is referred to as a “non-target terminal”.

第一端末１００は、通話機能を有する端末装置である。第一端末１００は、例えば携帯電話機、ＰＨＳ（Personal Handy-phone System）、スマートフォン、固定電話機、タブレット装置、パーソナルコンピュータ、ゲーム装置、テレビ受像機などの情報処理機能及び通信機能を有した装置である。第一端末１００は、中継システム３００を介して第二端末２００との間で通話セッションを確立する。第一端末１００は、通話セッションを介して、ユーザの発話を音声信号として第二端末２００に送信する。第一端末１００は、通話セッションを介して、第二端末２００から音声信号を受信し、スピーカーから出力する。 The first terminal 100 is a terminal device having a call function. The first terminal 100 is a device having an information processing function and a communication function, such as a mobile phone, a PHS (Personal Handy-phone System), a smartphone, a fixed phone, a tablet device, a personal computer, a game device, and a television receiver. . The first terminal 100 establishes a call session with the second terminal 200 via the relay system 300. The first terminal 100 transmits the user's utterance as a voice signal to the second terminal 200 via the call session. The first terminal 100 receives an audio signal from the second terminal 200 via a call session and outputs it from a speaker.

第二端末２００は、通話機能を有する端末装置である。第二端末２００は、例えば携帯電話機、ＰＨＳ、スマートフォン、固定電話機、タブレット装置、パーソナルコンピュータ、ゲーム装置、テレビ受像機などの情報処理機能及び通信機能を有した装置である。第二端末２００は、中継システム３００を介して第一端末１００との間で通話セッションを確立する。第二端末２００は、通話セッションを介して、ユーザの発話を音声信号として第一端末１００に送信する。第二端末２００は、通話セッションを介して、第一端末１００から音声信号を受信し、スピーカーから出力する。 The second terminal 200 is a terminal device having a call function. The second terminal 200 is a device having an information processing function and a communication function, such as a mobile phone, a PHS, a smartphone, a fixed phone, a tablet device, a personal computer, a game device, and a television receiver. The second terminal 200 establishes a call session with the first terminal 100 via the relay system 300. The second terminal 200 transmits the user's utterance as a voice signal to the first terminal 100 via the call session. The second terminal 200 receives the audio signal from the first terminal 100 via the call session and outputs it from the speaker.

図２は、中継システム３００の機能構成を示す概略ブロック図である。中継システム３００は、１台又は複数台の情報処理装置によって構成される。例えば、中継システム３００が一台の情報処理装置で構成される場合、情報処理装置は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、中継プログラムを実行する。中継プログラムの実行によって、情報処理装置は、通信部３０１、通話制御部３０２、音声認識部３０３、音程認識部３０４、抑揚認識部３０５、音声情報記憶部３０６、音声生成部３０７を備える装置として機能する。なお、中継システム３００の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されても良い。また、中継システム３００は、専用のハードウェアによって実現されても良い。中継プログラムは、コンピュータ読み取り可能な記録媒体に記録されても良い。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。中継プログラムは、電気通信回線を介して送受信されても良い。 FIG. 2 is a schematic block diagram showing a functional configuration of the relay system 300. The relay system 300 is configured by one or a plurality of information processing apparatuses. For example, when the relay system 300 is configured by a single information processing apparatus, the information processing apparatus includes a CPU (Central Processing Unit), a memory, an auxiliary storage device, and the like connected by a bus, and executes a relay program. By executing the relay program, the information processing apparatus functions as an apparatus including a communication unit 301, a call control unit 302, a voice recognition unit 303, a pitch recognition unit 304, an intonation recognition unit 305, a voice information storage unit 306, and a voice generation unit 307. To do. Note that all or some of the functions of the relay system 300 may be realized using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). Further, the relay system 300 may be realized by dedicated hardware. The relay program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, or a storage device such as a hard disk built in the computer system. The relay program may be transmitted / received via a telecommunication line.

通信部３０１は、ネットワークを介して第一端末１００及び第二端末２００と通信を行う。
通話制御部３０２は、第一端末１００及び第二端末２００との間で通話を制御する。通話制御部３０２は、例えばＳＩＰ（Session Initiation Protocol）に従った処理を行う事によって通話を制御する。第一端末１００及び第二端末２００との間で通話セッションが確立された後、通話制御部３０２は以下のように動作する。 The communication unit 301 communicates with the first terminal 100 and the second terminal 200 via a network.
The call control unit 302 controls a call between the first terminal 100 and the second terminal 200. The call control unit 302 controls the call by performing processing according to, for example, SIP (Session Initiation Protocol). After a call session is established between the first terminal 100 and the second terminal 200, the call control unit 302 operates as follows.

通話制御部３０２は、第一端末１００（変換対象端末）から送信された音声信号を音声認識部３０３、音程認識部３０４及び抑揚認識部３０５に出力する。そして、通話制御部３０２は、第一端末１００から送信された音声信号そのものは中継せず、音声生成部３０７によって生成された合成音声を通話相手の端末（第二端末２００）へ中継する。 The call control unit 302 outputs the voice signal transmitted from the first terminal 100 (the conversion target terminal) to the voice recognition unit 303, the pitch recognition unit 304, and the intonation recognition unit 305. Then, the call control unit 302 does not relay the voice signal itself transmitted from the first terminal 100, but relays the synthesized voice generated by the voice generation unit 307 to the call partner terminal (second terminal 200).

通話制御部３０２は、第二端末２００（非対象端末）から送信された音声信号については、音声信号を音声認識部３０３、音程認識部３０４及び抑揚認識部３０５に出力せず、通信部３０１を介して通話相手の端末（第一端末１００）へ中継する。
また、通話制御部３０２は、変換対象端末から送信される音声ＩＤを音声生成部３０７に出力する。音声ＩＤは、変換対象端末を用いて通話するユーザの声が変換される声色を表す識別情報である。 For the voice signal transmitted from the second terminal 200 (non-target terminal), the call control unit 302 does not output the voice signal to the voice recognition unit 303, the pitch recognition unit 304, and the intonation recognition unit 305. To the other party's terminal (first terminal 100).
Further, the call control unit 302 outputs the voice ID transmitted from the conversion target terminal to the voice generation unit 307. The voice ID is identification information representing a voice color to which a voice of a user who makes a call using the conversion target terminal is converted.

音声認識部３０３は、通話制御部３０２から出力された音声信号の発話内容を認識し、文字情報を生成する。音声認識部３０３は、生成した音声情報に対して時間情報を付加することによって、音声認識結果を生成する。時間情報とは、各文字が発音された時間を表す情報である。時間情報は、例えば変換対象端末から送信された音声信号から人の声が検出された時点を始点とし、始点からの経過時間によって表されても良い。 The voice recognition unit 303 recognizes the utterance content of the voice signal output from the call control unit 302 and generates character information. The voice recognition unit 303 generates a voice recognition result by adding time information to the generated voice information. The time information is information representing the time when each character is pronounced. For example, the time information may be represented by an elapsed time from the start point when a human voice is detected from the audio signal transmitted from the conversion target terminal.

音程認識部３０４は、通話制御部３０２から出力された音声信号の音程変化を認識し、音程情報を生成する。音程認識部３０４は、例えば音声信号の周波数変化に基づいて音程変化を認識しても良い。音程認識部３０４は、生成した音程情報に対して時間情報を付加することによって、音程認識結果を生成する。 The pitch recognizing unit 304 recognizes a pitch change of the voice signal output from the call control unit 302 and generates pitch information. The pitch recognition unit 304 may recognize a pitch change based on, for example, a frequency change of the audio signal. The pitch recognition unit 304 generates a pitch recognition result by adding time information to the generated pitch information.

抑揚認識部３０５は、通話制御部３０２から出力された音声信号の抑揚変化を認識し、抑揚情報を生成する。抑揚認識部３０５は、例えば音声信号の振幅の変化に基づいて抑揚変化を認識しても良い。抑揚認識部３０５は、生成した抑揚情報に対して時間情報を付加することによって、抑揚認識結果を生成する。 The intonation recognition unit 305 recognizes the inflection change in the voice signal output from the call control unit 302 and generates intonation information. The intonation recognition unit 305 may recognize the intonation change based on, for example, a change in the amplitude of the audio signal. The intonation recognition unit 305 generates an intonation recognition result by adding time information to the generated intonation information.

音声情報記憶部３０６は、磁気ハードディスク装置や半導体記憶装置などの記憶装置を用いて構成される。音声情報記憶部３０６は、声色毎に予め生成された音声素片を含むデータベースを、音声ＩＤに対応付けて記憶する。すなわち、データベースは音声ＩＤ毎に記憶されている。 The audio information storage unit 306 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The voice information storage unit 306 stores a database including voice segments generated in advance for each voice color in association with the voice ID. That is, the database is stored for each voice ID.

音声生成部３０７は、通話制御部３０２によって出力される音声ＩＤに基づいて、音声情報記憶部３０６に記憶されている複数のデータベースの中から、音声合成処理に用いるデータベースを選択する。音声生成部３０７は、選択されたデータベースから、音声認識部３０３によって出力される音声認識結果と、音程認識部３０４によって出力される音程認識結果と、抑揚認識部３０５によって出力される抑揚認識結果と、に基づいて音声素片を選択する。そして、音声生成部３０７は、選択した音声素片を、音程認識結果及び抑揚認識結果に基づいて接続し、合成音声を生成する。音声生成部３０７は、例えば周波数ドメイン歌唱アーティキュレーション接続法を用いることによって音声合成処理を行っても良い。 The voice generation unit 307 selects a database to be used for voice synthesis processing from a plurality of databases stored in the voice information storage unit 306 based on the voice ID output by the call control unit 302. From the selected database, the speech generation unit 307 includes a speech recognition result output by the speech recognition unit 303, a pitch recognition result output by the pitch recognition unit 304, and an intonation recognition result output by the intonation recognition unit 305. , To select a speech unit. Then, the speech generation unit 307 connects the selected speech units based on the pitch recognition result and the intonation recognition result, and generates a synthesized speech. The speech generation unit 307 may perform speech synthesis processing by using, for example, a frequency domain singing articulation connection method.

図３は、通話システム１の処理の流れの具体例を表すシーケンス図である。図３に示す処理の流れでは、発呼を行った端末が変換対象端末となり、着呼をした端末が非対象端末となる。すなわち、中継システム３００の通話制御部３０２は、発呼を行った端末を変換対象端末として認識し、着呼をした端末を非対象端末として認識する。 FIG. 3 is a sequence diagram illustrating a specific example of the processing flow of the call system 1. In the processing flow shown in FIG. 3, the terminal that made the call becomes the conversion target terminal, and the terminal that made the call becomes the non-target terminal. That is, the call control unit 302 of the relay system 300 recognizes a terminal that has made a call as a conversion target terminal, and recognizes a terminal that has made a call as a non-target terminal.

まず、第一端末１００（変換対象端末）のユーザが、第一端末１００を操作することによって、通話先として第二端末２００を指定し発呼を指示する。発呼の指示を受けた第一端末１００は、中継システム３００に対してINVITEメッセージを送信する（ステップＳ１０１）。中継システム３００は、INVITEメッセージに応じて、100 Tryingメッセージを第一端末１００に送信する（ステップＳ１０２）。また、中継システム３００は、INVITEメッセージに応じて、発呼先として指定されている第二端末２００に対しINVITEメッセージを送信する（ステップＳ１０３）。 First, the user of the first terminal 100 (conversion target terminal) operates the first terminal 100 to designate the second terminal 200 as a call destination and instruct a call. The first terminal 100 that has received the call instruction transmits an INVITE message to the relay system 300 (step S101). The relay system 300 transmits a 100 Trying message to the first terminal 100 in response to the INVITE message (step S102). Further, the relay system 300 transmits the INVITE message to the second terminal 200 designated as the call destination in response to the INVITE message (step S103).

第二端末２００は、中継システム３００からINVITEメッセージを受信すると、100 Tryingメッセージを中継システム３００に対して送信する（ステップＳ１０４）。次に、第二端末２００は、着信音を出力し、着呼していることをユーザに通知する。そして、第二端末２００は、180 Ringingメッセージを中継システム３００に対して送信する（ステップＳ１０５）。 When receiving the INVITE message from the relay system 300, the second terminal 200 transmits a 100 Trying message to the relay system 300 (step S104). Next, the second terminal 200 outputs a ring tone and notifies the user that the call is being received. Then, the second terminal 200 transmits a 180 Ringing message to the relay system 300 (step S105).

中継システム３００は、第二端末２００から180 Ringingメッセージを受信すると、第一端末１００に対して180 Ringingメッセージを送信する（ステップＳ１０６）。第二端末２００のユーザが第二端末２００をオフフックにすると、第二端末２００は中継システム３００に対して200 OKメッセージを送信する（ステップＳ１０７）。中継システム３００は、第二端末２００から200 OKメッセージを受信すると、第一端末１００に対して200 OKメッセージを送信する（ステップＳ１０８）。 When receiving the 180 Ringing message from the second terminal 200, the relay system 300 transmits the 180 Ringing message to the first terminal 100 (step S106). When the user of the second terminal 200 takes the second terminal 200 off-hook, the second terminal 200 transmits a 200 OK message to the relay system 300 (step S107). When receiving the 200 OK message from the second terminal 200, the relay system 300 transmits the 200 OK message to the first terminal 100 (step S108).

第一端末１００は、中継システム３００から200 OKメッセージを受信すると、ACKメッセージを中継システム３００に対して送信する（ステップＳ１０９）。中継システム３００は、第一端末１００からACKメッセージを受信すると、ACKメッセージを第二端末２００に対して送信する（ステップＳ１１０）。以上の処理によって、第一端末１００と第二端末２００との間で通話セッションが確立される（ステップＳ１１１）。 When receiving the 200 OK message from the relay system 300, the first terminal 100 transmits an ACK message to the relay system 300 (step S109). When receiving the ACK message from the first terminal 100, the relay system 300 transmits the ACK message to the second terminal 200 (step S110). Through the above processing, a call session is established between the first terminal 100 and the second terminal 200 (step S111).

その後、第二端末２００のユーザが第二端末２００をオンフック状態にすると、第二端末２００は中継システム３００に対してBYEメッセージを送信する（ステップＳ１１２）。中継システム３００は、第二端末２００からBYEメッセージを受信すると、BYEメッセージを第一端末１００に対して送信する（ステップＳ１１３）。以上の処理によって、第一端末１００と第二端末２００との間に確立された通話セッションが終了する。 Thereafter, when the user of the second terminal 200 places the second terminal 200 in the on-hook state, the second terminal 200 transmits a BYE message to the relay system 300 (step S112). When the relay system 300 receives the BYE message from the second terminal 200, the relay system 300 transmits the BYE message to the first terminal 100 (step S113). With the above processing, the call session established between the first terminal 100 and the second terminal 200 is terminated.

図４は、通話セッションが確立されている間の中継システム３００の処理の具体例を示すフローチャートである。中継システム３００の通話制御部３０２は、ネットワークを介して通信部３０１が音声信号を受信すると（ステップＳ２０１）、受信された音声信号の送信元が変換対象端末であるか否か判定する（ステップＳ２０２）。送信元が変換対象端末ではない場合（ステップＳ２０２−ＮＯ）、通話制御部３０２は、受信された音声信号を通話相手の端末（非対象端末）へ送信する（ステップＳ２０７）。 FIG. 4 is a flowchart illustrating a specific example of processing of the relay system 300 while a call session is established. When the communication unit 301 receives an audio signal via the network (step S201), the call control unit 302 of the relay system 300 determines whether or not the transmission source of the received audio signal is a conversion target terminal (step S202). ). When the transmission source is not the conversion target terminal (step S202—NO), the call control unit 302 transmits the received voice signal to the call partner terminal (non-target terminal) (step S207).

一方、送信元が変換対象端末である場合（ステップＳ２０２−ＹＥＳ）、通話制御部３０２は、変換対象端末から送信された音声ＩＤを音声生成部３０７に出力する。また、通話制御部３０２は、受信された音声信号を音声認識部３０３、音程認識部３０４及び抑揚認識部３０５に出力する。音声認識部３０３は、音声信号に基づいて音声認識処理を行い、音声生成部３０７に対して音声認識結果を出力する（ステップＳ２０３）。音程認識部３０４は、音声信号に基づいて音程認識処理を行い、音声生成部３０７に対して音程認識結果を出力する（ステップＳ２０４）。抑揚認識部３０５は、音声信号に基づいて抑揚認識処理を行い、音声生成部３０７に対して抑揚認識結果を出力する（ステップＳ２０５）。 On the other hand, when the transmission source is the conversion target terminal (step S202—YES), the call control unit 302 outputs the voice ID transmitted from the conversion target terminal to the voice generation unit 307. Further, the call control unit 302 outputs the received voice signal to the voice recognition unit 303, the pitch recognition unit 304, and the intonation recognition unit 305. The voice recognition unit 303 performs voice recognition processing based on the voice signal, and outputs a voice recognition result to the voice generation unit 307 (step S203). The pitch recognition unit 304 performs a pitch recognition process based on the voice signal, and outputs a pitch recognition result to the voice generation unit 307 (step S204). The intonation recognition unit 305 performs intonation recognition processing based on the speech signal, and outputs an intonation recognition result to the speech generation unit 307 (step S205).

音声生成部３０７は、音声ＩＤ、音声認識結果、音程認識結果及び抑揚認識結果に基づいて合成音声を生成する（ステップＳ２０６）。通話制御部３０２は、音声生成部３０７によって生成された合成音声の音声信号を、変換対象端末の通話相手である端末（非対象端末）に対して送信する（ステップＳ２０７）。以上の処理によって、変換対象端末と非対象端末との間で通話が行われる。 The voice generation unit 307 generates a synthesized voice based on the voice ID, the voice recognition result, the pitch recognition result, and the intonation recognition result (step S206). The call control unit 302 transmits the voice signal of the synthesized voice generated by the voice generation unit 307 to the terminal (non-target terminal) that is the call partner of the conversion target terminal (step S207). Through the above processing, a call is performed between the conversion target terminal and the non-target terminal.

このように構成された通話システム１では、音声通話に付加される新たな機能を提供することが可能となる。具体的には以下のとおりである。変換対象端末のユーザの音声は、ユーザが用いる発話内容、音程及び抑揚を再現した形で、音声ＩＤが示す声色の合成音声に変換される。そして、合成音声は、変換対象端末の通話相手の端末に送信される。そのため、変換対象端末の通話相手は、他の人物と通話をしている状態を楽しむことが可能となる。 In the call system 1 configured as described above, it is possible to provide a new function added to the voice call. Specifically, it is as follows. The voice of the user of the conversion target terminal is converted into a voice-synthesized voice indicated by the voice ID in a form that reproduces the utterance content, pitch, and intonation used by the user. The synthesized voice is transmitted to the terminal of the other party of the conversion target terminal. Therefore, it is possible for the call partner of the conversion target terminal to enjoy a state in which a call is being made with another person.

例えば、音声ＩＤ毎に有名なキャラクター（アニメのキャラクター、ＴＶ番組のキャラクター等）のデータベースが音声情報記憶部３０６に登録されている場合、変換対象端末のユーザは、非対象端末のユーザとの間で、自身が選択した有名なキャラクターの音声で通話を行う事が可能となる。 For example, when a database of famous characters (animated characters, TV program characters, etc.) for each voice ID is registered in the voice information storage unit 306, the user of the conversion target terminal is connected to the user of the non-target terminal. Now, it is possible to make a call with the voice of the famous character you have selected.

また、通話システム１では、合成音声において、変換対象端末のユーザの発話内容が再現されるだけではなく、音程や抑揚も再現される。そのため、機械的な音声で不自然な通話が行われることを防止し、発話者の特徴や感情を活かした通話を実現することが可能となる。 Further, in the call system 1, not only the speech content of the user of the conversion target terminal is reproduced, but also the pitch and intonation are reproduced in the synthesized speech. For this reason, it is possible to prevent an unnatural call from being made with mechanical voice, and to realize a call utilizing the characteristics and emotions of the speaker.

＜変形例＞
音程認識部３０４及び抑揚認識部３０５は、必ずしも両方が備えられる必要は無く、いずれか一方のみが備えられるように構成されても良い。
図３に示すシーケンス図では、発呼側が変換対象端末として認識されたが、着呼側が変換対象端末として認識されるように構成されても良い。例えば以下のように構成されても良い。まず、発呼側に対して、予め所定の電話番号が通知されている。中継システム３００には、予め通知された電話番号と、変換対象端末となる着呼側の端末の電話番号とが対応付けて登録されている。発呼側が予め通知された電話番号に発信すると、中継システム３００は、発信された電話番号に予め対応付けられている変換対象端末の電話番号との間で通話セッションを確立する。そして、中継システム３００は、発呼側を非対象端末として認識し、着呼側を変換対象端末として認識し、処理を行う。 <Modification>
Both the pitch recognition unit 304 and the intonation recognition unit 305 are not necessarily provided, and may be configured to include only one of them.
In the sequence diagram shown in FIG. 3, the calling side is recognized as the conversion target terminal, but the called side may be recognized as the conversion target terminal. For example, it may be configured as follows. First, a predetermined telephone number is notified in advance to the calling party. In the relay system 300, a telephone number notified in advance and a telephone number of a called terminal serving as a conversion target terminal are registered in association with each other. When the calling party makes a call to the telephone number notified in advance, the relay system 300 establishes a call session with the telephone number of the conversion target terminal that is previously associated with the transmitted telephone number. The relay system 300 recognizes the calling side as a non-target terminal, recognizes the called side as a conversion target terminal, and performs processing.

通話セッションの確立手法としては、中継システム３００が第一端末１００及び第二端末２００それぞれと通話セッションを確立する手法が採用されても良い（例えばＶ字発信）。
ステップＳ２０３の音声認識処理、ステップＳ２０４の音程認識処理、ステップＳ２０５の抑揚認識処理のいずれか２つ又は全ては、並列に実行されても良い。また、各処理が実行される順序は、図４に示したフローチャートの順番に限定される必要は無い。
中継システム３００は、音程認識処理又は抑揚認識処理のいずれか一方又は双方を実行しなくとも良い。この場合、例えば中継システム３００は、予め音声情報記憶部３０６に記憶されている音程情報や抑揚情報を用いることによって音声合成を行っても良い。 As a method for establishing a call session, a method may be employed in which the relay system 300 establishes a call session with each of the first terminal 100 and the second terminal 200 (for example, V-shaped transmission).
Any two or all of the speech recognition processing in step S203, the pitch recognition processing in step S204, and the intonation recognition processing in step S205 may be executed in parallel. Further, the order in which each process is executed need not be limited to the order of the flowchart shown in FIG.
The relay system 300 may not execute either one or both of the pitch recognition process and the intonation recognition process. In this case, for example, the relay system 300 may perform speech synthesis by using pitch information and intonation information stored in advance in the speech information storage unit 306.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１…通話システム，１００…第一端末，２００…第二端末，３００…中継システム，３０１…通信部，３０２…通話制御部，３０３…音声認識部，３０４…音程認識部，３０５…抑揚認識部，３０６…音声情報記憶部，３０７…音声生成部 DESCRIPTION OF SYMBOLS 1 ... Call system, 100 ... 1st terminal, 200 ... 2nd terminal, 300 ... Relay system, 301 ... Communication part, 302 ... Call control part, 303 ... Voice recognition part, 304 ... Pitch recognition part, 305 ... Intonation recognition part 306: Audio information storage unit 307: Audio generation unit

Claims

第一端末と第二端末との間で通話を中継する中継システムであって、
第一端末から送信された音声に基づいて発話内容を認識する第一認識部と、
第一端末から送信された音声に基づいて音程を認識する第二認識部と、
合成音声を生成する際に用いられる音声素片を記憶する記憶部と、
前記記憶部に記憶される前記音声素片を組み合わせることによって、前記第一認識部によって認識された発話内容と前記第二認識部によって認識された音程を再現した合成音声を生成する音声生成部と、
前記音声生成部によって生成された前記合成音声を前記第二端末に送信する通信部と、を備える中継システム。 A relay system that relays a call between a first terminal and a second terminal,
A first recognition unit for recognizing the utterance content based on the voice transmitted from the first terminal;
A second recognizing unit that recognizes a more sound based on the sound that is transmitted from the first terminal,
A storage unit for storing speech segments used when generating synthesized speech;
By combining the speech units stored in the storage unit, the voice generator for generating a synthesized speech that reproduces the more recognized sound and speech content recognized by the first recognition unit by the second recognition unit When,
And a communication unit that transmits the synthesized voice generated by the voice generation unit to the second terminal.

第一端末と第二端末との間で通話を中継する中継システムが行う中継方法であって、
第一端末から送信された音声に基づいて発話内容を認識する第一認識ステップと、
第一端末から送信された音声に基づいて音程を認識する第二認識ステップと、
記憶部に記憶される音声素片を組み合わせることによって、前記第一認識ステップにおいて認識された発話内容と前記第二認識ステップにおいて認識された音程を再現した合成音声を生成する音声生成ステップと、
前記音声生成ステップにおいて生成された前記合成音声を前記第二端末に送信する通信ステップと、
を有する中継方法。 A relay method performed by a relay system that relays a call between a first terminal and a second terminal,
A first recognition step for recognizing the utterance content based on the voice transmitted from the first terminal;
A second recognizing step recognizes as sound based on the sound that is transmitted from the first terminal,
By combining speech units stored in the storage unit, a sound generation step of generating a reproduced synthetic speech as perceived sound in the first recognition it said second recognition step the recognized speech content in step,
A communication step of transmitting the synthesized speech generated in the speech generation step to the second terminal;
A relay method.

第一端末と第二端末との間で通話を中継するコンピュータに対し、
第一端末から送信された音声に基づいて発話内容を認識する第一認識ステップと、
第一端末から送信された音声に基づいて音程を認識する第二認識ステップと、
記憶部に記憶される音声素片を組み合わせることによって、前記第一認識ステップにおいて認識された発話内容と前記第二認識ステップにおいて認識された音程を再現した合成音声を生成する音声生成ステップと、
前記音声生成ステップにおいて生成された前記合成音声を前記第二端末に送信する通信ステップと、
を実行させるためのプログラム。 For computers that relay calls between the first terminal and the second terminal,
A first recognition step for recognizing the utterance content based on the voice transmitted from the first terminal;
A second recognizing step recognizes as sound based on the sound that is transmitted from the first terminal,
A speech generation step of generating synthesized speech that reproduces the utterance content recognized in the first recognition step and the pitch recognized in the second recognition step by combining speech units stored in the storage unit;
A communication step of transmitting the synthesized speech generated in the speech generation step to the second terminal;
A program for running