JP2004077738A

JP2004077738A - Content vocalization providing system

Info

Publication number: JP2004077738A
Application number: JP2002237251A
Authority: JP
Inventors: Shinji Hayakawa; 早川　慎司; Mayumi Harada; 原田　真弓; Satoshi Watanabe; 渡辺　聡
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2002-08-16
Filing date: 2002-08-16
Publication date: 2004-03-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a content vocalization providing system by which a content creator etc. can greatly participate in the quality, properties, etc., of a synthesized voice obtained by vocalizing contents and the operation load on and cost for the content creator etc. can be reduced. <P>SOLUTION: The content vocalization providing system comprises: a voice synthesis condition input means of inputting an arbitrary voice synthesis condition of conversion of vocalized data while being associated with contents; and a vocalizing means of converting contents to be provided into vocalized data according to the voice synthesis condition inputed by the voice synthesis condition input means and transmitting the data to a terminal requesting the contents. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明はコンテンツ音声化提供システムに関し、例えば、テキスト情報を含むコンテンツを音声化部で音声化してユーザに提供しようとするシステムに関するものである。
【０００２】
【従来の技術】
一般に、テキストデータを音声データに変換する技術は、テキスト音声変換（Ｔｅｘｔ　Ｔｏ　Ｓｐｅｅｃｈ：ＴＴＳ）技術と呼ばれている。また、ＴＴＳ技術が出力した音声は、合成音声と呼ばれている。現在、ネットワーク上のサーバが自動的にテキスト情報を取得してＴＴＳを行い、でき上がった合成音声をデータ化して、ネットワーク経由でユーザ端末に配信する考え方が、特開２００１−２８２２６８号公報（以下、文献１と呼ぶ）や、特開２００２−１４９５２号公報（以下、文献２と呼ぶ）に記載されている。なお、この明細書においては、ネットワーク上に配置されてＴＴＳを行うサーバを、音声化部と呼ぶこととする。
【０００３】
文献１には、ネットワーク上からテキスト情報を自動的に入手し、音声化部で音声ファイルを生成し、予め登録されたＷｅｂサーバや電話サーバに配信するシステムが記載されている。
【０００４】
文献２には、ユーザが音声化部に対して、予めＷｅｂページのＵＲＬなどを登録しておくと、音声化部が定期的に登録されたＷｅｂページを取得し、更新された部分があれば、それを音声化してユーザに通知するシステムが記載されているである。
【０００５】
【発明が解決しようとする課題】
しかしながら、従来の音声化部は自律的な動作を前提に構成されており、コンテンツ制作者に対する配慮がなされていない。
【０００６】
例えば、コンテンツ制作者がＴＴＳ後の合成音声について、どのような音声品質なのか、また、どの部分をどのように読んでいるかなど、全く関与、確認することができず、音声化部側で勝手に音声化されてしまっている。これは、コンテンツ制作者にとって極めて不本意な状況であり、ＴＴＳ後の合成音声を確認し、音声化に関与したいという欲求は、コンテンツ制作者にとって、極めて自然である。
【０００７】
勿論、コンテンツ制作者が、音声化部にアクセスして自分のコンテンツを指定すれば、ユーザに提供されている合成音声を確認することはできるが、その出力音に関して修正する術がない。
【０００８】
現在、これを回避するためには、自ら配信用音声を作成する以外に方法がない。しかし、この配信用音声の作成作業は、人間を使って録音する場合であれば、膨大なコストと時間を要するだけでなく、コンテンツを頻繁に更新することは、事実上、困難となる。また、市販の音声合成ソフトウェアを用いる場合でも、音声合成方式が多数あり、選定は困難を極める。また、この方法でも、頻繁なコンテンツの更新には極めて大きな障害となる。
【０００９】
また、コンテンツ制作者側だけでなく、ユーザも、提供される合成音声の品質や属性などについて音声化部へ指示することができない。すなわち、ユーザは、音声化部から提供された合成音声に対し、ユーザ端末における音量や音質などの操作子の操作で変更できる程度しか、合成音声の品質や属性を調整することができない。
【００１０】
そのため、コンテンツ制作者やユーザなどが、コンテンツを音声化した合成音声の品質や属性などに大きく関与し得る、しかも、コンテンツ制作者などの作業負担やコストを抑えることができるコンテンツ音声化提供システムが望まれている。
【００１１】
【課題を解決するための手段】
かかる課題を解決するため、本発明は、テキストデータを含むコンテンツを音声化データに変換して提供するコンテンツ音声化提供システムにおいて、音声化データに変換する任意の音声合成条件をコンテンツに対応付けて取り込む音声合成条件取り込み手段と、この音声合成条件取り込み手段が取り込んだ音声合成条件に従って、提供対象のコンテンツを音声化データに変換して、コンテンツの要求端末に送信する音声化手段とを有することを特徴とする。
【００１２】
【発明の実施の形態】
（Ａ）第１の実施形態
以下、本発明によるコンテンツ音声化提供システムの第１の実施形態を図面を参照しながら詳述する。
【００１３】
（Ａ−１）第１の実施形態の構成
図１は、第１の実施形態のコンテンツ音声化提供システムの全体構成を示すブロック図である。
【００１４】
図１において、第１の実施形態のコンテンツ音声化提供システムは、登録部１、音声化部２及びＷｅｂサーバ３を有し、当該システムへのアクセス装置として、ユーザ端末４やコンテンツ制作者端末５が存在する。これらは、全てデータネットワークＮによって接続されている。ここで、データネットワークＮは、例えば、インターネット、ＷＡＮ（Ｗｉｄｅ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）、ＬＡＮ（Ｌоｃａｌ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）、ＶＰＮ（Ｖｉｒｔｕａｌ　Ｐｒｉｖａｔｅ　Ｎｅｔｗｏｒｋ）、コンピュータ内部のデータバスなどが該当する。
【００１５】
図１では、便宜上、登録部１と音声化部２とが物理的に別々の場所に存在しているように表記しているが、登録部１と音声化部２とを一体の構成としても良い。同様に、登録部１と音声化部２とＷｅｂサーバ３を一体の構成としても良い。
【００１６】
図２は、登録部１の詳細構成を示すブロック図である。登録部１は、図２に示すように、制御部１１、アクセス部１２、プログラム記憶部１３及び情報記憶部１４を有する。
【００１７】
制御部１１は、各部の制御や演算、データ転送などを行い、例えばＣＰＵなどから構成されている。アクセス部１２は、データネットワークＮ及び登録部１間でのデータ入出力を行い、例えば、モデムやイーサネット（登録商標）カードなどで構成されている。プログラム記憶部１３は、制御部１１が実行するプログラムを格納しており、例えば、ハードディスクや光ディスクや半導体メモリなどで構成されている。プログラム記憶部１３に記憶されているプログラムは、例えば、ユーザ端末４やコンテンツ制作者端末５に表示する画面を形成するためのプログラムやＨＴＭＬファイル、音声化部（例えば音声合成サーバ）２へのリクエスト内容を生成するプログラムなどである。情報記憶部１４は、コンテンツ制作者が登録した、コンテンツ情報と音声合成条件が関連付けられて保存され、主に、ユーザ端末４からのリクエストに応じて利用される。情報記憶部１４は、例えば、ハードディスクや光ディスクや半導体メモリなどで構成されている。
【００１８】
上述したように、情報記憶部１４に保存されている内容は、主に、コンテンツ情報と音声合成条件とである。コンテンツ情報とは、コンテンツやコンテンツ制作者に関連する情報を指し、例えば、音声合成の対象となるコンテンツのＵＲＬ（Ｕｎｉｆｏｒｍ　Ｒｅｓｏｕｒｃｅ　Ｌｏｃａｔｏｒ）やＵＲｌ（Ｕｎｉｆｏｒｍ　Ｒｅｓｏｕｒｃｅ　Ｉｄｅｎｔｉｆｉｅｒ）、ニュースやコラムといったコンテンツの属性、登録サイトの登録名、コンテンツ制作者確認用のユーザＩＤとパスワードなどが該当する。また、音声合成条件とは、音声合成のための条件群を指し、主に音声化部２に送信されて合成音声を生成する段階で用いられる。具体的には、例えば、話者の性別、話す速度（話速）、抑揚、音程、音質、音量、使用する音声合成方式などが該当する。
【００１９】
図２では、便宜上、各機能部を別々の機器として示したが、実際上は１台のコンピュータで全機能部を実現しても良い。
【００２０】
音声化部２は、詳細構成の図示は省略するが、主に、音声合成機能部とデータの送受信機能部とを有する。具体的には、例えば、ＣＰＵなどの演算装置と、ＨＤＤや半導体メモリなどの記憶装置と、モデムやネットワークカードなどのネットワークアクセス装置と、これら上で動作するプログラムとで構成される。
【００２１】
Ｗｅｂサーバ３は、詳細構成の図示は省略するが、主に、記憶機能部とデータの送受信機能部で構成される。具体的には、例えば、ＣＰＵなどの演算装置と、ＨＤＤや半導体メモリなどの記憶装置と、モデムやネットワークカードなどのネットワークアクセス装置と、これら上で動作するプログラムとで構成される。Ｗｅｂサーバ３には、音声合成対象のテキスト本体や、ＨＴＭＬなどの言語で記述された、いわゆるＷｅｂページが保存されている。コンテンツ制作者は、自分が作成したコンテンツを、例えばコンテンツ制作者端末５からこのＷｅｂサーバ３転送して保存させ、又は、Ｗｅｂサーバ３の入力機能部（記録媒体の読み取り機能などを含む）を利用して入力して保存させ、ネットワークＮ上に公開する。
【００２２】
ユーザ端末４は、詳細構成の図示は省略するが、主に、ディスプレイやスピーカなどの情報出力機能と、キーボードやマイクといった情報入力機能とで構成される。具体的には、デスクトップ型パソコン、ノート型パソコン、携帯情報端末、携帯電話、情報家電などが該当する。
【００２３】
コンテンツ制作者端末５も、ユーザ端末４と同様な構成である。図１では、便宜上、ユーザ端末４とコンテンツ制作者端末５を別々に記載したが、コンテンツ制作者端末５に特別に要求される機能はなく、機能的にはユーザ端末４と何ら変わるところはない。コンテンツ制作者端末５と、ユーザ端末４とを同一のものとしても良い。
【００２４】
（Ａ−２）第１の実施形態の動作
以下、第１の実施形態のコンテンツ音声化提供システムにおける動作を、コンテンツの音声合成条件の登録動作及びコンテンツの音声化提供動作の順に説明する。
【００２５】
図３は、コンテンツ制作者端末５上の表示画面の遷移を示しており、具体的には、コンテンツ制作者が、コンテンツ制作者端末５から登録部１にアクセスし、音声合成条件などを登録するまでの画面遷移を示している。
【００２６】
まず、コンテンツ制作者端末５から登録部１にアクセスすると、そのアクセスに対応して、登録部１の制御部１１は、プログラム記憶手段１３からプログラムやデータファイルを取り出し、コンテンツ制作者端末５に画面生成用のデータを送信する。この送信データとしては、例えばＨＴＭＬファイルが該当する。
【００２７】
このデータを受信したコンテンツ制作者端末５には、図３（Ａ）に示す「コンテンツ登録画面」ＳＵＲ１が表示される。勿論、これ以前に、図示しない画面を用いて予めコンテンツ制作者として登録部１側にユーザ登録を行い、また、ログオン画面などを用いて登録ユーザの確認を実行させた後、コンテンツ登録画面ＳＵＲ１を取り出す処理を実行するようにしても良い。
【００２８】
コンテンツ登録画面ＳＵＲ１において、音声合成の対象となるコンテンツに関する情報（コンテンツ情報）の入力を促す。コンテンツに関する情報（コンテンツ情報）は、例えば、Ｗｅｂページの場所（ＵＲＬ、ＵＲＩなど）や、そのＷｅｂページの登録名、属性、キーワード、コンテンツ制作者のメッセージ、コンテンツ制作者の名前や連絡先などが該当する。これらの入力方法として、例えば、コンテンツ登録画面ＳＵＲ１のＵＲＬ入力部分と登録名入力部分では、自由に文字を打ち込めるテキストボックス形式を用い、属性入力部分では、図の▼印部をクリックすることによりメニュー一覧が表示されるプルダウン方式を用いている。但し、入力方法は、上記の例に限定されるものではない。コンテンツ登録画面ＳＵＲ１には「次へ」のボタン（アイコン）が含まれており、コンテンツ制作者は、上述したようなコンテンツ情報の入力が終了すると、この「次へ」のボタンをクリックする。
【００２９】
コンテンツ制作者端末５から、コンテンツ登録画面ＳＵＲ１の「次へ」のボタンがクリックされた際の情報が与えられた登録部１の制御部１１は、コンテンツ制作者端末５に、図３（Ｂ）に示すような「音声合成条件設定画面」ＳＵＲ２を表示するためのデータをプログラム記憶部１３などから取り出して、コンテンツ制作者端末５に送信する。
【００３０】
音声合成条件設定画面ＳＵＲ２は、音声合成時に必要となる音声合成条件の設定を促す画面である。音声合成条件としては、例えば、話者の性別や種類、話す速度、抑揚、音高などが該当する。また、コンテンツ制作者への条件提示方法としては、音声合成条件設定画面ＳＵＲ２の性別項で用いているような択一的なラジオボタン形式や、音声合成条件設定画面ＳＵＲ２の話速項で用いているようなグラフィカルなスライドバーなどを利用できる。ここでも、入力方法は、この例に限定されるものではない。
【００３１】
この音声合成条件設定画面ＳＵＲ２には、「戻る」、「試聴」、「登録」の各ボタンが含まれている。
【００３２】
「戻る」ボタンは、表示画面を、音声合成条件設定画面ＳＵＲ２から上述したコンテンツ登録画面ＳＵＲ１に戻ることを起動するボタンである。
【００３３】
「試聴」ボタンは、コンテンツ制作者に、設定された音声合成条件で実際に出力される音声を試聴させる処理を起動するボタンである。試聴機能により、コンテンツ制作者は、条件設定後、直ちに試聴できるため、設定値と出力音声のマッチングを容易にとれ、条件設定をよりてきせつに実行できるようになる。実際の出力音声を試聴する起動や実行手段は、図３に示す例に限定されるものではなく、別の画面（別の機会）や別の装置に設けるようにしても良い。なお、コンテンツ制作者が行う、音声合成条件の設定と試聴の可能回数は任意である。
【００３４】
試聴時の各部の動作説明は、後述するユーザ端末へのコンテンツ音声の提供動作の説明後に行う。
【００３５】
音声合成条件設定画面ＳＵＲ２の「登録」ボタンは、コンテンツ制作者が、上述したような音声合成条件の設定が終了したときにクリックするものである。コンテンツ制作者端末５から、音声合成条件設定画面ＳＵＲ２の「登録」ボタンがクリックされた際の情報が与えられた登録部１の制御部１１は、コンテンツ情報と音声合成条件の設定値を関連付けて情報記憶部１４に保存すると共に、コンテンツ制作者端末５に、図３（Ｃ）に示すような「登録確認画面」ＳＵＲ３を表示するためのデータをプログラム記憶部１３などから取り出して、コンテンツ制作者端末５に送信する。
【００３６】
コンテンツ情報や音声合成条件の登録に関する画面構成や画面遷移は、上記説明のものに限定されるものではない。例えば、上述したコンテンツ登録画面ＳＵＲ１と音声合成条件設定画面ＳＵＲ２とを１つの画面内に配置し、コンテンツ制作者のボタン操作回数を低減させるようにしても良い。また、登録確認画面ＳＵＲ３を省いたりしても良い。コンテンツ情報の登録項目や音声合成条件の登録項目の数や種類は、上記の例に限定されるものではない。
【００３７】
図４は、コンテンツ情報及び音声合成条件が登録部１に登録されたコンテンツに関し、音声での提供をユーザが受ける場合のシステム全体での処理の流れの第１例を示している。
【００３８】
コンテンツの利用者は、ユーザ端末４から登録部１にアクセスする。このとき、登録部１の制御部１１は、情報記憶部１４に登録されているファイルを検知し、検知情報に基づき、音声で提供可能なコンテンツのリストの情報を含む「音声サイト一覧」画面ＳＵＲ４（図４参照）のデータを完成させ、アクセスしていたユーザ端末４に送信する。
【００３９】
ユーザは、音声サイト一覧画面ＳＵＲ４で提示されたリストの中から、聞きたいコンテンツを選択する。図４の音声サイト一覧画面ＳＵＲ４の例では、複数選択が可能なチェックボックス方式の選択肢が画面に提示されている。
【００４０】
ここで、音声サイト一覧画面ＳＵＲ４が表示されているときに、ユーザが「ＴＴＴ　Ｎｅｗｓ」を選択して「開始」ボタンをクリックしたとする。このとき、ユーザ端末４から登録部１に対し、「ＴＴＴ　Ｎｅｗｓ」が選択されたことを表すリクエストデータ（選択サイトデータ）が送信される（Ｔ１）。
【００４１】
登録部１では、このリクエストデータを受信すると、内部の制御部１１が、情報記憶部１４を参照し、「ＴＴＴ　Ｎｅｗｓ」の登録名で登録されたデータから、コンテンツの場所情報（ここではＵＲＬとする）と、音声合成条件（ここでは、性別、話速、抑揚の各値とする）を取得する。制御部１１は、アクセス部１２を通して、コンテンツのＵＲＬにアクセスし、音声での提供対象となるデータ（ここではＨＴＭＬファイル）を、該当するＷｅｂサーバ３ら取得する（Ｔ２）。
【００４２】
制御部１１は、必要に応じて、プログラム記憶手段１３から所定のプログラムを呼び出して実行し、取得した音声での提供対象データを加工して音声合成用のテキストデータを生成する。この加工とは、例えば、ＨＴＭＬタグの削除、置換、変更、追加や、条件式による文字列の削除、置換、変更、追加などの作業が該当する。勿論、取得した提供対象データが、そのまま音声合成用のテキストデータとして利用可能な場合、このような処理を行う必要はない。登録部１は、少なくとも音声合成用のテキストデータと、先に読み出した音声合成条件のデータとを、音声化部２に送信する（Ｔ３）。この際の送信データには、ユーザ端末４を特定する情報も含まれている。
【００４３】
音声化部２は、受信したデータを使って合成音声（合成音声データ）を形成し、必要に応じて合成音声以外のデータを付加した状態で、ユーザ端末４に送信する（Ｔ４）。合成音声以外のデータとは、例えば、別の音データや、画面表示用のデータなどが該当し、これらは、登録部１から送信されたデータでも良い。図４の例では、画面表示用のデータを付加して送信しているが、合成音声データのみを送信しても良い。音声化部２から、これらのデータを受信したユーザ端末４は、内部の図示しない手段によって受信した音声データをユーザに聴取可能な形態にして提供する（ＳＮＤ１）。
【００４４】
図４の例では付加された画面表示用データも同時に画面出力している（ＳＵＲ５）。すなわち、「ＴＴＴ　Ｎｅｗｓ」を構成する複数の項目を並記して表示すると共に、その時点で音声出力に供している項目名を網掛け表示し、また、音声出力に係るトータル時間（合計）と、現在そのうちのどのタイミングを出力しているかを示す時間情報とを表示しており、更に、音声出力を前の項目に切り替えることを指示する「前」ボタンや音声出力を次の項目に切り替えることを指示する「次」ボタンや音声出力を強制停止させることを指示する「停止」ボタンなども表示している。
【００４５】
以上のように、第１の実施形態の場合、音声情報を受信するためには、ユーザ端末４から登録部１にリクエストデータを送信すれば良い。言い換えると、音声情報を受信することに限定すれば、ユーザ端末４から、Ｗｅｂサーバ３へのアクセスは必要ない。
【００４６】
次に、上述した音声合成条件設定画面ＳＵＲ２の「試聴」ボタンがクリックされた際の処理の流れを簡単に説明する。
【００４７】
「試聴」ボタンがクリックされると、コンテンツ制作者端末５は、試聴要求と試聴に係る音声合成条件とを登録部１に通知する。このとき、登録部１の制御部１１は、情報記憶部１４から試聴に供するコンテンツの情報を取り出し、登録部１の制御部１１は、アクセス部１２を通して、そのコンテンツのＵＲＬにアクセスし、音声での提供対象となるデータ（ここではＨＴＭＬファイル）を、該当するＷｅｂサーバ３ら取得する。そして、登録部１の制御部１１は、必要に応じて、プログラム記憶手段１３から所定のプログラムを呼び出して実行し、取得した音声での提供対象データを加工して音声合成用のテキストデータを生成し、少なくとも音声合成用のテキストデータと、先に読み出した音声合成条件のデータとを、音声化部２に送信する。この際の送信データには、コンテンツ制作者端末５を特定する情報も含まれている。音声化部２は、受信したデータを使って合成音声（合成音声データ）を形成し、必要に応じて合成音声以外のデータを付加した状態で、コンテンツ制作者端末５に送信する。
【００４８】
以上のようにして、コンテンツ制作者は、自己が設定した音声合成条件で自己のコンテンツの音声出力の提供（試聴）を受けることができる。
【００４９】
上述した図４（に示す第１例）では、ユーザ端末４に提供する元となるＨＴＭＬファイルを登録部１がＷｅｂサーバ３から取得するものを示したが、これに代え、ユーザ端末４に提供する元となるＨＴＭＬファイルを音声化部２がＷｅｂサーバ３から取得するようにしても良い。
【００５０】
図５は、この場合のシステム全体での処理の流れ（第２例）を示す説明図である。
【００５１】
ユーザ端末４が、登録部１に対し、選択サイトデータ（リクエストデータ）を送信するまでの処理（Ｔ１１）は、上述した図４に示す第１例の場合と同様である。
【００５２】
登録部１では、この選択サイトデータを受信すると、内部の制御部１１が、情報記憶部１４を参照し、選択サイトデータに係るコンテンツの場所情報（例えばＵＲＬ）や音声合成条件を取得し、音声化部２に送信する（Ｔ１２）。この際の送信データには、ユーザ端末４を特定する情報も含まれている。
【００５３】
これにより、音声化部２は、コンテンツのＵＲＬにアクセスし、音声での提供対象となるデータ（ここではＨＴＭＬファイル）を、該当するＷｅｂサーバ３から取得する（Ｔ１３）。
【００５４】
その後、音声化部２は、必要に応じて、取得した音声での提供対象データを加工（ＨＴＭＬタグの削除、置換、変更、追加など）して音声合成用のテキストデータを生成し、その後、受信した音声合成条件データに従って合成音声（合成音声データ）を形成し、必要に応じて合成音声以外のデータを付加した状態で、ユーザ端末４に送信する（Ｔ１４）。このときのユーザ端末４での動作は、第１例の場合と同様である。
【００５５】
登録部１、音声化部２及びＷｅｂサーバ３の役割分担は、上記第１例及び第２例に限定されず、さらに他の分担であっても良い。要は、Ｗｅｂサーバ３から取得したデータを元にしてユーザ端末４に送信するための音声データを形成できれば良い。例えば、音声合成条件も、登録部１からＷｅｂサーバ３を経由して音声化部２に与えるようにしても良い。この場合、Ｗｅｂサーバ３からのＨＴＭＬファイルと共に、音声合成条件が音声化部２に与えられることが好ましい。
【００５６】
（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、コンテンツ制作者が、コンテンツの音声化に関する条件を自分で設定でき、ユーザに提供される音を実際に確認することができる。そのため、コンテンツ制作者は、常に提供される音声を把握でき、その属性などを自由に変更が可能となる。
【００５７】
また、コンテンツ制作者が、コンテンツの音声化作業を行う必要はなく、システム側が有する音声化部が合成音声に変換する際の条件だけを設定すれば良く、作業が容易であって、コンテンツ制作者への負担は少ない。そのため、音声化対応が、コンテンツ更新の足かせになることはない。
【００５８】
（Ｂ）第２の実施形態
次に、本発明によるコンテンツ音声化提供システムの第２の実施形態を図面を参照しながら詳述する。
【００５９】
（Ｂ−１）第２の実施形態の構成
第２の実施形態のコンテンツ音声化提供システムも、全体構成は、上述した図１で表すことができ、データネットワークＮを介して接続される、登録部１、音声化部２、Ｗｅｂサーバ３、ユーザ端末４及びコンテンツ制作者端末５などを構成要素としている。
【００６０】
登録部１は、第１の実施形態のものと異なっており、第２の実施形態の登録部１は、図６に示すように、制御部１１、アクセス部１２及びプログラム記憶部１３を有し、情報記憶部１４が設けられていない。すなわち、コンテンツ制作者が設定した音声合成条件は、他の装置（Ｗｅｂサーバ３）に記憶されるようになされている。
【００６１】
このように、登録機能が、第１の実施形態と異なるため、登録部１だけでなく、音声化部２、Ｗｅｂサーバ３、ユーザ端末４及びコンテンツ制作者端末５の機能も、第１の実施形態とは異なっているが、その点については、以下の動作説明で明らかにする。
【００６２】
（Ｂ−２）第２の実施形態の動作
第２の実施形態のコンテンツ音声化提供システムにおける動作も、コンテンツの音声合成条件の登録動作及びコンテンツの音声化提供動作の順に説明する。
【００６３】
図７は、第２の実施形態でのコンテンツ制作者端末５上の表示画面の遷移を示しており、具体的には、コンテンツ制作者が、コンテンツ制作者端末５から登録部１にアクセスして開始された一連の処理でのコンテンツ制作者端末５上の画面遷移を示している。
【００６４】
コンテンツ制作者が、コンテンツ制作者端末５から登録部１にアクセスし、図７（Ｃ）に示す「登録確認画面」ＳＵＲ２３がコンテンツ制作者端末５に表示されるまでの、コンテンツ制作者端末５及び登録部１の動作は、第１の実施形態の場合と同様である。
【００６５】
第２の実施形態の場合、登録確認画面ＳＵＲ２３には「次へ」ボタンが含まれており、コンテンツ制作者が登録確認画面ＳＵＲ２３における「次へ」ボタンをクリックすると、登録部１の制御部１１は、これまでに登録部１が取得したコンテンツ情報と音声合成条件（制御部１１内のバッファメモリに格納されている）を用いて、プログラム記憶部１３からプログラムを呼び出して実行し、コンテンツ制作者のＷｅｂページに追加記述すべき内容を形成する。登録部１は、この形成した内容を表示するためにコンテンツ制作者端末５にデータを送信し、コンテンツ制作者端末５に「リンク条件表示画面」ＳＵＲ２４を表示させる。Ｗｅｂページに追加記述すべき内容の形成は、例えば、予めテンプレートを用意しておき、入力された音声合成条件の設定値などを、そのテンプレートに挿入することにより行う。
【００６６】
コンテンツ制作者が、このリンク条件表示画面ＳＵＲ２４に表示された内容を、作成したＷｅｂページなどにリンク形式で記述することにより、コンテンツ制作者が意図した音声をユーザに提供することができるようになる。
【００６７】
すなわち、第２の実施形態の場合、コンテンツ制作者は、Ｗｅｂページの情報として音声合成条件を直接盛り込むことにより、コンテンツ制作者が意図した音声をユーザに提供することができる。
【００６８】
図８は、ユーザが、コンテンツを音声で提供を受ける場合の第２の実施形態のシステム全体での処理の流れを示している。
【００６９】
第１の実施形態では、ユーザ端末４が登録部１にリクエストを送り、ユーザ端末４がＷｅｂサーバ３には直接アクセスしなかったのに対し、第２の実施形態では、ユーザ端末４はＷｅｂサーバ３にアクセスし、登録部１には直接アクセスしない。また、第２の実施形態の場合、ユーザにコンテンツを音声で提供する段階では、登録部１は機能しない。第１の実施形態と第２の実施形態とでは、これらの部分が異なっている。
【００７０】
ユーザ端末４からＷｅｂサーバ３にアクセスし、Ｗｅｂサーバ３からユーザ端末４に、画面ＳＵＲ２５を表示するためのサイトデータ（Ｗｅｂページ）が送信されたとする（Ｔ２１）。
【００７１】
画面ＳＵＲ２５を構成するためのサイトデータにおける各「聞く」ボタンにはそれぞれ、音声での提供が可能なコンテンツ情報や音声合成条件などの情報を含む記述ＳＵＢ２１で規定されているようなリンクが張られている。
【００７２】
ユーザが、いずれかの「聞く」ボタンをクリックことにより、その「聞く」ボタンに係るリンク先記述（記述ＳＵＢ２１参照）に従ったリクエストが、ユーザ端末４から音声化部２に送信される（Ｔ２２）。
【００７３】
このリクエストを受信した音声化部２は、音声化に必要なデータ（例えばＨＴＭＬファイル）を、リクエスト内のコンテンツ場所情報が指定する場所から取得する（Ｔ２３）。
【００７４】
音声化部２は、取得したデータに対し、リクエスト内の音声合成条件などを適用して作成した音声データを、ユーザ端末４に送信する（Ｔ２４）。このデータを受信したユーザ端末では、音出力ＳＮＤ１のように音声が出力される。必要に応じて、画面ＳＵＲ２６のような画面を表示しても良い。
【００７５】
（Ｂ−３）第２の実施形態の効果
第２の実施形態によると、Ｗｅｂサーバ３に保存されているコンテンツに、登録部１が出力した記述を追加することにより、第１の実施形態の効果に加え、ユーザが新たなアクセス場所（例えば、登録部１）にアクセスすることなく、従来通りのＷｅｂサーバ３にアクセスするだけで、音声化されたコンテンツを聞くことができるという効果を奏することができる。
【００７６】
（Ｃ）第３の実施形態
次に、本発明によるコンテンツ音声化提供システムの第３の実施形態を図面を参照しながら詳述する。
【００７７】
（Ｃ−１）第３の実施形態の構成
第３の実施形態のコンテンツ音声化提供システムも、その全体構成は、既述した図１で表すことができ、データネットワークＮを介して接続される、登録部１、音声化部２、Ｗｅｂサーバ３、ユーザ端末４及びコンテンツ制作者端末５などを構成要素としている。登録部１の内部構成も、第１の実施形態と同様に、図２で表すことができる。
【００７８】
但し、各部の機能は、既述した実施形態のものと異なっており、以下の動作説明で明らかにする。
【００７９】
なお、第３の実施形態の場合、登録部１が、音声で提供するコンテンツについては、Ｗｅｂサーバ３の機能をも担っているので、この点から言えば、図１でのＷｅｂサーバ３は省略することができる。
【００８０】
（Ｃ−２）第３の実施形態の動作
図９は、コンテンツを音声でユーザに提供するためのデータ送受信の手順例を示すものである。
【００８１】
コンテンツ制作者は、そのコンテンツ制作者端末５から、自己が制作した図１０に示すようなコンテンツデータと、音声合成条件などを記載した図１１に示すようなサイト識別情報とを、登録部１に送信する（Ｔ３１）。登録部１は、受信したコンテンツデータとサイト識別情報を情報記憶部１４に保存する。
【００８２】
ユーザが、ユーザ端末４から登録部１に登録してあるコンテンツにアクセス（例えば図１１の「ｈｔｔｐ：／／ｗｗｗ．ｘｘｘｘ．ｃｏ．ｊｐ」）すると（Ｔ３２）、登録部１の制御部１１は、プログラム記憶部１３から必要なプログラムを呼び出し、ユーザからリクエストがあったコンテンツデータ（図１０）とサイト識別情報（図１１）とを、情報記憶部１４から読み出す。制御部１１は、サイト識別情報を参照し、コンテンツデータの適切な場所に、音声での提供を要求するためのデータ（リクエスト送信手段）１２１（図１２参照）を付加し、このような付加後のデータをユーザ端末４に送信する（Ｔ３３）。
【００８３】
これにより、ユーザ端末４には、図１２に示すような、テキストデータ１２０と「聞く」ボタン（リクエスト送信手段）１２１とを含む画面が表示される。「聞く」ボタン１２１の情報には、第２の実施形態のときとほぼ同様に、テキストデータの送信先や音声合成条件（性別「男」、話速「６」、抑揚「４」、音質「４」、音量「３」）の情報も含まれている。
【００８４】
ユーザは、この「聞く」ボタン１２１をクリックことにより、少なくともテキストデータ１２０と音声合成条件とを含む音声化リクエストが音声化部２に送信される（Ｔ３４）。ユーザ端末４からリクエストを受信した音声化部２は、リクエストに応じて音声化データを生成し、ユーザ端末４に送信する（Ｔ３５）。これにより、ユーザ端末４から、所望するコンテンツの内容が音声出力される。
【００８５】
図１２は、音声提供要求ボタンである「聞く」ボタンが１個のコンテンツに対応するものであったが、音声提供要求ボタンの操作を、複数の中から選択されたコンテンツに対応させるようにしても良い。図１３は、この場合におけるユーザ端末４での表示画面例（第２の表示例）を示している。
【００８６】
図１３に示す画面では、３個のニュースセクション１３１、１３２、１３３と、「チェック記事を聞く」ボタン（リクエスト送信手段）１３４が備えられている。各ニュースセクション１３１、１３２、１３３にはそれぞれ、チェックボックスが備えられており、ユーザが聞きたいと思うニュースセクションをチェック選択することができる。図１３は、ユーザが、ニュースセクション１３１及び１３３を選択した状態を示している。この段階で、ユーザが、「チェック記事を聞く」ボタン（リクエスト送信手段）１３４をクリックすることにより、少なくとも、チェック選択されたニュースセクション１３１及び１３３の本文が記載されているページ（図１４参照）のＵＲＬと音声合成条件とが音声化部２に送信される。従って、ニュースセクション１３１及び１３３が音声出力される。
【００８７】
また、ユーザ端末４に表示するコンテンツの表示画面形式としては、図１２や図１３に示すものに代え、図１５に示すようなものでも良い。
【００８８】
図１５に示す画面では、ニュースセクション１３１〜１３３と、「チェック記事を聞く」ボタン（リクエスト送信手段）１３４に加え、音声合成条件の再設定画面１５１も備えられている。音声合成条件の再設定画面１５１の初期状態は、コンテンツ制作者が設定した音声合成条件である。ユーザは、聞きたいと思うニュースセクションを選択できるだけでなく、音声合成条件の再設定画面１５１に対する操作を通じて音声合成条件も設定することができる。音声化部２に送信される音声合成条件は、「チェック記事を聞く」ボタン（リクエスト送信手段）１３４がクリックされた際における音声合成条件の再設定画面１５１に設定された内容である。
【００８９】
図１５は、音声合成条件の再設定画面１５１として、ラジオボタン方式による選択方法のものを示したが、図１６に示すようなプルダウン方式による選択方法のものにすることもできる。
【００９０】
なお、コンテンツの選択方法や音声合成条件設定方法の選定方法は、上述したものに限定されないことは勿論である。
【００９１】
（Ｃ−３）第３の実施形態の効果
第３の実施形態によれば、Ｗｅｂページなどのコンテンツデータに加え、ごく簡単なサイト識別情報を、登録部に登録することにより、コンテンツ制作者の意図した音声をユーザに提供することができる。また、サイト識別情報を変更することにより、コンテンツデータを変更することなしに、極めて容易に提供音声を変更することができる。
【００９２】
また、図１５や図１６のような表示画像を適用した場合には、音声化部で、リクエストに含まれている音声合成条件の統計などをとることにより、コンテンツ制作者が、ユーザがどのような音声合成条件で音声化データを聞いたのかを知ることができる。
【００９３】
さらに、この第３の実施形態によっても、音声化部で自動的に音声化データを生成するので、コンテンツ制作者は大量のデータを自ら音声化するという作業は不要である。
【００９４】
（Ｄ）他の実施形態
上記各実施形態の説明においても、種々変形実施形態に言及したが、さらに、以下に例示するような変形実施形態を挙げることができる。
【００９５】
上記各実施形態における音声合成条件の設定処理に係る画面遷移図においては、便宜上、各処理工程で画面を分割して説明したが、言うまでもなく、全て１画面内に収める構成であっても良い。
【００９６】
また、上記各実施形態における、各構成要素間のデータの送信手順や送信内容、データの加工に関する役割分担などは、全て一例であり、上記実施形態のものに限定されるものではない。
【００９７】
本発明に関し、音声合成条件を設定し得る属性などは、任意に設定することができる。また、上記各実施形態で挙げた音声合成条件についても、その設定し得る選択肢を増減しても良い。例えば、性別に関し、「男性」、「女性」に加え、「ロボット（的音声）」を設けるようにしても良く、「２０代男性」、「３０代男性」、「４０代男性」などの年令をも紙するようにしても良い。また例えば、音声の符号化速度（１６ＫＢＰＳや３２ＫＢＰＳなど）を条件設定できるようにしても良い。さらに例えば、音質などについてもエコーの有無などを設定し得るようにしても良い。
【００９８】
また、第３の実施形態で説明したような、コンテンツ制作者（コンテンツ提供者側）及びユーザ（コンテンツ被提供者側）の双方が音声合成条件を設定し得る場合において、コンテンツ制作者が設定し得る音声属性とユーザが設定し得る音声属性とを同じにしても良く、また、異なるようにしても良い。
【００９９】
さらに、上記実施形態においては、１又は複数のコンテンツに共通に音声合成条件を設定するものを示したが、１コンテンツについても、タイトル部分や要約部分やコンテンツ本体など、部分によって、異なる音声合成条件を設定できるようにしても良い。また、コンテンツ制作者が音声合成条件を設定し得るコンテンツ部分と、ユーザがが音声合成条件を設定し得るコンテンツ部分とを区別（一部重複していても良い）するようにしても良い。
【０１００】
さらにまた、第３の実施形態の説明では、ユーザは、コンテンツの提供を受けるそのタイミングにおいて音声合成条件を設定し得るものを示したが、予め、音声合成条件を設定できるようにしても良い。例えば、ユーザがキーワードなどを登録して、メールマガジンの記事の中の該当する記事の提供を受ける場合において、キーワードなどの登録時に、音声合成条件を設定できるようにしても良い。
【０１０１】
また、コンテンツの音声出力時の音声合成条件を設定し得る者は、コンテンツ制作者やユーザだけでなく、コンテンツ管理者（例えばプロバイダ）などであっても良い。
【０１０２】
さらに、コンテンツ制作者やコンテンツ管理者が音声合成条件を設定する場合において、ユーザ端末が携帯端末であれば、低速の符号化速度、それ以外の端末であれば高速の符号化速度のような、ユーザ端末の種類との関係によって自動的に切り替わるような音声合成条件の設定を認めるようにしても良い。
【０１０３】
さらにまた、上記第２及び第３の実施形態では、「聞く」ボタンがクリックされてからデータ（コンテンツや音声合成条件など）を音声化部に与えるものを示したが、「聞く」ボタンを設けず、ユーザ端末が直ちに他の装置から与えられたデータを音声化部に与えるようにしても良い。
【０１０４】
また、第２の実施形態においても、第３の実施形態のように、音声合成条件を表示し、ユーザによる修正（再設定）を認めるようにしても良い。
【０１０５】
上記各実施形態では、登録部がコンテンツ制作者であることを認証することなく、音声合成条件を取り込むものを示したが、コンテンツ制作者の認証を行った後に音声合成条件を取り込むようにしても良い。
【０１０６】
なお、第１〜第３の実施形態の特徴は、組合せが可能なものは組み合わせて良いことは勿論である。
【０１０７】
【発明の効果】
以上のように、本発明によれば、コンテンツ制作者などが、コンテンツを音声化した合成音声の品質や属性などに大きく関与し得る、コンテンツ制作者などの作業負担やコストを抑えられるコンテンツ音声化提供システムを提供できる。
【図面の簡単な説明】
【図１】第１の実施形態のコンテンツ音声化提供システムの全体構成を示すブロック図である。
【図２】第１の実施形態の登録部の詳細構成を示すブロック図である。
【図３】第１の実施形態のコンテンツの音声合成条件の設定時のコンテンツ制作者端末上の表示画面の遷移を示す説明図である。
【図４】第１の実施形態におけるコンテンツを音声でユーザに提供する際のシステム全体での処理の第１例を示す説明図である。
【図５】第１の実施形態におけるコンテンツを音声でユーザに提供する際のシステム全体での処理の第２例を示す説明図である。
【図６】第２の実施形態の登録部の詳細構成を示すブロック図である。
【図７】第２の実施形態のコンテンツの音声合成条件の設定時のコンテンツ制作者端末上の表示画面の遷移を示す説明図である。
【図８】第２の実施形態におけるコンテンツを音声でユーザに提供する際のシステム全体での処理例を示す説明図である。
【図９】第３の実施形態におけるシステム全体でのデータの送受信例を示す説明図である。
【図１０】第３の実施形態の説明で用いるコンテンツデータを示す説明図である。
【図１１】第３の実施形態の説明で用いる音声合成条件を示す説明図である。
【図１２】第３の実施形態のコンテンツの音声提供要求ボタンを含む第１の表示例を示す説明図である。
【図１３】第３の実施形態のコンテンツの音声提供要求ボタンを含む第２の表示例を示す説明図である。
【図１４】図１３におけるニュースセッションの詳細例を示す説明図である。
【図１５】第３の実施形態のコンテンツの音声提供要求ボタンを含む第３の表示例を示す説明図である。
【図１６】第３の実施形態のコンテンツの音声提供要求ボタンを含む第４の表示例を示す説明図である。
【符号の説明】
１…登録部、２…音声化部、３…Ｗｅｂサーバ、４…ユーザ端末、５…コンテンツ制作者端末、１１…制御部、１２…アクセス部、１３…プログラム記憶部、１４…情報記憶部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a content audio conversion providing system, for example, to a system for converting a content including text information into an audio by an audio conversion unit and providing the audio to a user.
[0002]
[Prior art]
In general, a technique for converting text data into speech data is called a text-to-speech (TTS) technique. The voice output by the TTS technology is called a synthesized voice. At present, there is a concept that a server on a network automatically obtains text information, performs TTS, converts the resulting synthesized voice into data, and distributes the data to a user terminal via a network. Reference 1) and JP-A-2002-14952 (hereinafter referred to as Reference 2). Note that, in this specification, a server that performs TTS on a network is referred to as a voice conversion unit.
[0003]
Document 1 describes a system in which text information is automatically obtained from a network, a voice file is generated by a voice conversion unit, and the voice file is distributed to a Web server or a telephone server registered in advance.
[0004]
In Document 2, if the user registers the URL of the Web page or the like in advance in the voice conversion unit, the voice conversion unit periodically acquires the registered Web page, and if there is an updated portion, Describes a system for notifying a user by converting the sound to voice.
[0005]
[Problems to be solved by the invention]
However, the conventional voice conversion unit is configured on the premise of autonomous operation, and no consideration is given to the content creator.
[0006]
For example, the content creator could not be involved or check at all about the synthesized speech after TTS, such as what kind of speech quality it was or what part and how to read it. Has been voiced. This is a very unwilling situation for the content creator, and the desire to check the synthesized speech after the TTS and to participate in the voice conversion is extremely natural for the content creator.
[0007]
Of course, if the content creator accesses the voice conversion unit and specifies his own content, he can check the synthesized voice provided to the user, but there is no way to correct the output sound.
[0008]
Currently, there is no other way to avoid this than to create the audio for distribution. However, the task of creating the audio for distribution requires enormous cost and time when recording using a human, and it is practically difficult to frequently update the content. Even when using commercially available speech synthesis software, there are many speech synthesis methods, and selection is extremely difficult. Also, even with this method, frequent updating of the content is an extremely large obstacle.
[0009]
In addition, not only the content creator but also the user cannot instruct the voice conversion unit about the quality and attributes of the provided synthesized voice. That is, the user can adjust the quality and attributes of the synthesized speech provided by the speech conversion unit only to the extent that the synthesized speech can be changed by operating the controls such as the volume and sound quality at the user terminal.
[0010]
For this reason, there is a system for providing a content audio system that enables content creators and users to greatly affect the quality and attributes of synthesized speech obtained by converting the content into voices, and that can reduce the work load and cost of the content creator. Is desired.
[0011]
[Means for Solving the Problems]
In order to solve such a problem, the present invention provides a content-speech providing system which converts content including text data into voiced data and provides the content by associating arbitrary voice synthesis conditions for converting the content into voiced data with the content. A voice synthesizing condition capturing means for capturing, and voice converting means for converting the content to be provided into voice data in accordance with the voice synthesizing conditions captured by the voice synthesis condition capturing means, and transmitting the voice data to a content request terminal. Features.
[0012]
BEST MODE FOR CARRYING OUT THE INVENTION
(A) First embodiment
Hereinafter, a first embodiment of a content audio conversion providing system according to the present invention will be described in detail with reference to the drawings.
[0013]
(A-1) Configuration of First Embodiment
FIG. 1 is a block diagram showing the overall configuration of the content audio conversion providing system according to the first embodiment.
[0014]
In FIG. 1, the content audio conversion providing system according to the first embodiment includes a registration unit 1, an audio conversion unit 2, and a Web server 3, and a user terminal 4 or a content creator terminal 5 as an access device to the system. Exists. These are all connected by a data network N. Here, the data network N corresponds to, for example, the Internet, a WAN (Wide Area Network), a LAN (Local Area Network), a VPN (Virtual Private Network), and a data bus inside a computer.
[0015]
In FIG. 1, for convenience, the registration unit 1 and the voice conversion unit 2 are illustrated as physically present in different places. However, the registration unit 1 and the voice conversion unit 2 may be configured as an integrated configuration. good. Similarly, the registration unit 1, the voice conversion unit 2, and the Web server 3 may be integrated.
[0016]
FIG. 2 is a block diagram illustrating a detailed configuration of the registration unit 1. The registration unit 1 has a control unit 11, an access unit 12, a program storage unit 13, and an information storage unit 14, as shown in FIG.
[0017]
The control unit 11 performs control, calculation, data transfer, and the like of each unit, and includes, for example, a CPU. The access unit 12 performs data input / output between the data network N and the registration unit 1, and is configured by, for example, a modem or an Ethernet (registered trademark) card. The program storage unit 13 stores a program to be executed by the control unit 11, and is configured by, for example, a hard disk, an optical disk, a semiconductor memory, or the like. The programs stored in the program storage unit 13 include, for example, a program for forming a screen to be displayed on the user terminal 4 and the content creator terminal 5, an HTML file, and a request to the voice conversion unit (for example, a voice synthesis server) 2. It is a program that generates contents. The information storage unit 14 stores the content information registered by the content creator in association with the speech synthesis condition, and is mainly used in response to a request from the user terminal 4. The information storage unit 14 includes, for example, a hard disk, an optical disk, a semiconductor memory, or the like.
[0018]
As described above, the contents stored in the information storage unit 14 are mainly content information and speech synthesis conditions. The content information refers to information related to the content or the content creator. For example, the URL (Uniform Resource Locator) or URL (Uniform Resource Identifier) of the content to be subjected to speech synthesis, the attribute of the content such as news or column, registration, etc. The registration name of the site, the user ID and password for confirming the content creator, and the like correspond to this. The speech synthesis condition refers to a group of conditions for speech synthesis, and is mainly used at the stage of being transmitted to the speech unit 2 to generate a synthesized speech. Specifically, for example, the gender of the speaker, the speaking speed (speaking speed), intonation, pitch, sound quality, volume, the speech synthesis method to be used, and the like are applicable.
[0019]
In FIG. 2, each functional unit is shown as a separate device for convenience, but in practice, all the functional units may be realized by one computer.
[0020]
Although the detailed configuration is not shown, the voice conversion unit 2 mainly has a voice synthesis function unit and a data transmission / reception function unit. Specifically, for example, it is composed of an arithmetic device such as a CPU, a storage device such as an HDD or a semiconductor memory, a network access device such as a modem or a network card, and a program operating on these.
[0021]
Although illustration of the detailed configuration is omitted, the Web server 3 mainly includes a storage function unit and a data transmission / reception function unit. Specifically, for example, it is composed of an arithmetic device such as a CPU, a storage device such as an HDD or a semiconductor memory, a network access device such as a modem or a network card, and a program operating on these. The Web server 3 stores a text body to be synthesized and a so-called Web page described in a language such as HTML. The content creator transfers the content created by himself / herself, for example, from the content creator terminal 5 to the Web server 3 and saves the content, or uses an input function unit (including a recording medium reading function) of the Web server 3. And input and save it, and publish it on the network N.
[0022]
The user terminal 4 mainly includes an information output function such as a display and a speaker and an information input function such as a keyboard and a microphone, although the detailed configuration is not illustrated. Specifically, a desktop personal computer, a notebook personal computer, a portable information terminal, a mobile phone, an information home appliance, and the like are applicable.
[0023]
The content creator terminal 5 has the same configuration as the user terminal 4. In FIG. 1, the user terminal 4 and the content creator terminal 5 are separately described for convenience. However, there is no special function required for the content creator terminal 5, and there is no functional difference from the user terminal 4. . The content creator terminal 5 and the user terminal 4 may be the same.
[0024]
(A-2) Operation of the first embodiment
Hereinafter, the operation in the content audio conversion providing system of the first embodiment will be described in the order of the operation of registering the audio synthesis condition of the content and the operation of providing the audio conversion of the content.
[0025]
FIG. 3 shows the transition of the display screen on the content creator terminal 5. Specifically, the content creator accesses the registration unit 1 from the content creator terminal 5 and registers the speech synthesis conditions and the like. The screen transition up to is shown.
[0026]
First, when the registration unit 1 is accessed from the content creator terminal 5, in response to the access, the control unit 11 of the registration unit 1 extracts a program or a data file from the program storage unit 13 and displays the screen on the content creator terminal 5. Send data for generation. The transmission data corresponds to, for example, an HTML file.
[0027]
On the content creator terminal 5 that has received this data, a “content registration screen” SUR1 shown in FIG. 3A is displayed. Of course, before this, the user is registered in advance in the registration unit 1 as a content creator using a screen (not shown), and after confirming the registered user using a logon screen or the like, the content registration screen SUR1 is displayed. The extracting process may be executed.
[0028]
In the content registration screen SUR1, the user is prompted to input information (content information) on the content to be subjected to speech synthesis. The information on the content (content information) includes, for example, the location (URL, URI, etc.) of the Web page, the registered name, attribute, keyword, message of the content creator, the name and contact information of the content creator, and the like. Applicable. For example, in the URL input portion and the registration name input portion of the content registration screen SUR1, a text box format in which characters can be freely entered is used, and in the attribute input portion, a menu is clicked by clicking a ▼ mark in the figure. A pull-down method that displays a list is used. However, the input method is not limited to the above example. The content registration screen SUR1 includes a “next” button (icon), and the content creator clicks the “next” button when the input of the content information as described above is completed.
[0029]
The control unit 11 of the registration unit 1 to which the information when the “next” button of the content registration screen SUR1 is clicked from the content creator terminal 5 is provided. The data for displaying the “speech synthesis condition setting screen” SUR2 as shown in FIG. 1 is extracted from the program storage unit 13 or the like and transmitted to the content creator terminal 5.
[0030]
The speech synthesis condition setting screen SUR2 is a screen that prompts the user to set speech synthesis conditions necessary for speech synthesis. The speech synthesis conditions include, for example, the sex and type of the speaker, speaking speed, intonation, pitch, and the like. Also, as a method of presenting conditions to the content creator, an alternative radio button format as used in the gender term of the speech synthesis condition setting screen SUR2 or a speech speed term of the speech synthesis condition setting screen SUR2 is used. You can use such a graphical slide bar. Again, the input method is not limited to this example.
[0031]
The speech synthesis condition setting screen SUR2 includes buttons for "return", "listening", and "register".
[0032]
The “return” button is a button for activating the return of the display screen from the speech synthesis condition setting screen SUR2 to the above-described content registration screen SUR1.
[0033]
The "preview" button is a button for activating the process of allowing the content creator to pre-listen the sound actually output under the set voice synthesis conditions. The preview function allows the content creator to listen immediately after setting the conditions, so that the set value and the output sound can be easily matched, and the conditions can be set more accurately. The activation and execution means for previewing the actual output sound are not limited to the example shown in FIG. 3 and may be provided on another screen (another opportunity) or another device. The setting of the speech synthesis condition and the possible number of trial listenings performed by the content creator are arbitrary.
[0034]
The operation of each unit at the time of the trial listening will be described after the operation of providing the content audio to the user terminal described later.
[0035]
The “registration” button on the speech synthesis condition setting screen SUR2 is clicked by the content creator when the above-described speech synthesis condition setting is completed. The control unit 11 of the registration unit 1 to which the information when the “registration” button on the speech synthesis condition setting screen SUR2 is clicked from the content creator terminal 5 associates the content information with the set value of the speech synthesis condition. The information is stored in the information storage unit 14, and the data for displaying the "registration confirmation screen" SUR3 as shown in FIG. Transmit to terminal 5.
[0036]
The screen configuration and screen transition related to registration of content information and speech synthesis conditions are not limited to those described above. For example, the content registration screen SUR1 and the speech synthesis condition setting screen SUR2 described above may be arranged in one screen to reduce the number of button operations by the content creator. Further, the registration confirmation screen SUR3 may be omitted. The number and types of registered items of content information and registered items of speech synthesis conditions are not limited to the above examples.
[0037]
FIG. 4 shows a first example of a processing flow of the entire system when a user receives a provision of voice with respect to content in which content information and voice synthesis conditions are registered in the registration unit 1.
[0038]
The user of the content accesses the registration unit 1 from the user terminal 4. At this time, the control unit 11 of the registration unit 1 detects a file registered in the information storage unit 14 and, based on the detection information, a “voice site list” screen SUR4 including information of a list of contents that can be provided by voice. (See FIG. 4), and the data is transmitted to the user terminal 4 that has accessed.
[0039]
The user selects a desired content from the list presented on the audio site list screen SUR4. In the example of the voice site list screen SUR4 in FIG. 4, a plurality of check box options that can be selected are presented on the screen.
[0040]
Here, it is assumed that the user selects “TTT News” and clicks the “start” button while the voice site list screen SUR4 is displayed. At this time, request data (selected site data) indicating that “TTT News” has been selected is transmitted from the user terminal 4 to the registration unit 1 (T1).
[0041]
When the registration unit 1 receives the request data, the internal control unit 11 refers to the information storage unit 14 and extracts the location information of the content (here, URL and URL) from the data registered with the registered name of “TTT News”. ) And speech synthesis conditions (here, sex, speech speed, intonation values). The control unit 11 accesses the URL of the content through the access unit 12, and acquires data (here, an HTML file) to be provided by voice from the corresponding Web server 3 (T2).
[0042]
The control unit 11 calls and executes a predetermined program from the program storage unit 13 as necessary, and processes the data to be provided in the acquired voice to generate text data for voice synthesis. This processing corresponds to, for example, operations such as deletion, replacement, change, and addition of an HTML tag, and deletion, replacement, change, and addition of a character string by a conditional expression. Of course, when the obtained providing target data can be used as it is as text data for speech synthesis, such processing need not be performed. The registration unit 1 transmits at least the text data for speech synthesis and the data of the speech synthesis condition read out earlier to the speech conversion unit 2 (T3). The transmission data at this time also includes information for specifying the user terminal 4.
[0043]
The voice conversion unit 2 forms a synthesized voice (synthesized voice data) using the received data, and transmits the synthesized voice data to the user terminal 4 with data other than the synthesized voice added as necessary (T4). The data other than the synthesized voice corresponds to, for example, different sound data, data for screen display, and the like, and these may be data transmitted from the registration unit 1. In the example of FIG. 4, data for screen display is added and transmitted, but only synthesized speech data may be transmitted. The user terminal 4 that has received these data from the voice conversion unit 2 provides the voice data received by the internal means (not shown) in a form that can be heard by the user (SND1).
[0044]
In the example of FIG. 4, the added screen display data is output to the screen at the same time (SUR5). That is, a plurality of items constituting “TTT News” are displayed side by side, the names of the items that are currently being used for audio output are shaded, and the total time (total) related to the audio output is displayed. It displays time information indicating which timing is currently being output, and furthermore, a “previous” button for instructing to switch the audio output to the previous item, and switching the audio output to the next item. A "next" button for instructing, a "stop" button for forcibly stopping audio output, and the like are also displayed.
[0045]
As described above, in the case of the first embodiment, in order to receive audio information, request data may be transmitted from the user terminal 4 to the registration unit 1. In other words, if it is limited to receiving the voice information, there is no need to access the Web server 3 from the user terminal 4.
[0046]
Next, a brief description will be given of the flow of processing when the "listen" button on the above-described speech synthesis condition setting screen SUR2 is clicked.
[0047]
When the “preview” button is clicked, the content creator terminal 5 notifies the registration unit 1 of the prerequisite request and the voice synthesis condition relating to the preview. At this time, the control unit 11 of the registration unit 1 extracts the information of the content to be provided for trial listening from the information storage unit 14, and the control unit 11 of the registration unit 1 accesses the URL of the content through the access unit 12 and outputs the content by voice. (Here, an HTML file) to be provided from the corresponding Web server 3. Then, the control unit 11 of the registration unit 1 calls and executes a predetermined program from the program storage unit 13 as needed, and processes the acquired data to be provided in voice to generate text data for voice synthesis. Then, at least the text data for voice synthesis and the data of the voice synthesis condition read out earlier are transmitted to the voice conversion unit 2. The transmission data at this time also includes information for specifying the content creator terminal 5. The voice conversion unit 2 forms a synthesized voice (synthesized voice data) using the received data, and transmits the synthesized voice to the content creator terminal 5 with data other than the synthesized voice added as necessary.
[0048]
As described above, the content creator can receive the audio output (listening) of his / her content under the voice synthesis conditions set by the content creator.
[0049]
In FIG. 4 (the first example shown in FIG. 4), the registration unit 1 obtains an HTML file to be provided to the user terminal 4 from the Web server 3, but instead provides the HTML file to the user terminal 4. The voice conversion unit 2 may obtain the HTML file from which the data is to be obtained from the Web server 3.
[0050]
FIG. 5 is an explanatory diagram showing a processing flow (second example) of the entire system in this case.
[0051]
The process (T11) until the user terminal 4 transmits the selected site data (request data) to the registration unit 1 is the same as the case of the first example shown in FIG. 4 described above.
[0052]
When the registration unit 1 receives the selected site data, the internal control unit 11 refers to the information storage unit 14 to obtain location information (for example, a URL) and voice synthesis conditions of the content related to the selected site data, and (T12). The transmission data at this time also includes information for specifying the user terminal 4.
[0053]
Thereby, the voice conversion unit 2 accesses the URL of the content and acquires data (here, an HTML file) to be provided by voice from the corresponding Web server 3 (T13).
[0054]
After that, the voice conversion unit 2 processes the data to be provided in the obtained voice as needed (deletion, replacement, change, addition, etc. of the HTML tag) to generate text data for voice synthesis, and thereafter, A synthesized voice (synthesized voice data) is formed according to the received voice synthesis condition data, and transmitted to the user terminal 4 with data other than the synthesized voice added as necessary (T14). The operation of the user terminal 4 at this time is the same as in the first example.
[0055]
The division of roles of the registration unit 1, the voice conversion unit 2, and the Web server 3 is not limited to the first and second examples, and may be another division. The point is that audio data to be transmitted to the user terminal 4 can be formed based on the data acquired from the Web server 3. For example, the speech synthesis condition may be provided from the registration unit 1 to the speech conversion unit 2 via the Web server 3. In this case, it is preferable that the speech synthesizing section 2 be given the speech synthesis condition together with the HTML file from the Web server 3.
[0056]
(A-3) Effects of the first embodiment
According to the first embodiment, the content creator can set the conditions relating to the audio conversion of the content by himself, and can actually confirm the sound provided to the user. Therefore, the content creator can always grasp the provided voice, and can freely change its attributes and the like.
[0057]
Also, the content creator does not need to perform the content vocalization work, but only the conditions for converting the voice into the synthesized voice by the vocalization unit of the system side. The burden on is small. Therefore, the conversion to voice does not hinder the update of the content.
[0058]
(B) Second embodiment
Next, a second embodiment of a content audio conversion providing system according to the present invention will be described in detail with reference to the drawings.
[0059]
(B-1) Configuration of Second Embodiment
The content audio conversion providing system of the second embodiment can also be represented by the entire configuration shown in FIG. 1 described above, and is connected via a data network N, and has a registration unit 1, an audio conversion unit 2, a Web server 3, The user terminal 4 and the content creator terminal 5 are constituent elements.
[0060]
The registration unit 1 is different from that of the first embodiment. The registration unit 1 of the second embodiment has a control unit 11, an access unit 12, and a program storage unit 13 as shown in FIG. , The information storage unit 14 is not provided. That is, the speech synthesis conditions set by the content creator are stored in another device (Web server 3).
[0061]
As described above, since the registration function is different from that of the first embodiment, not only the registration unit 1 but also the functions of the voice conversion unit 2, the Web server 3, the user terminal 4, and the content creator terminal 5 are different from those of the first embodiment. Although this is different from the mode, that point will be clarified in the following operation description.
[0062]
(B-2) Operation of the second embodiment
The operation of the content audio conversion providing system according to the second embodiment will be described in the order of the operation of registering the audio synthesis condition of the content and the operation of providing the audio conversion of the content.
[0063]
FIG. 7 shows the transition of the display screen on the content creator terminal 5 in the second embodiment. Specifically, the content creator accesses the registration unit 1 from the content creator terminal 5 The screen transition on the content creator terminal 5 in the started series of processing is shown.
[0064]
The content creator accesses the registration unit 1 from the content creator terminal 5 and waits until the “registration confirmation screen” SUR 23 shown in FIG. The operation of the registration unit 1 is the same as in the first embodiment.
[0065]
In the case of the second embodiment, the registration confirmation screen SUR23 includes a “Next” button, and when the content creator clicks the “Next” button on the registration confirmation screen SUR23, the control unit 11 of the registration unit 1 Uses the content information acquired by the registration unit 1 and the speech synthesis conditions (stored in the buffer memory in the control unit 11) to call and execute a program from the program storage unit 13 to execute the content creator. The content to be additionally described is formed on the Web page. The registration unit 1 transmits data to the content creator terminal 5 to display the formed content, and causes the content creator terminal 5 to display a “link condition display screen” SUR24. The contents to be additionally described on the Web page are formed, for example, by preparing a template in advance and inserting the input set values of the speech synthesis conditions into the template.
[0066]
The content creator can provide the user with the audio intended by the content creator by describing the content displayed on the link condition display screen SUR24 in a link format on a created Web page or the like. .
[0067]
That is, in the case of the second embodiment, the content creator can provide the user with the audio intended by the content creator by directly incorporating the speech synthesis conditions as the information of the Web page.
[0068]
FIG. 8 shows a flow of processing in the entire system of the second embodiment when a user receives content provided by voice.
[0069]
In the first embodiment, the user terminal 4 sends a request to the registration unit 1, and the user terminal 4 does not directly access the Web server 3. On the other hand, in the second embodiment, the user terminal 4 uses the Web server 3 And does not directly access the registration unit 1. Further, in the case of the second embodiment, the registration unit 1 does not function at the stage of providing the content to the user by voice. These parts are different between the first embodiment and the second embodiment.
[0070]
It is assumed that the user terminal 4 accesses the Web server 3 and site data (Web page) for displaying the screen SUR25 is transmitted from the Web server 3 to the user terminal 4 (T21).
[0071]
Each "listen" button in the site data for forming the screen SUR25 is provided with a link defined by the description SUB21 including information such as content information that can be provided by voice and voice synthesis conditions. ing.
[0072]
When the user clicks one of the “listen” buttons, a request according to the link destination description (see the description SUB21) related to the “listen” button is transmitted from the user terminal 4 to the audio conversion unit 2 (T22). ).
[0073]
Upon receiving this request, the voice conversion unit 2 acquires data (for example, an HTML file) necessary for voice conversion from the location specified by the content location information in the request (T23).
[0074]
The voice conversion unit 2 transmits voice data created by applying voice synthesis conditions in the request to the acquired data to the user terminal 4 (T24). The user terminal that has received this data outputs a sound like the sound output SND1. If necessary, a screen such as the screen SUR26 may be displayed.
[0075]
(B-3) Effects of the second embodiment
According to the second embodiment, by adding the description output by the registration unit 1 to the content stored in the Web server 3, in addition to the effects of the first embodiment, the user can access a new access location (for example, Thus, it is possible to provide an effect that the user can listen to the voiced content simply by accessing the conventional Web server 3 without accessing the registration unit 1).
[0076]
(C) Third embodiment
Next, a third embodiment of a content audio conversion providing system according to the present invention will be described in detail with reference to the drawings.
[0077]
(C-1) Configuration of Third Embodiment
The content speech providing system according to the third embodiment can also be represented by the entire configuration shown in FIG. 1 described above, and is connected via a data network N. 3, a user terminal 4, a content creator terminal 5, and the like. The internal configuration of the registration unit 1 can be represented in FIG. 2 as in the first embodiment.
[0078]
However, the function of each unit is different from that of the above-described embodiment, and will be clarified in the following operation description.
[0079]
In the case of the third embodiment, since the registration unit 1 also has the function of the Web server 3 for the content provided by voice, the Web server 3 in FIG. 1 is omitted from this point. can do.
[0080]
(C-2) Operation of the third embodiment
FIG. 9 shows an example of a data transmission / reception procedure for providing content to a user by voice.
[0081]
The content creator inputs, from the content creator terminal 5, the content data as shown in FIG. 10 created by the content creator and the site identification information as shown in FIG. Transmit (T31). The registration unit 1 stores the received content data and site identification information in the information storage unit 14.
[0082]
When the user accesses the content registered in the registration unit 1 from the user terminal 4 (for example, “http://www.xxx.co.jp” in FIG. 11) (T32), the control unit 11 of the registration unit 1 Then, a necessary program is called from the program storage unit 13, and the content data (FIG. 10) and the site identification information (FIG. 11) requested by the user are read from the information storage unit 14. The control unit 11 refers to the site identification information and adds data (request transmission means) 121 (see FIG. 12) for requesting the provision by voice to an appropriate location of the content data. Is transmitted to the user terminal 4 (T33).
[0083]
As a result, a screen including the text data 120 and the “listen” button (request transmission unit) 121 is displayed on the user terminal 4 as shown in FIG. The information of the "listen" button 121 includes the destination of text data and the speech synthesis conditions (sex "male", speech speed "6", inflection "4", sound quality "" as in the second embodiment. 4 ", volume" 3 ").
[0084]
By clicking the "listen" button 121, the user transmits a speech request including at least the text data 120 and the speech synthesis condition to the speech unit 2 (T34). Upon receiving the request from the user terminal 4, the voice conversion unit 2 generates voice data according to the request and transmits the voice data to the user terminal 4 (T35). As a result, the content of the desired content is output as audio from the user terminal 4.
[0085]
In FIG. 12, the “listen” button, which is the audio provision request button, corresponds to one content. However, the operation of the audio provision request button is made to correspond to the content selected from a plurality of contents. Is also good. FIG. 13 shows an example of a display screen (second display example) on the user terminal 4 in this case.
[0086]
The screen shown in FIG. 13 includes three news sections 131, 132, and 133, and a "listen to check article" button (request transmission means) 134. Each of the news sections 131, 132, and 133 is provided with a check box so that the user can check and select a news section that he / she wants to hear. FIG. 13 shows a state where the user has selected the news sections 131 and 133. At this stage, when the user clicks the “listen to check article” button (request transmission means) 134, at least a page in which the text of the checked news sections 131 and 133 is described (see FIG. 14) Is transmitted to the voice conversion unit 2. Therefore, the news sections 131 and 133 are output as voice.
[0087]
The display screen format of the content displayed on the user terminal 4 may be the one shown in FIG. 15 instead of the one shown in FIG. 12 or FIG.
[0088]
The screen shown in FIG. 15 includes a news synthesizing condition reset screen 151 in addition to the news sections 131 to 133 and a “listen to check article” button (request transmitting means) 134. The initial state of the voice synthesis condition reset screen 151 is the voice synthesis condition set by the content creator. The user can not only select a news section that he / she wants to hear, but also set speech synthesis conditions through an operation on the speech synthesis condition reset screen 151. The voice synthesis condition transmitted to the voice conversion unit 2 is the content set on the voice synthesis condition reset screen 151 when the “listen to check article” button (request transmission unit) 134 is clicked.
[0089]
FIG. 15 shows the selection method using the radio button method as the speech synthesis condition resetting screen 151, but the selection method using the pull-down method as shown in FIG. 16 may be used.
[0090]
It should be noted that the method of selecting the content and the method of setting the voice synthesis condition are not limited to those described above.
[0091]
(C-3) Effects of the third embodiment
According to the third embodiment, by registering very simple site identification information in addition to content data such as a Web page in the registration unit, it is possible to provide a user with a sound intended by the content creator. Further, by changing the site identification information, the provided voice can be changed extremely easily without changing the content data.
[0092]
When the display image as shown in FIG. 15 or FIG. 16 is applied, the content creator obtains the statistics of the voice synthesis conditions included in the request by the voice conversion unit so that the content creator can understand how the user is. It is possible to know whether or not the user has listened to the vocalized data under the appropriate voice synthesis conditions.
[0093]
Furthermore, according to the third embodiment, since the voiced data is automatically generated by the voiced part, the content creator does not need to voice a large amount of data by himself.
[0094]
(D) Other embodiments
In the description of each of the above embodiments, various modified embodiments have been referred to, but further modified embodiments as exemplified below can be cited.
[0095]
In the screen transition diagram relating to the processing for setting the speech synthesis conditions in each of the above-described embodiments, the screen is divided for each processing step for convenience, but it is needless to say that all the screens may be included in one screen.
[0096]
In addition, in each of the above-described embodiments, the procedure for transmitting data between the components, the contents of transmission, and the sharing of roles related to data processing are all examples, and are not limited to those in the above-described embodiments.
[0097]
According to the present invention, attributes and the like for which speech synthesis conditions can be set can be arbitrarily set. As for the speech synthesis conditions described in the above embodiments, the number of options that can be set may be increased or decreased. For example, as for gender, in addition to “male” and “female”, “robot (target voice)” may be provided, and “20s male”, “30s male”, “40s male”, etc. You may also make an order. Further, for example, the condition may be set such that the audio encoding speed (16 KBPS, 32 KBPS, etc.) can be set. Further, for example, the presence / absence of an echo may be set for the sound quality.
[0098]
Further, as described in the third embodiment, when both the content creator (content provider side) and the user (content recipient side) can set the speech synthesis condition, the content creator sets the voice synthesis condition. The voice attribute to be obtained may be the same as the voice attribute that can be set by the user, or may be different.
[0099]
Further, in the above-described embodiment, the speech synthesis condition is set in common for one or a plurality of contents. However, for one content, different speech synthesis conditions are set depending on parts such as a title part, a summary part, and a content body. May be set. Further, a content part where the content creator can set the voice synthesis condition and a content part where the user can set the voice synthesis condition may be distinguished (may overlap).
[0100]
Furthermore, in the description of the third embodiment, the user can set the speech synthesis condition at the timing of receiving the provision of the content. However, the user may be able to set the speech synthesis condition in advance. For example, when a user registers a keyword or the like and receives provision of a corresponding article in an article in an e-mail magazine, a speech synthesis condition may be set when the keyword or the like is registered.
[0101]
In addition, the person who can set the speech synthesis condition at the time of outputting the sound of the content may be not only the content creator and the user but also a content manager (for example, a provider).
[0102]
Furthermore, when the content creator or the content manager sets the speech synthesis conditions, if the user terminal is a portable terminal, the encoding speed is low, and if the user terminal is other than that, the encoding speed is high. The setting of the speech synthesis condition that automatically switches depending on the type of the user terminal may be allowed.
[0103]
Furthermore, in the above-described second and third embodiments, the case where the data (contents, speech synthesis conditions, and the like) are given to the voice conversion unit after the "listen" button is clicked has been described. Instead, the user terminal may immediately supply the data provided from another device to the voice conversion unit.
[0104]
Also, in the second embodiment, as in the third embodiment, the speech synthesis conditions may be displayed and correction (re-setting) by the user may be allowed.
[0105]
In each of the above embodiments, the registration unit fetches the voice synthesis condition without authenticating the content creator. However, the voice synthesis condition may be fetched after the content creator is authenticated. good.
[0106]
The features of the first to third embodiments may be combined as long as the combination is possible.
[0107]
【The invention's effect】
As described above, according to the present invention, a content creator or the like can greatly contribute to the quality and attributes of a synthesized speech obtained by converting a content into a speech, and can reduce the work load and cost of the content creator. Provision system can be provided.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an overall configuration of a content audio conversion providing system according to a first embodiment.
FIG. 2 is a block diagram illustrating a detailed configuration of a registration unit according to the first embodiment.
FIG. 3 is an explanatory diagram showing transition of a display screen on a content creator terminal when setting a speech synthesis condition of content according to the first embodiment;
FIG. 4 is an explanatory diagram showing a first example of processing in the entire system when content is provided to a user by voice according to the first embodiment;
FIG. 5 is an explanatory diagram illustrating a second example of processing in the entire system when providing content to a user by voice according to the first embodiment;
FIG. 6 is a block diagram illustrating a detailed configuration of a registration unit according to the second embodiment.
FIG. 7 is an explanatory diagram showing transition of a display screen on a content creator terminal when setting a speech synthesis condition of content according to a second embodiment.
FIG. 8 is an explanatory diagram showing an example of processing in the entire system when content is provided to a user by voice in the second embodiment.
FIG. 9 is an explanatory diagram showing an example of data transmission and reception in the entire system according to the third embodiment.
FIG. 10 is an explanatory diagram showing content data used in the description of the third embodiment.
FIG. 11 is an explanatory diagram showing speech synthesis conditions used in the description of the third embodiment.
FIG. 12 is an explanatory diagram showing a first display example including a voice provision request button for content according to the third embodiment;
FIG. 13 is an explanatory diagram showing a second display example including a voice provision request button for content according to the third embodiment.
FIG. 14 is an explanatory diagram showing a detailed example of a news session in FIG. 13;
FIG. 15 is an explanatory diagram showing a third display example including a voice provision request button for content according to the third embodiment;
FIG. 16 is an explanatory diagram showing a fourth display example including a voice provision request button for content according to the third embodiment.
[Explanation of symbols]
REFERENCE SIGNS LIST 1 registration unit 2 voice conversion unit 3 Web server 4 user terminal 5 content creator terminal 11 control unit 12 access unit 13 program storage unit 14 information storage unit

Claims

テキストデータを含むコンテンツを音声化データに変換して提供するコンテンツ音声化提供システムにおいて、
音声化データに変換する任意の音声合成条件をコンテンツに対応付けて取り込む音声合成条件取り込み手段と、
この音声合成条件取り込み手段が取り込んだ音声合成条件に従って、提供対象のコンテンツを音声化データに変換して、コンテンツの要求端末に送信する音声化手段と
を有することを特徴とするコンテンツ音声化提供システム。In a content speech providing system that provides content by converting content including text data into speech data,
Voice synthesis condition capturing means for capturing an arbitrary voice synthesis condition to be converted into voiced data in association with the content;
A content-to-speech providing system, comprising: a content-to-be-provided content that is converted into voiced data in accordance with the voice-synthesis condition captured by the voice-synthesis-condition capturing means; .

上記音声合成条件取り込み手段は、上記コンテンツを記憶しているコンテンツ記憶手段とは異なる装置に設けられ、
上記音声合成条件取り込み手段は、上記コンテンツの特定情報及び場所情報に関連付けて取り込んだ音声合成条件を記憶している情報記憶部を備える
ことを特徴とする請求項１に記載のコンテンツ音声化提供システム。The voice synthesis condition capturing means is provided in a device different from the content storage means storing the content,
2. The system according to claim 1, wherein the voice synthesis condition capturing means includes an information storage unit storing voice synthesis conditions captured in association with the specific information and the location information of the content. .

上記音声合成条件取り込み手段は、上記コンテンツの要求端末からの要求に応じ、上記コンテンツ記憶手段から該当するコンテンツを取り出し、内部記憶している音声合成条件と共に、上記音声化手段に与えることを特徴とする請求項２に記載のコンテンツ音声化提供システム。In response to a request from the requesting terminal for the content, the voice synthesizing condition capturing means fetches the corresponding content from the content storage means, and provides the content together with the internally stored voice synthesis condition to the voice converting means. The content audio conversion providing system according to claim 2.

上記音声合成条件取り込み手段は、上記コンテンツの要求端末からの要求に応じ、上記コンテンツの特定情報及び場所情報、並びに、音声合成条件を上記音声化手段に与え、
上記音声化手段は、上記コンテンツ記憶手段から該当するコンテンツを取り出し、音声化データに変換する
ことを特徴とする請求項２に記載のコンテンツ音声化提供システム。The voice synthesis condition capturing means, in response to a request from the content requesting terminal, gives the content specific information and location information, and voice synthesis conditions to the voice conversion means,
3. The content audio conversion providing system according to claim 2, wherein the audio conversion unit extracts the corresponding content from the content storage unit and converts the content into audio data.

上記音声合成条件取り込み手段は、上記コンテンツを記憶しているコンテンツ記憶手段とは異なる装置に設けられ、
上記音声合成条件取り込み手段は、取り込んだ音声合成条件を、上記コンテンツ記憶手段に記憶されている対応するコンテンツに盛り込む形式にし、そのコンテンツに盛り込むことを指示する
ことを特徴とする請求項１に記載のコンテンツ音声化提供システム。The voice synthesis condition capturing means is provided in a device different from the content storage means storing the content,
2. The speech synthesis condition taking means according to claim 1, wherein said speech synthesis condition taking means puts the taken speech synthesis condition into a format for inclusion in the corresponding content stored in said content storage means, and instructs the content to be incorporated into said content. Content audio system.

上記コンテンツの要求端末は、上記コンテンツ記憶手段から音声合成条件が盛り込まれたコンテンツを取り出して、上記音声化手段に与えて音声化データに変換させることを特徴とする請求項５に記載のコンテンツ音声化提供システム。6. The content voice according to claim 5, wherein the content requesting terminal extracts the content including the voice synthesis condition from the content storage means, and gives the content to the voice conversion means to convert the content into voice data. Provided system.

上記音声合成条件が盛り込まれたコンテンツは、音声化を求めるボタンアイコンのデータを含み、上記コンテンツの要求端末は、上記コンテンツ記憶手段から取り出したコンテンツの表示状態で、上記ボタンアイコンのクリックを検出したときに、音声合成条件が盛り込まれたコンテンツを上記音声化手段に与えることを特徴とする請求項６に記載のコンテンツ音声化提供システム。The content in which the speech synthesis condition is included includes data of a button icon for requesting speech, and the terminal requesting the content detects a click on the button icon in a display state of the content retrieved from the content storage unit. 7. The system according to claim 6, wherein the content including a speech synthesis condition is provided to the speech unit.

上記コンテンツの要求端末は、上記コンテンツ記憶手段から複数のコンテンツを取り出し、クリックされた上記ボタンアイコンに係る、音声合成条件が盛り込まれたコンテンツを上記音声化手段に与えることを特徴とする請求項７に記載のコンテンツ音声化提供システム。8. The content requesting terminal according to claim 7, wherein the content requesting terminal retrieves a plurality of contents from the content storage means, and provides the speech conversion means with contents including voice synthesis conditions related to the clicked button icon. The content audio conversion providing system according to 1.

上記音声合成条件取り込み手段は、上記コンテンツを記憶している装置に設けられ、
上記音声合成条件取り込み手段は、取り込んだ音声合成条件を、対応するコンテンツに対応付けて記憶している情報記憶部を備える
ことを特徴とする請求項１に記載のコンテンツ音声化提供システム。The voice synthesis condition capturing means is provided in an apparatus storing the content,
2. The system according to claim 1, wherein the voice synthesis condition capturing means includes an information storage unit storing the captured voice synthesis conditions in association with the corresponding content.

上記コンテンツの要求端末は、上記音声合成条件取り込み手段から音声合成条件と共にコンテンツを取り出して、上記音声化手段に与えて音声化データに変換させることを特徴とする請求項９に記載のコンテンツ音声化提供システム。10. The content-to-speech conversion apparatus according to claim 9, wherein the content requesting terminal extracts the content together with the speech-synthesis condition from the speech-synthesis-condition taking-in means, and supplies the content to the speech conversion means to convert the content into speech data. Delivery system.

上記コンテンツは、音声化を求める起動表示データを含み、上記コンテンツの要求端末は、上記コンテンツ記憶手段から取り出したコンテンツの表示状態で、上記起動表示データが有効操作されたときに、上記音声合成条件及び上記コンテンツを上記音声化手段に与えることを特徴とする請求項１０に記載のコンテンツ音声化提供システム。The content includes start-up display data that requires voice conversion, and the terminal requesting the content sets the voice synthesis condition when the start-up display data is effectively operated in a display state of the content retrieved from the content storage unit. 11. The contents audio providing system according to claim 10, wherein the contents are provided to the audio means.

上記コンテンツの要求端末は、上記音声合成条件取り込み手段から複数のコンテンツを取り出し、有効操作された上記起動表示データに係るコンテンツを音声合成条件と共に上記音声化手段に与えることを特徴とする請求項１１に記載のコンテンツ音声化提供システム。12. The content requesting terminal according to claim 11, wherein the content request terminal extracts a plurality of contents from the voice synthesis condition capturing means, and provides the content relating to the activated display data that has been effectively operated to the voice conversion means together with a voice synthesis condition. The content audio conversion providing system according to 1.

上記コンテンツの要求端末は、上記音声合成条件取り込み手段から与えられた音声合成条件を表示し、その修正入力を取り込むことを特徴とする請求項９〜１２のいずれかに記載のコンテンツ音声化提供システム。13. The system according to claim 9, wherein the content requesting terminal displays the speech synthesis condition provided by the speech synthesis condition acquisition means and acquires a correction input thereof. .