JP2015049309A

JP2015049309A - Information processing device, speech speed data generation method and program

Info

Publication number: JP2015049309A
Application number: JP2013179783A
Authority: JP
Inventors: 典昭阿瀬見; Noriaki Asemi
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2013-08-30
Filing date: 2013-08-30
Publication date: 2015-03-16

Abstract

PROBLEM TO BE SOLVED: To enable adjusting a speech speed so that contents of an utterance in a synthesized speech can be easily understood.SOLUTION: In a speech speed data generation processing, speech text data TD is acquired (S110), and the acquired speech text data TD is morphologically analyzed (S120). Familiarity is identified for each morpheme (word) identified by morphological analysis on the basis of word familiarity data stored in an information processing server 10 (S130). Furthermore, in the speech speed data generation processing, speech speed data is generated so that a word representing the familiarity is low, of which a ratio of a time required for reading out the word occupying a time required for reading out entire information becomes longer (S170, S200).

Description

本発明は、映像に合わせて出力される音声の発声時間を表す話速データを生成する情報処理装置、話速データ生成方法、及びプログラムに関する。 The present invention relates to an information processing apparatus, a speech speed data generation method, and a program that generate speech speed data representing the utterance time of audio output in accordance with video.

従来、映画やテレビ番組などの映像を含むコンテンツにおいて、映像の出力に合わせて音声合成にて生成された合成音声を出力することがなされている。
この映像に合わせて合成音声を出力する装置として、合成音声の発声時間長を番組放送時間に一致させるように当該音声の伸縮率を決定し、その決定した伸縮率に基づいて合成音声における話速を変換する情報処理装置が提案されている（特許文献１参照）。 2. Description of the Related Art Conventionally, in content including video such as movies and television programs, synthesized voice generated by voice synthesis is output in accordance with video output.
As a device for outputting synthesized voice in accordance with this video, the expansion rate of the voice is determined so that the utterance time length of the synthesized audio matches the program broadcast time, and the speech speed in the synthesized speech is determined based on the determined expansion rate. Has been proposed (see Patent Document 1).

特開２０１２−０７８７５５号公報JP 2012-078755 A

この特許文献１に記載された情報処理装置にて話速を変換した場合、合成音声にて発声される文章に含まれる一部の単語については、発声に要する時間長が長時間となり、他の単語については、発声に要する時間長が短時間となる。 When the speech speed is converted by the information processing apparatus described in Patent Document 1, for some words included in a sentence uttered by synthesized speech, the length of time required for utterance becomes long. For words, the time required for utterance is short.

そして、発声に要する時間長が通常よりも短時間となる単語を聴いた人物は、その単語を聞き取れない可能性があり、発声の内容全体を理解することが困難となる可能性があった。 A person who listens to a word whose utterance takes a shorter time than usual may not be able to hear the word, and it may be difficult to understand the entire contents of the utterance.

つまり、従来の技術では、合成音声において、発声の内容が理解しやすくなるように話速を調整できないという課題があった。
そこで、本発明は、合成音声において、発声の内容が理解しやすくなるように話速を調整可能とすることを目的とする。 In other words, the conventional technique has a problem that the speech speed cannot be adjusted in the synthesized speech so that the content of the utterance can be easily understood.
Therefore, an object of the present invention is to make it possible to adjust the speech speed so that the content of the utterance can be easily understood in the synthesized speech.

上記目的を達成するためになされた本発明は、テキスト取得手段と、解析手段と、親密度取得手段と、話速決定手段とを備えた情報処理装置である。
本発明では、テキスト取得手段が、映像に合わせて音声によって出力される情報の内容を文字列で表すテキストデータを取得する。解析手段は、テキスト取得手段にて取得したテキストデータを解析し、テキストデータによって表される文字列に含まれる各単語を特定する。 The present invention made to achieve the above object is an information processing apparatus including a text acquisition means, an analysis means, a closeness acquisition means, and a speech speed determination means.
In the present invention, the text acquisition means acquires text data representing the content of information output by sound in accordance with a video as a character string. The analysis unit analyzes the text data acquired by the text acquisition unit and identifies each word included in the character string represented by the text data.

そして、親密度取得手段が、解析手段にて特定された各単語に対応する親密度を、親密度データベースから取得する。なお、ここで言う親密度データベースとは、親密度情報が格納されたデータベースである。そして、親密度情報とは、単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられた情報である。 Then, the familiarity acquisition unit acquires the familiarity corresponding to each word specified by the analysis unit from the familiarity database. The familiarity database referred to here is a database in which familiarity information is stored. The familiarity information is information in which each word is associated with a familiarity representing the recognition degree of each word in advance.

さらに、話速決定手段は、親密度取得手段で取得した親密度が低いことを表している単語ほど、テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した話速データを生成する。ただし、ここで言う話速データとは、音声合成によって出力される合成音声の発声時間を表すデータであり、かつ、テキストデータによって表される情報の文字列を構成する各音素の発声時間を表すデータである。 Furthermore, the speech rate determination means has a longer proportion of the utterance time of the word in the utterance time of the entire information represented by the text data, as the word indicating that the familiarity acquired by the familiarity acquisition means is lower. Thus, speech speed data in which the utterance time of the word is adjusted is generated. However, the speech speed data referred to here is data representing the utterance time of the synthesized speech output by speech synthesis, and represents the utterance time of each phoneme constituting the character string of the information represented by the text data. It is data.

すなわち、映像に合わせて出力される音声に認識度合い（即ち、親密度）が低い単語が含まれている場合、その単語の発声に掛ける時間長が短いと、その音声を聞いた人物は、音声によって表される情報の内容を認識できない可能性がある。 In other words, if a word that has a low recognition level (that is, closeness) is included in the sound that is output in accordance with the video, if the length of time it takes to utter the word is short, There is a possibility that the content of the information represented by cannot be recognized.

しかしながら、本発明の情報処理装置においては、親密度が低いことを表している単語ほど、情報の全発声時間に占める当該単語の発声時間の割合が長くなるように、話速データを生成している。 However, in the information processing apparatus according to the present invention, the speech speed data is generated so that the word representing lower intimacy has a longer proportion of the utterance time of the word in the total utterance time of the information. Yes.

このような話速データに基づいて合成音声の出力速度を決定すれば、その合成音声においては、情報の全発声時間に占める、親密度が低い単語の発声に掛ける時間長の割合を大きくできる。 If the output speed of the synthesized speech is determined based on such speech speed data, in the synthesized speech, the ratio of the time length for uttering words with low familiarity in the total utterance time of information can be increased.

この結果、その合成音声を聴いた人物は、親密度が低い単語であっても聴き取りやすくなり、発声によって表される情報の内容全体を認識することができる。
換言すれば、本発明の情報処理装置においては、合成音声において、発声の内容を理解しやすくなるように話速を調整することができる。 As a result, a person who has listened to the synthesized speech can easily hear even a word with low familiarity, and can recognize the entire content of the information represented by the utterance.
In other words, in the information processing apparatus of the present invention, the speech speed can be adjusted in the synthesized speech so that the content of the utterance can be easily understood.

なお、ここで言う発声時間は、発声に要する時間を表すものであり、速度（話速）を含むものである。
ところで、本発明の情報処理装置においては、解析手段で特定した単語の中から、重要度が高い品詞として予め規定された重要品詞に対応する単語である重要単語を特定する単語特定手段を備えていても良い。 The utterance time referred to here represents the time required for utterance and includes speed (speech speed).
By the way, the information processing apparatus of the present invention includes word specifying means for specifying an important word that is a word corresponding to an important part of speech defined in advance as a part of speech with high importance from the words specified by the analyzing means. May be.

この場合、本発明の話速決定手段は、単語特定手段で特定された重要単語に含まれる母音の発声時間が長くなるように、話速データを生成しても良い。
本発明の情報処理装置によれば、重要単語に対する発声時間が長くなるように話速データを生成することができる。 In this case, the speech speed determining means of the present invention may generate the speech speed data so that the vowel utterance time included in the important word specified by the word specifying means becomes longer.
According to the information processing apparatus of the present invention, speech speed data can be generated so that the utterance time for an important word becomes longer.

そして、本発明の情報処理装置にて生成された話速データに基づいて話速が調整された合成音声は、重要品詞をより聴き取りやすくすることができ、発声の内容をより理解しやすくできる。 The synthesized speech whose speech speed is adjusted based on the speech speed data generated by the information processing apparatus of the present invention can make important parts of speech easier to hear, and the contents of utterances can be more easily understood. .

さらに、本発明における単語特定手段は、名詞、及び動詞の少なくとも一方を重要品詞とし、重要品詞それぞれに対応する単語を重要単語として特定しても良い。
音声にて出力される情報においては、名詞及び動詞が大きな重みを有する。 Furthermore, the word specifying means in the present invention may specify at least one of a noun and a verb as an important part of speech, and specify a word corresponding to each of the important parts of speech as an important word.
In information output by voice, nouns and verbs have large weights.

このため、本発明においては、名詞及び動詞の少なくとも一方を重要品詞とし、重要品詞それぞれに対応する単語を重要単語として特定しても良い。
このような情報処理装置によれば、名詞、及び動詞の少なくとも一方に対する発声時間が長くなるように話速データを生成することができる。 For this reason, in the present invention, at least one of a noun and a verb may be an important part of speech, and a word corresponding to each important part of speech may be specified as an important word.
According to such an information processing apparatus, speech speed data can be generated so that the utterance time for at least one of a noun and a verb is prolonged.

そして、本発明の情報処理装置にて生成された話速データに基づいて話速が調整された合成音声は、名詞及び動詞の少なくとも一方をより聴き取りやすくすることができる。
また、本発明の情報処理装置においては、更新手段が、解析手段で特定された単語の認識度合いが高くなるように、親密度データベースに格納されている親密度情報において当該単語と対応付けられた親密度を更新しても良い。 The synthesized speech whose speech speed is adjusted based on the speech speed data generated by the information processing apparatus of the present invention can make it easier to hear at least one of a noun and a verb.
In the information processing apparatus of the present invention, the updating unit is associated with the word in the familiarity information stored in the familiarity database so that the recognition degree of the word specified by the analyzing unit is high. You may update the intimacy.

テキストデータに含まれている単語が合成音声によって出力されると、その合成音声を聴いた人物にとっての当該単語の認識度合いが向上すると考えられる。
このため、本発明の情報処理装置においては、テキストデータに含まれている単語の認識度合いが高くなるように、更新手段が、親密度データベースに格納されている親密度情報における単語と対応付けられた親密度を更新しても良い。 When a word included in text data is output as synthesized speech, it is considered that the degree of recognition of the word for a person who has listened to the synthesized speech is improved.
For this reason, in the information processing apparatus of the present invention, the updating means is associated with the word in the familiarity information stored in the familiarity database so that the recognition degree of the word included in the text data is increased. You may update your familiarity.

このような情報処理装置によれば、親密度情報を常時最新の情報とすることができ、より適切な話速データを生成できる。
さらに、本発明においては、一つの映像に対して、複数個のテキストデータが存在していても良い。この場合、テキスト取得手段は、テキストデータを順次取得し、解析手段は、テキスト取得手段がテキストデータを取得するごとに、映像に対応するテキストデータの時間進行に沿ってテキストデータを解析しても良い。 According to such an information processing apparatus, the familiarity information can always be the latest information, and more appropriate speech speed data can be generated.
Further, in the present invention, a plurality of text data may exist for one video. In this case, the text acquisition means sequentially acquires the text data, and the analysis means analyzes the text data along the time progress of the text data corresponding to the video every time the text acquisition means acquires the text data. good.

そして、本発明における更新手段では、解析手段にて特定した、テキストデータの時間進行の中で出現回数が多い単語ほど、テキストデータの進行量に応じて親密度が高くなるように、親密度情報において当該単語と対応付けられた親密度を更新しても良い。 Then, in the update means in the present invention, the familiarity information is specified so that the word having a higher number of appearances in the time progression of the text data specified by the analysis means has a higher familiarity according to the progress amount of the text data. The intimacy associated with the word may be updated.

このような情報処理装置によれば、映像全体に渡って登場する回数が多い単語ほど、親密度を高くでき、その映像に適した話速データを生成できる。
また、本発明の情報処理装置においては、話速決定手段にて生成された話速データに基づいて、各単語を構成する各音素の発声時間が話速データによって表された発声時間となるように音声合成して出力する音声合成手段を備えていても良い。 According to such an information processing device, a word having a greater number of appearances throughout the video can be made more intimate and speech speed data suitable for the video can be generated.
Further, in the information processing apparatus of the present invention, the utterance time of each phoneme constituting each word becomes the utterance time represented by the speech speed data based on the speech speed data generated by the speech speed determining means. Voice synthesis means for synthesizing and outputting the voice may be provided.

このような情報処理装置によれば、発声の内容を理解しやすくなるように話速を調整した合成音声を出力することができる。
なお、本発明においては、テキストデータのそれぞれには、当該テキストデータによって表された文字列の発声に掛けることが可能な時間長として予め規定された要発声時間が含まれていても良い。 According to such an information processing apparatus, it is possible to output synthesized speech in which the speech speed is adjusted so that the content of the utterance can be easily understood.
In the present invention, each text data may include a required utterance time defined in advance as a time length that can be applied to the utterance of the character string represented by the text data.

この場合、本発明の話速決定手段は、テキストデータによって表される情報全体の発声時間が要発声時間に維持されるように正規化したデータを、話速データとして生成しても良い。 In this case, the speech speed determination means of the present invention may generate normalized data as speech speed data so that the utterance time of the entire information represented by the text data is maintained at the required utterance time.

このような情報処理装置によれば、情報の内容を発声するために要する時間長を変更することがないため、映像の進行に沿って適切なタイミングで発声させることができる。
ところで、本発明は、話速データを生成する話速データ生成方法としてなされていても良い。 According to such an information processing apparatus, since the time length required for uttering the content of information is not changed, it is possible to utter at an appropriate timing along the progress of the video.
By the way, the present invention may be implemented as a speech speed data generation method for generating speech speed data.

本発明における話速データ生成方法は、テキストデータを取得するテキスト取得過程と、テキストデータによって表される文字列に含まれる各単語を特定する解析過程と、解析過程にて特定された各単語に対応する親密度を取得する親密度取得過程と、親密度取得過程で取得した親密度が低いことを表している単語ほど、テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した話速データを生成する話速決定過程とを備えることを特徴としている。 The speech speed data generation method in the present invention includes a text acquisition process for acquiring text data, an analysis process for specifying each word included in a character string represented by the text data, and each word specified in the analysis process. The intimacy acquisition process for acquiring the corresponding intimacy, and the word representing the lower intimacy acquired in the intimacy acquisition process occupies the utterance time of the entire information represented by the text data. And a speech speed determining process for generating speech speed data in which the utterance time of the word is adjusted so that the time ratio becomes longer.

このような話速データ生成方法であれば、請求項１に記載の情報処理装置と同様の効果を得ることができる。
また、本発明は、コンピュータが実行するプログラムとしてなされていても良い。 With such a speech speed data generation method, the same effect as the information processing apparatus according to claim 1 can be obtained.
Further, the present invention may be made as a program executed by a computer.

本発明のプログラムは、テキストデータを取得するテキスト取得手順と、テキストデータによって表される文字列に含まれる各単語を特定する解析手順と、その特定された各単語に対応する親密度を取得する親密度取得手順と、その取得した親密度が低いことを表している単語ほど、テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した話速データを生成する話速決定手順とをコンピュータに実行させることを特徴とする。 The program of the present invention acquires a text acquisition procedure for acquiring text data, an analysis procedure for specifying each word included in a character string represented by the text data, and a closeness corresponding to each of the specified words. The degree of familiarity acquisition procedure and the word indicating that the acquired intimacy is low, the proportion of the utterance time of the word in the utterance time of the entire information represented by the text data is increased. And a speech speed determination procedure for generating speech speed data in which the utterance time is adjusted.

本発明が、このようなプログラムとしてなされていれば、記録媒体から必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された情報処理装置として機能させることができる。 If the present invention is made as such a program, it is used by loading the computer from a recording medium as needed and starting it, or by acquiring it and starting it through a communication line as necessary. be able to. And by making a computer perform each procedure, the computer can be functioned as an information processing apparatus described in claim 1.

なお、ここで言う記録媒体には、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な電子媒体を含む。 The recording medium referred to here includes, for example, a computer-readable electronic medium such as a DVD-ROM, a CD-ROM, and a hard disk.

本発明が適用された情報処理装置及び情報処理装置の周辺の概略構成を示すブロック図である。1 is a block diagram illustrating an information processing apparatus to which the present invention is applied and a schematic configuration around the information processing apparatus. セリフテキストデータの構造を説明する説明図である。It is explanatory drawing explaining the structure of serif text data. 話速データ生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech speed data generation process. 話速データ生成処理の処理過程で生成される情報を説明する説明図である。It is explanatory drawing explaining the information produced | generated in the process of a speech speed data production | generation process. 親密度更新処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a closeness update process.

以下に本発明の実施形態を図面と共に説明する。
〈コンテンツ視聴システム〉
図１に示すコンテンツ視聴システム１は、予め用意されたコンテンツを利用者が視聴するシステムであり、情報処理サーバ１０と、少なくとも一つの情報処理装置３０とを備えている。
〈情報処理サーバ〉
情報処理サーバ１０は、各種データが格納されるサーバであり、通信部１２と、制御部１４と、記憶部２２とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
<Content viewing system>
A content viewing system 1 shown in FIG. 1 is a system in which a user views content prepared in advance, and includes an information processing server 10 and at least one information processing device 30.
<Information processing server>
The information processing server 10 is a server that stores various data, and includes a communication unit 12, a control unit 14, and a storage unit 22.

この情報処理サーバ１０に格納される各種データには、少なくとも、出力すべき映像と音声とを含むコンテンツデータＣＤと、予め入力された音声の音声特徴量を少なくとも含む音源データＳＶと、各単語の認識度合いを表す親密度を単語それぞれと対応付けた単語親密度データＤＤとを含む。 The various types of data stored in the information processing server 10 include at least content data CD including video and audio to be output, sound source data SV including at least audio feature values of previously input audio, and each word. It includes word familiarity data DD in which familiarity representing the degree of recognition is associated with each word.

通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。本実施形態における通信網とは、例えば、公衆無線通信網やネットワーク回線である。
制御部１４は、ＲＯＭ１６と、ＲＡＭ１８と、ＣＰＵ２０とを少なくとも有した周知のコンピュータを中心に構成され、通信部１２や記憶部２２を制御する。 In the communication unit 12, the information processing server 10 communicates with the outside through a communication network. The communication network in this embodiment is, for example, a public wireless communication network or a network line.
The control unit 14 is configured around a known computer having at least a ROM 16, a RAM 18, and a CPU 20, and controls the communication unit 12 and the storage unit 22.

ＲＯＭ１６は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納する。ＲＡＭ１８は、処理プログラムやデータを一時的に格納する。ＣＰＵ２０は、ＲＯＭ１６やＲＡＭ１８に記憶された処理プログラムに従って各種処理を実行する。 The ROM 16 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 18 temporarily stores processing programs and data. The CPU 20 executes various processes according to the processing program stored in the ROM 16 or the RAM 18.

記憶部２２は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。この記憶装置とは、例えば、ハードディスク装置やフラッシュメモリなどである。記憶部２２には、コンテンツデータＣＤと、音源データＳＶと、単語親密度データＤＤとが格納されている。 The storage unit 22 is a non-volatile storage device configured to be able to read and write stored contents. The storage device is, for example, a hard disk device or a flash memory. The storage unit 22 stores content data CD, sound source data SV, and word familiarity data DD.

このうち、コンテンツデータＣＤは、コンテンツごとに予め用意されたデータである。
ここで言うコンテンツとは、少なくとも画像（映像）と音声とが時間軸に沿って出力される制作物である。この制作物の一例として、映画やテレビ番組が考えられる。 Among these, the content data CD is data prepared in advance for each content.
The content mentioned here is a product in which at least an image (video) and audio are output along the time axis. As an example of this product, a movie or a TV program can be considered.

このコンテンツデータＣＤは、映像データＩＭと、セリフ音声データＳＤと、セリフテキストデータＴＤとを含む。図１中の符号“ｍ”は、コンテンツデータＣＤの個数を意味する。 This content data CD includes video data IM, speech audio data SD, and speech text data TD. The symbol “m” in FIG. 1 means the number of content data CDs.

映像データＩＭは、コンテンツにおいて出力される映像（動画）を構成する複数の画像からなるデータである。
セリフ音声データＳＤは、映像データＩＭによって表される映像に合わせて出力される音声データである。このセリフ音声データＳＤは、例えば、映像に合わせて発せられるセリフやナレーションである。本実施形態におけるセリフ音声データＳＤは、映像におけるセリフやナレーションごとに用意されていても良いし、映像における時間軸に沿って予め規定された単位区間ごとに用意されていても良い。 The video data IM is data composed of a plurality of images constituting a video (moving image) output in the content.
The speech audio data SD is audio data that is output in accordance with the video represented by the video data IM. The speech audio data SD is, for example, speech or narration that is emitted in accordance with the video. The speech audio data SD in the present embodiment may be prepared for each speech or narration in the video, or may be prepared for each unit section defined in advance along the time axis in the video.

セリフテキストデータＴＤは、映像データＩＭによって表される映像に合わせて出力される音声の内容を表すテキストデータである。このセリフテキストデータＴＤには、図２に示すように、配役情報と、字幕情報と、タイミング情報とが含まれる。 The serif text data TD is text data representing the content of audio output in accordance with the video represented by the video data IM. As shown in FIG. 2, the serif text data TD includes casting information, caption information, and timing information.

このうち、字幕情報は、映像に合わせて出力される字幕（テキスト）である。この字幕は、セリフやナレーションなどの内容を文字列で表したものである。さらに、本実施形態における字幕の言語は、日本語である。 Among these, the caption information is a caption (text) output in accordance with the video. This subtitle is a character string representing contents such as lines and narration. Further, the subtitle language in the present embodiment is Japanese.

配役情報は、各字幕を読み上げるべき人物を識別する情報であり、字幕それぞれに規定されている。この配役情報は、人物そのものを特定する情報であっても良いし、性別や年齢などの人物の特徴を表す情報であっても良い。 The casting information is information for identifying a person who should read out each caption, and is defined for each caption. This casting information may be information that identifies the person itself, or information that represents the characteristics of the person such as gender and age.

タイミング情報は、字幕情報によって表される字幕を出力するタイミングが規定された開始タイミングと、その出力を終了するタイミングを表す終了タイミングとが、字幕それぞれに規定された情報である。これらの開始タイミング及び終了タイミングは、映像データＩＭにおける時間の進行と対応付けられている。 The timing information is information in which each of the subtitles includes a start timing in which the timing for outputting the subtitle represented by the subtitle information is defined and an end timing in which the output is terminated. These start timing and end timing are associated with the progress of time in the video data IM.

さらに、タイミング情報には、セリフテキストデータＴＤに含まれる字幕情報によって表された文字列全体を読み上げることに掛けることが可能な時間長として規定された要発声時間が含まれている。 Further, the timing information includes a required utterance time defined as a time length that can be spent reading out the entire character string represented by the subtitle information included in the serif text data TD.

なお、本実施形態におけるセリフテキストデータＴＤは、映像に合わせて出力される字幕ごとに用意されている。
音源データＳＶは、音声パラメータとタグデータとを音源ごとに対応付けたデータである。図１中の符号“ｎ”は、音源データＳＶの個数を意味する。 Note that the serif text data TD in this embodiment is prepared for each subtitle output in accordance with the video.
The sound source data SV is data in which sound parameters and tag data are associated with each sound source. The symbol “n” in FIG. 1 means the number of sound source data SV.

音声パラメータは、人が発した音の波形を表す少なくとも一つの特徴量である。この特徴量は、いわゆるフォルマント合成に用いる音声の特徴量であり、発声者ごと、かつ、音素ごとに用意される。音声パラメータにおける特徴量として、発声音声における各音素での基本周波数Ｆ０、メル周波数ケプストラム（ＭＦＣＣ）、音素長、パワー、及びそれらの時間差分を少なくとも備えている。 The voice parameter is at least one feature amount representing a waveform of a sound emitted by a person. This feature amount is a feature amount of speech used for so-called formant synthesis, and is prepared for each speaker and for each phoneme. As a feature value in the speech parameter, at least a fundamental frequency F0, a mel frequency cepstrum (MFCC), a phoneme length, a power, and a time difference thereof in each phoneme in the uttered speech are provided.

タグデータは、音声パラメータによって表される音の性質を表すデータであり、少なくとも、発声者の特徴を表す発声者特徴データを含む。この発声者特徴データには、例えば、発声者の性別、年齢などを含む。 The tag data is data representing the nature of the sound represented by the speech parameters, and includes at least speaker feature data representing the features of the speaker. The speaker feature data includes, for example, the sex and age of the speaker.

さらに、タグデータには、当該音声が発声されたときの発声者の表情を表す表情データを含んでも良い。この表情データは、感情や情緒、情景、状況を少なくとも含む表情としての概念を表すデータであり、発声者の表情を推定するために必要な情報を含んでも良い。 Further, the tag data may include facial expression data representing the facial expression of the speaker when the voice is uttered. This facial expression data is data representing a concept as a facial expression including at least emotions, emotions, scenes, and situations, and may include information necessary for estimating the expression of the speaker.

これらの音声パラメータとタグデータとを対応付けた音源データＳＶは、例えば、周知のカラオケ装置を用いて楽曲が歌唱された際に、そのカラオケ装置にて予め規定された処理を実行することで生成され記憶部２２に登録されても良い。 The sound source data SV in which these voice parameters and tag data are associated with each other is generated, for example, by executing a process defined in advance in the karaoke device when a song is sung using a known karaoke device. And may be registered in the storage unit 22.

また、単語親密度データＤＤは、単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられたデータである。ここで言う親密度は、認識度合いが高いほど大きな値である。すなわち、単語親密度データＤＤは、特許請求の範囲に記載された親密度情報の一例である。 The word familiarity data DD is data in which each word is associated with a familiarity representing the recognition degree of each word in advance. The familiarity here is a larger value as the recognition degree is higher. That is, the word familiarity data DD is an example of the familiarity information described in the claims.

なお、本実施形態における単語親密度データＤＤは、利用者ごとの各単語の認識度合いが記憶されたものでも良い。また、本実施形態においては、単語親密度データＤＤが記憶された記憶部２２は、親密度データベースとして機能する。
〈情報処理装置〉
情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、音入力部３４と、音出力部３５と、記憶部３６と、制御部４０とを備えている。 Note that the word familiarity data DD in the present embodiment may store the recognition degree of each word for each user. In the present embodiment, the storage unit 22 in which the word familiarity data DD is stored functions as a familiarity database.
<Information processing device>
The information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a sound input unit 34, a sound output unit 35, a storage unit 36, and a control unit 40.

本実施形態における情報処理装置３０として、例えば、周知の携帯端末を想定しても良いし、いわゆるパーソナルコンピュータといった周知の情報処理装置を想定しても良い。なお、携帯端末には、周知の電子書籍端末や、携帯電話、タブレット端末などの携帯情報端末を含む。 As the information processing apparatus 30 in the present embodiment, for example, a known portable terminal may be assumed, or a known information processing apparatus such as a so-called personal computer may be assumed. Note that portable terminals include well-known electronic book terminals, and portable information terminals such as mobile phones and tablet terminals.

通信部３１は、通信網を介して外部との間で情報通信を行う。入力受付部３２は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部３３は、制御部４０からの信号に基づいて画像を表示する。 The communication unit 31 performs information communication with the outside via a communication network. The input receiving unit 32 receives information input via an input device (not shown). The display unit 33 displays an image based on a signal from the control unit 40.

音入力部３４は、音を電気信号に変換して制御部４０に入力する装置であり、例えば、マイクロホンである。音出力部３５は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。記憶部３６は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶部３６には、各種処理プログラムや各種データが記憶される。 The sound input unit 34 is a device that converts sound into an electric signal and inputs the electric signal to the control unit 40, and is, for example, a microphone. The sound output unit 35 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker. The storage unit 36 is a non-volatile storage device configured to be able to read and write stored contents. The storage unit 36 stores various processing programs and various data.

また、制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。ＲＯＭ４１は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納する。ＲＡＭ４２は、処理プログラムやデータを一時的に格納する。ＣＰＵ４３は、ＲＯＭ４１やＲＡＭ４２に記憶された処理プログラムに従って各種処理を実行する。 The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43. The ROM 41 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 42 temporarily stores processing programs and data. The CPU 43 executes various processes according to the processing programs stored in the ROM 41 and the RAM 42.

すなわち、情報処理装置３０は、指定コンテンツに対応するコンテンツデータＣＤに基づいて、その指定コンテンツにおける映像を表示部３３に表示すると共に、映像における時間軸に合わせて音声を音出力部３５から出力する。ここで言う指定コンテンツとは、入力受付部３２にて受け付けた情報によって指定されたコンテンツである。 That is, the information processing apparatus 30 displays the video in the designated content on the display unit 33 based on the content data CD corresponding to the designated content, and outputs the sound from the sound output unit 35 in accordance with the time axis in the video. . The designated content referred to here is content designated by information received by the input receiving unit 32.

情報処理装置３０は、指定コンテンツにおける音声を出力する際に、セリフテキストデータＣＤによって表された日本語の字幕（テキスト）を、情報処理サーバ１０に格納されている音源データＳＶを用いて音声合成して合成音声を出力する。すなわち、本実施形態の情報処理装置３０は、声の吹き替えを実行可能に構成されている。 When the information processing apparatus 30 outputs sound in the designated content, the Japanese subtitles (text) represented by the serif text data CD are synthesized using the sound source data SV stored in the information processing server 10. To output synthesized speech. That is, the information processing apparatus 30 according to the present embodiment is configured to be able to execute voice-over.

情報処理装置３０のＲＯＭ４１には、音声合成によって出力される合成音声の発声時間を表す話速データを生成する話速データ生成処理を、制御部４０が実行するための処理プログラムが格納されている。
〈話速データ生成処理〉
情報処理装置３０の制御部４０が実行する話速データ生成処理は、起動指令が入力されると起動される。 The ROM 41 of the information processing apparatus 30 stores a processing program for the control unit 40 to execute speech speed data generation processing for generating speech speed data representing the speech time of synthesized speech output by speech synthesis. .
<Speech speed data generation processing>
The speech speed data generation process executed by the control unit 40 of the information processing apparatus 30 is activated when an activation command is input.

この話速データ生成処理では、図３に示すように、起動されると、制御部４０は、まず、指定コンテンツの日本語によるセリフテキストデータＣＤを取得する（Ｓ１１０）。続いて、制御部４０は、Ｓ１１０にて取得したセリフテキストデータＣＤによって表されるテキストを形態素解析し、形態素情報を導出する（Ｓ１２０）。このＳ１２０における形態素解析の手法として、周知の手法（例えば、“ＭｅＣａｂ”）を用いれば良い。 In the speech speed data generation process, as shown in FIG. 3, when activated, the control unit 40 first acquires Japanese text text data CD of designated content (S110). Subsequently, the control unit 40 performs morphological analysis on the text represented by the serif text data CD acquired in S110, and derives morpheme information (S120). A well-known method (for example, “MeCab”) may be used as the method of morphological analysis in S120.

また、形態素情報には、形態素ｍｏ（ｋ）と、形態素音素数ｐｈ＿ｎｕ（ｋ）と、音素ｐｈ（ｋ，ｊ）と、品詞フラグｐａ（ｋ）とが含まれる。
このうち、形態素ｍｏ（ｋ）は、セリフテキストデータＣＤによって表されるテキストに含まれる各形態素ｍｏである。符号“ｋ”は、テキストに含まれる形態素ｍｏそれぞれを識別するインデックス番号であり、セリフテキストデータＣＤにおける時間軸に沿って順に割り当てられる。 The morpheme information includes a morpheme mo (k), a morpheme phoneme number ph_nu (k), a phoneme ph (k, j), and a part of speech flag pa (k).
Among these, the morpheme mo (k) is each morpheme mo included in the text represented by the serif text data CD. The code “k” is an index number for identifying each morpheme mo included in the text, and is assigned in order along the time axis in the serif text data CD.

音素ｐｈ（ｋ，ｊ）は、形態素ｍｏ（ｋ）それぞれを構成する各音素である。符号“ｊ”は、各形態素ｍｏ（ｋ）に含まれる音素それぞれを識別するインデックス番号であり、テキストにおける時間軸に沿って割り当てられている。 The phoneme ph (k, j) is each phoneme constituting each morpheme mo (k). The code “j” is an index number for identifying each phoneme included in each morpheme mo (k), and is assigned along the time axis in the text.

形態素音素数ｐｈ＿ｎｕ（ｋ）は、各形態素ｍｏ（ｋ）を構成する音素ｐｈの数である。
さらに、品詞フラグｐａ（ｋ）は、各形態素ｍｏ（ｋ）（単語）に対応する品詞が、名詞または動詞であるか否かを表す。この品詞フラグｐａ（ｋ）では、品詞が名詞または動詞であれば「１」を設定し、品詞が名詞もしくは動詞でなければ「０」を設定する。 The morpheme phoneme number ph_nu (k) is the number of phonemes ph constituting each morpheme mo (k).
Further, the part of speech flag pa (k) indicates whether or not the part of speech corresponding to each morpheme mo (k) (word) is a noun or a verb. In this part of speech flag pa (k), “1” is set if the part of speech is a noun or a verb, and “0” is set if the part of speech is not a noun or a verb.

例えば、セリフテキストデータＣＤによって表されるテキストが「明日は晴れですね」である場合、そのテキストを形態素解析することで、図４に示す各形態素ｍｏ（ｋ）、及び音素ｐｈ（ｋ，ｊ）を含む形態素情報が導出される。 For example, when the text represented by the serif text data CD is “Tomorrow is sunny”, the morpheme mo (k) and phoneme ph (k, j) shown in FIG. ) Including) is derived.

さらに、話速データ生成処理では、制御部４０が、情報処理サーバ１０の記憶部２２から、Ｓ１２０にて導出した各形態素情報に含まれる形態素（単語）ｍｏ（ｋ）それぞれに対応する親密度を取得する（Ｓ１３０）。 Further, in the speech speed data generation process, the control unit 40 determines the familiarity corresponding to each morpheme (word) mo (k) included in each morpheme information derived in S120 from the storage unit 22 of the information processing server 10. Obtain (S130).

続いて、話速データ生成処理では、制御部４０は、各音素ｐｈ（ｋ，ｊ）が母音であるか否かを判定し、母音フラグｖｗ（ｋ，ｊ）を設定する（Ｓ１４０）。このＳ１４０では、具体的には、図４に示すように、各形態素ｍｏ（ｋ）における音素ｐｈ（ｋ，ｊ）が母音であれば、母音フラグｖｗ（ｋ，ｊ）を「１」に設定し、音素ｐｈ（ｋ，ｊ）が子音であれば、母音フラグｖｗ（ｋ，ｊ）を「０」に設定する。 Subsequently, in the speech speed data generation process, the control unit 40 determines whether each phoneme ph (k, j) is a vowel, and sets a vowel flag vw (k, j) (S140). In S140, specifically, as shown in FIG. 4, if the phoneme ph (k, j) in each morpheme mo (k) is a vowel, the vowel flag vw (k, j) is set to “1”. If the phoneme ph (k, j) is a consonant, the vowel flag vw (k, j) is set to “0”.

さらに、話速データ生成処理では、制御部４０は、音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）の初期値を設定する（Ｓ１５０）。ここで言う音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）は、セリフテキストデータＣＤによって表されるテキスト全体を読み上げるために必要な時間長（発声時間長）に占める、各音素ｐｈ（ｋ，ｊ）の読み上げに必要な時間長の割合である。 Further, in the speech speed data generation process, the control unit 40 sets an initial value of the phoneme length ratio Ph_lr (k, j) (S150). The phoneme length ratio Ph_lr (k, j) referred to here is the reading of each phoneme ph (k, j) in the time length (speech time length) necessary for reading the entire text represented by the serif text data CD. It is the ratio of the time length required for

本実施形態におけるＳ１５０では、具体的には、音素ｐｈ（ｋ，ｊ）が母音であれば、音素長比率ｐｈ＿ｌｒ（ｋ，ｊ）の初期値を「１」に設定し、音素ｐｈ（ｋ，ｊ）が子音であれば、音素長比率ｐｈ＿ｌｒ（ｋ，ｊ）の初期値を「規定値ｐ」に設定する。なお、本実施形態における規定値ｐは、予め規定された値であり、「０」よりも大きく「１」よりも小さい値である。 In S150 in the present embodiment, specifically, if the phoneme ph (k, j) is a vowel, the initial value of the phoneme length ratio ph_lr (k, j) is set to “1”, and the phoneme ph (k, k, j) is set. If j) is a consonant, the initial value of the phoneme length ratio ph_lr (k, j) is set to the “specified value p”. The specified value p in the present embodiment is a value specified in advance, and is a value that is larger than “0” and smaller than “1”.

続いて、話速データ生成処理では、制御部４０は、形態素情報に含まれる品詞フラグに基づいて、Ｓ１２０で導出した各形態素ｍｏ（ｋ）（単語）の中から重要単語を特定する（Ｓ１６０）。ここで言う重要単語とは、重要度が高い品詞として予め規定された重要品詞に対応する単語である。そして、本実施形態における重要品詞には、動詞と名詞とが含まれる。 Subsequently, in the speech speed data generation process, the control unit 40 specifies an important word from each morpheme mo (k) (word) derived in S120 based on the part of speech flag included in the morpheme information (S160). . The important word here is a word corresponding to an important part of speech that is defined in advance as a part of speech having a high degree of importance. The important parts of speech in this embodiment include verbs and nouns.

そして、制御部４０は、Ｓ１６０にて重要単語であると特定された各形態素ｍｏ（ｋ）を構成する音素ｐｈ（ｋ，ｊ）それぞれの中で母音に対応する音素ｐｈ（ｋ，ｊ）の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）を更新する（Ｓ１７０）。このＳ１７０における更新は、下記（１）式に従って実行され、重要単語に含まれる母音に対応する音素ｐｈ（ｋ，ｊ）の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）だけが長くなる。なお、（１）式中のαは、予め規定された定数である。 And the control part 40 of phoneme ph (k, j) corresponding to a vowel in each phoneme ph (k, j) which comprises each morpheme mo (k) identified as an important word in S160. The phoneme length ratio Ph_lr (k, j) is updated (S170). The update in S170 is executed according to the following equation (1), and only the phoneme length ratio Ph_lr (k, j) of the phoneme ph (k, j) corresponding to the vowel included in the important word is lengthened. In the equation (1), α is a constant defined in advance.

すなわち、本実施形態のＳ１７０では、品詞フラグｐａ（ｋ）が「１」であり、かつ、母音フラグｖｗ（ｋ，ｊ）が「１」である音素ｐｈ（ｋ，ｊ）を発声する時間長が“１＋α／１００”倍される。 That is, in S170 of this embodiment, the time length for uttering the phoneme ph (k, j) whose part-of-speech flag pa (k) is “1” and whose vowel flag vw (k, j) is “1”. Is multiplied by “1 + α / 100”.

さらに、話速データ生成処理では、制御部４０は、まず、各形態素ｍｏ（ｋ）の親密度を情報処理サーバ１０から取得し、その取得した親密度に基づいて規格化親密度ｎｒ＿ｆａ（ｋ）を算出する（Ｓ１８０）。この規格化親密度ｎｒ＿ｆａ（ｋ）は、形態素ｍｏ（ｋ）ごとの親密度の平均が「１」、分散が「１」となるように、各形態素ｍｏ（ｋ）の親密度を規格化したものである。 Further, in the speech speed data generation process, the control unit 40 first acquires the familiarity of each morpheme mo (k) from the information processing server 10 and normalizes the familiarity nr_fa (k) based on the acquired familiarity. Is calculated (S180). This normalized familiarity nr_fa (k) has normalized the familiarity of each morpheme mo (k) so that the average of the familiarity for each morpheme mo (k) is “1” and the variance is “1”. Is.

このＳ１８０においては、さらに、制御部４０は、下記（２）式に従って倍率β（ｋ）を算出すると共に、下記（３）式に従って、各形態素に含まれる母音の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）を補正する。 In S180, the control unit 40 further calculates the magnification β (k) according to the following equation (2), and the phoneme length ratio Ph_lr (k, j) included in each morpheme according to the following equation (3): ) Is corrected.

すなわち、Ｓ１８０によって、親密度が低いことを表している単語の母音の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）は、情報全体の読み上げに要する時間に占める当該単語の読み上げに要する時間の割合が長くなるように補正される。 That is to say, in S180, the phoneme length ratio Ph_lr (k, j) of a word representing a low familiarity increases the ratio of the time required for reading the word to the time required for reading the entire information. It is corrected as follows.

続いて、話速データ生成処理では、制御部４０が、セリフテキストデータＣＤによって表されるテキスト全体の発声時間が要発声時間に維持されるように、各音素ｐｈ（ｋ，ｊ）の音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）を導出する（Ｓ１９０）。 Subsequently, in the speech speed data generation process, the control unit 40 keeps the phoneme time of each phoneme ph (k, j) so that the utterance time of the entire text represented by the serif text data CD is maintained at the required utterance time. The length Ph_le (k, j) is derived (S190).

具体的に、本実施形態のＳ１９０における各音素ｐｈ（ｋ，ｊ）の音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）の導出は、下記（４）式に従って実行される。 Specifically, the derivation of the phoneme time length Ph_le (k, j) of each phoneme ph (k, j) in S190 of the present embodiment is executed according to the following equation (4).

なお、（４）式における分母は、セリフテキストデータＣＤに含まれる全ての音素ｐｈ（ｋ，ｊ）音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）を積算した値（総和）である。そして、（４）式における符号“ｔｏｌ”は、要発声時間である。また、（４）式における符号“Ｎ”は、セリフテキストデータＣＤに含まれる音素ｐｈの個数である。 The denominator in the equation (4) is a value (total) obtained by integrating all phoneme ph (k, j) phoneme length ratios Ph_lr (k, j) included in the serif text data CD. The code “tol” in the equation (4) is a required utterance time. Further, the symbol “N” in the equation (4) is the number of phonemes ph included in the serif text data CD.

すなわち、音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）は、セリフテキストデータＣＤによって表される字幕を読み上げる全時間長が、当該セリフテキストデータＣＤにおける要発声時間に維持されるように正規化されている。 That is, the phoneme time length Ph_le (k, j) is normalized so that the total time length for reading the subtitles represented by the serif text data CD is maintained at the required utterance time in the serif text data CD.

続いて、話速データ生成処理では、制御部４０が、Ｓ１９０にて導出された音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）を、各形態素ｍｏ（ｋ）を構成する各音素ｐｈ（ｋ，ｊ）を読み上げるタイミングを表すデータとして規定した話速データを生成する（Ｓ２００）。 Subsequently, in the speech speed data generation process, the control unit 40 uses the phoneme time length Ph_le (k, j) derived in S190 as the phoneme ph (k, j) constituting each morpheme mo (k). Spoken speed data defined as data representing the read-out timing is generated (S200).

さらに、話速データ生成処理では、制御部４０が、Ｓ１１０にて取得したセリフテキストデータＣＤに含まれている配役情報それぞれに基づいて、各配役情報に最も適合する音源データＳＶを取得する（Ｓ２１０）。 Further, in the speech speed data generation process, the control unit 40 acquires sound source data SV that best matches each casting information based on each casting information included in the speech text data CD obtained in S110 (S210). ).

そして、話速データ生成処理では、制御部４０が、Ｓ２１０にて取得した音源データＳＶを用いて、Ｓ１１０にて取得したセリフテキストデータＣＤに含まれている字幕情報の内容を音声合成する（Ｓ２２０）。なお、本実施形態のＳ２２０では、Ｓ２００にて生成された話速データに基づいて、字幕情報によって表されるテキストを構成する各音素の読み上げタイミング（速度）が決定される。 In the speech speed data generation process, the control unit 40 uses the sound source data SV acquired in S210 to synthesize the content of the subtitle information included in the speech text data CD acquired in S110 (S220). ). In S220 of the present embodiment, the reading timing (speed) of each phoneme constituting the text represented by the caption information is determined based on the speech speed data generated in S200.

そして、本実施形態のＳ２２０では、制御部４０は、制御信号を音出力部３５に出力し、音声合成によって生成された合成音声を音出力部３５から出力する。
その後、本話速データ生成処理を終了する。そして、時間軸に沿って次の映像データＩＭが出力されるタイミングに合わせて、話速データ生成処理を起動し、その映像データＩＭの時間軸に沿った次のセリフテキストデータＴＤを取得する（Ｓ１１０）。その後、Ｓ１２０〜Ｓ２２０を実行する。 In S <b> 220 of this embodiment, the control unit 40 outputs a control signal to the sound output unit 35, and outputs a synthesized speech generated by speech synthesis from the sound output unit 35.
Thereafter, the present speech speed data generation process is terminated. Then, in accordance with the timing at which the next video data IM is output along the time axis, the speech speed data generation process is started, and the next serif text data TD along the time axis of the video data IM is acquired ( S110). Thereafter, S120 to S220 are executed.

つまり、本実施形態の話速データ生成処理では、指定コンテンツのセリフテキストデータＴＤを取得し、その取得したセリフテキストデータＴＤを形態素解析する。そして、情報処理サーバ１０に格納されている単語親密度データに基づいて、形態素解析にて特定された各形態素（単語）について親密度を特定する。 That is, in the speech speed data generation process of the present embodiment, the serif text data TD of the specified content is acquired, and the acquired serif text data TD is morphologically analyzed. Then, based on the word familiarity data stored in the information processing server 10, the familiarity is specified for each morpheme (word) specified in the morphological analysis.

さらに、話速データ生成処理では、親密度が低いことを表している単語ほど、情報全体の読み上げに要する時間に占める当該単語の読み上げに要する時間の割合が長くなるように、話速データを生成している。
〈親密度更新処理〉
情報処理サーバ１０の制御部１４が実行する親密度更新処理について説明する。 Furthermore, in the speech speed data generation process, the speech speed data is generated so that the word indicating that the familiarity is low, the ratio of the time required for reading the word to the time required for reading the entire information becomes longer. doing.
<Intimacy update processing>
A familiarity update process executed by the control unit 14 of the information processing server 10 will be described.

この親密度更新処理は、話速データ生成処理の起動タイミングに合わせて起動される。
この親密度更新処理では、起動されると、図５に示すように、まず、話速データ生成処理にて取得された、日本語によるセリフテキストデータＣＤを、制御部１４が取得する（Ｓ３１０）。 This closeness update process is activated in synchronization with the activation timing of the speech speed data generation process.
In this familiarity update process, when activated, as shown in FIG. 5, first, the control unit 14 acquires the Japanese text text data CD acquired in the speech speed data generation process (S310). .

続いて、親密度更新処理では、制御部１４は、Ｓ３１０にて取得したセリフテキストデータＣＤによって表されるテキストを形態素解析し、形態素情報を導出する（Ｓ３２０）。このＳ３２０における形態素解析の手法として、周知の手法（例えば、“ＭｅＣａｂ”）を用いれば良い。また、ここでの形態素情報には、少なくとも形態素ｍｏ（ｋ）（単語）が含まれる。 Subsequently, in the familiarity update process, the control unit 14 performs morphological analysis on the text represented by the serif text data CD acquired in S310, and derives morpheme information (S320). A well-known technique (for example, “MeCab”) may be used as the technique of morphological analysis in S320. The morpheme information here includes at least morpheme mo (k) (word).

そして、親密度更新処理では、制御部１４は、Ｓ３２０にて導出した形態素ｍｏ（ｋ）に基づいて、単語親密度データＤＤを更新する（Ｓ３３０）。具体的に、本実施形態のＳ３３０では、同一内容の形態素ｍｏごとに出現回数をカウントし、その時点までに出現回数が多い形態素ｍｏ（単語）ほど親密度が高くなるように、単語親密度データＤＤを更新する。 In the familiarity update process, the control unit 14 updates the word familiarity data DD based on the morpheme mo (k) derived in S320 (S330). Specifically, in S330 of the present embodiment, the number of appearances is counted for each morpheme mo having the same content, and the word familiarity data is set so that the familiarity of the morpheme mo (word) with the larger number of appearances up to that point becomes higher. Update DD.

なお、親密度の更新は、出現回数に予め規定された係数を乗じた値を、更新前の親密度に加算することで実現すれば良い。また、親密度の更新は、形態素ｍｏの品詞が自立語であるものを対象とし、付属語は対象外としても良い。 The update of the familiarity may be realized by adding a value obtained by multiplying the number of appearances by a predetermined coefficient to the familiarity before the update. In addition, the update of the intimacy may be performed on the morpheme mo whose part of speech is an independent word, and the attached word may be excluded.

その後、親密度更新処理を終了する。
つまり、本実施形態の親密度更新処理においては、制御部１４は、利用者が視聴しているコンテンツにおいて、その時点までに出現回数が多い形態素ｍｏ（単語）ほど親密度が高くなるように、記憶部２２に格納されている単語親密度データＤＤを更新する。
［実施形態の効果］
以上説明したように、本実施形態の話速データ生成処理では、親密度が低い単語ほど、全読み上げ時間に占める当該単語の読み上げ時間の割合が長くなるように、話速データを生成している。 Thereafter, the closeness update process is terminated.
That is, in the familiarity update process of the present embodiment, the control unit 14 is configured such that, in the content that the user is viewing, the familiarity increases as the morpheme mo (word) that appears more frequently by that time. The word familiarity data DD stored in the storage unit 22 is updated.
[Effect of the embodiment]
As described above, in the speech speed data generation process according to the present embodiment, the speech speed data is generated so that the lower the familiarity of the word, the longer the ratio of the reading time of the word in the total reading time is. .

これは、認識度合い（即ち、親密度）が低い単語の読み上げに要する時間長が短いと、映像に合わせて出力される音声を聴いた人物は、その音声による情報の内容を認識できない可能性があるためである。 This is because if the time required to read a word with a low recognition level (ie, intimacy) is short, a person who listens to the sound output in accordance with the video may not be able to recognize the content of the information based on the sound. Because there is.

すなわち、本実施形態の話速データ生成処理によって生成された話速データに基づいて合成音声における各音素の開始タイミングを決定すれば、その合成音声においては、情報の全読み上げ時間に占める、親密度が低い単語の読み上げに要する時間長の割合を大きくできる。 That is, if the start timing of each phoneme in the synthesized speech is determined based on the speech rate data generated by the speech rate data generation process of the present embodiment, the familiarity that occupies the total reading time of information in the synthesized speech The ratio of the time length required to read out words with low can be increased.

この結果、親密度が低い単語であっても、合成音声を聴いた人物が聴き取りやすくなり、その人物は、発声によって表される情報の内容全体を認識することができる。
換言すれば、情報処理装置３０においては、合成音声において、発声の内容を理解しやすくなるように、読み上げ速度（即ち、話速）を調整できる。 As a result, even if the word has a low familiarity, it is easy for a person who has listened to the synthesized speech to hear, and the person can recognize the entire content of the information represented by the utterance.
In other words, the information processing apparatus 30 can adjust the reading speed (that is, speaking speed) so that the content of the utterance can be easily understood in the synthesized speech.

ところで、通常、日本語の音声にて表される情報では、名詞及び動詞が大きな重みを有する。このため、本実施形態の話速データ生成処理では、名詞及び動詞を重要品詞とし、重要品詞それぞれに対応する重要単語に対する読み上げ時間が長くなるように話速データを生成している。 By the way, normally, in information expressed in Japanese speech, nouns and verbs have large weights. For this reason, in the speech speed data generation processing according to the present embodiment, the noun and the verb are important parts of speech, and the speech speed data is generated so that the reading time for the important words corresponding to each of the important parts of speech becomes long.

このように生成された話速データに基づいて話速が調整された合成音声によれば、重要品詞をより聴き取りやすくすることができ、発声の内容をより理解しやすくできる。
また、本実施形態の話速データ生成処理では、一つのセリフテキストデータＣＤによって表される情報全体を読み上げるために必要な時間長が、要発声時間に維持されるように正規化したデータを話速データとして生成している。 According to the synthesized speech in which the speech speed is adjusted based on the speech speed data generated in this way, it is possible to make it easy to listen to important parts of speech and to understand the content of the utterance more easily.
Also, in the speech speed data generation process of the present embodiment, the data normalized so that the time length required to read out the entire information represented by one serif text data CD is maintained at the utterance time required. It is generated as speed data.

このため、話速データ生成処理によれば、字幕を読み上げる時間長が予め規定されていた時間長から変更されることを防止でき、映像の進行に合わせた適切なタイミングで字幕の読み上げを実現できる。 For this reason, according to the speech speed data generation process, it is possible to prevent the time length for reading subtitles from being changed from a predetermined time length, and to read subtitles at an appropriate timing according to the progress of the video. .

なお、本実施形態では、親密度更新処理において、セリフテキストデータＣＤの時間軸に沿った出現回数が多い形態素ｍｏ（単語）ほど、セリフテキストデータＣＤの時間軸に沿った進行量に応じて親密度が高くなるように、単語親密度データＤＤを更新している。 In the present embodiment, in the familiarity update process, a morpheme mo (word) having a larger number of appearances along the time axis of the serif text data CD indicates a parent according to the progress amount along the time axis of the serif text data CD. The word familiarity data DD is updated so as to increase the density.

このような親密度更新処理によれば、コンテンツにおける時間軸に沿って登場する回数が多い単語ほど、親密度を高くできる。
この結果、本実施形態の話速データ生成処理によれば、映像に適した話速データを生成できる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 According to such an intimacy update process, the intimacy can be increased as the word appears more frequently along the time axis in the content.
As a result, according to the speech speed data generation process of the present embodiment, speech speed data suitable for video can be generated.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態の話速データ生成処理では、名詞及び動詞の両方を重要品詞としていたが、重要品詞は、名詞及び動詞の少なくとも一方であっても良い。
また、上記実施形態では、話速データ生成処理を情報処理装置３０の制御部４０が実行していたが、話速データ生成処理を実行する装置は、情報処理装置３０に限るものではなく、情報処理サーバ１０であっても良い。 For example, in the speech speed data generation process of the above embodiment, both nouns and verbs are important parts of speech, but the important parts of speech may be at least one of nouns and verbs.
Moreover, in the said embodiment, although the control part 40 of the information processing apparatus 30 performed speech speed data generation processing, the apparatus which performs speech speed data generation processing is not restricted to the information processing apparatus 30, and information The processing server 10 may be used.

この場合、情報処理装置３０は、セリフテキストデータＴＤに基づく字幕を読み上げた合成音声する際に、情報処理サーバ１０から話速データを取得して話速を決定すれば良い。 In this case, the information processing apparatus 30 may acquire the speech speed data from the information processing server 10 and determine the speech speed when synthesizing the speech that reads out the caption based on the serif text data TD.

また、上記実施形態では、親密度更新処理を情報処理サーバ１０が実行していたが、親密度更新処理を実行する装置は、情報処理サーバ１０に限るものではなく、情報処理装置３０であっても良い。 Moreover, in the said embodiment, although the information processing server 10 performed the closeness update process, the apparatus which performs a closeness update process is not restricted to the information processing server 10, and is the information processing apparatus 30. Also good.

なお、上記実施形態の構成の一部を、課題を解決できる限りにおいて省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In addition, the aspect which abbreviate | omitted a part of structure of the said embodiment as long as the subject could be solved is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

話速データ生成処理のＳ１１０または親密度更新処理のＳ３１０を実行することで得られる機能が、特許請求の範囲の記載におけるテキスト取得手段に相当し、話速データ生成処理のＳ１２０または親密度更新処理のＳ３２０を実行することで得られる機能が、解析手段に相当する。また、話速データ生成処理のＳ１３０を実行することで得られる機能が、特許請求の範囲の記載における親密度取得手段に相当し、話速データ生成処理のＳ１４０〜Ｓ２００を実行することで得られる機能が、話速決定手段に相当する。 The function obtained by executing S110 of the speech speed data generation process or S310 of the intimacy update process corresponds to the text acquisition means in the claims, and S120 of the speech speed data generation process or the intimacy update process The function obtained by executing S320 in FIG. Further, the function obtained by executing S130 of the speech speed data generation process corresponds to the familiarity acquisition means in the claims, and can be obtained by executing S140 to S200 of the speech speed data generation process. The function corresponds to speech speed determining means.

さらに、話速データ生成処理のＳ１６０を実行することで得られる機能が、特許請求の範囲の記載における単語特定手段に相当する。親密度更新処理のＳ３３０を実行することで得られる機能が、特許請求の範囲の記載における更新手段に相当する。 Furthermore, the function obtained by executing S160 of the speech speed data generation process corresponds to the word specifying means in the claims. The function obtained by executing S330 of the intimacy update process corresponds to the update means in the claims.

１…コンテンツ視聴システム１０…情報処理サーバ１２…通信部１４…制御部１６…ＲＯＭ１８…ＲＡＭ２０…ＣＰＵ２２…記憶部２２…記憶装置３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…音入力部３５…音出力部３６…記憶部４０…制御部４１…ＲＯＭ４２…ＲＡＭ４３…ＣＰＵ DESCRIPTION OF SYMBOLS 1 ... Content viewing system 10 ... Information processing server 12 ... Communication part 14 ... Control part 16 ... ROM 18 ... RAM 20 ... CPU 22 ... Memory | storage part 22 ... Memory | storage device 30 ... Information processing apparatus 31 ... Communication part 32 ... Input reception part 33 ... Display unit 34 ... Sound input unit 35 ... Sound output unit 36 ... Storage unit 40 ... Control unit 41 ... ROM 42 ... RAM 43 ... CPU

Claims

映像に合わせて音声によって出力される情報の内容を文字列で表すテキストデータを取得するテキスト取得手段と、
前記テキスト取得手段にて取得したテキストデータを解析し、前記テキストデータによって表される文字列に含まれる各単語を特定する解析手段と、
単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられた親密度情報が格納された親密度データベースから、前記解析手段にて特定された各単語に対応する親密度を取得する親密度取得手段と、
音声合成によって出力される合成音声の発声時間を表すデータであり、かつ、前記テキストデータによって表される情報の前記文字列を構成する各音素の発声時間を表すデータを話速データとし、前記親密度取得手段で取得した親密度が低いことを表している単語ほど、前記テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した前記話速データを生成する話速決定手段と
を備えることを特徴とする情報処理装置。 Text acquisition means for acquiring text data representing the content of information output by sound in accordance with video, as a character string;
Analyzing the text data acquired by the text acquisition means, and specifying each word included in the character string represented by the text data;
A parent that acquires a familiarity corresponding to each word specified by the analysis means from a familiarity database in which familiarity information in which each word and a familiarity representing the recognition degree of each word are associated in advance is stored. Density acquisition means;
Data representing speech time of synthesized speech output by speech synthesis, and data representing speech time of each phoneme constituting the character string of the information represented by the text data is speech speed data, and The utterance of the word so that the proportion of the utterance time of the word in the utterance time of the entire information represented by the text data becomes longer as the word indicating that the familiarity acquired by the density acquisition means is lower An information processing apparatus comprising: speech speed determining means for generating the speech speed data adjusted for time.

前記解析手段で特定した単語の中から、重要度が高い品詞として予め規定された重要品詞に対応する単語である重要単語を特定する単語特定手段を備え、
前記話速決定手段は、
前記単語特定手段で特定された重要単語に含まれる母音の発声時間が長くなるように、前記話速データを生成する
ことを特徴とする請求項１に記載の情報処理装置。 Among the words specified by the analysis means, comprising word specifying means for specifying an important word that is a word corresponding to an important part of speech defined in advance as a part of speech with high importance,
The speech speed determining means is
The information processing apparatus according to claim 1, wherein the speech speed data is generated so that a vowel utterance time included in an important word specified by the word specifying unit becomes longer.

前記単語特定手段は、
名詞、及び動詞の少なくとも一方を前記重要品詞とし、前記重要品詞それぞれに対応する単語を前記重要単語として特定する
ことを特徴とする請求項２に記載の情報処理装置。 The word specifying means is
The information processing apparatus according to claim 2, wherein at least one of a noun and a verb is the important part of speech, and a word corresponding to each of the important parts of speech is specified as the important word.

前記解析手段で特定された単語の認識度合いが高くなるように、前記親密度データベースに格納されている親密度情報において当該単語と対応付けられた親密度を更新する更新手段を備える
ことを特徴とする請求項１から請求項３までのいずれか一項に記載の情報処理装置。 Update means for updating the intimacy associated with the word in the intimacy information stored in the intimacy database so that the degree of recognition of the word specified by the analyzing means is increased. The information processing apparatus according to any one of claims 1 to 3.

一つの映像に対して、複数個の前記テキストデータが存在し、
前記テキスト取得手段は、前記テキストデータを順次取得し、
前記解析手段は、前記テキスト取得手段がテキストデータを取得するごとに、前記映像に対応するテキストデータの時間進行に沿って前記テキストデータを解析し、
前記更新手段は、
前記解析手段にて特定した単語は、前記テキストデータの時間進行の中で出現回数が多い単語ほど、前記テキストデータの進行量に応じて前記親密度が高くなるように、前記親密度情報において当該単語と対応付けられた親密度を更新する
ことを特徴とする請求項４に記載の情報処理装置。 A plurality of the text data exists for one video,
The text acquisition means sequentially acquires the text data,
The analysis means analyzes the text data along the time progress of the text data corresponding to the video every time the text acquisition means acquires text data,
The updating means includes
In the familiarity information, the word identified by the analysis means is such that the word having a higher number of appearances in the time progression of the text data has a higher familiarity according to the progress amount of the text data. The information processing apparatus according to claim 4, wherein the closeness associated with the word is updated.

前記話速決定手段にて生成された話速データに基づいて、各単語を構成する各音素の発声時間が前記話速データによって表された発声時間となるように音声合成して出力する音声合成手段を備える
ことを特徴とする請求項１から請求項５までのいずれか一項に記載の情報処理装置。 Based on the speech speed data generated by the speech speed determining means, speech synthesis is performed by synthesizing and outputting the speech time of each phoneme constituting each word to be the speech time represented by the speech speed data. The information processing apparatus according to claim 1, further comprising: means.

前記テキストデータのそれぞれには、当該テキストデータによって表された前記文字列の発声に掛けることが可能な時間長として予め規定された要発声時間が含まれ、
前記話速決定手段は、
前記テキストデータによって表される情報全体の発声時間が前記要発声時間に維持されるように正規化したデータを、前記話速データとして生成する
ことを特徴とする請求項１から請求項６までのいずれか一項に記載の情報処理装置。 Each of the text data includes a required utterance time defined in advance as a time length that can be applied to the utterance of the character string represented by the text data,
The speech speed determining means is
The data normalized so that the utterance time of the entire information represented by the text data is maintained at the required utterance time is generated as the speech speed data. The information processing apparatus according to any one of claims.

映像に合わせて音声によって出力される情報の内容を文字列で表すテキストデータを取得するテキスト取得過程と、
前記テキスト取得過程にて取得したテキストデータを解析し、前記テキストデータによって表される文字列に含まれる各単語を特定する解析過程と、
単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられた親密度情報が格納された親密度データベースから、前記解析過程にて特定された各単語に対応する親密度を取得する親密度取得過程と、
音声合成によって出力される合成音声の発声時間を表すデータであり、かつ、前記テキストデータによって表される情報の前記文字列を構成する各音素の発声時間を表すデータを話速データとし、前記親密度取得過程で取得した親密度が低いことを表している単語ほど、前記テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した前記話速データを生成する話速決定過程と
を備えることを特徴とする話速データ生成方法。 A text acquisition process for acquiring text data representing the content of information output by sound in accordance with video, as a character string;
Analyzing the text data acquired in the text acquisition process, identifying each word included in the character string represented by the text data; and
A parent that acquires a familiarity corresponding to each word specified in the analysis process from a familiarity database that stores familiarity information in which each word and a familiarity representing the recognition degree of each word are associated in advance. Density acquisition process,
Data representing speech time of synthesized speech output by speech synthesis, and data representing speech time of each phoneme constituting the character string of the information represented by the text data is speech speed data, and The utterance of the word so that the proportion of the utterance time of the word in the utterance time of the whole information represented by the text data becomes longer as the word indicating that the familiarity acquired in the density acquisition process is lower A speech speed data generation method comprising: a speech speed determination process for generating the speech speed data adjusted in time.

映像に合わせて音声によって出力される情報の内容を文字列で表すテキストデータを取得するテキスト取得手順と、
前記テキスト取得手順にて取得したテキストデータを解析し、前記テキストデータによって表される文字列に含まれる各単語を特定する解析手順と、
単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられた親密度情報が格納された親密度データベースから、前記解析手順にて特定された各単語に対応する親密度を取得する親密度取得手順と、
音声合成によって出力される合成音声の発声時間を表すデータであり、かつ、前記テキストデータによって表される情報の前記文字列を構成する各音素の発声時間を表すデータを話速データとし、前記親密度取得手順で取得した親密度が低いことを表している単語ほど、前記テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した前記話速データを生成する話速決定手順とを
コンピュータに実行させることを特徴とするプログラム。 A text acquisition procedure for acquiring text data representing the content of information output by sound in accordance with video, as a character string,
Analyzing the text data acquired in the text acquisition procedure, and specifying each word included in the character string represented by the text data; and
A parent that acquires a familiarity corresponding to each word specified in the analysis procedure from a familiarity database in which familiarity information in which each word and a familiarity representing the recognition degree of each word are associated in advance is stored. Density acquisition procedure;
Data representing speech time of synthesized speech output by speech synthesis, and data representing speech time of each phoneme constituting the character string of the information represented by the text data is speech speed data, and The utterance of the word so that the proportion of the utterance time of the word that occupies the utterance time of the entire information represented by the text data becomes longer as the word indicating that the familiarity acquired in the density acquisition procedure is lower A program for causing a computer to execute a speech speed determination procedure for generating the speech speed data adjusted in time.