JP2006140707A

JP2006140707A - Method, device and program for processing image and computer-readable recording medium recording program

Info

Publication number: JP2006140707A
Application number: JP2004327739A
Authority: JP
Inventors: Hiroko Konya; 裕子紺家; Tomokazu Yamada; 智一山田; Hidekatsu Kuwano; 秀豪桑野; Katsuhiko Kawazoe; 雄彦川添
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-11-11
Filing date: 2004-11-11
Publication date: 2006-06-01
Anticipated expiration: 2024-11-11
Also published as: JP4272611B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique capable of relating an image and an external voice by a small workload and a simple operation by inputting the image and the external voice. <P>SOLUTION: An input voice is voice-recognized, and character informations regarding each voice section are obtained while the input images are divided into a topic section defined as the unit of one image. The list of an icon associated with the topic section is displayed on the basis of these processings, and the character informations as the result of the voice recognition are displayed at every voice section. A screen for editing is displayed for arranging and displaying an information indicating the position of a time of the topic section and the informations indicating the positions of the times of the voice sections. The voice section associated with the topic section is specified on the basis of the position of the time of the topic section and the positions of the times of each voice section selected on the screen for editing, and display informations regarding the two sections associated are clearly displayed. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力として、その映像とその音声とを関連付ける処理を行う映像処理方法及びその装置と、その映像処理方法の実現に用いられる映像処理用プログラム及びそのプログラムを記録したコンピュータ読み取り可能な記録媒体とに関する。 The present invention relates to a video processing method and apparatus for performing a process of associating the video with the audio by inputting the video and the audio generated separately from the video and associated with the video, and the video processing method. The present invention relates to a video processing program used for realization and a computer-readable recording medium on which the program is recorded.

テレビ映像やビデオ映像といった動画像編集においては、動画像編集を効率的に行うことができるようにするために、映像の切り替わりや音声・音楽の有無やテロップの有無などを検出するメディア認識技術を用いて、入力した映像に対してインデックス作成処理（映像をシーンに区画し、各シーンについて代表画像を決定するなどの処理を行う）を施して、その処理結果を検出結果表示枠を使ってディスプレイに一覧表示することで、動画像編集者の作業を補助するための情報としている。 In moving image editing such as TV images and video images, media recognition technology that detects switching of images, presence / absence of voice / music, presence / absence of telops, etc., in order to enable efficient editing of moving images. Use this to perform index creation processing (divide the video into scenes and determine representative images for each scene) on the input video, and display the processing results using the detection result display frame. The list is displayed as information for assisting the work of the moving image editor.

そして、動画像編集者は、このインデックス情報を参考にして、映像中の任意の区間をひとまとまりとしてトピックとし、トピックの内容などを表す情報を関連情報としてトピックに付与するという動画像編集機能を使って、映像以外の外部からのテキスト情報（台本や進行表など）を読み込んで、それを関連情報としてトピックに付与して表示するという編集作業を行っている。 Then, with reference to the index information, the moving image editor has a moving image editing function in which an arbitrary section in the video is grouped as a topic and information representing the content of the topic is added to the topic as related information. It is used to edit text information (scripts, progress tables, etc.) from external sources other than video, and assign it to topics as related information for display.

一方、このような動画像編集の補助情報として、映像内の音声を認識し、テキスト化して表示するという技術が用いられている（例えば、特許文献１参照）。 On the other hand, as such auxiliary information for editing a moving image, a technique of recognizing audio in a video and displaying it as text (for example, see Patent Document 1).

また、映像のシナリオに時間に関する記述がない場合に、シナリオ中のシーンに記述されている文字数をシーン毎にカウントして、そのカウント値と実際に編集された映像の実時間長とから各シーンの予測時間長を算出することで、シナリオに記述されている完成予定の映像と、実際に編集された映像との間の時間のずれの対応をとるという技術も用いられている（例えば、特許文献２参照）。
特開２００３−３２３４３７特開２００４−１５９１０７ In addition, when there is no description about time in the video scenario, the number of characters described in the scene in the scenario is counted for each scene, and each scene is calculated from the counted value and the actual time length of the actually edited video. A technique is also used in which a time lag between a video to be completed described in a scenario and an actually edited video is taken into account by calculating the predicted time length of the video (for example, patents) Reference 2).
JP 2003-323437 A JP 2004-159107 A

このような従来技術を背景にして、映像内の音声を認識しテキスト化する技術を利用して、そのようにして認識したテキスト情報を関連情報としてトピックに付与するという方法が用いられている。 Against the background of such a conventional technique, a method of using text recognition and text conversion technology to attach text information recognized in this way as related information to a topic is used.

しかしながら、音声認識にはある程度雑音のない音声が必要であるものの、テレビ映像やビデオ映像といった映像に含まれる音声は映像内容により雑音が多かったり、多くの音が重なったりしていて音声認識には適さない場合がある。 However, although voice with a certain level of noise is required for voice recognition, the voice contained in video such as TV images and video images is noisy depending on the video content, and many sounds are superimposed. It may not be suitable.

映像に関連する音声情報として、その他に、映像とは別に生成される要約された情報がある。この要約情報の方が映像の関連情報として適している場合がある。 Other audio information related to the video is summarized information generated separately from the video. This summary information may be more suitable as video related information.

このようなことを背景にして、専用のキャスターがテレビ映像やビデオ映像の音声を聞きながら言い直したり、あらかじめ用意される原稿を読んだりした要約文章の音声を音声認識してテキスト情報にするというリスピーク方式が用いられている。 Against this backdrop, a dedicated caster re-speaks while listening to the sound of TV and video images, or reads the text of a summary sentence that is read in advance and recognizes it as text information. The squirrel peak method is used.

これから、このリスピーク方式を利用して、リスピークされた音声認識結果のテキスト情報を関連情報としてトピックに付与するという方法を用いることが考えられる。 From this, it is conceivable to use a method in which text information of a speech recognition result that has been re-peaked is assigned to a topic as related information using this re-peak method.

しかしながら、リスピーク方式を利用する場合には、時間情報だけで、どのシーンとどの要約発話音声認識結果（リスピークされた音声の認識結果）とを結び付けるのかを判断することが難しいという問題がある。 However, when the rispeak method is used, there is a problem that it is difficult to determine which scene and which summary utterance speech recognition result (recognition result of the rispeaked speech) is linked with only the time information.

すなわち、要約文章を発話する場合、発話者は、あるシーンを見て要点をまとめ、それから発話内容を決めて発話するという過程を踏むことになるので、そのシーンの後半部または終了後から、そのシーンについての発話が始まり、そのシーンの終了後も発話が継続するということが起こる。これから、時間情報だけで、どのシーンとどの要約発話音声認識結果とを結び付けるのかを判断することが難しいのである。 That is, when speaking a summary sentence, the speaker takes a process of summarizing the main points by looking at a certain scene, then deciding the content of the utterance and then speaking, so that the latter part of the scene or after the end An utterance about a scene begins and the utterance continues even after the scene ends. From this point of view, it is difficult to determine which scene and which summary speech recognition result are to be linked only with time information.

しかも、シーンと発話の時刻のずれる量は映像の内容や発話者の癖などにより必ず一定の値であるとは限らないことから、あらかじめ決めた規定量分だけ先にずらしておくという方法を用いることもできない。 Moreover, since the amount of time difference between the scene and the time of utterance is not always a constant value depending on the content of the video or the utterance of the speaker, a method of shifting ahead by a predetermined amount is used. I can't do that either.

また、動画像編集者が逐次探して情報を整合するという方法を用いることも考えられるが、時間的コストが大きいという問題がある。 Further, although it is conceivable to use a method in which a moving image editor sequentially searches and matches information, there is a problem that time cost is high.

また、特許文献２に記載されるように、テキスト情報と映像情報とを自動で整合するという技術もあるが、この技術では映像内の音声区間長とテキストの文字量とを比較して対応付けていることから、映像内の発話と同等のテキスト文章とが必要になり、映像内の音声と異なる音声を使用する場合や、あらかじめテキスト文章が準備できない場合には利用不可能であるという問題がある。 In addition, as described in Patent Document 2, there is a technique for automatically matching text information and video information. However, in this technique, the audio section length in the video is compared with the text character amount and matched. Therefore, a text sentence equivalent to the utterance in the video is required, and it is not possible to use it when using a voice different from the voice in the video or when the text sentence cannot be prepared in advance. is there.

本発明はかかる事情に鑑みてなされたものであって、映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力として、その映像とその音声とを関連付ける処理を行うときに、少ない作業量で、かつ簡略な操作でもって、その関連付けを行うことができるようにする新たな映像処理技術の提供を目的とする。 The present invention has been made in view of such circumstances, and when an image and an audio generated separately from the image and associated with the image are input, and processing for associating the image with the audio is performed. An object of the present invention is to provide a new video processing technique that enables the association to be performed with a small amount of work and a simple operation.

この目的を達成するために、本発明の映像処理装置は、映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力として、その映像とその音声とを関連付ける処理を行うために、（イ）入力した音声を音声認識して、各音声区間についての文字情報を得るとともに、それらの音声区間の時間情報を得る音声認識手段と、（ロ）入力した映像を、ひとつの映像のまとまりとして定義されるトピック区間に区画するとともに、それらのトピック区間の時間情報を得るトピック区画手段と、（ハ）トピック区間に対応付けられるアイコンの一覧を表示し、音声認識結果の文字情報を音声区間毎に表示し、さらに、トピック区間の時間位置を示す情報と音声区間の時間位置を示す情報とを時間に沿った形で並べて表示する編集用画面を表示する編集用画面表示手段と、（ニ）編集用画面上で選択されたトピック区間の時間位置と各音声区間の時間位置とに基づいて、そのトピック区間に対応付けられる音声区間を特定して、そのトピック区間についての表示情報とその特定した音声区間についての表示情報とを明示表示する明示表示手段とを備える。 In order to achieve this object, the video processing apparatus of the present invention performs a process of associating the video with the audio by inputting the video and the audio generated separately from the video and associated with the video. And (b) voice recognition means for recognizing the input voice to obtain character information for each voice section, and obtaining time information of those voice sections; and (b) the input video as one video. A topic section means that obtains time information of the topic sections, and (c) a list of icons associated with the topic sections is displayed, and character information of the speech recognition result is displayed. An edit image that is displayed for each audio segment, and further displays information indicating the time position of the topic interval and information indicating the time position of the audio interval side by side along the time. And (d) the voice section associated with the topic section is identified based on the time position of the topic section selected on the editing screen and the time position of each voice section. And an explicit display means for explicitly displaying the display information about the topic section and the display information about the identified voice section.

この構成を採るときに、さらに、選択されたトピック区間に対応付けられる音声区間がユーザにより指定される場合に、その指定される音声区間がそのトピック区間に対応付けられることになるようにと、その音声区間の時間位置を修正する修正手段を備えることがある。 When adopting this configuration, when the voice section associated with the selected topic section is further designated by the user, the designated voice section is associated with the topic section. There may be provided correcting means for correcting the time position of the voice section.

この修正手段を備えるときには、この修正手段の修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量、あるいは、それまでに修正した音声区間の時間修正量の平均値を使って、処理対象の音声区間の時間位置を修正する手段を備えたり、この修正手段の修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したもの、あるいは、それまでに修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したものの平均値を使って、処理対象の音声区間の時間位置を修正する手段を備えることがある。 When the correction means is provided, the time correction amount of the corrected voice section or the time of the voice section corrected so far is processed with one or more voice sections following the voice section corrected by the correction means as a processing target. Means for correcting the time position of the speech section to be processed using the average value of the correction amount, or correcting one or a plurality of speech sections following the speech section corrected by the correction means as a processing target Normalization of the time correction amount of the speech section based on the time length of the topic section or the voice section, or normalization of the time correction amount of the voice section corrected so far based on the time length of the topic section or the voice section There may be provided means for correcting the time position of the speech section to be processed using the average value of the processed ones.

そして、この修正手段を備えるときには、音声区間の時間位置が修正される場合に、それに合わせて、編集用画面上に表示されるその音声区間の時間位置を示す情報の表示位置を変更する手段を備えることがある。 And when this correction means is provided, when the time position of the voice section is corrected, a means for changing the display position of the information indicating the time position of the voice section displayed on the editing screen is adjusted accordingly. May have.

以上の各処理手段が動作することで実現される本発明の映像処理方法はコンピュータプログラムでも実現できるものであり、このコンピュータプログラムは、適当なコンピュータ読み取り可能な記録媒体に記録して提供されたり、ネットワークを介して提供され、本発明を実施する際にインストールされてＣＰＵなどの制御手段上で動作することにより本発明を実現することになる。 The video processing method of the present invention realized by the operation of each processing means described above can also be realized by a computer program, which is provided by being recorded on a suitable computer-readable recording medium, The present invention is realized by being provided via a network, installed when executing the present invention, and operating on a control means such as a CPU.

このように構成される本発明では、映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力すると、その入力した音声を音声認識して、各音声区間について文字情報を得るとともに、それらの音声区間の時間情報を得る。そして、その入力した映像を、ひとつの映像のまとまりとして定義されるトピック区間に区画するとともに、それらのトピック区間の時間情報を得る。 In the present invention configured as described above, when a video and a voice generated separately from the video and associated with the video are input, the input voice is recognized and character information is obtained for each voice section. At the same time, the time information of those speech sections is obtained. Then, the inputted video is divided into topic sections defined as a group of one video, and time information of the topic sections is obtained.

続いて、これらの処理に基づいて編集用画面を表示する。このとき表示する編集用画面は、トピック区間に対応付けられるアイコンの一覧を表示し、音声認識結果の文字情報を音声区間毎に表示し、さらに、トピック区間の時間位置を示す情報と音声区間の時間位置を示す情報とを時間に沿った形で並べて表示するものである。 Subsequently, an editing screen is displayed based on these processes. The editing screen displayed at this time displays a list of icons associated with the topic section, displays the character information of the speech recognition result for each voice section, and further displays information indicating the time position of the topic section and the voice section. Information indicating the time position is displayed side by side along the time.

この編集用画面の表示を受けて、ユーザは、編集用画面に表示されるトピック区間についての表示情報をクリックすることでトピック区間を選択することになるので、ユーザによりトピック区間が選択されると、その選択されたトピック区間と各音声区間とが時間的にオーバーラップする割合を求めて、それに応じて、そのトピック区間に対応付けられる音声区間を特定して、そのトピック区間についての表示情報とその特定した音声区間についての表示情報とを明示表示する。 In response to the display of the editing screen, the user selects the topic section by clicking the display information about the topic section displayed on the editing screen. Therefore, when the topic section is selected by the user, , Obtaining a ratio of temporal overlap between the selected topic section and each voice section, and accordingly, identifying a voice section associated with the topic section, and displaying information about the topic section; The display information about the specified voice section is clearly displayed.

この明示表示を受けて、ユーザは、選択したトピック区間と明示表示される音声区間との間の対応関係が所望のものであるのか否かを判断して、所望のものであることを判断するときには、次のトピック区間を選択する。 Upon receiving this explicit display, the user determines whether or not the correspondence between the selected topic section and the voice section that is explicitly displayed is desired, and determines that it is desired. Sometimes the next topic section is selected.

一方、ユーザは、選択したトピック区間と明示表示される音声区間との間の対応関係が所望のものでないことを判断するときには、選択したトピック区間に対応付けられる音声区間を選択することで指定することになるので、ユーザにより音声区間が指定されると、その指定された音声区間がそのトピック区間に対応付けられることになるようにと、その音声区間の時間位置を修正する。 On the other hand, when determining that the correspondence between the selected topic section and the voice section that is explicitly displayed is not desired, the user designates the voice section that is associated with the selected topic section. Therefore, when the voice section is designated by the user, the time position of the voice section is corrected so that the designated voice section is associated with the topic section.

そして、この修正に合わせて、修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量、あるいは、それまでに修正した音声区間の時間修正量の平均値を使って、処理対象の音声区間の時間位置を修正したり、その修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したもの、あるいは、それまでに修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したものの平均値を使って、処理対象の音声区間の時間位置を修正する。 Then, in accordance with this correction, the time correction amount of the corrected voice section or the time correction amount of the voice section corrected so far is processed with one or more voice sections following the corrected voice section as a processing target. The average position is used to correct the time position of the target speech section, the time correction amount of the corrected speech section is normalized based on the topic section or the duration of the speech section, or so far The time position of the speech segment to be processed is corrected using the average value of the time correction amount of the corrected speech segment normalized based on the time length of the topic segment or the speech segment.

このようにして、本発明によれば、映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力として、その映像とその音声とを関連付ける処理を行うときに、少ない作業量で、かつ簡略な操作でもって、その関連付けを行うことができるようになる。 In this way, according to the present invention, when a process of associating the video with the audio is performed by inputting the video and the audio generated separately from the video and associated with the video, a small amount of work is performed. In addition, the association can be performed with a simple operation.

これから、本発明によれば、映像に対して、外部から取り入れた音声の認識結果を関連情報として自動的に付与することができるようになる。 As a result, according to the present invention, the recognition result of the voice taken from the outside can be automatically given to the video as the related information.

そして、本発明によれば、このとき付与する音声情報が映像内に含まれる音声認識には適さない音声の認識結果ではなくて、外部から取り入れた音声認識に適した音声の認識結果であることから、映像に対して正確で、かつ的確な音声情報を付与することができるようになる。 According to the present invention, the voice information provided at this time is not a voice recognition result that is not suitable for voice recognition included in the video, but a voice recognition result that is suitable for voice recognition taken from outside. Therefore, accurate and accurate audio information can be given to the video.

そして、本発明によれば、この音声情報の付与にあたって、時間情報によって対応付けられた映像と音声情報とを明示表示することにより、ユーザが視覚的に確認しながら映像に対して音声情報を付与することができるようになるので、映像に対して効率的に音声情報を付与することができるようになる。 Then, according to the present invention, when the audio information is added, the audio information is added to the video while visually confirming the video by visually displaying the video and audio information associated with the time information. Therefore, audio information can be efficiently given to the video.

以下、実施の形態に従って本発明を詳細に説明する。 Hereinafter, the present invention will be described in detail according to embodiments.

図１に、本発明を具備する動画像編集装置１の一実施形態例を図示する。 FIG. 1 illustrates an embodiment of a moving image editing apparatus 1 having the present invention.

本発明を具備する動画像編集装置１は、カメラ２により撮影された映像と、マイク３により収集された音声とを入力として、入力した映像と入力した音声とを関連付ける処理を行うものであって、カメラ２により撮影された映像を入力する映像入力部１０と、映像入力部１０の入力した映像を格納する映像格納部１１と、マイク３により収集された音声を入力する外部音声入力部１２と、外部音声入力部１２の入力した外部音声を音声認識して音声区間に区切る音声認識部１３と、音声認識部１３の認識結果を格納する認識音声格納部１４と、時間を刻むタイマ１５と、映像格納部１１に格納される映像をひとつの映像のまとまりとして定義されるトピック区間に区画するトピック定義部１６と、トピック定義部１６の区画した各トピック区間についての情報を格納するトピック情報格納部１７と、編集用画面を表示して、トピック区間と音声区間との対応付けを実行するトピック編集部１８と、編集用画面などを表示するディスプレイ１９とを備える。 The moving image editing apparatus 1 having the present invention performs processing for associating an input video with the input audio by using the video captured by the camera 2 and the audio collected by the microphone 3 as inputs. A video input unit 10 for inputting video captured by the camera 2, a video storage unit 11 for storing video input by the video input unit 10, and an external audio input unit 12 for inputting audio collected by the microphone 3. A speech recognition unit 13 that recognizes external speech input by the external speech input unit 12 and divides the speech into speech segments; a recognized speech storage unit 14 that stores a recognition result of the speech recognition unit 13; a timer 15 that counts time; A topic definition unit 16 that divides the video stored in the video storage unit 11 into topic sections defined as a group of videos, and each topic section partitioned by the topic definition unit 16 A topic information storage unit 17 that stores information about the topic, a topic editing unit 18 that displays an editing screen and associates a topic section with a voice section, and a display 19 that displays the editing screen and the like. Prepare.

次に、映像入力部１０、外部音声入力部１２、音声認識部１３、トピック定義部１６及びトピック編集部１８の実行する処理について説明する。 Next, processing executed by the video input unit 10, the external audio input unit 12, the audio recognition unit 13, the topic definition unit 16, and the topic editing unit 18 will be described.

〔１〕映像入力部１０の処理
映像入力部１０は、図２の処理フローに示すように、先ず最初に、ステップ１０で、カメラ２により撮影された映像を入力し、続くステップ１１で、タイマ１５により与えられる時間情報を付加しつつ、入力した映像を映像格納部１１に格納するという処理を実行する。 [1] Processing of Video Input Unit 10 As shown in the processing flow of FIG. 2, the video input unit 10 first inputs a video shot by the camera 2 in step 10, and then in step 11, a timer is input. A process of storing the input video in the video storage unit 11 is executed while adding the time information given by 15.

この映像入力部１０の処理に従って、映像格納部１１には、処理対象となる映像が格納されることになる。 According to the processing of the video input unit 10, the video storage unit 11 stores a video to be processed.

〔２〕外部音声入力部１２及び音声認識部１３の処理
外部音声入力部１２は、図３の処理フローに示すように、先ず最初に、ステップ２０で、マイク３により収集された外部音声を入力し、続くステップ２１で、入力した外部音声にタイマ１５により与えられる時間情報を付加するという処理を実行する。 [2] Processing of the external voice input unit 12 and the voice recognition unit 13 The external voice input unit 12 first inputs the external voice collected by the microphone 3 in step 20, as shown in the processing flow of FIG. Then, in the subsequent step 21, a process of adding time information given by the timer 15 to the input external sound is executed.

この外部音声入力部１２の処理を受けて、音声認識部１３が動作に入って、音声認識部１３は、図３の処理フローに示すように、先ず最初に、ステップ２２で、入力した外部音声を音声認識することで、各音声区間毎に、認識結果となる文字情報とその区間の時間情報とを生成し、続くステップ２３で、その処理結果を認識音声格納部１４に格納するという処理を実行する。 In response to the processing of the external voice input unit 12, the voice recognition unit 13 enters operation, and the voice recognition unit 13 first inputs the external voice input in step 22 as shown in the processing flow of FIG. Is recognized for each voice section, character information that is a recognition result and time information of the section are generated for each voice section, and the processing result is stored in the recognized voice storage unit 14 in the following step 23. Execute.

この外部音声入力部１２及び音声認識部１３の処理に従って、認識音声格納部１４には、入力した外部音声の音声認識結果である各音声区間毎の文字情報・時間情報が格納されることになる。 According to the processing of the external voice input unit 12 and the voice recognition unit 13, the recognized voice storage unit 14 stores character information and time information for each voice section, which is a voice recognition result of the input external voice. .

ここで、音声認識部１３としては、ディクテーションの音声認識機能（話した言葉すべてをできる限り忠実に認識する音声認識機能）を持つものが用いられ、例えば、“ＮＴＴ技術ジャーナル，1999年12月号，14ページ「音声認識エンジンＶoiceＲexを開発」”や、“ＮＴＴ技術ジャーナル，1999年12月号，22ページ「音声認識エンジンＶoiceＲexによる文書作成」”に記載される音声認識技術を用いることができる。 Here, as the speech recognition unit 13, one having a dictation speech recognition function (a speech recognition function that recognizes all spoken words as faithfully as possible) is used. For example, “NTT Technical Journal, December 1999 issue” , Page 14 “Development of voice recognition engine VoiceRex” and “NTT Technology Journal, December 1999 issue, page 22“ Document creation by voice recognition engine VoiceRex ”” can be used.

〔３〕トピック定義部１６の処理
トピック定義部１６は、図４に示すように、映像格納部１１に格納される映像についてトピック区間を定義することで、その映像をトピック区間に区画して、各トピック区間についての映像情報とその区間の時間情報とをトピック情報格納部１７に格納するという処理を実行する。 [3] Processing of Topic Definition Unit 16 As shown in FIG. 4, the topic definition unit 16 defines a topic section for a video stored in the video storage unit 11, thereby dividing the video into topic sections. A process of storing the video information about each topic section and the time information of the section in the topic information storage unit 17 is executed.

トピック定義部１６は、このトピック区間の定義を実行するために、図４に示すように、例えば、インデックス作成機能とトピック定義機能とを備えている。 In order to execute the definition of the topic section, the topic definition unit 16 includes, for example, an index creation function and a topic definition function as shown in FIG.

トピック定義部１６は、このインデックス作成機能を使って、映像の切り替わりや音声・音楽の有無やテロップの有無などを検出することで、入力した映像に対してインデックス作成処理（映像をシーンに区画し、各シーンについて代表画像を決定するなどの処理を行う）を施して、例えば、図５に示すユーザインタフェース画面の右側部分にあるブラウザ画面１００を使って、その処理結果を検出結果表示枠を使って一覧表示する。 The topic definition unit 16 uses this index creation function to detect the switching of videos, the presence / absence of audio / music, the presence / absence of telops, etc., and index creation processing (divides the videos into scenes). For example, the browser screen 100 on the right side of the user interface screen shown in FIG. 5 is used to display the processing result using the detection result display frame. List.

このユーザインタフェース画面では、左側部分に示すように、動画像再生表示、音声波形表示、マーク表示を含む動画像の再生プレーヤー１０１が設けられており、ユーザは、この再生プレーヤー１０１を使って、選択したインデックスについての映像を参照しながら、ひとつの映像のまとまりとして定義されるトピック区間を定義するので、トピック定義部１６は、トピック定義機能を使ってユーザと対話して、映像格納部１１に格納される映像についてトピック区間を定義することで、その映像をトピック区間に区画して、各トピック区間についての映像情報とその区間の時間情報とをトピック情報格納部１７に格納することになる。 In this user interface screen, as shown in the left part, a moving image playback player 101 including a moving image playback display, an audio waveform display, and a mark display is provided. The topic definition unit 16 defines a topic section defined as a unit of one video while referring to the video about the index, so that the topic definition unit 16 interacts with the user using the topic definition function and stores it in the video storage unit 11. By defining the topic section for the video to be recorded, the video is divided into topic sections, and the video information for each topic section and the time information of the section are stored in the topic information storage unit 17.

このようにして、野球中継の映像を入力する場合の例で説明するならば、１回の表の攻撃、１回の裏の攻撃、２回の表の攻撃、・・・・・というような形でトピック区間が定義されて、それらの各トピック区間に含まれる映像の情報とその区間の時間情報とがトピック情報格納部１７に格納されることになる。 In this way, if an example of a baseball broadcast video is input, it will be described as one front attack, one back attack, two front attacks, etc. The topic section is defined in a form, and the information of the video included in each topic section and the time information of the section are stored in the topic information storage unit 17.

〔４〕トピック編集部１８の処理
トピック編集部１８は、編集用画面を表示して、それを使ってユーザと対話することで、トピック定義部１６により定義されたトピック区間と、音声認識部１３により認識された音声区間との対応付けを実行する。 [4] Processing of Topic Editing Unit 18 The topic editing unit 18 displays the editing screen and interacts with the user by using the editing screen, so that the topic section defined by the topic definition unit 16 and the voice recognition unit 13 are displayed. The association with the speech section recognized by the above is executed.

このトピック編集部１８の処理に従って、映像に含まれる各トピック区間に対して、外部から取り入れた外部音声の認識結果を関連情報として自動的に付与することができるようになる。 According to the processing of the topic editing unit 18, the recognition result of the external audio taken from the outside can be automatically given as the related information to each topic section included in the video.

図６に、トピック編集部１８の表示する編集用画面の一例を図示する。 FIG. 6 illustrates an example of an editing screen displayed by the topic editing unit 18.

この図に示すように、トピック編集部１８の表示する編集用画面は、トピック区間に対応付けられるアイコンの一覧を表示するトピック一覧表示部２００と、外部音声の認識結果である文字情報を音声区間毎に表示する音声認識結果表示部２０１と、音声区間の時間位置を示すバーとトピック区間の時間位置を示すバーとを時間に沿った形で並べて表示するタイムライン表示部２０２とで構成されている。 As shown in this figure, the editing screen displayed by the topic editing unit 18 includes a topic list display unit 200 that displays a list of icons that are associated with topic sections, and character information that is a recognition result of external speech as speech sections. A speech recognition result display unit 201 for displaying each time, and a timeline display unit 202 for displaying a bar indicating the time position of the speech section and a bar indicating the time position of the topic section side by side along the time. Yes.

なお、この図６では省略しているが、編集用画面には、トピック一覧表示部２００に表示するアイコンの中からユーザによりアイコンが選択されると、その選択されたアイコンの指すトピック区間の映像を再生する再生プレーヤーが用意されている。 Although omitted in FIG. 6, when the user selects an icon from the icons displayed on the topic list display unit 200 on the editing screen, the video of the topic section indicated by the selected icon is displayed. A playback player for playing is available.

図７ないし図９に、トピック編集部１８の実行する処理フローの一例を図示する。次に、この処理フローに従って、本発明について詳細に説明する。 7 to 9 show an example of a processing flow executed by the topic editing unit 18. Next, according to this processing flow, the present invention will be described in detail.

トピック編集部１８は、ユーザから処理要求があると、図７ないし図９の処理フローに示すように、先ず最初に、ステップ３０で、編集用画面のトピック一覧表示部２００に、各トピック区間のアイコンを一覧表示し、続くステップ３１で、編集用画面の音声認識結果表示部２０１に、各音声区間の文字情報を表示する。 When there is a processing request from the user, the topic editing unit 18 first, in step 30, in the topic list display unit 200 of the editing screen, displays each topic section as shown in the processing flow of FIGS. A list of icons is displayed, and in the subsequent step 31, character information of each voice section is displayed on the voice recognition result display unit 201 of the editing screen.

続いて、ステップ３２で、編集用画面のタイムライン表示部２０２に、各トピック区間の時間位置を示すバーを表示し、続くステップ３３で、編集用画面のタイムライン表示部２０２に、各音声区間の時間位置を示すバーを表示する。 Subsequently, in step 32, a bar indicating the time position of each topic section is displayed on the timeline display section 202 of the editing screen, and in step 33, each voice section is displayed on the timeline display section 202 of the editing screen. Displays a bar indicating the time position of.

このようにして、トピック編集部１８は、ステップ３０〜ステップ３３の処理を実行することで、図６に示すような編集用画面を表示するのである。 In this way, the topic editing unit 18 displays the editing screen as shown in FIG. 6 by executing the processing of step 30 to step 33.

この編集用画面の表示に応答して、ユーザが編集操作を入力してくるので、トピック編集部１８は、続くステップ３４で、この編集操作が入力されるのを待って、編集操作が入力されたことを検出すると、ステップ３５に進んで、トピック区間をクリックする編集操作であるのかを判断する。 Since the user inputs an editing operation in response to the display of the editing screen, the topic editing unit 18 waits for the editing operation to be input in the subsequent step 34, and the editing operation is input. If it is detected, the process proceeds to step 35 to determine whether the editing operation is to click the topic section.

すなわち、ユーザは、処理対象となるトピック区間を選択する場合は、トピック一覧表示部２００に表示するいずれかのアイコンをクリックするか、タイムライン表示部２０２に表示するトピック区間の時間位置を示すいずれかのバーをクリックするので、そのような編集操作であるのかを判断するのである。 That is, when the user selects a topic section to be processed, the user clicks on any icon displayed on the topic list display unit 200 or displays the time position of the topic section displayed on the timeline display unit 202. By clicking the bar, it is judged whether it is such an editing operation.

このステップ３５の判断処理に従って、ユーザの編集操作がトピック区間をクリックする編集操作であることを判断するときには、ステップ３６に進んで、編集用画面で現在行っているハイライト表示（強調表示）を終了し、続くステップ３７で、クリックされたトピック区間の時間位置の近傍にある音声区間を抽出する。 When it is determined that the user's editing operation is an editing operation in which the topic section is clicked according to the determination processing in step 35, the process proceeds to step 36, and the highlight display (highlighted display) currently performed on the editing screen is displayed. In step 37, the speech section in the vicinity of the time position of the clicked topic section is extracted.

続いて、ステップ３８で、その抽出した音声区間を処理対象として、クリックされたトピック区間と処理対象の音声区間との間の時間的な重なりを示す値を取得する。具体的には、図１０中に示す時間長Ｘ，Ｙ（音声区間がトピック区間を跨ぐ場合にはＹ１，Ｙ２）で示す時間的な重なりを示す値を取得するのである。 Subsequently, in step 38, with the extracted speech section as a processing target, a value indicating a temporal overlap between the clicked topic section and the processing target speech section is acquired. Specifically, a value indicating the temporal overlap indicated by the time lengths X and Y shown in FIG. 10 (Y1 and Y2 when the speech section crosses the topic section) is acquired.

続いて、ステップ３９で、その取得したＸ，Ｙ（Ｙ１，Ｙ２）と予め設定される閾値Ｚとを用いて、処理対象の音声区間の中から、クリックされたトピック区間に対応付けられるものを特定する。 Subsequently, in step 39, using the acquired X, Y (Y1, Y2) and a preset threshold value Z, one that is associated with the clicked topic section from among the speech sections to be processed. Identify.

次に、このステップ３９で実行する特定処理について、図１１に示す処理フローに従って説明する。 Next, the specific process executed in step 39 will be described according to the process flow shown in FIG.

すなわち、トピック編集部１８は、ステップ３９の処理に入って、処理対象の音声区間の中から音声区間を１つ選択すると、図１１の処理フローに示すように、先ず最初に、ステップ３９０で、その選択した音声区間の開始時間又は終了時間の少なくともどちらか一方がトピックの区間内にあるのかを判断して、トピック区間内にあることを判断するとき、すなわち、図１０に示す音声区間αのような状態にあることを判断するときには、ステップ３９１に進んで、「Ｘ／（Ｘ＋Ｙ）≧Ｚ」という関係が成立するのか否かを判断して、この関係が成立することを判断するときには（トピック区間とオーバーラップする時間が長いことを判断するときには）、ステップ３９２に進んで、選択した音声区間がトピック区間に対応付けられるものと判断し、この関係が成立しないことを判断するときには、ステップ３９４に進んで、選択した音声区間がトピック区間に対応付けられないものと判断する。 That is, when the topic editing unit 18 enters the process of step 39 and selects one speech section from the speech sections to be processed, first, as shown in the processing flow of FIG. When determining whether at least one of the start time or the end time of the selected speech section is within the topic section and determining that it is within the topic section, that is, in the speech section α shown in FIG. When it is determined that such a state exists, the process proceeds to step 391, where it is determined whether or not the relationship “X / (X + Y) ≧ Z” is established, and when it is determined that this relationship is established ( If it is determined that the time to overlap with the topic section is long), the process proceeds to step 392 and the selected speech section is associated with the topic section. Disconnection and, when it is determined that this relationship is not satisfied, the process proceeds to step 394, the selected speech segment is determined that not associated with the topic section.

そして、ステップ３９０で、選択した音声区間の開始時間又は終了時間のどちらともがトピックの区間内にないことを判断するとき、すなわち、図１０に示す音声区間βのような状態にあることを判断するときには、ステップ３９３に進んで、「Ｘ／（Ｘ＋Ｙ１＋Ｙ２）≧Ｚ」という関係が成立するのか否かを判断して、この関係が成立することを判断するときには（トピック区間とオーバーラップする時間が長いことを判断するときには）、ステップ３９２に進んで、選択した音声区間がトピック区間に対応付けられるものと判断し、この関係が成立しないことを判断するときには、ステップ３９４に進んで、選択した音声区間がトピック区間に対応付けられないものと判断する。 Then, in step 390, when it is determined that neither the start time nor the end time of the selected speech section is within the topic section, that is, it is determined that the state is like the speech section β shown in FIG. If so, the process proceeds to step 393, where it is determined whether or not the relationship of “X / (X + Y1 + Y2) ≧ Z” is established, and when it is determined that this relationship is established (time to overlap with the topic section) If it is determined that it is long), the process proceeds to step 392, where it is determined that the selected speech segment is associated with the topic segment, and when it is determined that this relationship is not established, the process proceeds to step 394, where the selected speech segment is selected. It is determined that the section is not associated with the topic section.

このようにして、ステップ３９では、ステップ３８で取得したＸ，Ｙ（Ｙ１，Ｙ２）と予め設定される閾値Ｚとを用いて、クリックされたトピック区間とそのトピック区間の時間位置の近傍にある音声区間とが時間的にオーバーラップする割合を求めて、それに応じて、それらの音声区間の中から、クリックされたトピック区間に対応付けられるものを特定するのである。 In this way, in step 39, the clicked topic section and the time position of the topic section are in the vicinity using the X, Y (Y1, Y2) acquired in step 38 and the preset threshold value Z. A ratio of temporal overlap with the voice section is obtained, and according to the ratio, one corresponding to the clicked topic section is specified.

続いて、ステップ４０で、ステップ３９での特定処理に従って、クリックされたトピック区間に対応付けられる音声区間を特定できたのか否かを判断して、音声区間を特定できたことを判断するときには、ステップ４１に進んで、編集用画面上に表示するクリックされたトピック区間についての表示情報と、その特定した音声区間についての表示情報とをハイライト表示する。 Subsequently, in step 40, when it is determined whether or not the voice section associated with the clicked topic section can be specified according to the specifying process in step 39, and it is determined that the voice section can be specified, Proceeding to step 41, the display information about the clicked topic section displayed on the editing screen and the display information about the identified voice section are highlighted.

すなわち、図１２に示すように、編集用画面のトピック一覧表示部２００に表示する該当のトピック区間のアイコンと、編集用画面の音声認識結果表示部２０１に表示する該当の音声区間の文字情報と、編集用画面のタイムライン表示部２０２に表示する該当のトピック区間及び音声区間のバーとをハイライト表示するのである。 That is, as shown in FIG. 12, the icon of the corresponding topic section displayed on the topic list display section 200 of the editing screen, the character information of the corresponding voice section displayed on the voice recognition result display section 201 of the editing screen, The corresponding topic section and voice section bar displayed on the timeline display section 202 of the editing screen are highlighted.

この編集用画面のハイライト表示に応答して、ユーザはトピック区間と音声区間との対応付けが所望のものであるのか否かを入力してくるので、トピック編集部１８は、続くステップ４２で、ユーザがハイライト表示する対応関係が所望のものであるということを入力してきたのか否かを判断して、ユーザが所望のものであるということを入力してきたことを判断するときには、次のトピック区間の処理を行うべくステップ３４に戻る。 In response to the highlight display of the editing screen, the user inputs whether or not the correspondence between the topic section and the voice section is desired. When it is determined whether or not the user has input that the correspondence to be highlighted is the desired one, and when determining that the user has input the desired relationship, The process returns to step 34 to perform topic section processing.

このようにして、ステップ３７〜ステップ３９の処理に従って、クリックされたトピック区間に対応付けられる所望の音声区間を特定できる場合には、次のトピック区間の処理を行うべく、そのままステップ３４に戻るように処理するのである。 In this way, if a desired speech section associated with the clicked topic section can be specified according to the processing of step 37 to step 39, the process returns to step 34 as it is to perform processing for the next topic section. Is processed.

一方、ステップ４２で、ユーザがハイライト表示する対応関係が所望のものではないということを入力してきたことを判断するときには、ステップ４３に進んで、図１３に示すように、音声区間のハイライト表示を終了する。 On the other hand, when it is determined in step 42 that the user has input that the correspondence to be highlighted is not the desired one, the process proceeds to step 43 to highlight the voice section as shown in FIG. End the display.

続いて、ステップ４４で、編集用画面を使ってユーザと対話することで、クリックされたトピック区間に対応付けられる音声区間を選択し、続くステップ４５で、その選択した音声区間の時間位置がクリックされたトピック区間に対応付けられることになるようにと、その音声区間の時間位置を修正する。 Subsequently, in step 44, the voice section associated with the clicked topic section is selected by interacting with the user using the editing screen, and in step 45, the time position of the selected voice section is clicked. The time position of the voice section is corrected so as to be associated with the topic section.

なお、このとき実行する時間位置の修正については、例えば、クリックされたトピック区間の最終時間位置と選択した音声区間の最終時間位置とが一致することになるようにと自動で行うようにしてもよいが、後述するスライド表示モードに設定しておいて、タイムライン表示部２０２に表示する音声区間の時間位置を示すバーに対して行われるユーザの移動操作に従って行うようにしてもよい。 It should be noted that the time position correction executed at this time may be automatically performed so that, for example, the final time position of the clicked topic section matches the final time position of the selected voice section. However, the slide display mode may be set to be described later, and may be performed in accordance with the user's moving operation performed on the bar indicating the time position of the voice section displayed on the timeline display unit 202.

また、このとき実行する時間位置の修正については、選択した音声区間の時間位置がクリックされたトピック区間の時間位置に完全に含まれることになるまで修正する必要はなく、上述した「Ｘ／（Ｘ＋Ｙ）≧Ｚ」や「Ｘ／（Ｘ＋Ｙ１＋Ｙ２）≧Ｚ」という関係が成立する状態になるまでの修正で足りるが、完全に含まれることになるまで修正を行うようにしてもよい。 Further, the correction of the time position executed at this time does not need to be corrected until the time position of the selected speech section is completely included in the time position of the clicked topic section, and the above-described “X / ( The correction is sufficient until the relationship of “X + Y) ≧ Z” or “X / (X + Y1 + Y2) ≧ Z” is satisfied, but the correction may be performed until it is completely included.

続いて、ステップ４６で、選択した音声区間に続く音声区間の時間位置を修正する。このとき実行する修正処理の詳細については後述するが、選択した音声区間に続く全ての音声区間の時間位置を修正する必要はなく、例えば、選択した音声区間の後ろに位置する１つの音声区間の時間位置だけを修正するようにしてもよい。 Subsequently, in step 46, the time position of the voice section following the selected voice section is corrected. Although the details of the correction process executed at this time will be described later, it is not necessary to correct the time positions of all the voice sections that follow the selected voice section. For example, one voice section located after the selected voice section Only the time position may be corrected.

続いて、ステップ４７で、クリックされたトピック区間に対応付けられる音声区間が確定したことに対応して、選択した音声区間についての表示情報をハイライト表示する。 Subsequently, in step 47, in response to the determination of the voice section associated with the clicked topic section, the display information for the selected voice section is highlighted.

続いて、ステップ４８で、スライド表示モードに設定されているのか否かを判断して、スライド表示モードに設定されていることを判断するときには、ステップ４９に進んで、ステップ４５，４６で行った時間位置の修正に合わせて、図１４に示すように、タイムライン表示部２０２に表示する音声区間の時間位置を示すバーの表示位置をずらしてから、次のトピック区間の処理を行うべくステップ３４に戻り、一方、スライド表示モードに設定されていないことを判断するときには、ステップ４９の処理を行うことなく、次のトピック区間の処理を行うべくステップ３４に戻る。 Subsequently, in step 48, it is determined whether or not the slide display mode is set, and when it is determined that the slide display mode is set, the process proceeds to step 49 and is performed in steps 45 and 46. In accordance with the correction of the time position, as shown in FIG. 14, the display position of the bar indicating the time position of the voice section displayed on the timeline display unit 202 is shifted, and then the step 34 is performed to process the next topic section. On the other hand, when it is determined that the slide display mode is not set, the process returns to step 34 to perform the process of the next topic section without performing the process of step 49.

そして、ステップ４０で、クリックされたトピック区間に対応付けられる音声区間を特定できないことを判断するときには、ステップ５０に進んで、クリックされたトピック区間についての表示情報のみをハイライト表示してから、ユーザの指定する音声区間に従ってクリックされたトピック区間に対応付けられる音声区間を特定すべく、ステップ４４〜ステップ４９の処理に進む。 In step 40, when it is determined that the voice section associated with the clicked topic section cannot be specified, the process proceeds to step 50, and only the display information about the clicked topic section is highlighted. The process proceeds to step 44 to step 49 in order to specify the voice section associated with the topic section clicked according to the voice section specified by the user.

このようにして、ステップ３７〜ステップ３９の自動処理に従って、クリックされたトピック区間に対応付けられる音声区間を特定できるものの、その音声区間が所望のものでない場合と、クリックされたトピック区間に対応付けられる音声区間を特定できない場合には、ステップ４４〜ステップ４６の処理に従って、ユーザの指定する音声区間を選択して、その選択した音声区画がクリックされたトピック区間に対応付けられるものとなるようにと時間位置を修正するとともに、それに合わせて、その選択した音声区間に続く音声区間の時間位置を修正するように処理するのである。 In this way, although the voice section associated with the clicked topic section can be specified according to the automatic processing of Step 37 to Step 39, the voice section is associated with the clicked topic section when it is not desired. If the voice section to be specified cannot be specified, the voice section specified by the user is selected according to the processing of step 44 to step 46, and the selected voice section is associated with the clicked topic section. And the time position of the voice section that follows the selected voice section is corrected accordingly.

そして、ステップ３５で、ユーザの編集操作がトピック区間をクリックする編集操作でないことを判断するときには、ステップ５１に進んで、ユーザの編集操作が処理の終了を指示する編集操作であるのか否かを判断して、処理終了指示の編集操作でないことを判断するときには、ステップ５２に進んで、指示のある編集処理を実行してから、ステップ３４に戻り、処理終了指示の編集操作であることを判断するときには、処理を終了する。 If it is determined in step 35 that the user's editing operation is not an editing operation for clicking a topic section, the process proceeds to step 51 to determine whether or not the user's editing operation is an editing operation for instructing the end of the process. If it is determined that it is not an editing operation for a process end instruction, the process proceeds to step 52 to execute an editing process with an instruction, and then returns to step 34 to determine that it is an editing operation for a process end instruction. If so, the process ends.

このようにして、トピック編集部１８は、図６に示すような編集用画面を表示して、それを使ってユーザと対話することで、トピック定義部１６により定義されたトピック区間と、音声認識部１３により認識された音声区間との対応付けを実行するのである。 In this way, the topic editing unit 18 displays the editing screen as shown in FIG. 6 and interacts with the user by using the editing screen, so that the topic section defined by the topic definition unit 16 and voice recognition are displayed. The association with the speech section recognized by the unit 13 is executed.

次に、ステップ４６で実行する音声区間の時間位置の修正処理について説明する。トピック編集部１８は、このステップ４６では、ユーザの選択した音声区間に続く音声区間の時間位置を修正する処理を行う。 Next, the time position correction processing of the speech section executed in step 46 will be described. In step 46, the topic editing unit 18 performs a process of correcting the time position of the voice section that follows the voice section selected by the user.

図１５（ａ)(ｂ）に、トピック編集部１８がステップ４６で実行する処理フローの一例を図示する。 FIGS. 15A and 15B show an example of a processing flow executed by the topic editing unit 18 in step 46. FIG.

トピック編集部１８は、図１５（ａ）に示す処理フローに従って、ユーザの選択した音声区間に続く音声区間を修正対象として、その修正対象の時間位置を修正する場合には、先ず最初に、ステップ４６０Ａで、修正対象の音声区間より１つ前に位置する音声区間の修正時間を取得し、続くステップ４６１Ａで、その取得した修正時間を用いて、修正対象の音声区間の時間位置を修正する。 When the topic editing unit 18 corrects the time position of the correction target for the voice segment that follows the voice segment selected by the user according to the processing flow shown in FIG. In 460A, the correction time of the voice section located immediately before the voice section to be corrected is acquired, and in the subsequent step 461A, the time position of the voice section to be corrected is corrected using the acquired correction time.

すなわち、トピック編集部１８は、図１５（ａ）に示す処理フローに従って音声区間の時間位置を修正する場合には、図１６（ａ）に示すような形態でもって、修正対象の音声区間の時間位置を修正するのである。 That is, when the topic editing unit 18 corrects the time position of the voice section according to the processing flow shown in FIG. 15A, the topic editing unit 18 uses the form shown in FIG. The position is corrected.

一方、トピック編集部１８は、図１５（ｂ）に示す処理フローに従って、ユーザの選択した音声区間に続く音声区間を修正対象として、その修正対象の時間位置を修正する場合には、先ず最初に、ステップ４６０Ｂで、修正対象の音声区間より前に位置する音声区間の修正時間を取得して、それらの平均値を算出し、続くステップ４６１Ｂで、その算出した修正時間の平均値を用いて、修正対象の音声区間の時間位置を修正する。 On the other hand, the topic editing unit 18 first corrects the time position of the correction target with the voice section following the voice section selected by the user as the correction target according to the processing flow shown in FIG. In step 460B, the correction times of the voice sections located before the voice section to be corrected are acquired, and the average value thereof is calculated. In the subsequent step 461B, the average value of the calculated correction times is used. Correct the time position of the target speech section.

すなわち、トピック編集部１８は、図１５（ｂ）に示す処理フローに従って音声区間の時間位置を修正する場合には、図１６（ｂ）に示すような形態でもって、修正対象の音声区間の時間位置を修正するのである。 That is, when the topic editing unit 18 corrects the time position of the voice section according to the processing flow shown in FIG. 15B, the topic editing unit 18 uses the form shown in FIG. The position is corrected.

この図１５（ａ)(ｂ）に示す処理フローでは、音声区間の区間長について考慮していないが、図１５（ａ）に示す処理フローの代わりに、図１７（ａ）に示す処理フローのように、修正時間を音声区間長で正規して、その正規化した修正時間と修正対象の音声区間の区間長とに基づいて、修正対象の音声区間の時間位置を修正したり、図１５（ｂ）に示す処理フローの代わりに、図１７（ｂ）に示す処理フローのように、修正時間を音声区間長で正規してその平均値を算出して、その修正時間の平均値と修正対象の音声区間の区間長とに基づいて、修正対象の音声区間の時間位置を修正するようにしてもよい。 In the processing flow shown in FIGS. 15 (a) and 15 (b), the section length of the speech section is not considered, but instead of the processing flow shown in FIG. 15 (a), the processing flow shown in FIG. As described above, the correction time is normalized by the voice section length, and the time position of the voice section to be corrected is corrected based on the normalized correction time and the section length of the voice section to be corrected. Instead of the processing flow shown in b), as in the processing flow shown in FIG. 17B, the correction time is normalized by the voice section length to calculate the average value, and the average value of the correction time and the correction target The time position of the speech section to be corrected may be corrected based on the section length of the voice section.

また、図１５（ａ)(ｂ）に示す処理フローでは、トピック区間の区間長について考慮していないが、図１５（ａ）に示す処理フローの代わりに、図１８（ａ）に示す処理フローのように、修正時間を音声区間に対応付けられるトピック区間の区間長で正規して、その正規化した修正時間と修正対象の音声区間に対応付けられるトピック区間の区間長とに基づいて、修正対象の音声区間の時間位置を修正したり、図１５（ｂ）に示す処理フローの代わりに、図１８（ｂ）に示す処理フローのように、修正時間を音声区間に対応付けられるトピック区間の区間長で正規してその平均値を算出して、その修正時間の平均値と修正対象の音声区間に対応付けられるトピック区間の区間長とに基づいて、修正対象の音声区間の時間位置を修正するようにしてもよい。 Further, in the processing flow shown in FIGS. 15A and 15B, the section length of the topic section is not considered, but the processing flow shown in FIG. 18A is used instead of the processing flow shown in FIG. As shown, the correction time is normalized by the section length of the topic section associated with the speech section, and the correction is performed based on the normalized modification time and the section length of the topic section associated with the speech section to be corrected. The time position of the target speech section is corrected, or instead of the processing flow shown in FIG. 15 (b), as shown in the processing flow shown in FIG. The average value is calculated by normalizing the section length, and the time position of the voice section to be corrected is corrected based on the average value of the correction time and the section length of the topic section associated with the voice section to be corrected. Like It may be.

このようにして、本発明の動画像編集装置１によれば、映像と、その映像とは別に生成されてその映像に関連付けられる外部音声とを入力として、その映像とその外部音声とを関連付ける処理を行うときに、少ない作業量で、かつ簡略な操作でもって、その関連付けを行うことができるようになる。 As described above, according to the moving image editing apparatus 1 of the present invention, the process of associating the video with the external audio by inputting the video and the external audio generated separately from the video and associated with the video. Can be associated with a small amount of work and with a simple operation.

本発明を具備する動画像編集装置の一実施形態例である。1 is an embodiment of a moving image editing apparatus including the present invention. 映像入力部の実行する処理フローである。It is the processing flow which a video input part performs. 外部音声入力部及び音声認識部の実行する処理フローである。It is a processing flow which an external audio | voice input part and a speech recognition part perform. トピック定義部の実行する処理の説明図である。It is explanatory drawing of the process which a topic definition part performs. トピック定義部の表示するユーザインタフェース画面の説明図である。It is explanatory drawing of the user interface screen which a topic definition part displays. トピック編集部の表示する編集用画面の説明図である。It is explanatory drawing of the screen for edit which a topic edit part displays. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs. ２つの区間の間の時間的な重なりを示す値の説明図である。It is explanatory drawing of the value which shows the temporal overlap between two areas. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs. トピック編集部の表示する編集用画面の説明図である。It is explanatory drawing of the screen for edit which a topic edit part displays. トピック編集部の表示する編集用画面の説明図である。It is explanatory drawing of the screen for edit which a topic edit part displays. トピック編集部の実行する処理の説明図である。It is explanatory drawing of the process which a topic edit part performs. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs. トピック編集部の実行する処理の説明図である。It is explanatory drawing of the process which a topic edit part performs. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs. トピック編集部の実行する処理フローである。It is the processing flow which a topic edit part performs.

符号の説明Explanation of symbols

１動画像編集装置
１０映像入力部
１１映像格納部
１２外部音声入力部
１３音声認識部
１４認識音声格納部
１５タイマ
１６トピック定義部
１７トピック情報格納部
１８トピック編集部
１９ディスプレイ DESCRIPTION OF SYMBOLS 1 Moving image editing apparatus 10 Image | video input part 11 Image | video storage part 12 External audio | voice input part 13 Audio | voice recognition part 14 Recognition audio | voice storage part 15 Timer 16 Topic definition part 17 Topic information storage part 18 Topic edit part 19 Display

Claims

映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力として、その映像とその音声とを関連付ける処理を行う映像処理方法であって、
上記音声を音声認識して、各音声区間についての文字情報を得るとともに、それらの音声区間の時間情報を得る第１の過程と、
上記映像を、ひとつの映像のまとまりとして定義されるトピック区間に区画するとともに、それらのトピック区間の時間情報を得る第２の過程と、
上記トピック区間に対応付けられるアイコンの一覧を表示し、上記文字情報を音声区間毎に表示し、さらに、上記トピック区間の時間位置を示す情報と上記音声区間の時間位置を示す情報とを時間に沿った形で並べて表示する編集用画面を表示する第３の過程と、
上記編集用画面上で選択されたトピック区間の時間位置と上記音声区間の時間位置とに基づいて、そのトピック区間に対応付けられる音声区間を特定して、そのトピック区間についての表示情報とその特定した音声区間についての表示情報とを明示表示する第４の過程とを備えることを、
特徴とする映像処理方法。 A video processing method for performing a process of associating the video with the audio by inputting the video and the audio generated separately from the video and associated with the video,
A first step of recognizing the voice to obtain character information about each voice section and obtaining time information of the voice sections;
A second process of dividing the video into topic sections defined as a group of one video and obtaining time information of the topic sections;
A list of icons associated with the topic section is displayed, the character information is displayed for each voice section, and information indicating the time position of the topic section and information indicating the time position of the voice section are displayed in time. A third process of displaying an editing screen that is displayed side by side along the line;
Based on the time position of the topic section selected on the editing screen and the time position of the voice section, the voice section associated with the topic section is identified, and the display information about the topic section and its identification A fourth step of explicitly displaying display information about the voice segment
A characteristic video processing method.

請求項１に記載の映像処理方法において、
上記第４の過程では、上記選択されたトピック区間と上記音声区間とが時間的にオーバーラップする割合を求めて、それに応じて、そのトピック区間に対応付けられる音声区間を特定することを、
特徴とする映像処理方法。 The video processing method according to claim 1,
In the fourth step, the ratio of the temporal overlap between the selected topic section and the voice section is obtained, and accordingly, the voice section associated with the topic section is specified.
A characteristic video processing method.

請求項１又は２に記載の映像処理方法において、
上記選択されたトピック区間に対応付けられる音声区間がユーザにより指定される場合に、その指定される音声区間がそのトピック区間に対応付けられることになるようにと、その音声区間の時間位置を修正する過程を備えることを、
特徴とする映像処理方法。 The video processing method according to claim 1 or 2,
When the voice section associated with the selected topic section is designated by the user, the time position of the voice section is corrected so that the designated voice section is associated with the topic section. Preparing a process to
A characteristic video processing method.

請求項３に記載の映像処理方法において、
上記修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量、あるいは、それまでに修正した音声区間の時間修正量の平均値を使って、処理対象の音声区間の時間位置を修正する過程を備えることを、
特徴とする映像処理方法。 The video processing method according to claim 3,
Processes one or more speech segments following the modified speech segment as a processing target, using the corrected time interval of the modified speech segment, or the average value of the corrected time intervals of the speech segment corrected so far Having a process of correcting the time position of the target voice section,
A characteristic video processing method.

請求項３に記載の映像処理方法において、
上記修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したもの、あるいは、それまでに修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したものの平均値を使って、処理対象の音声区間の時間位置を修正する過程を備えることを、
特徴とする映像処理方法。 The video processing method according to claim 3,
One or a plurality of speech segments following the modified speech segment are processed, and the time correction amount of the modified speech segment is normalized based on the topic segment or the duration of the speech segment, or so far Including a process of correcting the time position of the speech section to be processed using the average value of the time correction amount of the corrected speech section normalized based on the time length of the topic section or the speech section,
A characteristic video processing method.

請求項３ないし５のいずれか１項に記載の映像処理方法において、
上記音声区間の時間位置が修正される場合に、それに合わせて、上記編集用画面上に表示されるその音声区間の時間位置を示す情報の表示位置を変更する過程を備えることを、
特徴とする映像処理方法。 The video processing method according to any one of claims 3 to 5,
In the case where the time position of the voice section is corrected, a process for changing the display position of the information indicating the time position of the voice section displayed on the editing screen accordingly is provided.
A characteristic video processing method.

映像と、その映像とは別に生成されてその映像に関連付けられる音声とを入力として、その映像とその音声とを関連付ける処理を行う映像処理装置であって、
上記音声を音声認識して、各音声区間についての文字情報を得るとともに、それらの音声区間の時間情報を得る第１の手段と、
上記映像を、ひとつの映像のまとまりとして定義されるトピック区間に区画するとともに、それらのトピック区間の時間情報を得る第２の手段と、
上記トピック区間に対応付けられるアイコンの一覧を表示し、上記文字情報を音声区間毎に表示し、さらに、上記トピック区間の時間位置を示す情報と上記音声区間の時間位置を示す情報とを時間に沿った形で並べて表示する編集用画面を表示する第３の手段と、
上記編集用画面上で選択されたトピック区間の時間位置と上記音声区間の時間位置とに基づいて、そのトピック区間に対応付けられる音声区間を特定して、そのトピック区間についての表示情報とその特定した音声区間についての表示情報とを明示表示する第４の手段とを備えることを、
特徴とする映像処理装置。 A video processing device that performs processing of associating the video with the audio by inputting the video and the audio generated separately from the video and associated with the video,
A first means for recognizing the voice to obtain character information about each voice section and to obtain time information of the voice sections;
A second means for dividing the video into topic sections defined as a set of one video, and obtaining time information of the topic sections;
A list of icons associated with the topic section is displayed, the character information is displayed for each voice section, and information indicating the time position of the topic section and information indicating the time position of the voice section are displayed in time. A third means for displaying an editing screen that is displayed side by side along the line;
Based on the time position of the topic section selected on the editing screen and the time position of the voice section, the voice section associated with the topic section is identified, and the display information about the topic section and its identification And a fourth means for explicitly displaying the display information about the voice section,
A video processing device.

請求項７に記載の映像処理装置において、
上記第４の手段は、上記選択されたトピック区間と上記音声区間とが時間的にオーバーラップする割合を求めて、それに応じて、そのトピック区間に対応付けられる音声区間を特定することを、
特徴とする映像処理装置。 The video processing apparatus according to claim 7,
The fourth means obtains a rate at which the selected topic section and the speech section overlap in time, and accordingly specifies a speech section associated with the topic section.
A video processing device.

請求項７又は８に記載の映像処理装置において、
上記選択されたトピック区間に対応付けられる音声区間がユーザにより指定される場合に、その指定される音声区間がそのトピック区間に対応付けられることになるようにと、その音声区間の時間位置を修正する手段を備えることを、
特徴とする映像処理装置。 The video processing apparatus according to claim 7 or 8,
When the voice section associated with the selected topic section is designated by the user, the time position of the voice section is corrected so that the designated voice section is associated with the topic section. Providing means for
A video processing device.

請求項９に記載の映像処理装置において、
上記修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量、あるいは、それまでに修正した音声区間の時間修正量の平均値を使って、処理対象の音声区間の時間位置を修正する手段を備えることを、
特徴とする映像処理装置。 The video processing device according to claim 9,
Processes one or more speech segments following the modified speech segment as a processing target, using the corrected time interval of the modified speech segment, or the average value of the corrected time intervals of the speech segment corrected so far Comprising means for correcting the time position of the target speech segment,
A video processing device.

請求項９に記載の映像処理装置において、
上記修正した音声区間に続く１つ又は複数の音声区間を処理対象として、その修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したもの、あるいは、それまでに修正した音声区間の時間修正量をトピック区間又は音声区間の時間長に基づいて正規化したものの平均値を使って、処理対象の音声区間の時間位置を修正する手段を備えることを、
特徴とする映像処理装置。 The video processing device according to claim 9,
One or a plurality of speech segments following the modified speech segment are processed, and the time correction amount of the modified speech segment is normalized based on the topic segment or the duration of the speech segment, or so far Means for correcting the time position of the speech section to be processed using the average value of the time correction amount of the corrected speech section normalized based on the time length of the topic section or the speech section,
A video processing device.

請求項９ないし１１のいずれか１項に記載の映像処理装置において、
上記音声区間の時間位置が修正される場合に、それに合わせて、上記編集用画面上に表示されるその音声区間の時間位置を示す情報の表示位置を変更する手段を備えることを、
特徴とする映像処理装置。 The video processing apparatus according to any one of claims 9 to 11,
In the case where the time position of the voice section is corrected, a unit for changing the display position of information indicating the time position of the voice section displayed on the editing screen is provided.
A video processing device.

請求項１ないし６のいずれか１項に記載の映像処理方法の実現に用いられる処理をコンピュータに実行させるための映像処理用プログラム。 A video processing program for causing a computer to execute processing used to realize the video processing method according to claim 1.

請求項１ないし６のいずれか１項に記載の映像処理方法の実現に用いられる処理をコンピュータに実行させるための映像処理用プログラムを記録したコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium having recorded thereon a video processing program for causing a computer to execute processing used to implement the video processing method according to claim 1.