JP4175093B2

JP4175093B2 - Topic boundary determination method and apparatus, and topic boundary determination program

Info

Publication number: JP4175093B2
Application number: JP2002323090A
Authority: JP
Inventors: 克人別所
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-11-06
Filing date: 2002-11-06
Publication date: 2008-11-05
Anticipated expiration: 2022-11-06
Also published as: JP2004157337A

Description

【０００１】
【発明の属する技術分野】
本発明は、トピック境界決定方法及び装置及びトピック境界決定プログラムに係り、特に、映像コンテンツや音声コンテンツをトピック単位に分割するためのトピック境界決定方法及び装置及びトピック境界決定プログラムに関する。
【０００２】
【従来の技術】
従来技術として、テキストをトピック単位に分割するHearst法がある（例えば、非特許文献１、２参照）。Hearst法では、テキストを単語に分割し、不要語を除去した後、各単語境界の前後に一定の単語数の単語列の窓をとり、各窓毎に、窓に含まれる単語の出現頻度ベクトルをとり、前後の窓に対応するベクトル間の余弦測度を当該単語境界の結束度として計算する。結束度が極小となる単語境界あるいは、その直近の文境界をトピック境界と認定する。
【０００３】
また、単語毎に当該単語を検索キーとして、単語とその意味表現であるベクトルの対の集合が格納された概念ベースを検索して、当該単語に対応するベクトルを取得し、窓に対応するベクトルとして、窓に含まれる単語のベクトルの重心をとっている方法が提案されている。
【０００４】
【非特許文献１】
Hearst, M.A.:Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp.9-16(1994).
【非特許文献２】
Hearst, M.A.:TextTilling: Segmenting Text into Multi-paragraph Subtopic Passages, Computational Linguistics, Vol.23, No.1, pp33-64 (1997).
【０００５】
【発明が解決しようとする課題】
しかしながら、セグメント対象として映像コンテンツや音声コンテンツ中の音声を音声認識により認識したテキストをとった場合、認識誤りの単語を含んでいるため、上記従来技術では、結束度が適切に計算されないという第１の問題がある。
【０００６】
また、音声セグメントはポーズで区切られたものであり、文の途中で別々の音声セグメントに区切られていることも多い。従来技術では、トピック境界と認定した音声セグメント境界が文の中途になることもあり、セグメンテーションの精度が低下するという第２の問題がある。
【０００７】
また、映像コンテンツでは、テロップを音声の補助的情報として用いることも多く、中には、テロップがトピックの見出しのような役割を果している場合もある。映像コンテンツでは、音声とテロップとを合わせて必要十分な情報量になっていることも多く、音声のみのセグメンテーションでは十分な精度が得られないという第３の問題がある。
【０００８】
本発明は、上記の点に鑑みなされたもので、音声認識結果から意味上の境界を正しくかつ精度よく検出することが可能なトピック境界決定方法及び装置及びトピック境界決定プログラムを提供することを目的とする。
【０００９】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００１０】
本発明（請求項１）は、映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各セグメントに対して複数のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した複数のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程（ステップ１）と、
ソートされた単語列から付属語を含む不要語を削除する不要語削除過程（ステップ２）と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程（ステップ３）と、
結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定過程（ステップ４）からなる。
【００１１】
このように、ＮＢＥＳＴ候補を複数とることに応じて、結束度計算の窓幅はより長くとる。複数のＮＢＥＳＴ候補において、認識の信頼性の高い単語はより多くのＮＢＥＳＴ候補に出現すると考えられる。従って、窓における出現回数もより多くなるので、窓の意味を表すベクトルは、信頼性の高い単語の影響が大きく、逆に信頼性の低い単語の影響は少なくなる。よって、窓の意味を表すベクトル及び結束度は、従来の技術に比べより適切なものとなる。
【００１２】
また、本発明（請求項２）は、映像コンテンツや音声コンテンツに含まれる音声を音声認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲の窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語に含まれる窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦速度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定過程と、からなり、
認識結果テキスト中の各単語に認識スコア情報があるデータが入力されると、
結束度算出過程において、
各単語毎に該単語を検索キーとして、単語と該単語の意味表現であるベクトルの対の集合が格納された概念ベースを検索して、該単語に対応するベクトルを取得し、
窓の意味を表すベクトルとして、該窓に含まれる単語のベクトルの、認識スコアを重みとする重み付き平均を算出する。
【００１３】
このように、認識スコアを重みとする重み付き平均をとることにより、窓の意味を表すベクトルは、認識スコアの高い単語の影響が大きく、逆に認識スコアの低い単語の影響は少なくなるので、重みなしの重心をとる従来技術と比べて、より適切なものとなる。その結果、より適切な結束度が算出できる。
【００１４】
本発明（請求項３）は、映像コンテンツや音声コンテンツに含まれる音声を音声認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントの認識結果テキストの末尾の単語列の表記や品詞、句読点を含む情報から、該音声セグメントが文の中途であるかどうかを判断する文中途判断過程と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲の窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語に含まれる窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦速度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
文中途判断過程で文の中途であると判断された音声セグメントの直後の境界を除く音声セグメント境界集合の中で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定過程と、からなる。
【００１５】
これにより、トピック境界は常に文と文の間になり、文の途中となることはないので、セグメンテーションの精度が向上する。
【００１６】
本発明（請求項４）は、映像コンテンツや音声コンテンツに含まれる音声を音声認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データと、セグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータであって、各テロップ認識結果テキストに開始時刻情報を含むデータが入力されると、
各テロップ認識結果テキストを音声セグメント列の中に、開始時刻情報が昇順となるように挿入するテロップ認識結果テキスト挿入過程と、
各テロップ認識結果テキストを単語分割するテロップ認識結果テキスト単語分割過程と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語及び、テロップ認識結果テキスト単語分割過程で得られた単語で付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲の窓とし、全音声セグメント及びテロップ中の単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語に含まれる窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦速度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
結束度が極小となる単語境界を求め、極小点あるいは該極小点に直近の音声セグメントまたは、テロップ間の境界をトピック境界と認定するトピック境界認定過程と、
からなる。
【００１７】
このように、テロップを音声とマージさせることにより、テロップがトピックの見出し相当のテキストとなる場合が多い。見出し相当のテキストには、そのトピックを代表するような単語が集中して出現するため、そのテキスト以降の結束度はとりわけ高くなり、見出し相当のテキストの直前の境界において結束度の谷の深さが大きくなり、その地点がトピック境界と認定されやすくなる。このため、セグメンテーションの精度が高くなる。また、見出し相当でないテロップがあっても、音声とテロップとを合わせて必要十分な情報になっていることも多いため、テロップを音声とマージさせることにより、より適切にトピック境界を検出できると考えられる。
【００１８】
図２は、本発明の原理構成図である。
【００１９】
本発明（請求項５）は、映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各セグメントに対して複数のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した複数のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段１と、
ソートされた単語列から付属語を含む不要語を削除する不要語除去手段２と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段３と、
結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定手段４と、を有する。
【００２０】
本発明（請求項６）は、映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定手段と、
を有し、
結束度算出手段において、
認識結果テキスト中の各単語に認識スコア情報があるデータが入力されると、各単語毎に該単語を検索キーとして、単語と該単語の意味表現であるベクトルの対の集合が格納された概念ベースを検索して、該単語に対応するベクトルを取得する手段と、
窓の意味を表すベクトルとして、該窓に含まれる単語のベクトルの、認識スコアを重みとする重み付き平均を算出する手段と、を有する。
【００２１】
本発明（請求項７）のトピック境界決定装置は、映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントの認識結果テキストの末尾の単語列の表記や品詞、句読点を含む情報から、該音声セグメントが文の中途であるかどうかを判断する文中途判断手段と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
文中途判断手段で文の中途であると判断された音声セグメントの直後の境界を除く音声セグメント境界集合の中で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定手段と、を有する。
【００２２】
本発明（請求項８）のトピック境界決定装置は、映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データとセグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータであって、各テロップ認識結果テキストに開始時刻情報を含むデータが入力されると、
各テロップ認識結果テキストを音声セグメント列の中に、開始時刻情報が昇順となるように挿入するテロップ認識結果テキスト挿入手段と、
各テロップ認識結果テキストを単語分割するテロップ認識結果テキスト単語分割手段と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語及びテロップ認識結果テキスト単語分割手段で得られた単語で付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメント及びテロップ中の単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
結束度が極小となる単語境界を求め、極小点あるいは該極小点に直近の音声セグメント境界またはテロップ間の境界をトピック境界と認定するトピック境界認定手段と、を有する。
【００２３】
本発明（請求項９）は、コンピュータを、請求項５乃至８記載のトピック境界決定装置として機能させるプログラムである。
【００２４】
また、本発明のトピック境界決定プログラムは、各音声セグメントの認識結果テキストの末尾の単語列の表記や品詞、句読点を含む情報から、該音声セグメントが文の中途であるかどうかを判断する文中途判断ステップを有し、
トピック境界認定ステップにおいて、文中途判断ステップで文の中途であると判断された音声セグメントの直後の境界を除く音声セグメント境界集合の中で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定するステップを含む。
【００２５】
また、本発明のトピック境界決定プログラムは、セグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータであって、各テロップ認識結果テキストに開始時刻情報を含むデータが入力されると、
各テロップ認識結果テキストを音声セグメント列の中に、開始時刻情報が昇順となるように挿入するテロップ認識結果テキスト挿入ステップと、
各テロップ認識結果テキストを単語分割するテロップ認識結果テキスト単語分割ステップを更に有し、
不要語除去ステップにおいて、テロップ認識結果テキスト単語分割ステップで得られた単語で付属語を含む不要語を除去し、
結束度算出ステップで、全音声セグメント及びテロップ中の単語の配列において結束度を計算し、
トピック境界認定ステップにおいて、結束度が極小となる極小点あるいは極小点に直近の音声セグメントまたは、テロップ間の境界をトピック境界と認定する。
【００２６】
【発明の実施の形態】
以下、図面と共に本発明の実施の形態について説明する。
【００２７】
図３は、本発明の第１の実施の形態におけるトピック境界決定装置の構成を示す。
【００２８】
同図に示すトピック境界決定装置は、データ入力部５、単語配列部１、不要語除去部２、結束度算出部３、トピック境界認定部４、トピック境界検出結果出力部６から構成される。
【００２９】
データ入力部５は、図４に示すようなＸＭＬ形式の音声認識結果データを入力する。図４に示すデータにおいて、ＳＥＧＭＥＮＴ要素が１音声セグメントの情報であり、ＳＥＧＭＥＮＴ要素の“begin ”，“end ” 属性が当該音声セグメントの開始時刻、終了時刻を表す。
ＮＢＥＳＴ要素がＮＢＥＳＴ候補であり、その“score ”，“rank”属性は、それぞれ認識スコア、上位何番目の候補かを表す。各音声セグメント毎にＮＢＥＳＴ候補が一般に複数ある。なお、一つの音声認識結果テキストが一つのＮＢＥＳＴ候補に対応しており、音声認識処理のスコアの高い順に得られる認識結果候補のそれぞれをＮＢＥＳＴ候補という。
ＴＥＸＴ要素は、対応するＮＢＥＳＴ候補の音声認識結果テキストであり、ＷＯＲＤ要素は、TEXT要素の内容を構成する単語である。ＷＯＲＤ要素の“begin ”，“end ”，“score ”，“pos ”属性は、当該単語の開始時刻、終了時刻、認識スコア、品詞情報を表す。
【００３０】
単語配列部１は、各音声セグメントに対して所定の個数のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した各ＮＢＥＳＴ候補に含まれる単語集合をマージして単語の開始時刻情報の順に該単語を昇順にソートする。
【００３１】
不要語除去部２は、ソートされた単語の並びである単語列から付属語を含む不要語を除去する。
【００３２】
結束度算出部３は、全音声セグメントの単語列をつなげてできる単語列において、各単語境界の前後に一定の単語数の単語列中の単語の範囲（以下、窓と記す）を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する。
【００３３】
トピック境界認定部４は、結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定する。
【００３４】
トピック境界検出結果出力部６は、トピック境界認定部４により認定されたトピック境界を出力する。
【００３５】
次に、上記の構成における動作を説明する。
【００３６】
図５は、本発明の第１の実施の形態における動作のフローチャートである。
【００３７】
ステップ１０１）データ入力部５において、音声認識結果データとして、各音声セグメントに対して認識スコアの高い順に複数のＮＢＥＳＴ候補と、当該ＮＢＥＳＴ候補に対する単語分割結果及び、当該単語分割結果の各単語に開始時刻情報が付与されているデータを入力する。
【００３８】
ステップ１０２）単語配列部１は、入力されたデータの各音声セグメントに対して所定の個数のＮＢＥＳＴ候補を採用する。ここで、所定の個数とは、１以上の整数または、全ＮＢＥＳＴ候補である。そして、各音声セグメント毎に、採用した各ＮＢＥＳＴ候補に含まれる単語集合をマージして単語の開始時刻情報の順に単語をソートする。図６に、上位２個のＮＢＥＳＴ候補を採用したときの単語配列処理の結果を示す。
【００３９】
ステップ１０３）次に、不要語除去部２は、単語列からトピックセグメンテーションに関係がないと考えられる付属語等の不要語を除去する。ここで、入力されたデータの単語情報には、単語表記や品詞の情報があり、この情報から助詞や助動詞などの付属語を抽出する。これらの助詞や助動詞は、トピックセグメンテーションには影響を及ぼさないと考えられ、このような単語を不要語と判断する。不要語除去を実現するために、不要語であると判断するロジックをプログラムとして実現してもよいし、または、外部テーブルとして不要語リスト（不要語とみなす単語表記や、品詞を記述する）を用意して、当該不要語リストを不要語除去処理を行うプログラムが参照してもよい。図６の各ＷＯＲＤ要素のpos 属性の値により、付属語を削除し得られた結果を図７に示す。
【００４０】
ステップ１０４）結束度算出部３は、全音声セグメントの単語列をつなげてできる単語列において、各単語境界の前後に一定の単語数の単語列の窓をとり、各窓毎に、窓に含まれる単語の出現頻度ベクトル等の窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の余弦測度等の類似度を当該単語境界の結束度として算出する。この例を図８に示す。ＮＢＥＳＴ候補を複数とることに応じて、結束度計算の窓幅はより長くとる。認識の信頼性の高い単語（例えば、図７中の『調整』、『九州』）は、より多くのＮＢＥＳＴ候補に出力すると考えられるので、各窓に多く出現し、窓の意味を表すベクトルは、信頼性の高い単語の影響が大きく、逆に信頼性の低い単語の影響は少なくなる。よって、窓の意味を表すベクトル及び結束度は、従来の技術に比べより適切なものとなる。
【００４１】
ステップ１０５）トピック境界認定部４は、結束度が極小となる単語境界を求め、当該極小点あるいは当該極小点に直近の音声セグメント境界をトピック境界と認定する。
【００４２】
ステップ１０６）トピック境界検出結果出力部６は、トピック境界認定部４で認定されたトピック境界を出力する。
【００４３】
［第２の実施の形態］
本実施の形態では、結束度算出部３において、概念ベースを用いた場合について説明する。
【００４４】
図９は、本発明の第２の実施の形態におけるトピック境界決定装置の構成を示す。図３の構成と同一部分については同一符号を付し、その説明を省略する。
【００４５】
また、図１０は、本発明の第２の実施の形態における概念ベースの例を示す。
概念ベース１０には、単語と当該単語の意味表現であるベクトルの対の集合が格納されており、ベクトル値が近ければ対応する単語の意味も近いという性質を持っている。なお、概念ベース１０は、データベース等の記憶手段に格納されているものとする。
【００４６】
本実施の形態では、前述の第１の実施の形態の構成に当該概念ベース１０を追加した構成である。これにより、第１の実施の形態における図５のフローチャートのステップ１０４において、結束度算出部３が、図１０に示す単語毎にその意味表現であるベクトルが割り当てられている概念ベース１０を、各単語毎に当該単語を検索キーとして検索し、当該単語に対応するベクトルを取得する。
【００４７】
窓に含まれる単語ベクトルの集合をνr （１≦ｒ≦ｓ）、単語ベクトルνr に対応する単語の認識スコアをｇr とする。窓に対応するベクトルとして、当該窓に含まれる単語のベクトルの、認識スコアを重みとする重み付き平均を算出すると、
【００４８】
【数１】

をとる。このような計算をすることで、認識スコアの高い単語の影響が大きく、逆に認識スコアの低い単語の影響は少なくなるので、より窓の意味を適切に反映しており、その結果、結束度もより適切なものとなる。
【００４９】
［第３の実施の形態］
図１１は、本発明の第３の実施の形態におけるトピック境界決定装置の構成を示す。同図に示すトピック境界決定装置は、図９の構成に文中途判断部７及び文中途判断ベース２０が付加された構成である。なお、文中途判断ベース２０は、データベース等の記憶手段に格納されている。同図において図９と同一構成部分には同一符号を付し、その説明を省略する。
【００５０】
文中途判断部７は、各音声セグメントの認識結果テキストの末尾の単語列の表記や品詞、句読点等の情報から当該音声セグメントが文の途中であるかどうかを、文中途判断ベース２０を参照して判断する。
【００５１】
これにより、トピック境界認定部４は、文中途判断部７で文の中途と判断された音声セグメントの直後の境界を除く音声セグメント境界集合の中で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定する。
【００５２】
図１２は、本発明の第３の実施の形態における動作のフローチャートである。同図において、ステップ２０１と図５のステップ１０１は同様であり、また、ステップ２０３〜ステップ２０５及びステップ２０７は、図５のステップ１０２〜ステップ１０４及びステップ１０６と同様であるので、その説明は省略する。
【００５３】
文中途判断部７は、データが入力されると（ステップ２０１）各音声セグメントの例えば最尤（ｒａｎｋ＝“１”）のＮＢＥＳＴ候補の末尾の単語列の表記や品詞、句読点等の情報から、当該音声セグメントが文の途中であるかどうかを判断する。例えば、文の中途と認定できる単語列の情報を文中途判断ベース２０に格納しておき、ＮＢＥＳＴ候補の末尾の単語列が文中途判断ベース２０中のいずれかの単語列とマッチした場合に、当該音声セグメントが文の中途であると判断する（ステップ２０２）。
【００５４】
図１３は、本発明の第３の実施の形態における文中途判断ベースの例を示す。文中途判断ベース２０の各レコードは、「単語表記；品詞情報」の列となっている。例えば、ＮＢＥＳＴ候補の単語列が『台風（名詞）・に（格助詞）・見舞（動詞語幹）・わ（動詞活用語尾）・れ（動詞接尾辞）』であったなら、文中途判断ベース中の２番目のレコードにマッチするので、当該音声セグメントは、文の中途であると判断する。
【００５５】
トピック境界認定部４では、文中途判断部７で文の途中と判断された音声セグメントの直後の境界は文の中途なので、音声セグメント境界集合から外した上で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定する（ステップ２０６）。これにより、トピック境界は常に文と文の間になり、文の中途となることはないので、セグメンテーションの精度が向上する。
【００５６】
［第４の実施の形態］
図１４は、本発明の第４の実施の形態におけるトピック境界決定装置の構成を示す。同図に示すトピック境界決定装置は、図９の構成に、テロップ認識結果テキスト挿入部８、テロップ認識結果テキスト単語分割部９を付加した構成である。図１４の構成において、図９の構成と同一部分には、同一符号を付しその説明を省略する。
【００５７】
但し、データ入力部５から入力されるデータは、セグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータであり、各テロップ認識結果テキストに開始時刻情報があるデータも入力される。
【００５８】
テロップ認識結果テキスト挿入部８は、各テロップ認識結果テキストを音声セグメント列の中に、開始時刻情報が昇順となるように挿入する。
【００５９】
テロップ認識結果テキスト単語分割部９は、各テロップ認識結果テキストを単語分割する。
【００６０】
図１５は、本発明の第４の実施の形態における動作のフローチャートである。
ステップ３０１）データ入力部５は、セグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータ（テロップ認識結果テキスト）を入力とする。入力されるデータの例を図１６に示す。ＴＥＬＯＰ要素が１テロップの情報であり、ＴＥＬＯＰ要素のbegin, end属性が当該テロップの開始時刻、終了時刻である。
【００６１】
ステップ３０２）テロップ認識結果テキスト挿入部８は、各テロップ認識結果テキストを音声セグメント列の中に開始時刻情報が昇順となるように挿入する。図１７に、図１６におけるＴＥＬＯＰ要素を前述の図４の音声認識結果データ中に挿入した結果の例を示す。ＳＥＧＭＥＮＴ要素及びＴＥＬＯＰ要素の“begin ”が昇順となるように、ＳＥＧＭＥＮＴ要素とＴＥＬＯＰ要素が配置さている。図４では、ＳＥＧＭＥＮＴ要素に“begin ”属性があるが、ない場合は、ＳＥＧＭＥＮＴ要素内のＷＯＲＤ要素の“begin ”の最小値をとってもよい。
【００６２】
ステップ３０３）テロップ認識結果テキスト単語分割部９は、各テロップ認識結果テキストを単語分割する。図１８に、ＴＥＬＯＰ要素のテキストを単語分割した結果のデータを示す。ＴＥＬＯＰ要素におけるＷＯＲＤ要素が、単語分割して得られた単語である。ＷＯＲＤ要素には品詞情報である“pos ”属性がある。
【００６３】
ステップ３０４）単語配列部１は、前述の第１の実施の形態と同様の処理を行う。当該処理によって得られた結果を図１９に示す。
【００６４】
ステップ３０５）不要語除去部２は、音声セグメント及びテロップにおける単語列から付属語等の不要語を除去する。図１９に示すＳＥＧＭＥＮＴ要素及びＴＥＬＯＰ要素における各ＷＯＲＤ要素から、“pos ”属性の値に基づいて、付属語のＷＯＲＤ要素を除去して得られた結果を図２０に示す。
【００６５】
ステップ３０６）結束度算出部３は、全ＳＥＧＭＥＮＴ要素及びＴＥＬＯＰ要素中のＷＯＲＤ要素の配列において、結束度を算出する。
【００６６】
ステップ３０７）トピック境界認定部４は、結束度が極小となる極小点をトピック境界と認定する。あるいは、ＳＥＧＭＥＮＴ要素及びＴＥＬＯＰ要素列から、極小点に直近のＳＥＧＭＥＮＴ要素−ＳＥＧＭＥＮＴ要素間、ＳＥＧＭＥＮＴ要素−ＴＥＬＯＰ要素間、ＴＥＬＯＰ要素−ＴＥＬＯＰ要素間の境界をトピック境界と認定する。
【００６７】
ステップ３０８）トピック境界検出結果出力部６は、認定されたトピック境界を出力する。
【００６８】
テロップ「台風の情報」は、トピックの見出し相当のテキストであり、この直前の境界がトピック境界と認定される可能性が高くなる。
【００６９】
本発明は、上記の概念ベース、文中途判断ベースをデータベース等の記憶手段に格納した上で、前述の第１から第４の実施の形態の動作をプログラムとして構築し、トピック境界決定装置として利用されるコンピュータにインストールする、または、ネットワークを介して流通させることも可能である。
【００７０】
また、構築されたプログラムをトピック境界決定装置として利用されるコンピュータに接続されるハードディスク装置や、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際にインストールし、ＣＰＵ等の制御装置で制御することも可能である。
【００７１】
なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【００７２】
【発明の効果】
上述のように、本発明によれば、映像コンテンツや音声コンテンツのトピックセグメンテーションにおいて、従来の技術よりも高い精度を実現することができる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の原理構成図である。
【図３】本発明の第１の実施の形態におけるトピック境界決定装置の構成図である。
【図４】本発明の第１の実施の形態における音声認識結果データの例である。
【図５】本発明の第１の実施の形態における動作のフローチャートである。
【図６】本発明の第１の実施の形態における単語配列部が上位２個のＮＢＥＳＴ候補を採用した場合の単語配列処理結果の例である。
【図７】本発明の第１の実施の形態における不要語を除去した例である。
【図８】本発明の第１の実施の形態における結束度算出処理を説明するための図である。
【図９】本発明の第２の実施の形態におけるトピック境界決定装置の構成図である。
【図１０】本発明の第２の実施の形態における概念ベースの例である。
【図１１】本発明の第３の実施の形態におけるトピック境界決定装置の構成図である。
【図１２】本発明の第３の実施の形態における動作のフローチャートである。
【図１３】本発明の第３の実施の形態における文中途判断ベースの例である。
【図１４】本発明の第４の実施の形態におけるトピック境界決定装置の構成図である。
【図１５】本発明の第４の実施の形態における動作のフローチャートである。
【図１６】本発明の第４の実施の形態における入力されるテロップ認識結果テキストの例である。
【図１７】本発明の第４の実施の形態におけるＴＥＬＯＰ要素を音声認識結果データに挿入した例である。
【図１８】本発明の第４の実施の形態におけるＴＥＬＯＰ要素のテキストを単語分割した結果である。
【図１９】本発明の第４の実施の形態における単語配列処理の結果である。
【図２０】本発明の第４の実施の形態における不要語を除去した例である。
【符号の説明】
１単語配列手段、単語配列部
２不要語除去手段、不要語除去部
３結束度算出手段、結束度算出部
４トピック境界認定手段、トピック境界認定部
５データ入力部
６トピック境界検出結果出力部
７文中途判断部
８テロップ認識結果テキスト挿入部
９テロップ認識結果テキスト単語分割部
１０概念ベース
２０文中途判断ベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a topic boundary determination method and apparatus and a topic boundary determination program, and more particularly to a topic boundary determination method and apparatus and a topic boundary determination program for dividing video content and audio content into topic units.
[0002]
[Prior art]
As a prior art, there is a Hearst method in which text is divided into topic units (for example, see Non-Patent Documents 1 and 2). In the Hearst method, the text is divided into words, unnecessary words are removed, a window of word strings with a certain number of words is taken before and after each word boundary, and the appearance frequency vector of the words included in the window for each window And the cosine measure between vectors corresponding to the front and back windows is calculated as the cohesion degree of the word boundary. The word boundary where the cohesion degree is minimized or the sentence boundary nearest to it is recognized as the topic boundary.
[0003]
Further, for each word, using the word as a search key, search a concept base in which a set of pairs of words and their semantic expressions is stored to obtain a vector corresponding to the word, and a vector corresponding to the window In other words, a method has been proposed in which the center of gravity of a word vector included in a window is taken.
[0004]
[Non-Patent Document 1]
Hearst, M.A .: Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16 (1994).
[Non-Patent Document 2]
Hearst, M.A.:TextTilling: Segmenting Text into Multi-paragraph Subtopic Passages, Computational Linguistics, Vol.23, No.1, pp33-64 (1997).
[0005]
[Problems to be solved by the invention]
However, when a text in which audio in video content or audio content is recognized by speech recognition is taken as a segment target, a recognition error word is included. Therefore, in the above prior art, the cohesion degree is not calculated appropriately. There is a problem.
[0006]
In addition, the speech segments are separated by pauses, and are often separated into separate speech segments in the middle of a sentence. In the prior art, the voice segment boundary recognized as the topic boundary may be halfway in the sentence, and there is a second problem that the accuracy of segmentation is lowered.
[0007]
In video content, telops are often used as auxiliary audio information. In some cases, telops play a role like topic headings. In video content, the necessary amount of information is often the sum of audio and telop, and there is a third problem that sufficient accuracy cannot be obtained by segmentation only with audio.
[0008]
The present invention has been made in view of the above points, and an object of the present invention is to provide a topic boundary determination method and apparatus and a topic boundary determination program capable of accurately and accurately detecting a semantic boundary from a speech recognition result. And
[0009]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0010]
  The present invention(Claim 1)Is a topic boundary determination method for dividing data obtained as a result of recognizing audio included in video content and audio content into topics.
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
  For each segmentpluralAdopted NBEST candidates and adopted for each voice segmentpluralNbest candidateEach ofMerge word sets contained in, Merged word setSort words in ascending order of word start time informationTo the word stringThe word sequence process (step 1)
  An unnecessary word deletion process (step 2) for deleting unnecessary words including attached words from the sorted word string;
  A range of words in a word string of a certain number of words M is used as a window,Each word boundary in a word string formed by connecting word strings of all speech segments, A window with M words immediately before the word boundary and a window with M words immediately after the word boundary.For each window, a vector representing the meaning of the window, including the frequency vector of the words included in the window, is calculated, and the degree of similarity between the vectors corresponding to the preceding and succeeding windows, including the cosine measure. A cohesion degree calculation process (step 3) for calculating the coherence degree of the word boundary;
  This is a topic boundary recognition process (step 4) in which a word boundary having a minimum cohesion is obtained and a minimum point or a speech segment boundary closest to the minimum point is recognized as a topic boundary.
[0011]
In this way, the window width of the cohesion degree calculation is made longer in accordance with taking a plurality of NBEST candidates. In a plurality of NBEST candidates, it is considered that a word with high recognition reliability appears in more NBEST candidates. Accordingly, since the number of appearances in the window is increased, the vector representing the meaning of the window is greatly influenced by a highly reliable word, and conversely, the influence of a word having low reliability is reduced. Therefore, the vector representing the meaning of the window and the degree of cohesion are more appropriate than those of the conventional technology.
[0012]
  In addition, the present invention(Claim 2)IsIn a topic boundary determination method for dividing data obtained as a result of speech recognition of audio included in video content and audio content into topics,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
  Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word A word arrangement process in which the words are sorted in ascending order in the order of time information;
  An unnecessary word deletion process for deleting unnecessary words including attached words from the sorted word string;
  Included in M words immediately before the word boundary for each word boundary in a word string formed by connecting word strings of all speech segments with a window of a range of words in a word string of a certain number of words M And a window for M words immediately after the word boundary, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculation process for calculating a similarity degree between the vectors corresponding to the window, such as cosine velocity, as the cohesion degree of the word boundary;
  A topic boundary recognition process in which a word boundary having a minimum cohesion degree is obtained, and a topic point recognition process is performed in which a minimum point or a speech segment boundary closest to the minimum point is recognized as a topic boundary.
  When data with recognition score information is entered for each word in the recognition result text,
  In the cohesion degree calculation process,
  For each word, using the word as a search key, search a concept base in which a set of pairs of words and semantic vectors of the word is stored to obtain a vector corresponding to the word,
  As a vector representing the meaning of the window, a weighted average of the vector of words included in the window with the recognition score as a weight is calculated.
[0013]
In this way, by taking a weighted average with the recognition score as a weight, the vector representing the meaning of the window is greatly influenced by a word with a high recognition score, and conversely, the influence of a word with a low recognition score is reduced. Compared to the prior art that takes the center of gravity without weight, it is more appropriate. As a result, a more appropriate cohesion degree can be calculated.
[0014]
  Bookinvention(Claim 3)IsIn a topic boundary determination method for dividing data obtained as a result of speech recognition of audio included in video content and audio content into topics,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
  Sentence mid-sentence judgment process for judging whether or not the speech segment is in the middle of the sentence from the information including the notation of the word string at the end of the recognition result text of each speech segment, part of speech, and punctuationWhen,
  Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word A word arrangement process in which the words are sorted in ascending order in the order of time information;
  An unnecessary word deletion process for deleting unnecessary words including attached words from the sorted word string;
  Included in M words immediately before the word boundary for each word boundary in a word string formed by connecting word strings of all speech segments with a window of a range of words in a word string of a certain number of words M And a window for M words immediately after the word boundary, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculation process for calculating a similarity degree between the vectors corresponding to the window, such as cosine velocity, as the cohesion degree of the word boundary;
  SentenceIn the speech segment boundary set excluding the boundary immediately after the speech segment determined to be in the middle of the sentence in the midway judgment process, the speech segment boundary nearest to the minimum point where the cohesion degree is minimized is recognized as the topic boundary.The topic boundary recognition process.
[0015]
As a result, the topic boundary is always between sentences and never in the middle of the sentence, so the segmentation accuracy is improved.
[0016]
  Bookinvention(Claim 4)IsIn a topic boundary determination method for dividing data obtained as a result of speech recognition of audio included in video content and audio content into topics,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result Speech recognition result data consisting of the assigned data,When the data obtained as a result of recognizing the telop included in the segmented video content by character recognition and including the start time information in each telop recognition result text,
  A telop recognition result text insertion process for inserting each telop recognition result text into the audio segment sequence so that the start time information is in ascending order;
  Telop recognition result text word segmentation process for segmenting each telop recognition result text,
  Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word A word arrangement process in which the words are sorted in ascending order in the order of time information;
  An unnecessary word deletion process for deleting unnecessary words including an ancillary word from the sorted word string and an unnecessary word including an ancillary word in the word obtained in the telop recognition result text word division process;
  In a word string formed by connecting all speech segments and word strings in a telop, a window of a range of words in a word string of a certain number M of words, for each word boundary, M words immediately before the word boundary A window included in a word and a window formed by M words immediately after the word boundary are designated, and a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated for each window. A cohesion degree calculating process for calculating a similarity degree between the vectors corresponding to the front and rear windows, such as cosine velocity, as the cohesion degree of the word boundary;
  Find the word boundaries that minimize the cohesion,Identify the minimum point or the boundary between the nearest audio segment or telop as the topic boundary.Topic boundary recognition process,
Consist of.
[0017]
Thus, by merging a telop with audio, the telop often becomes a text equivalent to a topic heading. In the text equivalent to the heading, words that represent the topic are concentrated, so the degree of cohesion after that text is particularly high, and the depth of the valley of cohesion at the boundary immediately before the text equivalent to the heading. Becomes larger and the point becomes easier to be recognized as a topic boundary. For this reason, the accuracy of segmentation becomes high. Also, even if there is a telop that is not equivalent to a heading, the necessary information is often combined with the audio and telop, so it is considered that the topic boundary can be detected more appropriately by merging the telop with the audio. It is done.
[0018]
FIG. 2 is a principle configuration diagram of the present invention.
[0019]
  The present invention(Claim 5)Is a topic boundary determination device for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
  For each segmentpluralAdopted NBEST candidates and adopted for each voice segmentpluralNbest candidateEach ofMerge word sets contained in, Merged word setSort words in ascending order of word start time informationTo the word stringWord array means 1 to perform,
  Unnecessary word removing means 2 for deleting unnecessary words including attached words from the sorted word string;
  A range of words in a word string of a certain number of words M is used as a window,Each word boundary in a word string formed by connecting word strings of all speech segments, A window with M words immediately before the word boundary and a window with M words immediately after the word boundary.For each window, a vector representing the meaning of the window, including the frequency vector of the words included in the window, is calculated, and the degree of similarity between the vectors corresponding to the preceding and succeeding windows, including the cosine measure. Cohesion degree calculating means 3 for calculating the coherence degree of the word boundary;
  And a topic boundary recognition unit 4 that obtains a word boundary having a minimum cohesion degree and recognizes a minimum point or a speech segment boundary closest to the minimum point as a topic boundary.
[0020]
  The present invention (Claim 6) is a topic boundary determination device for dividing data obtained as a result of recognizing audio included in video content and audio content into topic units,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
  Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word Word arrangement means for converting the words into ascending order in the order of time information,
  Unnecessary word removal means for deleting unnecessary words including attached words from the sorted word string;
  A window formed by M words immediately before the word boundary in each word boundary in a word string formed by connecting word strings of all speech segments using a range of words in a word string of a certain number M of words as a window. A window for M words immediately after the word boundary is specified, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculating means for calculating the degree of similarity between the vectors corresponding to, including a cosine measure, as the cohesion degree of the word boundary;
  Topic boundary recognition means for finding a word boundary having a minimum cohesion and recognizing a minimum point or a speech segment boundary closest to the minimum point as a topic boundary;
Have
  Cohesion degree calculation meansIn
  When data having recognition score information for each word in the recognition result text is input, a concept in which a set of a pair of a word and a vector that is a semantic expression of the word is stored for each word is used as a search key Means for searching a base and obtaining a vector corresponding to the word;
  Means for calculating a weighted average having a recognition score as a weight of a vector of words included in the window as a vector representing the meaning of the window;,Have
[0021]
  The present invention (Claim 7)The topic demarcation device ofA topic boundary determination apparatus for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
  Sentence midway judgment means for judging whether or not the speech segment is in the middle of a sentence from information including the notation of the word string at the end of the recognition result text of each voice segment, part of speech, and punctuation marksWhen,
  Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word Word arrangement means for converting the words into ascending order in the order of time information,
  Unnecessary word removal means for deleting unnecessary words including attached words from the sorted word string;
  A window formed by M words immediately before the word boundary in each word boundary in a word string formed by connecting word strings of all speech segments using a range of words in a word string of a certain number M of words as a window. A window for M words immediately after the word boundary is specified, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculating means for calculating the degree of similarity between the vectors corresponding to, including a cosine measure, as the cohesion degree of the word boundary;
  In the speech segment boundary set excluding the boundary immediately after the speech segment determined to be in the middle of the sentence by the mid-sentence judging means, the speech segment boundary nearest to the minimum point where the cohesion degree is minimized is recognized as the topic boundary.And a topic boundary recognition means.
[0022]
  The present invention(Claim 8)The topic demarcation device ofA topic boundary determination apparatus for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
  A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result Speech recognition result data consisting of assigned data andData obtained as a result of recognizing the telop included in the segmented video content by character recognition and including the start time information in each telop recognition result textIs entered,
  Telop recognition result text insertion means for inserting each telop recognition result text into the audio segment sequence so that the start time information is in ascending order;
  Telop recognition result text word dividing means for dividing each telop recognition result text into wordsWhen,
  Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word Word arrangement means for converting the words into ascending order in the order of time information,
  Unnecessary word removing means for deleting unnecessary words including auxiliary words from the sorted word string and unnecessary words including auxiliary words in words obtained by the telop recognition result text word dividing means;
  In a word string that is formed by connecting a range of words in a word string of a certain number M of words and connecting all the speech segments and the word strings in the telop, for each word boundary, M words immediately before the word boundary Designating a window by word and a window by M words immediately after the word boundary, and for each window, calculating a vector representing the meaning of the window, including an appearance frequency vector of the word included in the window; A cohesion degree calculating means for calculating a degree of similarity between the vectors corresponding to the front and rear windows, such as a cosine measure, as a cohesion degree of the word boundary;
  Find the word boundaries that minimize the cohesion,Identify the minimum point or the boundary between the nearest audio segment or telop as the topic boundary.Topic boundary recognition means,Have.
[0023]
  The present invention(Claim 9)IsA program that causes a computer to function as the topic boundary determination apparatus according to claims 5 to 8.
[0024]
In addition, the topic boundary determination program of the present invention determines whether a speech segment is in the middle of a sentence from information including the word sequence at the end of the recognition result text of each speech segment, part of speech, and punctuation. A determination step,
In the topic boundary recognition step, the speech segment boundary closest to the minimum point where the cohesion degree is minimum in the speech segment boundary set excluding the boundary immediately after the speech segment determined to be halfway in the sentence Including the step of qualifying as a topic boundary.
[0025]
Further, the topic boundary determination program of the present invention is data obtained as a result of recognizing a telop included in segmented video content by character recognition, and data including start time information is input to each telop recognition result text. When
A telop recognition result text insertion step for inserting each telop recognition result text into the audio segment sequence so that the start time information is in ascending order;
A telop recognition result text word dividing step for dividing each telop recognition result text into words,
In the unnecessary word removal step, unnecessary words including attached words are removed from the words obtained in the telop recognition result text word division step,
In the cohesion degree calculation step, the cohesion degree is calculated for all speech segments and word sequences in the telop
In the topic boundary recognition step, the minimum point where the degree of cohesion is minimum, or the voice segment nearest to the minimum point, or the boundary between telops is recognized as the topic boundary.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0027]
FIG. 3 shows the configuration of the topic boundary determination apparatus according to the first embodiment of the present invention.
[0028]
The topic boundary determination apparatus shown in FIG. 1 includes a data input unit 5, a word arrangement unit 1, an unnecessary word removal unit 2, a cohesion degree calculation unit 3, a topic boundary determination unit 4, and a topic boundary detection result output unit 6.
[0029]
The data input unit 5 inputs XML-type speech recognition result data as shown in FIG. In the data shown in FIG. 4, the SEGMENT element is information of one voice segment, and the “begin” and “end” attributes of the SEGMENT element represent the start time and end time of the voice segment.
The NBEST element is an NBEST candidate, and the “score” and “rank” attributes indicate the recognition score and the highest number of candidates, respectively. There are generally multiple NBEST candidates for each speech segment. One speech recognition result text corresponds to one NBEST candidate, and each recognition result candidate obtained in descending order of the speech recognition processing score is referred to as an NBEST candidate.
The TEXT element is the speech recognition result text of the corresponding NBEST candidate, and the WORD element is a word constituting the content of the TEXT element. The “begin”, “end”, “score”, and “pos” attributes of the WORD element represent the start time, end time, recognition score, and part of speech information of the word.
[0030]
The word arrangement unit 1 employs a predetermined number of NBEST candidates for each speech segment, merges the word sets included in each adopted NBEST candidate for each speech segment, and in order of the word start time information. Sort words in ascending order.
[0031]
The unnecessary word removal unit 2 removes unnecessary words including attached words from a word string that is a sequence of sorted words.
[0032]
The cohesion degree calculation unit 3 designates a range of words in a word string having a certain number of words before and after each word boundary (hereinafter referred to as a window) in a word string formed by connecting word strings of all speech segments, For each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated, and the similarity between the vectors corresponding to the preceding and succeeding windows, such as a cosine measure, is calculated. Calculated as the degree of word boundary cohesion.
[0033]
The topic boundary recognition unit 4 obtains a word boundary where the degree of cohesion is minimum, and recognizes the minimum point or a speech segment boundary closest to the minimum point as a topic boundary.
[0034]
The topic boundary detection result output unit 6 outputs the topic boundary recognized by the topic boundary recognition unit 4.
[0035]
Next, the operation in the above configuration will be described.
[0036]
FIG. 5 is a flowchart of the operation in the first embodiment of the present invention.
[0037]
Step 101) In the data input unit 5, as speech recognition result data, a plurality of NBEST candidates, a word segmentation result for the NBEST candidate, and each word of the word segmentation result are started in descending order of recognition score for each speech segment. Input data with time information.
[0038]
Step 102) The word array unit 1 employs a predetermined number of NBEST candidates for each speech segment of the input data. Here, the predetermined number is an integer equal to or greater than 1 or all NBEST candidates. Then, for each voice segment, the word sets included in each adopted NBEST candidate are merged to sort the words in the order of the word start time information. FIG. 6 shows the result of word arrangement processing when the top two NBEST candidates are employed.
[0039]
Step 103) Next, the unnecessary word removing unit 2 removes unnecessary words such as attached words that are considered not related to the topic segmentation from the word string. Here, the word information of the input data includes word notation and part-of-speech information, and auxiliary words such as particles and auxiliary verbs are extracted from this information. These particles and auxiliary verbs are considered to have no effect on topic segmentation, and such words are judged as unnecessary words. In order to realize unnecessary word removal, logic that determines that it is an unnecessary word may be realized as a program, or an unnecessary word list (describe a word notation and part of speech that are considered as an unnecessary word) as an external table The unnecessary word list may be prepared and referred to by a program that performs unnecessary word removal processing. FIG. 7 shows the result obtained by deleting the adjunct word according to the value of the pos attribute of each WORD element in FIG.
[0040]
Step 104) The cohesion degree calculation unit 3 takes a window of word strings having a certain number of words before and after each word boundary in a word string formed by connecting word strings of all speech segments, and includes each window in the window. A vector representing the meaning of the window, such as a word appearance frequency vector, is calculated, and a similarity such as a cosine measure between vectors corresponding to the preceding and following windows is calculated as the cohesion degree of the word boundary. An example of this is shown in FIG. The window width of the cohesion degree calculation is made longer according to taking a plurality of NBEST candidates. Words with high recognition reliability (for example, “adjustment” and “Kyushu” in FIG. 7) are considered to be output to more NBEST candidates, and thus appear more frequently in each window, and the vector representing the meaning of the window is The influence of highly reliable words is large, and the influence of unreliable words is reduced. Therefore, the vector representing the meaning of the window and the degree of cohesion are more appropriate than those of the conventional technology.
[0041]
Step 105) The topic boundary recognition unit 4 obtains a word boundary where the degree of cohesion is minimized, and recognizes the minimum point or a speech segment boundary closest to the minimum point as a topic boundary.
[0042]
Step 106) The topic boundary detection result output unit 6 outputs the topic boundary recognized by the topic boundary recognition unit 4.
[0043]
[Second Embodiment]
In the present embodiment, a case where a concept base is used in the cohesion degree calculation unit 3 will be described.
[0044]
FIG. 9 shows the configuration of the topic boundary determination device in the second exemplary embodiment of the present invention. The same parts as those in FIG. 3 are denoted by the same reference numerals, and the description thereof is omitted.
[0045]
FIG. 10 shows an example of a concept base in the second embodiment of the present invention.
The concept base 10 stores a set of a pair of a word and a vector that is a semantic expression of the word, and has a property that if the vector value is close, the meaning of the corresponding word is close. The concept base 10 is assumed to be stored in storage means such as a database.
[0046]
In the present embodiment, the concept base 10 is added to the configuration of the first embodiment described above. Thereby, in step 104 of the flowchart of FIG. 5 in the first embodiment, the cohesion degree calculation unit 3 sets the concept base 10 in which a vector that is a semantic expression is assigned to each word shown in FIG. The word is searched for each word using the search key, and a vector corresponding to the word is acquired.
[0047]
A set of word vectors included in the window is represented by νr (1 ≦ r ≦ s), and a word recognition score corresponding to the word vector νr is represented by gr. As a vector corresponding to the window, when calculating a weighted average of the word vectors included in the window with the recognition score as a weight,
[0048]
[Expression 1]

Take. By doing this calculation, the influence of words with a high recognition score is large, and conversely, the influence of words with a low recognition score is reduced, so the meaning of the window is reflected more appropriately. Will be more appropriate.
[0049]
[Third Embodiment]
FIG. 11 shows the configuration of the topic boundary determination apparatus in the third embodiment of the present invention. The topic boundary determination apparatus shown in FIG. 9 has a configuration in which a sentence midway determination unit 7 and a text midway determination base 20 are added to the configuration in FIG. The mid-sentence determination base 20 is stored in storage means such as a database. In the figure, the same components as those in FIG. 9 are denoted by the same reference numerals, and the description thereof is omitted.
[0050]
The mid-sentence judgment unit 7 refers to the mid-sentence judgment base 20 to determine whether or not the speech segment is in the middle of the sentence from information such as the word sequence at the end of the recognition result text of each speech segment, part of speech, and punctuation. Judgment.
[0051]
As a result, the topic boundary recognition unit 4 is closest to the minimum point at which the cohesion degree is minimum in the speech segment boundary set excluding the boundary immediately after the speech segment determined to be halfway by the sentence. A speech segment boundary is identified as a topic boundary.
[0052]
FIG. 12 is a flowchart of the operation in the third embodiment of the present invention. In FIG. 5, Step 201 is the same as Step 101 in FIG. 5, and Step 203 to Step 205 and Step 207 are the same as Step 102 to Step 104 and Step 106 in FIG. To do.
[0053]
When data is input (step 201), the mid-sentence determination unit 7 uses, for example, information on the word string at the end of the NBEST candidate of each likelihood segment (rank = “1”), part of speech, punctuation marks, and the like. Determine whether the speech segment is in the middle of a sentence. For example, when information on a word string that can be recognized as the middle of a sentence is stored in the middle sentence determination base 20 and the last word string of the NBEST candidate matches one of the word strings in the middle sentence determination base 20, It is determined that the speech segment is in the middle of a sentence (step 202).
[0054]
FIG. 13 shows an example of the mid-sentence judgment base in the third embodiment of the present invention. Each record in the mid-sentence determination base 20 is a column of “word notation; part of speech information”. For example, if the word sequence of the NBEST candidate is “Typhoon (noun), ni (case particle), mimai (verb stem), wa (verb inflection ending), re (verb suffix)” Therefore, it is determined that the speech segment is in the middle of a sentence.
[0055]
In the topic boundary recognition unit 4, the boundary immediately after the speech segment determined to be in the middle of the sentence by the sentence midway determination unit 7 is in the middle of the sentence. The latest speech segment boundary is identified as the topic boundary (step 206). As a result, the topic boundary is always between sentences and does not become halfway between sentences, so that the accuracy of segmentation is improved.
[0056]
[Fourth Embodiment]
FIG. 14 shows the configuration of a topic boundary determination apparatus in the fourth embodiment of the present invention. The topic boundary determination apparatus shown in the figure has a configuration in which a telop recognition result text insertion unit 8 and a telop recognition result text word division unit 9 are added to the configuration of FIG. In the configuration of FIG. 14, the same parts as those of the configuration of FIG.
[0057]
However, the data input from the data input unit 5 is data obtained as a result of recognizing the telop included in the segmented video content by character recognition, and there is also data having start time information in each telop recognition result text. Entered.
[0058]
The telop recognition result text insertion unit 8 inserts each telop recognition result text into the speech segment sequence so that the start time information is in ascending order.
[0059]
The telop recognition result text word division unit 9 divides each telop recognition result text into words.
[0060]
FIG. 15 is a flowchart of the operation in the fourth embodiment of the present invention.
Step 301) The data input unit 5 receives data (telop recognition result text) obtained as a result of recognizing the telop included in the segmented video content by character recognition. An example of input data is shown in FIG. The TELOP element is information of one telop, and the begin and end attributes of the TELOP element are the start time and end time of the telop.
[0061]
Step 302) The telop recognition result text insertion unit 8 inserts each telop recognition result text into the speech segment sequence so that the start time information is in ascending order. FIG. 17 shows an example of the result of inserting the TELOP element in FIG. 16 into the speech recognition result data of FIG. The SEGMENT element and the TELOP element are arranged so that the “begin” of the SEGMENT element and the TELOP element are in ascending order. In FIG. 4, the SEGMENT element has a “begin” attribute. If there is no “begin” attribute, the minimum value of “begin” of the WORD element in the SEGMENT element may be taken.
[0062]
Step 303) The telop recognition result text word division unit 9 divides each telop recognition result text into words. FIG. 18 shows data obtained as a result of dividing the text of the TELOP element into words. The WORD element in the TELOP element is a word obtained by word division. The WORD element has a “pos” attribute which is part of speech information.
[0063]
Step 304) The word array unit 1 performs the same processing as in the first embodiment. The results obtained by this processing are shown in FIG.
[0064]
Step 305) The unnecessary word removing unit 2 removes unnecessary words such as attached words from the word strings in the speech segment and the telop. FIG. 20 shows the result obtained by removing the WORD element of the adjunct word from each WORD element in the SEGMENT element and TELOP element shown in FIG. 19 based on the value of the “pos” attribute.
[0065]
Step 306) The cohesion degree calculation unit 3 calculates the cohesion degree in the array of WORD elements in all SEGMENT elements and TELOP elements.
[0066]
Step 307) The topic boundary recognition unit 4 recognizes the minimum point at which the cohesion degree is minimum as the topic boundary. Alternatively, the boundary between the SEGMENT element-SEGMENT element, the SEGMENT element-TELOP element, and the TELOP element-TELOP element closest to the minimum point is identified as the topic boundary from the SEGMENT element and the TELOP element string.
[0067]
Step 308) The topic boundary detection result output unit 6 outputs the recognized topic boundary.
[0068]
The telop “typhoon information” is a text equivalent to a topic headline, and there is a high possibility that the immediately preceding boundary is recognized as the topic boundary.
[0069]
The present invention stores the above-described concept base and sentence midway judgment base in a storage means such as a database, and then constructs the operation of the first to fourth embodiments as a program and uses it as a topic boundary determination device. It is also possible to install it on a computer to be distributed or distribute it via a network.
[0070]
Further, the constructed program is stored in a hard disk device connected to a computer used as a topic boundary determination device, a portable storage medium such as a flexible disk, a CD-ROM, etc., and installed when the present invention is carried out. However, it can also be controlled by a control device such as a CPU.
[0071]
The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.
[0072]
【The invention's effect】
As described above, according to the present invention, it is possible to achieve higher accuracy in topic segmentation of video content and audio content than in the prior art.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a principle configuration diagram of the present invention.
FIG. 3 is a configuration diagram of a topic boundary determination device according to the first embodiment of the present invention.
FIG. 4 is an example of speech recognition result data according to the first embodiment of the present invention.
FIG. 5 is a flowchart of the operation in the first embodiment of the present invention.
FIG. 6 is an example of a word arrangement processing result when the word arrangement unit in the first exemplary embodiment of the present invention employs the top two NBEST candidates.
FIG. 7 is an example of removing unnecessary words in the first embodiment of the present invention.
FIG. 8 is a diagram for explaining cohesion degree calculation processing according to the first embodiment of the present invention.
FIG. 9 is a configuration diagram of a topic boundary determination device according to a second embodiment of the present invention.
FIG. 10 is an example of a concept base in the second exemplary embodiment of the present invention.
FIG. 11 is a configuration diagram of a topic boundary determination device according to a third embodiment of the present invention.
FIG. 12 is a flowchart of the operation in the third embodiment of the present invention.
FIG. 13 is an example of a mid-sentence determination base in the third embodiment of the present invention.
FIG. 14 is a configuration diagram of a topic boundary determination device according to a fourth embodiment of the present invention.
FIG. 15 is a flowchart of the operation according to the fourth embodiment of the present invention.
FIG. 16 is an example of input telop recognition result text in the fourth exemplary embodiment of the present invention.
FIG. 17 is an example of inserting a TELOP element into speech recognition result data in the fourth exemplary embodiment of the present invention.
FIG. 18 is a result of dividing a text of a TELOP element according to a fourth embodiment of the present invention into words.
FIG. 19 shows a result of word arrangement processing in the fourth exemplary embodiment of the present invention.
FIG. 20 is an example of removing unnecessary words in the fourth embodiment of the present invention.
[Explanation of symbols]
1 Word arrangement means, word arrangement section
2 Unnecessary word removal means, unnecessary word removal unit
3 Cohesion degree calculating means, cohesion degree calculating section
4 Topic boundary recognition means, Topic boundary recognition section
5 Data input part
6 Topic boundary detection result output part
7 Sentence determination part
8 Text recognition result text insertion part
9 Ticker recognition result text word division part
10 Concept base
20 Judgment halfway judgment base

Claims

映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各セグメントに対して複数のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した複数のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
前記結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定過程からなる
ことを特徴とするトピック境界決定方法。In the topic boundary determination method for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
A plurality of NBEST candidates are adopted for each segment, a word set included in each of the plurality of adopted NBEST candidates is merged for each speech segment, and the merged word set is converted into a word start time information. and the word array process to word strings in order to sort the said word in ascending order,
An unnecessary word deletion process for deleting unnecessary words including attached words from the sorted word string;
A window formed by M words immediately before the word boundary in each word boundary in a word string formed by connecting word strings of all speech segments using a range of words in a word string of a certain number M of words as a window. A window for M words immediately after the word boundary is specified, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculation process for calculating a similarity degree between the vectors corresponding to, such as a cosine measure, as a cohesion degree of the word boundary;
A topic boundary determination method comprising: a topic boundary recognition process for obtaining a word boundary having a minimum cohesion degree and determining a minimum point or a speech segment boundary closest to the minimum point as a topic boundary.

映像コンテンツや音声コンテンツに含まれる音声を音声認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲の窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語に含まれる窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦速度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
前記結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定過程と、
からなり、
前記認識結果テキスト中の各単語に認識スコア情報があるデータが入力されると、
前記結束度算出過程において、
各単語毎に該単語を検索キーとして、単語と該単語の意味表現であるベクトルの対の集合が格納された概念ベースを検索して、該単語に対応するベクトルを取得し、
前記窓の意味を表すベクトルとして、該窓に含まれる単語のベクトルの、認識スコアを重みとする重み付き平均を算出することを特徴とするトピック境界決定方法。 In a topic boundary determination method for dividing data obtained as a result of speech recognition of audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word A word arrangement process in which the words are sorted in ascending order in the order of time information;
An unnecessary word deletion process for deleting unnecessary words including attached words from the sorted word string;
Included in M words immediately before the word boundary for each word boundary in a word string formed by connecting word strings of all speech segments with a window of a range of words in a word string of a certain number of words M And a window for M words immediately after the word boundary, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculation process for calculating a similarity degree between the vectors corresponding to the window, such as cosine velocity, as the cohesion degree of the word boundary;
A topic boundary recognition process for determining a word boundary where the cohesion degree is minimum, and determining a minimum point or a speech segment boundary closest to the minimum point as a topic boundary;
Consists of
When data having recognition score information is input to each word in the recognition result text,
In the cohesion degree calculation process,
For each word, using the word as a search key, search a concept base in which a set of pairs of words and semantic vectors of the word is stored to obtain a vector corresponding to the word;
A topic boundary determination method, comprising: calculating a weighted average with a recognition score as a weight of a vector of words included in the window as a vector representing the meaning of the window.

映像コンテンツや音声コンテンツに含まれる音声を音声認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントの認識結果テキストの末尾の単語列の表記や品詞、句読点を含む情報から、該音声セグメントが文の中途であるかどうかを判断する文中途判断過程と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲の窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語に含まれる窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦速度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
前記文中途判断過程で文の中途であると判断された音声セグメントの直後の境界を除く音声セグメント境界集合の中で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定過程と、
からなることを特徴とするトピック境界決定方法。 In a topic boundary determination method for dividing data obtained as a result of speech recognition of audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
A sentence halfway judgment process for judging whether or not the voice segment is halfway from a sentence from the information including the notation of the word string at the end of the recognition result text of each voice segment, part of speech, and punctuation marks ;
Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word A word arrangement process in which the words are sorted in ascending order in the order of time information;
An unnecessary word deletion process for deleting unnecessary words including attached words from the sorted word string;
Included in M words immediately before the word boundary for each word boundary in a word string formed by connecting word strings of all speech segments with a window of a range of words in a word string of a certain number of words M And a window for M words immediately after the word boundary, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculation process for calculating a similarity degree between the vectors corresponding to the window, such as cosine velocity, as the cohesion degree of the word boundary;
In the speech segment boundary set excluding the boundary immediately after the speech segment determined to be in the middle of the sentence in the mid-sentence determination process, the speech segment boundary nearest to the minimum point where the cohesion degree is minimum is recognized as the topic boundary. Topic boundary recognition process,
A method for determining a topic boundary, comprising:

映像コンテンツや音声コンテンツに含まれる音声を音声認識した結果得られたデータをトピック単位に分割するためのトピック境界決定方法において、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データと、セグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータであって、各テロップ認識結果テキストに開始時刻情報を含むデータが入力されると、
前記各テロップ認識結果テキストを音声セグメント列の中に、開始時刻情報が昇順となるように挿入するテロップ認識結果テキスト挿入過程と、
前記各テロップ認識結果テキストを単語分割するテロップ認識結果テキスト単語分割過程と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列過程と、
ソートされた単語列から付属語を含む不要語及び、前記テロップ認識結果テキスト単語分割過程で得られた単語で付属語を含む不要語を削除する不要語削除過程と、
一定の単語数Ｍの単語列中の単語の範囲の窓とし、全音声セグメント及びテロップ中の単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語に含まれる窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の該窓に対応するベクトル間の、余弦速度を始めとする類似度を当該単語境界の結束度として算出する結束度算出過程と、
前記結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメントまたは、テロップ間の境界をトピック境界と認定するトピック境界認定過程と、
からなることを特徴とするトピック境界決定方法。 In a topic boundary determination method for dividing data obtained as a result of speech recognition of audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result Data obtained by recognizing speech recognition result data composed of attached data and telops included in segment target video content by character recognition, each telop recognition result text including start time information Is entered,
A telop recognition result text insertion process for inserting each telop recognition result text into a speech segment sequence so that start time information is in ascending order;
A telop recognition result text word segmentation process for segmenting each telop recognition result text ;
Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word A word arrangement process in which the words are sorted in ascending order in the order of time information;
An unnecessary word deletion process for deleting an unnecessary word including an adjunct word from the sorted word string and an unnecessary word including an adjunct word in the word obtained in the telop recognition result text word dividing process;
In a word string formed by connecting all speech segments and word strings in a telop, a window of a range of words in a word string of a certain number M of words, for each word boundary, M words immediately before the word boundary A window included in a word and a window formed by M words immediately after the word boundary are designated, and a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated for each window. A cohesion degree calculating process for calculating a similarity degree between the vectors corresponding to the front and rear windows, such as cosine velocity, as the cohesion degree of the word boundary;
Obtaining a word boundary at which the degree of cohesion is minimal, a topic boundary recognition process for identifying a minimum point, a speech segment closest to the minimum point, or a boundary between telops as a topic boundary;
A method for determining a topic boundary, comprising:

映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各セグメントに対して複数のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した複数のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
前記結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定手段と、
を有することを特徴とするトピック境界決定装置。A topic boundary determination apparatus for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
A plurality of NBEST candidates are adopted for each segment, a word set included in each of the plurality of adopted NBEST candidates is merged for each speech segment, and the merged word set is converted into a word start time information. Word arrangement means for making the word sequence sorted in ascending order of the words in order;
Unnecessary word removal means for deleting unnecessary words including attached words from the sorted word string;
A window formed by M words immediately before the word boundary in each word boundary in a word string formed by connecting word strings of all speech segments using a range of words in a word string of a certain number M of words as a window. A window for M words immediately after the word boundary is specified, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculating means for calculating the degree of similarity between the vectors corresponding to, including a cosine measure, as the cohesion degree of the word boundary;
Topic boundary recognition means for determining a word boundary where the degree of cohesion is minimum, and determining a minimum point or a speech segment boundary closest to the minimum point as a topic boundary;
A topic boundary determination apparatus characterized by comprising:

映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
前記結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定手段と、
を有し、
前記結束度算出手段は、
前記認識結果テキスト中の各単語に認識スコア情報があるデータが入力されると、各単語毎に該単語を検索キーとして、単語と該単語の意味表現であるベクトルの対の集合が格納された概念ベースを検索して、該単語に対応するベクトルを取得する手段と、
前記窓の意味を表すベクトルとして、該窓に含まれる単語のベクトルの、認識スコアを重みとする重み付き平均を算出する手段とを有することを特徴とするトピック境界決定装置。 A topic boundary determination apparatus for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
One or more NBEST candidates are adopted for each segment, the word sets included in each of the adopted one or more NBEST candidates are merged for each speech segment, and the merged word set is used as a word start time. Word arrangement means for making a word string in which the words are sorted in ascending order in the order of information;
Unnecessary word removal means for deleting unnecessary words including attached words from the sorted word string;
A window formed by M words immediately before the word boundary in each word boundary in a word string formed by connecting word strings of all speech segments using a range of words in a word string of a certain number M of words as a window. A window for M words immediately after the word boundary is specified, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculating means for calculating the degree of similarity between the vectors corresponding to, including a cosine measure, as the cohesion degree of the word boundary;
Topic boundary recognition means for determining a word boundary where the degree of cohesion is minimum, and determining a minimum point or a speech segment boundary closest to the minimum point as a topic boundary;
Have
The cohesion degree calculating means includes
When data having recognition score information for each word in the recognition result text is input, a set of a pair of a word and a vector that is a semantic expression of the word is stored for each word using the word as a search key. Means for searching a concept base to obtain a vector corresponding to the word;
A topic boundary determination apparatus, comprising : a means for calculating a weighted average using a recognition score as a weight of a vector of words included in the window as a vector representing the meaning of the window.

映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データが入力されると、
各音声セグメントの認識結果テキストの末尾の単語列の表記や品詞、句読点を含む情報から、該音声セグメントが文の中途であるかどうかを判断する文中途判断手段と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメントの単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
前記文中途判断手段で文の中途であると判断された音声セグメントの直後の境界を除く音声セグメント境界集合の中で、結束度が極小となる極小点に直近の音声セグメント境界をトピック境界と認定するトピック境界認定手段と、
を有することを特徴とするトピック境界決定装置。 A topic boundary determination apparatus for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result When speech recognition result data consisting of assigned data is input,
Sentence midway judgment means for judging whether or not the voice segment is in the middle of a sentence from information including the notation of the word string at the end of the recognition result text of each voice segment, part of speech, and punctuation marks ;
Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word Word arrangement means for converting the words into ascending order in the order of time information,
Unnecessary word removal means for deleting unnecessary words including attached words from the sorted word string;
A window formed by M words immediately before the word boundary in each word boundary in a word string formed by connecting word strings of all speech segments using a range of words in a word string of a certain number M of words as a window. A window for M words immediately after the word boundary is specified, and for each window, a vector representing the meaning of the window including the appearance frequency vector of the word included in the window is calculated. A cohesion degree calculating means for calculating the degree of similarity between the vectors corresponding to, including a cosine measure, as the cohesion degree of the word boundary;
In the speech segment boundary set excluding the boundary immediately after the speech segment determined to be in the middle of the sentence by the mid-sentence judging means, the speech segment boundary nearest to the minimum point where the cohesion degree is minimized is recognized as the topic boundary. Topic boundary recognition means,
A topic boundary determination apparatus characterized by comprising:

映像コンテンツや音声コンテンツに含まれる音声を認識した結果得られたデータをトピック単位に分割するためのトピック境界決定装置であって、
各音声セグメントに対して認識スコアの高い順に出力された複数の認識結果テキスト（以下、ＮＢＥＳＴ候補と記す）、該ＮＢＥＳＴ候補に対する単語分割結果、及び、該単語分割結果の各単語に開始時刻情報が付与されているデータからなる音声認識結果データと、セグメント対象の映像コンテンツに含まれているテロップを文字認識により認識した結果得られるデータであって、各テロップ認識結果テキストに開始時刻情報を含むデータが入力されると、
前記各テロップ認識結果テキストを音声セグメント列の中に、開始時刻情報が昇順となるように挿入するテロップ認識結果テキスト挿入手段と、
前記各テロップ認識結果テキストを単語分割するテロップ認識結果テキスト単語分割手段と、
各音声セグメントに対して１以上のＮＢＥＳＴ候補を採用し、各音声セグメント毎に、採用した１以上のＮＢＥＳＴ候補のそれぞれに含まれる単語集合をマージして、マージされた単語集合を、単語の開始時刻情報の順に該単語を昇順にソートした単語列にする単語配列手段と、
ソートされた単語列から付属語を含む不要語及び前記テロップ認識結果テキスト単語分割手段で得られた単語で付属語を含む不要語を削除する不要語除去手段と、
一定の単語数Ｍの単語列中の単語の範囲を窓とし、全音声セグメント及びテロップ中の単語列をつなげてできる単語列において、各単語境界に対して、その単語境界の直前のＭ個の単語による窓と、その単語境界の直後のＭ個の単語による窓を指定し、各窓毎に、該窓に含まれる単語の出現頻度ベクトルを含む、該窓の意味を表すベクトルを算出し、前後の窓に対応するベクトル間の、余弦測度を始めとする類似度を当該単語境界の結束度として算出する結束度算出手段と、
前記結束度が極小となる単語境界を求め、極小点あるいは、該極小点に直近の音声セグメントまたはテロップ間の境界をトピック境界と認定するトピック境界認定手段と、
を有することを特徴とするトピック境界決定装置。 A topic boundary determination apparatus for dividing data obtained as a result of recognizing audio included in video content and audio content into topics,
A plurality of recognition result texts (hereinafter referred to as NBEST candidates) output in descending order of recognition scores for each speech segment, a word division result for the NBEST candidate, and start time information for each word of the word division result Data obtained by recognizing speech recognition result data composed of attached data and telops included in segment target video content by character recognition, each telop recognition result text including start time information Is entered,
Telop recognition result text insertion means for inserting each telop recognition result text into a speech segment sequence so that start time information is in ascending order;
Telop recognition result text word dividing means for dividing each telop recognition result text into words ,
Adopt one or more NBEST candidates for each speech segment, merge the word sets contained in each of the adopted one or more NBEST candidates for each speech segment, and use the merged word set as the start of a word Word arrangement means for converting the words into ascending order in the order of time information,
Unnecessary word removal means for deleting unnecessary words including auxiliary words from the sorted word strings and unnecessary words including auxiliary words in the words obtained by the telop recognition result text word dividing means;
In a word string that is formed by connecting a range of words in a word string of a certain number M of words and connecting all the speech segments and the word strings in the telop, for each word boundary, M words immediately before the word boundary Designating a window by word and a window by M words immediately after the word boundary, and for each window, calculating a vector representing the meaning of the window, including an appearance frequency vector of the word included in the window; A cohesion degree calculating means for calculating a degree of similarity between the vectors corresponding to the front and rear windows, such as a cosine measure, as a cohesion degree of the word boundary;
Topic boundary recognition means for determining a word boundary where the degree of cohesion is minimum, and determining a minimum point or a boundary between speech segments or telops closest to the minimum point as a topic boundary;
A topic boundary determination apparatus characterized by comprising:

コンピュータを、Computer
請求項５乃至８記載のトピック境界決定装置として機能させるプログラム。A program that functions as the topic boundary determination device according to claim 5.