JP4256393B2

JP4256393B2 - Voice processing method and program thereof

Info

Publication number: JP4256393B2
Application number: JP2006009913A
Authority: JP
Inventors: 浩太日▲高▼; 理水野; 信弥中嶌
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-08-08
Filing date: 2006-01-18
Publication date: 2009-04-22
Anticipated expiration: 2022-08-07
Also published as: JP2006146261A

Abstract

<P>PROBLEM TO BE SOLVED: To extract a summary section by deciding an emphasizing state in a conversation without depending upon a speaker. <P>SOLUTION: A code book which stores appearance probabilities groups of speech feature quantities, such as temporal variation characteristics of a fundamental frequency, power, and a dynamic feature quantity, in an emphasizing state by codes and also stores respective inter-frame transition probabilities using a hidden Markov model and output probabilities by state transitions while the appearance probabilities by the codes are regarded initial state probabilities is used to find initial state probabilities of the groups of speech feature quantities of respective starting frames in speech small paragraphs including voiced sections sandwiched between voiceless sections in a speech signal, find output probabilities by state transitions of the groups of speech feature quantities and inter-frame transition probabilities for respective 2nd and succeeding frames, and find likelihoods of the speech small paragraphs based upon the maximum value or the total sum of the product among products of the initial state probabilities, output probabilities, and transition probabilities by all state transition paths of the speech small paragraphs. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は音声信号を分析して、人間が発声した内容のうち強調した部分を抽出する方法、その方法に用いる音声処理方法、及びそのプログラムに関する。 The present invention relates to a method of analyzing an audio signal and extracting an emphasized portion of content uttered by a human, an audio processing method used in the method, and a program thereof.

音声信号から、その発声内容のうち発話者が重要なものとして強調している部分を抽出して、その発話内容の要約を自動的に作成することが提案されている。例えば特許文献１では、音声信号を解析してＦＦＴスペクトルやＬＰＣケプストラムなどを音声特徴量として求め、任意の区間の音声特徴量系列と、他の区間の音声特徴量系列とのＤＰマッチングを行って、これら系列間の距離を求め、この距離が所定以下の場合、これらの両区間は音韻的に類似した区間であると抽出し、時間位置情報を加えて重要部分としている。つまり音声中の繰り返し出現する言葉は重要であることが多いという現象を利用したものである。 It has been proposed to extract from the speech signal a portion of the utterance content that the speaker emphasizes as important and automatically create a summary of the utterance content. For example, in Patent Document 1, an audio signal is analyzed to obtain an FFT spectrum, an LPC cepstrum, or the like as an audio feature quantity, and DP matching is performed between an audio feature quantity sequence in an arbitrary section and an audio feature quantity series in another section. The distance between these sequences is obtained, and when the distance is less than or equal to a predetermined distance, both sections are extracted as phonologically similar sections, and time position information is added as an important part. In other words, it uses the phenomenon that words that appear repeatedly in speech are often important.

また特許文献２では、話者同士による対話音声などの音声信号からＦＦＴスペクトルやＬＰＣケプストラムなどを音声特徴量として求め、この音声特徴量を用いて音素片を認識して音素片記号系列を求め、区間どうしの音素片系列のＤＰマッチングにより、両区間の距離を求め、その距離が小さいものを、つまり音韻的に類似した区間を重要部分と抽出し、更にシソーラスを用いて、複数の話題内容を推定している。
また音声中の文や単語単位を抽出する技術として、音声中の文や単語単位の語調成分とアクセント成分とを合わせたピッチパターンが、低いピッチ周波数から始まって中ごろ前半で一番高く、後半徐々に低くなり、語尾で急激に低くなって発音がとまるという日本語によく現れる性質を活用した方法がある。例えば、非特許文献１などである。 In Patent Document 2, an FFT spectrum, an LPC cepstrum, and the like are obtained as speech feature amounts from speech signals such as dialogue speech between speakers, and phoneme symbol sequences are obtained by recognizing phonemes using the speech feature amounts. By DP matching of phoneme sequences between sections, the distance between both sections is obtained, and the one with a small distance, that is, the phonologically similar section is extracted as an important part, and further, a thesaurus is used to extract a plurality of topic contents. Estimated.
Also, as a technique for extracting sentence and word units in speech, the pitch pattern that combines the tone component and accent component of sentences and words in speech starts at a low pitch frequency and is the highest in the first half, and gradually in the second half. There is a method that takes advantage of the property that often appears in Japanese that the pronunciation is suddenly lowered at the end of the word. For example, Non-Patent Document 1 is used.

音声信号を伴う映像情報から重要なシーンをその音声信号を利用して抽出することが特許文献３で提案されている。これには音声信号を解析してスペクトル情報、急な立ち上がりと短時間持続する信号レベルなどの音声特徴量をとらえ、予め設定されている条件、例えば観客の歓声があがった時の音声信号の音声特徴量と比べて類似または近似する部分を抽出し、それらをつなぎ合わせることが開示されている。
特開平10-39890号公報特開2000-284793号公報特開平3−80782号公報板橋等、「韻律情報を考慮した音声要約の一方法」日本音響学会2000年春季研究発表会講演論文集I 239〜240 Patent Document 3 proposes extracting an important scene from video information accompanied by an audio signal using the audio signal. This is done by analyzing the audio signal and capturing spectral features, audio features such as sudden rises and signal levels that last for a short time, and the audio of the audio signal when the audience is cheering for a preset condition, for example It is disclosed to extract portions that are similar or approximate to the feature amount and connect them together.
Japanese Patent Laid-Open No. 10-39890 JP 2000-284793 A JP-A-3-80782 Itabashi et al. "A method of speech summarization considering prosodic information" The Acoustical Society of Japan 2000 Spring Meeting I 239-240

特開平10−39890号公報に示すものは、ＦＦＴスペクトルやＬＰＣケプストラムなど音声特徴量が、話者に依存するため、不特定発声者の音声信号や、複数の不特定話者の会話に対応できない問題があった。また、スペクトル情報を用いているため、原稿などを使用しない、自然な話し言葉や会話への適応は難しく、複数話者の同時発話が出現する環境への実現は困難である。
特開2000−284793号公報に示すものでは、重要部分を音素片記号系列として認識しているため、特開平10−39890号公報の技術と同様に原稿などを使用しない、自然な話し言葉や会話への適応は難しく、複数話者の同時発話が出現する環境への実現は困難である。また要約音声の単語認識結果とシソーラスを用いて話題要約を試みているが、定量的な評価を行っておらず、重要な単語は出現頻度が高く継続時間が長いという仮定に基づいているが、言語的な情報を利用していないため、話題に関係のない単語が抽出されてしまう問題があった。 In Japanese Patent Laid-Open No. 10-39890, since voice features such as FFT spectrum and LPC cepstrum depend on the speaker, the voice signal of an unspecified speaker or a conversation of a plurality of unspecified speakers cannot be handled. There was a problem. In addition, since spectrum information is used, it is difficult to adapt to natural spoken language and conversation without using a manuscript or the like, and it is difficult to realize an environment in which multiple speakers speak simultaneously.
In the one shown in Japanese Patent Laid-Open No. 2000-284793, the important part is recognized as a phoneme symbol series, so that a natural spoken word or conversation is avoided without using a manuscript or the like as in the technique of Japanese Patent Laid-Open No. 10-39890. Is difficult to adapt, and it is difficult to realize an environment where multiple speakers speak simultaneously. We are also trying to summarize topics using word recognition results and thesaurus of summary speech, but we have not performed a quantitative evaluation and based on the assumption that important words have a high appearance frequency and a long duration. Since linguistic information is not used, there is a problem that words unrelated to the topic are extracted.

また、原稿などを使用しない、自然な話し言葉は、文法が適切でないことが多く、発話方法が話者に依存するため、意味を理解できる単位としての音声段落を、基本周波数から抽出するのは、板橋秀一等、「韻律情報を考慮した音声要約の一方法」日本音響学会2000年春季研究発表会講演論文集I 239〜240の方法では問題がある。
特開平3−80782号公報に示すものは、予め抽出する条件を設定しなければならなく、また、抽出した音声区間が短く、再構成のために切り取り、合わせた場合、その切り取った部分の前後において音声の特徴が不連続となるため聞き取りにくい問題があった。 In addition, natural spoken words that do not use manuscripts etc. often have grammatical inaccuracy, and the utterance method depends on the speaker, so extracting the speech paragraph as a unit that can understand the meaning from the fundamental frequency is Shuichi Itabashi et al., “A method of speech summarization considering prosodic information” The method of the IEE239-240 Spring Proceedings of the Acoustical Society of Japan 2000 has problems.
Japanese Patent Application Laid-Open No. 3-80782 discloses that the conditions for extraction must be set in advance, and the extracted speech section is short and is cut out for reconstruction. In this case, there is a problem that it is difficult to hear due to the discontinuity of voice characteristics.

この発明は、前記のような従来の技術の有する欠点に鑑みてなされたもので、予め抽出したい条件を設定することなく、また、原稿などを使用しない、自然な話し言葉や会話においても、話者に依存せず、複数話者の同時発話にも依存せず、雑音環境でも安定して、音声が強調状態であるか平静状態であるかの判定ができる音声処理方法、またその方法を利用して音声の要約区間を自動的に抽出できる音声処理方法、及びこれらのプログラムを提供することを目的とする。 The present invention has been made in view of the above-mentioned drawbacks of the prior art. The speaker can be used in natural spoken words and conversations without setting conditions to be extracted in advance and without using a manuscript. A speech processing method that can determine whether the speech is in an emphasized state or in a calm state without depending on the simultaneous speech of multiple speakers, stable in a noisy environment, and using that method It is an object of the present invention to provide a speech processing method capable of automatically extracting speech summary sections and these programs.

この発明によれば、
基本周波数、パワー、動的特徴量の時間変化特性、基本周波数のフレーム間差分、パワーのフレーム間差分、動的特徴量の時間変化特性のフレーム間差分の６つのうちの少なくともいずれか１つを含む音声特徴量の組からなる音声特徴量ベクトルにそれぞれのコードを対応させ、
上記強調状態での上記各コードが出現するコード出現確率と、上記強調状態での各状態が遷移する状態遷移確率と、上記強調状態での状態遷移時に上記コードが出現する遷移コード出現確率とを格納した符号帳を作成し、
上記強調状態での初期状態確率に対応する上記コード出現確率と、上記強調状態での上記音声特徴量ベクトルに対応する状態遷移ごとの上記遷移コード出現確率と、状態遷移に対応する強調状態での上記状態遷移確率とからなる強調状態音響モデルを上記符号帳を用いて作成し、
(a-1) フレーム毎の音声信号について、無声区間か有声区間か判定し、
(a-2) 所定フレーム数以上の無声区間に挟まれ、少なくとも１フレーム以上の有声区間を含む部分を音声小段落とし、
(a-3) 上記音声小段落の最初のフレームの上記音声特徴量の組を量子化したコードと対応する音声特徴量ベクトルの強調状態での初期状態確率を上記符号帳から求め、
上記強調状態音響モデルより上記音声小段落の２番目以降の各フレームについて上記音声特徴量の組を量子化したコードと対応する音声特徴量ベクトルに対応する状態遷移ごとの強調状態での出力確率を求め、上記音声小段落内の各フレーム間の強調状態での遷移確率を求め、
(b) 上記音声小段落における全ての状態遷移経路ごとの上記強調状態での初期状態確率と上記出力確率と上記遷移確率の積の最大値又は上記積の総和に基づき、上記音声小段落が強調状態となる尤度を算出し、
(c) 上記強調状態となる尤度に基づいて上記音声小段落が強調状態であるか否かを判定する。 According to this invention,
At least one of six of the basic frequency, the power, the time variation characteristic of the dynamic feature amount, the difference between frames of the fundamental frequency, the difference between frames of the power, and the difference between frames of the time variation property of the dynamic feature amount Each code corresponds to a speech feature vector consisting of a set of speech features
A code appearance probability that each code appears in the emphasized state, a state transition probability that each state changes in the emphasized state, and a transition code appearance probability that the code appears in the state transition in the emphasized state Create a stored codebook,
The code appearance probability corresponding to the initial state probability in the emphasized state, the transition code appearance probability for each state transition corresponding to the speech feature vector in the emphasized state, and the emphasized state corresponding to the state transition the emphasized state acoustic model consisting of the state transition probability created using the codebook,
(a-1) For the audio signal for each frame, determine whether it is unvoiced or voiced,
(a-2) A portion including a voiced section of at least one frame sandwiched between unvoiced sections of a predetermined number of frames or more is a voice sub-paragraph,
(a-3) Obtaining an initial state probability in the emphasized state of a speech feature vector corresponding to a code obtained by quantizing the speech feature set of the first frame of the speech sub-paragraph from the codebook,
For the second and subsequent frames of the speech sub-paragraph from the enhanced state acoustic model, the output probability in the enhanced state for each state transition corresponding to the speech feature vector corresponding to the code obtained by quantizing the speech feature set pair. Find the transition probability in the emphasized state between each frame in the audio sub-paragraph,
(b) The voice sub-paragraph is emphasized based on the maximum value of the product of the initial state probability and the output probability and the transition probability in the emphasized state or the sum of the products for all the state transition paths in the voice sub-paragraph. Calculate the likelihood of becoming a state,
(c) It is determined whether or not the audio sub-paragraph is in the emphasized state based on the likelihood of being in the emphasized state.

以上述べたようにこの発明によれば、自然な話し言葉の音声の、音声強調状態や音声段落を抽出でき、音声小段落の発話の強調状態を判定できる。この方法を使用して、強調状態である音声小段落を含む音声段落を切り取り、合わせて再構成した音声が、元の音声の重要部分を伝える、要約音声を作成することが可能となる。しかも発話状態の判定や音声要約は話者に依存しない。 As described above, according to the present invention, it is possible to extract the voice emphasis state and the voice paragraph of the speech of the natural spoken language, and to determine the utterance emphasis state of the speech sub-paragraph. Using this method, it is possible to create a summary speech in which speech paragraphs that include emphasized speech sub-paragraphs are cut out and the reconstructed speech conveys a significant portion of the original speech. Moreover, the determination of speech state and voice summarization do not depend on the speaker.

以下に図面を参照してこの発明の音声強調状態判定を行う音声処理方法とともにこの方法を用いた音声強調状態要約方法を説明する。まずこの発明の実施例においても一部を用いる参考例を説明する。
第１参考例
図１にこの参考例による音声要約方法の基本手順を示す。ステップＳ１で入力音声信号を分析して音声特徴量を抽出する。この音声特徴量の組は音声処理技術においては規格化して使用されるので後で述べるように話者に依存しない規格化したパラメータとして使用する。ステップＳ２で入力音声信号の音声小段落と、複数の音声小段落から構成される音声段落を抽出し、ステップＳ３で各音声小段落を構成するフレームが平静状態か、強調状態かの発話状態を判定し、この判定に基づきステップＳ４で要約音声を作成し、音声要約を得る。 A speech enhancement state summarizing method using this method as well as a speech processing method for performing speech enhancement state determination according to the present invention will be described below with reference to the drawings. First, a reference example using a part of the embodiment of the present invention will be described.
First Reference Example FIG. 1 shows a basic procedure of a speech summarization method according to this reference example. In step S1, the input voice signal is analyzed to extract a voice feature amount. Since this set of speech feature values is used after being standardized in the speech processing technology, it is used as a standardized parameter independent of the speaker as described later. In step S2, an audio sub-paragraph of the input audio signal and an audio paragraph composed of a plurality of audio sub-paragraphs are extracted. In step S3, an utterance state indicating whether the frame constituting each audio sub-paragraph is calm or emphasized is determined. Based on this determination, a summary voice is created in step S4 to obtain a voice summary.

以下に、原稿などを使用しない、自然な話し言葉や会話音声の、音声要約に適応する場合の第１参考例を述べる。音声特徴量としては、スペクトル情報などに比べて、雑音環境下でも安定して得られ、かつ発話状態の判定が話者に依存し難いものを用いる。入力音声信号から音声特徴量として基本周波数f0、パワーｐ、音声の動的特徴量の時間変化特性ｄ、無声区間T_sを抽出する。これらの音声特徴量の抽出法は、たとえば、「音響・音声工学」（古井貞煕、近代科学社、1992）、「音声符号化」（守谷健弘、電子情報通信学会、1998）、「ディジタル音声処理」（古井貞煕、東海大学出版会、1985）、「複合正弦波モデルに基づく音声分析アルゴリズムに関する研究」（嵯峨山茂樹、博士論文、1998）などに述べられている。音声の動的特徴量の時間変化は発話速度の尺度となるパラメータであり日本国特許第2976998号に記載のものを用いてもよい。即ち、スペクトル包絡を反映するLPCスペクトラム係数の時間変化特性を求め、その時間変化をもとに発話速度係数、即ち動的特徴量が求められる。より具体的にはフレーム毎にLPCケプストラム係数C₁(t), …, C_K(t)を抽出して次式のような時点tでの動的特徴量d（ダイナミックメジャー）を求める。 In the following, a first reference example in the case of adapting to speech summarization of natural spoken words and conversational speech without using a manuscript will be described. As the speech feature amount, a speech feature amount that can be obtained more stably in a noise environment than the spectrum information and the speech state determination is less dependent on the speaker is used. A basic frequency f0, power p, a time dynamic characteristic d of voice dynamic feature, and an unvoiced section T _s are extracted from the input speech signal as speech feature. These speech feature extraction methods include, for example, “acoustic / speech engineering” (Sadaaki Furui, Modern Science, 1992), “speech coding” (Takehiro Moriya, IEICE, 1998), “digital "Speech processing" (Sadaaki Furui, Tokai University Press, 1985), "Study on speech analysis algorithm based on composite sine wave model" (Shigeki Hatakeyama, PhD thesis, 1998). The time change of the dynamic feature amount of the voice is a parameter serving as a measure of the speech speed, and the one described in Japanese Patent No. 2976998 may be used. That is, the time change characteristic of the LPC spectrum coefficient reflecting the spectrum envelope is obtained, and the speech rate coefficient, that is, the dynamic feature amount is obtained based on the time change. More specifically, the LPC cepstrum coefficients C ₁ (t),..., C _K (t) are extracted for each frame, and the dynamic feature quantity d (dynamic measure) at the time t as shown in the following equation is obtained.

ここで、±F₀は前後の音声区間フレーム数（必ずしも整数個のフレームでなくとも一定の時間区間でもよい）、KはLPCケプストラムの次数、k = 1, 2, …,Kである。発話速度の係数として動的特徴量の変化の極大点の単位時間当たりの個数、もしくは単位時間当たりの変化率が用いられる。
参考例では例えば100msを１フレーム長とし、フレームの開始点を50msずつシフトし、１フレームごとに入力信号の平均の基本周波数f0'を求める。パワーについても同様に１フレームごとに入力信号の平均パワーp'を求める。更に現フレームのf0'とｉフレーム前のf0'と後のf0'との差分をとり、それぞれΔf0'(-i), Δf0'(i)とする。パワーについても同様に現フレームのp'とｉフレーム前後のp'との差分Δp'(-i), Δp'(i)を求める。次にこれらf0'、Δf0'(-i), Δf0'(i)、p'、Δp'(-i), Δp'(i)を規格化する。この規格化は例えばf0'、 Δf0'(-i),Δf0'(i)をそれぞれ、例えば音声波形全体の平均基本周波数で割り算することにより規格化する。あるいは後述する音声小段落、又は音声段落ごとの平均基本周波数で割り算してもよいし、あるいは数秒後と又は数分後との平均基本周波数で割り算してもよい。これら規格化された値をf0"、Δf0"(-i),Δf0"(i)と表わす。p'、Δp'(-i), Δp'(i)についても同様に、発話状態判定の対象とする音声波形全体の平均パワーで割算し、規格化する。あるいは、音声小段落、音声段落ごとの平均パワーで割算してもよい。あるいは数秒毎又は数分毎の平均パワーで割算していもよい。これら規格化された値をp"、Δp"(-i), Δp"(i)と表わす。ｉの値は例えばｉ＝４とする。 Here, ± F ₀ is the number of frames in the preceding and following speech sections (not necessarily an integer number of frames but may be a fixed time section), and K is the order of the LPC cepstrum, k = 1, 2,. As the coefficient of speech rate, the number of maximum points of change in dynamic feature quantity per unit time or the rate of change per unit time is used.
In the reference example, for example, 100 ms is set to one frame length, the start point of the frame is shifted by 50 ms, and the average fundamental frequency f0 ′ of the input signal is obtained for each frame. Similarly for the power, the average power p ′ of the input signal is obtained for each frame. Further, the difference between f0 ′ of the current frame, f0 ′ before i frame, and f0 ′ after i frame is obtained and set as Δf0 ′ (− i) and Δf0 ′ (i), respectively. Similarly, regarding the power, differences Δp ′ (− i) and Δp ′ (i) between p ′ of the current frame and p ′ before and after the i frame are obtained. Next, these f0 ′, Δf0 ′ (− i), Δf0 ′ (i), p ′, Δp ′ (− i), and Δp ′ (i) are normalized. For example, the normalization is performed by dividing f0 ′, Δf0 ′ (− i), Δf0 ′ (i), for example, by the average fundamental frequency of the entire speech waveform, for example. Or you may divide by the audio | voice subparagraph mentioned later or the average fundamental frequency for every audio | voice paragraph, or you may divide by the average fundamental frequency after several seconds or after several minutes. These normalized values are represented as f0 ", Δf0" (-i), and Δf0 "(i). Similarly, p ', Δp' (-i), and Δp '(i) are also subject to speech state determination. Divide by the average power of the entire speech waveform to normalize, or you may divide by the average power of each audio sub-paragraph and audio paragraph, or divide by the average power every few seconds or every few minutes These normalized values are expressed as p ″, Δp ″ (− i), Δp ″ (i). The value of i is, for example, i = 4.

現フレームの例えば開始時刻の前後±T₁msの区間（幅2T₁ただし、T₁は例えばフレーム長の１０倍程度と、フレーム長より十分長く選ぶので、幅2T₁の中心点は現フレームのどの時点に選んでもよい）内の、ダイナミックメジャーのピーク本数、即ち動的特徴量の変化の極大点の個数d_pを計数する。これと、現フレームの開始時刻のT₂ms前の時刻を中心とする幅2T₁内のd_pとの差成分Δd_p(-T₂)を求める。同様に、前記±T₁ms内のd_p数と、現フレームの終了時刻からT₃ms後の時刻を中心とする幅2T₁の区間内のd_pとの差成分Δd_p(T₃)を求める。これら、T₁，T₂，T₃の値はフレーム長より十分長く、ここでは例えばT₁＝T₂＝T₃＝450msとする。フレームの前後の無声区間の長さをt_SR, T_SFとする。ステップＳ１ではこれらパラメータの各値をフレームごとに抽出する。 For example, a section of ± T ₁ ms before and after the start time of the current frame (width 2T ₁ where T ₁ is selected to be, for example, about 10 times the frame length and sufficiently longer than the frame length, so the center point of the width 2T ₁ The number of peaks of the dynamic measure, that is, the number of maximum points d _p of the change of the dynamic feature amount is counted. A difference component Δd _p (−T ₂ ) between this and d _p within a width 2T ₁ centering on a time T ₂ ms before the start time of the current frame is obtained. Similarly, the d _p number of the ± T within ₁ ms, the difference component [Delta] d _p and d _p in the interval of width 2T ₁ around the time after T ₃ ms from the end time of the current frame (T ₃₎ Ask for. These values of T ₁ , T ₂ , and T ₃ are sufficiently longer than the frame length. Here, for example, T ₁ = T ₂ = T ₃ = 450 ms. Let t _SR and T _SF be the length of the silent section before and after the frame. In step S1, the values of these parameters are extracted for each frame.

ステップＳ２における入力音声の音声小段落と、音声段落を抽出する方法の例を図２に示す。音声小段落とは発話状態判定を行う単位であり、音声段落は例えば400msec以上の無声区間ではさまれた少なくとも１つ以上の音声小段落を含む区間である。
ステップS201で、入力音声信号の無声区間と有声区間を抽出する。有声／無声の判定は、周期性／非周期性の判定と等価であるとみなして、自己相関関数や変形相関関数のピーク値で判定を行なうことが多い。入力信号の短時間スペクトルからスペクトル包絡を除去した予測残差の自己相関関数、即ち変形相関関数であり、変形相関関数のピークが所定の閾値より大きいか否かによって有声／無声の判定を行う。またそのピークを与える遅延時間によってピッチ周期1/f0（基本周波数f0）の抽出を行なう。 FIG. 2 shows an example of the voice sub-paragraph of the input voice and the method of extracting the voice paragraph in step S2. The audio sub-paragraph is a unit for determining the utterance state, and the audio paragraph is a section including at least one audio sub-paragraph sandwiched between silent sections of 400 msec or more, for example.
In step S201, unvoiced and voiced sections of the input voice signal are extracted. The determination of voiced / unvoiced is often regarded as equivalent to the determination of periodicity / non-periodicity, and the determination is often made based on the peak value of the autocorrelation function or the modified correlation function. An autocorrelation function of a prediction residual obtained by removing a spectral envelope from a short-time spectrum of an input signal, that is, a modified correlation function, and voiced / unvoiced is determined depending on whether or not the peak of the modified correlation function is larger than a predetermined threshold. The pitch period 1 / f0 (basic frequency f0) is extracted according to the delay time giving the peak.

ここでは音声信号から各音声特徴量をフレーム毎に分析することについて述べたが、音声信号が例えばＣＥＬＰ(Code-Excited Linear Prediction)などにより既にフレーム毎に符号化（即ち、分析）されて得られている係数もしくは符号が表す音声特徴量を用いてもよい。ＣＥＬＰによる符号には一般に線形予測係数、利得係数、ピッチ周期等が符号化されて含まれている。従ってＣＥＬＰによる符号を復号してこれらの音声特徴量を得ることができる。例えば、復号された利得係数の絶対値もしくは二乗値をパワーとして用い、ピッチ成分の利得係数の、非周期成分の利得係数に対する比に基づいて有声／無声判定を行うことができる。復号されたピッチ周期の逆数をピッチ周波数、即ち基本周波数として用いることができる。また、式(1) で説明した動的特徴量の計算に使用するLPCケプストラムは復号して得られたLPC係数を変換して求めることができる。もちろんＣＥＬＰによる符号にLSP係数が含まれていれば、これを一旦LPC係数に変換し、それから求めてもよい。このようにＣＥＬＰによる符号にはこの発明で使用できる音声特徴量が含まれているので、ＣＥＬＰによる符号を復号し、フレーム毎の必要な音声特徴量の組を取り出し、それらの音声特徴量の組に対し以下の処理を行えばよい。 Here, the analysis of each audio feature amount from the audio signal for each frame has been described. However, the audio signal is already encoded (ie, analyzed) for each frame by CELP (Code-Excited Linear Prediction), for example. A voice feature amount represented by a coefficient or a sign may be used. CELP codes generally include encoded linear prediction coefficients, gain coefficients, pitch periods, and the like. Therefore, these audio feature quantities can be obtained by decoding the CELP code. For example, the absolute value or square value of the decoded gain coefficient can be used as power, and voiced / unvoiced determination can be performed based on the ratio of the gain coefficient of the pitch component to the gain coefficient of the non-periodic component. The reciprocal of the decoded pitch period can be used as the pitch frequency, ie the fundamental frequency. Further, the LPC cepstrum used for the calculation of the dynamic feature amount described in the equation (1) can be obtained by converting the LPC coefficient obtained by decoding. Of course, if the code by CELP includes an LSP coefficient, it may be once converted into an LPC coefficient and then obtained. As described above, the CELP code includes speech feature values that can be used in the present invention. Therefore, the CELP code is decoded, a set of necessary speech feature values for each frame is extracted, and a set of these speech feature values is obtained. However, the following processing may be performed.

ステップS202で、有声区間の両側の無声区間の時間t_SR, t_SFがそれぞれ予め決めたt_s秒以上になるとき、その無声区間で囲まれた有声区間を含む部分を音声小段落Sとする。この無声区間の時間t_sは例えばt_s＝400msとする。
ステップS203で、この音声小段落内の、好ましくは後半部の、有声区間内の平均パワーｐと、その音声小段落の平均パワー値P_Sの定数β倍とを比較し、ｐ＜βP_Sであればその音声小段落を末尾音声小段落とし、直前の末尾音声小段落後の音声小段落から現に検出した末尾音声小段落までを音声段落と決定する。 In step S202, when the times t _SR and t _SF of the unvoiced sections on both sides of the voiced section are equal to or longer than the predetermined t _s seconds, the portion including the voiced section surrounded by the unvoiced section is defined as the audio sub-paragraph S. . Time t _s of the silent interval is, for example, t _s = 400ms.
In step S203, the average power p in the voiced section, preferably in the latter half of the speech sub-paragraph, is compared with a constant β times the average power value P _S of the speech sub-paragraph, and p <βP _S If there is, the audio sub-paragraph is set as the end audio sub-paragraph, and the audio paragraph from the audio sub-paragraph after the immediately preceding end audio sub-paragraph to the currently detected end audio sub-paragraph is determined as the audio paragraph.

図３に、有声区間、音声小段落、音声段落を模式的に示す。音声小段落を前記の、有声区間を囲む無声区間の時間がt_s秒以上の条件で、抽出する。図３では、音声小段落S_j-1，S_j，S_j+1について示している。これより、音声小段落S_jについて述べる。音声小段落S_jは、Q_j個の有声区間から構成され、その平均パワーをP_jとする。音声小段落S_jに含まれるｑ番目の有声区間V_q(q=1,2,…,Q_j)の平均パワーをp_qと表す。音声小段落S_jが音声段落Ｂの末尾の音声小段落であるか否かは、音声小段落S_jを構成する後半部分の有声区間のパワーから判定する。q=Q_j-αからQ_jまでの有声区間の平均パワーp_qの平均が音声小段落S_jの平均パワーP_jより小さい時、即ち、 FIG. 3 schematically shows voiced sections, audio sub-paragraphs, and audio paragraphs. A voice sub-paragraph is extracted under the condition that the time of the unvoiced section surrounding the voiced section is at least _ts seconds. FIG. 3 shows the audio sub-paragraphs S _j−1 , S _j and S _{j + 1} . Now, the audio sub-paragraph S _j will be described. The audio sub-paragraph S _j is composed of Q _j voiced sections, and its average power is P _j . The average power of the _qth voiced section V _q (q = 1, 2,..., Q _j ) included in the small speech paragraph S _j is expressed as p _q . Whether or not the audio sub-paragraph S _j is the audio sub-paragraph at the end of the audio paragraph B is determined from the power of the voiced section in the latter half part of the audio sub-paragraph S _j . When the average of the average power p _q of the voiced interval from q = Q _j −α to Q _j is smaller than the average power P _j of the speech sub-paragraph S _j , that is,

を満たす時、音声小段落S_jが音声段落Ｂの末尾音声小段落であるとする。式(2) のα，βは定数であり、αはQ_j/2以下の値であり、βは例えば0.5〜1.5程度の値である。これらの値は、音声段落の抽出を最適化するように予め実験により決める。ただし、有声区間の平均パワーp_qはその有声区間内の全フレームの平均パワーであり、またこの参考例では、α=3、β=0.8とした。このようにして末尾音声小段落を区切りとして隣接する末尾音声小段落間の音声小段落群を音声段落と判定できる。
図１中のステップＳ３における音声小段落発話状態判定方法の例を図４に示す。ここで、発話状態とは、話者が強調して発話している状態か、平静に発話している状態かをさす。予め作成した符号帳を使ってステップS301で入力音声小段落の音声特徴量の組をベクトル量子化する。発話状態はあとで詳述するように、前述の音声特徴量である基本周波数f0", その前後ｉフレームとの差分Δf0"(-i), Δf0"(i)、平均パワーp", その前後ｉフレームとの差分Δp"(-i), Δp"(i)、ダイナミックメジャーのピーク本数d_p, 及びその差分Δd_p(-T), Δd_p(T) などのうち、予め決めた１つ以上の音声特徴量の組を使って判定する。音声特徴量の組の例は後で詳しく述べる。符号帳には予め、各コード（インデックス）と対応して量子化された音声特徴量の組の値が音声特徴量ベクトルとして格納されており、符号帳に蓄えられた音声特徴量ベクトルの中から入力音声もしくは既に分析して得られた音声のフレーム毎の音声特徴量の組と最も近いものを特定する。その特定には、一般に入力信号の音声特徴量の組と符号帳中の音声特徴量ベクトルとの歪（距離）を最小にするものを特定する。
符号帳の作成
図５に、この符号帳の作成法の例を示す。多数の学習用音声を被験者から採取し、発話状態が、平静状態であるものと、強調状態であるものをそれぞれ識別できるようにラベリングする（S501）。 It is assumed that the audio sub-paragraph S _j is the last audio sub-paragraph of the audio paragraph B. In the formula (2), α and β are constants, α is a value of Q _j / 2 or less, and β is a value of about 0.5 to 1.5, for example. These values are determined in advance by experiments so as to optimize the extraction of speech paragraphs. However, the average power p _q of the voiced section is the average power of all frames in the voiced section, and in this reference example, α = 3 and β = 0.8. In this way, a group of audio sub-paragraphs between adjacent end audio sub-paragraphs with the end audio sub-paragraph as a delimiter can be determined as an audio paragraph.
FIG. 4 shows an example of the voice sub-paragraph utterance state determination method in step S3 in FIG. Here, the utterance state means a state in which the speaker emphasizes or speaks calmly. In step S301, a set of speech feature values of the input speech sub-paragraph is vector-quantized using a codebook created in advance. As will be described in detail later, the speech state is the fundamental frequency f0 ", which is the above-mentioned voice feature amount, the difference Δf0" (-i), Δf0 "(i), the average power p", before and after that A predetermined one of the differences Δp ″ (− i), Δp ″ (i) from the i frame, the number of dynamic measure peaks d _p , and the differences Δd _p (−T), Δd _p (T), etc. Judgment is made using the above-mentioned set of speech feature values. An example of a set of speech feature values will be described in detail later. In the codebook, values of a set of speech feature quantities quantized corresponding to each code (index) are stored as speech feature quantity vectors in advance. From the speech feature quantity vectors stored in the codebook, The input speech or the speech feature value for each frame of speech already obtained by analysis is identified as being closest to the set. For the specification, generally, the one that minimizes the distortion (distance) between the speech feature amount set of the input signal and the speech feature amount vector in the codebook is identified.
Creation of Code Book FIG. 5 shows an example of a method for creating this code book. A large number of learning voices are collected from the subject and labeled so that the utterance state can be distinguished from the utterance state and the emphasis state (S501).

例えば、日本語による発話の場合、被験者の発話が強調状態にあると判断する理由として、
(a) 声が大きく、名詞や接続詞を伸ばすように発話している場合、
(b) 話し始めを伸ばして話題変更を主張、意見を要約するように声を大きくしている場合、
(c) 声を大きく高くして重要な名詞などを強調している場合、
(d) 高音であるが声はそれほど大きくない場合、
(e) 苦笑いしながら、焦りから本音をごまかすようにしている場合、
(f) 周囲に同意を求める、あるいは問いかけるように、語尾が高音にしている場合、
(g) ゆっくりと力強く、念を押すように、語尾の声が大きくしている場合、
(h) 声が大きく高く、割り込んで発話する、相手より大きな声で主張している場合、
(i) 大きな声では憚れるような本音や秘密をひそひそ発言している場合、あるいは普段、声の大きい人が重要なことを小さくボソボソ発言している場合、
を挙げることができる。この例では、平静状態とは、前記の(a)〜(i)のいずれでもなく、発話が平静であると被験者が感じたものとした。 For example, in the case of an utterance in Japanese, as a reason to determine that the subject's utterance is in an emphasized state,
(a) If you speak loudly and speak nouns or conjunctions,
(b) If you are starting to speak, insisting on a topic change, and speaking loudly to summarize your opinion,
(c) If your voice is loud and emphasizes important nouns,
(d) If it is a high tone but the voice is not so loud,
(e) If you're trying to cheat your true intentions while laughing,
(f) If the ending part is at high pitch, asking for consent or asking the others
(g) If the ending voice is loud enough to be slow and powerful
(h) If your voice is loud and loud, you speak loudly, and you are speaking louder than your opponent,
(i) When speaking loudly and secretly, such as when speaking loudly, or when speaking loudly, a person with a loud voice usually speaks importantly
Can be mentioned. In this example, the calm state is not any of the above (a) to (i), and the subject felt that the utterance was calm.

尚、上述では強調状態と判定する対象を発話であるものとして説明したが、音楽でも強調状態を特定することができる。ここでは歌曲において、歌声から強調状態を特定しようとした場合に、強調と感じる理由として、
(a') 声が大きく、かつ声が高い
(b') 声が力強い
(c') 声が高く、かつアクセントが強い
(d') 声が高く、声質が変化する
(e') 声を伸長させ、かつ声が大きい
(f') 声が大きく、かつ、声が高く、アクセントが強い
(g') 声が大きく、かつ、声が高く、叫んでいる
(h') 声が高く、アクセントが変化する
(i') 声を伸長させ、かつ、声が大きく、語尾が高い
(j') 声が高く、かつ、声を伸長させる
(k') 声を伸長させ、かつ、叫び、声が高い
(l') 語尾上がり力強い
(m') ゆっくり強め
(n') 曲調が不規則
(o') 曲調が不規則、かつ、声が高い
また、音声を含まない楽器演奏のみの楽曲でも強調状態を特定することができる。その強調と感じる利用として、
(a") 強調部分全体のパワー増大
(b") 音の高低差が大きい
(c") パワーが増大する
(d") 楽器の数が変化する
(e") 曲調、テンポが変化する
等がある。これらを基に符号帳を作成しておくことにより、発話に限らず歌、器楽曲の要約も行うことができることになる。従って、請求項において使用されている用語「音声」は歌や器楽曲も含むものである。 In the above description, the object to be determined to be in the emphasized state has been described as an utterance, but the emphasized state can also be specified in music. Here, as a reason to feel emphasis when trying to identify the emphasis state from the singing voice,
(a ') loud and loud
(b ') Voice is strong
(c ') High voice and strong accent
(d ') Voice is high and voice quality changes
(e ') Elongate voice and loud voice
(f ') loud voice, high voice, strong accent
(g ') Voice is loud and loud and screaming
(h ') Voice is loud and accent changes
(i ') Stretching voice, loud voice, high ending
(j ') The voice is loud and the voice is extended
(k ') Stretching voice and screaming, high voice
(l ') Strong ending
(m ') Slowly strengthen
(n ') Irregular tune
(o ') The tone is irregular and the voice is high. Also, the emphasized state can be specified even for a musical piece performance that does not include voice. As use that feels that emphasis,
(a ") Increase the power of the entire emphasis
(b ") Large pitch difference
(c ") Increased power
(d ") Number of instruments changes
(e ") Musical tone, tempo changes, etc. By creating a codebook based on these, it is possible to summarize not only utterances but also songs and instrumental music. The term “speech” used in the term includes songs and instrumental music.

平静状態と強調状態の各ラベル区間について、図１中のステップＳ１と同様に、音声特徴量を抽出し（S502）、状態判定に使用する音声特徴量の組を選択する（S503）。平静状態と強調状態のラベル区間の、前記パラメータを用いて、ＬＢＧアルゴリズムで符号帳を作成する（S504）。ＬＢＧアルゴリズムについては、例えば、（Y.Linde,A.Buzo and R.M.Gray,“An algorithm for vector quantizer design,”IEEE Trans.Commun., vol.Com-28, pp.84-95,1980）がある。符号帳サイズは2^m個（ｍは１以上の整数）に可変であり、ｍビットコードC=00...0〜C=11...1に対応した量子化ベクトルが予め決められる。この符号帳作成は音声小段落ごとの全音声特徴量、又はこれより長い適当な区間ごとの全音声特徴量、あるいは学習音声全体の音声特徴量を例えばその平均値と標準偏差で標準化処理をして2^m個の音声特徴量ベクトルを生成して用いることが好ましい。 For each label section in the calm state and the emphasized state, as in step S1 in FIG. 1, the speech feature amount is extracted (S502), and a set of speech feature amounts used for state determination is selected (S503). A codebook is created by the LBG algorithm using the parameters in the label section of the calm state and the emphasized state (S504). As for the LBG algorithm, for example, there is (Y. Linde, A. Buzo and RMGray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., Vol. Com-28, pp. 84-95, 1980). The codebook size is variable to 2 ^m (m is an integer of 1 or more), and quantization vectors corresponding to m-bit codes C = 00... 0 to C = 11. This codebook is created by standardizing the total speech feature value for each sub-speech paragraph, the total speech feature value for each longer appropriate section, or the overall speech feature value of the learning speech, for example, using the average value and standard deviation. Preferably, 2 ^m speech feature vectors are generated and used.

図４の発話状態判定処理に戻って、ステップS301で、入力音声小段落の各フレームごとに得られる音声特徴量を符号帳作成に用いたと同じ平均値と標準偏差により標準化処理し、その標準化処理された音声特徴量をこの符号帳を用いてベクトル量子化（符号化）し、フレームごとに量子化ベクトルに対応するコードを得る。この際の入力音声信号から抽出した音声特徴量パラメータのうち、発話状態判定に使用するパラメータの組は前記の符号帳作成に用いたパラメータの組と同じものである。
強調状態が含まれる音声小段落を特定するために、音声小段落中のコードＣ（量子化音声特徴量ベクトルのインデックス）を用いて、発話状態の尤度を、平静状態と強調状態のそれぞれについて求める。このために、予め、任意のコードの出現確率を、平静状態の場合と、強調状態の場合について求めておき、この出現確率とそのコードとを組として符号帳に格納しておく。以下にこの出現確率の求め方の例を述べる。前記の符号帳作成に用いた学習音声中のラベルが与えられた１つの区間（ラベル区間）内のフレーム数をｎとし、それぞれのフレームから得られる音声特徴量ベクトルのコードが時系列でC₁，C₂，C₃，…，C_nであるとき、そのラベル区間Ａが強調状態となる確率P_Aemp、平静状態となる確率P_Anrmは次式、 Returning to the utterance state determination process of FIG. 4, in step S301, the voice feature obtained for each frame of the input voice sub-paragraph is standardized by the same average value and standard deviation as used in the codebook creation, and the standardization process The obtained speech feature quantity is vector quantized (encoded) using this codebook, and a code corresponding to the quantized vector is obtained for each frame. Of the speech feature parameters extracted from the input speech signal at this time, the set of parameters used for the speech state determination is the same as the set of parameters used for creating the codebook.
In order to identify a speech sub-paragraph that includes an emphasis state, the likelihood of the utterance state is determined for each of the calm state and the emphasis state using the code C (index of the quantized speech feature vector) in the speech sub-paragraph. Ask. For this purpose, the appearance probability of an arbitrary code is obtained in advance for the case of the calm state and the case of the emphasized state, and this appearance probability and the code are stored as a set in the codebook. An example of how to determine the appearance probability is described below. The number of frames in one section (label section) given a label in the learning speech used for the codebook creation is n, and the speech feature vector code obtained from each frame is C _{1 in} time series. , C _2, C _3, ..., when a C _n, the probability P _AEMP of the label section a is emphasized, the probability P _Anrm as a calm state equation,

で表される。ただし、P_emp(C_i｜C₁…C_i-1)はコード列C₁…C_i-1の次にコードC_iが強調状態となる条件付確率、P_nrm(C_i｜C₁…C_i-1)は同様にC₁…C_i-1に対しコードC_iが平静状態となる確率である。またP_emp(C₁)は符号帳を使って全学習音声についてフレーム毎に音声特徴量ベクトルを量子化し、これらコード中の、音声が強調状態とラベリングされた部分に存在したコードC₁の総個数を計数し、その計数値を強調状態とラベリングされた音声データの全コード数（＝フレーム数）で割算した値であり、P_nrm(C₁)はコードC₁が平静状態とラベリングされた部分に存在した個数を平静状態とラベリングされた音声データの全コード数で割算した値である。 It is represented by However, P _emp (C _i | C ₁ … C _i-1 ) is a conditional probability that the code C _i is in an emphasized state next to the code string C ₁ … C _i−1 , and P _nrm (C _i | C ₁ … Similarly, C _i-1 ) is the probability that the code C _i is in a calm state with respect to C ₁ ... C _i-1 . Also, P _emp (C ₁ ) quantizes the speech feature vector for each frame for all learning speech using the codebook, and in these codes, the total of the codes C ₁ existing in the portion where the speech is labeled as emphasized. The number is counted, and the counted value is _divided by the total number of codes (= number of frames) of the voice data labeled with the emphasis state. P _nrm (C ₁ ) is the code C ₁ is labeled as calm. This is a value obtained by dividing the number of existing data by the total number of codes of voice data labeled as calm.

この各条件付確率の計算を簡単にするために、この例ではN-gramモデル(Ｎ＜ｉ)を用いる。N-gramモデルは、ある時点でのある事象の出現はその直前のN-1個の事象の出現に依存すると近似するモデルであり、例えばｉ番目のフレームにコードC_iが出現する確率をP(C_i)=(C_i｜C_i-N+1…C_i-1)として求める。式(3), (4) 中の各条件付確率P_emp(C_i｜C₁…C_i-1), P_nrm(C_i｜C₁…C_i-1)にN-gramモデルを適用すると次式
P_emp(C_i|C₁…C_i-1)＝P_emp(C_i|C_i-N+1…C_i-1) (5)
P_nrm(C_i|C₁…C_i-1)＝P_nrm(C_i|C_i-N+1…C_i-1) (6)
のように近似できる。このような式(3), (4) 中の条件付確率P_emp(C_i｜C₁…C_i-1), P_nrm(C_i｜C₁…C_i-1) をN-gramモデルで近似した条件付確率P_emp(C_i｜C_i-N+1…C_i-1), P_nrm(C_i｜C_i-N+1…C_i-1)をラベリングされた学習音声の量子化コード列から全て求めるが、入力音声信号の音声特徴量の量子化したコード列と対応するものが学習音声から得られていない場合もある。そのため、高次の（即ちコード列の長い）条件付確率と単独出現確率とから低次の条件付出現確率とを補間して求める。具体的には以下に定義するN=3の場合であるtrigram、N=2の場合であるbigram、N=1の場合であるunigramを用いて線形補間法を施す。即ち、
N=3(trigram)：P_emp(C_i｜C_i-2C_i-1)、P_nrm(C_i｜C_i-2C_i-1)
N=2(bigram)：P_emp(C_i｜C_i-1)、P_nrm(C_i｜C_i-1)
N=1(unigram)：P_emp(C_i)、P_nrm(C_i)
であり、これら３つの強調状態でのC_iの出現確率、また３つの平静状態でのC_iの出現確率をそれぞれ用いて次の線形補間式、
P_emp(C_i｜C_i-2C_i-1)＝λ_emp1P_emp(C_i｜C_i-2C_i-1)＋λ_emp2P_emp(C_i｜C_i-1)＋λ_emp3P_emp(C_i)
(7)
P_nrm(C_i｜C_i-2C_i-1)＝λ_nrm1P_nrm(C_i｜C_i-2C_i-1)＋λ_nrm2P_nrm(C_i｜C_i-1)＋λ_nrm3P_nrm(C_i)
(8)
によりP_emp(C_i｜C_i-2C_i-1)、P_nrm(C_i｜C_i-2C_i-1)を得ることにする。 In order to simplify the calculation of each conditional probability, an N-gram model (N <i) is used in this example. N-gram model, the appearance of an event at a certain point in time is a model which approximates to be dependent on the appearance of the N-1 event just before, for example, the probability that the i-th frame code C _i appears P (C _i ) = (C _i | C _{i−N + 1} ... C _i−1 ) Apply the N-gram model to each conditional probability P _emp (C _i | C ₁ … C _i-1 ), P _nrm (C _i | C ₁ … C _i-1 ) in _Eqs. (3) and (4) Then the following formula
P _emp (C _i | C ₁ … C _i-1 ) ＝ P _emp (C _i | C _{i-N + 1} … C _i-1 ) (5)
P _nrm (C _i | C ₁ … C _i-1 ) ＝ P _nrm (C _i | C _{i-N + 1} … C _i-1 ) (6)
It can be approximated as follows. The conditional probabilities P _emp (C _i | C ₁ … C _i-1 ) and P _nrm (C _i | C ₁ … C _i-1 ) in _Eqs. (3) and (4) are _expressed as N-gram models. Of the learning speech labeled with the conditional probabilities P _emp (C _i | C _{i-N + 1} … C _i-1 ), P _nrm (C _i | C _{i-N + 1} … C _i-1 ) approximated by Although all are obtained from the quantized code sequence, there is a case where the speech code corresponding to the quantized code sequence of the input speech signal is not obtained from the learning speech. For this reason, the low-order conditional appearance probability is obtained by interpolation from the high-order (that is, long code string) conditional probability and the single appearance probability. Specifically, linear interpolation is performed using a trigram when N = 3, a bigram when N = 2, and a unigram when N = 1 defined below. That is,
N = 3 (trigram): P _emp (C _i | C _i-2 C _i-1 ), P _nrm (C _i | C _i-2 C _i-1 )
N = 2 (bigram): P _emp (C _i | C _i-1 ), P _nrm (C _i | C _i-1 )
N = 1 (unigram): P _emp (C _i ), P _nrm (C _i )
, And the three probability of occurrence of C _i in the emphasized state, and using the probability of occurrence of C _i at three undisturbed state; next linear interpolation formula,
P _emp (C _i | C _i-2 C _i-1 ) = λ _emp1 P _emp (C _i | C _i-2 C _i-1 ) + λ _emp2 P _emp (C _i | C _i-1 ) + λ _emp3 P _emp (C _i )
(7)
P _nrm (C _i | C _i-2 C _i-1 ) = λ _nrm1 P _nrm (C _i | C _i-2 C _i-1 ) + λ _nrm2 P _nrm (C _i | C _i-1 ) + λ _nrm3 P _nrm (C _i )
(8)
Thus, P _emp (C _i | C _i−2 C _i−1 ) and P _nrm (C _i | C _i−2 C _i−1 ) are obtained.

Trigramの強調状態とラベリングされた学習データのフレーム数をｎとし、時系列でコードC₁, C₂, ..., C_nが得られたとき、λ_emp1, λ_emp2, λ_emp3の再推定式は次のようになる。 Re-estimate λ _emp1 , λ _emp2 , and λ _emp3 when the number of frames of the training data labeled with Trigram and the number of frames of labeled learning data is n, and codes C ₁ , C ₂ , ..., C _n are obtained in time series The formula is as follows.

以下同様にしてλ_nrm1, λ_nrm2, λ_nrm3も求められる。
この例では、ラベル区間Ａのフレーム数がF_Aであり、得られたコードがC₁，C₂，…，C_FAのとき、このラベル区間Ａが強調状態となる確率P_Aemp及び平静状態となる確率P_Anrmはそれぞれ、
P_Aemp=P_emp(C₃｜C₁C₂)…P_emp(C_FA｜C_FA-2C_FA-1) (9)
P_Anrm=P_nrm(C₃｜C₁C₂)…P_nrm(C_FA｜C_FA-2C_FA-1) (10)
となる。この計算ができるように前記のtrigram, bigram, unigramを任意のコードについて求めて符号帳に格納しておく。つまり符号帳には各コードに対応して音声特徴量ベクトルと、その強調状態での出現確率と、平静状態での出現確率との組が格納される。その強調状態での出現確率としては、各コードが過去のフレームで出現したコードと無関係に強調状態で出現する確率（単独出現確率）及び／又は直前の連続した所定数のフレームの取り得るコードの列の次にそのコードが強調状態で出現する条件付確率を使用する。平静状態での出現確率も同様に、そのコードが過去のフレームで出現したコードと無関係に平静状態で出現する単独出現確率及び／又は直前の連続した所定数のフレームの取り得るコードの列の次にそのコードが平静状態で出現する条件付確率を使用する。 Similarly, λ _nrm1 , λ _nrm2 , and λ _nrm3 are obtained.
In this example, when the number of frames in the label section A is F _A and the obtained codes are C ₁ , C ₂ ,..., C _FA , the probability P _Aemp and the calm state that the label section A is in the emphasized state Each probability P _Anrm is
P _Aemp = P _emp (C ₃ | C ₁ C ₂ )… P _emp (C _FA | C _FA-2 C _FA-1 ) (9)
P _Anrm = P _nrm (C ₃ | C ₁ C ₂ )… P _nrm (C _FA | C _FA-2 C _FA-1 ) (10)
It becomes. The trigram, bigram, and unigram are obtained for an arbitrary code and stored in the codebook so that this calculation can be performed. That is, the codebook stores a set of speech feature vectors, appearance probabilities in the emphasized state, and appearance probabilities in the calm state corresponding to each code. As the appearance probability in the emphasized state, the probability that each code appears in the emphasized state regardless of the code that appeared in the past frame (single appearance probability) and / or the code that can be taken by a predetermined number of immediately preceding frames is possible. Use the conditional probability that the code appears in the highlighted state next to the column. Similarly, the probability of appearance in the calm state is the next to the single occurrence probability that the code appears in a calm state regardless of the code that appeared in the previous frame and / or the sequence of codes that can be taken in the immediately preceding predetermined number of frames. The conditional probability that the code appears in a calm state is used.

例えば図１２に示すように符号帳には各コードC1，C2，…ごとにその音声特徴量ベクトルと、その単独出現確率が強調状態、平静状態について、また条件付確率が強調状態、平静状態についてそれぞれ組として格納されている。ここで、コードC1, C2, C3,…は符号帳の各音声特徴量ベクトルに対応したコード（インデックス）を表し、それぞれｍビットの値"00...00", "00...01", "00...10",…である。符号帳におけるｈ番目のコードをChで表し、例えばC1は第１番目のコードを表すものとする。
この発明に適用する好ましい音声特徴量の組の例としてパラメータf0"，p"，d_pを使用し、符号帳サイズ（音声特徴量ベクトル数）が2⁵の場合の強調状態及び平静状態での、unigram及びbigramの例について説明する。図６は、unigramである。縦軸はP_emp(Ch)，P_nrm(Ch)で、横軸はコードChの値であり、各Chの値の左の棒グラフはP_emp(Ch)、右の棒グラフはP_nrm(Ch)である。この例では、コードC17のunigramは
P_emp(C17)＝0.065757
P_nrm(C17)＝0.024974
となった。図６から、任意のChについて、P_emp(Ch)とP_nrm(Ch)とに有意な差があることから、強調状態の音声特徴量の組をベクトル量子化したコードと、平静状態の音声特徴量の組をベクトル量子化したコードのunigramが互いに分離していることがわかる。図７は、bigramである。P_emp(C_i｜C_i-1)とP_nrm(C_i｜C_i-1)の値の一部を図１４〜１６に示す。ただしｉはフレーム番号に対応する時系列番号であり、各コードＣは任意のコードChを取り得る。この例では、コードCh=C27のbigramは図８に示すようになった。縦軸はP_emp(C27｜C_i-1)、P_nrm(C27｜C_i-1)で、横軸はコードCh=0, 1, …，31であり、各C_i-1の左の棒グラフはP_emp(C27｜C_i-1)、右の棒グラフはP_nrm(C27｜C_i-1)である。この例ではコードC9からコードC27に遷移する確率は、
P_emp(C27｜C9)＝0.11009
P_nrm(C27｜C9)＝0.05293
であった。図８から、任意のコードC_i-1について、P_emp(C27｜C_i-1)とP_nrm(C27｜C_i-1)間に値の有意な差があり、図１４〜１６から任意のコードC_iについても同様の結果が得られたことから、強調状態の音声特徴量の組をベクトル量子化したコードと、平静状態の音声特徴量の組をベクトル量子化したコードのbigramが互いに異なる値をとっており、分離していることがわかる。このことは、その符号帳に基づいて計算するbigramが強調状態と平静状態に対し互いに異なる確率を与えることを保証している。 For example, as shown in FIG. 12, in the codebook, for each code C1, C2,..., The speech feature vector, its single appearance probability is in the emphasized state and calm state, and the conditional probability is in the emphasized state and calm state. Each is stored as a set. Here, codes C1, C2, C3,... Represent codes (indexes) corresponding to each speech feature vector of the codebook, and m-bit values “00 ... 00”, “00 ... 01”, respectively. , "00 ... 10", ... The h-th code in the code book is represented by Ch, and for example, C1 represents the first code.
Parameters f0 Preferred example of a set of speech features to be applied to the present invention ", p", using the d _p, codebook size (audio feature vector number) in the enhancement state and calm condition if 2 ⁵ Examples of unigram and bigram will be described. FIG. 6 is a unigram. The vertical axis is P _emp (Ch), P _nrm (Ch), the horizontal axis is the code Ch value, the left bar graph of each Ch value is P _emp (Ch), and the right bar graph is P _nrm (Ch) It is. In this example, the unigram with code C17 is
P _emp (C17) = 0.065757
P _nrm (C17) = 0.024974
It became. From FIG. 6, since there is a significant difference between P _emp (Ch) and P _nrm (Ch) for an arbitrary Ch, a code obtained by vector quantization of a set of emphasized speech feature values and a calm speech It can be seen that unigrams of codes obtained by vector quantization of feature pairs are separated from each other. FIG. 7 is a bigram. A part of the values of P _emp (C _i | C _i-1 ) and P _nrm (C _i | C _i-1 ) are shown in FIGS. However, i is a time series number corresponding to the frame number, and each code C can take an arbitrary code Ch. In this example, the bigram of the code Ch = C27 is as shown in FIG. The vertical axis _{_{P emp (C27 | C i-}} 1), P nrm | In (C27 C _i-1), the horizontal axis is the code Ch = 0, 1, ..., a 31, a left of each C _i-1 The bar graph is P _emp (C27 | C _i-1 ), and the right bar graph is P _nrm (C27 | C _i-1 ). In this example, the probability of transition from code C9 to code C27 is
P _emp (C27 | C9) = 0.11009
P _nrm (C27 | C9) = 0.05293
Met. From FIG. 8, there is a significant difference in value between P _emp (C27 | C _i-1 ) and P _nrm (C27 | C _i-1 ) for an arbitrary code C _i−1 , which is arbitrary from FIGS. since similar results were obtained also for the code C _i, and code set vector quantization of speech features of the emphasized state, the code sets of speech features of calm conditions and vector quantization bigram each other Different values are taken and it can be seen that they are separated. This guarantees that the biggram calculated based on the codebook gives different probabilities for the emphasized state and the calm state.

図４中のステップS302では、入力音声小段落の全フレームのコードについてのその符号帳に格納されている前記確率から、発話状態の尤度を、平静状態と強調状態について求める。図９に参考例の模式図を示す。時刻ｔから始まる音声小段落のうち、第４フレームまでをi〜i+3で示している。前記のように、ここでは、フレーム長は100ms、フレームシフトを50msとした。フレーム番号ｉ、時刻ｔ〜t+100でコードC₁が、フレーム番号i+1、時刻t+50〜t+150でコードC₂が、フレーム番号i+2、時刻t+100〜t+200でコードC₃が、フレーム番号i+3、時刻t+150〜t+250でコードC₄が得られ、つまりフレーム順にコードがC₁，C₂，C₃，C₄であるとき、フレーム番号i+2以上のフレームでtrigramが計算できる。音声小段落Ｓが強調状態となる確率をP_Semp、平静状態となる確率をP_Snrmとすると第４フレームまでの確率はそれぞれ、
P_Semp＝P_emp(C₃｜C₁C₂)P_emp(C₄｜C₂C₃) (11)
P_Snrm＝P_nrm(C₃｜C₁C₂)P_nrm(C₄｜C₂C₃) (12)
となる。ただし、この例では、符号帳からC₃，C₄の強調状態及び平静状態の各単独出現確率を求め、またC₂の次にC₃が強調状態及び平静状態で各出現する条件付確率、更にC₃が、連続するC₁，C₂の次に、C₄が、連続するC₂，C₃の次にそれぞれ強調状態及び平静状態でそれぞれ出現する条件付確率を求めると以下のようになる。 In step S302 in FIG. 4, the likelihood of the speech state is obtained for the calm state and the emphasized state from the probabilities stored in the codebook for the codes of all frames of the input speech sub-paragraph. FIG. 9 shows a schematic diagram of a reference example. Of the audio sub-paragraphs starting from time t, i to i + 3 are shown up to the fourth frame. As described above, here, the frame length is 100 ms and the frame shift is 50 ms. Frame number i, the time t~t + 100 code C ₁ is the frame number i + 1, code C ₂ at time t + 50~t + 150, frame number i + 2, time t + 100~t + 200 Code C ₃ is frame number i + 3, and code C ₄ is obtained at time t + 150 to t + 250, that is, when the codes are C ₁ , C ₂ , C ₃ , C ₄ in frame order, the frame number Trigram can be calculated with i + 2 or more frames. If the probability that the speech sub-paragraph S is in the emphasized state is P _Semp and the probability that it is in the calm state is P _Snrm , then the probability up to the fourth frame is
P _Semp = P _emp (C ₃ | C ₁ C ₂ ) P _emp (C ₄ | C ₂ C ₃ ) (11)
P _Snrm = P _nrm (C ₃ | C ₁ C ₂ ) P _nrm (C ₄ | C ₂ C ₃ ) (12)
It becomes. However, in this example, the single occurrence probability of the emphasized state and the calm state of C ₃ and C _{4 is obtained} from the codebook, and the conditional probability that C ₃ appears in the emphasized state and the calm state next to C ₂ , Furthermore the C _3, the next C _1, C ₂ consecutive, C ₄ is, to C _2, and in the following C ₃ of obtaining the conditional probability of occurrence, respectively each emphasized and calm state following the continuous Become.

P_emp(C₃｜C₁C₂)＝λ_emp1P_emp(C₃｜C₁C₂)＋λ_emp2P_emp(C₃｜C₂)＋λ_emp3P_emp(C₃) (13)
P_emp(C₄｜C₂C₃)＝λ_emp1P_emp(C₄｜C₂C₃)＋λ_emp2P_emp(C₄｜C₃)＋λ_emp3P_emp(C₄) (14)
P_nrm(C₃｜C₁C₂)＝λ_nrm1P_nrm(C₃｜C₁C₂)＋λ_nrm2P_nrm(C₃｜C₂)＋λ_nrm3P_nrm(C₃) (15)
P_nrm(C₄｜C₂C₃)＝λ_nrm1P_nrm(C₄｜C₂C₃)＋λ_nrm2P_nrm(C₄｜C₃)＋λ_nrm3P_nrm(C₄) (16)
上記式(13)〜(16)を用いて式(11)と(12)で示される第3フレームまでの強調状態となる確率P_Sempと、平静状態となる確率P_Snrmが求まる。ここで、P_emp(C₃|C₁C₂), P_nrm(C₃|C₁C₂)はフレーム番号i+2において計算できる。 P _emp (C ₃ | C ₁ C ₂ ) = λ _emp1 P _emp (C ₃ | C ₁ C ₂ ) + λ _emp2 P _emp (C ₃ | C ₂ ) + λ _emp3 P _emp (C ₃ ) (13)
P _emp (C ₄ | C ₂ C ₃ ) = λ _emp1 P _emp (C ₄ | C ₂ C ₃ ) + λ _emp2 P _emp (C ₄ | C ₃ ) + λ _emp3 P _emp (C ₄ ) (14)
P _nrm (C ₃ | C ₁ C ₂ ) = λ _nrm1 P _nrm (C ₃ | C ₁ C ₂ ) + λ _nrm2 P _nrm (C ₃ | C ₂ ) + λ _nrm3 P _nrm (C ₃ ) (15)
P _nrm (C ₄ | C ₂ C ₃ ) = λ _nrm1 P _nrm (C ₄ | C ₂ C ₃ ) + λ _nrm2 P _nrm (C ₄ | C ₃ ) + λ _nrm3 P _nrm (C ₄ ) (16)
The formula (13) to the probability P _Semp to be emphasized to the third frame of the formula (11) and (12) with (16), a calm state probability P _SNRM is obtained. Here, P _emp (C ₃ | C ₁ C ₂ ) and P _nrm (C ₃ | C ₁ C ₂ ) can be calculated at frame number i + 2.

上述は第４フレームi+3までの計算について説明したが、この例では、フレーム数F_Sの音声小段落Ｓのそれぞれのフレームから得たコードがC₁，C₂，…，C_FSのとき、この音声小段落Ｓが強調状態になる確率P_Sempと平静状態になる確率P_Snrmを次式により計算する。
P_Semp＝P_emp(C₃｜C₁C₂)…P_emp(C_FS｜C_FS-2C_FS-1) (17)
P_Snrm＝P_nrm(C₃｜C₁C₂)…P_nrm(C_FS｜C_FS-2C_FS-1) (18)
これらの確率が、P_Semp＞P_Snrmであれば、その音声小段落Ｓは強調状態、P_Semp≦P_Snrmであれば、平静状態とする。 The above describes the calculation up to the fourth frame i + 3. In this example, the codes obtained from the frames of the audio sub-paragraph S with the number of frames F _S are C ₁ , C ₂ _,. to calculate the probability P _Semp probability P _SNRM become calm state where the audio sub-paragraph S is emphasized by the following equation.
P _Semp = P _emp (C ₃ | C ₁ C ₂ )… P _emp (C _FS | C _FS-2 C _FS-1 ) (17)
P _Snrm = P _nrm (C ₃ | C ₁ C ₂ )… P _nrm (C _FS | C _FS-2 C _FS-1 ) (18)
If these probabilities are P _Semp > P _Snrm , the audio sub-paragraph S is in an emphasized state, and if P _Semp ≦ P _Snrm , it is in a calm state.

図１中のステップＳ４の要約音声作成は、図４中のステップS302で強調状態と判定された音声小段落を含む音声段落を繋ぎ合わせて構成される。
この参考例の方法で、企業で行われた、原稿などを使用しない、自然な話し言葉や会話による会議音声を用いて、音声要約実験を行った。この例では図６〜図８に示した場合と異なる条件で強調状態の判定、要約部分の抽出を行っている。
これより、符号帳サイズ（符号数）を256として、50msを１フレームとし、シフトを50msとし、符号帳に格納された各音声特徴量ベクトルを構成する音声特徴量の組を
[f0",Δf0"(1),Δf0"(-1),Δf0"(4),Δf0"(-4),p",Δp"(1),Δp"(-1),Δp"(4),Δp"(-4),d_p,Δd_p(T),Δd_p(-T)]
としたときの実験例について述べる。発話状態判定実験は、被験者により強調状態及び平静状態とラベルがつけられた音声区間の音声特徴量を用いた。符号帳作成に用いた、強調状態707個と平静状態807個のラベルについて、各ラベル区間の、全フレームのコードを、前記の式(9), (10)によって、発話状態を判定し、この実験をclose実験とした。close実験は、符号帳の作成に使用した音声データを用いた実験であり、open実験は符号帳の作成に使用してない音声データを用いた実験である。 The summary speech creation in step S4 in FIG. 1 is configured by connecting speech paragraphs including the speech subparagraph determined to be in the emphasized state in step S302 in FIG.
Using the method of this reference example, we conducted speech summarization experiments using natural speech and conversational conference voices that were conducted in a company and did not use manuscripts. In this example, the determination of the emphasis state and the extraction of the summary portion are performed under different conditions from those shown in FIGS.
From this, the codebook size (number of codes) is set to 256, 50 ms is set to 1 frame, shift is set to 50 ms, and a set of audio feature values constituting each audio feature vector stored in the codebook is set.
[f0 ", Δf0" (1), Δf0 "(-1), Δf0" (4), Δf0 "(-4), p", Δp "(1), Δp" (-1), Δp "(4 ), Δp "(-4), d _p , Δd _p (T), Δd _p (-T)]
An example of the experiment will be described. The speech state determination experiment used speech feature values of speech sections labeled as emphasized state and calm state by the subject. For the 707 emphasis state and 807 calm state labels used in the codebook creation, the utterance state is determined by the above equations (9) and (10) for all frame codes in each label section. The experiment was a close experiment. The close experiment is an experiment using voice data used to create a codebook, and the open experiment is an experiment using voice data not used to create a codebook.

一方、符号帳作成に用いていない、強調状態173個、平静状態193個のラベルについて、各ラベル区間の、全フレームのコードを、前記の式(9), (10)によって、発話状態を判定し、この実験をopen実験とした。
評価は再現率と適合率を用いて行った。ここで再現率は被験者が設定した正解集合に対して、この参考例の方法が判定した発話状態の正解の割合であり、適合率はこの参考例の方法が判定した発話状態の数のうち、正解した割合である。結果は、
close実験強調状態：再現率８９％、適合率８７％
平静状態：再現率８８％、適合率９０％
open実験強調状態：再現率８４％、適合率９１％
平静状態：再現率９２％、適合率８７％
となった。ただし、
λ_emp1＝λ_nrm1＝0.41
λ_emp2＝λ_nrm2＝0.41
λ_emp3＝λ_nrm3＝0.08
とした。 On the other hand, for the 173 emphasized labels and 193 calmed labels that are not used in the codebook creation, the utterance state is determined by the above equations (9) and (10) for all the frame codes in each label section. This experiment was designated as an open experiment.
The evaluation was performed using the recall and precision. Here, the recall is the ratio of correct answer of the utterance state determined by the method of this reference example to the correct answer set set by the subject, and the relevance rate is the number of utterance states determined by the method of this reference example, This is the correct answer. Result is,
close experiment Emphasis state: 89% recall, 87% conformity
Quiet state: 88% recall, 90% compliance
Open experiment Emphasis state: 84% recall, 91% compliance
Quiet state: 92% recall, 87% compliance
It became. However,
λ _emp1 = λ _nrm1 = 0.41
λ _emp2 = λ _nrm2 = 0.41
λ _emp3 = λ _nrm3 = 0.08
It was.

音声特徴量は前記のように、参考例として前後参照フレーム範囲を±ｉ(i=4)とした場合は２９個存在し、組み合わせはΣ₂₉C_n個ある。ただし、Σの範囲はｎ＝１〜２９であり、₂₉C_nは29からｎ個を取る組合せである。これより、そのうちの１８種類の音声特徴量を一組とするベクトルから成る符号帳を用いた参考例について述べる。以下では再びフレームを100ms、シフト量を50msとする。図１７に示すその１８種類の音声特徴量の組み合わせ番号と、各々の音声特徴量を示す。発話状態判定実験は、被験者が設定した強調状態と平静状態のラベル区間の音声特徴量を用いた。close実験として、符号帳作成に用いた、強調状態613個と平静状態803個のラベルについて、open実験として、符号帳作成に用いていない、強調状態171個、平静状態193個のラベルについて発話状態を判定した。符号帳サイズは128で、
λ_emp1＝λ_nrm1＝0.41
λ_emp2＝λ_nrm2＝0.41
λ_emp3＝λ_nrm3＝0.08
とした。図１０は、１８組の音声特徴量の組み合わせで実験した、close実験と、open実験の再現率を示している。縦軸は再現率、横軸はパラメータの組み合わせ番号であり、○印はclose実験、×印はopen実験である。再現率の平均と分散は、
close実験平均0.94546、分散0.00013507
open実験平均0.78788、分散0.00046283
であった。図１０には、再現率0.95と0.8にそれぞれ実線を表示した。それぞれ、close実験、open実験に対応しており、例えばclose実験の再現率で0.95以上かつ、open実験の再現率で0.8以上を得るためには、音声特徴量の組み合わせで７番、１１番、１８番のいずれも使用することができる。これらはいずれも動的特徴量の時間変化特性d_pを含んでおり、これが重要なパラメータであることがわかる。また、7番及び１１番のパラメータの組は、基本周波数、パワー、動的特徴量の時間変化特性、及びそれらのフレーム間差分を含んでいることを特徴としている。又、番号１９の組はopen実験での上記条件をわずかに満たしていないが、基本周波数f0"、パワーp"、動的特徴量の時間変化特性d_pの３つのみであり、演算処理量が少なくてすむ利点がある。 Audio feature amount as described above, when the ± a reference frame range before and after as a reference example i (i = 4) exists 29, the combination is ₂₉ C _n pieces sigma. However, the range of Σ is n = 1 to 29, and ₂₉ C _n is a combination of 29 to n. Thus, a reference example using a codebook composed of a vector including a set of 18 types of speech feature values will be described. In the following, it is assumed that the frame is again 100 ms and the shift amount is 50 ms. FIG. 17 shows the combination numbers of the 18 types of audio feature values and the respective audio feature values. In the speech state determination experiment, the speech feature amount of the label state between the emphasized state and the calm state set by the subject was used. As a close experiment, 613 emphasis state and 803 calm state labels used for codebook creation, and as an open experiment, 171 emphasis state and 193 calm state utterance states not used for codebook creation Was judged. The codebook size is 128,
λ _emp1 = λ _nrm1 = 0.41
λ _emp2 = λ _nrm2 = 0.41
λ _emp3 = λ _nrm3 = 0.08
It was. FIG. 10 shows the recall rate of the close experiment and the open experiment, which were performed with combinations of 18 voice feature values. The vertical axis represents the recall ratio, the horizontal axis represents the parameter combination number, the circle indicates a close experiment, and the cross indicates an open experiment. The average recall and variance are
close experiment average 0.94546, dispersion 0.00013507
Open experiment average 0.78788, variance 0.00046283
Met. In FIG. 10, solid lines are displayed at recalls of 0.95 and 0.8, respectively. Each corresponds to a close experiment and an open experiment. For example, in order to obtain a recall ratio of 0.95 or more in a close experiment and 0.8 or more in a recall ratio of an open experiment, the combination of voice features is the seventh, eleventh, Any of No. 18 can be used. Each of these includes a time change characteristic d _p of the dynamic feature amount, and it is understood that this is an important parameter. Further, the set of parameters Nos. 7 and 11 is characterized in that it includes a fundamental frequency, power, a time change characteristic of a dynamic feature amount, and a difference between the frames. Although a set of numbers 19 is not slightly satisfy the above conditions in open experiments, the fundamental frequency f0 ", the power p", and only three of the time change characteristic d _p of dynamic features, arithmetic processing amount There is an advantage that less is required.

図１０の結果より、符号帳作成に用いていない、被験者が前記(a)〜(i)の理由で、設定した強調状態のラベルと、(a)〜(i)のいずれでもなく、発話が平静であるとした平静状態であるラベルの、発話状態を判定し（open実験）、再現率を0.8以上にすることは、音声特徴量の組み合わせを選択することで可能となることがわかる。また、このことは、使用している符号長が適正に作成されていることを示している。
これより、図１７中の１８番目の音声特徴量の組み合わせの符号帳サイズ依存性についての実験例を述べる。符号帳サイズを2, 4, 8, 16, 32, 64, 128, 256と変化させたときのclose実験とopen実験の再現率を図１１に示す。縦軸に再現率、横軸は２ⁿのｎを示し、実線曲線はclose実験を、破線曲線はopen実験を示す。ただし、
λ_emp1＝λ_nrm1＝0.41
λ_emp2＝λ_nrm2＝0.41
λ_emp3＝λ_nrm3＝0.08
とした。図１１から、符号帳サイズを増加すると、再現率が上昇することがわかり、例えば、再現率を0.8以上にすることは、符号帳サイズ（符号帳に格納されているコードの数）を選択することで可能となることがわかる。また符号帳サイズが２でも再現率が0.5以上となっている。これは条件付確率を用いているためと思われる。この参考例によれば、被験者が前記(a)〜(i)の理由で、設定した強調状態と、(a)〜(i)のいずれでもなく、発話が平静であるとした平静状態の音声特徴量の組をベクトル量子化して符号帳を作成した場合、任意のコードの強調状態と平静状態の出現確率は統計的に分離するので、発話状態を判定することが可能であることがわかる。 From the result of FIG. 10, the subject who is not used for the codebook creation has the highlighted state label set for the reasons (a) to (i) above, and the utterance is not any of (a) to (i). It can be seen that it is possible to determine the utterance state (open experiment) of the label in the calm state, which is considered to be calm, and to set the recall rate to 0.8 or more by selecting a combination of voice feature amounts. This also indicates that the code length used is properly created.
In the following, an experimental example of the codebook size dependency of the combination of the 18th speech feature quantity in FIG. 17 will be described. FIG. 11 shows the recall ratio of the close experiment and the open experiment when the codebook size is changed to 2, 4, 8, 16, 32, 64, 128, 256. The vertical axis represents the recall, the horizontal axis represents ²ⁿ , the solid curve represents the close experiment, and the dashed curve represents the open experiment. However,
λ _emp1 = λ _nrm1 = 0.41
λ _emp2 = λ _nrm2 = 0.41
λ _emp3 = λ _nrm3 = 0.08
It was. From FIG. 11, it can be seen that increasing the codebook size increases the recall rate. For example, setting the recall rate to 0.8 or more selects the codebook size (the number of codes stored in the codebook). It turns out that this is possible. Even if the codebook size is 2, the recall is 0.5 or more. This seems to be due to the use of conditional probabilities. According to this reference example, the voice in a calm state in which the test subject is calm because the subject has set the emphasized state for any of the reasons (a) to (i) described above and is not any of (a) to (i). When a codebook is created by vector quantization of a set of feature values, the appearance probability of an arbitrary code is statistically separated from the appearance probability of a calm state, and it can be seen that the utterance state can be determined.

この参考例の方法で、原稿などを使用しない、自然な話し言葉や会話による、１時間の会議音声の要約音声を作成した。要約音声は２３の音声段落から構成され、要約音声時間は、元の音声時間の、１１％の時間であった。音声段落の評価として、被験者が２３の音声段落を聴取し、８３％が意味を理解できると判定した。作成した要約音声を評価するため、要約音声を被験者が聴取し、作成した議事録と、元の音声を聴取し、作成した議事録を比較した。再現率が８６％で、検出率が８３％であった。この発明の、音声要約方法により、原稿などを使用しない、自然な話し言葉や会話の音声要約が可能になることがわかる。 Using the method of this reference example, a summary audio of a one-hour conference voice was created using natural spoken language and conversation without using a manuscript. The summary speech consisted of 23 speech paragraphs, and the summary speech time was 11% of the original speech time. As an evaluation of speech paragraphs, subjects listened to 23 speech paragraphs and 83% determined that they could understand the meaning. In order to evaluate the created summary speech, the subjects listened to the summary speech and the minutes produced were compared with the minutes produced by listening to the original speech. The recall was 86% and the detection rate was 83%. It can be seen that the speech summarization method of the present invention enables speech summaries of natural spoken words and conversations without using a manuscript or the like.

この参考例の音声強調状態判定方法の別形態を説明する。この場合も前述と同様に図１中のステップＳ１と同様に入力音声信号のフレームごとの音声特徴量を抽出し、例えば図１２に示した符号帳を用い、図４を参照して説明したように、入力音声信号のフレームごとの音声特徴量の組を符号帳によりベクトル量子化（ベクトル符号化）し、得られたコードが強調状態で出現する確率と平静状態で出現する確率を符号帳にコードと対応して格納されている出現確率を用いて求めるが、ここでは各フレームのコードの出現確率を、直前の２つの連続するフレームのコード列を条件とする条件付出現確率として求め、フレームごとに発話状態の判定、即ち強調状態であるか否かの判定を行う。つまり図４のステップS303における発話状態の尤度計算において、図９に示したように音声特徴量の組がベクトル符号化された場合、フレーム番号i+2では強調状態尤度P_e(i+2)と平静状態尤度P_n(i+2)を、
P_e(i+2)＝P_emp(C₃｜C₁C₂)
P_n(i+2)＝P_nrm(C₃｜C₁C₂)
により計算する。この場合も、P_emp(C₃｜C₁C₂)を式(13) により計算し、またP_nrm(C₃｜C₁C₂)を式(15)により計算することが好ましい。このようにして求めたP_e(i+2)とP_n(i+2)とを比較し、P_e(i+2)＞P_n(i+2)であればこのフレーム番号i+2は強調状態と判定し、P_e(i+2)＞P_n(i+2)でなければ、このフレームは強調状態でないと判定する。 Another embodiment of the speech enhancement state determination method of this reference example will be described. Also in this case, as described above, the speech feature amount for each frame of the input speech signal is extracted in the same manner as in step S1 in FIG. 1, and for example, as described with reference to FIG. 4 using the codebook shown in FIG. In addition, a set of speech features for each frame of the input speech signal is vector quantized by the codebook (vector coding), and the probability that the obtained code appears in the emphasized state and the probability that it appears in the calm state is stored in the codebook. This is obtained using the appearance probability stored in correspondence with the code. Here, the appearance probability of the code of each frame is obtained as a conditional appearance probability that is conditional on the code sequence of the two immediately preceding frames. Each time, the utterance state is determined, that is, whether or not it is in the emphasized state. In other words, in the speech state likelihood calculation in step S303 in FIG. 4, when the speech feature value pair is vector-encoded as shown in FIG. 9, the enhancement state likelihood _Pe (i +) is obtained at frame number i + 2. 2) and the calm state likelihood P _n (i + 2),
P _e (i + 2) = P _emp (C ₃ | C ₁ C ₂ )
P _n (i + 2) = P _nrm (C ₃ | C ₁ C ₂ )
Calculate according to Also in this case, it is preferable to calculate P _emp (C ₃ | C ₁ C ₂ ) by the equation (13) and P _nrm (C ₃ | C ₁ C ₂ ) by the equation (15). P _e (i + 2) and P _n (i + 2) thus obtained are compared, and if P _e (i + 2)> P _n (i + 2), this frame number i + 2 It determines that determines that emphasized, unless _{P e (i + 2)>} P n (i + 2), this frame is not emphasized.

次のフレーム番号i+3においては、
P_e(i+3)＝P_emp(C₄｜C₂C₃)，
P_n(i+3)＝P_nrm(C₄｜C₂C₃)
をそれぞれ計算して、P_e(i+3)＞P_n(i+3)であればこのフレームを強調状態と判定する。以下同様に各フレームについて順次強調状態か否かを判定する。
このようにして音声小段落中の強調状態と判定されたフレームの条件付出現確率P_eの音声小段落にわたる積ΠP_eと平静状態と判定されたフレームの条件付確率P_nの音声小段落にわたる積ΠP_nを求め、ΠP_e＞ΠP_nであればその音声小段落は強調状態であると判定し、ΠP_e≦ΠP_nであれば平静状態であると判定する。あるいは、強調状態と判定されたフレームのP_eの音声小段落にわたる総和ΣP_eと平静状態と判定されたフレームのP_nの音声小段落にわたる総和ΣP_nを求め、ΣP_e＞ΣP_nならその音声小段落は強調状態であると判定し、ΣP_e≦ΣP_nであれば平静状態と判定してもよい。あるいは、これら条件付確率の総積あるいは条件付確率の総和を重み付け比較して音声小段落の発話状態を判定してもよい。 In the next frame number i + 3,
P _e (i + 3) = P _emp (C ₄ | C ₂ C ₃ ),
P _n (i + 3) = P _nrm (C ₄ | C ₂ C ₃ )
Are calculated, and if P _e (i + 3)> P _n (i + 3), this frame is determined to be in an emphasized state. In the same manner, it is determined whether or not each frame is in an enhanced state sequentially.
Thus over the stressed state determination frame conditional probability P _e of the audio sub-paragraph over product PaiP _e and calm state determination frame conditional probability P _n of audio sub-paragraphs of in the speech sub-paragraph Multiplies PaiP _n, determines that its voice subsections if ΠP _e> ΠP _n determined to be emphasized is a calm state if ΠP _e ≦ ΠP _n. Alternatively, the total sum .SIGMA.P _n over voice subsections of P _n of frames is determined as the sum .SIGMA.P _e and calm state over voice subsections of P _e of the frame is determined to emphasized, the sound if .SIGMA.P _e> .SIGMA.P _n The small paragraph may be determined to be in an emphasized state, and may be determined to be in a calm state if ΣP _e ≦ ΣP _n . Alternatively, the utterance state of the audio sub-paragraph may be determined by weighted comparison of the total product of the conditional probabilities or the sum of the conditional probabilities.

この音声強調状態判定方法においても、用いる音声特徴量については上記の方法の場合と同様であり、出現確率も単独出現確率又はこれと条件付確率の組合せでもよく、更にこの組合せを用いる場合は、条件付確率の計算に対し線形補間法を用いることが好ましい。またこの音声強調状態判定方法においても、音声小段落又はこれよりも長い適当な区間ごとに、あるいは全体の音声信号の各音声特徴量の平均値で各音声特徴量を規格化してフレームごとの音声特徴量の組を形成し、図４のステップS301のベクトル量子化以後の処理を行うことが好ましい。音声強調状態判定方法及び音声要約方法の何れにおいても、音声特徴量の組としてはf0"、p₀"、Δf0"(i)、Δf0"(-i)、Δp"(i)、Δp"(-i)、d_p、Δd_p(T)、Δd_p(-T)の少なくともいずれかを含む組を用いる。 Also in this speech enhancement state determination method, the speech feature amount used is the same as in the above method, and the appearance probability may be a single appearance probability or a combination of this and a conditional probability, and when this combination is used, It is preferable to use a linear interpolation method for the calculation of the conditional probability. Also, in this speech enhancement state determination method, each speech feature is normalized by the average value of each speech feature for every small sub-paragraph or longer appropriate section, or for each speech feature of the entire speech signal. It is preferable to form a set of feature amounts and perform the processing after vector quantization in step S301 in FIG. In any speech enhancement state determination method and the audio summarizing method, as a set of speech features _{f0 ", p 0", Δf0} "(i), Δf0" (- i), Δp "(i), Δp" ( -i), a set including at least one of d _p , Δd _p (T), and Δd _p (−T) is used.

図１３を参照してこの参考例による音声強調状態判定装置及び音声要約装置を説明する。
入力部１１に音声強調状態が判定されるべき、又は音声の要約を検出する対象とする音声信号が入力され、入力部１１には必要に応じて入力音声信号をデジタル信号に変換する機能も含まれる。ディジタル化された音声信号は必要に応じて記憶部１２に一旦格納される。音声特徴量抽出部１３で前述した音声特徴量の組がフレームごとに算出される。算出した各音声特徴量は必要に応じて、音声特徴量の平均値で規格化され、量子化部１４で各フレームの音声特徴量の組が符号帳メモリ１５を参照して量子化されコードを出力する。このコードは強調状態確率計算部１６と平静状態確率計算部１７に与えられる。符号帳メモリ１５は例えば図１２に示したようなものである。 The speech enhancement state determination device and speech summarization device according to this reference example will be described with reference to FIG.
The input unit 11 is input with a speech signal whose speech enhancement state is to be determined or whose speech summary is to be detected, and the input unit 11 includes a function of converting the input speech signal into a digital signal as necessary. It is. The digitized audio signal is temporarily stored in the storage unit 12 as necessary. The speech feature amount extraction unit 13 calculates the above-described set of speech feature amounts for each frame. Each calculated speech feature value is normalized by the average value of the speech feature values as necessary, and a set of speech feature values of each frame is quantized by the quantization unit 14 with reference to the codebook memory 15 to generate a code. Output. This code is given to the emphasis state probability calculation unit 16 and the calm state probability calculation unit 17. The codebook memory 15 is, for example, as shown in FIG.

強調確率計算部１６によりその量子化された音声特徴量の組のコードの強調状態での出現確率が、符号帳メモリ１５に格納されている対応する出現確率を用いて、例えば式(13)又は(14) により計算される。同様に平静確率計算部１７により、前記量子化された音声特徴量の組の平静状態での出現確率が符号帳メモリ１５に格納されている対応する音声特徴量ベクトルの出現確率を用いて、例えば式(15)又は(16)により計算される。強調確率計算部１６及び平静確率計算部１７で各フレーム毎に算出された強調状態での出現確率と平静状態での出現確率及び各フレームのコードは各フレームの番号と共に記憶部１２に格納される。強調状態判定部１８はこれら計算された強調状態での出現確率と平静状態での出現確率とを比較し、前者の方が大きければそのフレームの音声は強調状態であると判定し、そうでない場合は強調状態でないと判定する。 Using the corresponding appearance probability stored in the codebook memory 15, the appearance probability in the enhancement state of the code of the speech feature quantity group quantized by the enhancement probability calculation unit 16 is used, for example, Calculated by (14). Similarly, by using the appearance probability of the corresponding speech feature quantity vector stored in the codebook memory 15 as the appearance probability in the calm state of the set of quantized speech feature quantities by the calm probability calculation unit 17, for example, Calculated by equation (15) or (16). The appearance probability in the emphasized state, the appearance probability in the calm state, and the code of each frame calculated for each frame by the enhancement probability calculation unit 16 and the calm probability calculation unit 17 are stored in the storage unit 12 together with the number of each frame. . The enhancement state determination unit 18 compares the calculated appearance probability in the enhancement state with the appearance probability in the calm state. If the former is larger, the speech of the frame is determined to be in the enhancement state. Is determined not to be in an emphasized state.

これら各部の制御は制御部１９の制御のもとに順次行われる。
音声要約装置の実施形態は、図１３中の実線ブロックで示す音声強調状態判定装置に対し、更に破線ブロックが付加されて構成される。つまり、記憶部１２に格納されている各フレームの音声特徴量が無声区間判定部２１と有声区間判定部２２に与えられ、無声区間判定部２１により各フレームごとに無声区間か否かが判定され、また有声区間判定部２２により各フレームごとに有声区間か否かが判定される。これら無声区間判定結果と有声区間判定結果は音声小段落判定部２３に入力される。 Control of these units is sequentially performed under the control of the control unit 19.
The embodiment of the speech summarization device is configured by further adding a broken line block to the speech enhancement state determination device indicated by the solid line block in FIG. That is, the voice feature amount of each frame stored in the storage unit 12 is given to the unvoiced segment determination unit 21 and the voiced segment determination unit 22, and the unvoiced segment determination unit 21 determines whether each frame is an unvoiced segment. In addition, the voiced section determination unit 22 determines whether each frame is a voiced section. These unvoiced segment determination results and voiced segment determination results are input to the audio sub-paragraph determination unit 23.

音声小段落判定部２３はこれら無声区間判定、有声区間判定に基づき、先の方法の実施形態で説明したように所定フレーム数以上連続する無声区間に囲まれた有声区間を含む部分を音声小段落と判定する。音声小段落判定部２３の判定結果は記憶部１２に書き込まれ、記憶部１２に格納されている音声データ列に付記され、無声区間で囲まれたフレーム群に音声小段落番号を付与する。これと共に音声小段落判定部２３の判定結果は末尾音声小段落判定部２４に入力される。
末尾音声小段落判定部２４では、例えば図３を参照して説明した手法により末尾音声小段落が検出され、末尾音声小段落判定結果が音声段落判定部２５に入力され、音声段落判定部２５により各検出末尾音声小段落の次の音声小段落の開始から、次の検出末尾音声小段落の終わりまでを音声段落と判定する。この音声段落判定結果も記憶部１２に書き込まれ、記憶部１２に記憶している音声小段落番号列に音声段落列番号を付与する。 Based on the unvoiced segment determination and the voiced segment determination, the audio subparagraph determination unit 23 determines a portion including a voiced segment surrounded by the unvoiced segments that are continuous for a predetermined number of frames or more as described in the previous method embodiment. Is determined. The determination result of the audio sub-paragraph determination unit 23 is written in the storage unit 12, added to the audio data string stored in the storage unit 12, and an audio sub-paragraph number is assigned to the frame group surrounded by the silent section. At the same time, the determination result of the audio sub-paragraph determination unit 23 is input to the end audio sub-paragraph determination unit 24.
In the end audio sub-paragraph determination unit 24, the end audio sub-paragraph is detected by the method described with reference to FIG. 3, for example, and the end audio sub-paragraph determination result is input to the audio paragraph determination unit 25. The audio paragraph is determined from the start of the next audio sub-paragraph of each detection end audio sub-paragraph to the end of the next detection end audio sub-paragraph. The voice paragraph determination result is also written in the storage unit 12, and the voice paragraph string number is assigned to the voice sub-paragraph number string stored in the storage unit 12.

音声要約装置として動作する場合、強調状態確率計算部１６及び平静確率計算部１７では記憶部１２から各音声小段落を構成する各フレームの強調確率と平静確率を読み出し、各音声小段落毎の確率が例えば式(17)及び(18)により計算される。強調状態判定部１８ではこの音声小段落ごとの確率計算値を比較して、その音声小段落が強調状態か否かを判定し、要約区間取出し部２６では音声段落中の１つの音声小段落でも強調状態と判定されたものがあればその音声小段落を含む音声段落を要約区間として取出す。各部の制御は制御部１９により行われる。 When operating as a speech summarization apparatus, the emphasis state probability calculation unit 16 and the calm probability calculation unit 17 read out the enhancement probability and the calm probability of each frame constituting each speech sub-paragraph from the storage unit 12, and the probability for each speech sub-paragraph Is calculated by, for example, equations (17) and (18). The emphasis state determination unit 18 compares the calculated probability values for each audio sub-paragraph to determine whether or not the audio sub-paragraph is in an emphasis state, and the summary section extraction unit 26 determines even one audio sub-paragraph in the audio paragraph. If there is one determined to be in the emphasized state, the speech paragraph including the speech sub-paragraph is taken out as a summary section. Control of each unit is performed by the control unit 19.

音声強調状態判定装置及び音声要約装置の何れもコンピュータによりプログラムを実行させて機能させることになる。この場合は音声強調状態判定プログラム又は音声要約プログラムを通信回線を介してインターネットから、又はＣＤ−ＲＯＭ、磁気ディスクなどからプログラムメモリ２７にダウンロードし、ＣＰＵ又はマイクロプロセッサよりなる制御部１９がそのプログラムを実行することになる。なお符号帳の内容もインターネットからプログラムと同様に通信回線を介してダウンロードさせて用いてもよい。
第２参考例
前述の第１参考例による音声強調状態判定方法、音声要約方法では、任意の音声小段落において、強調状態となる確率が平静状態となる確率より大きい音声小段落を一つでも含む音声段落は全て要約として抽出されるため、任意の要約率（圧縮率）で要約することができない不都合がある。この第２参考例では、この点を改善し、任意の要約率で元の音声の要約を自動生成することができる音声処理方法、音声処理装置及び音声処理プログラムを実現する。 Both the speech enhancement state determination device and the speech summarization device are caused to function by causing a computer to execute a program. In this case, the voice enhancement state determination program or the voice summarization program is downloaded to the program memory 27 from the Internet or from a CD-ROM, a magnetic disk or the like via a communication line, and the control unit 19 comprising a CPU or a microprocessor downloads the program. Will be executed. The contents of the code book may also be downloaded from the Internet via a communication line in the same manner as the program.
Second Reference Example In the speech enhancement state determination method and the speech summarization method according to the first reference example described above, any speech sub-paragraph includes at least one speech sub-paragraph whose probability of being in an emphasized state is greater than the probability of being in a calm state. Since all speech paragraphs are extracted as summaries, there is a disadvantage that they cannot be summarized at an arbitrary summarization rate (compression rate). In the second reference example, this point is improved, and a speech processing method, speech processing apparatus, and speech processing program capable of automatically generating a summary of the original speech at an arbitrary summarization rate are realized.

図１８に第２参考例による音声処理方法の基本手順を示す。
ステップＳ１１で音声強調確率算出処理を実行し、音声小段落の強調確率及び平静確率を求める。
ステップＳ１２では要約条件入力ステップＳ１２を実行する。この要約条件入力ステップＳ１２では例えば利用者に要約時間又は要約率又は圧縮率のうち予め決められた少なくとも１つの入力を促す情報を提供し、要約時間又は要約率又は圧縮率を入力させる。尚、予め設定された複数の要約時間又は要約率、圧縮率の中から少なくとも一つを選択する入力方法を採ることもできる。 FIG. 18 shows the basic procedure of the speech processing method according to the second reference example.
In step S11, a speech enhancement probability calculation process is executed to obtain the enhancement probability and calm probability of the speech sub-paragraph.
In step S12, summary condition input step S12 is executed. In this summary condition input step S12, for example, information prompting the user to input at least one of the summary time, the summary rate, or the compression rate is provided, and the summary time, the summary rate, or the compression rate is input. Note that an input method of selecting at least one of a plurality of preset summary times, summarization ratios, and compression ratios may be employed.

ステップＳ１３では抽出条件の変更を繰り返す動作を実行し、ステップＳ１２の要約条件入力ステップＳ１２で入力された要約時間又は要約率又は圧縮率を満たす抽出条件を決定する。
ステップＳ１４で要約抽出ステップを実行する。この要約抽出ステップＳ１４では抽出条件変更ステップＳ１３で決定した抽出条件を用いて要約として採用すべき音声段落を決定し、この採用すべき音声段落の総時間長を計算する。
ステップ１５では要約再生処理を実行し、要約抽出ステップＳ１４で抽出した音声段落列を再生する。 In step S13, an operation of repeatedly changing the extraction condition is executed, and an extraction condition that satisfies the summary time, summary rate, or compression rate input in the summary condition input step S12 of step S12 is determined.
In step S14, a summary extraction step is executed. In this summary extraction step S14, the speech paragraph to be adopted as the summary is determined using the extraction condition determined in the extraction condition changing step S13, and the total time length of the speech paragraph to be adopted is calculated.
In step 15, summary reproduction processing is executed, and the speech paragraph string extracted in summary extraction step S14 is reproduced.

図１９は図１８に示した音声強調確率算出ステップＳ１１の詳細を示す。
ステップS101で要約対象とする音声波形列を音声小段落に分離する。
ステップS102ではステップS101で分離した音声小段落列から音声段落を抽出する。音声段落とは図３で説明したように、１つ以上の音声小段落で構成され、その部分の音声を再生した場合、大多数の受聴者が意味を理解できる単位である。ステップS101, S102による音声小段落及び音声段落の抽出は図２で説明したと同様の方法で行うことができる。
ステップS103及びステップS104でステップS101で抽出した音声小段落毎に図１２で説明した符号帳と前述した式(17), (18)等を利用して各音声小段落が強調状態となる確率（強調確率）P_Sempと、平静状態となる確率（平静確率）P_Snrmとを求める。 FIG. 19 shows details of the speech enhancement probability calculation step S11 shown in FIG.
In step S101, the speech waveform sequence to be summarized is separated into speech sub-paragraphs.
In step S102, a speech paragraph is extracted from the speech sub-paragraph sequence separated in step S101. As described with reference to FIG. 3, the audio paragraph is a unit composed of one or more audio sub-paragraphs, and the majority of the listeners can understand the meaning when the audio of that portion is reproduced. The extraction of audio sub-paragraphs and audio paragraphs in steps S101 and S102 can be performed by the same method as described in FIG.
The probability that each audio sub-paragraph is in an emphasized state using the code book described in FIG. 12 and the above-described equations (17), (18), etc. for each audio sub-paragraph extracted in step S101 in steps S103 and S104 ( Emphasis probability) P _Semp and the probability of _being in a calm state (calm probability) P _Snrm are obtained.

ステップS105ではステップS103及びS104において各音声小段落毎に求めた強調確率P_Sempと平静確率P_Snrmなどを各音声小段落毎に仕分けして記憶手段に音声強調確率テーブルとして格納する。
図２０に記憶手段に格納した音声強調確率テーブルの一例を示す。図２０に示すM1, M2, M3, …は音声小段落毎に求めた音声小段落強調確率P_Sempと、音声小段落平静確率P_Snrmを記録した音声小段落確率記憶部を示す。これらの小段落確率記憶部M1, M2, M3, …には各音声小段落S_jの属する音声段落番号Bと、音声小段落S_jに付された音声小段落番号ｊと、開始時刻（要約対象音声の先頭から計時した時刻）終了時刻、音声小段落強調確率、音声小段落平静確率、各音声小段落を構成するフレーム数F_S等が格納される。 In step S105, the emphasis probability P _Semp and the calm probability P _Snrm determined for each audio sub-paragraph in steps S103 and S104 are sorted for each audio sub-paragraph and stored in the storage means as an audio enhancement probability table.
FIG. 20 shows an example of the speech enhancement probability table stored in the storage means. M1, M2, M3,... Shown in FIG. 20 indicate a voice sub-paragraph probability storage unit in which the voice sub-paragraph emphasis probability P _Semp obtained for each voice sub-paragraph and the voice sub-paragraph calm probability P _Snrm are recorded. In these sub-paragraph probability storage units M1, M2, M3,..., The audio paragraph number B to which each audio sub-paragraph S _j belongs, the audio sub-paragraph number j assigned to the audio sub-paragraph S _j , and the start time (summary) starting time was counted from) the end time of the target speech, the speech subsections enhancement probability, speech subsections calm probability, frame number F _S like that constitute each audio sub-paragraph is stored.

図１８における要約条件入力ステップＳ１２で入力する条件としては要約すべきコンテンツの全長T_Cを1/X（Ｘは正の整数）の時間T_S=T_C/Xに要約することを示す要約率r=1/X、あるいは要約時間T_Sを入力する。
この要約条件の設定に対し、抽出条件変更ステップＳ１３では初期値として重み係数ＷをＷ＝１に設定し、この重み係数を要約抽出ステップＳ１４に入力する。
要約抽出ステップＳ１４は重み係数Ｗ＝１として音声強調確率テーブルから各音声小段落毎に格納されている強調確率P_Sempと平静確率P_Snrmとを比較し、
WP_Semp＞P_Snrm (19)
の関係にある音声小段落を抽出すると共に、更にこの抽出した音声小段落を一つでも含む音声段落を抽出し、抽出した音声段落列の総延長時間T_G（秒）を求める。 The summarization ratio indicating that the total length T _C of the content to be summarized is summarized at time T _S = T _C / X of 1 / X (X is a positive integer) as the condition to be input in the summary condition input step S12 in FIG. Enter r = 1 / X or summary time T _S.
In response to the setting of the summary condition, in the extraction condition changing step S13, the weighting factor W is set to W = 1 as an initial value, and this weighting factor is input to the summary extracting step S14.
The summary extraction step S14 _{compares the} emphasis probability P _Semp and the calm probability P _Snrm stored for each audio sub-paragraph from the audio emphasis probability table with the weighting factor W = 1.
WP _Semp > P _Snrm (19)
Are extracted, and a speech paragraph including at least one extracted speech subparagraph is extracted, and a total extension time T _G (seconds) of the extracted speech paragraph string is obtained.

抽出した音声段落列の総延長時間T_Gと要約条件で決めた所定の要約時間T_Sとを比較する。ここでT_G≒T_S（T_Sに対するT_Gの誤差が例えば±数％程度の範囲）であれば抽出した音声段落列をそのまま要約音声として再生する。
要約条件で設定した要約時間T_Sに対するコンテンツの要約した総延長時間T_Gの誤差値が規定より大きく、その関係がT_G＞T_Sであれば抽出した音声段落列の総延長時間T_Gが、要約条件で定めた要約時間T_Sより長いと判定し、図１８に示した抽出条件変更ステップＳ１３を再実行させる。抽出条件変更ステップＳ１３では重み係数がＷ＝１で抽出した音声段落列の総延長時間T_Gが要約条件で定めた要約時間T_Sより「長い」とする判定結果を受けて強調確率P_Sempに現在値より小さい重み付け係数Ｗを乗算して重み付けを施す。重み係数Ｗとしては例えばW=1-0.001×L（Lはループ回数）で求める。 Extracted predetermined summary decided summary conditions the total extension time T _G of the audio paragraph column time is compared with T _S. Here, if T _G ≈T _S (the error of T _G with respect to T _S is in a range of, for example, about ± several percent), the extracted speech paragraph string is reproduced as it is as summary speech.
If the error value of the total extension time T _G summarized of the content with respect to the summary time T _S set in the summary condition is larger than the specified value, and the relationship is T _G > T _S , the total extension time T _{G of the} extracted speech paragraph sequence is Then, it is determined that it is longer than the summarization time T _S defined by the summarization condition, and the extraction condition changing step S13 shown in FIG. 18 is re-executed. In the extraction condition changing step S13, the enhancement probability P _Semp is received in response to the determination result that the total extension time T _G of the speech paragraph sequence extracted with the weighting factor W = 1 is “longer” than the summary time T _S defined in the summary condition. Weighting is performed by multiplying the weighting coefficient W smaller than the current value. For example, the weighting factor W is calculated by W = 1−0.001 × L (L is the number of loops).

つまり、音声強調確率テーブルから読み出した音声段落列の全ての音声小段落で求められている強調確率P_Sempの配列に１回目のループではW=1-0.001×1で決まる重み係数W=0.999を乗算し、重み付けを施す。この重み付けされた全ての各音声小段落の強調確率WP_Sempと各音声小段落の平静確率P_Snrmとを比較し、WP_Semp＞P_Snrmの関係にある音声小段落を抽出する。
この抽出結果に従って要約抽出ステップＳ１４では抽出された音声小段落を含む音声段落を抽出し、要約音声段落列を再び求める。これと共に、この要約音声段落列の総延長時間T_Gを算出し、この総延長時間T_Gと要約条件で定められる要約時間T_Sとを比較する。比較の結果がT_G≒T_Sであれば、その音声段落列を要約音声と決定し、再生する。 In other words, the weighting factor W = 0.999 determined by W = 1−0.001 × 1 in the first loop is _{added to} the array of enhancement probabilities P _Semp obtained for all speech sub-paragraphs of the speech paragraph sequence read from the speech enhancement probability table. Multiply and weight. The weighted emphasis probability WP _Semp of each voice sub-paragraph and the calm probability P _Snrm of each voice sub-paragraph are compared, and a voice sub-paragraph having a relationship of WP _Semp > P _Snrm is extracted.
In the summary extraction step S14 according to this extraction result, a speech paragraph including the extracted speech sub-paragraph is extracted, and a summary speech paragraph string is obtained again. At the same time, the total extension time T _G of this summary speech paragraph sequence is calculated, and the total extension time T _G is compared with the summary time T _S defined by the summary conditions. If the comparison result is T _G ≈T _S , the speech paragraph string is determined as summary speech and reproduced.

１回目の重み付け処理の結果が依然としてT_G＞T_Sであれば抽出条件変更ステップを、２回目のループとして実行させる。このとき重み係数ＷはW=1-0.001×2で求める。全ての強調確率P_SempにW=0.998の重み付けを施す。
このように、ループの実行を繰り返す毎にこの例では重み係数Ｗの値を徐々に小さくするように抽出条件を変更していくことによりWP_Semp＞P_Snrmの条件を満たす音声小段落の数を漸次減らすことができる。これにより要約条件を満たすT_G≒T_Sの状態を検出することができる。 If the result of the first weighting process is still T _G > T _S , the extraction condition changing step is executed as a second loop. At this time, the weighting factor W is obtained as W = 1−0.001 × 2. All weighting probabilities P _Semp are weighted with W = 0.998.
In this way, each time the loop is repeated, the extraction condition is changed so that the value of the weighting factor W is gradually decreased in this example, thereby reducing the number of audio sub-paragraphs that satisfy the condition of WP _Semp > P _Snrm. Can be gradually reduced. As a result, a state of T _G ≈T _S satisfying the summary condition can be detected.

尚、上述では要約時間T_Gの収束条件としてT_G≒T_Sとしたが、厳密にT_G=T_Sに収束させることもできる。この場合には要約条件に例えば５秒不足している場合、あと１つの音声段落を加えると１０秒超過してしまうが、音声段落から５秒のみ再生することで利用者の要約条件に一致させることができる。また、この５秒は強調と判定された音声小段落の付近の５秒でもよいし、音声段落の先頭から５秒でもよい。
また、上述した初期状態でT_G＜T_Sと判定された場合は重み係数Ｗを現在値よりも小さく例えばW=1-0.001×Lとして求め、この重み係数Ｗを平静確率P_Snrmの配列に乗算し、平静確率P_Snrmに重み付けを施せばよい。また、他の方法としては初期状態でT_G＞T_Sと判定された場合に重み係数を現在値より大きくW=1+0.001×Lとし、この重み係数Ｗを平静確率P_Snrmの配列に乗算してもよい。 In the above description, T _G ≈T _S is set as the convergence condition of the summary time T _G , but it is also possible to strictly converge to T _G = T _S. In this case, for example, if the summarization condition is insufficient for 5 seconds, for example, if one more audio paragraph is added, it will exceed 10 seconds. However, by playing only 5 seconds from the audio paragraph, the summarization condition of the user is matched. be able to. Further, the 5 seconds may be 5 seconds near the audio sub-paragraph determined to be emphasized, or 5 seconds from the beginning of the audio paragraph.
If it is determined that T _G <T _S in the initial state described above, the weighting factor W is determined to be smaller than the current value, for example, W = 1−0.001 × L, and this weighting factor W is arranged in an array of calm probabilities P _Snrm . Multiplication is performed, and the calm probability P _Snrm is weighted. As another method, when it is determined that T _G > T _{S in the} initial state, the weighting factor is set larger than the current value to W = 1 + 0.001 × L, and the weighting factor W is multiplied by the array of the calm probability P _Snrm. May be.

また、要約再生ステップＳ１５では要約抽出ステップＳ１４で抽出した音声段落列を再生するものとして説明したが、音声付の画像情報の場合、要約音声として抽出した音声段落に対応した画像情報を切り出してつなぎ合わせ、音声と共に再生することによりテレビ放送の要約、あるいは映画の要約等を行うことができる。
また、上述では音声強調確率テーブルに格納した各音声小段落毎に求めた強調確率又は平静確率のいずれか一方に直接重み係数Ｗを乗算して重み付けを施すことを説明したが、強調状態を精度良く検出するためには重み係数Ｗに各音声小段落を構成するフレームの数Ｆ乗してW^Fとして重み付けを行うことが望ましい。式(17)及び(18)で算出する条件付の強調確率P_Sempは各フレーム毎に求めた強調状態となる確率を音声小段落にわたって乗算して求めており、また平静状態となる確率P_Snrmも各フレーム毎に算出した平静状態となる確率を音声小段落にわたって乗算して求めている。そこで、例えば強調確率P_Sempに重み付けを施すには各フレーム毎に求めた強調状態となる確率に係数Ｗで重み付けして音声小段落にわたって乗算すればW^Fの重み付けを施したことになる。 In the summary playback step S15, the speech paragraph sequence extracted in the summary extraction step S14 has been described as being played back. However, in the case of image information with speech, the image information corresponding to the speech paragraph extracted as the summary speech is cut out and connected. In addition, it is possible to summarize a television broadcast or a movie by playing it with sound.
In the above description, it has been described that weighting is performed by directly multiplying either the enhancement probability or the calm probability obtained for each speech sub-paragraph stored in the speech enhancement probability table by the weighting factor W. well in order to detect, it is desirable to perform the weighting as the number F-ride to W ^F of frames constituting each audio sub-paragraph weighting factor W. The conditional emphasis probability P _Semp calculated by the equations (17) and (18) is obtained by multiplying the probability of becoming the emphasis state obtained for each frame over the voice sub-paragraph, and the probability P _{Snrm of being in a} calm state. Is also obtained by multiplying the probability of a calm state calculated for each frame over the audio sub-paragraph. Therefore, it means that subjected to weighting W ^F is multiplied over the audio sub-paragraph for example to apply a weighting to emphasize the probability P _Semp weighted by a factor W to the probability that the emphasized state determined for each frame.

この結果、フレームの数Ｆに応じて重み付けの影響が増減され、フレーム数の多い音声小段落ほど、つまり延長時間が長い音声小段落程大きい重みが付されることになる。
但し、単に強調状態を判定するための抽出条件を変更すればよいのであれば各フレーム毎に求めた強調状態となる確率の積又は平静状態となる確率の積に重み係数Ｗを乗算するだけでも抽出条件の変更を行うことができる。従って、必ずしも重み付け係数ＷをＷ^Fとする必要はない。
また、上述では抽出条件の変更手段として音声小段落毎に求めた強調確率P_Semp又は平静確率P_Snrmに重み付けを施してP_Semp＞P_Snrmを満たす音声小段落の数を変化させる方法を採ったが、他の方法として全ての音声小段落の強調確率P_Sempと平静確率P_Snrmに関してそれぞれその確率比P_Semp/P_Snrmを演算し、この確率比の降順に対応する音声小段落を含む音声段落を同一段落は一回に限定して累積して、それらの音声段落の累積時間を算出し、その時間和、即ち要約区間の時間の総和が、略所定の要約時間に合致する場合、そのときの累積音声段落の時刻順配列を要約と決定して要約音声を編成してもよい。 As a result, the influence of weighting is increased / decreased according to the number F of frames, and an audio sub-paragraph with a larger number of frames, that is, an audio sub-paragraph with a longer extension time is given a higher weight.
However, if it is only necessary to change the extraction condition for determining the emphasis state, the product of the probability of becoming the emphasis state obtained for each frame or the product of the probability of becoming the calm state is simply multiplied by the weighting factor W. The extraction conditions can be changed. Therefore, it is not always necessary to the weighting coefficient W and W ^F.
Further, in the above _description , as a means for changing the extraction condition, a method is adopted in which the weighting is applied to the emphasis probability P _Semp or the calm probability P _Snrm obtained for each audio sub-paragraph and the number of audio sub-paragraphs satisfying P _Semp > P _Snrm is changed. However, as another method, the probability ratio P _Semp / P _Snrm is calculated for the emphasis probability P _Semp and the calm probability P _Snrm of all the audio sub-paragraphs, and the audio paragraph including the audio sub-paragraphs corresponding to the descending order of this probability ratio is calculated. The same paragraph is accumulated only once, and the accumulated time of those speech paragraphs is calculated, and if the sum of the times, that is, the sum of the times of the summary sections, substantially matches the predetermined summary time, then The summary speech may be organized by determining the time sequence of the accumulated speech paragraphs as the summary.

この場合、編成した要約音声の総延長時間が要約条件で設定した要約時間に対して過不足が生じた場合には、強調状態にあると判定するための確率比P_Semp/P_Snrmの判定閾値を変更すれば抽出条件を変更することができる。即ち、判定閾値を大きくすれば、強調状態と判定される音声小段落の数は減少するので要約区間として検出される音声段落数も減少し、総要約時間も短くなる。判定閾値を小さくすればその逆となる。この抽出条件変更方法を採る場合には要約条件を満たす要約音声を編成するまでの処理を簡素化することができる利点が得られる。 In this case, if the total extended time of the organized summary audio is excessive or deficient with respect to the summary time set in the summary condition, the determination threshold of the probability ratio P _Semp / P _Snrm for determining that it is in the emphasized state The extraction condition can be changed by changing. That is, if the determination threshold value is increased, the number of small audio paragraphs determined to be in the emphasized state is reduced, so that the number of audio paragraphs detected as the summary section is also reduced, and the total summary time is shortened. The opposite is true if the decision threshold is reduced. In the case of adopting this extraction condition changing method, there is an advantage that it is possible to simplify the process until the summary voice that satisfies the summary condition is organized.

上述では各音声小段落毎に求める強調確率P_Sempと平静確率P_Snrmを各フレーム毎に算出した強調状態となる確率の積及び平静状態となる確率の積で算出するものとして説明したが、他の方法として各フレーム毎に強調状態となる確率を求め、それらの音声小段落内の平均値を求め、この平均値をその音声小段落の強調確率P_Semp及び平静確率P_Snrmとして用いることもできる。従って、この強調確率P_Semp及び平静確率P_Snrmの算出方法を採る場合には重み付けに用いる重み付け係数Ｗはそのまま強調確率P_Semp又は平静確率P_Snrmに乗算すればよい。 In the above _description , the emphasis probability P _Semp and the calm probability P _Snrm to be calculated for each audio sub-paragraph are described as being calculated by the product of the probability of being in an emphasized state and the product of the probability of being in a calm state calculated for each frame. It is also possible to obtain the probability of being in an emphasized state for each frame as a method of the above, obtain the average value in those speech sub-paragraphs, and use this average value as the emphasis probability P _Semp and the calm probability P _Snrm of the speech sub-paragraph. . Therefore, when the calculation method of the emphasis probability P _Semp and the calm probability P _Snrm is employed, the weighting coefficient W used for weighting may be directly multiplied by the emphasis probability P _Semp or the calm probability P _Snrm .

図２１を参照してこの第２参考例による要約率を自由に設定できる音声処理装置を説明する。この参考例では図１３に示した音声強調状態要約装置の構成に要約条件入力部３１と、音声強調確率テーブル３２と、強調小段落抽出部３３と、抽出条件変更部３４と、仮要約区間判定部３５と、この仮要約区間判定部３５の内部に要約音声の総延長時間を求める総延長時間算出部３５Ａと、この総延長時間算出部３５Ａが算出した要約音声の総延長時間が要約条件入力部３１でユーザが入力した要約時間に対し、その誤差が予め決められた値の範囲に入っているか否かを判定する要約区間決定部３５Ｂと、要約条件に合致した要約音声を保存し、再生する要約音声保存・再生部３５Ｃを設けた構成とした点を特徴とするものである。 A speech processing apparatus capable of freely setting the summarization rate according to the second reference example will be described with reference to FIG. In this reference example, the summary condition input unit 31, the speech enhancement probability table 32, the enhancement small paragraph extraction unit 33, the extraction condition change unit 34, and the provisional summary section determination are added to the configuration of the speech enhancement state summarization apparatus shown in FIG. 35, a total extension time calculation unit 35A for obtaining the total extension time of the summary speech in the temporary summary section determination unit 35, and a summary condition input time for the summary speech calculated by the total extension time calculation unit 35A A summary section determination unit 35B that determines whether or not the error is within a predetermined value range with respect to the summary time input by the user in the unit 31, and stores and reproduces the summary audio that meets the summary condition The summary voice storage / playback unit 35C is provided.

入力音声から図１３で説明したように、フレーム毎に音声特徴量が求められ、この音声特徴量に従って強調確率計算部１６と平静確率計算部１７でフレーム毎に強調確率と、平静確率とを算出し、これら強調確率と平静確率を各フレームに付与したフレーム番号と共に記憶部１２に格納する。更に、このフレーム番号に音声小段落判定部で判定した音声小段落に付与した音声小段落番号ｊとその音声小段落が属する音声段落番号Ｂが付記され、各フレーム及び音声小段落にアドレスが付与される。
この参考例による音声処理装置では強調確率算出部１６と平静確率算出部１７は記憶部１２に格納している各フレームの強調確率と平静確率を読み出し、この強調確率及び平静確率から各音声小段落毎に強調確率P_Sempと平静確率P_Snrmとを求め、これら強調確率P_Sempと平静確率P_Snrmを音声強調確率テーブル３２に格納する。 As described with reference to FIG. 13, the speech feature amount is obtained for each frame from the input speech, and the enhancement probability calculation unit 16 and the calm probability calculation unit 17 calculate the enhancement probability and the calm probability for each frame according to the speech feature amount. The emphasis probability and the calm probability are stored in the storage unit 12 together with the frame number assigned to each frame. Furthermore, the audio subparagraph number j assigned to the audio subparagraph determined by the audio subparagraph determination unit and the audio paragraph number B to which the audio subparagraph belongs are appended to this frame number, and an address is assigned to each frame and audio subparagraph. Is done.
In the speech processing apparatus according to this reference example, the enhancement probability calculation unit 16 and the calm probability calculation unit 17 read the enhancement probability and the calm probability of each frame stored in the storage unit 12, and each speech sub-paragraph is calculated from the enhancement probability and the calm probability. the emphasis probability calculated a P _Semp and undisturbed probability P _SNRM, stores these enhancement probabilities P _Semp and undisturbed probability P _SNRM speech enhancement probability tables 32 each.

音声強調確率テーブル３２には各種のコンテンツの音声波形の音声小段落毎に求めた強調確率と平静確率とが格納されており、いつでも利用者の要求に応じて要約を実行できる。利用者は要約条件入力部３１に要約条件を入力する。ここで言う要約条件とは要約したいコンテンツの名称と、そのコンテンツの全長時間に対する要約率ｒを指す。要約条件としてはコンテンツの全長を1/10に要約するか、或は時間で１０分に要約するなどの入力方法が考えられる。ここで例えば要約率r=1/10と入力した場合は要約時間算出部３１Ａはコンテンツの全長時間を1/10にする時間を算出し、その算出した要約時間を要約区間仮判定部３５の要約区間決定部３５Ｂに送り込む。 The speech enhancement probability table 32 stores enhancement probabilities and calming probabilities obtained for each speech sub-paragraph of speech waveforms of various contents, and summarization can be executed at any time according to the user's request. The user inputs the summary condition to the summary condition input unit 31. The summarization condition here refers to the name of the content to be summarized and the summarization rate r for the total time of the content. As the summarization condition, an input method such as summarizing the total length of the content to 1/10 or summing up to 10 minutes in time can be considered. Here, for example, when the summary rate r = 1/10 is input, the summary time calculation unit 31A calculates a time for reducing the total length time of the content to 1/10, and the calculated summary time is summarized by the summary section temporary determination unit 35. This is sent to the section determination unit 35B.

要約条件入力部３１に要約条件が入力されたことを受けて制御部１９は要約音声の生成動作を開始する。その開始の処理としては音声強調確率テーブル３２から利用者が希望したコンテンツに対して強調確率と平静確率を読み出す。読み出された強調確率と平静確率を強調小段落抽出部３３に送り込み、強調状態にあると判定される音声小段落番号を抽出する。
強調状態にある音声小段落を抽出するための条件を変更する方法としては上述した強調確率P_Sempと平静確率P_Snrmに対する相対重み付け係数Ｗを変更してWP_Semp＞P_Snrmの関係にある音声小段落を抽出し、それら音声小段落を含む音声段落により要約音声を得る方法と、重み付き確率比WP_Semp/P_Snrmを算出し、この重み係数を変更して重み付き確率比の降順に強調音声段落を含む音声段落の時間を１回に制限して累算して要約時間を得る方法とを用いることができる。 In response to the summary condition being input to the summary condition input unit 31, the control unit 19 starts the operation for generating the summary speech. As the start process, the emphasis probability and the calm probability are read out from the voice emphasis probability table 32 for the content desired by the user. The read-out emphasis probability and calmness probability are sent to the emphasis sub-paragraph extraction unit 33, and the audio sub-paragraph number determined to be in the emphasis state is extracted.
As a method of changing the condition for extracting a speech sub-paragraph in an emphasized state, the above-described speech _{enhancement with} a relationship of WP _Semp > P _Snrm is _performed by changing the relative weighting coefficient W for the emphasis probability P _Semp and the calm probability P _Snrm . Extracting paragraphs and obtaining summary speech using speech paragraphs including these speech sub-paragraphs, calculating weighted probability ratio WP _Semp / P _Snrm , and changing this weighting factor to emphasize emphasized speech in descending order of weighted probability ratio A method of obtaining a summary time by limiting the time of a speech paragraph including a paragraph to one time and accumulating the time can be used.

抽出条件の初期値としては重み付けにより抽出条件を変更する場合には重み付け係数Ｗの初期値をW=1としてもよい。また、各音声小段落毎に求めた強調確率P_Sempと平静確率P_Snrmの確率比P_Semp/P_Snrmの値に応じて強調状態と判定する場合は初期値としてその比の値が例えばP_Semp/P_Snrm≧１である場合を強調状態と判定してもよい。
この初期設定状態で強調状態と判定された音声小段落番号と開始時刻、終了時刻を表わすデータを強調小段落抽出部３３から要約区間仮判定部３５に送り込む。要約区間仮判定部３５では強調状態と判定された小段落番号を含む音声段落を記憶部１２に格納している音声段落列から検索し、抽出する。抽出した音声段落列の総延長時間を総延長時間算出部３５Ａで算出し、その総延長時間と要約条件として入力された要約時間とを要約区間決定部３５Ｂで比較する。比較結果が要約条件を満たすか否かの判定は、例えば要約総時間T_Gと入力要約時間T_Sが予め決めた許容誤差ΔTに対し、｜T_G-T_S｜≦ΔTを満足しているか否かを判定してもよいし、予め決めた１より小さい正の値δに対し0＜｜T_G-T_S｜＜δを満足しているか否かを判定してもよい。比較の結果が要約条件を満たしていれば、その音声段落列を要約音声保存・再生部３５Ｃで保存し、再生する。この再生動作は強調小段落抽出部３３で強調状態と判定された音声小段落の番号から音声段落を抽出し、その音声段落の開始時刻と終了時刻の指定により各コンテンツの音声データ或は映像データを読み出して要約音声及び要約映像データとして送出する。 As an initial value of the extraction condition, when the extraction condition is changed by weighting, the initial value of the weighting coefficient W may be set to W = 1. In addition, when determining the emphasis state according to the value of the probability ratio P _Semp / P _Snrm of the emphasis probability P _Semp and the calm probability P _Snrm obtained for each audio sub-paragraph, the value of the ratio is, for example, P _Semp A case where / P _Snrm ≧ 1 may be determined as the emphasized state.
Data representing the voice sub-paragraph number, the start time, and the end time determined to be in the emphasized state in the initial setting state are sent from the emphasized small paragraph extracting unit 33 to the summary section temporary determining unit 35. The summary section tentative determination unit 35 searches and extracts a speech paragraph including the small paragraph number determined to be in the emphasized state from the speech paragraph sequence stored in the storage unit 12. The total extension time of the extracted speech paragraph string is calculated by the total extension time calculation unit 35A, and the total extension time and the summary time input as the summary condition are compared by the summary section determination unit 35B. Whether the comparison result satisfies the summary condition is determined by, for example, whether or not the summary total time T _G and the input summary time T _S satisfy | T _G −T _S | ≦ ΔT with respect to a predetermined allowable error ΔT. It may be determined whether or not 0 <| T _G −T _S | <δ is satisfied for a positive value δ smaller than 1, which is determined in advance. If the comparison result satisfies the digest condition, the speech paragraph string is stored in the summary speech storage / playback unit 35C and played back. In this reproduction operation, an audio paragraph is extracted from the number of the audio sub-paragraph determined to be in the emphasized state by the emphasized sub-paragraph extracting unit 33, and the audio data or video data of each content is designated by specifying the start time and end time of the audio paragraph. Are output as summary audio and summary video data.

要約区間決定部３５Ｂで要約条件を満たしていないと判定した場合は、要約区間決定部３５Ｂから抽出条件変更部３４に抽出条件の変更指令信号を出力し、抽出条件変更部３４に抽出条件の変更を行わせる。抽出条件変更部３４は抽出条件の変更を行い、その抽出条件を強調小段落抽出部３３に入力する。強調小段落抽出部３３は抽出条件変更部３４から入力された抽出条件に従って再び音声強調確率テーブル３２に格納されている各音声小段落の強調確率と平静確率との比較判定を行う。
強調小段落抽出部３３が抽出した強調音声小段落は再び要約区間仮判定部３５に送り込まれ、強調状態と判定された音声小段落を含む音声段落の抽出を行わせる。この抽出された音声段落の総延長時間を算出し、その算出結果が要約条件を満たすか否かを要約区間決定部３５Ｂで行う。この動作が要約条件を満たすまで繰り返され、要約条件が満たされた音声段落列が要約音声及び要約映像データとして記憶部１２から読み出され再生され、ユーザ端末に配信される。 When the summary section determination unit 35B determines that the summary condition is not satisfied, the summary section determination unit 35B outputs an extraction condition change command signal to the extraction condition change unit 34, and the extraction condition change unit 34 changes the extraction condition. To do. The extraction condition changing unit 34 changes the extraction condition and inputs the extraction condition to the emphasized small paragraph extracting unit 33. The emphasized small paragraph extraction unit 33 performs comparison determination between the enhancement probability and the calm probability of each audio subparagraph stored in the speech enhancement probability table 32 again according to the extraction condition input from the extraction condition changing unit 34.
The emphasized audio sub-paragraph extracted by the emphasized sub-paragraph extracting unit 33 is sent again to the summary section temporary determination unit 35 to extract the audio paragraph including the audio sub-paragraph determined to be in the emphasized state. The total extension time of the extracted speech paragraph is calculated, and the summary section determination unit 35B determines whether or not the calculation result satisfies the summary condition. This operation is repeated until the summary condition is satisfied, and the audio paragraph string that satisfies the summary condition is read out from the storage unit 12 as summary audio and summary video data, reproduced, and distributed to the user terminal.

この第２参考例による音声処理方法はコンピュータによりプログラムを実行させて実現することになる。この場合は符号帳及び処理プログラムを通信回線を介してダウンロードしたり、又はCD-ROM、磁気ディスク等の記憶媒体に格納されたプログラムをインストールして計算機内のＣＰＵ等の処理装置でこの参考例の方法を実行させることも可能である。
実施例
以下に発明の実施例を説明する。
第１参考例で説明した図１のステップＳ３における発話状態判定処理は、図４及び１２を参照して説明したように、被験者の音声を分析して強調状態及び平静状態とラベルされた区間の各音声特徴量ベクトルについて予め求めた単独出現確率及び条件付出現確率を計算して符号帳にコードと対応して格納しておき、入力音声小段落の一連のフレームのコードから音声小段落が強調状態となる確率と平静状態となる確率を例えば式(17)及び(18)で求め、それらの大小関係により音声小段落が強調状態か平静状態かを判定したが、この発明の実施例では音響モデルとして隠れマルコフモデル(HMM：Hidden Markov Model)を使用して判定を行う例を以下に説明する。 The voice processing method according to the second reference example is realized by executing a program by a computer. In this case, the codebook and processing program can be downloaded via a communication line, or the program stored in a storage medium such as a CD-ROM or magnetic disk can be installed and used in a processing unit such as a CPU in the computer. It is also possible to execute the method.
EXAMPLES Examples of the invention will be described below.
As described with reference to FIGS. 4 and 12, the speech state determination process in step S <b> 3 of FIG. 1 described in the first reference example is performed by analyzing the subject's voice and labeling the emphasized state and the calm state. The single appearance probability and conditional appearance probability obtained in advance for each speech feature vector are calculated and stored in the codebook in correspondence with the code, and the speech sub-paragraph is emphasized from the code of a series of frames of the input speech sub-paragraph The probability of becoming a state and the probability of being in a calm state are obtained by, for example, equations (17) and (18), and it is determined whether the speech sub-paragraph is in an emphasized state or a calm state based on the magnitude relationship between them. An example in which the determination is performed using a hidden Markov model (HMM) as a model will be described below.

この実施例では、例えば予め被験者の学習用音声信号データ中の強調状態とラベル付けされた多数の区間と平静状態とラベル付けされた多数の区間から強調状態のHMMと平静状態のHMMをそれぞれ作成し、入力音声小段落の強調状態HMMに対する尤度と平静状態HMMに対する尤度を求め、その大小関係から発話状態を判定する。
HMMは一般に以下のパラメータにより構成される。
Ｓ：状態の有限集合；Ｓ＝｛S_i｝
Ｙ：観測データ集合；Ｙ＝｛y₁,..., y_t｝
Ａ：状態遷移確率の集合；Ａ＝｛a_ij｝
Ｂ：出力確率の集合；Ｂ＝｛b_j(y_t)｝
π：初期状態確率の集合；π＝｛π_i｝
図２２Ａ，２２Ｂは状態数４(i=1,2,3,4)の場合の典型的な強調状態HMMと平静状態HMMの例を示す。この発明の実施例において、例えば学習音声データ中の強調状態と平静状態のラベル区間を、予め決めた状態数４にモデル化する場合、強調状態のHMMの状態における有限集合S_emp=｛S_emmpi｝はS_emp1, S_emp2, S_emp3, S_emp4であり、平静状態のHMMの状態における有限集合S_nrm=｛S_nrmi｝はS_nrm1, S_nrm2, S_nrm3, S_nrm4である。観測データ集合Ｙの要素｛y₁…,y_t｝は強調状態と平静状態のラベル区間の量子化された音声特徴量の組である。この実施例においても音声特徴量として、基本周波数、パワー、動的特徴量の時間変化特性の少なくともいずれか１つ及び／又はそれらのフレーム間差分の少なくともいずれか１つを含む音声特徴量の組を使用する。a_empijは状態S_empiからS_empjに遷移する確率を示し、b_empj(y_t)は状態S_empjに遷移してy_tを出力する出力確率を示す。初期状態確率はπ_emp(y₁)、π_nrm(y₁)となる。a_empij, a_nrmij, b_empj(y_t), b_nrmj(y_t)は学習音声からＥＭ(Expectation-Maximization)アルゴリズム、または前向き・後ろ向きアルゴリズムによって推定する。 In this embodiment, for example, an HMM in an emphasized state and an HMM in a calm state are respectively created from a number of sections labeled as emphasized states and a number of sections labeled as calm states in the speech signal data for learning of the subject in advance. Then, the likelihood for the emphasis state HMM of the input speech sub-paragraph and the likelihood for the calm state HMM are obtained, and the speech state is determined from the magnitude relationship.
The HMM is generally composed of the following parameters.
S: finite set of states; S = {S _i }
Y: observation data set; Y = {y ₁ , ..., y _t }
A: set of state transition probabilities; A = {a _ij }
B: set of output probabilities; B = {b _j (y _t )}
π: set of initial state probabilities; π = {π _i }
22A and 22B show examples of typical emphasis state HMM and calm state HMM in the case of the number of states 4 (i = 1, 2, 3, 4). In the embodiment of the present invention, for example, when the emphasized state and the calm state label section in the learning speech data are modeled to a predetermined number of states 4, a finite set S _emp = {S _empipi in the state of the emphasized HMM } Is S _emp1 , S _emp2 , S _emp3 , S _emp4 , and the finite set S _nrm = {S _nrmi } in the HMM state in a calm state is S _nrm1 , S _nrm2 , S _nrm3 , S _nrm4 . The elements {y ₁ ..., Y _t } of the observation data set Y are a set of quantized speech feature quantities in the emphasized state and calm state label sections. In this embodiment as well, a set of audio feature amounts including at least one of fundamental frequency, power, time-varying characteristics of dynamic feature amounts and / or at least one of their inter-frame differences as audio feature amounts. Is used. a _empij represents the probability of transition from the state S _empi to S _empj , and b _empj (y _t ) represents the output probability of transitioning to the state S _empj and outputting y _t . The initial state probabilities are π _emp (y ₁ ) and π _nrm (y ₁ ). a _empij , a _nrmij , b _empj (y _t ), and b _nrmj (y _t ) are estimated from the learning speech by an EM (Expectation-Maximization) algorithm or a forward / backward algorithm.

以下に強調状態のHMMの設計の概要を説明する。
ステップＳ１：まず、学習音声データ中の強調状態又は平静状態とラベル付けされた全ての区間のフレームを分析して各フレームの予め決めた音声特徴量の組を求め、量子化符号帳を作る。例えば、ここでは、音声特徴量として第１参考例の実験で使用した後述する図１７に組み番号７で示す１３個のパラメータを含む音声特徴量の組を使用するものとし、量子化符号帳として、１３次元ベクトルの符号帳を１つ作成する。量子化符号帳のサイズをＭとし、各ベクトルに対応するコードをCm, (1, …, M)と表し、各コードに対応して学習により求めた音声特徴量ベクトルが格納された量子化符号帳を作成する。
ステップＳ２：学習音声データ中の強調状態と平静状態の全てのラベル区間のフレームの音声特徴量の組を量子化符号帳により量子化して、各強調ラベル区間の音声特徴量ベクトルのコード列Cm_t, t=1,…,LN、（LNは区間のフレーム数）を得る。第１参考例で述べたように、量子化符号帳の各コードCmの強調状態での出現確率P_emp(Cm)が求まり、これが初期状態確率π_emp(Cm)となる。同様に、平静状態での出現確率P_nrm(Cm)が求まり、これが初期状態確率π_nrm(Cm)となる。図２３ＡはコードCmの番号とそれに対応する初期状態確率π_emp(Cm)とπ_nrm(Cm)の関係を表として示す。
ステップＳ３：強調状態HMMの状態数は任意に決めてよい。ここでは例えば図２２Ａ、２２Ｂの場合、強調状態HMMと平静状態HMMの状態数はいずれも４とした場合を示し、強調状態のHMMは状態S_empi、状態S_emp2、状態S_emp3、状態S_emp4が、平静状態のHMMは状態S_nrm1、状態S_nrm2、状態S_nrm3、状態S_nrm4が設けられている。 The following outlines the design of the emphasized HMM.
Step S1: First, frames of all sections labeled as emphasized state or calm state in the learning speech data are analyzed to obtain a predetermined speech feature amount set for each frame, and a quantization codebook is created. For example, here, a speech feature amount set including 13 parameters indicated by a combination number 7 in FIG. 17 (described later) used in the experiment of the first reference example is used as the speech feature amount, and a quantization codebook is used. Create one 13-dimensional vector codebook. The size of the quantization codebook is set to M, the code corresponding to each vector is represented as Cm, (1, ..., M), and the quantized code storing the speech feature vector obtained by learning corresponding to each code Create a book.
Step S2: A speech feature vector code sequence Cm _{t of} each emphasized label section is quantized by a quantization codebook for a set of speech feature quantities of frames in all the label sections in the emphasized state and the calm state in the learned speech data. , t = 1,..., LN, (LN is the number of frames in the section). As described in the first reference example, the appearance probability P _emp (Cm) in the emphasized state of each code Cm in the quantization codebook is obtained, and this is the initial state probability π _emp (Cm). Similarly, the appearance probability P _nrm (Cm) in a calm state is obtained, and this is the initial state probability π _nrm (Cm). FIG. 23A shows the relationship between the code Cm number and the corresponding initial state probabilities π _emp (Cm) and π _nrm (Cm) as a table.
Step S3: The number of states of the emphasis state HMM may be arbitrarily determined. Here, for example, in the case of FIGS. 22A and 22B, the number of states of the emphasized state HMM and the calm state HMM is 4 and the HMM in the emphasized state is the state S _empi , the state S _emp2 , the state S _emp3 , and the state S _emp4. There, HMM calm state state _S nrm1, state _S nrm2, state _S nrm3, state S _Nrm4 is provided.

学習音声データの強調ラベル区間の一連のフレームから得たコード列から状態の遷移回数を計算し、それに基づいてＥＭアルゴリズム及び前向き・後ろ向きアルゴリズムを使って遷移確率a_empij, a_nrmijと、出力確率b_empj(Cm), b_nrmj(Cm)を最尤推定する。これらの計算方法については例えばBaum, L.E.,"An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Function of a Markov Process", In-equalities, vol.3, pp.1-8(1972)に示されている。図２３Ｂ、２３Ｃにそれぞれの状態に対して設けられる遷移確率a_empij, a_nrmijを示し、図２４に強調状態HMMのそれぞれの状態S_empj, 及び平静状態HMM のそれぞれの状態S_nrmj, (j=1,…,4)での各コードの出力確率b_empj(Cm), b_nrmj(Cm)を表で示す。 The number of state transitions is calculated from the code sequence obtained from a series of frames in the emphasized label section of the learning speech data, and based on this, transition probabilities a _empij , a _nrmij and output probability b using EM algorithm and forward / backward algorithm _empj (Cm), b _nrmj (Cm) is estimated with maximum likelihood. These calculation methods are described in, for example, Baum, LE, "An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Function of a Markov Process", In-equalities, vol.3, pp.1-8 (1972). Yes. 23B and 23C show transition probabilities a _empij and a _nrmij provided for the respective states, and FIG. 24 shows the respective states S _empj and S _jrmj , (j = The output probabilities b _empj (Cm) and b _nrmj (Cm) of each code in 1, ..., 4) are shown in a table.

これら状態遷移確率a_empij, a_nrmijとコードの出力確率b_empj(Cm), b_nrmj(Cm)はそれぞれ表として例えば図１3の装置の符号帳メモリ１５内に格納され、以下に説明する入力音声信号の発話状態の判定に使用される。なお、出力確率の表は第１及び第２参考例における符号帳に対応する。
このようにして設計した強調状態HMMと平静状態HMMを使って入力音声小段落の発話状態を以下のようにして判定することができる。
入力音声小段落の一連のフレーム(フレーム数FN)から得た音声特徴量の組の列が得られ、それぞれの音声特徴量の組を量子化符号帳により量子化してコード列｛Cm₁, Cm₂, …,Cm_FN｝を得る。このコード列を、上記強調状態HMMで状態S_emp1から開始して状態S_emp4に達する全ての取り得る遷移経路について、音声小段落が強調状態となる確率（尤度）を算出する。ある経路ｋの遷移経路について以下に説明する。図２５に、音声小段落の各フレームについて、コード列、状態、状態遷移確率、出力確率をまとめて示す。強調状態HMMでの経路ｋの状態系列Ｓ^k _empがＳ^k _emp=｛S^k _emp1, S^k _emp2, …, S^k _empFN｝であるときの強調状態となる確率P(Ｓ^k _emp)は、次式により求められる。 These state transition probabilities a _empij , a _nrmij and code output probabilities b _empj (Cm), b _nrmj (Cm) are respectively stored as tables in the codebook memory 15 of the apparatus of FIG. 13, for example. Used to determine the speech state of the signal. The output probability table corresponds to the codebook in the first and second reference examples.
Using the emphasis state HMM and the calm state HMM designed as described above, the utterance state of the input speech sub-paragraph can be determined as follows.
A sequence of speech feature values obtained from a series of frames (number of frames FN) of the input speech sub-paragraph is obtained, and each speech feature value set is quantized by a quantization codebook to generate a code sequence {Cm ₁ , Cm ₂ , ..., Cm _FN }. The code sequence for all possible transition paths to reach the state S _Emp4 starting from state S _emp1 above emphasized state HMM, calculates the probability (likelihood) that voice subsections is emphasized. A transition path of a certain path k will be described below. FIG. 25 collectively shows the code string, state, state transition probability, and output probability for each frame of the audio sub-paragraph. The probability P (S ^k _emp ) of being in the emphasized state when the state sequence S ^k _emp of the path k in the emphasized state HMM is S ^k _emp = {S ^k _emp1 , S ^k _emp2 , ..., S ^k _empFN } is It is obtained by the following formula.

全ての経路ｋについて式(20)を算出する。音声小段落が強調状態となる確率P_empHMMを例えば、最尤経路上の強調状態となる確率とすると、次式で表される。 Equation (20) is calculated for all routes k. If the probability P _empHMM that the speech sub-paragraph is in the emphasized state is, for example, the probability that the speech sub-paragraph is in the emphasized state on the maximum likelihood path, it is expressed by the following equation.

あるいは、全ての経路についての上記式(20)の和で次式 Alternatively, the sum of the above equation (20) for all routes is

として求めてもよい。
同様に、平静状態のHMMについて、経路ｋの状態系列Ｓ^k _nrmがＳ^k _nrm=｛S^k _nrm1, S^k _nrm2, …, S^k _nrmFN｝であるときの強調状態となる確率P(Ｓ^k _nrm)は次式、 You may ask as.
Similarly, with respect to the HMM in a calm state, the probability P (S ^{k of} being an enhanced state when the state sequence S ^k _nrm of the path k is S ^k _nrm = {S ^k _nrm1 , S ^k _nrm2 , ..., S ^k _nrmFN }. _nrm ) is:

により求める。音声小段落が平静状態となる確率P_nrmHMMを最尤経路上の平静状態となる確率とする場合、次式 Ask for. When the probability P _nrmHMM that a speech sub-paragraph is in a calm state is _defined as the probability that it is in a calm state on the maximum likelihood path,

で表される。あるいは、全ての経路についての上記式(22)の和で次式 It is represented by Alternatively, the sum of the above equation (22) for all routes is

として求めてもよい。
音声小段落について、強調状態確率P_empHMMと平静状態確率P_nrmHMMを比較し、前者が大きければ音声小段落は強調状態であると判定し、後者が大きければ音声小段落は平静状態であると判定する。あるいは、確率比P_empHMM/P_nrmHMMが予め決めた基準値より大きければ強調状態、基準値以下であれば平静状態と判定してもよい。
この実施例で説明したHMMを使用した強調状態確率及び平静状態確率の計算は、音声要約を行う第２参考例で説明した図１８のステップＳ１１、詳しくは、図１９のステップS103, S104における音声強調確率算出処理に利用してもよい。即ち、式(17), (18)により確率P_Semp, P_Snrmを求める代わりに式(21), (23)又は式(21'), (23')により求めた強調状態確率P_empHMM及び平静状態確率P_nrmHMMを使用し、図２０に示した音声強調確率テーブルに格納してもよい。確率比P_empHMM/P_nrmHMMと比較する基準値の値を変化させることにより、要約率を変えることができることは第２参考例の場合と同様である。 You may ask as.
_Compares the emphasis state probability P _empHMM and the calm state probability P _nrmHMM for the audio sub-paragraph, and determines that the audio sub-paragraph is in the emphasized state if the former is large, and determines that the audio sub-paragraph is in the calm state if the latter is large. To do. Alternatively, the emphasis state may be determined if the probability ratio P _empHMM / P _nrmHMM is greater than a predetermined reference value, and the calm state may be determined if the probability ratio P _empHMM / P _nrmHMM is less than the reference value.
The calculation of the emphasized state probability and the calm state probability using the HMM described in this embodiment is performed in step S11 of FIG. 18 described in the second reference example for performing speech summarization, more specifically, in steps S103 and S104 of FIG. You may utilize for an emphasis probability calculation process. In other words, instead of _{obtaining the} probabilities P _Semp and P _Snrm from _Eqs . (17) and (18), the emphasis state probability P _empHMM and the calmness obtained from _Eqs. (21) and (23) or _Eqs. (21 ') and (23') The state probability P _nrmHMM may be used and stored in the speech enhancement probability table shown in FIG. The summarization rate can be changed by changing the value of the reference value compared with the probability ratio P _empHMM / P _nrmHMM as in the case of the second reference example.

第１参考例の音声要約方法の基本手順例を示す流れ図。The flowchart which shows the example of a basic procedure of the audio | voice summarization method of a 1st reference example. 図１中のステップＳ２において、入力音声から、有声区間、音声小段落、音声段落を抽出する手順の例を示す流れ図。The flowchart which shows the example of the procedure which extracts a voiced area, an audio | voice small paragraph, and an audio | voice paragraph from input audio | voice in step S2 in FIG. 有声区間、音声小段落、音声段落の関係を説明するための図。The figure for demonstrating the relationship between a voiced area, an audio | voice small paragraph, and an audio | voice paragraph. 図１中のステップＳ３における入力音声小段落の発話状態を判定する手順の例を示す流れ図。The flowchart which shows the example of the procedure which determines the utterance state of the input audio | voice subparagraph in step S3 in FIG. この発明において用いられる、符号帳の一部を作成する手順の例を示す流れ図。The flowchart which shows the example of the procedure which produces a part of codebook used in this invention. 音声特徴量をベクトル量子化したコードのunigramの例を示す図。The figure which shows the example of the unigram of the code | cord | chord which carried out the vector quantization of the audio | voice feature-value. 音声特徴量をベクトル量子化したコードのbigramの例を示す図。The figure which shows the example of the bigram of the code | cord | chord which carried out the vector quantization of the audio | voice feature-value. 図７に示したbigramのうち、コードCh=27のbigramを示す図。The figure which shows the bigram of the code | cord | chord Ch = 27 among the bigrams shown in FIG. 発話状態尤度計算を説明するための図。The figure for demonstrating speech state likelihood calculation. １８組のパラメータの組み合わせで実験した、close実験とopen実験の再現率を示す図。The figure which shows the reproduction rate of the close experiment and open experiment which experimented with the combination of 18 sets of parameters. 符号帳サイズを変化させたときの、close実験とopen実験の再現率を示す図。The figure which shows the recall of a close experiment and an open experiment when changing a codebook size. 符号帳の記憶例を示す図。The figure which shows the example of a memory | storage of a code book. 第１参考例による音声強調状態判定装置及び音声要約装置の各機能構成例を示す図。The figure which shows each function structural example of the audio | voice emphasis state determination apparatus and audio | voice summarization apparatus by a 1st reference example. 音声特徴量をベクトル量子化したbigramの例を示す図。The figure which shows the example of the bigram which carried out the vector quantization of the audio | voice feature-value. 図１４の続きを示す図。The figure which shows the continuation of FIG. 図１５の続きを示す図。The figure which shows the continuation of FIG. 実際に用いた音声特徴量のパラメータの組合せの例を示す図。The figure which shows the example of the parameter combination of the audio | voice feature-value actually used. 第２参考例の音声要約方法を説明するフローチャート。The flowchart explaining the audio | voice summarizing method of a 2nd reference example. 音声強調確率テーブルの作成方法を示すフローチャート。The flowchart which shows the preparation method of a speech emphasis probability table. 音声強調確率テーブルを説明するための図。The figure for demonstrating a speech emphasis probability table. 第２参考例の音声強調状態判定装置及び音声強調状態要約装置の構成例を示すブロック図。The block diagram which shows the structural example of the audio | voice emphasis state determination apparatus and audio | voice emphasis state summary apparatus of a 2nd reference example. Ａはこの発明の実施例における強調状態HMMを説明するための図、Ｂはこの実施例における平静状態HMMを説明するための図。A is a diagram for explaining an emphasized state HMM in an embodiment of the present invention, and B is a diagram for explaining a calm state HMM in this embodiment. Ａは各コードに対する強調状態の初期状態確率と平静状態の初期状態確率を示し、Ｂは強調状態での各遷移状態に対し設けられる状態遷移確率の表を示し、Ｃは平静状態での各遷移状態に対し設けられる状態遷移確率の表を示す図。A shows the initial state probability of the emphasized state and the initial state probability of the calm state for each code, B shows a table of state transition probabilities provided for each transition state in the emphasized state, and C shows each transition in the calm state The figure which shows the table | surface of the state transition probability provided with respect to a state. 強調状態の各遷移状態と平静状態の各遷移状態でのそれぞれのコードの出力確率の表を示す。The table of the output probability of each code | cord in each transition state of an emphasis state and each transition state of a calm state is shown. １つの音声小段落における一連のフレームから得たコード列とそれらのコードがとる１つの状態遷移系列及びそれに従った状態遷移確率と出力確率をまとめた表を示す。2 shows a table summarizing code sequences obtained from a series of frames in one audio sub-paragraph, one state transition sequence taken by those codes, and state transition probabilities and output probabilities according thereto.

Claims

フレーム毎の音声特徴量の組に基づき音声の強調状態を判定する音声処理方法であって、
基本周波数、パワー、動的特徴量の時間変化特性、基本周波数のフレーム間差分、パワーのフレーム間差分、動的特徴量の時間変化特性のフレーム間差分の６つのうちの少なくともいずれか１つを含む音声特徴量の組からなる音声特徴量ベクトルにそれぞれのコードを対応させ、
上記強調状態での上記各コードが出現するコード出現確率と、上記強調状態での各状態が遷移する状態遷移確率と、上記強調状態での状態遷移時に上記コードが出現する遷移コード出現確率とを格納した符号帳を作成し、
上記強調状態での初期状態確率に対応する上記コード出現確率と、上記強調状態での上記音声特徴量ベクトルに対応する状態遷移ごとの上記遷移コード出現確率と、状態遷移に対応する強調状態での上記状態遷移確率とからなる強調状態音響モデルを上記符号帳を用いて作成し、
(a-1) フレーム毎の音声信号について、無声区間か有声区間か判定し、
(a-2) 所定フレーム数以上の無声区間に挟まれ、少なくとも１フレーム以上の有声区間を含む部分を音声小段落とし、
(a-3) 上記音声小段落の最初のフレームの上記音声特徴量の組を量子化したコードと対応する音声特徴量ベクトルの強調状態での初期状態確率を上記符号帳から求め、
上記強調状態音響モデルより上記音声小段落の２番目以降の各フレームについて上記音声特徴量の組を量子化したコードと対応する音声特徴量ベクトルに対応する状態遷移ごとの強調状態での出力確率を求め、上記音声小段落内の各フレーム間の強調状態での遷移確率を求めるステップと、
(b) 上記音声小段落における全ての状態遷移経路ごとの上記強調状態での初期状態確率と上記出力確率と上記遷移確率の積の最大値又は上記積の総和に基づき、上記音声小段落が強調状態となる尤度を算出するステップと、
(c) 上記強調状態となる尤度に基づいて上記音声小段落が強調状態であるか否かを判定するステップとを含むことを特徴とする音声処理方法。 An audio processing method for determining an audio enhancement state based on a set of audio feature values for each frame,
At least one of six of the basic frequency, the power, the time variation characteristic of the dynamic feature amount, the difference between frames of the fundamental frequency, the difference between frames of the power, and the difference between frames of the time variation property of the dynamic feature amount Each code corresponds to a speech feature vector consisting of a set of speech features
A code appearance probability that each code appears in the emphasized state, a state transition probability that each state changes in the emphasized state, and a transition code appearance probability that the code appears in the state transition in the emphasized state Create a stored codebook,
The code appearance probability corresponding to the initial state probability in the emphasized state, the transition code appearance probability for each state transition corresponding to the speech feature vector in the emphasized state, and the emphasized state corresponding to the state transition the emphasized state acoustic model consisting of the state transition probability created using the codebook,
(a-1) For the audio signal for each frame, determine whether it is unvoiced or voiced,
(a-2) A portion including a voiced section of at least one frame sandwiched between unvoiced sections of a predetermined number of frames or more is a voice sub-paragraph,
(a-3) Obtaining an initial state probability in the emphasized state of a speech feature vector corresponding to a code obtained by quantizing the speech feature set of the first frame of the speech sub-paragraph from the codebook,
For the second and subsequent frames of the speech sub-paragraph from the enhanced state acoustic model, the output probability in the enhanced state for each state transition corresponding to the speech feature vector corresponding to the code obtained by quantizing the speech feature set pair. Determining a transition probability in an emphasized state between each frame in the audio sub-paragraph;
(b) The voice sub-paragraph is emphasized based on the maximum value of the product of the initial state probability and the output probability and the transition probability in the emphasized state or the sum of the products for all the state transition paths in the voice sub-paragraph. Calculating the likelihood of becoming a state;
(c) determining whether or not the audio sub-paragraph is in an emphasized state based on the likelihood of being in an emphasized state.

請求項１に記載の方法において、
上記符号帳を、平静状態での上記各コードが出現するコード出現確率と、上記平静状態での各状態が遷移する状態遷移確率と、上記平静状態での状態遷移時に上記コードが出現する遷移コード出現確率とをも格納するようにして作成し、
上記平静状態での初期状態確率に対応する上記コード出現確率と、上記音声特徴量ベクトルに対応する状態遷移ごとの上記遷移コード出現確率と、状態遷移に対応する平静状態での上記状態遷移確率とからなる平静状態音響モデルを上記符号帳を用いて作成し、
上記ステップ(a-3) は、更に上記音声小段落の最初のフレームの上記音声特徴量の組を量子化したコードと対応する音声特徴量ベクトルの平静状態での初期状態確率を求め、
上記平静状態音響モデルより上記音声小段落の２番目以降の各フレームについて上記音声特徴量ベクトルの組の量子化した音声特徴量ベクトルに対応する状態遷移ごとの平静状態での出力確率を求め、上記音声小段落内の各フレーム間の平静状態での遷移確率を求めるステップも含み、
上記ステップ(b) は、更に上記音声小段落における上記平静状態音響モデルの全ての状態遷移経路ごとの上記平静状態での初期状態確率と上記出力確率と上記遷移確率の積の最大値又は上記積の総和に基づき、上記音声小段落が平静状態となる尤度として算出するステップを含み、
上記ステップ(c)における上記音声小段落が上記強調状態であるか否かの判定は、上記
強調状態音響モデルを用いて求めた上記強調状態となる尤度と、上記平静状態音響モデルを用いて求めた上記平静状態となる尤度とを比較して判定することを特徴とする音声処理方法。 The method of claim 1, wherein
The codebook includes a code appearance probability that each code appears in a calm state, a state transition probability that each state transitions in the calm state, and a transition code in which the code appears during a state transition in the calm state. Create it to store the appearance probability,
And the code occurrence probability corresponding to the initial state probability in the calm state, and the transition code occurrence probability for each state transition corresponding to the audio feature vectors, and the state transition probability in calm state corresponding to the state transition A calm state acoustic model is created using the above codebook,
The step (a-3) further determines an initial state probability in a calm state of a speech feature vector corresponding to a code obtained by quantizing the speech feature set of the first frame of the speech sub-paragraph,
Obtaining an output probability in a calm state for each state transition corresponding to a quantized speech feature vector of the speech feature vector set for each of the second and subsequent frames of the speech sub-paragraph from the calm state acoustic model; Including determining the transition probability in a calm state between each frame in the audio sub-paragraph,
The step (b) further includes the maximum value of the product of the initial state probability, the output probability, and the transition probability in the calm state for each state transition path of the calm state acoustic model in the sub audio paragraph or the product. Calculating the likelihood that the audio sub-paragraph is in a calm state based on the sum of
The determination as to whether or not the audio sub-paragraph in the step (c) is in the emphasized state is performed using the likelihood that the emphasized state obtained using the emphasized state acoustic model and the calm state acoustic model are used. A speech processing method, characterized in that determination is made by comparing the obtained likelihood of being in a calm state.

請求項１又は２に記載の方法において、上記ステップ(a-2) は、更に上記音声小段落の後半部に含まれる１フレーム以上の有声区間の平均パワーがその音声小段落内の平均パワーの定数倍より小さい音声小段落を末尾とする音声小段落群を音声段落と判定するステップを含み、
上記ステップ(c)は、強調状態であると判定することに加えて、上記強調状態と判定さ
れた音声小段落を含む上記音声段落を要約区間と判定するステップも含むことを特徴とする音声処理方法。 3. The method according to claim 1 or 2, wherein the step (a-2) further includes calculating the average power of the voiced section of one or more frames included in the latter half of the audio sub-paragraph as the average power in the audio sub-paragraph. Determining a group of audio sub-paragraphs ending with audio sub-paragraphs smaller than a constant multiple as audio paragraphs;
In addition to determining that the step (c) is in the emphasized state, the step (c) includes a step of determining the speech paragraph including the small audio paragraph determined to be in the emphasized state as a summary section. Method.

請求項２に記載の方法において、上記ステップ(a-2) は、更に上記音声小段落の後半部に含まれる１フレーム以上の有声区間の平均パワーがその音声小段落内の平均パワーの定数倍より小さい音声小段落を末尾とする音声小段落群を音声段落と判定するステップを含み、
上記ステップ(c) における上記比較して判定することは、
(c-1) 上記音声小段落が強調状態となる尤度と平静状態となる尤度の尤度比を算出するステップと、
(c-2) 上記尤度比を基準値と比較し、基準値より大きければ上記音声小段落が強調状態であると判定するステップを有し、
上記ステップ(c)は上記強調状態であると判定することに加えて、上記強調状態と判定
された音声小段落を含む上記音声段落を要約区間と判定するステップ(c-3)も含むことを
特徴とする音声処理方法。 3. The method according to claim 2, wherein the step (a-2) further includes a step in which the average power of one or more voiced sections included in the latter half of the audio sub-paragraph is a constant multiple of the average power in the audio sub-paragraph. Determining a group of audio sub-paragraphs ending with a smaller audio sub-paragraph as an audio paragraph;
Judging by the comparison in step (c) above
(c-1) calculating a likelihood ratio between the likelihood that the speech sub-paragraph is in an emphasized state and the likelihood in which it is in a calm state;
(c-2) comparing the likelihood ratio with a reference value, and determining that the audio sub-paragraph is in an emphasized state if it is greater than the reference value;
In addition to determining that the step (c) is in the emphasized state, the step (c) includes a step (c-3) of determining the speech paragraph including the audio subparagraph determined to be the emphasized state as a summary section. A voice processing method as a feature.

請求項４に記載の方法において、
(c-4) 上記ステップ(c-3)で得られた要約区間の要約率または要約時間が所定の要約率
または要約時間であるか否かを判断し、
所定の要約率または要約時間でない場合は上記基準値を変更してステップ(c-2)に戻るステップを含むことを特徴とする音声処理方法。 The method of claim 4, wherein
(c-4) Determine whether the summarization rate or summarization time of the summarization section obtained in step (c-3) above is a predetermined summarization rate or summarization time,
A speech processing method comprising a step of changing the reference value and returning to step (c-2) when the summarization rate or summarization time is not a predetermined value.

請求項１乃至５のいずれかに記載の音声処理方法の各ステップをコンピュータに実行させる音声処理プログラム。 A speech processing program for causing a computer to execute each step of the speech processing method according to claim 1.