JP2013011680A

JP2013011680A - Speaker discrimination device, speaker discrimination program, and speaker discrimination method

Info

Publication number: JP2013011680A
Application number: JP2011143215A
Authority: JP
Inventors: Gei Cho; 霓張; Kazuho Maeda; 一穂前田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-06-28
Filing date: 2011-06-28
Publication date: 2013-01-17
Anticipated expiration: 2031-06-28
Also published as: JP5672175B2

Abstract

PROBLEM TO BE SOLVED: To provide a device, a program, and a method for easily and accurately discriminating a speaker.SOLUTION: A speaker discrimination device 50 obtains two sets of audio data from respective microphones arranged for two speakers. In addition, the speaker discrimination device 50 frames each of the two sets of audio data. Further, the speaker discrimination device 50 distinguishes, on the basis of a first probability model, whether each of the frames falls into a voiced sound region or an unvoiced sound region. Furthermore, the speaker discrimination device 50 determines whether the distinguished result of the frames falling into the voiced sound region is valid or invalid. At this moment, the speaker discrimination device 50 transforms the energy ratio of the two sets of audio data into a model in which a plurality of probability distributions are mixed, and makes the determination above depending on which of the plurality of probability distributions the energy ratio between the frames falls into. Finally, the speaker discrimination device 50 distinguishes speech regions and silence regions in the two sets of audio data from the distinguished result of the frames after the determination of validity or invalidity, on the basis of a second probability model.

Description

本発明は、話者判別装置、話者判別プログラム及び話者判別方法に関する。 The present invention relates to a speaker discrimination device, a speaker discrimination program, and a speaker discrimination method.

複数の話者によってなされる会話の各場面において各話者のうち誰が発話しているのかを判別する技術が知られている。 There is known a technique for discriminating who is speaking from each speaker in each scene of a conversation made by a plurality of speakers.

かかる話者の判別を閾値判定により実現する技術の一例として、音声認識装置が挙げられる。この音声認識装置には、各参加者に対応してマイクロホンが接続される。このような構成の下、音声認識装置は、マイクロホンによって出力される音声信号のパワーがパワー閾値を超えてから下回るまでの区間の音声信号を音声認識の対象として記憶部の所定のエリアへ記録する。その上で、音声認識装置は、記憶部に記録した音声信号を音声認識した後に、発言者を特定するためのデータとしてマイクロホンの識別情報を紐付けて音声認識の結果を記憶部の議事録エリアへ記録する。 An example of a technique for realizing such speaker determination by threshold determination is a voice recognition device. A microphone is connected to this speech recognition apparatus corresponding to each participant. Under such a configuration, the speech recognition apparatus records the speech signal in the section from when the power of the speech signal output by the microphone exceeds the power threshold to below it as a speech recognition target in a predetermined area of the storage unit. . In addition, the voice recognition device recognizes the voice signal recorded in the storage unit, and then associates the microphone identification information as data for specifying the speaker, and displays the result of the voice recognition in the minutes area of the storage unit. To record.

また、話者の判別を音源定位により実現する技術の一例としては、発話イベント分離システムが挙げられる。この発話イベント分離システムでは、それぞれ異なる方向に放射状に向けた複数のマイクロホンを有するマイクロホンアレイが用いられる。発話イベント分離システムは、音源定位のアルゴリズムを用いて、マイクロホンアレイによって収録された多チャネルの音声データを解析して時刻毎に音の到来方向を推定する。また、発話イベント分離システムは、音源となる話者の存在範囲を推定する。その上で、発話イベント分離システムは、音源定位の結果と、話者の存在範囲の推定結果から、時刻毎にどの話者が発話しているかを同定する。 An example of a technique for realizing speaker discrimination by sound source localization is an utterance event separation system. In this utterance event separation system, a microphone array having a plurality of microphones radially directed in different directions is used. The utterance event separation system uses a sound source localization algorithm to analyze multi-channel audio data recorded by a microphone array and estimate the direction of arrival of sound at each time. Also, the utterance event separation system estimates the existence range of speakers as sound sources. Then, the utterance event separation system identifies which speaker is speaking at each time from the result of sound source localization and the estimation result of the presence range of the speaker.

特開２００８−３０９８５６号公報JP 2008-309856 A 特開２００７−２３３２３９号公報JP 2007-233239 A

しかしながら、上記の従来技術では、以下に説明するように、話者の判別を簡易かつ正確に行うことができないという問題がある。 However, the above-described prior art has a problem that it is not possible to easily and accurately determine the speaker as described below.

例えば、上記の音声認識装置は、音声信号のパワーがパワー閾値を超過するか否かによって話者が発話しているか否かを判定するものである。このため、上記の音声認識装置では、話者を判別する精度はパワー閾値に依存するが、人間が発話する音声には個人差があるので、パワー閾値に適切な値を設定することは困難である。それゆえ、上記の音声認識装置では、話者の判別を正確に行うことができない。 For example, the speech recognition apparatus described above determines whether or not the speaker is speaking depending on whether or not the power of the speech signal exceeds the power threshold. For this reason, in the above speech recognition apparatus, the accuracy of determining the speaker depends on the power threshold, but since there are individual differences in the speech uttered by humans, it is difficult to set an appropriate value for the power threshold. is there. Therefore, the above speech recognition apparatus cannot accurately determine the speaker.

また、上記の発話イベント分離システムでは、話者の存在範囲を推定するために、会議に参加する人数等を予め学習させておく必要もある。さらに、上記の発話イベント分離システムでは、音源定位により音の到来方向を推定するのに複雑なアルゴリズムを使用する必要がある。よって、上記の発話イベント分離システムでは、話者の判別を簡易に行うことはできない。 Further, in the above utterance event separation system, it is necessary to learn in advance the number of people participating in the conference in order to estimate the range of the speaker. Furthermore, in the above utterance event separation system, it is necessary to use a complex algorithm to estimate the direction of arrival of sound by sound source localization. Therefore, in the above utterance event separation system, it is not possible to easily determine the speaker.

開示の技術は、上記に鑑みてなされたものであって、話者の判別を簡易かつ正確に行うことができる話者判別装置、話者判別プログラム及び話者判別方法を提供することを目的とする。 The disclosed technique has been made in view of the above, and an object thereof is to provide a speaker discrimination device, a speaker discrimination program, and a speaker discrimination method capable of easily and accurately discriminating a speaker. To do.

本願の開示する話者判別装置は、２人の話者にそれぞれ配置されるマイクから２つの音声データを取得する取得部を有する。さらに、前記話者判別装置は、前記取得部によって取得された２つの音声データの各々を所定の区間のフレームにフレーム化するフレーム化部を有する。さらに、前記話者判別装置は、第１の確率モデルに基づいて、前記フレーム化部によってフレーム化されたフレームが有声音領域または無声音領域のいずれであるかを識別する第１の識別部を有する。さらに、前記話者判別装置は、前記第１の識別部によって有声音領域であると識別されたフレームの識別結果を有効または無効とするかを決定する決定部を有する。前記決定部は、２つの音声データのエネルギー比を複数の確率分布が混合するモデルにモデル化した上で、前記フレーム間のエネルギー比が複数の確率分布のうちいずれの確率分布に属するかに応じて前記フレームの識別結果を有効または無効とするかを決定する。さらに、前記話者判別装置は、第２の確率モデルに基づいて、前記決定部によって有効または無効が決定された後のフレームの識別結果から２つの音声データにおける発話領域および沈黙領域を識別する第２の識別部を有する。 The speaker discrimination device disclosed in the present application includes an acquisition unit that acquires two audio data from microphones respectively arranged for two speakers. Further, the speaker discrimination device has a framing unit that frames each of the two audio data acquired by the acquisition unit into a frame of a predetermined section. Furthermore, the speaker discrimination device has a first identification unit that identifies whether the frame framed by the framing unit is a voiced sound region or an unvoiced sound region based on a first probability model. . Further, the speaker discrimination device includes a determination unit that determines whether the identification result of the frame identified as the voiced sound area by the first identification unit is valid or invalid. The determining unit models the energy ratio of two audio data into a model in which a plurality of probability distributions are mixed, and then determines whether the energy ratio between the frames belongs to which probability distribution. To determine whether the identification result of the frame is valid or invalid. Further, the speaker discriminating device identifies, based on a second probability model, a speech area and a silence area in two speech data from the identification result of the frame after the valid or invalid is determined by the determining unit. 2 identification parts.

本願の開示する話者判別装置の一つの態様によれば、話者の判別を簡易かつ正確に行うことができるという効果を奏する。 According to one aspect of the speaker discrimination device disclosed in the present application, it is possible to easily and accurately perform speaker discrimination.

図１は、実施例１に係る会話分析装置の機能的構成を示すブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the conversation analysis apparatus according to the first embodiment. 図２は、有声音および無声音の一例を示す図である。FIG. 2 is a diagram illustrating an example of voiced and unvoiced sounds. 図３は、発話領域および沈黙領域の一例を示す図である。FIG. 3 is a diagram illustrating an example of a speech area and a silence area. 図４は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。FIG. 4 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. 図５Ａは、音声データ別のエネルギーの一例を示す図である。FIG. 5A is a diagram illustrating an example of energy for each audio data. 図５Ｂは、音声データ別の有声音Ｖまたは無声音Ｕの識別結果の一例を示す図である。FIG. 5B is a diagram illustrating an example of the identification result of voiced sound V or unvoiced sound U by voice data. 図５Ｃは、フレーム間のエネルギー比の一例を示す図である。FIG. 5C is a diagram illustrating an example of an energy ratio between frames. 図５Ｄは、フレーム間のエネルギー比が所属する分布を示す図である。FIG. 5D is a diagram illustrating a distribution to which an energy ratio between frames belongs. 図５Ｅは、置換後の音声データ別の有声音Ｖまたは無声音Ｕの識別結果の一例を示す図である。FIG. 5E is a diagram illustrating an example of the identification result of the voiced sound V or unvoiced sound U for each voice data after replacement. 図６は、事後確率ρ_ｉｊを用いた話者の推定結果の一例を示す図である。FIG. 6 is a diagram illustrating an example of a speaker estimation result using the posterior probability ρ _ij . 図７は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。FIG. 7 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. 図８は、実施例１に係る会話分析処理の手順を示すフローチャートである。FIG. 8 is a flowchart illustrating the procedure of the conversation analysis process according to the first embodiment. 図９は、実施例１に係る会話分析処理の手順を示すフローチャートである。FIG. 9 is a flowchart illustrating the procedure of the conversation analysis process according to the first embodiment. 図１０は、実施例１に係る決定処理の手順を示すフローチャートである。FIG. 10 is a flowchart illustrating the procedure of the determination process according to the first embodiment. 図１１は、実施例１及び実施例２に係る話者判別プログラムを実行するコンピュータの一例について説明するための図である。FIG. 11 is a diagram for explaining an example of a computer that executes a speaker discrimination program according to the first and second embodiments.

以下に、本願の開示する話者判別装置、話者判別プログラム及び話者判別方法の実施例を図面に基づいて詳細に説明する。なお、この実施例は開示の技術を限定するものではない。そして、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Embodiments of a speaker discrimination device, a speaker discrimination program, and a speaker discrimination method disclosed in the present application will be described below in detail with reference to the drawings. Note that this embodiment does not limit the disclosed technology. Each embodiment can be appropriately combined within a range in which processing contents are not contradictory.

まず、本実施例に係る話者判別装置を含む会話分析装置の機能的構成について説明する。図１は、実施例１に係る会話分析装置の機能的構成を示すブロック図である。図１に示す会話分析装置１０は、話者Ａおよび話者Ｂにそれぞれ対応して設けられた接話マイク３０Ａ及び３０Ｂを介して集音した２つの音声データから、話者Ａおよび話者Ｂの会話に関する特性を抽出して会話スタイルを分析するものである。 First, the functional configuration of the conversation analysis device including the speaker discrimination device according to the present embodiment will be described. FIG. 1 is a block diagram illustrating a functional configuration of the conversation analysis apparatus according to the first embodiment. The conversation analysis apparatus 10 shown in FIG. 1 uses two voice data collected via close-talking microphones 30A and 30B provided corresponding to the speakers A and B, respectively. The conversation style is analyzed by extracting the characteristics of the conversation.

この会話分析装置１０には、接話マイク３０Ａ及び３０Ｂの２つのマイクが接続される。これら接話マイク３０Ａ及び３０Ｂは、話者によって装着される接話型マイクロホン（close‐talking microphone）である。かかる接話マイクの一態様としては、ラペルマイクやヘッドセットマイクなどが挙げられる。以下では、接話マイク３０Ａ及び３０Ｂのことを区別なく総称する場合には「接話マイク３０」と記載する場合がある。 Two conversational microphones 30A and 30B are connected to the conversation analysis apparatus 10. These close-talking microphones 30A and 30B are close-talking microphones worn by a speaker. Examples of such close-talking microphones include a lapel microphone and a headset microphone. Hereinafter, the close-talking microphones 30 </ b> A and 30 </ b> B may be collectively referred to as “close-talking microphone 30” when collectively referred to without distinction.

なお、図１の例では、接話型マイクロホンを用いる場合を例示したが、必ずしもマイクを装着する話者以外の他の話者をマイクから遠ざける必要はない。例えば、指向性を持つマイクを適用することができる。この場合には、話者Ａが発話する方向の感度が他の方向の感度よりも強くなるように指向性マイクを配置し、また、話者Ｂについても同様にして指向性マイクを用いればよい。 In the example of FIG. 1, the case of using a close-talking microphone is illustrated, but it is not always necessary to keep other speakers other than the speaker wearing the microphone away from the microphone. For example, a microphone having directivity can be applied. In this case, the directional microphone is arranged so that the sensitivity in the direction in which the speaker A speaks is stronger than the sensitivity in the other direction, and the directional microphone may be used similarly for the speaker B. .

登録部３１は、接話マイク３０によって集音された音声信号を会話分析装置１０の音声記憶部１１へ登録する処理部である。一態様としては、登録部３１は、接話マイク３０から音声入力されたアナログ信号にＡ／Ｄ（Analog/Digital）変換を実行することによりデジタル信号に変換した上で音声記憶部１１へ登録する。なお、以下では、接話マイク３０Ａから音声入力されたアナログ信号がＡ／Ｄ変換されたデジタル信号のことを「第１の音声データ」と記載する場合がある。また、接話マイク３０Ｂから音声入力されたアナログ信号がＡ／Ｄ変換されたデジタル信号のことを「第２の音声データ」と記載する場合がある。 The registration unit 31 is a processing unit that registers the voice signal collected by the close-talking microphone 30 in the voice storage unit 11 of the conversation analysis device 10. As an aspect, the registration unit 31 performs A / D (Analog / Digital) conversion on an analog signal input by voice from the close-talking microphone 30, converts the analog signal into a digital signal, and registers the digital signal in the voice storage unit 11. . Hereinafter, a digital signal obtained by A / D-converting an analog signal input from the close-talking microphone 30A may be referred to as “first audio data”. In addition, a digital signal obtained by A / D-converting an analog signal input from the close-talking microphone 30B may be referred to as “second audio data”.

図１に示すように、会話分析装置１０は、音声記憶部１１と、抽出部１３と、分析部１４とを有する。なお、会話分析装置１０は、図１に示した機能部以外にも既知のコンピュータが有する各種の機能部、例えば各種の入力デバイスや音声出力デバイスなどを始め、他の装置との通信を制御する通信インターフェースなどの機能部を有することとしてもかまわない。 As shown in FIG. 1, the conversation analysis device 10 includes a voice storage unit 11, an extraction unit 13, and an analysis unit 14. The conversation analysis apparatus 10 controls communication with other apparatuses, including various functional units included in known computers other than the functional units illustrated in FIG. 1, such as various input devices and voice output devices. It does not matter as having a functional unit such as a communication interface.

音声記憶部１１は、音声データを記憶する記憶部である。この音声記憶部１１は、第１の音声データ１２Ａと、第２の音声データ１２Ｂとを記憶する。なお、上記の音声記憶部１１などの記憶部には、半導体メモリ素子や記憶装置を採用できる。例えば、半導体メモリ素子としては、ＶＲＡＭ（Video Random Access Memory）、ＲＡＭ（Random Access Memory)、ＲＯＭ（Read Only Memory）やフラッシュメモリ（flash memory）などが挙げられる。また、記憶装置としては、ハードディスク、光ディスクなどの記憶装置が挙げられる。 The voice storage unit 11 is a storage unit that stores voice data. The voice storage unit 11 stores first voice data 12A and second voice data 12B. Note that a semiconductor memory element or a storage device can be adopted as the storage unit such as the voice storage unit 11. For example, examples of the semiconductor memory element include a video random access memory (VRAM), a random access memory (RAM), a read only memory (ROM), and a flash memory. Examples of the storage device include storage devices such as a hard disk and an optical disk.

これら第１の音声データ１２Ａ及び第２の音声データ１２Ｂは、話者Ａ及び話者Ｂが装着する接話マイク３０によって集音された音声信号がＡ／Ｄ変換されたデジタルデータである。このうち、第１の音声データ１２Ａには、話者Ａの音声だけでなく、話者Ｂの音声も含み得るが、話者Ａから接話マイク３０Ａまでの距離が話者Ｂや話者Ｃに比べて接近している。よって、第１の音声データ１２Ａに含まれる音声は、話者Ａと話者Ｂとの間で同時に発話がなされていた場合でも、話者Ａによって発話された音声のエネルギーが最も高くなる。同様に、第２の音声データ１２Ｂに含まれる音声は、話者Ｂによって発話された音声のエネルギーが最も高くなる。 The first voice data 12A and the second voice data 12B are digital data obtained by A / D-converting the voice signals collected by the close-talking microphones 30 worn by the speakers A and B. Among these, the first audio data 12A can include not only the voice of the speaker A but also the voice of the speaker B, but the distance from the speaker A to the close-up microphone 30A is the speaker B or the speaker C. Is closer than Therefore, the voice included in the first voice data 12A has the highest energy of the voice uttered by the speaker A even when the utterance is simultaneously made between the speaker A and the speaker B. Similarly, the voice included in the second voice data 12B has the highest energy of the voice uttered by the speaker B.

ここで、話者によって発話される有声音および無声音について説明する。図２は、有声音および無声音の一例を示す図である。図２の例では、サンプリング周波数が１６ｋＨｚである接話マイクを用いて集音した場合の音声データが示されている。図２の例では、横軸は時間を示し、縦軸は周波数を示し、図中の濃淡はスペクトルエントロピーの大小を表す。 Here, voiced and unvoiced sounds uttered by the speaker will be described. FIG. 2 is a diagram illustrating an example of voiced and unvoiced sounds. In the example of FIG. 2, audio data when sound is collected using a close-talking microphone with a sampling frequency of 16 kHz is shown. In the example of FIG. 2, the horizontal axis indicates time, the vertical axis indicates frequency, and the shading in the figure indicates the magnitude of spectral entropy.

図２に示すように、有声音Ｖ（Voiced）は、スペクトルエントロピーの変化が大きく、無声音Ｕ（Unvoiced）よりも低い周波数の音である。有声音の一例としては、母音「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」などが挙げられる。また、無声音Ｕは、有声音Ｖよりも高い周波数の音である。無声音の一例としては、母音以外の音、例えば「ｓ」、「ｐ」、「ｈ」などが挙げられる。これら有声音および無声音の特徴は、話者によって発話される言語に依存せず、日本語、英語や中国語などの任意の言語において共通する。 As shown in FIG. 2, the voiced sound V (Voiced) is a sound having a large spectrum entropy change and a frequency lower than that of the unvoiced sound U (Unvoiced). Examples of voiced sounds include vowels “a”, “i”, “u”, “e”, “o”, and the like. The unvoiced sound U is a sound having a higher frequency than the voiced sound V. Examples of unvoiced sounds include sounds other than vowels, such as “s”, “p”, “h”, and the like. The characteristics of these voiced sounds and unvoiced sounds do not depend on the language spoken by the speaker, and are common to arbitrary languages such as Japanese, English, and Chinese.

次に、有声音および無声音と発話領域および沈黙領域との関係について説明する。ここで言う「発話領域」は、話者によって発話がなされている領域を指し、無声音領域および有声音領域を含む。また、「沈黙領域」は、話者によって発話がなされていない領域を指し、音声データにおいて発話領域以外の領域に相当する。 Next, the relationship between voiced and unvoiced sounds and utterance areas and silence areas will be described. The “speech region” here refers to a region where a speaker is speaking and includes an unvoiced sound region and a voiced sound region. Further, the “silence area” refers to an area where no utterance is made by the speaker, and corresponds to an area other than the utterance area in the voice data.

図３は、発話領域および沈黙領域の一例を示す図である。この図３の例では、話者によって「ＷａＴａＳｈｉＷａＣｈｏｕＤｅＳｕ」と発話された場合を示す。図３に示す例では、「ＷａＴａＳｈｉＷａ」の発話領域４０と、「Ｃｈｏｕ」の発話領域４１と、「ＤｅＳｕ」の発話領域４２との間に、沈黙領域４３および沈黙領域４４が存在することを示す。このうち、発話領域４０には、無声音「Ｗ」、有声音「ａ」、無声音「Ｔ」、有声音「ａ」、無声音「Ｓｈ」、有声音「ｉ」、無声音「Ｗ」、有声音「ａ」が含まれる。また、発話領域４１には、無声音「Ｃｈ」、有声音「ｏｕ」が含まれる。さらに、発話領域４２には、無声音「Ｄ」、有声音「ｅ」、無声音「Ｓ」、有声音「ｕ」が含まれる。 FIG. 3 is a diagram illustrating an example of a speech area and a silence area. In the example of FIG. 3, a case where a speaker speaks “WaTaShiWa Chou DeSu” is shown. In the example illustrated in FIG. 3, the silence area 43 and the silence area 44 are present between the “WaTaShiWa” speech area 40, the “Chou” speech area 41, and the “DeSu” speech area 42. . Among these, in the utterance area 40, unvoiced sound “W”, voiced sound “a”, unvoiced sound “T”, voiced sound “a”, unvoiced sound “Sh”, voiced sound “i”, unvoiced sound “W”, voiced sound “ a ”is included. Further, the utterance area 41 includes an unvoiced sound “Ch” and a voiced sound “ou”. Furthermore, the speech area 42 includes unvoiced sound “D”, voiced sound “e”, unvoiced sound “S”, and voiced sound “u”.

図１の説明に戻り、会話分析装置１０は、複数の話者によってなされる会話の各場面において各話者のうち誰が発話しているのかを判別する話者判別装置５０を有する。この話者判別装置５０は、図１に示すように、取得部５１と、フレーム化部５２と、第１の識別部５３と、決定部５４と、第２の識別部５５とを有する。 Returning to the description of FIG. 1, the conversation analyzing apparatus 10 includes a speaker discriminating apparatus 50 that discriminates who is speaking out of each speaker in each scene of a conversation made by a plurality of speakers. As shown in FIG. 1, the speaker discrimination device 50 includes an acquisition unit 51, a framing unit 52, a first identification unit 53, a determination unit 54, and a second identification unit 55.

取得部５１は、第１の音声データおよび第２の音声データを取得する処理部である。一態様としては、取得部５１は、音声記憶部１１に記憶された第１の音声データ１２Ａおよび第２の音声データ１２Ｂを読み出す。他の一態様としては、取得部５１は、登録部３１によってＡ／Ｄ変換された第１の音声データおよび第２の音声データをストリームデータとして取得することもできる。更なる一態様としては、取得部５１は、ネットワークを介して図示しない外部装置から第１の音声データおよび第２の音声データを取得することもできる。 The acquisition unit 51 is a processing unit that acquires first audio data and second audio data. As an aspect, the acquisition unit 51 reads the first audio data 12A and the second audio data 12B stored in the audio storage unit 11. As another aspect, the acquisition unit 51 can also acquire the first audio data and the second audio data A / D converted by the registration unit 31 as stream data. As a further aspect, the acquisition unit 51 can also acquire the first audio data and the second audio data from an external device (not shown) via a network.

フレーム化部５２は、取得部５１によって取得された第１の音声データ１２Ａおよび第２の音声データ１２Ｂを所定の区間のフレームにフレーム化する処理部である。一態様としては、フレーム化部５２は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂそれぞれの長さを比較する。このとき、フレーム化部５２は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂの長さの差が許容誤差範囲内でない場合には、図示しない表示部等にエラーメッセージを出力し、以降の処理を中止する。一方、フレーム化部５２は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂの長さが同一であるか、あるいは許容誤差範囲内である場合には、次のような処理を実行する。すなわち、フレーム化部５２は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂをフレーム化する。 The framing unit 52 is a processing unit that frames the first audio data 12A and the second audio data 12B acquired by the acquisition unit 51 into frames of a predetermined section. As an aspect, the framing unit 52 compares the lengths of the first audio data 12A and the second audio data 12B. At this time, if the difference in length between the first audio data 12A and the second audio data 12B is not within the allowable error range, the framing unit 52 outputs an error message to a display unit or the like (not shown). Cancel the process. On the other hand, when the lengths of the first audio data 12A and the second audio data 12B are the same or within the allowable error range, the framing unit 52 executes the following processing. That is, the framing unit 52 frames the first audio data 12A and the second audio data 12B.

一例を挙げれば、フレーム化部５２は、下記の式Ａ、式Ｂを用いて、各々の音声データを、長さを２５６ｍｓとするフレーム化を実行する。このとき、フレーム化部５２は、前後のフレームの重複部分の長さが１２８ｍｓとなるようにする。なお、上記のフレームの長さ、前後のフレームの重複部分の長さは、あくまでも一例であり、任意の値を採用できる。
Ｓ＝ｆｌｏｏｒ（Ｙ／Ｘ）・・・・・・・・・・・・・・・・式Ａ
ｍ＝ｆｌｏｏｒ（（Ｓ−２５６）／１２８）＋１・・・・・・・・式Ｂ
なお、「ｆｌｏｏｒ（ｘ）」は、ｘ以下の最大の整数を算出するための関数であり、Ｙは、第１の音声データ１２Ａおよび第２の音声データ１２Ｂそれぞれのデータ量（byte）であり、Ｘは、１（byte）のデータに対応する長さ（ms）である。 For example, the framing unit 52 uses the following formulas A and B to perform framing with each audio data having a length of 256 ms. At this time, the framing unit 52 sets the length of the overlapped portion of the preceding and following frames to 128 ms. Note that the length of the frame and the length of the overlapping portion of the preceding and following frames are merely examples, and arbitrary values can be adopted.
S = floor (Y / X) ... Formula A
m = floor ((S-256) / 128) +1 ... B
“Floor (x)” is a function for calculating the maximum integer equal to or less than x, and Y is the data amount (byte) of each of the first audio data 12A and the second audio data 12B. , X is a length (ms) corresponding to 1 (byte) data.

このような処理によって、第１の音声データ１２Ａおよび第２の音声データ１２ＢそれぞれについてＮ個のフレームが得られたものとして以下の説明を行う。なお、以下では、第１の音声データ１２Ａから得られたＮ個のフレームの各々を、「第１フレーム（１）」、「第１フレーム（２）」・・・「第１フレーム（Ｎ）」と記載する場合がある。同様に、第２の音声データ１２Ｂから得られたＮ個のフレームの各々を、「第２フレーム（１）」、「第２フレーム（２）」・・・「第２フレーム（Ｎ）」と記載する場合がある。 The following description will be made assuming that N frames are obtained for each of the first audio data 12A and the second audio data 12B by such processing. In the following, each of the N frames obtained from the first audio data 12A will be referred to as “first frame (1)”, “first frame (2)”,... “First frame (N)”. May be written. Similarly, each of the N frames obtained from the second audio data 12B is referred to as “second frame (1)”, “second frame (2)”... “Second frame (N)”. May be described.

第１の識別部５３は、第１の確率モデルに基づいて、フレーム化部５２によってフレーム化されたフレームが有声音領域または無声音領域のいずれであるかを識別する処理部である。一態様としては、第１の識別部５３は、第１フレーム（１）〜第１フレーム（Ｎ）、第２フレーム（１）〜第２フレーム（Ｎ）の各々の音声データごとに、下記の処理を実行する。すなわち、第１の識別部５３は、自己相関係数のピークの数、自己相関係数のピークの最大値及びスペクトルエントロピーの３つの特徴量を抽出する。さらに、第１の識別部５３は、先に抽出した３つの特徴量それぞれの平均値および標準偏差を各々の音声データごとに算出する。その上で、第１の識別部５３は、確率モデルである隠れマルコフモデル（Hidden Markov Model；HMM）を用いて、有声音領域および無声音領域を各々の音声データのフレームごとに識別する。 The first identifying unit 53 is a processing unit that identifies whether the frame framed by the framing unit 52 is a voiced sound region or an unvoiced sound region based on the first probability model. As an aspect, the first identification unit 53 performs the following for each audio data of the first frame (1) to the first frame (N) and the second frame (1) to the second frame (N). Execute the process. That is, the first discriminating unit 53 extracts three feature quantities including the number of autocorrelation coefficient peaks, the maximum value of the autocorrelation coefficient peaks, and the spectral entropy. Further, the first identification unit 53 calculates an average value and a standard deviation of each of the three feature quantities extracted previously for each audio data. Then, the first identifying unit 53 identifies the voiced sound area and the unvoiced sound area for each frame of the sound data using a hidden Markov model (HMM) which is a probability model.

ここで、有声音領域および無声音領域の識別方法について説明する。図４は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。図４に示すように、第１の識別部５３は、上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果（observation）とし、ＥＭ法（Expectation-Maximization algorithm）を用いて、状態遷移確率（transition possibility）Ｐ_tを算出する。 Here, a method for identifying the voiced sound area and the unvoiced sound area will be described. FIG. 4 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. As shown in FIG. 4, the first identification unit 53 uses the above three feature quantities, and the average value and standard deviation of each feature quantity as observation results (observation), and uses an EM method (Expectation-Maximization algorithm). using calculates the state transition probability (transition possibility) P _t.

かかる状態遷移確率Ｐ_tは、例えば、有声音の状態のままでいる確率、有声音の状態から無声音の状態に遷移する確率、無声音の状態のままでいる確率、無声音の状態から有声音の状態に遷移する確率を指す。図４に示す例で言えば、発話は、有声音および無声音の両方とも同一の確率で開始すると仮定して、発話の開始における有声音および無声音の状態の確率がいずれも「０．５」に設定されている。さらに、初期の状態遷移確率Ｐ_tとして、有声音の状態のままでいる確率が「０．９５」に設定されるとともに、有声音の状態から無声音の状態に遷移する確率が「０．０５」に設定されている。さらに、初期の状態遷移確率Ｐ_tとして、無声音の状態のままでいる確率が「０．９５」に設定されるとともに、無声音の状態から有声音の状態に遷移する確率が「０．０５」に設定されている。このような設定の下、第１の識別部５３は、状態遷移確率Ｐ_tを算出することを所定回数にわたって繰り返す。これによって、状態遷移確率Ｐ_tを精度よく算出することができる。 The state transition probability P _t is, for example, the probability of remaining a voiced sound state, the probability of transitioning from a voiced sound state to an unvoiced sound state, the probability of remaining an unvoiced sound state, and the state of a voiced sound to a voiced sound state Indicates the probability of transition to. In the example shown in FIG. 4, assuming that both voiced and unvoiced sounds start with the same probability, both the voiced and unvoiced sound state probabilities at the start of the utterance are both “0.5”. Is set. Furthermore, as the initial state transition probability P _t , the probability that the voiced sound state remains is set to “0.95”, and the probability that the voiced sound state changes to the unvoiced sound state is “0.05”. Is set to Furthermore, as the initial state transition probability P _t , the probability of remaining unvoiced sound is set to “0.95”, and the probability of transition from unvoiced sound to voiced sound is set to “0.05”. Is set. Under such setting, the first identification unit 53 repeats the calculation of the state transition probability P _{t a} predetermined number of times. As a result, the state transition probability P _t can be calculated with high accuracy.

さらに、第１の識別部５３は、上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果とし、ビタビアルゴリズム（Viterbi algorithm）により、観測確率（observation possibility）Ｐ_ｏを各々の音声データごとに算出する。ここで、観測確率Ｐ_ｏは、例えば、有声音の状態から観測（observed）を出力する確率、有声音の状態から非観測（not observed）を出力する確率、無声音の状態から観測を出力する確率および無声音の状態から非観測を出力する確率である。なお、観測確率は、出力確率（emission possibility）とも称される。 Furthermore, the first identifying unit 53, the three feature amounts described above, as well as the mean and standard deviation of each feature quantity and observation results, the Viterbi algorithm (Viterbi algorithm), the observation probability (observation possibility) P _o Calculation is performed for each audio data. Here, the observation probability P _o is, for example, the probability of outputting observation from the voiced sound state, the probability of outputting not observed from the voiced sound state, and the probability of outputting observation from the unvoiced sound state. And the probability of outputting non-observation from the state of unvoiced sound. The observation probability is also referred to as output probability.

これら状態遷移確率Ｐ_tおよび観測確率Ｐ_ｏを算出した後に、第１の識別部５３は、上記の３つの特徴量に基づいて、ビタビアルゴリズムを用いて、各フレームにおいて発話されている場合にその音が有声音Ｖであるか、あるいは無声音Ｕであるかを識別する。その上で、第１の識別部５３は、有声音と識別された領域を有声音領域とし、無声音と識別された領域を無声音領域とする。 After calculating the state transition probability P _t and the observation probability P _o , the first identification unit 53 uses the Viterbi algorithm based on the above three feature amounts when the speech is uttered in each frame. Whether the sound is a voiced sound V or an unvoiced sound U is identified. In addition, the first identification unit 53 sets a region identified as a voiced sound as a voiced sound region and a region identified as an unvoiced sound as an unvoiced sound region.

このように、第１の識別部５３は、自己相関係数のピークの数、自己相関係数のピークの最大値及びスペクトルエントロピーなどの特徴量を用いて、有声音領域および無声音領域を識別する。したがって、第１の識別部５３では、周囲のノイズの影響によって有声音領域および無声音領域を識別する精度が低下することを抑制できる。また、第１の識別部５３は、周囲のノイズに強い特徴量を用いるため、第１の音声データ１２Ａおよび第２の音声データ１２Ｂをフレーム化する場合に、フレームの個数をより少なくすることもできる。それゆえ、第１の識別部５３では、より簡易な処理で有声音領域および無声音領域を識別できる。 As described above, the first identifying unit 53 identifies the voiced sound region and the unvoiced sound region using the feature quantity such as the number of peaks of the autocorrelation coefficient, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy. . Therefore, in the 1st identification part 53, it can suppress that the precision which identifies a voiced sound area | region and an unvoiced sound area | region by the influence of surrounding noise falls. In addition, since the first identification unit 53 uses a feature amount that is strong against ambient noise, when the first audio data 12A and the second audio data 12B are framed, the number of frames may be reduced. it can. Therefore, the first identification unit 53 can identify the voiced sound area and the unvoiced sound area with simpler processing.

決定部５４は、第１の識別部５３によって有声音領域であると識別されたフレームの識別結果を有効または無効とするかを決定する処理部である。一態様としては、決定部５４は、２つの音声データのエネルギー比を複数の確率分布が混合するモデルにモデル化した上で、フレーム間のエネルギー比が複数の確率分布のうちいずれの確率分布に属するかに応じてフレームの識別結果を有効化または無効化する。 The determination unit 54 is a processing unit that determines whether the identification result of the frame identified as the voiced sound area by the first identification unit 53 is valid or invalid. As one aspect, the determination unit 54 models the energy ratio of two audio data into a model in which a plurality of probability distributions are mixed, and then the energy ratio between frames changes to any probability distribution among the plurality of probability distributions. The identification result of the frame is validated or invalidated depending on whether it belongs.

すなわち、本実施例に係る話者判別装置５０では、２つの音声データから得られるエネルギー比の大きさに応じて３つのガウス分布の混合が仮定される。ここでは、一例として、第２の音声データ１２Ｂに対する第１の音声データ１２Ａのエネルギー比、すなわち話者Ｂに対する話者Ａのエネルギー比を用いる場合を例示するが、第１の音声データ１２Ａに対する第２の音声データ１２Ｂを用いることとしてもよい。この場合には、話者Ｂに対する話者Ａのエネルギー比を用いる場合とは逆の仮定がなされる。なお、上記の３つのガウス分布は、第１の音声データ１２Ａに含まれる音声のうち話者Ａによって発話された音声のエネルギーが話者Ｂのものよりも高いという前提の下に仮定される。 That is, in the speaker discrimination device 50 according to the present embodiment, a mixture of three Gaussian distributions is assumed according to the magnitude of the energy ratio obtained from the two speech data. Here, as an example, the case where the energy ratio of the first voice data 12A to the second voice data 12B, that is, the energy ratio of the speaker A to the speaker B is used is illustrated. The second audio data 12B may be used. In this case, the assumption opposite to the case of using the energy ratio of speaker A to speaker B is made. Note that the above three Gaussian distributions are assumed on the assumption that the energy of the speech uttered by the speaker A among the speech included in the first speech data 12A is higher than that of the speaker B.

かかる混合ガウス分布の一態様としては、話者Ｂが発話している「第１の分布」、話者Ａおよび話者Ｂの両者が発話している「第２の分布」、話者Ａが発話している「第３の分布」の３つにモデル化する態様が挙げられる。このうち、話者Ｂが発話している場合には、話者Ａの音声のエネルギーは話者Ｂの音声のエネルギーよりも低いと推定できる。このため、「第１の分布」には、エネルギー比が低い帯域に確率分布が割り当てられる。また、話者Ａおよび話者Ｂの両者が発話している場合には、両者の音声のエネルギー比はほぼ同等であると推定できる。よって、「第２の分布」には、エネルギー比が中間である帯域に確率分布が割り当てられる。また、話者Ａが発話している場合には、話者Ａの音声のエネルギーは話者Ｂの音声のエネルギーよりも高いと推定できる。したがって、「第３の分布」には、エネルギー比が高い帯域に確率分布が割り当てられる。 As an aspect of such a mixed Gaussian distribution, a “first distribution” in which the speaker B speaks, a “second distribution” in which both the speaker A and the speaker B speak, A mode of modeling in three of the “third distribution” being spoken can be mentioned. Among these, when the speaker B is speaking, it can be estimated that the energy of the voice of the speaker A is lower than the energy of the voice of the speaker B. For this reason, a probability distribution is assigned to a band having a low energy ratio for the “first distribution”. Further, when both the speaker A and the speaker B are speaking, it can be estimated that the energy ratios of the two voices are almost equal. Therefore, a probability distribution is assigned to the band having an intermediate energy ratio in the “second distribution”. When speaker A is speaking, it can be estimated that the energy of the voice of speaker A is higher than the energy of the voice of speaker B. Therefore, a probability distribution is assigned to the band having a high energy ratio for the “third distribution”.

このような仮定の下、決定部５４は、既存の期待値最大化法、いわゆるＥＭ法を用いて、各フレーム間のエネルギー比が「第１の分布」、「第２の分布」または「第３の分布」に属している確率をそれぞれ推定する。 Under such an assumption, the determination unit 54 uses the existing expected value maximization method, the so-called EM method, so that the energy ratio between the frames is “first distribution”, “second distribution”, or “first distribution”. Probabilities belonging to “distribution of 3” are estimated.

かかるＥＭ法の一態様としては、決定部５４は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂのフレーム間のエネルギー比を計算する。このとき、話者Ａおよび話者Ｂがともに沈黙している場合には、両者の音声のエネルギー比はほぼ同等となり、誤って第２の分布に属すると推定されるおそれもある。それゆえ、決定部５４は、発話領域と推定される可能性が高い有声音領域だけを推定に使用する観点から、２つの音声データの各フレーム間で少なくともいずれか一方が有声音Ｖと識別されたフレームを対象に、エネルギー比を算出する。なお、上記のエネルギーは、各々の音声データのフレームに高速フーリエ変換、いわゆるＦＦＴ（Fast Fourier Transform）を実行して周波数解析を行った上で周波数成分ごとの振幅値を平均化することにより算出できる。さらに、上記のエネルギー比は、第１フレーム（ｊ）のエネルギーを第２フレーム（ｊ）のエネルギーで除算することによって算出できる。なお、ここで言う「ｊ」は、１〜Ｎの自然数であり、Ｎ個のフレームのうちｊ番目のフレームであることを示す。 As one aspect of the EM method, the determination unit 54 calculates the energy ratio between frames of the first audio data 12A and the second audio data 12B. At this time, when both the speaker A and the speaker B are silent, the energy ratios of the two voices are almost equal, and there is a possibility that it is erroneously estimated to belong to the second distribution. Therefore, the determination unit 54 identifies at least one of the two voice data frames as the voiced sound V from the viewpoint of using only the voiced sound area that is likely to be estimated as the speech area for estimation. The energy ratio is calculated for each frame. The above energy can be calculated by performing frequency analysis by performing fast Fourier transform, so-called FFT (Fast Fourier Transform), on each audio data frame, and then averaging the amplitude value for each frequency component. . Further, the energy ratio can be calculated by dividing the energy of the first frame (j) by the energy of the second frame (j). Here, “j” is a natural number of 1 to N, and indicates the jth frame among the N frames.

そして、決定部５４は、下記の式（１）に示すように、先に算出したエネルギー比の対数Ｘ_ｊをさらに計算する。このようにエネルギー比の対数を取るのは、割合そのままでは逆数となるエネルギー比を正負の符号を反対に対称にできるからである。 And the determination part 54 further calculates logarithm _Xj of the energy ratio calculated previously, as shown to the following formula | equation (1). The logarithm of the energy ratio is taken because the energy ratio that is the reciprocal of the ratio as it is can be made symmetric with the positive and negative signs reversed.

さらに、決定部５４は、ＥＭ法に用いる各種の項目の初期値を設定する。例えば、決定部５４は、上記のようにフレームごとに算出したエネルギー比の対数Ｘ_ｊを昇順に並べ替える。これによって並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}を得る。さらに、決定部５４は、並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}からマトリクスを生成することによって事後確率ρ_ｉｊの初期値を得る。かかる事後確率ρ_ｉｊは、後述の最大化ステップ、すなわちＭ（Maximization）ステップおよび期待値ステップ、すなわちＥ（Expectation）ステップの繰り返し演算によって最尤推定される。このため、必ずしも昇順に並べ替えられたエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}を使用せずともランダムな値を使用することとしてもかまわない。ここで言う「ｉ」は、第１の分布〜第３の分布を指し、例えば、ρ_ｉｊは、ｊ番目のフレームのエネルギー比がｉ番目の分布に含まれる確率を指す。このようにして設定された並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}および事後確率ρ_ｉｊの初期値が後述のＭステップに供給される。 Furthermore, the determination unit 54 sets initial values of various items used in the EM method. For example, the determining unit 54 rearranges the logarithm X _j of the energy ratio calculated for each frame as described above in ascending order. Thus, the logarithm X _{j_sorted} of the energy ratio after rearrangement is obtained. Furthermore, the determination unit 54 obtains an initial value of the posterior probability ρ _ij by generating a matrix from the logarithm X _{j_sorted} of the energy ratio after the rearrangement. The a posteriori probability ρ _ij is estimated with the maximum likelihood by an iterative operation of a maximization step described later, that is, an M (Maximization) step and an expected value step, that is, an E (Expectation) step. For this reason, a random value may be used without _necessarily using the logarithm X _{j_sorted of} energy ratios rearranged in ascending order. Here, “i” refers to the first distribution to the third distribution. For example, ρ _ij refers to the probability that the energy ratio of the j-th frame is included in the i-th distribution. The logarithm X _{j_sorted} of the rearranged energy ratio and the initial value of the posterior probability ρ _ij set in this way are supplied to the M step described later.

その後、決定部５４は、上記の第１の分布、第２の分布および第３の分布を含んでなるモデルを計算するＭステップを実行する。一態様としては、決定部５４は、下記の式（２）〜式（４）を用いて、第１の分布、第２の分布または第３の分布を定義するパラメータρ_ｉ、μ_ｉおよびσ_ｉを計算することによってパラメータρ_ｉ、μ_ｉおよびσ_ｉをアップデートする。このとき、初期値が算出された初回には、並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}および事後確率ρ_ｉｊの初期値が計算に用いられる。一方、Ｅステップの実行後には、後述のＥステップでアップデートされた事後確率ρ_ｉｊ及びエネルギー比の対数Ｘ_ｊが用いられる。 Thereafter, the determination unit 54 executes M steps for calculating a model including the first distribution, the second distribution, and the third distribution. As one aspect, the determination unit 54 uses parameters (2) to (4) below to define parameters ρ _i , μ _i, and σ that define the first distribution, the second distribution, or the third distribution. Update the parameters ρ _i , μ _i and σ _i by calculating _i . At this time, at the first time when the initial value is calculated, the logarithm X _{j_sorted} of the energy ratio after rearrangement and the initial value of the posterior probability ρ _ij are used for the calculation. On the other hand, after the execution of the E step, the posterior probability ρ _ij and the logarithm X _{j of the} energy ratio updated in the E step described later are used.

続いて、決定部５４は、Ｍステップで算出されたモデルを用いて、モデルの尤度の期待値を計算するＥステップを実行する。一態様としては、決定部５４は、Ｍステップで算出されたパラメータρ_ｉ、μ_ｉおよびσ_ｉを下記の式（５）〜式（７）へ代入することによって、確率密度Ｎ（ｘ_ｊ：μ_ｉ，σ_ｉ）、事後混合物の尤度ｆ（ｘ_ｊ）および事後確率ρ_ｉｊを算出する。そして、決定部５４は、Ｍステップ及びＥステップを所定の回数、例えば５回にわたって繰り返し実行する。 Subsequently, the determination unit 54 executes an E step of calculating an expected value of the likelihood of the model using the model calculated in the M step. As an aspect, the determination unit 54 substitutes the parameters ρ _i , μ _i and σ _i calculated in M steps into the following formulas (5) to (7), so that the probability density N (x _j : μ _i , σ _i ), the likelihood f (x _j ) and the posterior probability ρ _ij of the posterior mixture are calculated. Then, the determination unit 54 repeatedly executes the M step and the E step a predetermined number of times, for example, five times.

このようにしてＭステップ及びＥステップを所定の回数実行後に、決定部５４は、ＥＭ法によって算出された事後確率ρ_ｉｊを用いて、第１の識別部５３によって有声音Ｖであると識別されたフレームの識別結果を有効または無効とするかを決定する。 Thus, after executing the M step and the E step a predetermined number of times, the determination unit 54 is identified as the voiced sound V by the first identification unit 53 using the posterior probability ρ _ij calculated by the EM method. It is determined whether the identification result of the frame is valid or invalid.

かかる識別結果の有効化または無効化の一態様としては、決定部５４は、ＥＭ法によって算出されたρ_１ｊとρ_２ｊ及びρ_３ｊとを比較し、ρ_２ｊ＞ρ_１ｊまたはρ_３ｊ＞ρ_１ｊであるか否かを判定する。このとき、ρ_２ｊ＞ρ_１ｊである場合には、ｊ番目のフレームのエネルギー比が第１の分布よりも第２の分布に属している可能性が高いので、話者Ｂが単独で発話している可能性よりも話者Ａおよび話者Ｂの両者が発話している可能性の方が高いと推定できる。また、ρ_３ｊ＞ρ_１ｊである場合には、ｊ番目のフレームのエネルギー比が第１の分布よりも第３の分布に属している可能性が高いので、話者Ｂが単独で発話している可能性よりも話者Ａが単独で発話している可能性の方が高いと推定できる。このため、決定部５４は、第１フレーム（ｊ）及び第２フレーム（ｊ）の識別結果がともに有声音Ｖである場合に、第２フレーム（ｊ）の識別結果を有声音Ｖから無声音Ｕに置換する。これによって、第１の識別部５３によって有声音Ｖと識別された第２フレームの識別結果を無効化する。 As an aspect of validation or invalidation of the identification result, the determination unit 54 compares ρ _1j calculated by the EM method with ρ _2j and ρ _3j, and ρ _2j > ρ _1j or ρ _3j > ρ _1j. It is determined whether or not. At this time, if ρ _2j > ρ _1j , the energy ratio of the j-th frame is more likely to belong to the second distribution than the first distribution, so that the speaker B speaks alone. It can be estimated that the possibility that both the speaker A and the speaker B are speaking is higher than the possibility that the speaker A is speaking. If ρ _3j > ρ _1j , the energy ratio of the j-th frame is more likely to belong to the third distribution than the first distribution, so that the speaker B speaks alone. It can be estimated that the possibility that the speaker A is speaking alone is higher than the possibility that the speaker A is speaking. For this reason, when both the identification results of the first frame (j) and the second frame (j) are the voiced sound V, the determination unit 54 changes the identification result of the second frame (j) from the voiced sound V to the unvoiced sound U. Replace with. As a result, the identification result of the second frame identified as the voiced sound V by the first identification unit 53 is invalidated.

一方、ρ_２ｊ＞ρ_１ｊまたはρ_３ｊ＞ρ_１ｊでない場合、すなわちρ_２ｊ＜ρ_１ｊかつρ_３ｊ＜ρ_１ｊである場合には、ｊ番目のフレームのエネルギー比が第２の分布及び第３の分布よりも第１の分布に属している可能性が高い。この場合には、話者Ａが単独で発話している可能性並びに話者Ａおよび話者Ｂの両者が発話している可能性よりも話者Ｂが単独で発話している可能性の方が高いと推定できる。よって、決定部５４は、第１フレーム（ｊ）及び第２フレーム（ｊ）の識別結果がともに有声音Ｖである場合に、第１フレーム（ｊ）の識別結果を有声音Ｖから無声音Ｕに置換する。これによって、第１の識別部５３によって有声音Ｖと識別された第１フレームの識別結果を無効化する。 On the other hand, when ρ _2j > ρ _1j or ρ _3j > ρ _1j is not satisfied, that is, when ρ _2j <ρ _1j and ρ _3j <ρ _1j , the energy ratio of the jth frame is the second distribution and the third The possibility of belonging to the first distribution is higher than the distribution. In this case, the possibility that speaker B is speaking alone rather than the possibility that speaker A is speaking alone and the possibility that both speaker A and speaker B are speaking Can be estimated to be high. Therefore, the determination unit 54 changes the identification result of the first frame (j) from the voiced sound V to the unvoiced sound U when the identification results of the first frame (j) and the second frame (j) are both voiced sounds V. Replace. As a result, the identification result of the first frame identified as the voiced sound V by the first identification unit 53 is invalidated.

ここで、図５Ａ〜図５Ｅを用いて、識別結果の有効化または無効化の一例を説明する。図５Ａは、音声データ別のエネルギーの一例を示す図である。図５Ｂは、音声データ別の有声音Ｖまたは無声音Ｕの識別結果の一例を示す図である。図５Ｃは、フレーム間のエネルギー比の一例を示す図である。図５Ｄは、フレーム間のエネルギー比が所属する分布を示す図である。図５Ｅは、置換後の音声データ別の有声音Ｖまたは無声音Ｕの識別結果の一例を示す図である。 Here, an example of validation or invalidation of the identification result will be described with reference to FIGS. 5A to 5E. FIG. 5A is a diagram illustrating an example of energy for each audio data. FIG. 5B is a diagram illustrating an example of the identification result of voiced sound V or unvoiced sound U by voice data. FIG. 5C is a diagram illustrating an example of an energy ratio between frames. FIG. 5D is a diagram illustrating a distribution to which an energy ratio between frames belongs. FIG. 5E is a diagram illustrating an example of the identification result of the voiced sound V or unvoiced sound U for each voice data after replacement.

一例として、各音声データにおける同一区間のフレームのエネルギーがそれぞれ図５Ａに示す値を取り、第１フレームおよび第２フレームの識別結果がそれぞれ図５Ｂに示す識別結果を取る場合を想定する。この場合には、決定部５４によってフレーム間のエネルギー比が算出される。このとき、図５Ｃに示すように、第１フレームまたは第２フレームのうち少なくともいずれか１つの識別結果が有声音Ｖと識別されたフレーム、すなわち図中の値がブランクであるフレームを除くフレームを対象にフレーム間のエネルギー比が算出される。 As an example, it is assumed that the energy of frames in the same section in each audio data takes the values shown in FIG. 5A, and the identification results of the first frame and the second frame each take the identification results shown in FIG. 5B. In this case, the energy ratio between frames is calculated by the determination unit 54. At this time, as shown in FIG. 5C, a frame in which at least one of the first frame and the second frame is identified as voiced sound V, that is, a frame excluding a frame whose value in the figure is blank is excluded. The energy ratio between frames is calculated for the object.

その後、フレーム間のエネルギーの比が第１の分布、第２の分布または第３の分布のうちいずれの分布に属するかがＥＭ法を用いて算出される。図５Ｄの例では、第１の分布に属するフレームが濃い塗りつぶしによって図示され、第２の分布に属するフレームが薄い塗りつぶしによって図示され、さらに、第３の分布に属するフレームが斜線の塗りつぶしによって図示されている。この場合には、図５Ｅに示すように、有声音Ｖと識別されている第２フレームのうち第３の分布に属すると推定された第２フレーム（２）、第２フレーム（６）及び第２フレーム（８）の識別結果が無声音Ｕに置換される。さらに、有声音Ｖと識別されている第１フレームのうち第１の分布に属すると推定された第１フレーム（１３）、第１フレーム（１５）、第１フレーム（１６）、第１フレーム（１８）及び第１フレーム（２０）の識別結果が無声音Ｕに置換される。 Thereafter, it is calculated using the EM method whether the energy ratio between frames belongs to the first distribution, the second distribution, or the third distribution. In the example of FIG. 5D, frames belonging to the first distribution are illustrated by dark fills, frames belonging to the second distribution are illustrated by light fills, and frames belonging to the third distribution are illustrated by hatched fills. ing. In this case, as shown in FIG. 5E, the second frame (2), the second frame (6), and the second frame estimated to belong to the third distribution among the second frames identified as the voiced sound V The identification result of two frames (8) is replaced with the unvoiced sound U. Furthermore, the first frame (13), the first frame (15), the first frame (16), the first frame (presumed to belong to the first distribution among the first frames identified as the voiced sound V) 18) and the identification result of the first frame (20) are replaced with the unvoiced sound U.

このように、決定部５４では、２人の話者の音声データのエネルギー比を混合ガウス分布でモデル化した上でフレーム間のエネルギー比が属する分布に応じて有声音Ｖの識別結果を有効又は無効とする。その上で、後段の第２の識別部５５によって各々の音声データの発話領域および沈黙領域が識別される。このため、各音声データを構成する同一区間のフレーム間で閾値を用いて判定せずとも、話者を判別することができる。また、上記の従来技術のように、事前に学習を行う必要もなく、話者の判別に複雑なアルゴリズムを用いる必要もない。 As described above, the determination unit 54 models the energy ratio of the voice data of the two speakers with the mixed Gaussian distribution, and validates the identification result of the voiced sound V according to the distribution to which the energy ratio between frames belongs. Invalid. In addition, the speech recognition area and the silence area of each voice data are identified by the second identification unit 55 in the subsequent stage. For this reason, a speaker can be determined without using a threshold between frames in the same section constituting each audio data. Further, unlike the above-described prior art, it is not necessary to perform learning in advance, and it is not necessary to use a complicated algorithm for speaker determination.

さらに、決定部５４では、フレーム間のエネルギー比が第２の分布に属すると推定された場合に、有声音Ｖと識別されたフレームの識別結果を維持する。例えば、図５Ｅの例で言えば、仮に話者Ｂの発話の音量が話者Ａの発話の音量よりも低かったとしても、１０番目〜１２番目のフレームまでの第１フレーム及び第２フレームの識別結果は有声音Ｖのまま維持される。それゆえ、２人の話者が発話する音量に開きがある場合でも、同時発話を判別することもできる。 Further, the determination unit 54 maintains the identification result of the frame identified as the voiced sound V when the energy ratio between frames is estimated to belong to the second distribution. For example, in the example of FIG. 5E, even if the volume of the utterance of the speaker B is lower than the volume of the utterance of the speaker A, the first frame and the second frame of the tenth to twelfth frames. The identification result is maintained as voiced sound V. Therefore, it is possible to determine simultaneous utterances even when there is a gap in the volume at which two speakers speak.

なお、ここでは、フレームごとの話者Ｂに対する話者Ａのエネルギー比を用いる場合を説明したが、フレームごとの話者Ｂに対する話者Ａのエネルギー比を併せて用いることとしてもよい。例えば、話者Ｂに対する話者Ａのエネルギー比を用いて事後確率ρ_ｉｊを算出した場合には、ρ_２ｊ＞ρ_１ｊまたはρ_３ｊ＞ρ_１ｊであるならば決定部５４に第２フレーム（ｊ）の識別結果を有声音Ｖから無声音Ｕに置換させる。さらに、話者Ａに対する話者Ｂのエネルギー比を用いて事後確率ρ_ｉｊを算出した場合には、ρ_２ｊ＞ρ_１ｊまたはρ_３ｊ＞ρ_１ｊであるならば決定部５４によって第１フレーム（ｊ）の識別結果を有声音Ｖから無声音Ｕに置換させればよい。このとき、初期値として与えられる事後確率ρ_ｉｊが話者Ｂに対する話者Ａのエネルギー比を用いる場合と話者Ａに対する話者Ｂのエネルギー比を用いる場合とで異なると、ＥＭ法によって算出された事後確率ρ_ｉｊも変わる。この場合には、事後確率ρ_ｉｊを用いた話者の推定結果も変わってくる。 Here, the case of using the energy ratio of the speaker A to the speaker B for each frame has been described, but the energy ratio of the speaker A to the speaker B for each frame may be used together. For example, when the posterior probability ρ _ij is calculated using the energy ratio of the speaker A to the speaker B, if the ρ _2j > ρ _1j or ρ _3j > ρ _1j is satisfied, the determination unit 54 receives the second frame (j ) Is replaced from voiced sound V to unvoiced sound U. Further, when the posterior probability ρ _ij is calculated using the energy ratio of the speaker B to the speaker A, the first frame (j is determined by the determining unit 54 if ρ _2j > ρ _1j or ρ _3j > ρ _1j. ) Is replaced with the unvoiced sound U from the voiced sound V. At this time, if the posterior probability ρ _ij given as the initial value is different between the case where the energy ratio of the speaker A to the speaker B is used and the case where the energy ratio of the speaker B to the speaker A is used, it is calculated by the EM method. The posterior probability ρ _ij also changes. In this case, the speaker estimation result using the posterior probability ρ _ij also changes.

図６は、事後確率ρ_ｉｊを用いた話者の推定結果の一例を示す図である。図６の例では、話者Ａによって発話されていると推定されたフレームが濃い塗りつぶしによって図示され、話者Ｂによって発話されていると推定されたフレームが斜線の塗りつぶしによって図示されている。また、図６の例では、発話されていると推定されたフレームが薄い塗りつぶしによって図示されている。この場合にも、図５Ｅに示す例と同様に、有声音Ｖと識別されている第２フレームのうち第３の分布に属すると推定された第２フレーム（２）、第２フレーム（６）及び第２フレーム（８）の識別結果が無声音Ｕに置換される。さらに、有声音Ｖと識別されている第１フレームのうち第１の分布に属すると推定された第１フレーム（１３）、第１フレーム（１５）、第１フレーム（１６）、第１フレーム（１８）及び第１フレーム（２０）の識別結果が無声音Ｕに置換される。 FIG. 6 is a diagram illustrating an example of a speaker estimation result using the posterior probability ρ _ij . In the example of FIG. 6, a frame estimated to be uttered by the speaker A is illustrated by a dark fill, and a frame estimated to be uttered by the speaker B is illustrated by a hatched fill. In the example of FIG. 6, the frame estimated to be uttered is illustrated by a thin fill. Also in this case, as in the example shown in FIG. 5E, the second frame (2) and the second frame (6) estimated to belong to the third distribution among the second frames identified as the voiced sound V. The identification result of the second frame (8) is replaced with the unvoiced sound U. Furthermore, the first frame (13), the first frame (15), the first frame (16), the first frame (presumed to belong to the first distribution among the first frames identified as the voiced sound V) 18) and the identification result of the first frame (20) are replaced with the unvoiced sound U.

このように、話者Ｂに対する話者Ａのエネルギー比および話者Ｂに対する話者Ａのエネルギー比の両方を用いて識別結果の有効化または無効化を実行した場合にも、図５Ｅに示した場合と同様の結果を得ることができる。 As shown in FIG. 5E, the identification result is validated or invalidated using both the energy ratio of the speaker A to the speaker B and the energy ratio of the speaker A to the speaker B. Similar results can be obtained.

第２の識別部５５は、第２の確率モデルに基づいて、決定部５４によって有効または無効が決定された後のフレームの識別結果から２つの音声データにおける発話領域および沈黙領域を識別する処理部である。 The second identification unit 55 is a processing unit that identifies the speech area and the silence area in the two audio data from the identification result of the frame after the validity or invalidity is determined by the determination unit 54 based on the second probability model. It is.

ここで、発話領域および沈黙領域の識別方法について説明する。図７は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。図７に示す状態遷移確率Ｐ_tおよび観測確率Ｐ_ｏは、予め定められた値である。かかる状態遷移確率Ｐ_tは、例えば、沈黙の状態である沈黙状態のままでいる確率、沈黙状態から発話の状態である発話状態に遷移する確率、発話状態のままでいる確率および発話状態から沈黙状態に遷移する確率を示す。図７に示す例で言えば、発話は、有声音および無声音の両方とも同一の確率で開始すると仮定して、発話の開始における沈黙状態および発話状態の確率がいずれも「０．５」に設定されている。また、状態遷移確率Ｐ_tとして、沈黙状態のままでいる確率が「０．９９９」に設定されるとともに、沈黙状態から発話状態に遷移する確率が「０．００１」に設定されている。さらに、状態遷移確率Ｐ_tとして、発話状態のままでいる確率が「０．９９９」に設定されるとともに、発話状態から沈黙状態に遷移する確率が「０．００１」に設定されている。 Here, a method for identifying a speech area and a silence area will be described. FIG. 7 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. The state transition probability P _t and the observation probability P _o shown in FIG. 7 are predetermined values. The state transition probability P _t is, for example, the probability of staying in a silence state that is a silence state, the probability of transitioning from the silence state to the utterance state that is an utterance state, the probability of remaining in the utterance state, and the silence from the utterance state Indicates the probability of transition to a state. In the example shown in FIG. 7, assuming that both voiced and unvoiced sounds start with the same probability, both the silence state and the utterance state probabilities at the start of the utterance are set to “0.5”. Has been. Further, as the state transition probability P _t , the probability of remaining silent is set to “0.999”, and the probability of transition from the silent state to the speech state is set to “0.001”. Further, as the state transition probability P _t , the probability of remaining in the speech state is set to “0.999”, and the probability of transition from the speech state to the silent state is set to “0.001”.

また、観測確率Ｐ_ｏは、例えば、沈黙状態において無声音が検出される確率、沈黙状態において有声音が検出される確率、発話状態において無声音が検出される確率、および発話状態において有声音が検出される確率を指す。図７の例で言えば、観測確率Ｐ_ｏとして、沈黙状態において無声音が検出される確率が「０．９９」に設定されるとともに、沈黙状態において有声音が検出される確率が「０．０１」に設定されている。また、観測確率Ｐ_ｏとして、発話状態において無声音が検出される確率が「０．５」に設定されるとともに、発話状態において有声音が検出される確率が「０．５」に設定されている。 The observation probability _Po is, for example, the probability that an unvoiced sound is detected in the silence state, the probability that the voiced sound is detected in the silence state, the probability that the unvoiced sound is detected in the utterance state, and the voiced sound in the utterance state. Indicates the probability. In the example of FIG. 7, the probability that an unvoiced sound is detected in the silence state is set to “0.99” as the observation probability _Po , and the probability that the voiced sound is detected in the silence state is “0.01”. "Is set. Further, as the observation probability P _o , the probability that an unvoiced sound is detected in an utterance state is set to “0.5”, and the probability that a voiced sound is detected in an utterance state is set to “0.5”. .

なお、図７の例では、発話状態において無声音が検出される確率および発話状態において有声音が検出される確率をともに「０．５」に設定する場合を例示したが、同時発話の場合には他の話者よりも音量が小さい発話を行う話者の無声音が増加することも想定される。よって、発話状態において無声音が検出される確率を「０．５」よりも大きく設定することにより、他の話者よりも音量が小さい発話を行う話者の無声音の増加を抑制することもできる。 In the example of FIG. 7, the case where both the probability that an unvoiced sound is detected in the utterance state and the probability that the voiced sound is detected in the utterance state is set to “0.5” is illustrated. It is also assumed that the unvoiced sound of a speaker who makes an utterance whose volume is lower than that of other speakers increases. Therefore, by setting the probability that an unvoiced sound is detected in an utterance state to be larger than “0.5”, it is possible to suppress an increase in the unvoiced sound of a speaker who makes an utterance whose volume is lower than that of other speakers.

このような設定の下、第２の識別部５５は、ビタビアルゴリズムを用いて、決定部５４による有効化または無効化がなされた後の有声音および無声音から、各々の音声データにおける沈黙領域および発話領域であるかを識別する。これによって、第１の音声データにおける話者Ａの発話領域および沈黙領域、第２の音声データにおける話者Ｂの発話領域および沈黙領域が識別される。 Under such a setting, the second identification unit 55 uses the Viterbi algorithm from the voiced sound and the unvoiced sound after being validated or invalidated by the decision unit 54, and the silence region and speech in each voice data. Identify if it is an area. As a result, the speech area and silence area of the speaker A in the first voice data and the speech area and silence area of the speaker B in the second voice data are identified.

会話分析装置１０の説明に戻り、抽出部１３は、各々の音声データから会話特性を抽出する処理部である。一態様としては、抽出部１３は、第２の識別部５５によって識別された第１の音声データにおける話者Ａの発話領域をもとに有声音領域の数、有声音領域の長さの平均値および有声音領域の長さの標準偏差を算出する。また、抽出部１３は、第２の識別部５５によって識別された第１の音声データにおける話者Ａの発話領域をもとに発話領域の数、発話領域の長さの平均値および発話領域の長さの標準偏差を算出する。さらに、抽出部１３は、第２の識別部５５によって識別された第１の音声データにおける話者Ａの沈黙領域をもとに、沈黙領域の数、沈黙領域の長さの平均値および沈黙領域の長さの標準偏差を算出する。 Returning to the description of the conversation analysis device 10, the extraction unit 13 is a processing unit that extracts conversation characteristics from each voice data. As one aspect, the extraction unit 13 uses the average number of voiced sound regions and the number of voiced sound regions based on the utterance region of the speaker A in the first voice data identified by the second identification unit 55. The standard deviation of the value and the length of the voiced sound area is calculated. The extraction unit 13 also determines the number of utterance regions, the average value of the utterance regions, and the utterance region based on the utterance region of the speaker A in the first voice data identified by the second identification unit 55. Calculate the standard deviation of the length. Further, the extraction unit 13 uses the silence area of the speaker A in the first speech data identified by the second identification unit 55 to determine the number of silence areas, the average value of the silence area lengths, and the silence areas. The standard deviation of the length of is calculated.

また、抽出部１３は、会話全体の時間の長さに対する話者Ａの発話時間の長さの割合を算出する。このとき、抽出部１３は、話者Ａの発話領域の長さの合計を、話者Ａの発話時間の長さとして、上記の割合を算出する。また、抽出部１３は、話者Ｂの発話時間に対する話者Ａの発話時間の割合を算出する。さらに、抽出部１３は、話者Ｃの発話時間に対する話者Ａの発話時間の割合も算出する。また、抽出部１３は、話者Ａの発話領域をもとに、音量の標準偏差およびスペクトルエントロピーの標準偏差を算出する。さらに、抽出部１３は、話者Ａの発話領域をもとに算出した音量の標準偏差と、スペクトルエントロピーの標準偏差との和を、変化の度合いとして算出する。なお、ここでは、話者Ａの会話特性を抽出する場合を例示したが、話者Ｂについても、上記の話者Ａと同様にして、会話特性を抽出する。 Further, the extraction unit 13 calculates the ratio of the length of the utterance time of the speaker A to the length of time of the entire conversation. At this time, the extraction unit 13 calculates the above ratio by regarding the total length of the utterance area of the speaker A as the length of the utterance time of the speaker A. Further, the extraction unit 13 calculates the ratio of the speaking time of the speaker A to the speaking time of the speaker B. Further, the extraction unit 13 also calculates the ratio of the utterance time of the speaker A to the utterance time of the speaker C. Further, the extraction unit 13 calculates the standard deviation of the volume and the standard deviation of the spectral entropy based on the utterance area of the speaker A. Furthermore, the extraction unit 13 calculates the sum of the standard deviation of the volume calculated based on the utterance area of the speaker A and the standard deviation of the spectral entropy as the degree of change. Here, the case where the conversation characteristic of the speaker A is extracted has been illustrated, but the conversation characteristic is also extracted for the speaker B in the same manner as the speaker A described above.

このようにして算出された有声音領域の数、有声音領域の長さの平均値および有声音領域の長さの標準偏差の各会話特性は、有声音の長さがどの位長いのかを示す指標となる。また、発話領域の数、発話領域の長さの平均値、および発話領域の長さの標準偏差の各会話特性は、対応する人物が、常に会話において長く続けて話すのか、あるいは少ししか話さないのかを示す指標となる。また、沈黙領域の数、沈黙領域の長さの平均値および沈黙領域の長さの標準偏差の各会話特性は、話者の話し方が、長く続けて話すのか、あるいは中断（沈黙）を多くはさみながら話すのかを示す指標となる。また、会話全体の時間の長さに対するある人物の発話時間の長さの割合および他の人物の発話時間に対するある人物の発話時間の割合Ｒ_ｔの各会話特性は、会話の参加状態を示す指標となる。また、音量の標準偏差、スペクトルエントロピーの標準偏差および変化の度合いの各会話特性は、感情の変化が激しい情熱的な話者であるのか、あるいは感情の変化が小さい静かな話者であるのかを示す指標となる。 Each conversation characteristic of the number of voiced sound areas, the average value of the length of the voiced sound area, and the standard deviation of the length of the voiced sound area thus calculated indicates how long the length of the voiced sound area is. It becomes an indicator. In addition, the number of utterance areas, the average value of the utterance area lengths, and the standard deviations of the utterance area lengths indicate that the corresponding person always speaks for a long time in the conversation or speaks little. It becomes an index indicating whether or not. In addition, the number of silence areas, the average value of the silence area lengths, and the standard deviation of the silence area lengths, the conversation characteristics, whether the speaker speaks for a long time, or is often interrupted (silence). It becomes an index that shows how to talk. Further, each conversation characteristics of speech time ratio R _t of a person to the length speech time parts and other persons in the speech time of a person to the length of the entire conversation time index indicating the participation status of the conversation It becomes. In addition, the conversation characteristics of volume standard deviation, spectral entropy standard deviation, and degree of change indicate whether the speaker is a passionate speaker with a strong emotional change or a quiet speaker with a small emotional change. It becomes an indicator to show.

分析部１４は、抽出部１３によって抽出された会話特性に基づいて、会話スタイルを分析する処理部である。一態様としては、分析部１４は、他の人物の発話時間に対するある人物の発話時間の割合Ｒ_ｔが、所定値、例えば１．５以上である場合には、この「ある人物」は、会話においてよく話す人物であると分析する。また、分析部１４は、割合Ｒ_ｔが所定値、例えば０．６６以下である場合には、この「ある人物」は、会話においてあまり話さない、いわゆる聞き役の人物であると分析する。なお、分析部１４は、割合Ｒ_ｔが、所定値、例えば０．６６より大きく、１．５未満である場合には、会話に対する参加状況において両者は対等であると分析する。 The analysis unit 14 is a processing unit that analyzes the conversation style based on the conversation characteristics extracted by the extraction unit 13. As an embodiment, the analysis portion 14, the ratio R _t of utterance time a person with respect to speech time of the other person, a predetermined value, if for example 1.5 or more, this "a person" is a conversation Analyzes that the person speaks well. Further, when the ratio R _t is a predetermined value, for example, 0.66 or less, the analysis unit 14 analyzes that this “certain person” is a so-called hearing person who does not speak much in conversation. Incidentally, the analysis portion 14, the ratio R _t is a predetermined value, for example greater than 0.66, when it is less than 1.5, the analysis and both are equal in participation for conversation.

他の一態様としては、分析部１４は、ある人物の発話領域の数に対する有声音領域の数の割合および発話領域の長さの平均値が、他の人物の発話領域の数に対する有声音領域の数の割合および発話領域の長さの平均値よりも大きい場合には、次のように分析する。すなわち、分析部１４は、「ある人物」は会話において長く続けて話しがちな人物であると分析する。また、分析部１４は、ある人物の沈黙領域の長さの平均値が他の人物の沈黙領域の長さの平均値よりも大きく、かつある人物の沈黙領域の長さの標準偏差が所定値、例えば、６．０以上である場合には、次のように分析する。すなわち、分析部１４は、「ある人物」は、相手の話を聞いて、相手の内容に合わせて自分の発話を中断するため、発話の長さが一定しない人物であると分析する。 As another aspect, the analysis unit 14 determines that the ratio of the number of voiced sound regions to the number of utterance regions of a certain person and the average value of the length of the utterance regions are the voiced sound regions with respect to the number of utterance regions of other persons. When the ratio is larger than the average value of the ratio of the number of utterances and the length of the utterance region, analysis is performed as follows. That is, the analysis unit 14 analyzes that “a certain person” is a person who tends to talk for a long time in the conversation. The analysis unit 14 also determines that the average value of the silence area length of a person is larger than the average value of the silence area length of another person, and the standard deviation of the length of the silence area of a person is a predetermined value. For example, when it is 6.0 or more, the analysis is performed as follows. That is, the analysis unit 14 analyzes that “a certain person” is a person whose utterance length is not constant because he / she utters the utterance in accordance with the contents of the other party by listening to the other party ’s story.

更なる一態様としては、分析部１４は、ある人物の音量の標準偏差、スペクトルエントロピーの標準偏差または変化の度合いが、それぞれに対応する基準値以上である場合には、「ある人物」は感情の変化が激しい情熱的な話者であると分析する。また、分析部１４は、ある人物の音量の標準偏差、スペクトルエントロピーの標準偏差または変化の度合いが、それぞれに対応する基準値未満である場合には、「ある人物」は感情の変化が小さい静かな話者であると分析する。 As a further aspect, when the standard deviation of the volume of a certain person, the standard deviation of the spectral entropy, or the degree of change is equal to or greater than the corresponding reference value, the analysis unit 14 determines that “a certain person” is an emotion. Analyzed as a passionate speaker with drastic changes. In addition, the analysis unit 14 indicates that “a certain person” has a quiet and small emotional change when the standard deviation of the volume of the person, the standard deviation of the spectral entropy, or the degree of change is less than the corresponding reference value. Analyzing to be a good speaker.

他の一態様としては、分析部１４は、ある人物と他の人物との関係を分析することもできる。例えば、分析部１４は、他の人物の発話時間に対するある人物の発話時間の割合Ｒ_ｔが所定値、例えば１．０以上である場合には、「ある人物」は「他の人物」に対してよく話しかけているため、ある人物と他の人物との関係が友達や家族であると分析できる。一方、割合Ｒ_ｔが所定値、例えば１．０未満である場合には、この「ある人物」は「他の人物」の話を聞こうとしているため、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 As another aspect, the analysis unit 14 can also analyze the relationship between a certain person and another person. For example, the analysis portion 14, the ratio R _t is a predetermined value of the speech time of a person with respect to speech time of another person, if for example 1.0 or higher, "a person" whereas "other person" You can analyze that the relationship between one person and another person is a friend or family member. On the other hand, when the ratio R _t is a predetermined value, for example, less than 1.0, this “one person” is going to listen to the story of “another person”. Analyze your company colleagues and business partners.

更なる一態様としては、分析部１４は、ある人物と他の人物との会話においてある人物の発話領域の長さの平均値が所定値、例えば、１．８５（ｓ）以上である場合には、ある人物と他の人物との関係が友達や家族であると分析できる。これは、「ある人物」が「他の人物」に対してよく話しかけているためである。一方、分析部１４は、ある人物と他の人物との会話においてある人物の発話領域の長さの平均値が所定値、例えば、１．８５（ｓ）未満である場合には、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 As a further aspect, the analysis unit 14 has a case where the average value of the length of the utterance area of a certain person in a conversation between a certain person and another person is a predetermined value, for example, 1.85 (s) or more. Can analyze that the relationship between a person and another person is a friend or family member. This is because “a certain person” often talks to “another person”. On the other hand, if the average value of the length of the utterance area of a person in a conversation between a person and another person is less than a predetermined value, for example, 1.85 (s), the analysis unit 14 Analyzes relationships with other people as company colleagues or business partners.

他の一態様としては、分析部１４は、ある人物と他の人物との会話においてある人物の沈黙領域の長さの平均値が所定値、例えば、３．００（ｓ）以下である場合には、同様の理由で、ある人物と他の人物との関係が友達や家族であると分析できる。一方、分析部１４は、ある人物の沈黙領域の長さの平均値が所定値、例えば、３．００（ｓ）より大きい場合には、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 As another aspect, the analysis unit 14 has a case where the average value of the silence area length of a person in a conversation between the person and another person is a predetermined value, for example, 3.00 (s) or less. For the same reason, it can be analyzed that the relationship between a person and another person is a friend or family member. On the other hand, if the average value of the length of the silence area of a certain person is greater than a predetermined value, for example, 3.00 (s), the analysis unit 14 determines that the relationship between a certain person and another person Can be analyzed as a business partner.

更なる一態様としては、分析部１４は、ある人物と他の人物との会話においてある人物の変化の度合いが所定値、例えば、０．３３以上である場合には、同様の理由で、ある人物と他の人物との関係が友達や家族であると分析できる。一方、分析部１４は、ある人物の変化の度合いが所定値、例えば、０．３３未満である場合には、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 As a further aspect, the analysis unit 14 has the same reason when the degree of change of a person in a conversation between a person and another person is a predetermined value, for example, 0.33 or more. Analyzes that the relationship between a person and another person is a friend or family member. On the other hand, when the degree of change of a certain person is less than a predetermined value, for example, less than 0.33, the analysis unit 14 can analyze that the relationship between a certain person and another person is a company colleague or a business partner. .

これらの分析を行った後に、分析部１４は、分析結果を所定の出力先の装置、例えば会話分析装置１０が有する表示部や話者Ａおよび話者Ｂが利用する情報処理装置などに出力することができる。 After performing these analyses, the analysis unit 14 outputs the analysis results to a predetermined output destination device, for example, a display unit included in the conversation analysis device 10 or an information processing device used by the speakers A and B. be able to.

なお、話者判別装置５０、抽出部１３及び分析部１４には、各種の集積回路や電子回路を採用できる。また、話者判別装置５０に含まれる機能部の一部を別の集積回路や電子回路とすることもできる。例えば、集積回路としては、ＡＳＩＣ（Application Specific Integrated Circuit）が挙げられる。また、電子回路としては、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などが挙げられる。 Note that various integrated circuits and electronic circuits can be adopted for the speaker discrimination device 50, the extraction unit 13, and the analysis unit 14. In addition, a part of the functional unit included in the speaker discrimination device 50 can be another integrated circuit or an electronic circuit. For example, an ASIC (Application Specific Integrated Circuit) is an example of the integrated circuit. Examples of the electronic circuit include a central processing unit (CPU) and a micro processing unit (MPU).

続いて、本実施例に係る会話分析装置の処理の流れについて説明する。なお、ここでは、会話分析装置１０によって実行される（１）会話分析処理を説明した後に、話者判別装置５０によって実行される（２）実行処理を説明する。 Next, the process flow of the conversation analysis apparatus according to this embodiment will be described. Here, after (1) conversation analysis processing executed by the conversation analysis device 10 is described, (2) execution processing executed by the speaker discrimination device 50 will be described.

（１）会話分析処理
図８及び図９は、実施例１に係る会話分析処理の手順を示すフローチャートである。この会話分析処理は、一例として、図示しない入力部から会話分析処理を実行する指示を受け付けた場合に処理が起動する。 (1) Conversation Analysis Processing FIGS. 8 and 9 are flowcharts illustrating the procedure of conversation analysis processing according to the first embodiment. As an example, the conversation analysis process is started when an instruction to execute the conversation analysis process is received from an input unit (not shown).

図８に示すように、取得部５１は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂを取得する（ステップＳ１０１）。そして、フレーム化部５２は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂそれぞれの長さが同一であるか否かを判定する（ステップＳ１０２）。なお、ここで言う「同一」は、長さの差が許容誤差範囲内である場合も含む。 As shown in FIG. 8, the acquisition unit 51 acquires the first audio data 12A and the second audio data 12B (step S101). Then, the framing unit 52 determines whether or not the lengths of the first audio data 12A and the second audio data 12B are the same (step S102). Here, “same” includes the case where the difference in length is within the allowable error range.

このとき、各々の音声データの長さが同一でない場合（ステップＳ１０２否定）には、フレーム化部５２は、エラーメッセージを図示しない表示部に出力し（ステップＳ１０３）、処理を終了する。 At this time, if the lengths of the respective audio data are not the same (No at Step S102), the framing unit 52 outputs an error message to a display unit (not shown) (Step S103) and ends the process.

一方、各々の音声データの長さが同一である場合（ステップＳ１０２肯定）には、フレーム化部５２は、第１の音声データ１２Ａおよび第２の音声データ１２Ｂをフレーム化する（ステップＳ１０４）。 On the other hand, when the lengths of the respective audio data are the same (Yes in step S102), the framing unit 52 frames the first audio data 12A and the second audio data 12B (step S104).

その後、第１の識別部５３は、自己相関係数のピークの数、自己相関係数のピークの最大値およびスペクトルエントロピーの３つの特徴量を各々の音声データごとに抽出する（ステップＳ１０５）。そして、第１の識別部５３は、各々の音声データごとに抽出した３つの特徴量それぞれの平均値および標準偏差を算出する（ステップＳ１０６）。 After that, the first identification unit 53 extracts three feature quantities of the number of autocorrelation coefficient peaks, the maximum value of the autocorrelation coefficient peaks, and the spectral entropy for each audio data (step S105). Then, the first identification unit 53 calculates an average value and a standard deviation of each of the three feature amounts extracted for each audio data (step S106).

続いて、第１の識別部５３は、変数Ｎに０を設定し（ステップＳ１０７）、隠れマルコフモデルにおける有声音および無声音の状態遷移について初期の状態遷移確率Ｐ_tを設定する（ステップＳ１０８）。 Subsequently, the first identifying unit 53 sets 0 in the variable N (step S107), and sets the initial state transition probability P _t for the state transitions of the voiced and unvoiced sound in the hidden Markov model (step S108).

そして、第１の識別部５３は、変数Ｎの値を１つインクリメントする（ステップＳ１０９）。このとき、変数Ｎの値が５以上でない場合（ステップＳ１１０否定）には、第１の識別部５３は、各々の音声データごとに抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果とし、ＥＭ法を用いて、状態遷移確率Ｐ_tを算出し（ステップＳ１１１）、変数Ｎの値をさらに１つインクリメントする。 Then, the first identification unit 53 increments the value of the variable N by one (step S109). At this time, if the value of the variable N is not 5 or more (No at Step S110), the first identification unit 53 extracts the above three feature amounts extracted for each audio data and the average of the feature amounts. Using the value and standard deviation as observation results, the state transition probability P _t is calculated using the EM method (step S111), and the value of the variable N is further incremented by one.

一方、変数Ｎの値が５以上である場合（ステップＳ１１０肯定）には、第１の識別部５３は、次のような処理を実行する。すなわち、第１の識別部５３は、各々の音声データごとに抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果とし、ＥＭ法を用いて、状態遷移確率Ｐ_tを算出する（ステップＳ１１２）。 On the other hand, when the value of the variable N is 5 or more (Yes at Step S110), the first identification unit 53 executes the following process. That is, the first identification unit 53 uses the above three feature values extracted for each voice data, and the average value and standard deviation of each feature value as observation results, and uses the EM method to determine the state transition probability. _Pt is calculated (step S112).

そして、第１の識別部５３は、各々の音声データごとに抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果とし、ビタビアルゴリズムを用いて、観測確率Ｐ_ｏを算出する（ステップＳ１１３）。 Then, the first identification unit 53 uses the above three feature quantities extracted for each voice data, and the average value and standard deviation of each feature quantity as observation results, and uses the Viterbi algorithm to obtain an observation probability P _o is calculated (step S113).

その後、第１の識別部５３は、各々の音声データごとに抽出した上記の３つの特徴量に基づいて、ビタビアルゴリズムを用いて、次のような処理を行う。すなわち、第１の識別部５３は、発話が行われている各フレームにおいて、発話されている音が有声音であるか、あるいは無声音であるかを識別する。そして、第１の識別部５３は、有声音が検出された領域を有声音領域とし、無声音が検出された領域を無声音領域とする（ステップＳ１１４）。 Thereafter, the first identification unit 53 performs the following process using the Viterbi algorithm based on the above three feature values extracted for each audio data. That is, the first identifying unit 53 identifies whether the sound being uttered is a voiced sound or an unvoiced sound in each frame where the utterance is being performed. Then, the first identification unit 53 sets the area where the voiced sound is detected as the voiced sound area, and sets the area where the unvoiced sound is detected as the unvoiced sound area (step S114).

ここで、決定部５４は、２人の話者の音声データのエネルギー比を混合ガウス分布でモデル化した上でフレーム間のエネルギー比が属する分布に応じて有声音Ｖの識別結果を有効または無効とするかを決定する「決定処理」を実行する（ステップＳ１１５）。 Here, the decision unit 54 validates or invalidates the identification result of the voiced sound V according to the distribution to which the energy ratio between frames belongs after modeling the energy ratio of the voice data of the two speakers with a mixed Gaussian distribution. A “decision process” for determining whether or not to be executed is executed (step S115).

その後、第２の識別部５５は、ビタビアルゴリズムを用いて、決定部５４による有効化または無効化がなされた有声音および無声音の識別結果から、沈黙状態または発話状態であるかを検出することにより沈黙領域および発話領域を識別する（ステップＳ１１６）。 Thereafter, the second identification unit 55 uses the Viterbi algorithm to detect whether it is a silence state or a speech state from the identification result of the voiced sound and the unvoiced sound that have been validated or invalidated by the determination unit 54. A silence area and a speech area are identified (step S116).

続いて、抽出部１３は、図９に示すように、ある話者が発話したと特定されたフレームから、有声音領域の数、有声音領域の長さの平均値および有声音領域の長さの標準偏差を算出する（ステップＳ１１７）。 Subsequently, as illustrated in FIG. 9, the extraction unit 13 determines the number of voiced sound regions, the average value of the lengths of the voiced sound regions, and the length of the voiced sound regions from the frame identified as being uttered by a certain speaker. Is calculated (step S117).

さらに、抽出部１３は、ある話者が発話したと特定されたフレームから、発話領域の数、発話領域の長さの平均値および発話領域の長さの標準偏差を算出する（ステップＳ１１８）。その後、抽出部１３は、ある話者の沈黙領域のフレームから、沈黙領域の数、沈黙領域の長さの平均値および沈黙領域の長さの標準偏差を算出する（ステップＳ１１９）。 Further, the extraction unit 13 calculates the number of utterance areas, the average value of the lengths of the utterance areas, and the standard deviation of the length of the utterance areas from the frame specified that a certain speaker has uttered (step S118). Thereafter, the extraction unit 13 calculates the number of silence areas, the average value of the silence area lengths, and the standard deviation of the silence area lengths from the frame of the silence area of a certain speaker (step S119).

そして、抽出部１３は、会話全体の時間の長さに対するある話者の発話時間の長さの割合を算出する（ステップＳ１２０）。さらに、抽出部１３は、他の話者の発話時間に対するある話者の発話時間の割合を算出する（ステップＳ１２１）。 Then, the extraction unit 13 calculates the ratio of the length of the utterance time of a certain speaker to the length of time of the entire conversation (step S120). Further, the extraction unit 13 calculates the ratio of the utterance time of a certain speaker to the utterance time of another speaker (step S121).

続いて、抽出部１３は、ある話者が発話したと特定されたフレームから、音量の標準偏差およびスペクトルエントロピーの標準偏差を算出する（ステップＳ１２２）。抽出部１３は、ある話者が発話したと特定されたフレームから算出した音量の標準偏差と、スペクトルエントロピーの標準偏差との和を、変化の度合いとして算出する（ステップＳ１２３）。 Subsequently, the extraction unit 13 calculates the standard deviation of the sound volume and the standard deviation of the spectral entropy from the frame specified that a certain speaker has spoken (step S122). The extraction unit 13 calculates, as the degree of change, the sum of the standard deviation of the volume calculated from the frame identified as being spoken by a certain speaker and the standard deviation of the spectral entropy (step S123).

そして、全ての話者、すなわち話者Ａおよび話者Ｂの会話特性を抽出するまで（ステップＳ１２４否定）、上記のステップＳ１１７〜ステップＳ１２３までの処理を繰り返し実行する。その後、全ての話者の会話特性を抽出すると（ステップＳ１２４肯定）、分析部１４は、抽出部１３によって抽出された会話特性に基づいて、会話スタイルを分析する（ステップＳ１２５）。最後に、分析部１４は、分析結果を所定の出力先の装置へ出力し（ステップＳ１２６）、処理を終了する。 Until the conversation characteristics of all the speakers, that is, the speaker A and the speaker B are extracted (No at Step S124), the processes from Step S117 to Step S123 are repeatedly executed. Thereafter, when the conversation characteristics of all the speakers are extracted (Yes at Step S124), the analysis unit 14 analyzes the conversation style based on the conversation characteristics extracted by the extraction unit 13 (Step S125). Finally, the analysis unit 14 outputs the analysis result to a predetermined output destination device (step S126), and ends the process.

（２）決定処理
図１０は、実施例１に係る決定処理の手順を示すフローチャートである。この決定処理は、図８に示したステップＳ１１５に対応する処理であり、有声音領域および無声音領域が識別された後に処理が起動する。 (2) Determination Process FIG. 10 is a flowchart illustrating the procedure of the determination process according to the first embodiment. This determination process is a process corresponding to step S115 shown in FIG. 8, and starts after the voiced sound area and the unvoiced sound area are identified.

図１０に示すように、決定部５４は、２つの音声データの各フレーム間で少なくともいずれか一方が有声音Ｖと識別されたフレームを対象に、第１の音声データ１２Ａおよび第２の音声データ１２Ｂのフレーム間のエネルギー比を計算する（ステップＳ３０１）。続いて、決定部５４は、ステップＳ３０１で算出したエネルギー比の対数Ｘ_ｊをさらに計算する（ステップＳ３０２）。 As shown in FIG. 10, the determination unit 54 targets the first audio data 12A and the second audio data for a frame in which at least one of the two audio data is identified as the voiced sound V. The energy ratio between 12B frames is calculated (step S301). Subsequently, the determination unit 54 further calculates the logarithm X _j of the energy ratio calculated in step S301 (step S302).

そして、決定部５４は、ＥＭ法に用いる各種の項目の初期値を設定する（ステップＳ３０３）。例えば、決定部５４は、上記のようにフレームごとに算出したエネルギー比の対数Ｘ_ｊを昇順に並べ替える。これによって並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}を得る。さらに、決定部５４は、並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}からマトリクスを生成することによって事後確率ρ_ｉｊの初期値を得る。 And the determination part 54 sets the initial value of the various items used for EM method (step S303). For example, the determining unit 54 rearranges the logarithm X _j of the energy ratio calculated for each frame as described above in ascending order. Thus, the logarithm X _{j_sorted} of the energy ratio after rearrangement is obtained. Furthermore, the determination unit 54 obtains an initial value of the posterior probability ρ _ij by generating a matrix from the logarithm X _{j_sorted} of the energy ratio after the rearrangement.

続いて、決定部５４は、ステップＳ３０３で設定された並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}および事後確率ρ_ｉｊの初期値を後述のＭステップに供給する（ステップＳ３０４）。 Subsequently, the determination unit 54 supplies the initial value of the logarithm X _{j_sorted} of the rearranged energy ratio and the posterior probability ρ _ij set in step S303 to the M step described later (step S304).

そして、決定部５４は、上記の式（２）〜式（４）を用いて、第１の分布、第２の分布または第３の分布を定義するパラメータρ_ｉ、μ_ｉ及びσ_ｉを計算することによってパラメータρ_ｉ、μ_ｉ及びσ_ｉをアップデートするＭステップを実行する（ステップＳ３０５）。このとき、初期値が算出された初回には、並べ替え後のエネルギー比の対数Ｘ_{ｊ＿ｓｏｒｔｅｄ}および事後確率ρ_ｉｊの初期値が計算に用いられる。 Then, the determination unit 54 calculates the parameters ρ _i , μ _i, and σ _i that define the first distribution, the second distribution, or the third distribution using the above formulas (2) to (4). Thus, the M step for updating the parameters ρ _i , μ _i and σ _i is executed (step S305). At this time, at the first time when the initial value is calculated, the logarithm X _{j_sorted} of the energy ratio after rearrangement and the initial value of the posterior probability ρ _ij are used for the calculation.

続いて、決定部５４は、Ｍステップで算出されたパラメータρ_ｉ、μ_ｉ及びσ_ｉを上記の式（５）〜式（７）へ代入することで、確率密度Ｎ（ｘ_ｊ：μ_ｉ，σ_ｉ）、事後混合物の尤度ｆ（ｘ_ｊ）及び事後確率ρ_ｉｊを算出するＥステップを実行する（ステップＳ３０６）。 Subsequently, the determination unit 54 substitutes the parameters ρ _i , μ _i and σ _i calculated in M steps into the above formulas (5) to (7), so that the probability density N (x _j : μ _i , Σ _i ), the likelihood f (x _j ) and the posterior probability ρ _ij of the posterior mixture are executed (step S306).

そして、Ｍステップ及びＥステップを所定の回数にわたって繰り返し実行するまで（ステップＳ３０７否定）、決定部５４は、次のような処理を実行する。すなわち、決定部５４は、後述のＥステップでアップデートされた事後確率ρ_ｉｊ及びエネルギー比の対数Ｘ_ｊをＭステップへ供給した上で（ステップＳ３０８）、上記のＭステップ及びＥステップを実行する。 Then, the determination unit 54 executes the following process until the M step and the E step are repeatedly executed a predetermined number of times (No in step S307). That is, the determination unit 54 supplies the posterior probability ρ _ij and the logarithm X _{j of the} energy ratio updated in the E step described later to the M step (step S308), and then executes the above M step and E step.

その後、Ｍステップ及びＥステップを所定の回数実行した場合（ステップＳ３０７肯定）に、決定部５４は、次のような処理を実行する。すなわち、決定部５４は、ＥＭ法によって算出された事後確率ρ_ｉｊを用いて、第１の識別部５３によって有声音Ｖであると識別されたフレームの識別結果を有効または無効とするかを決定し（ステップＳ３０９）、処理を終了する。 Thereafter, when the M step and the E step are executed a predetermined number of times (Yes in step S307), the determination unit 54 executes the following process. That is, the determination unit 54 determines whether to validate or invalidate the identification result of the frame identified as the voiced sound V by the first identification unit 53 using the posterior probability ρ _ij calculated by the EM method. (Step S309), and the process ends.

上述してきたように、本実施例に係る話者判別装置５０では、２人の話者の音声データのエネルギー比を混合ガウス分布でモデル化した上でフレーム間のエネルギー比が属する分布に応じて有声音Ｖの識別結果を有効又は無効とする。その上で、本実施例に係る話者判別装置５０では、有効化または無効化が実行されたフレームの識別結果から２つの音声データの発話領域および沈黙領域を識別する。このため、本実施例に係る話者判別装置５０では、各音声データを構成する同一区間のフレーム間で閾値を用いて判定せずとも、話者を判別することができる。また、本実施例に係る話者判別装置５０では、上記の従来技術のように、事前に学習を行う必要もなく、話者の判別に複雑なアルゴリズムを用いる必要もない。したがって、本実施例に係る話者判別装置５０によれば、話者の判別を簡易かつ正確に行うことが可能である。 As described above, in the speaker discriminating apparatus 50 according to the present embodiment, the energy ratio of the voice data of two speakers is modeled by a mixed Gaussian distribution, and the energy ratio between frames belongs to the distribution to which it belongs. The identification result of the voiced sound V is validated or invalidated. Then, the speaker discriminating apparatus 50 according to the present embodiment identifies the speech area and the silence area of the two voice data from the identification result of the frame that has been validated or invalidated. For this reason, the speaker discrimination device 50 according to the present embodiment can discriminate a speaker without using a threshold between frames in the same section constituting each audio data. Further, in the speaker discrimination device 50 according to the present embodiment, it is not necessary to perform learning in advance as in the above-described conventional technique, and it is not necessary to use a complicated algorithm for speaker discrimination. Therefore, according to the speaker discriminating apparatus 50 according to the present embodiment, the speaker can be discriminated easily and accurately.

さらに、本実施例に係る話者判別装置５０では、フレーム間のエネルギー比が第２の分布に属すると推定された場合に、有声音Ｖと識別されたフレームの識別結果を維持する。それゆえ、本実施例に係る話者判別装置５０によれば、２人の話者が発話する音量に開きがある場合でも、同時発話を判別することもできる。 Furthermore, the speaker discrimination device 50 according to the present embodiment maintains the identification result of the frame identified as the voiced sound V when the energy ratio between frames is estimated to belong to the second distribution. Therefore, according to the speaker discriminating apparatus 50 according to the present embodiment, it is possible to discriminate simultaneous utterances even when there is a gap in the volume at which two speakers utter.

さて、これまで開示の装置に関する実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施例を説明する。 Although the embodiments related to the disclosed apparatus have been described above, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, another embodiment included in the present invention will be described below.

［分散および統合］
また、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、話者判別装置５０、抽出部１３または分析部１４を会話分析装置１０の外部装置としてネットワーク経由で接続するようにしてもよい。また、話者判別装置５０、抽出部１３または分析部１４を別の装置がそれぞれ有し、ネットワーク接続されて協働することで、上記の話者判別装置の機能を実現するようにしてもよい。 [Distribution and integration]
In addition, each component of each illustrated apparatus does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the speaker discrimination device 50, the extraction unit 13, or the analysis unit 14 may be connected as an external device of the conversation analysis device 10 via a network. Further, another device may have the speaker discrimination device 50, the extraction unit 13 or the analysis unit 14, and the functions of the speaker discrimination device described above may be realized by being connected to a network and cooperating. .

［話者判別プログラム］
また、上記の実施例で説明した各種の処理は、予め用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。そこで、以下では、図１１を用いて、上記の実施例と同様の機能を有する話者判別プログラムを実行するコンピュータの一例について説明する。 [Speaker discrimination program]
The various processes described in the above embodiments can be realized by executing a prepared program on a computer such as a personal computer or a workstation. Therefore, in the following, an example of a computer that executes a speaker discrimination program having the same function as in the above embodiment will be described with reference to FIG.

図１１は、実施例１及び実施例２に係る話者判別プログラムを実行するコンピュータの一例について説明するための図である。図１１に示すように、コンピュータ１００は、操作部１１０ａと、スピーカ１１０ｂと、マイク１１０ｃと、ディスプレイ１２０と、通信部１３０とを有する。さらに、このコンピュータ１００は、ＣＰＵ１５０と、ＲＯＭ１６０と、ＨＤＤ１７０と、ＲＡＭ１８０と有する。これら１１０〜１８０の各部はバス１４０を介して接続される。 FIG. 11 is a diagram for explaining an example of a computer that executes a speaker discrimination program according to the first and second embodiments. As illustrated in FIG. 11, the computer 100 includes an operation unit 110 a, a speaker 110 b, a microphone 110 c, a display 120, and a communication unit 130. Further, the computer 100 includes a CPU 150, a ROM 160, an HDD 170, and a RAM 180. These units 110 to 180 are connected via a bus 140.

ＨＤＤ１７０には、図１１に示すように、上記の実施例１で示した取得部５１と、フレーム化部５２と、第１の識別部５３と、決定部５４と、第２の識別部５５と同様の機能を発揮する話者判別プログラム１７０ａが予め記憶される。この話者判別プログラム１７０ａについては、図１に示した各々の取得部５１、フレーム化部５２、第１の識別部５３、決定部５４及び第２の識別部５５の各構成要素と同様、適宜統合又は分離しても良い。すなわち、ＨＤＤ１７０に格納される各データは、常に全てのデータがＨＤＤ１７０に格納される必要はなく、処理に必要なデータのみがＨＤＤ１７０に格納されれば良い。 As shown in FIG. 11, the HDD 170 includes an acquisition unit 51, a framing unit 52, a first identification unit 53, a determination unit 54, and a second identification unit 55 described in the first embodiment. A speaker discrimination program 170a that performs the same function is stored in advance. About this speaker discrimination | determination program 170a, like each component of each acquisition part 51 shown in FIG. 1, the framing part 52, the 1st discrimination | determination part 53, the determination part 54, and the 2nd discrimination | determination part 55, suitably It may be integrated or separated. In other words, all data stored in the HDD 170 need not always be stored in the HDD 170, and only data necessary for processing may be stored in the HDD 170.

そして、ＣＰＵ１５０が、話者判別プログラム１７０ａをＨＤＤ１７０から読み出してＲＡＭ１８０に展開する。これによって、図１１に示すように、話者判別プログラム１７０ａは、話者判別プロセス１８０ａとして機能する。この話者判別プロセス１８０ａは、ＨＤＤ１７０から読み出した各種データを適宜ＲＡＭ１８０上の自身に割り当てられた領域に展開し、この展開した各種データに基づいて各種処理を実行する。なお、話者判別プロセス１８０ａは、図１に示した取得部５１、フレーム化部５２、第１の識別部５３、決定部５４及び第２の識別部５５にて実行される処理、例えば図１０に示す処理を含む。また、ＣＰＵ１５０上で仮想的に実現される各処理部は、常に全ての処理部がＣＰＵ１５０上で動作する必要はなく、処理に必要な処理部のみが仮想的に実現されれば良い。 Then, the CPU 150 reads the speaker discrimination program 170 a from the HDD 170 and expands it in the RAM 180. Thus, as shown in FIG. 11, the speaker discrimination program 170a functions as a speaker discrimination process 180a. The speaker determination process 180a expands various data read from the HDD 170 in an area allocated to itself on the RAM 180 as appropriate, and executes various processes based on the expanded data. Note that the speaker discrimination process 180a is performed by the acquisition unit 51, the framing unit 52, the first identification unit 53, the determination unit 54, and the second identification unit 55 shown in FIG. The process shown in is included. In addition, each processing unit virtually realized on the CPU 150 does not always require that all processing units operate on the CPU 150, and only a processing unit necessary for the processing needs to be virtually realized.

なお、上記の話者判別プログラム１７０ａについては、必ずしも最初からＨＤＤ１７０やＲＯＭ１６０に記憶させておく必要はない。例えば、コンピュータ１００に挿入されるフレキシブルディスク、いわゆるＦＤ、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させる。そして、コンピュータ１００がこれらの可搬用の物理媒体から各プログラムを取得して実行するようにしてもよい。また、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータ１００に接続される他のコンピュータまたはサーバ装置などに各プログラムを記憶させておき、コンピュータ１００がこれらから各プログラムを取得して実行するようにしてもよい。 Note that the speaker discrimination program 170a is not necessarily stored in the HDD 170 or the ROM 160 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk inserted into the computer 100, so-called FD, CD-ROM, DVD disk, magneto-optical disk, or IC card. Then, the computer 100 may acquire and execute each program from these portable physical media. In addition, each program is stored in another computer or server device connected to the computer 100 via a public line, the Internet, a LAN, a WAN, etc., and the computer 100 acquires and executes each program from these. It may be.

１０会話分析装置
１１音声記憶部
１２Ａ第１の音声データ
１２Ｂ第２の音声データ
３０Ａ，３０Ｂ接話マイク
３１登録部
５０話者判別装置
５１取得部
５２フレーム化部
５３第１の識別部
５４決定部
５５第２の識別部 DESCRIPTION OF SYMBOLS 10 Conversation analyzer 11 Voice memory | storage part 12A 1st audio | voice data 12B 2nd audio | voice data 30A, 30B Closed-talking microphone 31 Registration part 50 Speaker discrimination | determination apparatus 51 Acquisition part 52 Framing part 53 1st identification part 54 Determination part 55 Second identification unit

Claims

２人の話者にそれぞれ配置されるマイクから２つの音声データを取得する取得部と、
前記取得部によって取得された２つの音声データの各々を所定の区間のフレームにフレーム化するフレーム化部と、
第１の確率モデルに基づいて、前記フレーム化部によってフレーム化されたフレームが有声音領域または無声音領域のいずれであるかを識別する第１の識別部と、
２つの音声データのエネルギー比を複数の確率分布が混合するモデルにモデル化した上で、前記フレーム間のエネルギー比が複数の確率分布のうちいずれの確率分布に属するかに応じて前記第１の識別部によって有声音領域であると識別されたフレームの識別結果を有効または無効とするかを決定する決定部と、
第２の確率モデルに基づいて、前記決定部によって有効または無効が決定された後のフレームの識別結果から２つの音声データにおける発話領域および沈黙領域を識別する第２の識別部と
を有することを特徴とする話者判別装置。 An acquisition unit for acquiring two audio data from microphones arranged respectively for two speakers;
A framing unit that frames each of the two audio data acquired by the acquisition unit into a frame of a predetermined section;
A first identification unit that identifies whether the frame framed by the framing unit is a voiced sound region or an unvoiced sound region based on a first probability model;
The energy ratio of the two audio data is modeled as a model in which a plurality of probability distributions are mixed, and the first ratio is determined according to which of the plurality of probability distributions the energy ratio between the frames belongs to. A determination unit that determines whether the identification result of the frame identified as the voiced sound region by the identification unit is valid or invalid;
A second identification unit for identifying a speech area and a silence area in two audio data based on the identification result of the frame after being determined valid or invalid by the determination unit based on a second probability model Characteristic speaker discrimination device.

前記決定部は、
２つの音声データのエネルギー比の大きさに応じて前記２人の話者のうち第１の話者が発話した第１の分布、第１の話者と前記第１の話者とは異なる第２の話者とが同時に発話した第２の分布および前記第２の話者が発話した第３の分布の３つの分布にモデル化した上で、前記フレーム間のエネルギー比が前記第１の分布に属するか、もしくは前記第２の分布又は前記第３の分布に属するかに応じて前記有声音領域であると識別されたフレームの識別結果を有効または無効とするかを決定することを特徴とする請求項１に記載の話者判別装置。 The determination unit
The first distribution of the first speaker out of the two speakers according to the magnitude of the energy ratio of the two voice data, the first speaker and the first speaker are different. After modeling into three distributions, a second distribution uttered by two speakers at the same time and a third distribution uttered by the second speaker, the energy ratio between the frames is the first distribution. Or whether to validate or invalidate the identification result of the frame identified as the voiced sound area depending on whether it belongs to the second distribution or the third distribution. The speaker discrimination device according to claim 1.

コンピュータに、
２人の話者にそれぞれ配置されるマイクから２つの音声データを取得し、
取得された２つの音声データの各々を所定の区間のフレームにフレーム化し、
第１の確率モデルに基づいて、前記フレームが有声音領域または無声音領域のいずれであるかを識別し、
２つの音声データのエネルギー比を複数の確率分布が混合するモデルにモデル化した上で、前記フレーム間のエネルギー比が複数の確率分布のうちいずれの確率分布に属するかに応じて前記有声音領域であると識別されたフレームの識別結果を有効または無効とするかを決定し、
第２の確率モデルに基づいて、有効または無効が決定された後のフレームの識別結果から２つの音声データにおける発話領域および沈黙領域を識別する
各処理を実行させることを特徴とする話者判別プログラム。 On the computer,
Obtain two audio data from microphones placed on two speakers,
Each of the acquired two audio data is framed into frames of a predetermined section,
Identifying whether the frame is a voiced sound region or an unvoiced sound region based on a first probability model;
After modeling the energy ratio of the two voice data into a model in which a plurality of probability distributions are mixed, the voiced sound region depends on which of the plurality of probability distributions the energy ratio between the frames belongs to Decide whether to validate or invalidate the identification result for frames identified as
A speaker discrimination program for executing each process for identifying a speech area and a silence area in two audio data from a frame identification result after validity or invalidity is determined based on a second probability model .

コンピュータが、
２人の話者にそれぞれ配置されるマイクから２つの音声データを取得し、
取得された２つの音声データの各々を所定の区間のフレームにフレーム化し、
第１の確率モデルに基づいて、前記フレームが有声音領域または無声音領域のいずれであるかを識別し、
２つの音声データのエネルギー比を複数の確率分布が混合するモデルにモデル化した上で、前記フレーム間のエネルギー比が複数の確率分布のうちいずれの確率分布に属するかに応じて前記有声音領域であると識別されたフレームの識別結果を有効または無効とするかを決定し、
第２の確率モデルに基づいて、有効または無効が決定された後のフレームの識別結果から２つの音声データにおける発話領域および沈黙領域を識別する
各処理を実行することを特徴とする話者判別方法。 Computer
Obtain two audio data from microphones placed on two speakers,
Each of the acquired two audio data is framed into frames of a predetermined section,
Identifying whether the frame is a voiced sound region or an unvoiced sound region based on a first probability model;
After modeling the energy ratio of the two voice data into a model in which a plurality of probability distributions are mixed, the voiced sound region depends on which of the plurality of probability distributions the energy ratio between the frames belongs to Decide whether to validate or invalidate the identification result for frames identified as
A speaker discrimination method comprising: executing each process for identifying a speech area and a silence area in two audio data from a frame identification result after validity or invalidity is determined based on a second probability model .