JP2012185394A

JP2012185394A - Analysis device, analysis program and analysis method

Info

Publication number: JP2012185394A
Application number: JP2011049476A
Authority: JP
Inventors: Gei Cho; 霓張
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-03-07
Filing date: 2011-03-07
Publication date: 2012-09-27
Anticipated expiration: 2031-03-07
Also published as: JP5678732B2

Abstract

PROBLEM TO BE SOLVED: To more easily analyze conversation style.SOLUTION: An analysis device 10 comprises: an acquisition section 14a; a first detection section 14b; a second detection section 14c; an extraction section 14e; and an analysis section 14f. The acquisition section 14a acquires voice data. The first detection section 14b detects a voiced region and a voiceless region from the acquired voice data using a first probability model. The second detection section 14c detects a speech region and a silent region in the voice data based on the detected voiced region and the voiceless region using a second probability model. The extraction section 14e extracts conversation characteristics of the detected speech region and the silent region. The analysis section 14f analyzes a conversation style based on the extracted conversation characteristics.

Description

本発明は、分析装置、分析プログラムおよび分析方法に関する。 The present invention relates to an analysis apparatus, an analysis program, and an analysis method.

複数人での会話において、各々の人物の会話の特性を測定し、会話スタイルや対話パターンなどを分析する技術が知られている。また、分析結果をコールセンターに従事する人や、セールスやマーケティングに従事する人にフィードバックすることで、各々が従事する仕事の改善に役立たせる技術が知られている。 2. Description of the Related Art There are known techniques for measuring the characteristics of each person's conversation and analyzing the conversation style, conversation pattern, etc. in a conversation between multiple persons. In addition, there is known a technique for feeding back analysis results to a person engaged in a call center or a person engaged in sales or marketing, thereby helping to improve the work of each employee.

また、従来の技術では、話者が発した音声の音素を特定することで、会話の内容を特定するものがある。かかる従来技術では、会話の内容を特定し、音圧データなどを用いて、話者の会話スタイルを分析する。 Further, some conventional techniques specify the content of a conversation by specifying a phoneme of a voice uttered by a speaker. In such a conventional technique, the content of the conversation is specified, and the conversation style of the speaker is analyzed using sound pressure data or the like.

特開２００６−１１３５４６号公報JP 2006-113546 A

しかしながら、上記の従来の技術では、話者の会話スタイルを分析する際に、会話の内容を特定しなければならず、処理に時間を要する。そのため、上記の従来の技術では、簡易に会話スタイルを分析することができないという問題がある。 However, in the above conventional technique, when analyzing the conversation style of the speaker, the content of the conversation must be specified, and processing takes time. For this reason, the above conventional technique has a problem that the conversation style cannot be easily analyzed.

開示の技術は、上記に鑑みてなされたものであって、より簡易に会話スタイルを分析することができる分析装置、分析プログラムおよび分析方法を提供することを目的とする。 The disclosed technology has been made in view of the above, and an object thereof is to provide an analysis apparatus, an analysis program, and an analysis method that can more easily analyze a conversation style.

本願の開示する分析装置は、一つの態様において、取得部と、第１の検出部と、第２の検出部と、抽出部と、分析部とを有する。取得部は、音声データを取得する。第１の検出部は、前記取得部により取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出する。第２の検出部は、前記第１の検出部により検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出する。抽出部は、前記第２の検出部により検出された発話領域および沈黙領域の会話特性を抽出する。分析部は、前記抽出部により抽出された会話特性に基づいて、会話スタイルを分析する。 In one aspect, an analysis device disclosed in the present application includes an acquisition unit, a first detection unit, a second detection unit, an extraction unit, and an analysis unit. The acquisition unit acquires audio data. A 1st detection part detects a voiced sound area | region and an unvoiced sound area | region using the 1st probability model from the audio | voice data acquired by the said acquisition part. The second detection unit detects a speech region and a silence region in the voice data using a second probability model based on the voiced sound region and the unvoiced sound region detected by the first detection unit. The extraction unit extracts conversation characteristics of the speech area and the silence area detected by the second detection unit. The analysis unit analyzes the conversation style based on the conversation characteristics extracted by the extraction unit.

本願の開示する分析装置の一つの態様によれば、より簡易に会話スタイルを分析することができる。 According to one aspect of the analysis device disclosed in the present application, it is possible to analyze the conversation style more easily.

図１は、実施例１に係る分析装置の構成を示す図である。FIG. 1 is a diagram illustrating the configuration of the analyzer according to the first embodiment. 図２は、有声音および無声音の一例を説明するための図である。FIG. 2 is a diagram for explaining an example of voiced sound and unvoiced sound. 図３は、発話が行われる発話領域、および発話が行われない沈黙領域の一例を説明するための図である。FIG. 3 is a diagram for explaining an example of an utterance area where an utterance is performed and a silence area where no utterance is performed. 図４は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。FIG. 4 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. 図５は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。FIG. 5 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. 図６は、会話スタイルの分析方法の一例を説明するための図である。FIG. 6 is a diagram for explaining an example of a conversation style analysis method. 図７は、会話スタイルの分析方法の一例を説明するための図である。FIG. 7 is a diagram for explaining an example of a conversation style analysis method. 図８は、会話スタイルの分析方法の一例を説明するための図である。FIG. 8 is a diagram for explaining an example of a conversation style analysis method. 図９は、人物Ａが、人物Ｂ、Ｃ、Ｄ、Ｅのそれぞれと会話したときに抽出された会話特性の一例を示す図である。FIG. 9 is a diagram illustrating an example of conversation characteristics extracted when the person A has a conversation with each of the persons B, C, D, and E. 図１０は、実施例１に係る分析処理の手順を示すフローチャートである。FIG. 10 is a flowchart illustrating the analysis processing procedure according to the first embodiment. 図１１は、実施例１に係る分析処理の手順を示すフローチャートである。FIG. 11 is a flowchart illustrating the analysis processing procedure according to the first embodiment. 図１２は、実施例１に係る分析処理の手順を示すフローチャートである。FIG. 12 is a flowchart illustrating the analysis processing procedure according to the first embodiment. 図１３は、分析プログラムを実行するコンピュータを示す図である。FIG. 13 is a diagram illustrating a computer that executes an analysis program.

以下に、本願の開示する分析装置、分析プログラムおよび分析方法の各実施例を図面に基づいて詳細に説明する。なお、各実施例は開示の技術を限定するものではない。そして、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Hereinafter, embodiments of an analysis apparatus, an analysis program, and an analysis method disclosed in the present application will be described in detail with reference to the drawings. Each embodiment does not limit the disclosed technology. Each embodiment can be appropriately combined within a range in which processing contents are not contradictory.

［分析装置の構成］
実施例１に係る分析装置について説明する。図１は、実施例１に係る分析装置の構成を示す図である。本実施例に係る分析装置１０は、音声データを取得し、取得した音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出する。また、本実施例に係る分析装置１０は、検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、音声データにおける発話領域および沈黙領域を検出する。また、本実施例に係る分析装置１０は、検出された発話領域および沈黙領域の会話特性を抽出し、抽出した会話特性に基づいて、会話スタイルを分析する。図１に示すように、分析装置１０は、入力部１１と、出力部１２と、記憶部１３と、制御部１４とを有する。 [Configuration of analyzer]
The analyzer according to Example 1 will be described. FIG. 1 is a diagram illustrating the configuration of the analyzer according to the first embodiment. The analysis apparatus 10 according to the present embodiment acquires voice data, and detects a voiced sound area and an unvoiced sound area from the acquired voice data using the first probability model. Moreover, the analysis apparatus 10 according to the present embodiment detects the speech area and the silence area in the speech data using the second probability model based on the detected voiced sound area and unvoiced sound area. Moreover, the analysis apparatus 10 according to the present embodiment extracts the conversation characteristics of the detected utterance area and silence area, and analyzes the conversation style based on the extracted conversation characteristics. As illustrated in FIG. 1, the analysis apparatus 10 includes an input unit 11, an output unit 12, a storage unit 13, and a control unit 14.

入力部１１は、ユーザの操作を受け付けて制御部１４に受付内容を送信する。例えば、入力部１１は、後述の分析処理を実行する指示を受け付けた場合には、この指示を制御部１４に送信する。また、入力部１１は、後述の第１の音声データ１３ａおよび第２の音声データ１３ｂを受け付けた場合には、これらの音声データを制御部１４に送信する。 The input unit 11 receives a user operation and transmits the received content to the control unit 14. For example, the input unit 11 transmits this instruction to the control unit 14 when receiving an instruction to execute an analysis process described later. When the input unit 11 receives first audio data 13a and second audio data 13b described later, the input unit 11 transmits these audio data to the control unit 14.

出力部１２は、受け付けた情報を出力する。例えば、出力部１２は、後述の分析部１４ｆからの分析結果を表示する。出力部１２のデバイスの一例としては、ＬＣＤ（Liquid Crystal Display）やプロジェクタなどの表示デバイスが挙げられる。 The output unit 12 outputs the received information. For example, the output unit 12 displays an analysis result from the later-described analysis unit 14f. Examples of the device of the output unit 12 include display devices such as an LCD (Liquid Crystal Display) and a projector.

記憶部１３は、制御部１４で実行される各種プログラムを記憶する。また、記憶部１３は、第１の音声データ１３ａ、第２の音声データ１３ｂを記憶する。 The storage unit 13 stores various programs executed by the control unit 14. The storage unit 13 stores first audio data 13a and second audio data 13b.

第１の音声データ１３ａ、第２の音声データ１３ｂについて説明する。第１の音声データ１３ａは、複数人、例えば、ＡおよびＢの２人の会話を、Ａに取り付けたマイク（microphone）により音声データに変換したものである。第１の音声データ１３ａには、ＡおよびＢの会話が含まれるが、Ａの音声の音量の方がＢの音声の音量よりも大きくなる。これは、ＢよりもＡの方がマイクに近いからである。また、第２の音声データ１３ｂは、複数人、例えば、ＡおよびＢの２人の会話を、Ｂに取り付けたマイクにより音声データに変換したものである。第２の音声データ１３ｂには、ＡおよびＢの会話が含まれるが、Ｂの音声の音量の方がＡの音声の音量よりも大きくなる。これは、ＡよりもＢの方がマイクに近いからである。なお、ＡとＢとが互いに携帯電話を用いて会話を行う場合などには、互いの携帯電話にマイクを設けることにより、第１の音声データ１３ａ、第２の音声データ１３ｂを取得することができる。 The first audio data 13a and the second audio data 13b will be described. The first audio data 13a is obtained by converting conversations of a plurality of people, for example, A and B, into audio data using a microphone attached to A. The first voice data 13a includes A and B conversations, but the volume of the A voice is higher than the volume of the B voice. This is because A is closer to the microphone than B is. The second audio data 13b is obtained by converting conversations of a plurality of people, for example, A and B, into audio data using a microphone attached to B. The second voice data 13b includes A and B conversations, but the volume of the voice of B is larger than the volume of the voice of A. This is because B is closer to the microphone than A is. When A and B have a conversation with each other using a mobile phone, the first audio data 13a and the second audio data 13b can be acquired by providing a microphone for each mobile phone. it can.

ここで、日本語、英語、中国語などの任意の言語において共通する特徴について説明する。図２は、有声音および無声音の一例を説明するための図である。図２の例では、サンプリング周波数が１６ｋＨｚである接話型マイクを用いて取得した音声データが示されている。図２の例では、横軸は時間を示し、縦軸は周波数を示し、図中の濃淡はスペクトルエントロピーの大小を示す。図２に示すように、有声音Ｖは、スペクトルエントロピーの変化が大きく、無声音Ｕよりも低い周波数の音声データの部分の音である。ここで、有声音は、母音「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」である。 Here, features common to arbitrary languages such as Japanese, English, and Chinese will be described. FIG. 2 is a diagram for explaining an example of voiced sound and unvoiced sound. In the example of FIG. 2, audio data acquired using a close-talking microphone with a sampling frequency of 16 kHz is shown. In the example of FIG. 2, the horizontal axis indicates time, the vertical axis indicates frequency, and the shading in the figure indicates the magnitude of spectral entropy. As shown in FIG. 2, the voiced sound V is a sound of a portion of sound data having a frequency lower than that of the unvoiced sound U, with a large change in spectrum entropy. Here, the voiced sounds are vowels “a”, “i”, “u”, “e”, and “o”.

また、無声音Ｕは、有声音Ｖよりも高い周波数の音声データの部分である。ここで、無声音は、母音以外の音、例えば「ｓ」、「ｐ」、「ｈ」である。 The unvoiced sound U is a portion of audio data having a higher frequency than the voiced sound V. Here, the unvoiced sound is a sound other than a vowel such as “s”, “p”, and “h”.

図３は、発話が行われる発話領域、および発話が行われない沈黙領域の一例を説明するための図である。発話領域は、無声音領域および有声音領域を含む。図３の例は、発話の内容が「ＷａＴａＳｈｉＷａＣｈｏｕＤｅＳｕ」の場合を示す。図３の例では、発話領域は、無声音「Ｗ」、有声音「ａ」、無声音「Ｔ」、有声音「ａ」、無声音「Ｓｈ」、有声音「ｉ」、無声音「Ｗ」、有声音「ａ」を含む。また、図３の例では、発話領域は、無声音「Ｃｈ」、有声音「ｏｕ」を含む。また、図３の例では、発話領域は、無声音「Ｄ」、有声音「ｅ」、無声音「Ｓ」、有声音「ｕ」を含む。また、図３の例では、「ＷａＴａＳｈｉＷａ」の発話領域と、「Ｃｈｏｕ」の発話領域と、「ＤｅＳｕ」の発話領域との間に、沈黙領域が存在することを示す。 FIG. 3 is a diagram for explaining an example of an utterance area where an utterance is performed and a silence area where no utterance is performed. The speech area includes an unvoiced sound area and a voiced sound area. The example of FIG. 3 shows a case where the content of the utterance is “WaTaShiWaChouDeSu”. In the example of FIG. 3, the utterance area includes unvoiced sound “W”, voiced sound “a”, unvoiced sound “T”, voiced sound “a”, unvoiced sound “Sh”, voiced sound “i”, unvoiced sound “W”, and voiced sound. Includes "a". In the example of FIG. 3, the utterance region includes unvoiced sound “Ch” and voiced sound “ou”. In the example of FIG. 3, the utterance region includes unvoiced sound “D”, voiced sound “e”, unvoiced sound “S”, and voiced sound “u”. Further, in the example of FIG. 3, it is shown that a silence area exists between the utterance area of “WaTaShiWa”, the utterance area of “Chou”, and the utterance area of “DeSu”.

記憶部１３は、例えば、フラッシュメモリなどの半導体メモリ素子、または、ハードディスク、光ディスクなどの記憶装置である。なお、記憶部１３は、上記の種類の記憶装置に限定されるものではなく、ＲＡＭ（Random Access Memory)、ＲＯＭ（Read Only Memory)であってもよい。 The storage unit 13 is, for example, a semiconductor memory device such as a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 13 is not limited to the type of storage device described above, and may be a RAM (Random Access Memory) or a ROM (Read Only Memory).

図１の説明に戻り、制御部１４は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する。制御部１４は、図１に示すように、取得部１４ａと、第１の検出部１４ｂと、第２の検出部１４ｃと、特定部１４ｄと、抽出部１４ｅと、分析部１４ｆとを有する。 Returning to the description of FIG. 1, the control unit 14 includes an internal memory for storing programs and control data that define various processing procedures, and executes various processes using these. As shown in FIG. 1, the control unit 14 includes an acquisition unit 14a, a first detection unit 14b, a second detection unit 14c, a specifying unit 14d, an extraction unit 14e, and an analysis unit 14f.

取得部１４ａは、音声データを取得する。例えば、取得部１４ａは、第１の音声データ１３ａおよび第２の音声データ１３ｂを取得する。なお、取得部１４ａは、入力部１１が受け付けた第１の音声データ１３ａおよび第２の音声データ１３ｂを、入力部１１から取得することもできる。 The acquisition unit 14a acquires audio data. For example, the acquisition unit 14a acquires the first audio data 13a and the second audio data 13b. Note that the acquisition unit 14 a can also acquire the first audio data 13 a and the second audio data 13 b received by the input unit 11 from the input unit 11.

第１の検出部１４ｂは、取得部１４ａにより取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出する。例えば、第１の検出部１４ｂは、まず、第１の音声データ１３ａおよび第２の音声データ１３ｂのそれぞれの長さを比較する。そして、第１の検出部１４ｂは、第１の音声データ１３ａおよび第２の音声データ１３ｂの長さの差が許容誤差範囲内でない場合には、以降の処理を行うのに適さないため、エラー出力するように出力部１２を制御し、以降の処理を行わない。一方、第１の検出部１４ｂは、第１の音声データ１３ａおよび第２の音声データ１３ｂの長さが同一であるか、または、それぞれの長さの差が許容誤差範囲内である場合には、以降の処理を行う。すなわち、第１の検出部１４ｂは、第１の音声データ１３ａおよび第２の音声データ１３ｂをフレーム化する。具体例を挙げて説明すると、第１の検出部１４ｂは、下記の式（１）、式（２）を用いて、それぞれの音声データを、長さを２５６ｍｓとするフレーム化を行う。このとき、前後のフレームの重複部分の長さが１２８ｍｓとなるようにする。
Ｓ＝ｆｌｏｏｒ（Ｙ／Ｘ）・・・・・・・・・・・・・・・・式（１）
ｍ＝ｆｌｏｏｒ（（Ｓ−２５６）／１２８）＋１・・・・・・・・式（２）
なお、「ｆｌｏｏｒ（ｘ）」は、ｘ以下の最大の整数を算出するための関数であり、Ｙは、第１の音声データ１３ａおよび第２の音声データ１３ｂのそれぞれのデータ量（ｂｙｔｅ）であり、Ｘは、１（ｂｙｔｅ）のデータに対応する長さ（ｍｓ）である。 The 1st detection part 14b detects a voiced sound area | region and an unvoiced sound area | region using the 1st probability model from the audio | voice data acquired by the acquisition part 14a. For example, the first detection unit 14b first compares the lengths of the first audio data 13a and the second audio data 13b. Since the first detection unit 14b is not suitable for performing the subsequent processing when the difference in length between the first audio data 13a and the second audio data 13b is not within the allowable error range, an error occurs. The output unit 12 is controlled to output, and the subsequent processing is not performed. On the other hand, when the first audio data 13a and the second audio data 13b have the same length, or the difference between the lengths is within an allowable error range, the first detection unit 14b The subsequent processing is performed. That is, the first detection unit 14b frames the first audio data 13a and the second audio data 13b. To explain with a specific example, the first detection unit 14b uses the following equations (1) and (2) to frame each audio data with a length of 256 ms. At this time, the length of the overlapping portion of the preceding and following frames is set to 128 ms.
S = floor (Y / X) ... Formula (1)
m = floor ((S-256) / 128) +1 (2)
“Floor (x)” is a function for calculating the maximum integer equal to or less than x, and Y is the data amount (bytes) of each of the first audio data 13a and the second audio data 13b. Yes, X is a length (ms) corresponding to 1 (byte) data.

このような処理によって、第１の音声データ１３ａおよび第２の音声データ１３ｂのそれぞれについてｍ個のフレームが得られたものとして、以下、説明を続ける。なお、以下の説明では、第１の音声データ１３ａから得られたｍ個のフレームの各々を、「第１フレーム（１）」、「第１フレーム（２）」、・・・、「第１フレーム（ｍ）」と表記する場合がある。同様に、第２の音声データ１３ｂから得られたｍ個のフレームの各々を、「第２フレーム（１）」、「第２フレーム（２）」、・・・、「第２フレーム（ｍ）」と表記する場合がある。また、上記で説明したフレームの長さ、前後のフレームの重複部分の長さは、一例であり、任意の値を採用できる。 The description will be continued below assuming that m frames are obtained for each of the first audio data 13a and the second audio data 13b by such processing. In the following description, each of the m frames obtained from the first audio data 13a is referred to as “first frame (1)”, “first frame (2)”,. “Frame (m)” may be indicated. Similarly, each of the m frames obtained from the second audio data 13b is referred to as “second frame (1)”, “second frame (2)”,..., “Second frame (m)”. May be written. Further, the length of the frame described above and the length of the overlapping portion of the preceding and following frames are examples, and arbitrary values can be adopted.

そして、第１の検出部１４ｂは、第１フレーム（１）〜第１フレーム（ｍ）、第２フレーム（１）〜第２フレーム（ｍ）の全てのフレームについて、下記の処理を行う。すなわち、第１の抽出部１４ｂは、自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーの３つの特徴量を抽出する。 And the 1st detection part 14b performs the following process about all the frames of 1st frame (1)-1st frame (m), 2nd frame (1)-2nd frame (m). In other words, the first extraction unit 14b extracts three feature quantities including the number of peaks of the autocorrelation coefficient, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy.

そして、第１の検出部１４ｂは、全てのフレームについて抽出した３つの特徴量のそれぞれの平均値および標準偏差を算出する。そして、第１の検出部１４ｂは、確率モデルである隠れマルコフモデル（Hidden Markov Model；HMM）を用いて、有声音領域および無声音領域を検出する。 Then, the first detection unit 14b calculates the average value and standard deviation of the three feature amounts extracted for all frames. And the 1st detection part 14b detects a voiced sound area | region and an unvoiced sound area | region using the Hidden Markov Model (HMM) which is a probability model.

有声音領域および無声音領域の検出方法について、具体例を挙げて説明する。図４は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。図４の例では、第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果（observation）として用いて、ＥＭ法（Expectation-Maximization algorithm）により、状態遷移確率（transition possibility）Ｐ_tを算出する。ここで、状態遷移確率Ｐ_tは、例えば、有声音の状態のままでいる確率、有声音の状態から無声音の状態に遷移する確率、無声音の状態のままでいる確率、無声音の状態から有声音の状態に遷移する確率である。なお、図４の例では、発話は、有声音および無声音の両方とも同一の確率で開始すると考えられるので、発話の開始における有声音および無声音の状態の確率はともに「０．５」である。また、図４の例では、初期の状態遷移確率Ｐ_tとして、有声音の状態のままでいる確率「０．９５」、有声音の状態から無声音の状態に遷移する確率「０．０５」が与えられる。さらに、図４の例では、初期の状態遷移確率Ｐ_tとして、無声音の状態のままでいる確率「０．９５」、無声音の状態から有声音の状態に遷移する確率「０．０５」が与えられる。第１の検出部１４ｂは、状態遷移確率Ｐ_tを再び算出することを所定回数繰り返す。これにより、精度の高い状態遷移確率Ｐ_tを算出することができる。 A method for detecting the voiced sound area and the unvoiced sound area will be described with a specific example. FIG. 4 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. In the example of FIG. 4, the first detection unit 14b uses the above-described three feature quantities extracted for all frames, and the average value and standard deviation of each feature quantity as observation results (observation). by (Expectation-Maximization algorithm), and calculates the state transition probability (transition possibility) P _t. Here, the state transition probability P _t is, for example, the probability of remaining in a voiced sound state, the probability of transitioning from a voiced sound state to an unvoiced sound state, the probability of remaining in an unvoiced sound state, and the voiced sound state This is the probability of transition to the state. In the example of FIG. 4, since it is considered that both voiced sound and unvoiced sound start with the same probability, the probability of the state of voiced sound and unvoiced sound at the start of the utterance is both “0.5”. In the example of FIG. 4, as the initial state transition probability P _t , the probability “0.95” of remaining a voiced sound state and the probability “0.05” of transitioning from a voiced sound state to an unvoiced sound state are Given. Further, in the example of FIG. 4, as the initial state transition probability P _t , the probability “0.95” of remaining in an unvoiced sound state and the probability “0.05” of transitioning from an unvoiced sound state to a voiced sound state are given. It is done. The first detection unit 14b repeats calculating the state transition probability _Pt again a predetermined number of times. Thus, it is possible to calculate a highly accurate state transition probability P _t.

また、図４の例では、第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果として用いて、ビタビアルゴリズム（Viterbi algorithm）により、観測確率（observation possibility）Ｐ_ｏを算出する。ここで、観測確率Ｐ_ｏは、例えば、有声音の状態から観測（observed）を出力する確率、有声音の状態から非観測（not observed）を出力する確率、無声音の状態から観測を出力する確率、および無声音の状態から非観測を出力する確率である。なお、観測確率は、出力確率（emission possibility）とも称される。 In the example of FIG. 4, the first detection unit 14b uses the above-described three feature amounts extracted for all frames, and the average value and standard deviation of each feature amount as the observation result, and the Viterbi algorithm ( the Viterbi algorithm), to calculate the observation probability (observation possibility) _{P o.} Here, the observation probability P _o is, for example, the probability of outputting observation from the voiced sound state, the probability of outputting not observed from the voiced sound state, and the probability of outputting observation from the unvoiced sound state. , And the probability of outputting non-observation from the state of unvoiced sound. The observation probability is also referred to as output probability.

そして、図４の例では、状態遷移確率Ｐ_tおよび観測確率Ｐ_ｏが算出された場合には、第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量に基づいて、ビタビアルゴリズムを用いて、次のような処理を行う。すなわち、第１の検出部１４ｂは、発話が行われている各フレームにおいて、発話されている音が有声音であるか、または、無声音であるかを検出する。そして、第１の検出部１４ｂは、有声音が検出された領域を有声音領域とし、無声音が検出された領域を無声音領域とする。 In the example of FIG. 4, when the state transition probability P _t and the observation probability _Po are calculated, the first detection unit 14 b is based on the above three feature values extracted for all frames. The following processing is performed using the Viterbi algorithm. That is, the first detection unit 14b detects whether the uttered sound is a voiced sound or an unvoiced sound in each frame in which the utterance is performed. Then, the first detection unit 14b sets the area where the voiced sound is detected as the voiced sound area and sets the area where the unvoiced sound is detected as the unvoiced sound area.

このように、分析装置１０は、周囲のノイズに強い自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーなどの特徴量を用いて、有声音領域および無声音領域を検出する。したがって、分析装置１０によれば、周囲のノイズの影響により、有声音領域および無声音領域を検出する精度が低下することを抑制することができる。また、周囲のノイズに強い特徴量を用いるため、第１の音声データ１３ａおよび第２の音声データ１３ｂをフレーム化する際に、フレームの個数をより少なくすることができる。したがって、分析装置１０によれば、より簡易な処理で有声音領域および無声音領域を検出することができる。 As described above, the analysis apparatus 10 uses the feature amount such as the number of peaks of the autocorrelation coefficient that is strong against ambient noise, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy, and thereby the voiced sound area and the unvoiced sound area. Is detected. Therefore, according to the analyzer 10, it can suppress that the precision which detects a voiced sound area | region and an unvoiced sound area | region falls by the influence of surrounding noise. In addition, since a feature amount that is strong against ambient noise is used, the number of frames can be reduced when the first audio data 13a and the second audio data 13b are framed. Therefore, according to the analyzer 10, the voiced sound area and the unvoiced sound area can be detected by simpler processing.

第２の検出部１４ｃは、第１の検出部１４ｂにより検出された有声音領域および無声音領域に基づいて、隠れマルコフモデルを用いて、音声データにおける発話領域および沈黙領域を検出する。 The second detection unit 14c detects a speech region and a silence region in the speech data using a hidden Markov model based on the voiced sound region and the unvoiced sound region detected by the first detection unit 14b.

発話領域および沈黙領域の検出方法について、具体例を挙げて説明する。図５は、隠れマルコフモデルにおける状態遷移図の一例を示す図である。図５の例では、状態遷移確率Ｐ_tおよび観測確率Ｐ_ｏは、予め定められた値である。図５の例では、状態遷移確率Ｐ_tは、例えば、沈黙の状態である沈黙状態のままでいる確率、沈黙状態から発話の状態である発話状態に遷移する確率、発話状態のままでいる確率、および発話状態から沈黙状態に遷移する確率である。なお、図５の例では、発話の開始における沈黙状態および発話状態の確率はともに「０．５」である。また、図５の例では、状態遷移確率Ｐ_tとして、沈黙状態のままでいる確率「０．９９９」、沈黙状態から発話状態に遷移する確率「０．００１」が定められている。また、図５の例では、状態遷移確率Ｐ_tとして、発話状態のままでいる確率「０．９９９」、発話状態から沈黙状態に遷移する確率「０．００１」が定められている。 A method for detecting a speech area and a silence area will be described with specific examples. FIG. 5 is a diagram illustrating an example of a state transition diagram in the hidden Markov model. In the example of FIG. 5, the state transition probability P _t and the observation probability P _o are predetermined values. In the example of FIG. 5, the state transition probability P _t is, for example, the probability of remaining in a silence state that is a silence state, the probability of transitioning from a silence state to an utterance state that is an utterance state, and the probability of remaining in an utterance state , And the probability of transition from the speech state to the silence state. In the example of FIG. 5, the probability of the silence state and the speech state at the start of the speech is both “0.5”. In the example of FIG. 5, the probability “0.999” of remaining silent and the probability “0.001” of transition from the silent state to the speech state are defined as the state transition probability P _t . In the example of FIG. 5, as the state transition probability P _t , the probability “0.999” of remaining in the utterance state and the probability “0.001” of transition from the utterance state to the silence state are defined.

また、図５の例では、観測確率Ｐ_ｏは、例えば、沈黙状態において無声音が検出される確率、沈黙状態において有声音が検出される確率、発話状態において無声音が検出される確率、および発話状態において有声音が検出される確率である。なお、図５の例では、観測確率Ｐ_ｏとして、沈黙状態において無声音が検出される確率「０．９９」、沈黙状態において有声音が検出される確率「０．０１」が定められている。また、図５の例では、観測確率Ｐ_ｏとして、発話状態において無声音が検出される確率「０．５」、発話状態において有声音が検出される確率「０．５」が定められている。 In the example of FIG. 5, the observation probability _Po is, for example, the probability that an unvoiced sound is detected in a silence state, the probability that a voiced sound is detected in a silence state, the probability that an unvoiced sound is detected in an utterance state, and the utterance state Is the probability that a voiced sound will be detected. In the example of FIG. 5, as the observation probability P _o, the probability "0.99" as unvoiced is detected in silence state, the probability "0.01" as voiced is detected it is defined in the silent state. Further, in the example of FIG. 5, as the observation probability P _o, the probability "0.5" as unvoiced is detected in the speech state, the probability "0.5" as voiced is detected is defined in the speech state.

そして、図５の例では、第２の検出部１４ｃは、全てのフレームについて、第１の検出部１４ｂにより検出された有声音および無声音に基づいて、ビタビアルゴリズムを用いて、沈黙状態であるか、または、発話状態であるかを検出する。そして、第２の検出部１４ｃは、沈黙状態の領域を沈黙領域とし、発話状態の領域を発話領域とする。 In the example of FIG. 5, whether the second detection unit 14c is in a silence state using the Viterbi algorithm based on the voiced sound and the unvoiced sound detected by the first detection unit 14b for all frames. Or whether it is in an utterance state. The second detection unit 14c sets the silence area as the silence area and the utterance area as the utterance area.

このように、分析装置１０は、隠れマルコフモデルを用いて、沈黙領域および発話領域を検出する。したがって、分析装置１０によれば、２人の会話において、発話が重複しても、精度よく、沈黙領域および発話領域を検出することができる。 Thus, the analysis apparatus 10 detects the silence area and the speech area using the hidden Markov model. Therefore, according to the analysis device 10, even if the utterances overlap in the conversation between the two people, the silence area and the utterance area can be accurately detected.

特定部１４ｄは、第２の検出部１４ｃにより検出された発話領域での有声音領域における音量が閾値以上の場合に、音声取得装置に最も近い人物を発話領域において発話した人物として特定する。また、特定部１４ｄは、第２の検出部１４ｃにより検出された発話領域での有声音領域における音量が閾値未満の場合に、音声取得装置に最も近い人物以外の人物を発話領域において発話した人物として特定する。 When the volume in the voiced sound area in the utterance area detected by the second detection section 14c is equal to or higher than the threshold, the specifying unit 14d specifies the person closest to the voice acquisition device as the person who has spoken in the utterance area. In addition, when the volume in the voiced sound region in the utterance region detected by the second detection unit 14c is less than the threshold, the specifying unit 14d utters a person other than the person closest to the voice acquisition device in the utterance region. As specified.

例えば、特定部１４ｄは、第１の音声データ１３ａについて、発話領域での有声音領域として検出されたフレームの音量の平均値Ｅ_ｔ１を算出する。同様に、特定部１４ｄは、第２の音声データ１３ｂについて、発話領域での有声音領域として検出されたフレームの音量の平均値Ｅ_ｔ２を算出する。 For example, the specifying unit 14d calculates an average value E _t1 of the volume of the frames detected as the voiced sound area in the utterance area for the first sound data 13a. Similarly, the specific unit 14d, the second audio data 13b, and calculates the average value E _t2 volume of frame detected as voiced regions in speech region.

そして、特定部１４ｄは、第１の音声データ１３ａにおいて発話領域での有声音領域として検出されたフレームの全てについて、音量が所定の閾値以上であるか否かを判定する。特定部１４ｄは、音量が所定の閾値以上であるフレームについては、第１の音声データ１３ａを取得した音声取得装置であるマイクに最も近い人物、例えばＡを、このフレームにおいて発話した人物として特定する。これは、ＢよりもＡの方がマイクに近いため、第１の音声データ１３ａにおいて、Ａの音声の音量の方がＢの音声の音量よりも大きくなるからである。また、特定部１４ｄは、音量が所定の閾値未満であるフレームについては、第１の音声データ１３ａを取得した音声取得装置であるマイクに最も近い人物以外の人物を、このフレームにおいて発話した人物として特定する。なお、閾値として、「０．２Ｅ_ｔ１」や「０．５Ｅ_ｔ１」が挙げられるが、閾値はこれに限られず、発話した人物を特定可能な値であれば任意の値を採用できる。 And specific part 14d judges whether a sound volume is more than a predetermined threshold about all the frames detected as voiced sound field in the utterance field in the 1st voice data 13a. For a frame whose volume is equal to or higher than a predetermined threshold, the specifying unit 14d specifies the person closest to the microphone that is the voice acquisition device that acquired the first voice data 13a, for example, A as the person who spoke in this frame. . This is because A is closer to the microphone than B, and thus the volume of the A voice is higher than the volume of the B voice in the first audio data 13a. In addition, for the frame whose volume is less than the predetermined threshold, the specifying unit 14d sets a person other than the person closest to the microphone that is the voice acquisition device that acquired the first voice data 13a as the person who spoke in this frame. Identify. As the threshold value, “0.2E _t1 ” and “0.5E _t1 ” can be mentioned, but the threshold value is not limited to this, and any value can be adopted as long as the person who speaks can be specified.

また、特定部１４ｄは、第２の音声データ１３ｂにおいて発話領域での有声音領域として検出されたフレームの全てについて、音量が所定の閾値以上であるか否かを判定する。特定部１４ｄは、音量が所定の閾値以上であるフレームについては、第２の音声データ１３ｂを取得したマイクに最も近い人物、例えばＢを、このフレームにおいて発話した人物として特定する。また、特定部１４ｄは、音量が所定の閾値未満であるフレームについては、第２の音声データ１３ｂを取得したマイクに最も近い人物以外の人物を、このフレームにおいて発話した人物として特定する。なお、閾値として、「０．２Ｅ_ｔ２」や「０．５Ｅ_ｔ２」が挙げられるが、閾値はこれに限られず、発話した人物を特定可能な値であれば任意の値を採用できる。 Further, the specifying unit 14d determines whether or not the sound volume is equal to or higher than a predetermined threshold value for all frames detected as voiced sound areas in the utterance area in the second sound data 13b. For a frame whose volume is equal to or higher than a predetermined threshold, the specifying unit 14d specifies the person closest to the microphone that acquired the second audio data 13b, for example, B as the person who spoke in this frame. In addition, for a frame whose volume is less than a predetermined threshold, the specifying unit 14d specifies a person other than the person closest to the microphone that acquired the second audio data 13b as the person who spoke in this frame. Note that “0.2E _t2 ” and “0.5E _t2 ” can be mentioned as the threshold value, but the threshold value is not limited to this, and any value can be adopted as long as the person who speaks can be specified.

すなわち、特定部１４ｄは、各音声データについて、音量が閾値以上であるフレームにおいて発話する人物として、既知である、対応するマイクに最も近い人物を特定する。また、会話を行う人数が２人であることが既知である場合には、特定部１４ｄは、音量が閾値未満であるフレームにおいて発話する人物として、既知である、対応するマイクに最も近い人物以外の人物を特定する。上記の第１の音声データ１３ａから人物を特定する場面の例では、特定部１４ｄは、音量が所定の閾値未満であるフレームについては、第１の音声データ１３ａを取得したマイクに最も近い人物以外の人物Ｂを、発話した人物として特定する。また、上記の第２の音声データ１３ｂから人物を特定する場面の例では、特定部１４ｄは、音量が所定の閾値未満であるフレームについては、第２の音声データ１３ｂを取得したマイクに最も近い人物以外の人物Ａを、発話した人物として特定する。なお、特定部１４ｄによる人物を特定する方法はこれに限られず、種々の方法を用いることができる。 That is, the specifying unit 14d specifies, for each audio data, a person who is known as a person who speaks in a frame whose volume is equal to or higher than a threshold and is closest to the corresponding microphone. In addition, when it is known that the number of persons who have conversations is two, the specifying unit 14d is a person other than the person closest to the corresponding microphone that is known as a person who speaks in a frame whose volume is less than the threshold. Identify people. In the example of the scene in which the person is specified from the first audio data 13a, the specifying unit 14d is a person other than the person closest to the microphone that acquired the first audio data 13a for a frame whose volume is less than a predetermined threshold. Is identified as the person who spoke. Moreover, in the example of the scene which specifies a person from said 2nd audio | voice data 13b, the specific | specification part 14d is the closest to the microphone which acquired the 2nd audio | voice data 13b about the flame | frame whose sound volume is less than a predetermined threshold value. A person A other than the person is identified as the person who spoke. Note that the method of specifying a person by the specifying unit 14d is not limited to this, and various methods can be used.

抽出部１４ｅは、音声データから会話特性を抽出する。例えば、抽出部１４ｅは、Ａが発話したと特定されたフレームから、有声音領域の数、有声音領域の長さの平均値、および有声音領域の長さの標準偏差を算出する。また、抽出部１４ｅは、Ａが発話したと特定されたフレームから、発話領域の数、発話領域の長さの平均値、および発話領域の長さの標準偏差を算出する。また、抽出部１４ｅは、Ａの沈黙領域のフレームから、沈黙領域の数、沈黙領域の長さの平均値、および沈黙領域の長さの標準偏差を算出する。 The extraction unit 14e extracts conversation characteristics from the voice data. For example, the extraction unit 14e calculates the number of voiced sound regions, the average value of the lengths of the voiced sound regions, and the standard deviation of the length of the voiced sound regions from the frame identified as A uttered. Further, the extraction unit 14e calculates the number of utterance regions, the average value of the lengths of the utterance regions, and the standard deviation of the length of the utterance regions from the frame identified as A uttered. Further, the extraction unit 14e calculates the number of silence areas, the average value of the silence area lengths, and the standard deviation of the silence area lengths from the frame of the silence area A.

また、抽出部１４ｅは、会話全体の時間の長さに対するＡの発話時間の長さの割合を算出する。なお、抽出部１４ｅは、Ａの発話領域の長さの合計を、Ａの発話時間の長さとして、かかる割合を算出する。また、抽出部１４ｅは、Ｂの発話時間に対するＡの発話時間の割合を算出する。また、抽出部１４ｅは、Ａが発話したと特定されたフレームから、音量の標準偏差およびスペクトルエントロピーの標準偏差を算出する。また、抽出部１４ｅは、Ａが発話したと特定されたフレームから算出した音量の標準偏差と、スペクトルエントロピーの標準偏差との和を、変化の度合いとして算出する。 In addition, the extraction unit 14e calculates the ratio of the length of A's utterance time to the length of time of the entire conversation. The extraction unit 14e calculates the ratio by using the total length of the utterance area of A as the length of the utterance time of A. Further, the extraction unit 14e calculates the ratio of the utterance time of A to the utterance time of B. Further, the extraction unit 14e calculates the standard deviation of the sound volume and the standard deviation of the spectral entropy from the frame identified as A uttered. In addition, the extraction unit 14e calculates the sum of the standard deviation of the volume calculated from the frame identified as A uttered and the standard deviation of the spectral entropy as the degree of change.

また、抽出部１４ｅは、Ｂが発話したと特定されたフレームから、有声音領域の数、有声音領域の長さの平均値、および有声音領域の長さの標準偏差を算出する。また、抽出部１４ｅは、Ｂが発話したと特定されたフレームから、発話領域の数、発話領域の長さの平均値、および発話領域の長さの標準偏差を算出する。また、抽出部１４ｅは、Ｂの沈黙領域のフレームから、沈黙領域の数、沈黙領域の長さの平均値、および沈黙領域の長さの標準偏差を算出する。 Further, the extraction unit 14e calculates the number of voiced sound regions, the average value of the length of the voiced sound region, and the standard deviation of the length of the voiced sound region from the frame identified as having spoken by B. Further, the extraction unit 14e calculates the number of utterance areas, the average value of the lengths of the utterance areas, and the standard deviation of the length of the utterance areas from the frame identified as having uttered B. Further, the extraction unit 14e calculates the number of silence areas, the average value of the silence area lengths, and the standard deviation of the silence area lengths from the frame of the B silence area.

また、抽出部１４ｅは、会話全体の時間の長さに対するＢの発話時間の長さの割合を算出する。なお、抽出部１４ｅは、Ｂの発話領域の長さの合計を、Ｂの発話時間の長さとして、かかる割合を算出する。また、抽出部１４ｅは、Ａの発話時間に対するＢの発話時間の割合を算出する。また、抽出部１４ｅは、Ｂが発話したと特定されたフレームから、音量の標準偏差およびスペクトルエントロピーの標準偏差を算出する。また、抽出部１４ｅは、Ｂが発話したと特定されたフレームから算出した音量の標準偏差と、スペクトルエントロピーの標準偏差との和を、変化の度合いとして算出する。 Further, the extraction unit 14e calculates the ratio of the length of B's utterance time to the length of time of the entire conversation. The extraction unit 14e calculates the ratio by using the total length of the B speech area as the length of the B speech time. Further, the extraction unit 14e calculates the ratio of the B speech time to the A speech time. Further, the extraction unit 14e calculates the standard deviation of the sound volume and the standard deviation of the spectral entropy from the frame identified as having spoken by B. Further, the extraction unit 14e calculates, as the degree of change, the sum of the standard deviation of the volume calculated from the frame identified as having spoken by B and the standard deviation of the spectral entropy.

このようにして算出された有声音領域の数、有声音領域の長さの平均値、および有声音領域の長さの標準偏差の各会話特性は、有声音の長さがどの位長いのかを示す指標となる。また、発話領域の数、発話領域の長さの平均値、および発話領域の長さの標準偏差の各会話特性は、対応する人物が、常に会話において長く続けて話すのか、または、少ししか話さないのかを示す指標となる。また、沈黙領域の数、沈黙領域の長さの平均値、および沈黙領域の長さの標準偏差の各会話特性は、話者の話し方が、長く続けて話すのか、または、中断（沈黙）を多くはさみながら話すのかを示す指標となる。また、会話全体の時間の長さに対するある人物の発話時間の長さの割合、および他の人物の発話時間に対するある人物の発話時間の割合Ｒ_ｔの各会話特性は、会話の参加状態を示す指標となる。また、音量の標準偏差、スペクトルエントロピーの標準偏差、および変化の度合いの各会話特性は、感情の変化が激しい情熱的な話者であるのか、または、感情の変化が小さい静かな話者であるのかを示す指標となる。 The conversation characteristics of the number of voiced sound areas calculated in this way, the average value of the length of the voiced sound area, and the standard deviation of the length of the voiced sound area indicate how long the length of the voiced sound is. It becomes an indicator to show. In addition, the number of utterance areas, the average value of the utterance area lengths, and the standard deviation of the utterance area lengths indicate that the corresponding person always speaks for a long time in the conversation or speaks little. It becomes an index indicating whether there is no. In addition, the number of silence areas, the average value of the silence area length, and the standard deviation of the silence area length, each of the conversation characteristics, whether the speaker speaks for a long time, or interrupts (silence) Many of them serve as indicators of how much you speak. Further, each conversation characteristic of the ratio of the length of the speech time of a person to the length of the entire conversation time, and the ratio of speech time of a person with respect to speech time of another person R _t represents the participation status of the conversation It becomes an indicator. In addition, each of the conversational characteristics of volume standard deviation, spectral entropy standard deviation, and degree of change is a passionate speaker with high emotional changes, or a quiet speaker with low emotional changes. It becomes an index indicating whether or not.

分析部１４ｆは、抽出された会話特性に基づいて、会話スタイルを分析する。具体例について説明する。分析部１４ｆは、他の人物の発話時間に対するある人物の発話時間の割合Ｒ_ｔが、所定値、例えば１．５以上である場合には、この「ある人物」は、会話においてよく話す人物であると分析する。また、分析部１４ｆは、割合Ｒ_ｔが、所定値、例えば０．６６以下である場合には、この「ある人物」は、会話においてあまり話さない、いわゆる聞き役の人物であると分析する。なお、分析部１４ｆは、割合Ｒ_ｔが、所定値、例えば０．６６より大きく、１．５未満である場合には、会話に対する参加状況において両者は対等であると分析する。 The analysis unit 14f analyzes the conversation style based on the extracted conversation characteristics. A specific example will be described. Analyzer 14f is speech time ratio R _t of a person with respect to speech time of the other person, a predetermined value, if for example 1.5 or more, this "a person", as well Talking People in the talking Analyze that there is. Further, when the ratio R _t is a predetermined value, for example, 0.66 or less, the analysis unit 14f analyzes that this “certain person” is a so-called hearing person who does not speak much in conversation. Incidentally, the analysis unit 14f, the proportion R _t is a predetermined value, for example greater than 0.66, when it is less than 1.5, the analysis and both are equal in participation for conversation.

図６は、会話スタイルの分析方法の一例を説明するための図である。図６では、会話全体の時間の長さに対するＡの発話時間の長さの割合が２５．７２％である場合が例示されている。また、図６では、会話全体の時間の長さに対するＢの発話時間の長さの割合が４３．７１％である場合が例示されている。また、図６では、Ａの発話時間に対するＢの発話時間の割合Ｒ_ｔが１．７０である場合が例示されている。この場合、分析部１４ｆは、Ｂは会話においてよく話す人物であると分析する。また、この場合、Ｂの発話時間に対するＡの発話時間の割合Ｒ_ｔが０．６６以下となるので、分析部１４ｆは、Ａはあまり話さない、いわゆる聞き役の人物であると分析する。 FIG. 6 is a diagram for explaining an example of a conversation style analysis method. FIG. 6 illustrates a case where the ratio of the length of A's utterance time to the length of the entire conversation is 25.72%. FIG. 6 illustrates a case where the ratio of the length of B's utterance time to the length of the entire conversation is 43.71%. FIG. 6 illustrates a case where the ratio R _t of B's utterance time to A's utterance time is 1.70. In this case, the analysis unit 14f analyzes that B is a person who often speaks in conversation. In this case, the ratio R _t of speech time of A to speech time of B is 0.66 or less, the analysis unit 14f, A is not speak too, it is analyzed to be the person called Kikiyaku.

また、分析部１４ｆは、ある人物の発話領域の数に対する有声音領域の数の割合、および発話領域の長さの平均値が、他の人物の発話領域の数に対する有声音領域の数の割合、および発話領域の長さの平均値よりも大きい場合には、次の処理を行う。すなわち、分析部１４ｆは、「ある人物」は会話において長く続けて話しがちな人物であると分析する。また、分析部１４ｆは、ある人物の沈黙領域の長さの平均値が、他の人物の沈黙領域の長さの平均値よりも大きく、かつある人物の沈黙領域の長さの標準偏差が、所定値、例えば、６．０以上である場合には、次の処理を行う。すなわち、分析部１４ｆは、「ある人物」は、相手の話を聞いて、相手の内容に合わせて自分の発話を中断するため、発話の長さが一定しない人物であると分析する。 Further, the analysis unit 14f determines that the ratio of the number of voiced sound areas to the number of utterance areas of a certain person and the average value of the length of the utterance areas are the ratio of the number of voiced sound areas to the number of utterance areas of other persons. , And the length of the utterance area is larger than the average value, the following processing is performed. That is, the analysis unit 14f analyzes that “a certain person” is a person who tends to talk for a long time in the conversation. Further, the analysis unit 14f has an average value of the length of the silence area of a certain person larger than the average value of the length of the silence area of another person, and a standard deviation of the length of the silence area of a certain person is When it is a predetermined value, for example, 6.0 or more, the following processing is performed. That is, the analysis unit 14f analyzes that “a certain person” is a person whose utterance length is not constant because he / she utters the utterance according to the contents of the other party after listening to the other party ’s story.

図７は、会話スタイルの分析方法の一例を説明するための図である。図７では、Ａの有声音領域の数が８３、有声音領域の長さの平均値が０．４２５４６（ｓ）、有声音領域の長さの標準偏差が０．５０１０である場合が例示されている。また、図７では、Ａの発話領域の数が２８、発話領域の長さの平均値が１．５８（ｓ）、発話領域の長さの標準偏差が１．７８０３である場合が例示されている。また、図７では、Ａの沈黙領域の数が２９、沈黙領域の長さの平均値が４．４０５５（ｓ）、沈黙領域の長さの標準偏差が６．８００１である場合が例示されている。また、図７では、Ｂの有声音領域の数が１５０、有声音領域の長さの平均値が０．４０４１６（ｓ）、有声音領域の長さの標準偏差が０．４１９８である場合が例示されている。また、図７では、Ｂの発話領域の数が４０、発話領域の長さの平均値が１．８７９６（ｓ）、発話領域の長さの標準偏差が１．４９２８である場合が例示されている。また、図７では、Ｂの沈黙領域の数が４１、沈黙領域の長さの平均値が２．３６１４（ｓ）、沈黙領域の長さの標準偏差が２．７５２７である場合が例示されている。この場合、分析部１４ｆは、Ｂは会話においてよく話す人物であると分析する。また、この場合、Ｂの発話時間に対するＡの発話時間の割合Ｒ_ｔが０．６６以下となるので、分析部１４ｆは、Ａはあまり話さない、いわゆる聞き役の人物であると分析する。この場合、Ｂの発話領域の数に対する有声音領域の数の割合、および発話領域の長さの平均値が、Ａの発話領域の数に対する有声音領域の数の割合、および発話領域の長さの平均値よりも大きい。そのため、分析部１４ｆは、Ｂは会話において長く続けて話しがちな人物であると分析する。また、Ａの沈黙領域の長さの平均値が、Ｂの沈黙領域の長さの平均値よりも大きく、かつＡの沈黙領域の長さの標準偏差が、所定値、例えば、６．０以上である。そのため、分析部１４ｆは、Ａは、相手の話を聞いて、相手の内容に合わせて自分の発話を中断するため、発話の長さが一定しない人物であると分析する。 FIG. 7 is a diagram for explaining an example of a conversation style analysis method. FIG. 7 illustrates a case where the number of voiced sound areas of A is 83, the average length of voiced sound areas is 0.42546 (s), and the standard deviation of the length of voiced sound areas is 0.5010. ing. FIG. 7 illustrates a case where the number of utterance areas of A is 28, the average value of utterance area lengths is 1.58 (s), and the standard deviation of the utterance area lengths is 1.7803. Yes. FIG. 7 shows an example in which the number of silence areas of A is 29, the average value of the silence area lengths is 4.4055 (s), and the standard deviation of the silence area lengths is 6.8001. Yes. In FIG. 7, the number of B voiced sound areas is 150, the average length of the voiced sound areas is 0.40416 (s), and the standard deviation of the length of the voiced sound areas is 0.4198. Illustrated. FIG. 7 illustrates an example in which the number of B utterance areas is 40, the average length of utterance areas is 1.7996 (s), and the standard deviation of utterance area lengths is 1.4928. Yes. Further, FIG. 7 illustrates a case where the number of silence areas of B is 41, the average length of silence areas is 2.3614 (s), and the standard deviation of silence area lengths is 2.7527. Yes. In this case, the analysis unit 14f analyzes that B is a person who often speaks in conversation. In this case, the ratio R _t of speech time of A to speech time of B is 0.66 or less, the analysis unit 14f, A is not speak too, it is analyzed to be the person called Kikiyaku. In this case, the ratio of the number of voiced sound areas to the number of B utterance areas, and the average value of the length of the utterance areas are the ratio of the number of voiced sound areas to the number of A utterance areas, and the length of the utterance areas. It is larger than the average value. Therefore, the analysis unit 14f analyzes that B is a person who tends to talk for a long time in the conversation. Further, the average value of the length of the silence area of A is larger than the average value of the length of the silence area of B, and the standard deviation of the length of the silence area of A is a predetermined value, for example, 6.0 or more It is. Therefore, the analysis unit 14f analyzes that the person A is a person whose utterance length is not constant because he / she listens to the other person's story and interrupts his / her utterance according to the contents of the other person.

また、分析部１４ｆは、ある人物の音量の標準偏差、スペクトルエントロピーの標準偏差、または変化の度合いが、それぞれに対応する基準値以上である場合には、「ある人物」は感情の変化が激しい情熱的な話者であると分析する。また、分析部１４ｆは、ある人物の音量の標準偏差、スペクトルエントロピーの標準偏差、または変化の度合いが、それぞれに対応する基準値未満である場合には、「ある人物」は感情の変化が小さい静かな話者であると分析する。 Further, the analysis unit 14f determines that the emotion of “a certain person” is severe when the standard deviation of the volume of the person, the standard deviation of the spectral entropy, or the degree of change is equal to or greater than the corresponding reference value. Analyze that you are a passionate speaker. Further, the analysis unit 14f indicates that a certain person has a small change in emotion when the standard deviation of the volume of a certain person, the standard deviation of the spectral entropy, or the degree of change is less than the corresponding reference value. Analyze that you are a quiet speaker.

図８は、会話スタイルの分析方法の一例を説明するための図である。図８では、Ａの変化の度合いが０．２８２４であり、Ｂの変化の度合いが０．２６６２である場合が例示されている。ここで、変化の度合いに対応する基準値を、例えば、０．２７とする場合には、分析部１４ｆは、Ａは感情の変化が激しい情熱的な話者であると分析する。また、分析部１４ｆは、Ｂは感情の変化が小さい静かな話者であると分析する。 FIG. 8 is a diagram for explaining an example of a conversation style analysis method. FIG. 8 illustrates a case where the degree of change of A is 0.2824 and the degree of change of B is 0.2662. Here, when the reference value corresponding to the degree of change is set to 0.27, for example, the analysis unit 14f analyzes that A is a passionate speaker whose emotions change drastically. The analysis unit 14f analyzes that B is a quiet speaker with a small change in emotion.

また、分析部１４ｆは、ある人物と、他の人物との関係を分析することもできる。例えば、分析部１４ｆは、他の人物の発話時間に対するある人物の発話時間の割合Ｒ_ｔが、所定値、例えば１．０以上である場合には、「ある人物」は「他の人物」に対してよく話しかけているため、ある人物と他の人物との関係が友達や家族であると分析できる。一方、割合Ｒ_ｔが、所定値、例えば１．０未満である場合には、この「ある人物」は「他の人物」の話を聞こうとしているため、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 The analysis unit 14f can also analyze the relationship between a certain person and another person. For example, the analysis unit 14f, the ratio R _t of utterance time a person with respect to speech time of the other person, a predetermined value, if for example 1.0 or higher, "a person" in the "other person" Because we talk to many people, we can analyze the relationship between one person and another person as friends and family. On the other hand, when the ratio R _t is a predetermined value, for example, less than 1.0, this “one person” is going to listen to the “other person”, and therefore the relationship between a certain person and another person. Can be analyzed as a company colleague or business partner.

また、分析部１４ｆは、ある人物と他の人物との会話において、ある人物の発話領域の長さの平均値が、所定値、例えば、１．８５（ｓ）以上である場合には、ある人物と他の人物との関係が友達や家族であると分析できる。これは、「ある人物」は「他の人物」に対してよく話しかけているためである。一方、分析部１４ｆは、ある人物と他の人物との会話において、ある人物の発話領域の長さの平均値が、所定値、例えば、１．８５（ｓ）未満である場合には、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 In addition, the analysis unit 14f has a case where the average value of the length of an utterance area of a certain person is a predetermined value, for example, 1.85 (s) or more, in a conversation between the certain person and another person. Analyzes that the relationship between a person and another person is a friend or family member. This is because “a certain person” often talks to “another person”. On the other hand, in the conversation between a certain person and another person, the analysis unit 14f has a case where the average value of the length of the utterance area of a certain person is less than a predetermined value, for example, 1.85 (s). You can analyze the relationship between a person and another person as a company colleague or business partner.

また、分析部１４ｆは、ある人物と他の人物との会話において、ある人物の沈黙領域の長さの平均値が、所定値、例えば、３．００（ｓ）以下である場合には、同様の理由で、ある人物と他の人物との関係が友達や家族であると分析できる。一方、分析部１４ｆは、ある人物の沈黙領域の長さの平均値が、所定値、例えば、３．００（ｓ）より大きい場合には、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 Further, the analysis unit 14f is similar in the case where the average value of the silence area length of a certain person is equal to or less than a predetermined value, for example, 3.00 (s) in a conversation between a certain person and another person. For this reason, it can be analyzed that the relationship between a person and another person is a friend or family member. On the other hand, if the average length of the silence area of a certain person is greater than a predetermined value, for example, 3.00 (s), the analysis unit 14f determines that the relationship between a certain person and another person is a colleague of the company. And can be analyzed as a business partner.

また、分析部１４ｆは、ある人物と他の人物との会話において、ある人物の変化の度合いが、所定値、例えば、０．３３以上である場合には、同様の理由で、ある人物と他の人物との関係が友達や家族であると分析できる。一方、分析部１４ｆは、ある人物の変化の度合いが、所定値、例えば、０．３３未満である場合には、ある人物と他の人物との関係が会社の同僚やビジネスパートナーであると分析できる。 In addition, when the degree of change of a certain person is a predetermined value, for example, 0.33 or more in a conversation between a certain person and another person, the analyzing unit 14f may You can analyze your relationship with other people as friends and family. On the other hand, when the degree of change of a certain person is less than a predetermined value, for example, less than 0.33, the analysis unit 14f analyzes that the relationship between a certain person and another person is a company colleague or a business partner. it can.

図９は、人物Ａが、人物Ｂ、Ｃ、Ｄ、Ｅのそれぞれと会話したときに抽出された会話特性の一例を示す図である。図９の例では、ＡとＢとの関係は同僚である。また、図９の例では、ＡとＣとの関係はビジネスパートナーである。また、図９の例では、ＡとＤとの関係は友達である。また、図９の例では、ＡとＥとの関係は家族である。 FIG. 9 is a diagram illustrating an example of conversation characteristics extracted when the person A has a conversation with each of the persons B, C, D, and E. In the example of FIG. 9, the relationship between A and B is a colleague. In the example of FIG. 9, the relationship between A and C is a business partner. Moreover, in the example of FIG. 9, the relationship between A and D is a friend. In the example of FIG. 9, the relationship between A and E is a family.

図９の例では、ＡとＢとの会話におけるＡの割合Ｒ_ｔが０．９である場合が例示されている。また、図９の例では、ＡとＣとの会話におけるＡの割合Ｒ_ｔが０．８である場合が例示されている。また、図９の例では、ＡとＤとの会話におけるＡの割合Ｒ_ｔが１．３である場合が例示されている。また、図９の例では、ＡとＥとの会話におけるＡの割合Ｒ_ｔが１．５である場合が例示されている。このような場合、分析部１４ｆは、割合Ｒ_ｔに基づいて、Ａと、ＢおよびＣとの関係は同僚またはビジネスパートナーであると分析する。また、分析部１４ｆは、割合Ｒ_ｔに基づいて、Ａと、ＤおよびＥとの関係は友達または家族であると分析する。 In the example of FIG. 9, the case where the ratio R _t of A in the conversation between A and B is 0.9 is illustrated. Further, in the example of FIG. 9, when the ratio R _t A in the conversation between A and C is 0.8 it is illustrated. Further, in the example of FIG. 9, a case where the ratio R _t of A in the conversation between A and D is 1.3 is illustrated. Further, in the example of FIG. 9, a case where the ratio R _t of A in the conversation between A and E is 1.5 is illustrated. In such a case, the analysis unit 14f, based on the ratio R _t, the relationship of the A, B and C are analyzed the colleague or business partner. The analysis unit 14f, based on the ratio R _t, the relationship of the A, D and E are analyzed to be friends or family.

また、図９の例では、ＡとＢとの会話におけるＡの発話領域の長さの平均値が１．８１６（ｓ）である場合が例示されている。また、図９の例では、ＡとＣとの会話におけるＡの発話領域の長さの平均値が１．７５９（ｓ）である場合が例示されている。また、図９の例では、ＡとＤとの会話におけるＡの発話領域の長さの平均値が１．９２６（ｓ）である場合が例示されている。また、図９の例では、ＡとＥとの会話におけるＡの発話領域の長さの平均値が１．８８３（ｓ）である場合が例示されている。このような場合、分析部１４ｆは、Ａの発話領域の長さの平均値に基づいて、Ａと、ＢおよびＣとの関係は同僚またはビジネスパートナーであると分析する。また、分析部１４ｆは、Ａの発話領域の長さの平均値に基づいて、Ａと、ＤおよびＥとの関係は友達または家族であると分析する。 In the example of FIG. 9, the case where the average value of the length of the utterance area of A in the conversation between A and B is 1.816 (s) is illustrated. Further, in the example of FIG. 9, the case where the average value of the length of the utterance area of A in the conversation between A and C is 1.759 (s) is illustrated. Further, in the example of FIG. 9, a case where the average value of the length of the utterance area of A in the conversation between A and D is 1.926 (s) is illustrated. Further, in the example of FIG. 9, the case where the average value of the length of the utterance area of A in the conversation between A and E is 1.883 (s) is illustrated. In such a case, the analysis unit 14f analyzes that the relationship between A, B, and C is a colleague or a business partner based on the average length of the utterance area of A. Moreover, the analysis part 14f analyzes that the relationship between A, D, and E is a friend or a family based on the average value of the length of the utterance area of A.

また、図９の例では、ＡとＢとの会話におけるＡの沈黙領域の長さの平均値が３．１８０２７（ｓ）である場合が例示されている。また、図９の例では、ＡとＣとの会話におけるＡの沈黙領域の長さの平均値が３．２２５９（ｓ）である場合が例示されている。また、図９の例では、ＡとＤとの会話におけるＡの沈黙領域の長さの平均値が２．５３６４２（ｓ）である場合が例示されている。また、図９の例では、ＡとＥとの会話におけるＡの沈黙領域の長さの平均値が２．７５９５８（ｓ）である場合が例示されている。このような場合、分析部１４ｆは、Ａの沈黙領域の長さの平均値に基づいて、Ａと、ＢおよびＣとの関係は同僚またはビジネスパートナーであると分析する。また、分析部１４ｆは、Ａの沈黙領域の長さの平均値に基づいて、Ａと、ＤおよびＥとの関係は友達または家族であると分析する。 Further, in the example of FIG. 9, a case where the average value of the silence area length of A in the conversation between A and B is 3.18027 (s) is illustrated. Further, in the example of FIG. 9, a case where the average value of the length of the silence area of A in the conversation between A and C is 3.2259 (s) is illustrated. In the example of FIG. 9, the case where the average value of the length of the silence area of A in the conversation between A and D is 2.53642 (s) is illustrated. Further, in the example of FIG. 9, a case where the average value of the length of the silence area of A in the conversation between A and E is 2.75958 (s) is illustrated. In such a case, the analysis unit 14f analyzes that the relationship between A, B, and C is a colleague or a business partner based on the average value of the length of the silence region of A. Moreover, the analysis part 14f analyzes that the relationship between A, D, and E is a friend or a family based on the average value of the length of the silence area of A.

また、図９の例では、ＡとＢとの会話におけるＡの変化の度合いが０．３１６４８６である場合が例示されている。また、図９の例では、ＡとＣとの会話におけるＡの変化の度合いが０．２８８１８９である場合が例示されている。また、図９の例では、ＡとＤとの会話におけるＡの変化の度合いが０．３４２２７８である場合が例示されている。また、図９の例では、ＡとＥとの会話におけるＡの変化の度合いが０．３７０８０５である場合が例示されている。このような場合、分析部１４ｆは、Ａの変化の度合いに基づいて、Ａと、ＢおよびＣとの関係は同僚またはビジネスパートナーであると分析する。また、分析部１４ｆは、Ａの変化の度合いに基づいて、Ａと、ＤおよびＥとの関係は友達または家族であると分析する。 In the example of FIG. 9, the case where the degree of change of A in the conversation between A and B is 0.316486 is illustrated. In the example of FIG. 9, the case where the degree of change of A in the conversation between A and C is 0.288189 is illustrated. In the example of FIG. 9, the case where the degree of change of A in the conversation between A and D is 0.342278 is illustrated. In the example of FIG. 9, the case where the degree of change of A in the conversation between A and E is 0.370805 is illustrated. In such a case, the analysis unit 14f analyzes that the relationship between A, B, and C is a colleague or a business partner based on the degree of change of A. Moreover, the analysis part 14f analyzes that the relationship between A, D, and E is a friend or a family based on the degree of change of A.

そして、分析部１４ｆは、分析結果を出力装置１２に送信する。これにより、分析結果が出力装置１２により出力され、分析結果が発話者にフィードバックされる。 Then, the analysis unit 14 f transmits the analysis result to the output device 12. As a result, the analysis result is output by the output device 12, and the analysis result is fed back to the speaker.

制御部１４は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの集積回路またはＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などの電子回路である。 The control unit 14 is an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA) or an electronic circuit such as a central processing unit (CPU) or a micro processing unit (MPU).

［処理の流れ］
次に、本実施例に係る分析装置１０の処理の流れを説明する。図１０〜図１２は、実施例１に係る分析処理の手順を示すフローチャートである。この分析処理の実行タイミングとしては様々なタイミングが考えられる。例えば、分析処理は、入力部１１から分析処理を実行する指示を制御部１４が受信した場合に実行される。 [Process flow]
Next, the process flow of the analyzer 10 according to the present embodiment will be described. 10 to 12 are flowcharts illustrating the analysis processing procedure according to the first embodiment. Various timings can be considered as the execution timing of this analysis process. For example, the analysis process is executed when the control unit 14 receives an instruction to execute the analysis process from the input unit 11.

図１０に示すように、取得部１４ａは、第１の音声データ１３ａおよび第２の音声データ１３ｂを取得する（ステップＳ１０１）。第１の検出部１４ｂは、第１の音声データ１３ａおよび第２の音声データ１３ｂのそれぞれの長さが同一であるか否かを判定する（ステップＳ１０２）。ここで言う「同一」は、長さの差が許容誤差範囲内である場合も含む。 As illustrated in FIG. 10, the acquisition unit 14a acquires first audio data 13a and second audio data 13b (step S101). The first detection unit 14b determines whether or not the lengths of the first audio data 13a and the second audio data 13b are the same (step S102). Here, “same” includes the case where the difference in length is within the allowable error range.

長さが同一でない場合（ステップＳ１０２否定）には、第１の検出部１４ｂは、エラー出力するように出力部１２を制御し（ステップＳ１０３）、処理を終了する。一方、長さが同一である場合（ステップＳ１０２肯定）には、第１の検出部１４ｂは、第１の音声データ１３ａおよび第２の音声データ１３ｂをフレーム化する（ステップＳ１０４）。 If the lengths are not the same (No at Step S102), the first detection unit 14b controls the output unit 12 to output an error (Step S103), and ends the process. On the other hand, when the lengths are the same (Yes at Step S102), the first detection unit 14b frames the first audio data 13a and the second audio data 13b (Step S104).

第１の検出部１４ｂは、全てのフレームについて、自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーの３つの特徴量を抽出する（ステップＳ１０５）。第１の検出部１４ｂは、全てのフレームについて抽出した３つの特徴量のそれぞれの平均値および標準偏差を算出する（ステップＳ１０６）。第１の検出部１４ｂは、変数Ｎに０を設定する（ステップＳ１０７）。 The first detection unit 14b extracts three feature amounts of the number of autocorrelation coefficient peaks, the maximum value of autocorrelation coefficient peaks, and spectral entropy for all frames (step S105). The first detection unit 14b calculates an average value and a standard deviation of the three feature amounts extracted for all the frames (step S106). The first detection unit 14b sets 0 to the variable N (step S107).

第１の検出部１４ｂは、隠れマルコフモデルにおける有声音および無声音の状態遷移について、初期の状態遷移確率Ｐ_tを設定する（ステップＳ１０８）。第１の検出部１４ｂは、変数Ｎの値を１つインクリメントする（ステップＳ１０９）。 First detecting unit 14b, the state transition of the voiced and unvoiced sound in the hidden Markov model, sets the initial state transition probability P _t (step S108). The first detection unit 14b increments the value of the variable N by one (step S109).

第１の検出部１４ｂは、変数Ｎの値が５以上であるか否かを判定する（ステップＳ１１０）。変数Ｎの値が５以上でない場合（ステップＳ１１０否定）には、第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果として用いて、ＥＭ法により、状態遷移確率Ｐ_tを算出し（ステップＳ１１１）、ステップＳ１０９へ戻る。 The first detection unit 14b determines whether or not the value of the variable N is 5 or more (step S110). When the value of the variable N is not 5 or more (No at Step S110), the first detection unit 14b calculates the above-described three feature amounts extracted for all frames, and the average value and standard deviation of each feature amount. Using the observation result, the state transition probability P _t is calculated by the EM method (step S111), and the process returns to step S109.

一方、変数Ｎの値が５以上である場合（ステップＳ１１０肯定）には、第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果として用いて、ＥＭ法により、状態遷移確率Ｐ_tを算出する（ステップＳ１１２）。 On the other hand, when the value of the variable N is 5 or more (Yes at Step S110), the first detection unit 14b extracts the above three feature amounts extracted for all frames, and the average value of each feature amount and Using the standard deviation as an observation result, the state transition probability P _t is calculated by the EM method (step S112).

第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量、並びに、各特徴量の平均値および標準偏差を観測結果として用いて、ビタビアルゴリズムにより、観測確率Ｐ_ｏを算出する（ステップＳ１１３）。 First detecting unit 14b, three feature quantity of the extracted above for all frames, and, by using the observation result the mean value and standard deviation of each feature quantity, the Viterbi algorithm calculates the observation probability P _o (Step S113).

第１の検出部１４ｂは、全てのフレームについて抽出した上記の３つの特徴量に基づいて、ビタビアルゴリズムを用いて、次のような処理を行う。すなわち、第１の検出部１４ｂは、発話が行われている各フレームにおいて、発話されている音が有声音であるか、または、無声音であるかを検出する。そして、第１の検出部１４ｂは、有声音が検出された領域を有声音領域とし、無声音が検出された領域を無声音領域とする（ステップＳ１１４）。 The first detection unit 14b performs the following process using the Viterbi algorithm based on the three feature amounts extracted for all frames. That is, the first detection unit 14b detects whether the uttered sound is a voiced sound or an unvoiced sound in each frame in which the utterance is performed. Then, the first detection unit 14b sets the area where the voiced sound is detected as the voiced sound area, and sets the area where the unvoiced sound is detected as the unvoiced sound area (step S114).

第２の検出部１４ｃは、全てのフレームについて、有声音および無声音に基づいて、ビタビアルゴリズムを用いて、沈黙状態であるか、または、発話状態であるかを検出することで、沈黙領域および発話領域を検出する（ステップＳ１１５）。 The second detection unit 14c uses the Viterbi algorithm for all frames based on voiced and unvoiced sounds to detect whether it is in a silence state or an utterance state. A region is detected (step S115).

特定部１４ｄは、第１の音声データ１３ａについて、発話領域での有声音領域として検出されたフレームの音量の平均値Ｅ_ｔ１を算出するとともに、第２の音声データ１３ｂについて、発話領域での有声音領域として検出されたフレームの音量の平均値Ｅ_ｔ２を算出する（ステップＳ１１６）。特定部１４ｄは、第１の音声データ１３ａにおいて発話領域での有声音領域として検出されたフレームのうち、未判定のフレームを１つ選択して、選択されたフレームの音量Ｅ_ｊが所定の閾値、例えば０．２Ｅ_ｔ１以上であるか否かを判定する（ステップＳ１１７）。選択されたフレームの音量が所定の閾値以上である場合（ステップＳ１１７肯定）には、特定部１４ｄは、第１の音声データ１３ａを取得したマイクに最も近い人物を、このフレームにおいて発話した人物として特定する（ステップＳ１１８）。一方、選択されたフレームの音量が所定の閾値未満である場合（ステップＳ１１７否定）には、特定部１４ｄは、第１の音声データ１３ａを取得したマイクに最も近い人物以外の人物を、このフレームにおいて発話した人物として特定する（ステップＳ１１９）。 The specifying unit 14d calculates an average value E _t1 of the volume of the frames detected as the voiced sound area in the utterance area for the first sound data 13a, and determines whether the second sound data 13b is present in the utterance area. The average value E _t2 of the volume of the frames detected as the voice sound area is calculated (step S116). The specifying unit 14d selects one undetermined frame from the frames detected as the voiced sound area in the utterance area in the first audio data 13a, and the volume E _j of the selected frame is a predetermined threshold value. For example, it is determined whether or not it is 0.2E _t1 or more (step S117). When the volume of the selected frame is equal to or higher than the predetermined threshold (Yes at Step S117), the specifying unit 14d determines the person closest to the microphone that acquired the first audio data 13a as the person who spoke in this frame. Specify (step S118). On the other hand, when the volume of the selected frame is less than the predetermined threshold (No in step S117), the specifying unit 14d selects a person other than the person closest to the microphone that acquired the first audio data 13a as the frame. Is identified as the person who spoke (step S119).

特定部１４ｄは、第１の音声データ１３ａにおいて発話領域での有声音領域として検出されたフレームの中に、上記ステップＳ１１８で未判定のフレームがあるか否かを判定する（ステップＳ１２０）。未判定のフレームがある場合（ステップＳ１２０肯定）には、ステップＳ１１７に戻る。一方、未判定のフレームがない場合（ステップＳ１２０否定）には、図１１に示すように、特定部１４ｄは、次のような処理を行う。すなわち、特定部１４ｄは、第２の音声データ１３ｂにおいて発話領域での有声音領域として検出されたフレームのうち、未判定のフレームを１つ選択して、選択されたフレームの音量Ｅ_ｊが所定の閾値、例えば０．２Ｅ_ｔ２以上であるか否かを判定する（ステップＳ１２１）。 The specifying unit 14d determines whether or not there is a frame that has not been determined in step S118 in the frames detected as the voiced sound region in the utterance region in the first audio data 13a (step S120). If there is an undetermined frame (Yes at Step S120), the process returns to Step S117. On the other hand, when there is no undetermined frame (No at Step S120), as illustrated in FIG. 11, the specifying unit 14d performs the following process. That is, the specifying unit 14d selects one undecided frame from the frames detected as the voiced sound region in the utterance region in the second sound data 13b, and the volume E _j of the selected frame is predetermined. It is determined whether or not the threshold value is, for example, 0.2E _t2 or more (step S121).

選択されたフレームの音量が所定の閾値以上である場合（ステップＳ１２１肯定）には、特定部１４ｄは、第２の音声データ１３ｂを取得したマイクに最も近い人物を、このフレームにおいて発話した人物として特定する（ステップＳ１２２）。一方、選択されたフレームの音量が所定の閾値未満である場合（ステップＳ１２１否定）には、特定部１４ｄは、第２の音声データ１３ｂを取得したマイクに最も近い人物以外の人物を、このフレームにおいて発話した人物として特定する（ステップＳ１２３）。 When the volume of the selected frame is equal to or higher than the predetermined threshold (Yes at Step S121), the specifying unit 14d determines the person closest to the microphone that acquired the second audio data 13b as the person who spoke in this frame. Specify (step S122). On the other hand, when the volume of the selected frame is less than the predetermined threshold value (No in step S121), the specifying unit 14d determines a person other than the person closest to the microphone that acquired the second audio data 13b as the frame. Is identified as the person who spoke (step S123).

特定部１４ｄは、第２の音声データ１３ｂにおいて発話領域での有声音領域として検出されたフレームの中に、上記ステップＳ１２１で未判定のフレームがあるか否かを判定する（ステップＳ１２４）。未判定のフレームがある場合（ステップＳ１２４肯定）には、ステップＳ１２１に戻る。一方、未判定のフレームがない場合（ステップＳ１２４否定）には、図１２に示すように、抽出部１４ｅは、次のような処理を行う。すなわち、抽出部１４ｅは、ある人物が発話したと特定されたフレームから、有声音領域の数、有声音領域の長さの平均値、および有声音領域の長さの標準偏差を算出する（ステップＳ１２５）。抽出部１４ｅは、ある人物が発話したと特定されたフレームから、発話領域の数、発話領域の長さの平均値、および発話領域の長さの標準偏差を算出する（ステップＳ１２６）。抽出部１４ｅは、ある人物の沈黙領域のフレームから、沈黙領域の数、沈黙領域の長さの平均値、および沈黙領域の長さの標準偏差を算出する（ステップＳ１２７）。 The specifying unit 14d determines whether or not there is a frame that has not been determined in step S121 in the frames detected as the voiced sound region in the utterance region in the second audio data 13b (step S124). If there is an undetermined frame (Yes at step S124), the process returns to step S121. On the other hand, when there is no undetermined frame (No at Step S124), as illustrated in FIG. 12, the extraction unit 14e performs the following process. That is, the extraction unit 14e calculates the number of voiced sound areas, the average value of the lengths of the voiced sound areas, and the standard deviation of the length of the voiced sound areas from the frame identified as a certain person uttering (Step). S125). The extraction unit 14e calculates the number of utterance areas, the average value of the lengths of the utterance areas, and the standard deviation of the lengths of the utterance areas from the frame identified as a certain person uttering (step S126). The extraction unit 14e calculates the number of silence areas, the average value of the silence area lengths, and the standard deviation of the silence area lengths from the frame of the silence area of a person (step S127).

抽出部１４ｅは、会話全体の時間の長さに対するある人物の発話時間の長さの割合を算出する（ステップＳ１２８）。抽出部１４ｅは、他の人物の発話時間に対するある人物の発話時間の割合を算出する（ステップＳ１２９）。抽出部１４ｅは、ある人物が発話したと特定されたフレームから、音量の標準偏差およびスペクトルエントロピーの標準偏差を算出する（ステップＳ１３０）。抽出部１４ｅは、ある人物が発話したと特定されたフレームから算出した音量の標準偏差と、スペクトルエントロピーの標準偏差との和を、変化の度合いとして算出する（ステップＳ１３１）。 The extraction unit 14e calculates the ratio of the length of the utterance time of a certain person to the length of time of the entire conversation (step S128). The extraction unit 14e calculates the ratio of the utterance time of a certain person to the utterance time of another person (step S129). The extraction unit 14e calculates the standard deviation of the sound volume and the standard deviation of the spectral entropy from the frame specified that a certain person has spoken (step S130). The extraction unit 14e calculates, as the degree of change, the sum of the standard deviation of the volume calculated from the frame identified as a certain person speaking and the standard deviation of the spectral entropy (step S131).

抽出部１４ｅは、他の人物が発話したと特定されたフレームから、有声音領域の数、有声音領域の長さの平均値、および有声音領域の長さの標準偏差を算出する（ステップＳ１３２）。抽出部１４ｅは、他の人物が発話したと特定されたフレームから、発話領域の数、発話領域の長さの平均値、および発話領域の長さの標準偏差を算出する（ステップＳ１３３）。抽出部１４ｅは、他の人物の沈黙領域のフレームから、沈黙領域の数、沈黙領域の長さの平均値、および沈黙領域の長さの標準偏差を算出する（ステップＳ１３４）。 The extraction unit 14e calculates the number of voiced sound areas, the average value of the lengths of the voiced sound areas, and the standard deviation of the lengths of the voiced sound areas from the frames identified as uttered by another person (step S132). ). The extraction unit 14e calculates the number of utterance regions, the average value of the lengths of the utterance regions, and the standard deviation of the length of the utterance regions from the frame identified as uttered by another person (step S133). The extraction unit 14e calculates the number of silence areas, the average value of the silence area lengths, and the standard deviation of the silence area lengths from the frame of the silence area of another person (step S134).

抽出部１４ｅは、会話全体の時間の長さに対する他の人物の発話時間の長さの割合を算出する（ステップＳ１３５）。抽出部１４ｅは、ある人物の発話時間に対する他の人物の発話時間の割合を算出する（ステップＳ１３６）。抽出部１４ｅは、他の人物が発話したと特定されたフレームから、音量の標準偏差およびスペクトルエントロピーの標準偏差を算出する（ステップＳ１３７）。抽出部１４ｅは、他の人物が発話したと特定されたフレームから算出した音量の標準偏差と、スペクトルエントロピーの標準偏差との和を、変化の度合いとして算出する（ステップＳ１３８）。 The extraction unit 14e calculates the ratio of the length of the utterance time of another person to the length of time of the entire conversation (step S135). The extraction unit 14e calculates the ratio of the speech time of another person to the speech time of a certain person (step S136). The extraction unit 14e calculates the standard deviation of the sound volume and the standard deviation of the spectral entropy from the frame identified as having been uttered by another person (step S137). The extraction unit 14e calculates, as the degree of change, the sum of the standard deviation of the volume calculated from the frame identified as having been spoken by another person and the standard deviation of the spectral entropy (step S138).

分析部１４ｆは、抽出された会話特性に基づいて、会話スタイルを分析する（ステップＳ１３９）。分析部１４ｆは、分析結果を出力装置１２に送信し（ステップＳ１４０）、処理を終了する。 The analysis unit 14f analyzes the conversation style based on the extracted conversation characteristics (step S139). The analysis unit 14f transmits the analysis result to the output device 12 (step S140), and ends the process.

［実施例１の効果］
上述してきたように、本実施例に係る分析装置１０は、音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出する。本実施例に係る分析装置１０は、検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、音声データにおける発話領域および沈黙領域を検出する。本実施例に係る分析装置１０は、検出された発話領域および沈黙領域の会話特性を抽出する。本実施例に係る分析装置１０は、抽出された会話特性に基づいて、会話スタイルを分析する。このように、本実施例によれば、話者の会話スタイルを分析する際に、会話の内容を特定せずに、発話領域および沈黙領域の会話特性に基づいて会話スタイルを分析するため、処理に時間を要することなく簡易に会話スタイルを分析することができる。 [Effect of Example 1]
As described above, the analysis apparatus 10 according to the present embodiment detects the voiced sound area and the unvoiced sound area from the sound data using the first probability model. The analysis apparatus 10 according to the present embodiment detects the speech area and the silence area in the speech data using the second probability model based on the detected voiced sound area and unvoiced sound area. The analysis apparatus 10 according to the present embodiment extracts the conversation characteristics of the detected speech area and silence area. The analysis apparatus 10 according to the present embodiment analyzes the conversation style based on the extracted conversation characteristics. Thus, according to the present embodiment, when analyzing the conversation style of the speaker, the conversation style is analyzed based on the conversation characteristics of the speech area and the silence area without specifying the content of the conversation. It is possible to analyze the conversation style easily without taking time.

また、本実施例によれば、話者の会話スタイルを分析する際に、会話の内容を特定せずに、会話スタイルを分析するため、会話の内容が知られることなく、話者のプライバシーを保護しつつ、会話スタイルを分析することができる。 Further, according to this embodiment, when analyzing the conversation style of the speaker, the conversation style is analyzed without specifying the content of the conversation. Analyze conversation style while protecting.

また、本実施例によれば、話者の会話スタイルを分析する際に、日本語、英語、中国語などの各種言語に共通の特徴を用いて、発話領域および沈黙領域を検出するので、言語に依存することなく、会話スタイルを分析することができる。 Also, according to the present embodiment, when analyzing the conversation style of the speaker, the speech area and the silence area are detected using features common to various languages such as Japanese, English, and Chinese. It is possible to analyze the conversation style without depending on.

また、本実施例に係る分析装置１０は、周囲のノイズに強い自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーなどの特徴量を抽出し、抽出した特徴量を用いて、有声音領域および無声音領域を検出する。したがって、本実施例に係る分析装置１０によれば、周囲のノイズの影響により、有声音領域および無声音領域を検出する精度が低下することを抑制することができる。また、周囲のノイズに強い特徴量を用いているため、本実施例に係る分析装置１０は、第１の音声データ１３ａおよび第２の音声データ１３ｂをフレーム化する際に、フレームの個数をより少なくすることができる。したがって、分析装置１０によれば、より簡易な処理で有声音領域および無声音領域を検出することができる。 In addition, the analysis apparatus 10 according to the present embodiment extracts feature quantities such as the number of autocorrelation coefficient peaks that are strong against ambient noise, the maximum value of the peak of the autocorrelation coefficient, and spectral entropy. The quantity is used to detect voiced and unvoiced sound areas. Therefore, according to the analyzer 10 according to the present embodiment, it is possible to suppress a decrease in the accuracy of detecting the voiced sound area and the unvoiced sound area due to the influence of surrounding noise. In addition, since the feature quantity that is strong against ambient noise is used, the analysis apparatus 10 according to the present embodiment increases the number of frames when framing the first audio data 13a and the second audio data 13b. Can be reduced. Therefore, according to the analyzer 10, the voiced sound area and the unvoiced sound area can be detected by simpler processing.

また、本実施例に係る分析装置１０は、発話領域における音量が閾値以上の場合に、マイクに最も近い人物を発話領域において発話した人物として特定する。また、本実施例に係る分析装置１０は、発話領域における音量が閾値未満の場合に、マイクに最も近い人物以外の人物を発話領域において発話した人物として特定する。本実施例に係る分析装置１０は、特定した人物ごとに、会話スタイルを分析する。したがって、本実施例に係る分析装置１０によれば、音量の大きさの判定という簡易な処理で人物を特定することができる結果、簡易な処理で人物ごとの会話スタイルを分析できる。 In addition, when the volume in the utterance area is equal to or higher than the threshold, the analysis apparatus 10 according to the present embodiment identifies the person closest to the microphone as the person who has uttered in the utterance area. In addition, when the volume in the utterance area is less than the threshold, the analysis apparatus 10 according to the present embodiment identifies a person other than the person closest to the microphone as a person who has uttered in the utterance area. The analysis apparatus 10 according to the present embodiment analyzes the conversation style for each identified person. Therefore, according to the analysis apparatus 10 according to the present embodiment, a person can be specified by a simple process of determining the volume level, and as a result, the conversation style for each person can be analyzed by a simple process.

また、本実施例に係る分析装置１０は、確率モデルとして隠れマルコフモデルを用いて、沈黙領域および発話領域を検出する。したがって、本実施例に係る分析装置１０によれば、２人の会話において、発話が重複しても、精度よく、沈黙領域および発話領域を検出することができる。 Further, the analysis apparatus 10 according to the present embodiment detects a silence area and a speech area using a hidden Markov model as a probability model. Therefore, according to the analysis apparatus 10 according to the present embodiment, even if the utterances overlap in the conversation between two people, the silence area and the utterance area can be accurately detected.

さて、これまで開示の装置に関する実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施例を説明する。 Although the embodiments related to the disclosed apparatus have been described above, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, another embodiment included in the present invention will be described below.

また、各種の負荷や使用状況などに応じて、各実施例において説明した各処理の各ステップでの処理の順番を変更できる。例えば、会話の人数が２人であり、開示の装置は、この２人の属性（名前など）について既知である場合には、図１１に示すステップＳ１２１〜１２４の処理を省略することもできる。 Further, the order of processing at each step of each processing described in each embodiment can be changed according to various loads and usage conditions. For example, when the number of conversations is two and the disclosed apparatus knows the attributes (names, etc.) of the two persons, the processing in steps S121 to S124 shown in FIG. 11 can be omitted.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的状態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、図１に示す抽出部１４ｅと分析部１４ｆとが統合されてもよい。 Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific state of distribution / integration of each device is not limited to the one shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the extraction unit 14e and the analysis unit 14f illustrated in FIG. 1 may be integrated.

［分析プログラム］
また、上記の実施例で説明した分析装置の各種の処理は、あらかじめ用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータシステムで実行することによって実現することもできる。そこで、以下では、図１３を用いて、上記の実施例で説明した分析装置と同様の機能を有する分析プログラムを実行するコンピュータの一例を説明する。図１３は、分析プログラムを実行するコンピュータを示す図である。 [Analysis program]
Various processes of the analyzer described in the above embodiment can be realized by executing a program prepared in advance on a computer system such as a personal computer or a workstation. In the following, an example of a computer that executes an analysis program having the same function as that of the analysis apparatus described in the above embodiment will be described with reference to FIG. FIG. 13 is a diagram illustrating a computer that executes an analysis program.

図１３に示すように、実施例２におけるコンピュータ３００は、ＣＰＵ（Central Processing Unit）３１０、ＲＯＭ（Read Only Memory）３２０、ＨＤＤ（Hard Disk Drive）３３０、ＲＡＭ（Random Access Memory）３４０を有する。これら３００〜３４０の各部は、バス４００を介して接続される。 As illustrated in FIG. 13, the computer 300 according to the second embodiment includes a central processing unit (CPU) 310, a read only memory (ROM) 320, a hard disk drive (HDD) 330, and a random access memory (RAM) 340. These units 300 to 340 are connected via a bus 400.

ＲＯＭ３２０には、上記の実施例１で示す取得部１４ａと、第１の検出部１４ｂと、第２の検出部１４ｃと、特定部１４ｄと、抽出部１４ｅと、分析部１４ｆと同様の機能を発揮する分析プログラム３２０ａが予め記憶される。なお、分析プログラム３２０ａについては、適宜分離しても良い。 The ROM 320 has the same functions as those of the acquisition unit 14a, the first detection unit 14b, the second detection unit 14c, the specifying unit 14d, the extraction unit 14e, and the analysis unit 14f described in the first embodiment. An analysis program 320a to be exhibited is stored in advance. Note that the analysis program 320a may be separated as appropriate.

そして、ＣＰＵ３１０が、分析プログラム３２０ａをＲＯＭ３２０から読み出して実行する。 Then, the CPU 310 reads the analysis program 320a from the ROM 320 and executes it.

そして、ＨＤＤ３３０には、第１の音声データ、第２の音声データが設けられる。これら第１の音声データ、第２の音声データのそれぞれは、図１に示した第１の音声データ１３ａ、第２の音声データ１３ｂのそれぞれに対応する。 The HDD 330 is provided with first audio data and second audio data. Each of these first audio data and second audio data corresponds to each of the first audio data 13a and the second audio data 13b shown in FIG.

そして、ＣＰＵ３１０は、第１の音声データと、第２の音声データとを読み出してＲＡＭ３４０に格納する。さらに、ＣＰＵ３１０は、ＲＡＭ３４０に格納された第１の音声データと、第２の音声データとを用いて、分析プログラムを実行する。なお、ＲＡＭ３４０に格納される各データは、常に全てのデータがＲＡＭ３４０に格納される必要はなく、処理に必要なデータのみがＲＡＭ３４０に格納されれば良い。 Then, the CPU 310 reads the first audio data and the second audio data and stores them in the RAM 340. Further, the CPU 310 executes the analysis program using the first sound data and the second sound data stored in the RAM 340. Each data stored in the RAM 340 does not always need to be stored in the RAM 340, and only the data necessary for the process may be stored in the RAM 340.

なお、上記した分析プログラムについては、必ずしも最初からＲＯＭ３２０に記憶させておく必要はない。 Note that the above analysis program does not necessarily have to be stored in the ROM 320 from the beginning.

例えば、コンピュータ３００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」にプログラムを記憶させておく。そして、コンピュータ３００がこれらからプログラムを読み出して実行するようにしてもよい。 For example, the program is stored in a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into the computer 300. Then, the computer 300 may read and execute the program from these.

さらには、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータ３００に接続される「他のコンピュータ（またはサーバ）」などにプログラムを記憶させておく。そして、コンピュータ３００がこれらからプログラムを読み出して実行するようにしてもよい。 Furthermore, the program is stored in “another computer (or server)” connected to the computer 300 via a public line, the Internet, a LAN, a WAN, or the like. Then, the computer 300 may read and execute the program from these.

以上説明した実施形態及びその変形例に関し、更に以下の付記を開示する。 The following supplementary notes are further disclosed regarding the embodiment described above and its modifications.

（付記１）音声データを取得する取得部と、
前記取得部により取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出する第１の検出部と、
前記第１の検出部により検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出する第２の検出部と、
前記第２の検出部により検出された発話領域および沈黙領域の会話特性を抽出する抽出部と、
前記抽出部により抽出された会話特性に基づいて、会話スタイルを分析する分析部と
を有することを特徴とする分析装置。 (Supplementary Note 1) An acquisition unit that acquires audio data;
A first detection unit that detects a voiced sound region and an unvoiced sound region from the voice data acquired by the acquisition unit using a first probability model;
A second detection unit that detects a speech region and a silence region in the voice data using a second probability model based on the voiced sound region and the unvoiced sound region detected by the first detection unit;
An extractor for extracting speech characteristics of the speech area and the silence area detected by the second detector;
An analysis unit comprising: an analysis unit that analyzes a conversation style based on the conversation characteristics extracted by the extraction unit.

（付記２）前記第１の検出部は、前記音声データの各フレームについて、自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーを抽出し、抽出した自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーに基づいて、有声音領域および無声音領域を検出する
ことを特徴とする付記１に記載の分析装置。 (Supplementary Note 2) The first detection unit extracts the number of autocorrelation coefficient peaks, the maximum value of autocorrelation coefficient peaks, and spectral entropy for each frame of the audio data, and extracts the extracted self-phase. The analysis apparatus according to appendix 1, wherein the voiced sound area and the unvoiced sound area are detected based on the number of peaks of the relationship number, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy.

（付記３）前記取得部は、音声取得装置により取得された複数人の会話における音声データを取得し、
前記第２の検出部により検出された発話領域における音量が閾値以上の場合に、前記音声取得装置に最も近い人物を該発話領域において発話した人物として特定するとともに、前記第２の検出部により検出された発話領域における音量が閾値未満の場合に、前記音声取得装置に最も近い人物以外の人物を該発話領域において発話した人物として特定する特定部をさらに有し、
前記分析部は、前記特定部により特定された人物ごとに、会話スタイルを分析する
ことを特徴とする付記１または２に記載の分析装置。 (Additional remark 3) The said acquisition part acquires the audio | voice data in the conversation of the two persons acquired by the audio | voice acquisition apparatus,
When the volume in the utterance area detected by the second detection unit is equal to or higher than a threshold, the person closest to the voice acquisition device is identified as the person who has spoken in the utterance area, and is detected by the second detection unit A specifying unit that specifies a person other than the person closest to the voice acquisition device as a person who has spoken in the utterance area when the volume in the uttered area is less than a threshold;
The analysis apparatus according to appendix 1 or 2, wherein the analysis unit analyzes a conversation style for each person specified by the specifying unit.

（付記４）コンピュータに、
音声データを取得し、
取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出し、
検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出し、
検出された発話領域および沈黙領域の会話特性を抽出し、
抽出された会話特性に基づいて、会話スタイルを分析する
処理を実行させることを特徴とする分析プログラム。 (Appendix 4)
Get audio data,
Using the first probability model, the voiced sound area and the unvoiced sound area are detected from the acquired voice data,
Based on the detected voiced sound region and unvoiced sound region, using a second probability model, detecting a speech region and a silence region in the speech data;
Extract the speech characteristics of the detected speech area and silence area,
An analysis program characterized by causing processing to analyze a conversation style based on extracted conversation characteristics.

（付記５）前記有声音領域および無声音領域を検出する処理は、前記音声データの各フレームについて、自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーを抽出し、抽出した自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーに基づいて、有声音領域および無声音領域を検出する
ことを特徴とする付記４に記載の分析プログラム。 (Additional remark 5) The process which detects the said voiced sound area | region and an unvoiced sound area | region extracts the number of peaks of an autocorrelation coefficient, the maximum value of the peak of an autocorrelation coefficient, and spectrum entropy about each frame of the said audio | voice data. The analysis program according to appendix 4, wherein the voiced sound region and the unvoiced sound region are detected based on the number of extracted autocorrelation coefficient peaks, the maximum value of the autocorrelation coefficient peak, and the spectral entropy. .

（付記６）前記音声データを取得する処理は、音声取得装置により取得された複数人の会話における音声データを取得し、
検出された発話領域における音量が閾値以上の場合に、前記音声取得装置に最も近い人物を該発話領域において発話した人物として特定するとともに、検出された発話領域における音量が閾値未満の場合に、前記音声取得装置に最も近い人物以外の人物を該発話領域において発話した人物として特定する処理をさらに前記コンピュータに実行させ、
前記会話スタイルを分析する処理は、前記特定された人物ごとに、会話スタイルを分析する
ことを特徴とする付記４または５に記載の分析プログラム。 (Additional remark 6) The process which acquires the said audio | voice data acquires the audio | voice data in the conversation of several persons acquired by the audio | voice acquisition apparatus,
When the volume in the detected utterance area is greater than or equal to a threshold, the person closest to the voice acquisition device is identified as the person who uttered in the utterance area, and when the volume in the detected utterance area is less than the threshold, Causing the computer to further execute a process of identifying a person other than the person closest to the voice acquisition device as a person who spoke in the utterance area;
The analysis program according to appendix 4 or 5, wherein the process of analyzing the conversation style analyzes the conversation style for each of the specified persons.

（付記７）コンピュータが実行する分析方法であって、
音声データを取得し、
取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出し、
検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出し、
検出された発話領域および沈黙領域の会話特性を抽出し、
抽出された会話特性に基づいて、会話スタイルを分析する
ことを特徴とする分析方法。 (Appendix 7) An analysis method executed by a computer,
Get audio data,
Using the first probability model, the voiced sound area and the unvoiced sound area are detected from the acquired voice data,
Based on the detected voiced sound region and unvoiced sound region, using a second probability model, detecting a speech region and a silence region in the speech data;
Extract the speech characteristics of the detected speech area and silence area,
An analysis method characterized by analyzing a conversation style based on extracted conversation characteristics.

（付記８）前記有声音領域および無声音領域を検出する方法は、前記音声データの各フレームについて、自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーを抽出し、抽出した自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーに基づいて、有声音領域および無声音領域を検出する
ことを特徴とする付記７に記載の分析方法。 (Supplementary note 8) The method for detecting the voiced sound region and the unvoiced sound region extracts the number of peaks of the autocorrelation coefficient, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy for each frame of the speech data. The voiced sound region and the unvoiced sound region are detected based on the number of extracted autocorrelation coefficient peaks, the maximum value of the autocorrelation coefficient peak, and the spectral entropy. .

（付記９）前記音声データを取得する方法は、音声取得装置により取得された複数人の会話における音声データを取得し、
検出された発話領域における音量が閾値以上の場合に、前記音声取得装置に最も近い人物を該発話領域において発話した人物として特定するとともに、検出された発話領域における音量が閾値未満の場合に、前記音声取得装置に最も近い人物以外の人物を該発話領域において発話した人物として特定する方法をさらに前記コンピュータが実行し、
前記会話スタイルを分析する方法は、前記特定された人物ごとに、会話スタイルを分析する
ことを特徴とする付記７または８に記載の分析方法。 (Additional remark 9) The method to acquire the said audio | voice data acquires the audio | voice data in the conversation of several people acquired by the audio | voice acquisition apparatus,
When the volume in the detected utterance area is greater than or equal to a threshold, the person closest to the voice acquisition device is identified as the person who uttered in the utterance area, and when the volume in the detected utterance area is less than the threshold, The computer further executes a method of identifying a person other than the person closest to the voice acquisition device as a person who has spoken in the speech area,
9. The analysis method according to appendix 7 or 8, wherein the method for analyzing the conversation style comprises analyzing the conversation style for each of the specified persons.

１０分析装置
１１入力部
１２出力部
１３記憶部
１３ａ第１の音声データ
１３ｂ第２の音声データ
１４制御部
１４ａ取得部
１４ｂ第１の検出部
１４ｃ第２の検出部
１４ｄ特定部
１４ｅ抽出部
１４ｆ分析部 DESCRIPTION OF SYMBOLS 10 Analysis apparatus 11 Input part 12 Output part 13 Storage part 13a 1st audio | voice data 13b 2nd audio | voice data 14 Control part 14a Acquisition part 14b 1st detection part 14c 2nd detection part 14d Identification part 14e Extraction part 14f Analysis Part

Claims

音声データを取得する取得部と、
前記取得部により取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出する第１の検出部と、
前記第１の検出部により検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出する第２の検出部と、
前記第２の検出部により検出された発話領域および沈黙領域の会話特性を抽出する抽出部と、
前記抽出部により抽出された会話特性に基づいて、会話スタイルを分析する分析部と
を有することを特徴とする分析装置。 An acquisition unit for acquiring audio data;
A first detection unit that detects a voiced sound region and an unvoiced sound region from the voice data acquired by the acquisition unit using a first probability model;
A second detection unit that detects a speech region and a silence region in the voice data using a second probability model based on the voiced sound region and the unvoiced sound region detected by the first detection unit;
An extractor for extracting speech characteristics of the speech area and the silence area detected by the second detector;
An analysis unit comprising: an analysis unit that analyzes a conversation style based on the conversation characteristics extracted by the extraction unit.

前記第１の検出部は、前記音声データの各フレームについて、自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーを抽出し、抽出した自己相関係数のピークの数、自己相関係数のピークの最大値、およびスペクトルエントロピーに基づいて、有声音領域および無声音領域を検出する
ことを特徴とする請求項１に記載の分析装置。 The first detection unit extracts the number of peaks of the autocorrelation coefficient, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy for each frame of the audio data, and extracts the peak of the extracted autocorrelation coefficient The voiced sound region and the unvoiced sound region are detected based on the number of, the maximum value of the peak of the autocorrelation coefficient, and the spectral entropy.

前記取得部は、音声取得装置により取得された複数人の会話における音声データを取得し、
前記第２の検出部により検出された発話領域における音量が閾値以上の場合に、前記音声取得装置に最も近い人物を該発話領域において発話した人物として特定するとともに、前記第２の検出部により検出された発話領域における音量が閾値未満の場合に、前記音声取得装置に最も近い人物以外の人物を該発話領域において発話した人物として特定する特定部をさらに有し、
前記分析部は、前記特定部により特定された人物ごとに、会話スタイルを分析する
ことを特徴とする請求項１または２に記載の分析装置。 The acquisition unit acquires voice data in a conversation of a plurality of people acquired by a voice acquisition device,
When the volume in the utterance area detected by the second detection unit is equal to or higher than a threshold, the person closest to the voice acquisition device is identified as the person who has spoken in the utterance area, and is detected by the second detection unit A specifying unit that specifies a person other than the person closest to the voice acquisition device as a person who has spoken in the utterance area when the volume in the uttered area is less than a threshold;
The analysis apparatus according to claim 1, wherein the analysis unit analyzes a conversation style for each person specified by the specifying unit.

コンピュータに、
音声データを取得し、
取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出し、
検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出し、
検出された発話領域および沈黙領域の会話特性を抽出し、
抽出された会話特性に基づいて、会話スタイルを分析する
処理を実行させることを特徴とする分析プログラム。 On the computer,
Get audio data,
Using the first probability model, the voiced sound area and the unvoiced sound area are detected from the acquired voice data,
Based on the detected voiced sound region and unvoiced sound region, using a second probability model, detecting a speech region and a silence region in the speech data;
Extract the speech characteristics of the detected speech area and silence area,
An analysis program characterized by causing processing to analyze a conversation style based on extracted conversation characteristics.

コンピュータが実行する分析方法であって、
音声データを取得し、
取得された音声データから、第１の確率モデルを用いて、有声音領域および無声音領域を検出し、
検出された有声音領域および無声音領域に基づいて、第２の確率モデルを用いて、前記音声データにおける発話領域および沈黙領域を検出し、
検出された発話領域および沈黙領域の会話特性を抽出し、
抽出された会話特性に基づいて、会話スタイルを分析する
ことを特徴とする分析方法。 An analysis method performed by a computer,
Get audio data,
Using the first probability model, the voiced sound area and the unvoiced sound area are detected from the acquired voice data,
Based on the detected voiced sound region and unvoiced sound region, using a second probability model, detecting a speech region and a silence region in the speech data;
Extract the speech characteristics of the detected speech area and silence area,
An analysis method characterized by analyzing a conversation style based on extracted conversation characteristics.