JP2013183755A

JP2013183755A - Detector, detection program and detection method

Info

Publication number: JP2013183755A
Application number: JP2012048629A
Authority: JP
Inventors: Naoji Matsuo; 直司松尾; Shoji Hayakawa; 昭二早川; Kazuya Takeda; 一哉武田; Takatoshi Sanehiro; 貴敏實廣
Original assignee: Nagoya University NUC; Fujitsu Ltd
Current assignee: Nagoya University NUC; Fujitsu Ltd
Priority date: 2012-03-05
Filing date: 2012-03-05
Publication date: 2013-09-19

Abstract

PROBLEM TO BE SOLVED: To suppress decline of accuracy in results of detection of a mental state.SOLUTION: A detector 10 obtains voice data of an utterer. The detector 10 computes a basic frequency of voice, and indexes which can evaluate the flatness of spectra indicating high-pass frequency components of voice from the obtained voice data. The detector 10 computes sets of the basic frequency and the indexes by using a "2 mass model of vocal cords" using various parameters. The detector 10 determines parameter values used in computing a set of feature quantities with the minimum difference in feature quantities among sets of feature quantities. The detector 10 detects the mental state of the utterer by using the determined parameters.

Description

本発明は、検出装置、検出プログラムおよび検出方法に関する。 The present invention relates to a detection device, a detection program, and a detection method.

近年、音声データを分析して発話者の感情等の状態を検出する技術が知られている。例えば、音声信号に基づいて、その強度、速度、テンポ、強度変化パターンの抑揚等を検出し、それぞれの変化量から、悲しみや怒り、喜び等の感情状態を生成する方法が知られている（例えば、特許文献１）。また、音声信号をローパスフィルタ処理して、音声信号の強度やピッチ等の特徴を抽出することにより、感情を検出する方法が知られている（例えば、特許文献２）。さらに、音声情報から音韻スペクトルに係る特徴量を抽出し、予め備えた状態判断テーブルに基づいて感情状態を判断する方法が知られている（例えば、特許文献３）。さらに加えて、音声信号の振幅包絡についての周期的変動を抽出し、話者が力んだ状態で発生したか否かを判定して、話者の怒りや苛立ちを検出する装置が知られている（例えば、特許文献４）。他には、音声の高さと大きさとに関連する、音声の基本周波数、および、音声の高域周波数成分を示すスペクトルについての平坦性を評価可能な指標が用いられて、心理状態が検出される技術がある（特許文献５）。 In recent years, techniques for analyzing speech data and detecting a state such as a speaker's emotion have been known. For example, a method is known in which intensity, speed, tempo, inflection of an intensity change pattern, and the like are detected based on an audio signal, and emotional states such as sadness, anger, and joy are generated from each change amount ( For example, Patent Document 1). In addition, a method for detecting emotion by performing low-pass filter processing on an audio signal and extracting features such as intensity and pitch of the audio signal is known (for example, Patent Document 2). Furthermore, a method is known in which a feature amount related to a phoneme spectrum is extracted from speech information and an emotional state is determined based on a state determination table prepared in advance (for example, Patent Document 3). In addition, a device that extracts periodic fluctuations in the amplitude envelope of a speech signal, determines whether or not the speaker is in a state of strength, and detects the speaker's anger and irritation is known. (For example, Patent Document 4). In addition, the psychological state is detected by using an index that can evaluate the flatness of the spectrum indicating the fundamental frequency of the speech and the high frequency component of the speech related to the speech height and volume. There is technology (Patent Document 5).

また、ストレス状態（リファレンス）と音声などの生体情報との対応付けを行い、生体情報からストレス状態を推定する技術がある。例えば、ストレスセンサを使用するときに、同時に顔画像データや音声データ等の人体情報を取得し、ストレス状態と人体情報との相関について学習し、ストレスセンサを使用しない場合には、人体情報のみからストレス状態を推定する技術がある（例えば、特許文献６）。また、被験者の音声信号に基づいた感情と、被験者の生体情報とを対応付けてＤＢ（Data Base）に登録しておき、被験者の生体情報を計測し、計測結果と対応付けられた感情をＤＢから検索し、検索の結果得られた感情を推定結果として出力する技術もある（例えば、特許文献７）。さらに、ストレス状態を推定する技術として、次のような技術がある。すなわち、かかる技術では、まず、学習音声に基づき予め被験者のストレス状態での音声特徴量ベクトルの出現確率、および非ストレス状態での音声特徴量ベクトルの出現確率を音声特徴量ベクトルに対応付けた符号帳を作成する。そして、検査時に、被験者の入力音声から音声特徴量を抽出し、抽出した音声特徴量に基づいて符号帳の対応する音声特徴量ベクトルを判定する。続いて、判定した音声特徴量ベクトルに対応するストレス状態での音声特徴量ベクトルの出現確率と、非ストレス状態での音声特徴量ベクトルの出現確率からストレス状態尤度および非ストレス状態尤度を算出する。その後、算出したこれらの尤度情報を一定期間蓄積し、蓄積されたストレス状態尤度および非ストレス状態尤度に基づいて話者がストレス状態にあったか否かを推定し、推定したストレス状態に対応したストレス緩和音を出力する（例えば、特許文献８）。ここで、音声は、声帯の振動が声道を伝わって口から放射される。 There is also a technique for associating a stress state (reference) with biological information such as voice and estimating the stress state from the biological information. For example, when using a stress sensor, acquire human body information such as face image data and voice data at the same time, learn about the correlation between the stress state and human body information, and if the stress sensor is not used, use only human body information There is a technique for estimating a stress state (for example, Patent Document 6). Also, emotions based on the voice signal of the subject and biological information of the subject are associated and registered in a DB (Data Base), the biological information of the subject is measured, and the emotion associated with the measurement result is stored in the DB. There is also a technique for searching from the above and outputting an emotion obtained as a result of the search as an estimation result (for example, Patent Document 7). Furthermore, there are the following techniques for estimating the stress state. That is, in this technique, first, based on the learning speech, a code that associates the appearance probability of the speech feature amount vector in the stress state of the subject and the appearance probability of the speech feature amount vector in the non-stress state in advance with the speech feature amount vector. Create a book. And at the time of a test | inspection, an audio | voice feature-value is extracted from a test subject's input audio | voice, and the audio | voice feature-value vector corresponding to a codebook is determined based on the extracted audio | voice feature-value. Subsequently, the stress state likelihood and the non-stress state likelihood are calculated from the appearance probability of the speech feature vector in the stress state corresponding to the determined speech feature vector and the appearance probability of the speech feature vector in the non-stress state. To do. After that, the calculated likelihood information is accumulated for a certain period, and it is estimated whether or not the speaker is in a stress state based on the accumulated stress state likelihood and non-stress state likelihood, and corresponds to the estimated stress state The stress relaxation sound is output (for example, Patent Document 8). Here, the sound is radiated from the mouth as the vibration of the vocal cords travels along the vocal tract.

特開２００２−０９１４８２号公報JP 2002-091482 A 特開２００３−０９９０８４号公報JP 2003-099084 A 特開２００５−３５２１５４号公報JP 2005-352154 A 特開２００９−００３１６２号公報JP 2009-003162 A 特開２０１１−２４２７５５号公報JP2011-242755A 特開２０１１−１６７３２３号公報JP 2011-167323 A 特開２０１０−１６７０１４号公報JP 2010-167014 A 特開２００７−０００３６６号公報JP 2007-000366 A

しかしながら、上記の従来の技術では、心理状態の検出結果の精度が低いという問題がある。例えば、上記の従来の技術では、心理状態を検出する際に用いられる特徴量は、心理状態との関係が弱い声道の影響を受けている。そのため、心理状態との関係が弱い声道の影響を受けた特徴量を用いて心理状態を検出する場合には、心理状態の検出における精度が低くなってしまうという問題がある。 However, the above conventional technique has a problem that the accuracy of the detection result of the psychological state is low. For example, in the above-described conventional technique, the feature amount used when detecting the psychological state is affected by the vocal tract having a weak relationship with the psychological state. Therefore, when the psychological state is detected using the feature quantity affected by the vocal tract having a weak relationship with the psychological state, there is a problem that accuracy in detecting the psychological state is lowered.

開示の技術は、上記に鑑みてなされたものであって、心理状態の検出結果の精度の低下を抑制することができる検出装置、検出プログラムおよび検出方法を提供することを目的とする。 The disclosed technology has been made in view of the above, and an object thereof is to provide a detection device, a detection program, and a detection method that can suppress a decrease in accuracy of a detection result of a psychological state.

本願の開示する検出装置は、取得部と、第一の算出部と、第二の算出部と、決定部と、検出部とを有する。取得部は、音声を発した人物の音声データを取得する。第一の算出部は、取得部により取得された音声データから第一の特徴量を算出する。第二の算出部は、所定のパラメータを用いた声帯振動のモデルから第二の特徴量を算出する。決定部は、第二の算出部により算出された第二の特徴量のうち、第一の算出部により算出された第一の特徴量との差分が最小となる場合の第二の特徴量について、第二の特徴量を算出したときに用いられたパラメータを決定する。検出部は、決定部により決定されたパラメータを用いて人物の心理状態を検出する。 The detection device disclosed in the present application includes an acquisition unit, a first calculation unit, a second calculation unit, a determination unit, and a detection unit. The acquisition unit acquires the voice data of the person who uttered the voice. The first calculation unit calculates a first feature amount from the audio data acquired by the acquisition unit. The second calculation unit calculates a second feature amount from a vocal cord vibration model using a predetermined parameter. The determination unit includes the second feature amount when the difference between the second feature amount calculated by the second calculation unit and the first feature amount calculated by the first calculation unit is minimum. The parameter used when calculating the second feature amount is determined. The detection unit detects the psychological state of the person using the parameter determined by the determination unit.

本願の開示する検出装置の一つの態様によれば、心理状態の検出結果の精度の低下を抑制することができる。 According to one aspect of the detection device disclosed in the present application, it is possible to suppress a decrease in accuracy of the detection result of the psychological state.

図１は、実施例に係る検出装置の機能構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a functional configuration of the detection apparatus according to the embodiment. 図２Ａは、声帯の２質量モデルの一例を示す図である。FIG. 2A is a diagram illustrating an example of a two-mass model of a vocal cord. 図２Ｂは、声帯の２質量モデルの一例を示す図である。FIG. 2B is a diagram illustrating an example of a two-mass model of a vocal cord. 図３は、実施例に係る検出装置が実行する処理の一例として、音声１フレーム分の処理を説明するための図である。FIG. 3 is a diagram for explaining processing for one frame of audio as an example of processing executed by the detection apparatus according to the embodiment. 図４Ａは、時系列に沿って、第一の算出部により算出された特徴量Ｆ０の一例を示した図である。FIG. 4A is a diagram illustrating an example of the feature amount F0 calculated by the first calculation unit along a time series. 図４Ｂは、時系列に沿って、決定部により特定された特徴量Ｆ０´の一例を示した図である。FIG. 4B is a diagram illustrating an example of the feature amount F0 ′ identified by the determination unit along a time series. 図５Ａは、時系列に沿って、第一の算出部により算出された特徴量ＳＦＭ（Spectral Flatness Measure）の一例を示した図である。FIG. 5A is a diagram illustrating an example of a feature amount SFM (Spectral Flatness Measure) calculated by the first calculation unit along a time series. 図５Ｂは、時系列に沿って、決定部により特定された特徴量ＳＦＭ´の一例を示した図である。FIG. 5B is a diagram illustrating an example of the feature amount SFM ′ specified by the determination unit along a time series. 図６は、従来の技術による心理状態の検出結果と、実施例に係る検出装置による心理状態の検出結果との一例を示す図である。FIG. 6 is a diagram illustrating an example of a psychological state detection result by a conventional technique and a psychological state detection result by the detection device according to the embodiment. 図７は、実施例に係る検出処理の手順を示すフローチャートである。FIG. 7 is a flowchart illustrating the procedure of the detection process according to the embodiment. 図８は、従来の技術による心理状態の検出結果と、実施例および実施例の変形例に係る各検出装置による心理状態の検出結果との一例を示す図である。FIG. 8 is a diagram illustrating an example of the detection result of the psychological state according to the conventional technique and the detection result of the psychological state by each detection device according to the embodiment and a modified example of the embodiment. 図９は、検出プログラムを実行するコンピュータを示す図である。FIG. 9 is a diagram illustrating a computer that executes a detection program.

以下に、本願の開示する検出装置、検出プログラムおよび検出方法の各実施例を図面に基づいて詳細に説明する。なお、実施例は開示の技術を限定するものではない。また、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Hereinafter, embodiments of a detection apparatus, a detection program, and a detection method disclosed in the present application will be described in detail with reference to the drawings. The embodiments do not limit the disclosed technology. In addition, the embodiments can be appropriately combined within a range in which processing contents are not contradictory.

実施例に係る検出装置について説明する。図１は、実施例に係る検出装置の機能構成の一例を示す図である。 The detection apparatus according to the embodiment will be described. FIG. 1 is a diagram illustrating an example of a functional configuration of the detection apparatus according to the embodiment.

［検出装置の機能構成］
図１に示すように、検出装置１０は、入力部１１と、出力部１２と、通信部１３と、記憶部１４と、制御部１５とを有する。 [Functional configuration of detection device]
As illustrated in FIG. 1, the detection device 10 includes an input unit 11, an output unit 12, a communication unit 13, a storage unit 14, and a control unit 15.

入力部１１は、各種情報を制御部１５に入力する。例えば、入力部１１は、ユーザから、後述の検出処理を実行するための指示を受け付けて、受け付けた指示を制御部１５に入力する。また、入力部１１は、音声データを制御部１５に入力する。ここで、音声データとは、発話者が発話した内容を示す音声のデータである。入力部１１のデバイスの一例としては、発話者の音声データを制御部１５に入力するためのマイクロフォンや、ユーザの操作を受け付けるマウスやキーボードなどのデバイスなどが挙げられる。 The input unit 11 inputs various information to the control unit 15. For example, the input unit 11 receives an instruction for performing a detection process described later from the user, and inputs the received instruction to the control unit 15. The input unit 11 inputs audio data to the control unit 15. Here, the voice data is voice data indicating the contents uttered by the speaker. Examples of the device of the input unit 11 include a microphone for inputting the voice data of the speaker to the control unit 15 and a device such as a mouse and a keyboard for receiving a user operation.

出力部１２は、各種の情報を出力する。例えば、出力部１２は、後述の決定部１５ｄにより決定されたパラメータの種類および値を表示する。出力部１２のデバイスの一例としては、液晶ディスプレイなどが挙げられる。 The output unit 12 outputs various information. For example, the output unit 12 displays the parameter type and value determined by the determination unit 15d described later. An example of the device of the output unit 12 is a liquid crystal display.

通信部１３は、各装置間の通信を行うためのインターフェースである。例えば、通信部１３は、図示しないサーバに接続される。かかるサーバの一例としては、後述の決定部１５ｄにより決定されたパラメータの種類および値を受信すると、受信したパラメータを用いて、後述の検出部１５ｅによる処理と同様の処理を行って、発話者の心理状態を検出するサーバが挙げられる。また、通信部１３は、制御部１５から、パラメータの種類および値を受信すると、受信したパラメータをサーバへ送信する。 The communication unit 13 is an interface for performing communication between devices. For example, the communication unit 13 is connected to a server (not shown). As an example of such a server, when the type and value of a parameter determined by a determination unit 15d described later are received, a process similar to the processing performed by the detection unit 15e described below is performed using the received parameter, and the speaker's A server that detects a psychological state is mentioned. In addition, when the communication unit 13 receives the parameter type and value from the control unit 15, the communication unit 13 transmits the received parameter to the server.

記憶部１４は、各種情報を記憶する。例えば、記憶部１４は、比較用特徴量１４ａを記憶する。比較用特徴量１４ａは、後述する検出部１５ｅにおいて、後述するバネ定数のパラメータｋ１、ｋｃとの比較に用いられる特徴量である。例えば、比較用特徴量１４ａには、声帯の上側部分および下側部分の振動状態を表すモデルである「声帯の２質量モデル」におけるパラメータを採用することができる。ここで、「声帯の２質量モデル」については、公知のモデルである。例えば、「声帯の２質量モデル」については、「K.Ishizaka, J.L. Flanagan. “Synthesis of voiced sounds from a two-mass model of the vocal cords”, Bell.Syst.Tech.Journal, Vol. 51, pp. 1233-1268, 1972.」に記載されている。図２Ａおよび図２Ｂは、声帯の２質量モデルの一例を示す図である。図２Ａおよび図２Ｂの例は、次の式（１）、式（２）および式（３）によって定められる声帯の振動状態を示すモデルの一例を示す。

ここで、ｍ_１、ｍ_２のそれぞれは、声帯の根元部分、声帯の先端部分の質量を示すパラメータである。また、ｋ_１、ｋ_２、ｋ_ｃのそれぞれは、声帯の根元部分、声帯の先端部分、声帯全体をバネとみなした場合のバネ定数を示すパラメータである。また、ｒ_１、ｒ_２のそれぞれは、声帯の根元部分、声帯の先端部分の粘性係数を示すパラメータである。また、ｘ_１、ｘ_２のそれぞれは、声帯の根元部分、声帯の先端部分をバネとみなした場合の平衡点からのずれ量を示すパラメータである。また、ｓ_１、ｓ_２のそれぞれは、声帯の根元部分、声帯の先端部分をバネとみなした場合に、平衡点からｘ_１、ｘ_２だけ伸びたまたは縮んだ場合のバネの力を示すパラメータである。 The storage unit 14 stores various information. For example, the storage unit 14 stores a comparison feature amount 14a. The comparison feature value 14a is a feature value used for comparison with spring constant parameters k1 and kc, which will be described later, in the detection unit 15e, which will be described later. For example, the parameters in the “two-mass model of vocal cords” that is a model representing the vibration state of the upper portion and the lower portion of the vocal cords can be adopted as the comparison feature 14a. Here, the “two mass model of vocal cords” is a known model. For example, for “two-mass model of vocal cords”, “K.Ishizaka, JL Flanagan.“ Synthesis of voiced sounds from a two-mass model of the vocal cords ”, Bell.Syst.Tech.Journal, Vol. 51, pp. 1233-1268, 1972. ”. 2A and 2B are diagrams showing an example of a two-mass model of a vocal cord. The example of FIG. 2A and FIG. 2B shows an example of a model showing the vibration state of the vocal cords defined by the following equations (1), (2), and (3).

Here, each of m ₁ and m ₂ is a parameter indicating the mass of the root part of the vocal cord and the tip part of the vocal cord. Each of k ₁ , k _{2, and} k _c is a parameter indicating a spring constant when the root part of the vocal cord, the tip part of the vocal cord, and the entire vocal cord are regarded as a spring. Each of r ₁ and r ₂ is a parameter indicating the viscosity coefficient of the root portion of the vocal cord and the tip portion of the vocal cord. Each of x ₁ and x ₂ is a parameter indicating the amount of deviation from the equilibrium point when the root part of the vocal cords and the tip part of the vocal cords are regarded as springs. Further, each of s ₁ and s ₂ is a parameter indicating the spring force when the root part of the vocal cord and the distal end part of the vocal cord are regarded as springs, and when they are extended or contracted by x ₁ and x ₂ from the equilibrium point. It is.

ストレスを受けていない状態である日常状態におけるバネ定数のパラメータｋ_１、ｋ_ｃの値の範囲を「声帯の２質量モデル」を用いて算出し、算出したバネ定数のパラメータｋ_１、ｋ_ｃの値の範囲を比較量特徴量１４ａとして採用することができる。 The range of values of the spring constant parameters k ₁ and k _c in the daily state that is not stressed is calculated using the “two-mass model of vocal cords”, and the calculated spring constant parameters k ₁ and k _c are calculated. A range of values can be adopted as the comparison amount feature 14a.

記憶部１４は、例えば、フラッシュメモリなどの半導体メモリ素子、または、ハードディスク、光ディスクなどの記憶装置である。なお、記憶部１４は、上記の種類の記憶装置に限定されるものではなく、ＲＡＭ（Random Access Memory)、ＲＯＭ（Read Only Memory)であってもよい。 The storage unit 14 is, for example, a semiconductor memory device such as a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 14 is not limited to the above type of storage device, and may be a RAM (Random Access Memory) or a ROM (Read Only Memory).

制御部１５は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する。図１に示すように、制御部１５は、取得部１５ａと、第一の算出部１５ｂと、第二の算出部１５ｃと、決定部１５ｄと、検出部１５ｅとを有する。 The control unit 15 has an internal memory for storing programs defining various processing procedures and control data, and executes various processes using these. As shown in FIG. 1, the control unit 15 includes an acquisition unit 15a, a first calculation unit 15b, a second calculation unit 15c, a determination unit 15d, and a detection unit 15e.

取得部１５ａは、発話者の音声データを取得する。例えば、取得部１５ａは、入力部１１から音声データが入力された場合には、入力された音声データを取得する。なお、予め録音された音声データが記憶部１４に記憶されている場合には、記憶部１４に記憶された音声データを取得することもできる。図３は、実施例に係る検出装置が実行する処理の一例を説明するための図である。 The acquisition unit 15a acquires the voice data of the speaker. For example, when voice data is input from the input unit 11, the acquisition unit 15a acquires the input voice data. In addition, when the voice data recorded in advance is stored in the storage unit 14, the voice data stored in the storage unit 14 can be acquired. FIG. 3 is a diagram for explaining an example of processing executed by the detection apparatus according to the embodiment.

第一の算出部１５ｂは、取得部１５ａにより取得された音声データから、音声の基本周波数Ｆ０、音声の高域周波数成分を示すスペクトルについての平坦性を評価可能な指標（ＳＦＭ；Spectral Flatness Measure）を算出する。例えば、第一の算出部１５ｂは、音声データの音声区間を１フレーム（例えば６４ｍｓｅｃ）ごとに線形予測分析（ＬＰＣ）を行い、予測残差波形を抽出する。そして、第一の算出部１５ｂは、音声の基本周波数Ｆ０を特定することで、基本周波数Ｆ０を算出する。また、第一の算出部１５ｂは、公知の技術、例えば、特開２０１１−２４２７５５号公報に記載されている技術を用いて、指標ＳＦＭを算出する。このように、第一の算出部１５ｂは、フレームごとに、基本周波数Ｆ０および指標ＳＦＭを算出する。 The first calculation unit 15b is an index (SFM; Spectral Flatness Measure) that can evaluate the flatness of the spectrum indicating the fundamental frequency F0 of speech and the high frequency component of speech from the speech data acquired by the acquisition unit 15a. Is calculated. For example, the first calculation unit 15b performs linear prediction analysis (LPC) for each frame (for example, 64 msec) of the voice section of the voice data, and extracts a prediction residual waveform. And the 1st calculation part 15b calculates the fundamental frequency F0 by specifying the fundamental frequency F0 of an audio | voice. In addition, the first calculation unit 15b calculates the index SFM using a known technique, for example, a technique described in JP2011-242755A. Thus, the first calculation unit 15b calculates the fundamental frequency F0 and the index SFM for each frame.

図３は、実施例に係る検出装置が実行する処理の一例として、音声１フレーム分の処理を説明するための図である。図３の例に示すように、第一の算出部１５ｂは、取得部１５ａにより取得された音声データの音声区間を１フレームごとに線形予測分析を行い、予測残差波形を抽出する。そして、第一の算出部１５ｂは、基本周波数Ｆ０、指標ＳＦＭを特徴量として算出する。 FIG. 3 is a diagram for explaining processing for one frame of audio as an example of processing executed by the detection apparatus according to the embodiment. As shown in the example of FIG. 3, the first calculation unit 15 b performs linear prediction analysis for each voice segment of the voice data acquired by the acquisition unit 15 a and extracts a prediction residual waveform. Then, the first calculation unit 15b calculates the basic frequency F0 and the index SFM as feature amounts.

第二の算出部１５ｃは、ｋ_１、ｋ_ｃなどの各種パラメータを用いた「声帯の２質量モデル」を用いて、音声の基本周波数Ｆ０´、音声の高域周波数成分を示すスペクトルについての平坦性を評価可能な指標（ＳＦＭ´）を算出する。例えば、第二の算出部１５ｃは、ｋ_１、ｋ_ｃなどの各種パラメータを用いた「声帯の２質量モデル」により、声帯振動をシミュレーションする。そして、第二の算出部１５ｃは、シミュレーションの結果得られた声帯振動である声門流の１フレーム（例えば６４ｍｓｅｃ）を抽出する。そして、第二の算出部１５ｃは、音声の基本周波数を特定することで、基本周波数Ｆ０´を算出する。また、第二の算出部１５ｃは、公知の技術、例えば、第一の算出部１５ｂにより行われる処理と同様に、特開２０１１−２４２７５５号公報に記載されている技術を用いて、指標ＳＦＭ´を算出する。 The second calculation unit 15c uses a “two-mass model of vocal cords” using various parameters such as k ₁ and k _c to flatten the spectrum indicating the fundamental frequency F0 ′ of the speech and the high frequency component of the speech. An index (SFM ′) capable of evaluating the sex is calculated. For example, the second calculation unit 15c simulates vocal fold vibration by a “two-mass model of vocal folds” using various parameters such as k ₁ and k _c . Then, the second calculation unit 15c extracts one frame (for example, 64 msec) of glottal flow that is vocal cord vibration obtained as a result of the simulation. And the 2nd calculation part 15c calculates fundamental frequency F0 'by specifying the fundamental frequency of an audio | voice. Further, the second calculation unit 15c uses a technique described in Japanese Patent Application Laid-Open No. 2011-242755 in the same manner as a known technique, for example, the process performed by the first calculation unit 15b. Is calculated.

図３の例では、第二の算出部１５ｃは、「声帯の２質量モデル」により、声帯振動をシミュレーションし、基本周波数Ｆ０´および指標ＳＦＭ´を特徴量として算出する。 In the example of FIG. 3, the second calculation unit 15 c simulates vocal fold vibration using a “two-mass model of the vocal fold” and calculates the fundamental frequency F0 ′ and the index SFM ′ as feature amounts.

決定部１５ｄは、算出された特徴量Ｆ０´およびＳＦＭ´と、算出された特徴量Ｆ０およびＳＦＭとの差分の大きさが最小となる場合の特徴量Ｆ０´およびＳＦＭ´の組について、次のような処理を行う。すなわち、決定部１５ｄは、かかる場合の特徴量Ｆ０´およびＳＦＭ´の組を算出したときに用いられたパラメータｋ_１、ｋ_ｃの値を決定する。決定部１５ｄは、この処理をフレームごとに行って、フレームごとにパラメータｋ_１、ｋ_ｃの値を決定する。 The determination unit 15d determines the following combinations of the feature amounts F0 ′ and SFM ′ when the difference between the calculated feature amounts F0 ′ and SFM ′ and the calculated feature amounts F0 and SFM is minimum. Perform the following process. That is, the determination unit 15d determines the values of the parameters k ₁ and k _c used when calculating the set of the feature amounts F0 ′ and SFM ′ in such a case. The determination unit 15d performs this process for each frame, and determines the values of the parameters k ₁ and k _c for each frame.

例えば、決定部１５ｄは、あるフレームについて、第一の算出部１５ｂにより算出された特徴量Ｆ０およびＳＦＭと、第二の算出部１５ｃにより算出された特徴量Ｆ０´およびＳＦＭ´を用いて、次の式（４）が示す評価値ｃの値を算出する。
ｃ＝α（Ｆ０−Ｆ０´）^２＋β（ＳＦＭ−ＳＦＭ´）^２・・・式（４）
ただし、α、βは、「（Ｆ０−Ｆ０´）^２」、「（ＳＦＭ−ＳＦＭ´）^２」の項に対して重み付けを行うための定数である。 For example, the determination unit 15d uses the feature amounts F0 and SFM calculated by the first calculation unit 15b and the feature amounts F0 ′ and SFM ′ calculated by the second calculation unit 15c for a frame to The value of the evaluation value c indicated by equation (4) is calculated.
c = α (F0−F0 ′) ² + β (SFM−SFM ′) ² Expression (4)
However, α and β are constants for weighting the terms “(F0−F0 ′) ² ” and “(SFM−SFM ′) ² ”.

そして、決定部１５ｄは、パラメータｋ_１、ｋ_ｃの値の複数の組み合わせに対して算出した複数の評価値ｃのうち、最も値が小さい評価値ｃを特定し、特定した評価値ｃの場合における特徴量Ｆ０´およびＳＦＭ´の組を特定する。続いて、決定部１５ｄは、特定した特徴量Ｆ０´およびＳＦＭ´の組を算出した場合のパラメータｋ_１、ｋ_ｃの値を特定することで、パラメータｋ_１、ｋ_ｃの値を決定する。続いて、決定部１５ｄは、パラメータｋ_１、ｋ_ｃの種類および決定したパラメータｋ_１、ｋ_ｃの値を通信部１３に送信する。これにより図示しないサーバにパラメータｋ_１、ｋ_ｃの種類および値が送信される。決定部１５ｄは、この処理をフレームごとに行って、フレームごとにパラメータｋ_１、ｋ_ｃの値を決定し、通信部１３に送信する。 And the determination part 15d specifies the evaluation value c with the smallest value among the plurality of evaluation values c calculated for the plurality of combinations of the values of the parameters k ₁ and k _c , and in the case of the specified evaluation value c A set of feature amounts F0 ′ and SFM ′ is specified. Subsequently, the determination unit 15d, by specifying the value of the parameter _k 1, _{k c} in the case of calculating a set of characteristic quantity F0' and SFM' identified, to determine the value of the parameter _k 1, _{k c.} Subsequently, the determination unit 15 d transmits the types of the parameters k ₁ and k _c and the determined values of the parameters k ₁ and k _c to the communication unit 13. As a result, the types and values of the parameters k ₁ and k _c are transmitted to a server (not shown). The determination unit 15 d performs this process for each frame, determines the values of the parameters k ₁ and k _c for each frame, and transmits them to the communication unit 13.

図３の例では、決定部１５ｄは、あるフレームについて、第一の算出部１５ｂにより算出された特徴量Ｆ０およびＳＦＭと、第二の算出部１５ｃにより算出された特徴量Ｆ０´およびＳＦＭ´を用いて、上記の式（４）が示す評価値ｃの値を算出する。そして、決定部１５ｄは、パラメータｋ_１、ｋ_ｃの値の複数の組み合わせに対して算出した複数の評価値ｃのうち、最も値が小さい評価値ｃの場合における特徴量Ｆ０´およびＳＦＭ´の組を特定する。続いて、決定部１５ｄは、特定した特徴量Ｆ０´およびＳＦＭ´の組を算出した場合のパラメータｋ_１、ｋ_ｃの値を特定し、送信する。このようにして、決定部１５ｄは、フレームごとに、特徴量Ｆ０およびＳＦＭと、特徴量Ｆ０´およびＳＦＭ´の複数の組のそれぞれとの誤差を評価して、最適なパラメータｋ_１、ｋ_ｃの値を決定し、通信部１３に送信する。パラメータｋ_１、ｋ_ｃの値を更新しながら、最適な値を探索する方法としては、公地の技術であるNelder-Meadシンプレックス法（J. A. Nelder, and R. Mead, "A Simplex Method for Function Minimization," Computer Journal, Vol. 7, pp. 308-313, 1965）などを利用できる。 In the example of FIG. 3, the determination unit 15d uses the feature amounts F0 and SFM calculated by the first calculation unit 15b and the feature amounts F0 ′ and SFM ′ calculated by the second calculation unit 15c for a certain frame. The value of the evaluation value c indicated by the above formula (4) is used. Then, the determination unit 15d calculates the feature amounts F0 ′ and SFM ′ in the case of the evaluation value c having the smallest value among the plurality of evaluation values c calculated for the plurality of combinations of the values of the parameters k ₁ and k _c . Identify the pair. Subsequently, the determination unit 15d specifies and transmits the values of the parameters k ₁ and k _{c when} the set of the specified feature amounts F0 ′ and SFM ′ is calculated. In this way, the determination unit 15d evaluates errors between the feature amounts F0 and SFM and each of the plurality of sets of feature amounts F0 ′ and SFM ′ for each frame, and determines the optimal parameters k ₁ and k _c. Is transmitted to the communication unit 13. Nelder-Mead simplex method (JA Nelder, and R. Mead, "A Simplex Method for Function Minimization") is a method for searching for the optimum value while updating the values of parameters k ₁ and k _c. , "Computer Journal, Vol. 7, pp. 308-313, 1965).

図４Ａは、時系列に沿って、第一の算出部により算出された特徴量Ｆ０の一例を示した図である。図４Ｂは、時系列に沿って、決定部により特定された特徴量Ｆ０´の一例を示した図である。図４Ａおよび図４Ｂに示す図から、図４Ａに示す特徴量Ｆ０と、図４Ｂに示す特徴量Ｆ０´とがほぼ同一となることが把握できる。また、図５Ａは、時系列に沿って、第一の算出部により算出された特徴量ＳＦＭの一例を示した図である。図５Ｂは、時系列に沿って、決定部により特定された特徴量ＳＦＭ´の一例を示した図である。図５Ａおよび図５Ｂに示す図から、図５Ａに示す特徴量ＳＦＭと、図５Ｂに示す特徴量ＳＦＭ´とがほぼ同一となることが把握できる。 FIG. 4A is a diagram illustrating an example of the feature amount F0 calculated by the first calculation unit along a time series. FIG. 4B is a diagram illustrating an example of the feature amount F0 ′ identified by the determination unit along a time series. From the diagrams shown in FIGS. 4A and 4B, it can be understood that the feature quantity F0 shown in FIG. 4A is substantially the same as the feature quantity F0 ′ shown in FIG. 4B. FIG. 5A is a diagram illustrating an example of the feature amount SFM calculated by the first calculation unit along a time series. FIG. 5B is a diagram illustrating an example of the feature amount SFM ′ specified by the determination unit along a time series. From the diagrams shown in FIGS. 5A and 5B, it can be understood that the feature quantity SFM shown in FIG. 5A and the feature quantity SFM ′ shown in FIG. 5B are substantially the same.

検出部１５ｅは、決定部１５ｄにより決定されたパラメータｋ_１、ｋ_ｃを用いて発話者の心理状態を検出する。例えば、検出部１５ｅは、まず、記憶部１４に記憶された比較用特徴量１４ａを取得する。続いて、検出部１５ｅは、比較用特徴量１４ａが示すストレスを受けていない状態である日常状態におけるバネ定数のパラメータｋ_１の値の範囲内に、決定部１５ｄにより決定されたパラメータｋ_１が含まれるか否かを判定する。日常状態におけるバネ定数のパラメータｋ_１の値の範囲内に、決定部１５ｄにより決定されたパラメータｋ_１が含まれない場合には、検出部１５ｅは、発話者の状態はストレス状態などの非日常状態であることを検出する。 The detection unit 15e detects the mental state of the speaker using the parameters k ₁ and k _c determined by the determination unit 15d. For example, the detection unit 15 e first acquires the comparison feature amount 14 a stored in the storage unit 14. Subsequently, the detection unit 15e, the range of values of the parameters k ₁ a spring constant in everyday state is a state unstressed indicated comparison feature quantity 14a, the parameter k ₁ is determined by the determination unit 15d It is determined whether or not it is included. Within the range of values of the parameters k ₁ a spring constant in everyday state, if the parameter k ₁ is determined by the determination unit 15d does not include the detection unit 15e, the state of a speaker extraordinary such stress conditions Detect that it is in a state.

一方、日常状態におけるバネ定数のパラメータｋ_１の値の範囲内に、決定部１５ｄにより決定されたパラメータｋ_１が含まれる場合には、検出部１５ｅは、次のような処理を行う。すなわち、検出部１５ｅは、比較用特徴量１４ａが示す日常状態におけるバネ定数のパラメータｋ_ｃの値の範囲内に、決定部１５ｄにより決定されたパラメータｋ_ｃが含まれるか否かを判定する。日常状態におけるバネ定数のパラメータｋ_ｃの値の範囲内に、決定部１５ｄにより決定されたパラメータｋ_ｃが含まれない場合には、検出部１５ｅは、発話者の状態はストレス状態などの非日常状態であることを検出する。一方、日常状態におけるバネ定数のパラメータｋ_ｃの値の範囲内に、決定部１５ｄにより決定されたパラメータｋ_ｃが含まれる場合には、検出部１５ｅは、発話者の状態は日常状態であることを検出する。なお、決定部１５ｄは、フレームごとに発話者の心理状態などの状態を検出することができる。また、決定部１５ｄは、複数のフレームにおいて決定部１５ｄにより決定されたパラメータｋ_１、ｋ_ｃの平均値と、比較用特徴量１４ａが示すパラメータｋ_１、ｋ_ｃの値の範囲とを比較して発話者の状態を検出することもできる。 On the other hand, within the range of values of the parameters k ₁ a spring constant in everyday state, if the parameter k ₁ is determined by the determination unit 15d includes the detection unit 15e performs the following processing. That is, the detection unit 15e determines whether or not the parameter k _c determined by the determination unit 15d is included in the range of the spring constant parameter k _c in the daily state indicated by the comparison feature 14a. Within the range of values of the parameter k _c of the spring constant in everyday state, when the determining unit does not include the parameter k _c determined by 15d, the detection unit 15e, the state of a speaker extraordinary such stress conditions Detect that it is in a state. On the other hand, within the range of values of the parameter k _c of the spring constant in everyday conditions that, when containing the parameter k _c determined by the determination unit 15d, the detection unit 15e, the speaker of the state is routine state Is detected. Note that the determination unit 15d can detect a state such as the mental state of the speaker for each frame. The determining unit 15d compares the average values of the parameters k ₁ and k _c determined by the determining unit 15d in a plurality of frames with the ranges of the values of the parameters k ₁ and k _c indicated by the comparison feature 14a. It is also possible to detect the state of the speaker.

図６は、従来の技術による心理状態の検出結果と、実施例に係る検出装置による心理状態の検出結果との一例を示す図である。図６の例では、「Ｓｐｅａｋｅｒ１」〜「Ｓｐｅａｋｅｒ４」の４人の男性が発話者である場合の検出結果を示す。図６の例が示す検出結果は、次のことを示す。すなわち、基本周波数Ｆ０、および、指標ＳＦＭを用いて発話者の心理状態を検出する場合に比べて、特徴量ｋ_１、ｋ_ｃを用いて発話者の心理状態を検出する場合の方が検出率が高いことを示す。 FIG. 6 is a diagram illustrating an example of a psychological state detection result by a conventional technique and a psychological state detection result by the detection device according to the embodiment. In the example of FIG. 6, a detection result when four men “Speaker 1” to “Speaker 4” are speakers is shown. The detection result shown in the example of FIG. 6 indicates the following. That is, the detection rate is higher when the psychological state of the speaker is detected using the feature quantities k ₁ and k _c than when the psychological state of the speaker is detected using the fundamental frequency F0 and the index SFM. Is high.

なお、実験によれば、日常状態では、パラメータｋ_１は、小さい値となり、パラメータｋ_ｃは、大きい値となる。また、非日常状態では、パラメータｋ_１は、男性では大きく、女性では小さい値となり、パラメータｋ_ｃは、男女ともに小さい値となる。 According to experiments, in the daily state, the parameter k ₁ has a small value and the parameter k _c has a large value. In an extraordinary state, the parameter k ₁ is large for men and small for women, and the parameter k _c is small for both men and women.

このように、検出装置１０は、心理状態との関係が強く、声帯を動かす筋肉の張力に関係し、声帯の振動に影響を与えるバネ定数のパラメータｋ_１、ｋ_ｃを、発話者の心理状態を検出する際に用いる。すなわち、検出装置１０は、発話者の心理状態との関係が弱い声道の影響が抑制された特徴量ｋ_１、ｋ_ｃを用いて心理状態を検出する。したがって、検出装置１０によれば、心理状態の検出結果の精度の低下を抑制することができる。 Thus, the detection apparatus 10 has a strong relationship with the psychological state, relates to the tension of the muscle that moves the vocal cords, and determines the spring constant parameters k ₁ and k _c that affect the vibration of the vocal cords as the psychological state of the speaker. Used when detecting. That is, the detection apparatus 10 detects the psychological state using the feature quantities k ₁ and k _c in which the influence of the vocal tract having a weak relationship with the speaker's psychological state is suppressed. Therefore, according to the detection apparatus 10, the fall of the precision of the detection result of a psychological state can be suppressed.

制御部１５は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの集積回路またはＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などの電子回路である。 The control unit 15 is an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA) or an electronic circuit such as a central processing unit (CPU) or a micro processing unit (MPU).

［処理の流れ］
次に、本実施例に係る検出装置１０の処理の流れについて説明する。図７は、実施例に係る検出処理の手順を示すフローチャートである。検出処理は、例えば、入力部１１から、検出処理を実行するための指示を制御部１５が受け付けたタイミングで実行される。 [Process flow]
Next, the process flow of the detection apparatus 10 according to the present embodiment will be described. FIG. 7 is a flowchart illustrating the procedure of the detection process according to the embodiment. The detection process is executed, for example, at a timing when the control unit 15 receives an instruction to execute the detection process from the input unit 11.

図７に示すように、取得部１５ａは、発話者の音声データを取得する（Ｓ１０１）。続いて、第一の算出部１５ｂは、音声データから1フレーム取得し（Ｓ１０２）、ＬＰＣ残差波形を抽出する（Ｓ１０３）。そして、第一の算出部１５ｂは、基本周波数Ｆ０および指標ＳＦＭを算出する（Ｓ１０４）。 As shown in FIG. 7, the acquisition unit 15a acquires the voice data of the speaker (S101). Subsequently, the first calculation unit 15b acquires one frame from the audio data (S102), and extracts an LPC residual waveform (S103). Then, the first calculation unit 15b calculates the fundamental frequency F0 and the index SFM (S104).

続いて、第二の算出部１５ｃは、ｋ_１、ｋ_ｃなどの各種パラメータを用いた「声帯の２質量モデル」により、声帯振動をシミュレーションする（Ｓ１０５）。そして、第二の算出部１５ｃは、合成された声門流から特徴量として、音声の基本周波数Ｆ０´、および、指標ＳＦＭ´を算出する（Ｓ１０６）。その後、第二の算出部１５ｃは、特徴量Ｆ０およびＳＦＭと、特徴量Ｆ０´およびＳＦＭ´を用いて、式（４）が示す誤差評価値ｃの値を算出する（Ｓ１０７）。 Subsequently, the second calculation unit 15c simulates vocal fold vibration using a “two-mass model of vocal folds” using various parameters such as k ₁ and k _c (S105). Then, the second calculation unit 15c calculates the fundamental frequency F0 ′ and the index SFM ′ of the speech as feature amounts from the synthesized glottal flow (S106). Thereafter, the second calculation unit 15c calculates the value of the error evaluation value c indicated by the equation (4) using the feature amounts F0 and SFM and the feature amounts F0 ′ and SFM ′ (S107).

決定部１５ｄは、評価値ｃが最小であるかを調べ（Ｓ１０８）、そうでない場合（Ｓ１０８否定）には、識別用特徴量として用いる２質量モデルのパラメータｋ_１、ｋ_ｃの値を更新する（Ｓ１０９）。評価値ｃが最小である場合（Ｓ１０８肯定）、特定した特徴量Ｆ０´およびＳＦＭ´の組を算出した場合のパラメータｋ_１、ｋ_ｃの値を特定することで、パラメータｋ_１、ｋ_ｃの値を決定し、出力する（Ｓ１１０）。このように、パラメータｋ_１、ｋ_ｃの値を更新しながら、最適な値を探索する方法としては、公知の技術であるNelder-Meadシンプレックス法などを利用することが可能である。 The determination unit 15d checks whether the evaluation value c is the minimum (S108). If not (No in S108), the determination unit 15d updates the values of the parameters k ₁ and k _c of the two-mass model used as the distinguishing feature amount. (S109). When the evaluation value c is the minimum (Yes in S108), by specifying the values of the parameters k ₁ and k _{c when} the set of the specified feature amounts F0 ′ and SFM ′ is calculated, the parameters k ₁ and k _c The value is determined and output (S110). As described above, the Nelder-Mead simplex method, which is a known technique, can be used as a method for searching for the optimum value while updating the values of the parameters k ₁ and k _c .

そして、決定部１５ｄは、音声データの中に未処理の次のフレームがあるか否かを判定する（Ｓ１１１）。未処理の次のフレームがある場合（Ｓ１１１肯定）には、未処理の次のフレームを処理対象のフレームとして、Ｓ１０２に戻り、再び、上述した処理を行う。 Then, the determination unit 15d determines whether there is an unprocessed next frame in the audio data (S111). If there is an unprocessed next frame (Yes at S111), the process returns to S102 with the unprocessed next frame as a processing target frame, and the above-described processing is performed again.

一方、未処理の次のフレームがない場合（Ｓ１１１否定）には、検出部１５ｅは、決定部１５ｄにより決定された入力音声全体に対する識別用特徴量パラメータｋ_１、ｋ_ｃの分布を用いて発話者の心理状態を検出する（Ｓ１１２）。具体的には一般的な判別分析手法や識別学習が利用できる。そして、処理を終了する。 On the other hand, when there is no unprocessed next frame (No at S111), the detection unit 15e uses the distribution of the identification feature parameter parameters k ₁ and k _c for the entire input speech determined by the determination unit 15d to speak. The psychological state of the person is detected (S112). Specifically, general discriminant analysis methods and discriminative learning can be used. Then, the process ends.

上述してきたように、実施例に係る検出装置１０は、心理状態との関係が強く、声帯を動かす筋肉の張力に関係し、声帯の振動に影響を与えるバネ定数のパラメータｋ_１、ｋ_ｃを、発話者の心理状態を検出する際に用いる。すなわち、検出装置１０は、発話者の心理状態との関係が弱い声道の影響が抑制された特徴量ｋ_１、ｋ_ｃを用いて心理状態を検出する。したがって、検出装置１０によれば、心理状態の検出結果の精度の低下を抑制することができる。 As described above, the detection device 10 according to the embodiment has a strong relationship with the psychological state, relates to the tension of the muscle that moves the vocal cords, and uses parameters k ₁ and k _c of spring constants that affect the vibration of the vocal cords. Used when detecting the speaker's psychological state. That is, the detection apparatus 10 detects the psychological state using the feature quantities k ₁ and k _c in which the influence of the vocal tract having a weak relationship with the speaker's psychological state is suppressed. Therefore, according to the detection apparatus 10, the fall of the precision of the detection result of a psychological state can be suppressed.

さて、これまで開示の装置に関する実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施例を説明する。 Although the embodiments related to the disclosed apparatus have been described above, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, another embodiment included in the present invention will be described below.

例えば、上記の実施例では、発話者の状態を検出する際に、バネ定数のパラメータｋ_１、ｋ_ｃを用いる場合について説明したが、開示の装置は、これに限定されない。例えば、開示の装置は、発話者の状態を検出する際に、バネ定数のパラメータｋ_１、ｋ_ｃに加え、声門下圧のパラメータＰｓを用いるようにしてもよい。ここで、声門下圧とは、声帯下の気管での空気圧を指す。また、パラメータＰｓは、上記の式（１）、式（２）において、Ｆ_１、Ｆ_２に関係するパラメータである。なお、実験によれば、日常状態では、パラメータＰｓは、大きい値となり、非日常状態では、パラメータＰｓは、とても小さい値となる。この場合、開示の装置における比較用特徴量１４ａには、上述したパラメータｋ_１、ｋ_ｃの値の範囲に加え、声門加圧についても同様の値の範囲が含まれる。すなわち、ストレスを受けていない状態である日常状態における声門下圧のパラメータＰｓの値の範囲を「声帯の２質量モデル」を用いて算出し、算出した声門下圧のパラメータＰｓの値の範囲を比較量特徴量１４ａとして採用する。 For example, in the above embodiment, the case where the spring constant parameters k ₁ and k _c are used when detecting the state of the speaker has been described, but the disclosed apparatus is not limited to this. For example, the disclosed apparatus may use the subglottic pressure parameter Ps in addition to the spring constant parameters k ₁ and k _c when detecting the state of the speaker. Here, subglottic pressure refers to the air pressure in the trachea below the vocal cords. The parameter Ps is a parameter related to F ₁ and F ₂ in the above formulas (1) and (2). According to experiments, the parameter Ps has a large value in the daily state, and the parameter Ps has a very small value in the unusual state. In this case, the comparison feature 14a in the disclosed apparatus includes a range of similar values for glottal pressure in addition to the range of values of the parameters k ₁ and k _c described above. That is, the range of the value of the parameter Ps of the subglottic pressure in the daily state that is not stressed is calculated using the “two-mass model of the glottis”, and the range of the value of the calculated parameter Ps of the subglottic pressure is calculated. The comparison amount is used as the feature amount 14a.

そして、第一の算出部１５ｂは、基本周波数Ｆ０および指標ＳＦＭを算出する。また、第二の算出部１５ｃは、ｋ_１、ｋ_ｃ、Ｐｓの各種パラメータの値を変更して、フレームごとに、音声の基本周波数Ｆ０´、および、指標ＳＦＭ´を算出する。 Then, the first calculation unit 15b calculates the fundamental frequency F0 and the index SFM. Further, the second calculation unit 15c changes the values of various parameters of k ₁ , k _c , and Ps, and calculates the voice fundamental frequency F0 ′ and the index SFM ′ for each frame.

また、決定部１５ｄは、特徴量Ｆ０およびＳＦＭと、特徴量Ｆ０´およびＳＦＭ´を用いて、上記の式（４）が示す評価値ｃの値を算出する。 Further, the determination unit 15d calculates the value of the evaluation value c indicated by the above equation (4) using the feature amounts F0 and SFM and the feature amounts F0 ′ and SFM ′.

そして、決定部１５ｄは、ｋ_１、ｋ_ｃ、Ｐｓの各種パラメータの複数の値に対して算出したそれぞれの評価値ｃのうち、最も値が小さい評価値ｃを特定し、特定した評価値ｃの場合における特徴量Ｆ０´およびＳＦＭ´の組を特定し、次のような処理を行う。すなわち、決定部１５ｄは、特定した特徴量Ｆ０´およびＳＦＭ´の組を算出した場合のパラメータｋ_１、ｋ_ｃ、Ｐｓの値を特定することで、パラメータｋ_１、ｋ_ｃ、Ｐｓの値を決定する。続いて、決定部１５ｄは、パラメータｋ_１、ｋ_ｃ、Ｐｓの種類および決定したパラメータｋ_１、ｋ_ｃ、Ｐｓの値を通信部１３に送信する。その後、検出部１５ｅは、決定部１５ｄにより決定されたパラメータｋ_１、ｋ_ｃ、Ｐｓを用いて、パラメータｋ_１、ｋ_ｃを用いる場合と同様の方法で発話者の心理状態を検出する。 The determination unit 15d includes, k _{1, k} c, of the respective evaluation values c calculated for a plurality of values of various parameters _Ps, to identify the most value is less evaluation value c, specified evaluation value c In this case, a set of feature amounts F0 ′ and SFM ′ is specified, and the following processing is performed. That is, the determining unit 15d specifies the values of the parameters k ₁ , k _c , and Ps when the set of the specified feature amounts F0 ′ and SFM ′ is calculated, thereby determining the values of the parameters k ₁ , k _c , and Ps. decide. Subsequently, the determination unit 15d transmits the parameter _k _1, k c, parameters and types and determination of Ps _k _1, k c, the value of Ps to the communication unit 13. Thereafter, the detection unit 15e detects the psychological state of the speaker using the parameters k ₁ , k _c , and Ps determined by the determination unit 15d in the same manner as when using the parameters k ₁ and k _c .

図８は、従来の技術による心理状態の検出結果と、実施例および実施例の変形例に係る各検出装置による心理状態の検出結果との一例を示す図である。図８の例では、３人の男性および３人の女性が発話者である場合の検出結果を示す。図８の例が示す検出結果は、次のことを示す。すなわち、基本周波数Ｆ０、および、指標ＳＦＭを用いて発話者の心理状態を検出する場合や、特徴量ｋ_１、ｋ_ｃを用いて発話者の心理状態を検出する場合と比べて、特徴量ｋ_１、ｋ_ｃ、Ｐｓを用いて発話者の心理状態を検出する場合の方が、検出率が高いことを示す。 FIG. 8 is a diagram illustrating an example of the detection result of the psychological state according to the conventional technique and the detection result of the psychological state by each detection device according to the embodiment and a modified example of the embodiment. In the example of FIG. 8, the detection result when three men and three women are speakers is shown. The detection result shown in the example of FIG. 8 indicates the following. That is, the feature amount k is compared with the case where the speaker's psychological state is detected using the fundamental frequency F0 and the index SFM, and the case where the speaker's psychological state is detected using the feature amounts k ₁ and k _c. ₁ , k _c , Ps indicates that the detection rate is higher when the speaker's psychological state is detected.

このように、変形例における検出装置は、心理状態との関係が強く、声帯を動かす筋肉の張力に関係し、声帯の振動に影響を与えるパラメータｋ_１、ｋ_ｃ、Ｐｓを、発話者の心理状態を検出する際に用いる。すなわち、検出装置は、発話者の心理状態との関係が弱い声道の影響が抑制された特徴量ｋ_１、ｋ_ｃ、Ｐｓを用いて心理状態を検出する。したがって、検出装置によれば、心理状態の検出結果の精度の低下を抑制することができる。 Thus, the detection device in the modified example has a strong relationship with the psychological state, relates to the tension of the muscle that moves the vocal cords, and determines the parameters k ₁ , k _c , and Ps that affect the vibration of the vocal cords as the psychology of the speaker. Used when detecting the state. That is, the detection device detects the psychological state using the feature quantities k ₁ , k _c , and Ps in which the influence of the vocal tract having a weak relationship with the speaker's psychological state is suppressed. Therefore, according to the detection device, it is possible to suppress a decrease in the accuracy of the detection result of the psychological state.

また、実施例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともできる。また、本実施例において説明した各処理のうち、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。 In addition, among the processes described in the embodiments, all or a part of the processes described as being automatically performed can be manually performed. In addition, among the processes described in this embodiment, all or a part of the processes described as being performed manually can be automatically performed by a known method.

また、各種の負荷や使用状況などに応じて、各実施例において説明した各処理の各ステップでの処理を任意に細かくわけたり、あるいはまとめたりすることができる。また、ステップを省略することもできる。 In addition, the processing at each step of each processing described in each embodiment can be arbitrarily finely divided or combined according to various loads and usage conditions. Also, the steps can be omitted.

また、各種の負荷や使用状況などに応じて、各実施例において説明した各処理の各ステップでの処理の順番を変更できる。 Further, the order of processing at each step of each processing described in each embodiment can be changed according to various loads and usage conditions.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的状態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific state of distribution / integration of each device is not limited to the one shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured.

［検出プログラム］
また、上記の実施例で説明した検出装置１０の各種の処理は、あらかじめ用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータシステムで実行することによって実現することもできる。そこで、以下では、図９を用いて、上記の実施例で説明した検出装置１０と同様の機能を有する検出プログラムを実行するコンピュータの一例を説明する。図９は、検出プログラムを実行するコンピュータを示す図である。 [Detection program]
Various processes of the detection apparatus 10 described in the above embodiment can be realized by executing a program prepared in advance on a computer system such as a personal computer or a workstation. In the following, an example of a computer that executes a detection program having the same function as that of the detection apparatus 10 described in the above embodiment will be described with reference to FIG. FIG. 9 is a diagram illustrating a computer that executes a detection program.

図９に示すように、コンピュータ３００は、ＣＰＵ３１０、ＲＯＭ３２０、ＨＤＤ３３０、ＲＡＭ３４０を有する。 As illustrated in FIG. 9, the computer 300 includes a CPU 310, a ROM 320, an HDD 330, and a RAM 340.

ＲＯＭ３２０には、ＯＳなどの基本プログラムが記憶されている。また、ＨＤＤ３３０には、上記の実施例で示す取得部１５ａと、第一の算出部１５ｂと、第二の算出部１５ｃと、決定部１５ｄと、検出部１５ｅと同様の機能を発揮する検出プログラム３３０ａが予め記憶される。なお、検出プログラム３３０ａについては、適宜分離しても良い。また、ＨＤＤ３３０には、比較用特徴量が設けられる。この比較用特徴量は、上述した比較用特徴量１４ａに対応する。 The ROM 320 stores basic programs such as an OS. In addition, the HDD 330 has a detection program that exhibits the same functions as the acquisition unit 15a, the first calculation unit 15b, the second calculation unit 15c, the determination unit 15d, and the detection unit 15e described in the above embodiment. 330a is stored in advance. Note that the detection program 330a may be separated as appropriate. Further, the HDD 330 is provided with a comparison feature amount. This comparison feature amount corresponds to the comparison feature amount 14a described above.

そして、ＣＰＵ３１０が、検出プログラム３３０ａをＨＤＤ３３０から読み出して実行する。 Then, the CPU 310 reads the detection program 330a from the HDD 330 and executes it.

そして、ＣＰＵ３１０は、比較用特徴量を読み出してＲＡＭ３４０に格納する。さらに、ＣＰＵ３１０は、ＲＡＭ３４０に格納された比較用特徴量を用いて、検出プログラム３３０ａを実行する。なお、ＲＡＭ３４０に格納される各データは、常に全てのデータがＲＡＭ３３０に格納されなくともよい。処理に用いられるデータがＲＡＭ３４０に格納されれば良い。 Then, the CPU 310 reads the comparison feature amount and stores it in the RAM 340. Further, the CPU 310 executes the detection program 330 a using the comparison feature amount stored in the RAM 340. Note that all data stored in the RAM 340 may not always be stored in the RAM 330. Data used for processing may be stored in the RAM 340.

なお、上記した検出プログラム３３０ａについては、必ずしも最初からＨＤＤ３３０に記憶させておく必要はない。 Note that the above-described detection program 330a is not necessarily stored in the HDD 330 from the beginning.

例えば、コンピュータ３００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」にプログラムを記憶させておく。そして、コンピュータ３００がこれらからプログラムを読み出して実行するようにしてもよい。 For example, the program is stored in a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into the computer 300. Then, the computer 300 may read and execute the program from these.

さらには、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータ３００に接続される「他のコンピュータ（またはサーバ）」などにプログラムを記憶させておく。そして、コンピュータ３００がこれらからプログラムを読み出して実行するようにしてもよい。 Furthermore, the program is stored in “another computer (or server)” connected to the computer 300 via a public line, the Internet, a LAN, a WAN, or the like. Then, the computer 300 may read and execute the program from these.

１０検出装置
１４記憶部
１４ａ比較用特徴量
１５制御部
１５ａ取得部
１５ｂ第一の算出部
１５ｃ第二の算出部
１５ｄ決定部
１５ｅ検出部 DESCRIPTION OF SYMBOLS 10 Detection apparatus 14 Memory | storage part 14a Comparison feature 15 Control part 15a Acquisition part 15b 1st calculation part 15c 2nd calculation part 15d Determination part 15e Detection part

Claims

音声を発した人物の音声データを取得する取得部と、
前記取得部により取得された音声データから第一の特徴量を算出する第一の算出部と、
所定のパラメータを用いた声帯振動のモデルから第二の特徴量を算出する第二の算出部と、
前記第二の算出部により算出された第二の特徴量のうち、前記第一の算出部により算出された第一の特徴量との差分が最小となる場合の前記第二の特徴量について、該第二の特徴量を算出したときに用いられたパラメータを決定する決定部と、
前記決定部により決定されたパラメータを用いて前記人物の心理状態を検出する検出部と、
を有することを特徴とする検出装置。 An acquisition unit for acquiring voice data of a person who has made a voice;
A first calculation unit for calculating a first feature amount from the audio data acquired by the acquisition unit;
A second calculation unit for calculating a second feature amount from a model of vocal cord vibration using predetermined parameters;
Of the second feature value calculated by the second calculation unit, the second feature value when the difference from the first feature value calculated by the first calculation unit is minimum. A determination unit for determining a parameter used when calculating the second feature amount;
A detection unit for detecting the psychological state of the person using the parameters determined by the determination unit;
A detection apparatus comprising:

前記第二の算出部は、バネ定数のパラメータを用いた前記声帯振動のモデルから前記第二の特徴量を算出し、
前記決定部は、前記バネ定数のパラメータを決定する
ことを特徴とする請求項１に記載の検出装置。 The second calculation unit calculates the second feature amount from a model of the vocal cord vibration using a parameter of a spring constant,
The detection device according to claim 1, wherein the determination unit determines a parameter of the spring constant.

前記第二の算出部は、バネ定数のパラメータおよび声門下圧のパラメータを用いた前記声帯振動のモデルから前記第二の特徴量を算出し、
前記決定部は、前記バネ定数のパラメータおよび前記声門下圧のパラメータを決定する
ことを特徴とする請求項１に記載の検出装置。 The second calculation unit calculates the second feature amount from the vocal cord vibration model using a spring constant parameter and a subglottic pressure parameter,
The detection device according to claim 1, wherein the determination unit determines a parameter of the spring constant and a parameter of the subglottic pressure.

コンピュータに、
音声を発した人物の音声データを取得し、
取得された音声データから第一の特徴量を算出し、
所定のパラメータを用いた声帯振動のモデルから第二の特徴量を算出し、
算出された前記第二の特徴量のうち、算出された前記第一の特徴量との差分が最小となる場合の前記第二の特徴量について、該第二の特徴量を算出したときに用いられたパラメータを決定し、
決定された前記パラメータを用いて前記人物の心理状態を検出する、
各処理を実行させることを特徴とする検出プログラム。 On the computer,
Get the voice data of the person who made the voice,
Calculate the first feature value from the acquired audio data,
Calculate the second feature amount from a model of vocal cord vibration using predetermined parameters,
Used when the second feature value is calculated for the second feature value when the difference between the calculated second feature value and the calculated first feature value is minimized. Determined parameters,
Detecting the psychological state of the person using the determined parameters;
A detection program characterized by causing each process to be executed.

コンピュータが、
音声を発した人物の音声データを取得し、
取得された音声データから第一の特徴量を算出し、
所定のパラメータを用いた声帯振動のモデルから第二の特徴量を算出し、
算出された前記第二の特徴量のうち、算出された前記第一の特徴量との差分が最小となる場合の前記第二の特徴量について、該第二の特徴量を算出したときに用いられたパラメータを決定し、
決定された前記パラメータを用いて前記人物の心理状態を検出する、
各処理を実行することを特徴とする検出方法。 Computer
Get the voice data of the person who made the voice,
Calculate the first feature value from the acquired audio data,
Calculate the second feature amount from a model of vocal cord vibration using predetermined parameters,
Used when the second feature value is calculated for the second feature value when the difference between the calculated second feature value and the calculated first feature value is minimized. Determined parameters,
Detecting the psychological state of the person using the determined parameters;
A detection method characterized by executing each process.