JPH07334192A

JPH07334192A - Audio signal processor and speech discrimination circuit

Info

Publication number: JPH07334192A
Application number: JP6131529A
Authority: JP
Inventors: Masaki Haranishi; 正樹原西
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1994-06-14
Filing date: 1994-06-14
Publication date: 1995-12-22

Abstract

PURPOSE:To discriminate whether or not the audio signal inputted from a microphone is a speech. CONSTITUTION:The audio signal inputted from the microphone 10 is inputted to a level detecting circuit 14 through a preamplifier 12. The level detecting circuit 14 compares the inputted audio signal with a specific reference level and outputs the signal exceeding the reference level to an A/D converter 16. The A/D converter 16 converts the analog output of the level detecting circuit 14 into a digital signal and its output is stored in an audio data storage circuit 18 by a specific period. A speech discrimination circuit 20 detects periodicity through the autocorrelation arithmetic of the audio data stored in the audio data storage circuit 18 and decides whether or not the input signal is a human voice from the detected basic period.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、オーディオ信号処理装
置及び音声判別回路に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio signal processing device and a voice discrimination circuit.

【０００２】[0002]

【従来の技術】入力オーディオ信号が人間の音声か又は
雑音であるかを判別する技術はまだ確立されていない。
従って、例えば、各参加者にマイクロフォンを割り当て
たテレビ会議システムで、音声の入力に応じてカメラを
その発言者に向けて制御する装置構成では、入力オーデ
ィオ信号のレベルが所定基準レベル以上になったとき、
音声入力があったと判断して、その音声入力のあったマ
イクロフォンに関連付けられた人物を撮影するようにカ
メラ（の向きとズーム）を制御するようになっている。2. Description of the Related Art A technique for discriminating whether an input audio signal is human voice or noise has not been established yet.
Therefore, for example, in a video conference system in which a microphone is assigned to each participant, in a device configuration in which the camera is controlled toward the speaker according to the input of voice, the level of the input audio signal exceeds a predetermined reference level. When
The camera (direction and zoom) is controlled so that it is determined that there is a voice input, and a person associated with the microphone having the voice input is photographed.

【０００３】[0003]

【発明が解決しようとする課題】しかし、このように、
入力オーディオ信号のレベルのみによりカメラ（の向
き）を制御すると、音声以外の雑音によってもカメラが
制御されてしまい、発言者とは関係の無い方向にカメラ
が向いてしまうという問題点がある。[Problems to be Solved by the Invention] However, in this way,
If the camera (orientation) is controlled only by the level of the input audio signal, noise other than voice will also control the camera, and the camera will face in a direction unrelated to the speaker.

【０００４】本発明は、このような不都合を解消したオ
ーディオ信号処理装置及び音声判別回路を提示すること
を目的とする。It is an object of the present invention to present an audio signal processing device and a voice discrimination circuit which eliminates such inconvenience.

【０００５】[0005]

【課題を解決するための手段】本発明に係るオーディオ
信号処理装置は、入力オーディオ信号のレベルを検出
し、所定レベル以上の部分を出力するレベル検出手段
と、当該レベル検出手段から出力されるアナログ・オー
ディオ信号をディジタル信号に変換するＡ／Ｄ変換手段
と、当該Ａ／Ｄ変換手段から出力されるディジタル・オ
ーディオ信号を記憶するオーディオ信号記憶手段と、当
該オーディオ信号記憶手段に記憶されるディジタル・オ
ーディオ信号の周期性を検出し、検出された周期が所定
の範囲にあるか否かにより人間の音声か否かを判別する
音声判別手段とからなることを特徴とする。An audio signal processing apparatus according to the present invention detects a level of an input audio signal and outputs a portion above a predetermined level, and an analog output from the level detecting means. A / D conversion means for converting an audio signal into a digital signal, an audio signal storage means for storing a digital audio signal output from the A / D conversion means, and a digital signal stored in the audio signal storage means The audio discriminating means detects the periodicity of the audio signal, and discriminates whether the detected period is a human voice or not depending on whether or not the detected period is within a predetermined range.

【０００６】本発明に係る音声判別回路は、入力オーデ
ィオ・データの自己相関を算出する自己相関演算手段
と、当該自己相関演算手段により得られる自己相関関数
から所定の極大点を検出する極大検出手段と、当該極大
検出手段により検出された極大点の時間と相関値から所
定周期内における重心値を計算する重心値演算手段と、
当該重心値演算手段で得られた重心値の時間成分と相関
値成分により当該入力オーディオ・データが音声か否か
を判別する判別手段とからなることを特徴とする。The voice discriminating circuit according to the present invention comprises an autocorrelation calculating means for calculating the autocorrelation of input audio data, and a maximum detecting means for detecting a predetermined maximum point from the autocorrelation function obtained by the autocorrelation calculating means. And a center-of-gravity value calculation means for calculating a center-of-gravity value within a predetermined period from the time and correlation value of the maximum point detected by the maximum detection means,
It is characterized by comprising a discriminating means for discriminating whether or not the input audio data is voice based on the time component and the correlation value component of the centroid value obtained by the centroid value calculating means.

【０００７】[0007]

【実施例】以下、図面を参照して、本発明の一実施例を
詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described in detail below with reference to the drawings.

【０００８】図１は、本発明の信号処理装置のブロック
図である。１０はマイクロフォン、１２はマイクロフォ
ン１０の出力を増幅するプリアンプ、１４はプリアンプ
１２から出力されるオーディオ信号のレベルを検出し、
所定レベルを越える入力信号を出力するレベル検出回
路、１６ははレベル検出回路１４のアナログ出力をディ
ジタル信号に変換するＡ／Ｄ変換器、１８はＡ／Ｄ変換
器１６から出力されるディジタル・オーディオ・データ
を記憶するオーディオ・データ記憶回路、２０はオーデ
ィオ・データ記憶回路１８から出力されるオーディオ・
データが音声データか否かを判別する音声判別回路、２
２は音声判別回路２０の判別結果を外部に出力する出力
端子である。FIG. 1 is a block diagram of a signal processing device of the present invention. 10 is a microphone, 12 is a preamplifier that amplifies the output of the microphone 10, 14 is the level of the audio signal output from the preamplifier 12,
A level detection circuit that outputs an input signal exceeding a predetermined level, 16 is an A / D converter that converts the analog output of the level detection circuit 14 into a digital signal, and 18 is a digital audio output from the A / D converter 16. Audio data storage circuit for storing data, 20 an audio data output from the audio data storage circuit 18
A voice discriminating circuit for discriminating whether or not the data is voice data, 2
Reference numeral 2 is an output terminal for outputting the discrimination result of the voice discrimination circuit 20 to the outside.

【０００９】図１に示す回路の動作を説明する。マイク
ロフォン１０から出力されるオーディオ信号はプリアン
プ１２により増幅され、レベル検出回路１４に入力す
る。レベル検出回路１４は入力するオーディオ信号を所
定基準レベルと比較し、当該所定基準レベル以上の信号
部分をＡ／Ｄ変換器１６に出力する。Ａ／Ｄ変換器１６
はレベル検出回路１４のアナログ出力をディジタル信号
に変換し、その出力はオーディオ・データ記憶回路１８
に所定期間分が記憶される。音声判別回路２０は、オー
ディオ・データ記憶回路１８に記憶されたオーディオ・
データの周期性を検出し、検出された基本周期から人間
の音声か否かを判別し、判別結果を出力端子２２に出力
する。The operation of the circuit shown in FIG. 1 will be described. The audio signal output from the microphone 10 is amplified by the preamplifier 12 and input to the level detection circuit 14. The level detection circuit 14 compares the input audio signal with a predetermined reference level and outputs a signal portion having the predetermined reference level or higher to the A / D converter 16. A / D converter 16
Converts the analog output of the level detection circuit 14 into a digital signal, the output of which is the audio data storage circuit 18
Is stored for a predetermined period. The voice discriminating circuit 20 uses the audio data stored in the audio data storage circuit 18.
The periodicity of the data is detected, it is discriminated from the detected fundamental period whether it is a human voice, and the discrimination result is output to the output terminal 22.

【００１０】図２及び図３は全体として、音声判別回路
２０の音声判別処理の流れを示すフローチャートであ
る。先ず、オーディオ・データ記憶回路１８に記憶され
るオーディオ・データを時間幅Ｔのブロックで切り出し
（Ｓ１）、切り出したブロックを更に、時間幅τで切り
出す（Ｓ２）。Ｔとτの関係を図４に示す。以後、時間
幅τを単位とした区間をフレームと呼び、時間幅Ｔを単
位とした区間をブロックと呼ぶことにする。2 and 3 are flow charts showing the flow of the voice discrimination processing of the voice discrimination circuit 20 as a whole. First, the audio data stored in the audio data storage circuit 18 is cut out in blocks of time width T (S1), and the cut out blocks are further cut out in time width τ (S2). The relationship between T and τ is shown in FIG. Hereinafter, a section in which the time width τ is a unit is called a frame, and a section in which the time width T is a unit is called a block.

【００１１】記憶回路１８に記憶されるオーディオ・デ
ータから第１ブロックの第１フレームを抽出し（Ｓ
３）、そのフレーム中のオーディオ・データから線形予
測する（Ｓ４）。即ち、原信号をＳｔ、予測信号をＳｔ
ｐとし、過去のＮサンプルを用いて線形予測を行なうと
すると、次式のようになる。The first frame of the first block is extracted from the audio data stored in the storage circuit 18 (S
3), linear prediction is performed from the audio data in the frame (S4). That is, the original signal is St and the predicted signal is St
If p is used and linear prediction is performed using past N samples, the following equation is obtained.

【００１２】[0012]

【数１】 [Equation 1]

【００１３】次に、原信号Ｓｔと予測信号Ｓｔｐの差Ｅ
ｔを求める（Ｓ５）。即ち、Next, the difference E between the original signal St and the predicted signal Stp
Calculate t (S5). That is,

【００１４】[0014]

【数２】Ｅｔ＝Ｓｔ−Ｓｔｐ更に、原信号の周期性を見るための自己相関処理を行な
う（Ｓ６）。本実施例では、周期τまでの成分がどの程
度存在するかを表すために、自己相関関数を次式のよう
にした。即ち、[Equation 2] Et = St-Stp Further, autocorrelation processing for observing the periodicity of the original signal is performed (S6). In the present embodiment, the autocorrelation function is represented by the following equation in order to express how many components up to the period τ exist. That is,

【００１５】[0015]

【数３】 [Equation 3]

【００１６】Ｓ６で求めた自己相関関数を正規化する
（Ｓ７）。即ち、正規化した自己相関関数をＲｎとし
て、次式を実行する。The autocorrelation function obtained in S6 is normalized (S7). That is, the following equation is executed with the normalized autocorrelation function as Rn.

【００１７】[0017]

【数４】Ｒｎ（τ）＝Ｒ（τ）／Ｒ（０）この自己相関関数を図５に示す。## EQU00004 ## Rn (.tau.) = R (.tau.) / R (0) This autocorrelation function is shown in FIG.

【００１８】Ｓ７で正規化した相関値が、経験的に決定
されたしきい値を越えるピーク値を持つかどうかを検出
し、そのピーク値を抽出する（Ｓ８）。この処理によ
り、図５に示す例では、矢印の部分が抽出される。It is detected whether the correlation value normalized in S7 has a peak value exceeding an empirically determined threshold value, and the peak value is extracted (S8). By this processing, in the example shown in FIG. 5, the arrow portion is extracted.

【００１９】以上は、第１ブロックの第１フレームの処
理である。次に、第１ブロックの第２フレームを抽出
し、第１フレームと同様の処理（Ｓ１０〜Ｓ１４）を実
行し、ピーク値を抽出する。第１フレームで抽出された
ピーク値と第２フレームで抽出されたピーク値を用い
て、ピーク値の重心を求める。The above is the processing of the first frame of the first block. Next, the second frame of the first block is extracted, the same processing as the first frame (S10 to S14) is executed, and the peak value is extracted. The center of gravity of the peak value is obtained using the peak value extracted in the first frame and the peak value extracted in the second frame.

【００２０】ピーク値重心抽出処理を図６を参照して説
明する。図６（ａ）は、第１フレームで得られた自己相
関関数のピーク値を示し、同（ｂ）は、第２フレームで
得られた自己相関関数のピーク値を示す。これらのピー
ク値の積分結果（Ｓ１５）は、図６（ｃ）に示すように
なる。図６（ｃ）に示すように、時間の小さいほうから
ｔ１，ｔ２及びｔ３とし、そのときの相関値をｐ
（t₁）、ｐ（t₂）、ｐ（t₃）とすると、重心値は以下の
ようにして求められる。即ち、自己相関関数の０次のモ
ーメントをｍ_opとすると、The peak value centroid extraction process will be described with reference to FIG. FIG. 6A shows the peak value of the autocorrelation function obtained in the first frame, and FIG. 6B shows the peak value of the autocorrelation function obtained in the second frame. The integration result (S15) of these peak values is as shown in FIG. 6 (c). As shown in FIG. 6C, t1, t2, and t3 are set in ascending order of time, and the correlation value at that time is p.
Assuming that (t ₁ ), p (t ₂ ), and p (t ₃ ), the center-of-gravity value is obtained as follows. That is, if the 0th moment of the autocorrelation function is m _op ,

【００２１】[0021]

【数５】 [Equation 5]

【００２２】時間の０次モーメントをｍ_otとすると、If the zeroth moment of time is m _ot ,

【００２３】[0023]

【数６】 [Equation 6]

【００２４】１次のモーメントをｍ₁とすれば、If the _first moment is m ₁ ,

【００２５】[0025]

【数７】 [Equation 7]

【００２６】これより、時間の重心値ｔ_gは、From this, the time centroid value t _g is

【００２７】[0027]

【数８】ｔ_g＝ｍ₁／ｍ_op となる。## EQU8 ## t _g = m ₁ / m _op .

【００２８】一方、相関値の重心値ｐ_gは、[0028] On the other hand, the center of gravity value p _g of correlation values,

【００２９】[0029]

【数９】ｐ_g＝ｍ₁／ｍ_ot の計算により得られる。得られた重心値は、図６（ｄ）
のようになる。## _EQU9 ## Obtained by calculating p _g = m ₁ / m _ot . The obtained center-of-gravity value is shown in FIG.
become that way.

【００３０】このようにして求められた重心値の時間成
分ｔ_gが、人間のピッチ周期の存在範囲である３ｍｓｅ
ｃ以上１５ｍｓｅｃ以内であるかどうかを判断し（Ｓ１
７）、この条件を満たすとき（Ｓ１７）、さらに、重心
の相関値成分ｐ_gが経験的に決められたしきい値を超え
るか否かを判断する（Ｓ１８）。この条件（Ｓ１８）も
満たしたとき、入力された信号が音声であったと判断
し、現在処理を行なっている第１ブロック内の信号全体
を音声と判断する（Ｓ１９）。The time component t _g of the center of gravity value thus obtained is 3 mse, which is the existence range of the human pitch period.
It is determined whether or not it is within c and 15 msec or more (S1
7), when this condition is satisfied (S17), further determines whether more than a threshold correlation value component p _g of the center of gravity has been determined empirically (S18). When this condition (S18) is also satisfied, it is determined that the input signal is voice, and the entire signal in the first block currently being processed is determined to be voice (S19).

【００３１】音声と判別されなかったとき、次に、第１
ブロックの第３フレームを抽出して前記第２フレームと
同様の処理（Ｓ１０〜Ｓ１４）を実行し、ピーク値を抽
出する。ここで抽出されたピーク値と、前記第１フレー
ムと第２フレームのピーク値の重心とを用いて、更にピ
ーク値の重心を求める。具体的には、第１フレームと第
２フレームにより得られた重心を図６（ａ）の第１フレ
ームのピーク値に、そして、第３フレームのピーク値を
図６（ｂ）の第２フレームのピーク値に置き換えること
で、新たなピーク値の重心を求めることができる。When it is not judged as voice, the first
The third frame of the block is extracted, the same processing (S10 to S14) as that of the second frame is executed, and the peak value is extracted. Using the peak value extracted here and the center of gravity of the peak values of the first frame and the second frame, the center of gravity of the peak value is further obtained. Specifically, the center of gravity obtained by the first frame and the second frame is the peak value of the first frame in FIG. 6A, and the peak value of the third frame is the second frame in FIG. 6B. By replacing it with the peak value of, the center of gravity of the new peak value can be obtained.

【００３２】こうして得られた重心は第３フレームまで
の重心である。この重心が人間の音声である条件（Ｓ１
７及びＳ１８）を満たしたとき、第１ブロックのオーデ
ィオ信号が音声であると判断し、第２ブロックの処理に
移行する。ここでも、音声と判断されなかったとき、更
に、第１ブロックの第４フレームを抽出し、同様の処理
を実行する。これにより、第１ブロックの第４フレーム
までのピーク値の重心が求められる。The center of gravity thus obtained is the center of gravity up to the third frame. The condition that this center of gravity is a human voice (S1
7 and S18) are satisfied, it is determined that the audio signal of the first block is voice, and the process proceeds to the second block. Also in this case, when it is determined that the voice is not sound, the fourth frame of the first block is further extracted and the same process is executed. As a result, the center of gravity of the peak value of the first block up to the fourth frame is obtained.

【００３３】このようにして、人間の音声と判断される
まで、第１ブロックの各フレームを処理していく。音声
と判断されると、今まで求めてきた重心の値を初期化
し、新たに、第２ブロックの重心抽出処理を実行する。In this way, each frame of the first block is processed until it is judged to be a human voice. If it is determined to be voice, the value of the center of gravity that has been obtained so far is initialized, and the center of gravity extraction processing of the second block is newly executed.

【００３４】もし、第１ブロックの最終フレームまで、
音声と判断されなかった場合（即ち、重心が人間の音声
である条件を満たさなかった場合）、第１ブロックのオ
ーディオ信号は音声ではなかったと最終的に判断され、
今までに得られたピーク値の重心の値は初期化され、第
２ブロック以降について、第１ブロックと同様の重心抽
出処理を実行する。If, until the last frame of the first block,
If it is not determined as voice (that is, if the center of gravity does not satisfy the condition of being human voice), it is finally determined that the audio signal of the first block is not voice,
The value of the center of gravity of the peak value obtained so far is initialized, and the second and subsequent blocks are subjected to the same center of gravity extraction processing as that of the first block.

【００３５】上記実施例のオーディオ信号処理装置を各
マイクロフォンに設け、各オーディオ信号処理装置の音
声判別結果出力に従いカメラを制御するようにすること
で、テレビ会議のカメラ制御装置として利用できる。図
７は、その概略構成ブロック図を示す。図１と同じ構成
要素には同じ符号を付してある。３０は図１に示した構
成のオーディオ信号処理装置、３２はカメラ・ユニッ
ト、３４は各オーディオ信号処理装置３０の音声判別回
路２０の出力に従い、音声が入力されたマイクロフォン
を使用する人物にカメラ・ユニット３２を制御するカメ
ラ制御回路である。カメラ制御回路３４は勿論、図示し
ないシステム制御回路等からのカメラ制御信号にも従っ
てカメラ・ユニット３２を制御する。By providing the audio signal processing device of the above embodiment in each microphone and controlling the camera according to the output of the voice discrimination result of each audio signal processing device, it can be used as a camera control device for a video conference. FIG. 7 shows a schematic block diagram thereof. The same components as those in FIG. 1 are designated by the same reference numerals. Reference numeral 30 is an audio signal processing device having the configuration shown in FIG. 1, 32 is a camera unit, and 34 is a camera unit for a person who uses a microphone to which a sound is input, according to the output of the sound determination circuit 20 of each audio signal processing device 30. It is a camera control circuit that controls the unit 32. The camera unit 32 is controlled according to a camera control signal from not only the camera control circuit 34 but also a system control circuit (not shown).

【００３６】図８は、図７に示すマイク＃１に関するカ
メラ制御のフローチャートを示す。オーディオ信号処理
装置３０の音声判別回路２０により音声か否かを判別し
（Ｓ３１）、その結果に応じて、音声のときには（Ｓ３
２）、カメラ制御回路３４は、マイク＃１に対するカメ
ラ制御フィールドのマイク＃１のフラグを立て（Ｓ３
４）、音声でないときには（Ｓ３２）、カメラ制御フィ
ールドのマイク＃１のフラグをクリアする（Ｓ３３）。FIG. 8 shows a flowchart of camera control for the microphone # 1 shown in FIG. Whether or not it is a sound is judged by the sound judging circuit 20 of the audio signal processing device 30 (S31), and when it is a sound (S3).
2), the camera control circuit 34 sets the flag of the microphone # 1 in the camera control field for the microphone # 1 (S3
4) If it is not voice (S32), the flag of microphone # 1 in the camera control field is cleared (S33).

【００３７】Ｓ３３，Ｓ３４の後、カメラ制御回路３４
は、カメラ制御フィールドのマイク＃１のフラグが立っ
ているか否かを確認し（Ｓ３５）、フラグが立っていた
ら、カメラ・ユニット３２がマイク＃１の利用者の方を
向くようにその雲台を制御する（Ｓ３６）。After S33 and S34, the camera control circuit 34
Confirms whether or not the flag of the microphone # 1 in the camera control field is set (S35). If the flag is set, the camera unit 32 turns the camera platform 32 toward the user of the microphone # 1. Is controlled (S36).

【００３８】[0038]

【発明の効果】以上の説明から容易に理解できるよう
に、本発明によれば、入力オーディオ信号が音声である
か否かを的確に判断できる。これにより、例えば、テレ
ビ会議では、雑音によってカメラが制御されてしまうこ
とを防止できる。As can be easily understood from the above description, according to the present invention, it can be accurately determined whether or not the input audio signal is voice. Thereby, for example, in a video conference, it is possible to prevent the camera from being controlled by noise.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例の概略構成ブロック図であ
る。FIG. 1 is a schematic block diagram of an embodiment of the present invention.

【図２】本実施例の音声判別処理のフローチャートの
一部である。FIG. 2 is a part of a flowchart of a voice discrimination process of the present embodiment.

【図３】本実施例の音声判別処理のフローチャートの
一部である。FIG. 3 is a part of a flowchart of a voice discrimination process of the present embodiment.

【図４】記憶回路１８に蓄積されるオーディオ・デー
タのフレーム（時間幅τ）とブロック（時間幅Ｔ）の関
係図である。FIG. 4 is a relationship diagram of a frame (time width τ) and a block (time width T) of audio data stored in a storage circuit 18.

【図５】自己相関関数の一例である。FIG. 5 is an example of an autocorrelation function.

【図６】自己相関関数のピーク値の積分と、そのピー
ク値の重心の時間成分とを表した図である。FIG. 6 is a diagram showing an integral of a peak value of an autocorrelation function and a time component of a center of gravity of the peak value.

【図７】カメラ制御に適用した実施例の概略構成ブロ
ック図である。FIG. 7 is a schematic configuration block diagram of an embodiment applied to camera control.

【図８】図７に示す実施例で、マイク＃１からの入力
に対するカメラ制御のフローチャートである。FIG. 8 is a flowchart of camera control with respect to an input from microphone # 1 in the embodiment shown in FIG.

【符号の説明】[Explanation of symbols]

１０：マイクロフォン１２：プリアンプ１４：レベル検出回路１６：Ａ／Ｄ変換器１８：オーディオ・データ記憶回路２０：音声判別回路２２：出力端子３０：オーディオ信号処理装置３２：カメラ・ユニット３４：カメラ制御回路 10: Microphone 12: Preamplifier 14: Level Detection Circuit 16: A / D Converter 18: Audio / Data Storage Circuit 20: Voice Discrimination Circuit 22: Output Terminal 30: Audio Signal Processing Device 32: Camera Unit 34: Camera Control Circuit

Claims

【特許請求の範囲】[Claims]

【請求項１】入力オーディオ信号のレベルを検出し、
所定レベル以上の部分を出力するレベル検出手段と、当
該レベル検出手段から出力されるアナログ・オーディオ
信号をディジタル信号に変換するＡ／Ｄ変換手段と、当
該Ａ／Ｄ変換手段から出力されるディジタル・オーディ
オ信号を記憶するオーディオ信号記憶手段と、当該オー
ディオ信号記憶手段に記憶されるディジタル・オーディ
オ信号の周期性を検出し、検出された周期が所定の範囲
にあるか否かにより人間の音声か否かを判別する音声判
別手段とからなることを特徴とするオーディオ信号処理
装置。1. A level of an input audio signal is detected,
A level detecting means for outputting a portion having a predetermined level or higher, an A / D converting means for converting an analog audio signal output from the level detecting means into a digital signal, and a digital output for the A / D converting means. Audio signal storage means for storing the audio signal and the periodicity of the digital audio signal stored in the audio signal storage means are detected, and whether the detected period is within a predetermined range or not is a human voice. An audio signal processing device, comprising: an audio discriminating means for discriminating whether or not the audio signal processing device.

【請求項２】上記音声判別手段が、上記オーディオ信
号記憶手段に記憶されるデータを時間幅Ｔのブロックで
切り出し、さらに１ブロックを時間幅τのフレームに分
割し、順次各ブロックの各フレームにおいて周期性を自
己相関処理により求め、その結果得られる自己相関関数
で所定の条件を満たす極大点を検出し、検出された極大
点の時間と相関値からそのブロック内の代表値である重
心値を計算し、重心値の時間成分が所定の範囲内にあ
り、かつ重心値の相関値成分が予め決められたしきい値
より大きくなったとき音声と判別する請求項１に記載の
オーディオ信号処理装置。2. The voice discriminating means cuts out the data stored in the audio signal storing means into blocks having a time width T, further divides one block into frames having a time width τ, and sequentially in each frame of each block. The periodicity is obtained by autocorrelation processing, and the resulting autocorrelation function is used to detect the maximum point that satisfies a predetermined condition, and the centroid value, which is the representative value in the block, is calculated from the time and correlation value of the detected maximum point. 2. The audio signal processing device according to claim 1, wherein the audio signal processing apparatus determines that the time component of the center of gravity value is within a predetermined range and the correlation value component of the center of gravity value is larger than a predetermined threshold value. .

【請求項３】上記重心値の時間成分の上記所定の範囲
が３ｍｓｅｃ以上１５ｍｓｅｃ以下である請求項２に記
載のオーディオ信号処理装置。3. The audio signal processing device according to claim 2, wherein the predetermined range of the time component of the centroid value is 3 msec or more and 15 msec or less.

【請求項４】入力オーディオ・データの自己相関を算
出する自己相関演算手段と、当該自己相関演算手段によ
り得られる自己相関関数から所定の極大点を検出する極
大検出手段と、当該極大検出手段により検出された極大
点の時間と相関値から所定周期内における重心値を計算
する重心値演算手段と、当該重心値演算手段で得られた
重心値の時間成分と相関値成分により当該入力オーディ
オ・データが音声か否かを判別する判別手段とからなる
ことを特徴とする音声判別回路。4. An autocorrelation calculation means for calculating an autocorrelation of input audio data, a maximum detection means for detecting a predetermined maximum point from an autocorrelation function obtained by the autocorrelation calculation means, and a maximum detection means. A center-of-gravity value calculating means for calculating a center-of-gravity value in a predetermined period from the detected maximum point time and the correlation value, and the input audio data by the time-component and correlation value component of the center-of-gravity value obtained by the center-of-gravity value calculating means. And a voice discriminating circuit for discriminating whether or not the voice is a voice.

【請求項５】上記判別手段は、上記重心値の時間成分
が所定の範囲内にあり、かつ当該重心値の相関値成分が
予め決められたしきい値より大きくなったとき、上記入
力オーディオ・データが音声であると判別する請求項４
に記載の音声判別回路。5. The inputting audio signal when the time component of the barycentric value is within a predetermined range and the correlation value component of the barycentric value is larger than a predetermined threshold value. 5. The method according to claim 4, wherein the data is voice.
The voice discrimination circuit described in.