JP5879813B2

JP5879813B2 - Multiple sound source identification device and information processing device linked to multiple sound sources

Info

Publication number: JP5879813B2
Application number: JP2011178223A
Authority: JP
Inventors: 茂出木　敏雄; 敏雄茂出木; 智子小堀; 志保平山; 由愛田中; 綾子宮澤; 玲美佐藤
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2011-08-17
Filing date: 2011-08-17
Publication date: 2016-03-08
Anticipated expiration: 2031-08-17
Also published as: JP2013041128A

Description

本発明は、アコースティックス楽器、自然音、ヒト・生物が発する音、電子楽器を除く人工機械など音響信号を発する音源を識別するための技術に関し、特に複数の音源を識別するための技術に関する。 The present invention relates to a technique for identifying a sound source that emits an acoustic signal such as an acoustic instrument, a natural sound, a sound generated by a human being or an organism, an artificial machine excluding an electronic instrument, and more particularly to a technique for identifying a plurality of sound sources.

楽器を各種デジタル情報端末のユーザインタフェースとして活用する方法として、MIDI規格（Musical Instrument Digital Interface）に準拠した電子楽器を用いる手法が知られている。MIDI規格に対応した鍵盤楽器（キーボード、自動演奏ピアノ）、サイレント弦楽器、サイレント管楽器、サイレントドラムなどが既に開発されており、MIDIインタフェースを通じて楽器演奏により情報端末を操作することが可能である（例えば、特許文献１参照）。 As a method of utilizing a musical instrument as a user interface of various digital information terminals, a technique using an electronic musical instrument compliant with the MIDI standard (Musical Instrument Digital Interface) is known. Keyboard instruments (keyboards, auto-playing pianos), silent stringed instruments, silent wind instruments, silent drums, etc. that have been developed in accordance with the MIDI standard have already been developed. Patent Document 1).

特許第３７１７６４０号公報Japanese Patent No. 3717640 特許第３９３５７４５号公報Japanese Patent No. 3935745 特開２０１１−１０７２６５号公報JP 2011-107265 A 特開２００７−３２５２０１号公報JP 2007-325201 A 特許第４１５６２５２号公報Japanese Patent No. 4156252

しかし、適用できる楽器はMIDI規格に準拠した電子楽器に制限される。一方、世の中には無数に近い種類のアコースティックス楽器が存在する。これに対して、特許文献２に開示されているようなアコースティックス楽器から出力される音響信号をMIDIデータに変換する符号化技術を活用する方法が考えられる。特許文献２の発明では演奏されるリズムや音階を認識して情報端末をMIDI規格に準拠した形態で渡すことは可能である。しかし、音色を認識することはできないため、演奏された楽器を特定することは困難であった。さらに、複数の楽器を同時に発音させた場合は、それぞれの楽器を特定することは一層困難であった。 However, applicable instruments are limited to electronic instruments that conform to the MIDI standard. On the other hand, there are a myriad of acoustic instruments in the world. On the other hand, a method using an encoding technique for converting an acoustic signal output from an acoustic instrument as disclosed in Patent Document 2 into MIDI data is conceivable. In the invention of Patent Document 2, it is possible to recognize the rhythm and scale to be played and pass the information terminal in a form compliant with the MIDI standard. However, since the timbre cannot be recognized, it is difficult to specify the played instrument. Furthermore, when a plurality of musical instruments are pronounced simultaneously, it is more difficult to specify each musical instrument.

このような音源分離技術としては、複数のマイクロフォンやマイクロフォンアレイを使用する手法が提案されているが（特許文献４参照）、装置が大掛かりでコスト高になり、楽器音のように音源が物理的に近接している場合は分離が困難である。これに対して、出願人も複数の楽器音演奏が合成されて録音された単一の音響信号に対して一律にＭＩＤＩ形式に符号化し、あらかじめ定義された音色管理テーブルを用いて符号化されたＭＩＤＩデータを複数のチャンネルに分離する手法を提案している（特許文献５参照）。しかし、ピアノとボーカル、ピアノとギターなど互いに特性が顕著に異なる音色でないと高精度に分離することができず、例えば、グランドピアノとエレキピアノのような組み合わせでは分離が難しいという問題がある。 As such a sound source separation technique, a technique using a plurality of microphones or a microphone array has been proposed (see Patent Document 4). However, the apparatus is large and expensive, and the sound source is physically like a musical instrument sound. Is difficult to separate. On the other hand, the applicant also uniformly encoded a single sound signal recorded by synthesizing a plurality of musical instrument sounds into MIDI format, and encoded using a predefined tone color management table. A method for separating MIDI data into a plurality of channels has been proposed (see Patent Document 5). However, it cannot be separated with high precision unless the timbres have characteristics that are remarkably different from each other, such as piano and vocal, piano and guitar, and there is a problem that separation is difficult in a combination such as a grand piano and an electric piano.

また、近年では、デジタルサイネージなど各種情報端末のユーザインターフェースとして、キーボード、マウス、タッチパネル等の既存のデジタル入力に変わるものとして、音声認識による入力も一部導入され始めている。しかし、サイネージが設置される雑踏の中では音声入力精度が著しく悪く、出力される音声も明瞭に聞こえないという問題があり、デジタルサイネージにおいて音声入出力はあまり普及していない。このように、音声以外の楽器音などの音入力を用いたユーザインターフェースの事例はなく、ましてや、複数のアコースティック楽器等を用いてユーザインターフェースの代用を行う手段は存在しなかった。 Also, in recent years, as a user interface of various information terminals such as digital signage, a part of voice recognition input has begun to be introduced as an alternative to existing digital input such as a keyboard, mouse, touch panel and the like. However, there is a problem that the voice input accuracy is remarkably poor in a crowd where a signage is installed, and the output voice cannot be heard clearly, and voice input / output is not so popular in digital signage. Thus, there is no example of a user interface using sound inputs such as instrument sounds other than voice, and there is no means for substituting a user interface using a plurality of acoustic instruments.

そこで、本発明は、アコースティックス楽器、自然音、ヒト・生物等の複数の音源が同時に発した音から、各音源を特定することが可能な複数音源の識別装置および複数音源に連動する情報処理装置を提供することを課題とする。 Therefore, the present invention provides a multi-sound source identification device capable of specifying each sound source from sounds simultaneously generated by a plurality of sound sources such as acoustic instruments, natural sounds, humans / living organisms, and information processing linked to the plurality of sound sources. It is an object to provide an apparatus.

上記課題を解決するため、本発明第１の態様では、複数の音源から発せられる音を取得して、それぞれの音源を識別する装置であって、各音源について、その特徴を表現した登録特徴データと、各音源を特定する識別情報が対応付けて登録された音源データベースと、前記複数の音源から発せられる音を録音してデジタルの音響信号として取得する音響信号取得手段と、前記音響信号に対して周波数解析を行い、時間的に平均化したスペクトルである検索側平均スペクトルに基づいて検索特徴データを生成する特徴データ生成手段と、前記生成された検索特徴データと前記音源データベースに登録されている登録特徴データの各々と相関計算を行い、得られた相関値の中で、最大の相関値をもち、かつ当該相関値が所定のしきい値以上を満たす登録特徴データに対応する音源を前記識別情報により特定する音源データベース検索手段と、前記音源データベース検索手段により特定された音源に対応する登録特徴データを用いて、前記検索特徴データを補正し、補正特徴データを生成する特徴データ補正手段と、を有し、前記音源データベース検索手段は、前記生成された補正特徴データと、既に特定された音源を除外した前記音源データベースに登録されている登録特徴データを用いて、前記音源を特定する処理を実行し、前記特徴データ補正手段は、前記音源データベース検索手段により特定された音源に対応付けて前記音源データベースに登録されている前記登録特徴データを用いて、前記登録特徴データを作成する基になる時間的に平均化したスペクトルである登録側平均スペクトルを作成し、作成された登録側平均スペクトルのうち最大の成分を与える周波数番号について、検索側平均スペクトルの対応する値を登録側平均スペクトルのうち最大の成分で除算し、その値を、前記最大の成分を与える周波数番号を含む各周波数番号に対して、前記特定された音源に対応する登録側平均スペクトルに乗じたものを前記検索側平均スペクトルから減じることにより、各周波数番号に対応する前記検索側平均スペクトルを補正するものであり、前記音源データベース検索手段、前記特徴データ補正手段は、所定の条件を満たすまで、繰り返し処理を実行することを特徴とする複数音源の識別装置を提供する。 In order to solve the above-mentioned problem, in the first aspect of the present invention, a device for acquiring sounds emitted from a plurality of sound sources and identifying each sound source, the registered feature data expressing the features of each sound source A sound source database in which identification information for identifying each sound source is associated and registered; sound signal acquisition means for recording and acquiring sounds emitted from the plurality of sound sources as digital sound signals; and Frequency characteristic analysis, and feature data generation means for generating search feature data based on a search side average spectrum that is a spectrum averaged over time, and the generated search feature data and the sound source database are registered. Performs correlation calculation with each registered feature data, and has the maximum correlation value among the obtained correlation values, and the correlation value satisfies a predetermined threshold value or more. A sound source database search unit that specifies a sound source corresponding to registered feature data based on the identification information, and using the registered feature data corresponding to the sound source specified by the sound source database search unit, the search feature data is corrected, and a corrected feature Feature data correction means for generating data, and the sound source database search means includes the generated correction feature data and registered feature data registered in the sound source database excluding the already specified sound source. Using the registered feature data registered in the sound source database in association with the sound source specified by the sound source database search unit. The registration side average scan, which is a temporally averaged spectrum that is the basis for creating the registration feature data. For the frequency number that gives the largest component of the created registration-side average spectrum, the corresponding value of the search-side average spectrum is divided by the largest component of the registration-side average spectrum, and the value is For each frequency number including the frequency number that gives the maximum component, the multiplication by the registration-side average spectrum corresponding to the specified sound source is subtracted from the search-side average spectrum, thereby the frequency number corresponding to each frequency number. The search-side average spectrum is corrected, and the sound source database search means and the feature data correction means provide a plurality of sound source identification devices characterized by repeatedly performing processing until a predetermined condition is satisfied.

本発明第１の態様によれば、音源から発せられる音をデジタル化した音響信号に対して周波数解析を行い、時間的に平均化したスペクトルに基づいて検索特徴データを生成し、音源データベース中の各登録特徴データと比較して、相関の高い登録特徴データを特定した後、特定された登録特徴データを用いて検索特徴データを補正して補正特徴データを得て、この補正特徴データと音源データベース中の各登録特徴データと比較して相関の高い登録特徴データを特定する処理を繰り返し実行するようにしたので、複数の音源が同時に発した音から、各音源を特定することが可能になる。また、本発明第１の態様によれば、登録特徴データを作成する基になる登録側平均スペクトルＶｄ（ｊ）を登録しておき、特定された音源に対応する登録側平均スペクトルＶｄ（ｊ）を最大にする周波数番号ｊｍａｘについて、検索側平均スペクトルの対応する値Ｖｑ（ｊｍａｘ）を登録側平均スペクトルＶｄｍａｘ（ｊ）のうち最大の成分Ｖｄｍａｘ（ｊｍａｘ）で除算し、その値を、前記周波数番号ｊｍａｘを含む全ての周波数番号ｊに対して、登録側平均スペクトルＶｄｍａｘ（ｊ）に乗じたものを検索側平均スペクトルＶｑ（ｊ）から減じることにより、全ての周波数ｊに対応する検索側平均スペクトルＶｑ（ｊ）を補正し、検索特徴データを補正するようにしたので、取得された音響信号の検索側平均スペクトルから、特定された音源の登録側平均スペクトルに相当する分を除去することができ、補正された検索側平均スペクトルは、特定された第１音源を除外した音源を表現したものとなるので、第２音以降を正確に特定することが可能になる。 According to the first aspect of the present invention, frequency analysis is performed on an acoustic signal obtained by digitizing sound emitted from a sound source, and search feature data is generated based on a temporally averaged spectrum. After specifying registered feature data having a high correlation with each registered feature data, the search feature data is corrected using the specified registered feature data to obtain corrected feature data. The corrected feature data and the sound source database Since the process of specifying registered feature data having a high correlation with each registered feature data is repeatedly executed, each sound source can be specified from sounds simultaneously generated by a plurality of sound sources. Further, according to the first aspect of the present invention, the registration-side average spectrum Vd (j) that is the basis for creating the registration feature data is registered, and the registration-side average spectrum Vd (j) corresponding to the specified sound source is registered. Is divided by the maximum component Vdmax (jmax) of the registration-side average spectrum Vdmax (j), and the corresponding value Vq (jmax) of the search-side average spectrum is divided by the frequency number jmax. The search side average spectrum Vq corresponding to all the frequencies j is obtained by subtracting the frequency obtained by multiplying the registration side average spectrum Vdmax (j) from the search side average spectrum Vq (j) for all frequency numbers j including jmax. Since (j) was corrected and the search feature data was corrected, it was specified from the search side average spectrum of the acquired acoustic signal. The portion corresponding to the source-side registered average spectrum can be removed, and the corrected search-side average spectrum represents the sound source excluding the identified first sound source, so that the second and subsequent sounds are accurate. It becomes possible to specify.

本発明第２の態様では、複数の音源から発せられる音を取得して、それぞれの音源を識別する装置であって、各音源について、その特徴を表現した登録特徴データと、各音源を特定する識別情報が対応付けて登録された音源データベースと、
前記複数の音源から発せられる音を録音してデジタルの音響信号として取得する音響信号取得手段と、前記音響信号に対して周波数解析を行い、時間的に平均化したスペクトルである検索側平均スペクトルの各周波数成分に、それぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した値に基づいて検索特徴データを生成する特徴データ生成手段と、前記生成された検索特徴データと前記音源データベースに登録されている登録特徴データの各々と相関計算を行い、得られた相関値の中で、最大の相関値をもち、かつ当該相関値が所定のしきい値以上を満たす登録特徴データに対応する音源を前記識別情報により特定する音源データベース検索手段と、前記音源データベース検索手段により特定された音源に対応する登録特徴データを用いて、前記検索特徴データを補正し、補正特徴データを生成する特徴データ補正手段と、を有し、前記音源データベース検索手段は、前記生成された補正特徴データと、既に特定された音源を除外した前記音源データベースに登録されている登録特徴データを用いて、前記音源を特定する処理を実行し、前記音源データベース検索手段、前記特徴データ補正手段は、所定の条件を満たすまで、繰り返し処理を実行することを特徴とする複数音源の識別装置を提供する。 In the second aspect of the present invention, an apparatus for acquiring sounds emitted from a plurality of sound sources and identifying each sound source, for each sound source, specifying registered feature data representing the feature and each sound source A sound source database in which identification information is registered in association with each other;
An acoustic signal acquiring means for acquiring a digital audio signal to record the sound emitted from the plurality of sound sources, the performs frequency analysis on the acoustic signal, the searching side average of the spectrum is a spectrum time-averaged Feature data generating means for generating search feature data based on a value obtained by multiplying each frequency component by a weight that increases as the frequency value increases, and registration in the generated search feature data and the sound source database A sound source corresponding to registered feature data having a maximum correlation value among the obtained correlation values and satisfying the correlation value equal to or higher than a predetermined threshold among the obtained correlation values. Sound source database search means for specifying the identification information based on the identification information, and registered feature data corresponding to the sound source specified by the sound source database search means And a feature data correction unit that corrects the search feature data and generates corrected feature data, wherein the sound source database search unit excludes the generated correction feature data and the already specified sound source. Using the registered feature data registered in the sound source database, the process of specifying the sound source is executed, and the sound source database search means and the feature data correction means repeatedly execute processing until a predetermined condition is satisfied. A multi-sound source identification device is provided.

本発明第２の態様によれば、音源から発せられる音をデジタル化した音響信号に対して周波数解析を行い、時間的に平均化したスペクトルに基づいて検索特徴データを生成し、音源データベース中の各登録特徴データと比較して、相関の高い登録特徴データを特定した後、特定された登録特徴データを用いて検索特徴データを補正して補正特徴データを得て、この補正特徴データと音源データベース中の各登録特徴データと比較して相関の高い登録特徴データを特定する処理を繰り返し実行するようにしたので、複数の音源が同時に発した音から、各音源を特定することが可能になる。また、本発明第２の態様によれば、時間的に平均化したスペクトルである検索側平均スペクトルの各周波数成分にそれぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した特徴データを生成するようにしたので、音源間の違いがたとえ高周波成分のわずかな違いであってもその差が明確になり、アコースティックス楽器、自然音、ヒト・生物等の発せられた音の特徴を明確に識別することができるため、正しい音源を特定することが可能になる。 According to the second aspect of the present invention , frequency analysis is performed on an acoustic signal obtained by digitizing sound emitted from a sound source, and search feature data is generated based on a temporally averaged spectrum. After specifying registered feature data having a high correlation with each registered feature data, the search feature data is corrected using the specified registered feature data to obtain corrected feature data. The corrected feature data and the sound source database Since the process of specifying registered feature data having a high correlation with each registered feature data is repeatedly executed, each sound source can be specified from sounds simultaneously generated by a plurality of sound sources. According to the second aspect of the present invention, feature data is generated by multiplying each frequency component of the search-side average spectrum, which is a spectrum averaged over time, by a weight that increases as the frequency value increases. Therefore, even if the difference between the sound sources is a slight difference in high-frequency components, the difference becomes clear, and the characteristics of the sound emitted by acoustic instruments, natural sounds, human beings, organisms, etc. are clarified Since it can be identified, the correct sound source can be specified.

本発明第３の態様では、本発明第２の態様の複数音源の識別装置における特徴データ補正手段が、前記音源データベース検索手段により特定された音源に対応する登録特徴データと、前記検索特徴データの相関値Ｖ１を求め、当該相関値Ｖ１を前記特定された登録特徴データに乗じたものを前記検索特徴データから減じることにより前記補正特徴データを生成するものであることを特徴とする。 In the third aspect of the present invention, the feature data correction means in the multiple sound source identification device of the second aspect of the present invention includes registered feature data corresponding to the sound source specified by the sound source database search means, and the search feature data The correction feature data is generated by obtaining a correlation value V1 and subtracting a value obtained by multiplying the specified registered feature data by the correlation value V1 from the search feature data.

本発明第３の態様によれば、特定された音源に対応する登録特徴データと、検索特徴データの相関値Ｖ１を求め、相関値Ｖ１を特定された登録特徴データに乗じたものを検索特徴データから減じることにより補正特徴データを生成するようにしたので、検索特徴データから第１音源の成分割合を定量的に算出しながら高精度に第１音源の特徴を除外した補正特徴データを生成でき、第２音源以降の成分割合を同様に算出しながら第２音源以降の特定を高精度に行うことが可能になる。 According to the third aspect of the present invention, the registered feature data corresponding to the specified sound source and the correlation value V1 of the search feature data are obtained, and the search feature data obtained by multiplying the specified registered feature data by the correlation value V1. Since the correction feature data is generated by subtracting from the correction feature data, the correction feature data excluding the features of the first sound source can be generated with high accuracy while quantitatively calculating the component ratio of the first sound source from the search feature data. It is possible to specify the second and subsequent sound sources with high accuracy while calculating the component ratios for the second and subsequent sound sources in the same manner.

本発明第４の態様では、本発明第１の態様の複数音源の識別装置における特徴データ生成手段が、前記検索側平均スペクトルの各周波数成分に、それぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した値に基づいて前記検索特徴データを生成するものであることを特徴とする。 In the fourth aspect of the present invention, the feature data generating means in the multiple sound source identification device of the first aspect of the present invention increases as the frequency value increases for each frequency component of the search side average spectrum. The search feature data is generated based on a value multiplied by a weight.

本発明第４の態様によれば、時間的に平均化したスペクトルである検索側平均スペクトルの各周波数成分にそれぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した特徴データを生成するようにしたので、音源間の違いがたとえ高周波成分のわずかな違いであってもその差が明確になり、アコースティックス楽器、自然音、ヒト・生物等の発せられた音の特徴を明確に識別することができるため、正しい音源を特定することが可能になる。 According to the fourth aspect of the present invention, feature data is generated by multiplying each frequency component of the search-side average spectrum, which is a spectrum averaged over time, by a weight that increases as the frequency value increases. Therefore, even if the difference between sound sources is a slight difference in high-frequency components, the difference becomes clear, and the characteristics of the emitted sound such as acoustic instruments, natural sounds, human beings and living things are clearly identified Therefore, it is possible to specify the correct sound source.

本発明第５の態様では、本発明第２から第４のいずれかの態様の複数音源の識別装置における特徴データ生成手段が、前記検索特徴データとして、前記各周波数成分に、前記周波数値が大きくなるのに伴って大きくなる重みを乗算した値から、さらに、それぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した値の平均値を減じて偏差ベクトルを生成することを特徴とする。 In the fifth aspect of the present invention, the feature data generation means in the multiple sound source identification apparatus according to any one of the second to fourth aspects of the present invention provides the frequency value of each frequency component as the search feature data. The deviation vector is generated by subtracting the average value of the values multiplied by the weights that increase as the frequency value increases from the value obtained by multiplying the weights that increase as the frequency value increases.

本発明第５の態様によれば、検索特徴データとして、各周波数成分にそれぞれ異なる重みを乗算したものの平均値を減じた偏差ベクトルを生成するようにしたので、特徴のある周波数成分がより明確になり、各音源の識別が容易になる。 According to the fifth aspect of the present invention, as the search feature data, the deviation vector is generated by subtracting the average value of each frequency component multiplied by a different weight, so that the characteristic frequency component is more clearly defined. Thus, each sound source can be easily identified.

本発明第６の態様では、本発明第１から第５のいずれかの態様の複数音源の識別装置における特徴データ生成手段が、前記音響信号取得手段により取得された音響信号の振幅が所定の値未満で所定の時間以上連続する無音区間を特定し、当該特定された無音区間を削除して音響信号を時間的に短縮する補正を実行し、当該補正された音響信号に対して、前記周波数解析を行うことを特徴とする。 In the sixth aspect of the present invention, the characteristic data generating means in the multi-sound source identification device according to any one of the first to fifth aspects of the present invention is configured such that the amplitude of the acoustic signal acquired by the acoustic signal acquiring means is a predetermined value. A silence interval that is less than or equal to a predetermined time is specified, correction is performed to delete the specified silence interval to shorten the acoustic signal in time, and the frequency analysis is performed on the corrected acoustic signal. It is characterized by performing.

本発明第６の態様によれば、周波数解析を行う前に音響信号の無音区間を削除するようにしたので、特徴データの基礎となるスペクトルに対して、無音区間と発音区間との間隔が揺らいでも音源の同定に与える影響を防ぐことが可能となる。 According to the sixth aspect of the present invention, since the silent section of the acoustic signal is deleted before performing the frequency analysis, the interval between the silent section and the sound generation section fluctuates with respect to the spectrum that is the basis of the feature data. However, it is possible to prevent the influence on the identification of the sound source.

本発明第７の態様では、本発明第１から第６のいずれかの態様の複数音源の識別装置における特徴データ生成手段が、前記周波数解析として、前記音響信号に対して、所定の区間単位に分割し、分割した各区間の波形データに同区間長にあらかじめ定義された重み関数を重畳した波形データに対してフーリエ変換を行い、各区間ごとに実部と虚部の２乗平均値をとって実数化したスペクトルを全区間に渡って平均化することにより前記検索側平均スペクトルを得ることを特徴とする。 In the seventh aspect of the present invention, the feature data generating means in the multiple sound source identification device according to any one of the first to sixth aspects of the present invention performs the frequency analysis as a unit of a predetermined section for the acoustic signal. The waveform data is divided and subjected to Fourier transform on the waveform data in which the weight function defined in advance is superimposed on the waveform data of each section, and the mean square value of the real part and the imaginary part is taken for each section. real phased spectra over the entire interval, characterized in that obtaining the searching side average spectrum by averaging Te.

本発明第７の態様によれば、周波数解析として、音響信号を所定の区間単位に分割し、各区間の波形データに対してフーリエ変換を行い、各区間ごとに得られた実数スペクトルを全区間に渡って平均化するようにしたので、周波数解析の時間分解能を落とすことなく、全区間に渡る特徴である検索特徴データを生成することが可能になる。 According to the seventh aspect of the present invention, as the frequency analysis, the acoustic signal is divided into predetermined sections, Fourier transform is performed on the waveform data of each section, and the real spectrum obtained for each section is converted into all sections. Therefore, it is possible to generate search feature data that is a feature over the entire section without reducing the time resolution of the frequency analysis.

本発明第８の態様では、本発明第７の態様の複数音源の識別装置において、前記各区間は隣接する区間どうしで区間長の１／２の時間幅だけ重複しており、前記重み関数は時間軸方向に対して非対称な形状で２種類定義されており、奇数番目の区間に対しては、一方の重み関数を重畳し、偶数番目の区間に対しては、他方の重み関数を重畳するようにしていることを特徴とする。 According to an eighth aspect of the present invention, in the multiple sound source identification device according to the seventh aspect of the present invention, each of the sections overlaps by a half time width of the section length between adjacent sections, and the weight function is Two types of asymmetric shapes are defined with respect to the time axis direction. One weight function is superimposed on the odd-numbered section, and the other weight function is superimposed on the even-numbered section. It is characterized by doing so.

本発明第８の態様によれば、各区間を隣接する区間どうしで区間長の１／２の時間幅だけ重複させ、重み関数を時間軸方向に対して非対称な形状で２種類定義し、奇数番目の区間と偶数番目の区間で異なる重み関数を重畳するようにしたので、音響信号を逆から再生した逆回し波形と元の音響信号を正常に再生した音との識別が可能になる。 According to the eighth aspect of the present invention, each interval is overlapped by two intervals between adjacent intervals, and two types of weighting functions are defined in an asymmetric shape with respect to the time axis direction. Since different weighting functions are superimposed on the first and even-numbered sections, it is possible to distinguish between a reverse waveform obtained by reproducing the acoustic signal from the reverse and a sound obtained by normally reproducing the original acoustic signal.

本発明第９の態様では、本発明第１から第８のいずれかの態様の複数音源の識別装置により特定された前記識別情報を用いる装置であって、前記識別情報ごとに異なる処理を行うプログラムを記憶しておき、前記複数音源の識別装置から出力された識別情報に対応する前記プログラムを起動し、当該プログラムに従った処理を実行する複数音源に連動する情報処理装置を提供する。 According to a ninth aspect of the present invention , there is provided a program that uses the identification information specified by the multiple sound source identification apparatus according to any one of the first to eighth aspects of the present invention, and performs a different process for each identification information. stores the, start the program corresponding to the identification information output from the identification device of the plurality of sound sources, to provide an information processing apparatus that works in multiple sound source that perform processing according to the program.

本発明第９の態様によれば、複数音源の識別装置により特定された識別情報に基づいて情報処理装置が処理を実行するようにしたので、発せられた音にマイクを向けるだけで、その音源に対応した処理が実行されることになる。 According to the ninth aspect of the present invention, since the information processing apparatus executes the process based on the identification information specified by the identification apparatus for a plurality of sound sources, the sound source can be obtained simply by pointing the microphone to the emitted sound. The processing corresponding to is executed.

本発明によれば、アコースティックス楽器、自然音、ヒト・生物等の複数の音源が同時に発した音から、各音源を特定することが可能となる。 According to the present invention, it is possible to specify each sound source from sounds simultaneously generated by a plurality of sound sources such as an acoustic instrument, a natural sound, and humans / living organisms.

登録特徴データ生成装置のハードウェア構成図である。It is a hardware block diagram of a registration characteristic data generation apparatus. 登録特徴データ生成装置の機能ブロック図である。It is a functional block diagram of a registration feature data generation device. 図１、２に示した装置の処理動作を示すフローチャートである。It is a flowchart which shows the processing operation of the apparatus shown in FIG. 無音区間の削除を説明するための図である。It is a figure for demonstrating deletion of a silence area. 楽器のスペクトルに、周波数値に比例した値を重みとして乗じた場合を示す図である。It is a figure which shows the case where the spectrum proportional to a frequency value is multiplied by the value of a musical instrument as a weight. 本発明に係る複数音源の識別装置のハードウェア構成図である。It is a hardware block diagram of the identification device of multiple sound sources according to the present invention. 本発明に係る複数音源の識別装置の機能ブロック図である。It is a functional block diagram of the identification apparatus of the multiple sound source which concerns on this invention. 図６、７に示した装置の処理動作を示すフローチャートである。8 is a flowchart showing a processing operation of the apparatus shown in FIGS. 図８のＳ２３０における音源の検索処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the search process of the sound source in S230 of FIG. 音の振幅変化と逆回し波形を示す図である。It is a figure which shows the amplitude change and reverse rotation waveform of a sound. 本発明で用いる窓関数を示す図である。It is a figure which shows the window function used by this invention.

以下、本発明の実施形態について図面を参照して詳細に説明する。
（１．音源データベースの準備）
本発明では、音源から発せられる音を周波数解析し、音源ごとのスペクトルを生成する。生成したスペクトルに基づいてその音源の特徴を表現した特徴データを生成し、音源の識別情報等とともに音源データベースに記録しておく。そして、識別対象とする音を取得し、周波数解析を行って特徴データを生成し、データベース内の特徴データと比較・照合することにより、その音の音源を特定する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(1. Preparation of sound source database)
In the present invention, the frequency of sound emitted from a sound source is analyzed, and a spectrum for each sound source is generated. Feature data representing the features of the sound source is generated based on the generated spectrum, and recorded in the sound source database together with identification information of the sound source. Then, the sound to be identified is acquired, frequency analysis is performed to generate feature data, and the sound source of the sound is specified by comparing and collating with the feature data in the database.

そのため、まず、音源に対応した特徴データを記憶した音源データベースを準備することが必要になる。音源データベースに登録する特徴データの生成は、登録特徴データ生成装置により行う。図１は、登録特徴データ生成装置のハードウェア構成図である。登録特徴データ生成装置は、汎用のコンピュータで実現することができ、図１に示すように、ＣＰＵ１（CPU: Central Processing Unit）と、コンピュータのメインメモリであるＲＡＭ２（RAM: Random Access Memory）と、ＣＰＵが実行するプログラムやデータを記憶するための大容量の記憶装置３（例えば、ハードディスク、フラッシュメモリ等）と、キーボード、マウス等のキー入力Ｉ／Ｆ（インターフェース）４と、外部装置（データ記憶媒体等）とデータ通信するためのデータ入出力Ｉ／Ｆ（インターフェース）５と、表示装置（液晶ディスプレイ等）に情報を送出するための表示出力Ｉ／Ｆ（インターフェース）６と、を備え、互いにバスを介して接続されている。 Therefore, first, it is necessary to prepare a sound source database storing feature data corresponding to the sound source. The feature data to be registered in the sound source database is generated by a registered feature data generation device. FIG. 1 is a hardware configuration diagram of a registered feature data generation apparatus. The registered feature data generation device can be realized by a general-purpose computer. As shown in FIG. 1, a CPU 1 (CPU: Central Processing Unit), a main memory RAM 2 (RAM: Random Access Memory), A large-capacity storage device 3 (for example, a hard disk, a flash memory, etc.) for storing programs and data executed by the CPU, a key input I / F (interface) 4 such as a keyboard and a mouse, and an external device (data storage) A data input / output I / F (interface) 5 for data communication with a medium or the like, and a display output I / F (interface) 6 for sending information to a display device (liquid crystal display or the like). Connected via bus.

図２は、登録特徴データ生成装置の構成を示す機能ブロック図である。図２において、１０は無音区間削除手段、２０は音響フレーム読込手段、３０は特徴データ生成手段、４０は特徴データ登録手段、５０は音源データベースである。 FIG. 2 is a functional block diagram showing the configuration of the registered feature data generation apparatus. In FIG. 2, 10 is a silent section deleting means, 20 is an acoustic frame reading means, 30 is a feature data generating means, 40 is a feature data registering means, and 50 is a sound source database.

無音区間削除手段１０は、デジタル音響信号のうち、無音であると判断される区間（無音区間）を削除する機能を有している。音響フレーム読込手段２０は、ある音を素材として記録したデジタルの音響信号から所定数Ｎのサンプルを１音響フレームとして順次読み込む機能を有している。特徴データ生成手段３０は、読み込んだ音響フレームを利用して、周波数解析を行い、その音源についての音響信号の特徴を表現した特徴データを生成する機能を有している。この特徴データは、ある音響信号の特徴を少ないデータ量で表現したものであり、後述するように最大Ｎ／２次元のベクトルである。本発明では、音源データベースに登録された登録特徴データと、音を録音して取得した音響信号から生成した検索特徴データを比較することにより音源を特定する。したがって、登録特徴データと検索特徴データは同じ形式である必要がある。そのため、登録特徴データと検索特徴データの生成手法は同一であり、本明細書では、生成された特徴データが音源データベースに登録されたものを登録特徴データとし、音を録音して取得した音響信号から生成された特徴データを検索特徴データと呼ぶ。特徴データ登録手段４０は、生成した特徴データを、元の音響信号に対応する音源に関連する関連情報、および音源を特定するための識別情報である音源ＩＤと対応付けて登録特徴データとして音源データベース５０に登録する機能を有している。音源とは、上述のように、楽器、自然物、生物、自然現象等、音の発信源を示すものである。図２に示した各構成手段は、図１に示したハードウェア構成に専用のプログラムを搭載することにより実現される。 The silent section deleting means 10 has a function of deleting a section (silent section) determined to be silent from the digital audio signal. The sound frame reading means 20 has a function of sequentially reading a predetermined number N of samples as one sound frame from a digital sound signal recorded with a certain sound as a material. The feature data generation unit 30 has a function of performing frequency analysis using the read sound frame and generating feature data representing the feature of the sound signal for the sound source. This feature data represents a feature of a certain acoustic signal with a small amount of data, and is a vector of maximum N / 2 dimensions as will be described later. In the present invention, the sound source is specified by comparing the registered feature data registered in the sound source database with the search feature data generated from the sound signal obtained by recording the sound. Therefore, the registered feature data and the search feature data need to be in the same format. Therefore, the registration feature data and the search feature data are generated in the same method, and in this specification, the acoustic signal obtained by recording the sound using the registered feature data as the registered feature data. The feature data generated from is called search feature data. The feature data registration unit 40 associates the generated feature data with the related information related to the sound source corresponding to the original sound signal and the sound source ID which is identification information for specifying the sound source as the registered feature data as the sound source database. 50 is registered. As described above, the sound source indicates a sound source such as a musical instrument, a natural object, a living thing, or a natural phenomenon. Each component shown in FIG. 2 is realized by installing a dedicated program in the hardware configuration shown in FIG.

図１の記憶装置３には、ＣＰＵ１を動作させ、コンピュータを、登録特徴データ生成装置として機能させるための専用のプログラムが実装されている。この専用のプログラムを実行することにより、ＣＰＵ１は、無音区間削除手段１０、音響フレーム読込手段２０、特徴データ生成手段３０、特徴データ登録手段４０としての機能を実現することになる。また、記憶装置３は、処理に必要な様々なデータを記憶する。 In the storage device 3 of FIG. 1, a dedicated program for operating the CPU 1 and causing the computer to function as a registered feature data generation device is installed. By executing this dedicated program, the CPU 1 realizes functions as the silent section deleting unit 10, the acoustic frame reading unit 20, the feature data generating unit 30, and the feature data registering unit 40. The storage device 3 stores various data necessary for processing.

次に、図１、図２に示した登録特徴データ生成装置の処理動作について図３のフローチャートに従って説明する。まず、登録特徴データ生成装置は、事前に準備された識別情報がわかっている音源を録音した音響信号ファイルから、デジタルの音響信号を読み込む。このデジタル音響信号は、アナログ音響信号に対して、ＰＣＭ等の手法によりサンプリングを行うことにより得られたものである。例えば、サンプリング周波数が４４．１ｋＨｚの場合、１秒当たり４４１００のサンプルとして得られることになる。 Next, the processing operation of the registered feature data generation apparatus shown in FIGS. 1 and 2 will be described with reference to the flowchart of FIG. First, the registered feature data generation device reads a digital acoustic signal from an acoustic signal file in which a sound source with known identification information prepared in advance is recorded. This digital sound signal is obtained by sampling an analog sound signal by a technique such as PCM. For example, when the sampling frequency is 44.1 kHz, 44100 samples are obtained per second.

登録特徴データ生成装置では、デジタル音響信号を読み込んだ後、無音区間削除手段１０が、無音区間の削除を行う（Ｓ１０１）。具体的には、サンプルの値が所定のしきい値未満となるサンプルが所定数連続した場合に、無音区間であると判断し、無音区間を削除する。各サンプルのしきい値、無音サンプルの連続数は適宜設定することができる。本実施形態では、サンプリング周波数４４．１ｋＨｚ、量子化ビット数１６ビット（１０進数で−３２７６８〜＋３２７６７）を条件として、各サンプルの絶対値のしきい値を１０００（１０進数）、無音サンプルの連続数を２０４８としている。無音サンプルの連続数２０４８は、サンプリング周波数４４．１ｋＨｚの場合、約０．０４６秒に相当する。 In the registered feature data generation apparatus, after reading the digital acoustic signal, the silent section deleting means 10 deletes the silent section (S101). Specifically, when a predetermined number of samples whose sample values are less than a predetermined threshold value continue, it is determined that the period is a silent period, and the silent period is deleted. The threshold value of each sample and the number of consecutive silent samples can be set as appropriate. In this embodiment, on the condition that the sampling frequency is 44.1 kHz and the quantization bit number is 16 bits (−32768 to +32767 in decimal number), the threshold value of the absolute value of each sample is 1000 (decimal number), and the silence samples are continuous. The number is 2048. The number of consecutive silence samples 2048 corresponds to about 0.046 seconds when the sampling frequency is 44.1 kHz.

図４は、無音区間の削除を説明するための図である。図４においては、音響信号の各サンプル値を線分で結んで波形として示している。無音区間の削除は、図４に示すように、無音区間のサンプルの値を０にするのではなく、その区間のサンプル自体を除去し、後方の発音区間のサンプルを前に詰めることにより行う。したがって、Ｓ１０１における無音区間の削除処理の結果、読み込まれた音響信号の総サンプル数は減少し、そのまま再生した場合の再生時間も短くなることになる。音源の特性によっては、無音区間の削除は必ずしも行う必要はない。しかし、パーカッション楽器音のように、短い発音区間が間欠的に並んでいるタイプの場合、無音区間と発音区間の比率（演奏リズム）により、開始から終了までの平均のスペクトルが大きく変化してしまう。例えば、図２、図３に示される特徴データを登録する際に準備した太鼓の音Ａと、後述の図７で示されるマイクロフォン６０で取り込まれる太鼓の音Ｂとは、演奏者が異なりリズムも異なるのが一般的であるが、図２、図３のような登録時および図７、図８のような識別時の双方の処理において、この無音区間の削除処理を実行させることにより、太鼓の音Ａと太鼓の音Ｂとのリズムの相違を吸収し、太鼓の音Ａと太鼓の音Ｂとは同一の音源であると判定することが可能になる。このため、短い発音区間が間欠的に並んでいるタイプの音源に対しては、無音区間の削除を行うことが有効である。 FIG. 4 is a diagram for explaining deletion of a silent section. In FIG. 4, each sample value of the acoustic signal is shown as a waveform connected by line segments. As shown in FIG. 4, the silent section is deleted by not removing the sample value of the silent section but setting the sample of the subsequent sounding section to the front. Therefore, as a result of the silent section deletion process in S101, the total number of samples of the read sound signal is reduced, and the reproduction time when reproduced as it is is shortened. Depending on the characteristics of the sound source, it is not always necessary to delete the silent section. However, in the case of a type in which short sounding intervals are arranged intermittently, such as percussion instrument sounds, the average spectrum from the start to the end varies greatly depending on the ratio of the silent interval to the sounding interval (performance rhythm). . For example, the drum sound A prepared when registering the feature data shown in FIGS. 2 and 3 and the drum sound B captured by the microphone 60 shown in FIG. Generally, it is different, but by executing this silent section deletion process in both the registration process as shown in FIGS. 2 and 3 and the identification process as shown in FIGS. The difference in rhythm between the sound A and the drum sound B is absorbed, and the drum sound A and the drum sound B can be determined to be the same sound source. For this reason, it is effective to delete the silent section for a type of sound source in which short sound generation sections are arranged intermittently.

無音区間の削除を終えたら、続いて、音響フレーム読込手段２０が、音響信号から、所定数のサンプルを１音響フレームとして読み込む。音響フレーム読込手段２０が読み込む１音響フレームのサンプル数は、適宜設定することができるが、サンプリング周波数が４４．１ｋＨｚの場合、４０９６サンプル程度とすることが望ましい。これは、約０．０９３秒に相当する。ただし、後述する周波数変換における窓関数の利用により値が減少するサンプルを考慮して、音響フレームは、所定数分のサンプルを重複させて読み込むことにしている。本実施形態では、音響フレームの区間長のちょうど１／２となる２０４８サンプルを重複させている。したがって、先頭の音響フレームはサンプル１〜４０９６、２番目の音響フレームはサンプル２０４９〜６１４４、３番目の音響フレームはサンプル４０９７〜８１９２というように、順次読み込まれていくことになる。 When the deletion of the silent section is finished, the acoustic frame reading means 20 reads a predetermined number of samples as one acoustic frame from the acoustic signal. The number of samples of one sound frame read by the sound frame reading means 20 can be set as appropriate, but is desirably about 4096 samples when the sampling frequency is 44.1 kHz. This corresponds to about 0.093 seconds. However, in consideration of samples whose values decrease due to the use of a window function in frequency conversion described later, the acoustic frame is read by overlapping a predetermined number of samples. In this embodiment, 2048 samples that are exactly ½ of the section length of the acoustic frame are overlapped. Therefore, the first acoustic frame is sequentially read as samples 1 to 4096, the second acoustic frame is samples 2049 to 6144, and the third acoustic frame is samples 4097 to 8192.

続いて、特徴データ生成手段３０は、読み込んだ各音響フレームに対して、周波数変換を行って、その音響フレームのスペクトルであるフレームスペクトルを得る（Ｓ１０２）。具体的には、音響フレーム読込手段２０が読み込んだ音響フレームについて、窓関数を利用して周波数変換を行う。周波数変換としては、フーリエ変換、ウェーブレット変換その他公知の種々の手法を用いることができる。本実施形態では、フーリエ変換を用いた場合を例にとって説明する。 Subsequently, the feature data generation unit 30 performs frequency conversion on each read sound frame to obtain a frame spectrum that is a spectrum of the sound frame (S102). Specifically, frequency conversion is performed on the acoustic frame read by the acoustic frame reading means 20 using a window function. As frequency conversion, Fourier transform, wavelet transform, and other various known methods can be used. In the present embodiment, a case where Fourier transform is used will be described as an example.

ここで、本実施形態においてフーリエ変換に利用する窓関数について説明しておく。一般に、所定の信号に対してフーリエ変換を行う場合、信号を所定の長さに区切って行う必要があるが、この場合、所定長さの信号に対してそのままフーリエ変換を行うと、高域部に擬似成分が発生する。そこで、一般にフーリエ変換を行う場合には、ハニング窓と呼ばれる窓関数を用いて、信号の値を変化させた後、変化後の値に対してフーリエ変換を実行する。 Here, a window function used for Fourier transform in the present embodiment will be described. In general, when performing Fourier transform on a predetermined signal, it is necessary to divide the signal into predetermined lengths. In this case, if Fourier transform is performed on a signal having a predetermined length, A pseudo component is generated. Therefore, in general, when performing Fourier transform, a signal value is changed using a window function called a Hanning window, and then Fourier transform is performed on the changed value.

Ｓ１０２においてフーリエ変換を行う場合、具体的には、サンプルｉにおける値Ｘ（ｉ）（ｉ＝０，…，Ｎ−１）に対して、ハニング窓関数Ｗ（ｉ）（＝０．５−０．５ｃｏｓ（２πｉ／Ｎ））を用いて、以下の〔数式１〕に従った処理を行う。これを各音響フレームｇ（ｇ＝０，…，Ｇ−１）に対して実行し、各音響フレームｇの各周波数における実部Ａ（ｇ，ｊ）、虚部Ｂ（ｇ，ｊ）を得る。 When performing the Fourier transform in S102, specifically, the Hanning window function W (i) (= 0.5-0) is applied to the value X (i) (i = 0,..., N-1) in the sample i. .5 cos (2πi / N)), the processing according to the following [Equation 1] is performed. This is executed for each acoustic frame g (g = 0,..., G−1) to obtain a real part A (g, j) and an imaginary part B (g, j) at each frequency of each acoustic frame g. .

〔数式１〕
Ａ（ｇ，ｊ）＝Σ_i=0,…,N-1Ｗ（ｉ）・Ｘ（ｉ）・ｃｏｓ（２πｉｊ／Ｎ）
Ｂ（ｇ，ｊ）＝Σ_i=0,…,N-1Ｗ（ｉ）・Ｘ（ｉ）・ｓｉｎ（２πｉｊ／Ｎ） [Formula 1]
A (g, j) = Σ _{i = 0,..., N−1} W (i) · X (i) · cos (2πij / N)
B (g, j) = Σ _{i = 0,..., N−1} W (i) · X (i) · sin (2πij / N)

続いて、特徴データ生成手段３０は、時間方向の平均化を行う（Ｓ１０３）。具体的には、以下の〔数式２〕に従った処理を行い、各周波数におけるＧ個の音響フレームの時間平均スペクトルＶ（ｊ）を得る。 Subsequently, the feature data generating unit 30 performs averaging in the time direction (S103). Specifically, processing according to the following [Equation 2] is performed to obtain time-average spectra V (j) of G sound frames at each frequency.

〔数式２〕
Ｖ（ｊ）＝[Σ_g=0,…,G-1｛Ａ（ｇ，ｊ）²＋Ｂ（ｇ，ｊ）²｝／Ｇ] ^1/2 [Formula 2]
V (j) = [Σ _{g = 0,..., G−1} {A (g, j) ² + B (g, j) ² } / G] ^1/2

上記〔数式１〕〔数式２〕において、ｉは、各音響フレーム内のＮ個のサンプルに付した通し番号であり、ｉ＝０，１，２，…Ｎ−１の整数値をとる。また、ｊは周波数の値について、値の小さなものから順に付した通し番号であるが、算出したスペクトルより折り返し成分を除く下半分を用いる。したがって、ｊ＝１，２，…Ｎ／２の整数値をとる。サンプリング周波数が４４．１ｋＨｚ、Ｎ＝４０９６の場合、ｊの値が１つ異なると、周波数が１０．８Ｈｚ異なることになる。上記〔数式２〕に従った処理の結果、Ｇ個の全音響フレームを平均化した、各周波数における時間平均スペクトルＶ（ｊ）が得られる。 In the above [Equation 1] and [Equation 2], i is a serial number assigned to N samples in each acoustic frame, and takes an integer value of i = 0, 1, 2,. Further, j is a serial number assigned in order from the smallest value of the frequency value, and the lower half excluding the aliasing component is used from the calculated spectrum. Therefore, an integer value of j = 1, 2,... N / 2 is taken. When the sampling frequency is 44.1 kHz and N = 4096, if the value of j is different by one, the frequency will be different by 10.8 Hz. As a result of the processing according to the above [Equation 2], a time average spectrum V (j) at each frequency obtained by averaging all G acoustic frames is obtained.

続いて、特徴データ生成手段３０は、偏差ベクトルの算出を行う（Ｓ１０４）。具体的には、全音響フレームに渡って平均化した時間平均スペクトルＶ（ｊ）の各周波数成分に、周波数値が大きくなるのに伴って大きくなる値ｊを重みとして乗じ、各周波数成分について、全周波数の平均を減じる処理を行う。実際には、特徴データ生成手段３０は、以下の〔数式３〕に従った処理を実行し、平均値Ａｖを算出する。 Subsequently, the feature data generating unit 30 calculates a deviation vector (S104). Specifically, each frequency component of the time-average spectrum V (j) averaged over all acoustic frames is multiplied by a value j that increases as the frequency value increases, and for each frequency component, Performs processing to reduce the average of all frequencies. Actually, the feature data generation unit 30 executes processing according to the following [Equation 3] to calculate the average value Av.

〔数式３〕
Ａｖ＝Σ_j=f1,…,f2Ｖ（ｊ）・ｊ／（ｆ２-ｆ１＋１） [Formula 3]
Av = Σ _{j = f1,..., F2} V (j) · j / (f2−f1 + 1)

上記〔数式３〕においては、周波数番号ｆ１からｆ２の間の各周波数について、その時間平均スペクトルＶ（ｊ）に周波数番号ｊを乗じたものの平均値Ａｖを算出している。周波数番号ｆ１、ｆ２としては、音源データベースに登録されている登録特徴ワードの元になった音響信号の周波数帯域に応じて適宜設定することが可能であるが、本実施形態では、ｆ１＝２７（約３００Ｈｚに相当）、ｆ２＝７４３（約８０００Ｈｚに相当）と設定している。 In the above [Equation 3], for each frequency between the frequency numbers f1 and f2, an average value Av of the time average spectrum V (j) multiplied by the frequency number j is calculated. The frequency numbers f1 and f2 can be appropriately set according to the frequency band of the acoustic signal that is the basis of the registered feature word registered in the sound source database. In this embodiment, f1 = 27 ( F2 = 743 (corresponding to approximately 8000 Hz).

そして、特徴データ生成手段３０は、以下の〔数式４〕に従った処理を実行し、偏差ベクトルδＶ（ｊ）を特徴データとして算出する。 Then, the feature data generation unit 30 executes processing according to the following [Equation 4] to calculate the deviation vector δV (j) as feature data.

〔数式４〕
δＶ（ｊ）＝Ｖ（ｊ）・ｊ−Ａｖ [Formula 4]
δV (j) = V (j) · j−Av

ここで、特徴データとして偏差ベクトルを用いる際に周波数番号ｊなる重みを乗じる理由について説明する。特徴データの基本となる音源のスペクトルは、調波構造をもっており、図５（ａ）（ｂ）に示すように、周波数に反比例して信号成分が小さくなる傾向がある。特に、楽器を音源とした場合、高次倍音に音色の特徴が現れることが多いため、高周波成分における差が目立たなくなる。そこで、本実施形態では、周波数値に比例した値を重みとして乗じたスペクトルを特徴データとして生成し、高周波成分の差を際立たせ、各音源の識別を容易にする。例えば、図５（ａ）（ｂ）に示した楽器ア、楽器イのスペクトルに、周波数値に比例した値を重みとして乗じると、図５（ｃ）（ｄ）に示すような重み付けスペクトルが得られる。高周波成分になる程大きな重みが乗じられているため、図５（ｃ）と（ｄ）の波形を比較すると明らかなように、高周波成分の差が目立つことになる。 Here, the reason why the weight of the frequency number j is multiplied when the deviation vector is used as the feature data will be described. The spectrum of the sound source that is the basis of the feature data has a harmonic structure, and as shown in FIGS. 5A and 5B, the signal component tends to decrease in inverse proportion to the frequency. In particular, when a musical instrument is used as a sound source, timbre characteristics often appear in high-order overtones, so that differences in high-frequency components are not noticeable. Therefore, in this embodiment, a spectrum obtained by multiplying a value proportional to the frequency value as a weight is generated as feature data, and the difference between the high-frequency components is made to stand out, thereby facilitating identification of each sound source. For example, when the spectrums of musical instruments A and B shown in FIGS. 5A and 5B are multiplied by a value proportional to the frequency value as a weight, a weighted spectrum as shown in FIGS. 5C and 5D is obtained. It is done. Since a larger weight is applied to become a high frequency component, the difference in the high frequency component becomes conspicuous as is clear when the waveforms in FIGS. 5C and 5D are compared.

以上のようにして、ある音源の音響信号から偏差ベクトルである特徴データδＶ（ｊ）が生成される。この特徴データδＶ（ｊ）は、最大Ｎ／２次元（Ｎ＝４０９６の場合、２０４８次元）の偏差ベクトルとなる。ただし、上述のように設定されたｆ１、ｆ２に従い（ｆ２−ｆ１＋１）次元となる。したがって、ｆ１＝２７、ｆ２＝７４３と設定されている場合は、７１７次元となる。特徴データδＶ（ｊ）が生成されたら、特徴データ登録手段４０は、別途入力された音響ＩＤ、音源名等の関連情報と対応付けて特徴データδＶ（ｊ）および時間平均スペクトルＶ（ｊ）を音源データベース５０に登録する。後述する複数音源の検索処理の過程で、特徴データδＶ（ｊ）と時間平均スペクトルＶ（ｊ）の双方が必要になるが、特徴データδＶ（ｊ）は、元来、時間平均スペクトルＶ（ｊ）を用いて〔数式３〕および〔数式４〕に基づき算出されたものであるため、現実には、時間平均スペクトルＶ（ｊ）のみ記憶装置３内の所定の領域に格納されればよく、逆に、時間平均スペクトルＶ（ｊ）は、特徴データδＶ（ｊ）とスペクトルの平均値Ａｖを用いて〔数式４〕を変形することにより逆算出可能であるため、特徴データδＶ（ｊ）とスペクトルの平均値Ａｖが記憶装置３内の所定の領域に格納される形態をとることもできる。識別時に録音して得られる音響信号から生成された特徴データ、時間平均スペクトルと区別するため、以下、音源データベースに登録された特徴データδＶ（ｊ）、時間平均スペクトルＶ（ｊ）をそれぞれ登録特徴データδＶｄ（ｊ）、登録側平均スペクトルＶｄ（ｊ）と表現する。 As described above, the feature data δV (j) that is a deviation vector is generated from the sound signal of a certain sound source. This feature data δV (j) is a deviation vector of maximum N / 2 dimensions (in the case of N = 4096, 2048 dimensions). However, the dimension becomes (f2-f1 + 1) according to f1 and f2 set as described above. Therefore, when f1 = 27 and f2 = 743 are set, there are 717 dimensions. When the feature data δV (j) is generated, the feature data registration unit 40 associates the feature data δV (j) and the time average spectrum V (j) with the related information such as the separately input sound ID and sound source name. Register in the sound source database 50. Both feature data δV (j) and time-average spectrum V (j) are required in the process of searching for a plurality of sound sources, which will be described later. However, feature data δV (j) is originally time-average spectrum V (j ) Is calculated based on [Equation 3] and [Equation 4], and in reality, only the time average spectrum V (j) may be stored in a predetermined area in the storage device 3, On the contrary, the time average spectrum V (j) can be calculated in reverse by modifying [Equation 4] using the feature data δV (j) and the average value Av of the spectrum, so that the feature data δV (j) The spectrum average value Av may be stored in a predetermined area in the storage device 3. The feature data δV (j) and the time average spectrum V (j) registered in the sound source database are hereinafter registered in order to distinguish from the feature data and time average spectrum generated from the sound signal obtained by recording at the time of identification. Data δVd (j) and registration side average spectrum Vd (j) are expressed.

（２．音源の識別）
次に、本発明に係る複数音源の識別装置について説明する。図６は、本発明に係る複数音源の識別装置のハードウェア構成図である。複数音源の識別装置は、登録特徴データ生成装置と同様、汎用のコンピュータで実現することができ、図６に示すように、ＣＰＵ１ａ（CPU: Central Processing Unit）と、コンピュータのメインメモリであるＲＡＭ２ａ（RAM: Random Access Memory）と、ＣＰＵ１ａが実行するプログラムやデータを記憶するための大容量の記憶装置３ａ（例えば、ハードディスク、フラッシュメモリ等）と、キーボード、マウス等のキー入力Ｉ／Ｆ（インターフェース）４ａと、外部装置（データ記憶媒体等）とデータ通信するためのデータ入出力Ｉ／Ｆ（インターフェース）５ａと、表示装置（ディスプレイ）に情報を送出するための表示出力Ｉ／Ｆ（インターフェース）６ａと、マイクロフォンと接続された音声入力Ｉ／Ｆ（インターフェース）７ａと、を備え、互いにバスを介して接続されている。 (2. Identification of sound source)
Next, a multi-sound source identification device according to the present invention will be described. FIG. 6 is a hardware configuration diagram of the multiple sound source identification apparatus according to the present invention. Similar to the registered feature data generation device, the multiple sound source identification device can be realized by a general-purpose computer. As shown in FIG. 6, a CPU 1a (CPU: Central Processing Unit) and a RAM 2a (main memory of the computer) RAM (Random Access Memory), a large-capacity storage device 3a (for example, a hard disk, a flash memory, etc.) for storing programs and data executed by the CPU 1a, and a key input I / F (interface) such as a keyboard and a mouse 4a, a data input / output I / F (interface) 5a for data communication with an external device (data storage medium or the like), and a display output I / F (interface) 6a for sending information to a display device (display) And an audio input I / F (interface) 7a connected to a microphone. It is connected via a bus.

図７は、本発明に係る複数音源の識別装置の構成を示す機能ブロック図である。図７において、１０は無音区間削除手段、２０は音響フレーム読込手段、３０は特徴データ生成手段、３１は特徴データ補正手段、５０は音源データベース、６０はマイクロフォン、７０は音響信号取得手段、８０は音源データベース検索手段、９０は音源情報出力手段である。図２と同一符号の構成要素については、図２と同様の機能を有するものであるので、説明は省略する。図７の処理は、利用者からの指示に基づいて起動されるが、利用者から発せられる識別対象の音をタイミング良く取り込むため、利用者から停止の指示があるまで、繰り返し実行されるような形態をとる。即ち、最後の音源情報の出力を実行したら、音響信号の取得に再び戻り以下同様な処理を実行するようにする。
FIG. 7 is a functional block diagram showing a configuration of a multiple sound source identification device according to the present invention. In FIG. 7, 10 is a silent section deleting means, 20 is an acoustic frame reading means, 30 is a feature data generating means, 31 is a feature data correcting means, 50 is a sound source database, 60 is a microphone, 70 is an acoustic signal acquiring means, and 80 is A sound source database search means 90 is a sound source information output means. Components having the same reference numerals as those in FIG. 2 have the same functions as those in FIG. The process of FIG. 7 is started based on an instruction from the user, but is repeatedly executed until an instruction to stop is received from the user in order to capture the sound of the identification target emitted from the user with good timing. Takes form. That is, After running output of the sound source information of the last, so as to perform the process returns following similar processing to acquisition of the acoustic signal.

マイクロフォン６０は、低周波成分から高周波成分まで広い範囲の音を忠実に取り込む性能を持っている必要は無く、スマートフォンなど各種携帯端末や携帯ゲーム機に内蔵されている安価な民生用マイクロフォンを用いることができる。具体的には、前述の通り特徴データの作成において、上記〔数式３〕に示されるように、周波数番号ｆ１＝２７（約３００Ｈｚに相当）からｆ２＝７４３（約８０００Ｈｚに相当）の周波数範囲しか計算対象としていないため、必要とするマイクロフォン６０の性能としてはこの範囲をカバーしていれば良い（商品として販売されるマイクロフォンは少なくともこの周波数範囲をカバーする）。音響信号取得手段７０は、マイクロフォン６０で取り込んだアナログ音響信号をＡ／Ｄ変換してデジタル化する機能を有している。 The microphone 60 does not need to have the ability to faithfully capture a wide range of sound from low frequency components to high frequency components, and uses an inexpensive consumer microphone built into various portable terminals such as smartphones and portable game machines. Can do. Specifically, in the creation of the feature data as described above, as shown in the above [Equation 3], only the frequency range from the frequency number f1 = 27 (corresponding to about 300 Hz) to f2 = 743 (corresponding to about 8000 Hz). Since it is not a calculation target, the necessary performance of the microphone 60 only needs to cover this range (a microphone sold as a product covers at least this frequency range). The acoustic signal acquisition means 70 has a function of A / D converting and digitizing an analog acoustic signal captured by the microphone 60.

音源データベース検索手段８０は、生成した特徴データと、音源データベース５０に登録されている登録特徴データとの照合を行う機能を有している。音源情報出力手段９０は、音源データベース検索手段８０による照合の結果、録音により得られた音響信号の特徴に最も類似する音源についての識別情報（楽器を識別するコード番号、ＭＩＤＩ規格のプログラム番号など）を、音源データベース５０から抽出して出力する機能を有している。図７に示した各構成手段は、図６に示したハードウェア構成に専用のプログラムを搭載することにより実現される。 The sound source database searching unit 80 has a function of collating the generated feature data with the registered feature data registered in the sound source database 50. The sound source information output means 90 identifies identification information (such as a code number for identifying a musical instrument, a program number for MIDI standards) about the sound source most similar to the characteristics of the acoustic signal obtained by recording as a result of the collation by the sound source database search means 80. Is extracted from the sound source database 50 and output. Each component shown in FIG. 7 is realized by installing a dedicated program in the hardware configuration shown in FIG.

図６の記憶装置３ａには、ＣＰＵ１ａを動作させ、コンピュータを、複数音源の識別装置として機能させるための専用のプログラムが実装されている。この専用のプログラムを実行することにより、ＣＰＵ１ａは、無音区間削除手段１０、音響フレーム読込手段２０、特徴データ生成手段３０、特徴データ補正手段３１、音響信号取得手段７０、音源データベース検索手段８０、音源情報出力手段９０としての機能を実現することになる。また、記憶装置３ａは、処理に必要な様々なデータを記憶する。 6 is mounted with a dedicated program for operating the CPU 1a and causing the computer to function as a plurality of sound source identification devices. By executing this dedicated program, the CPU 1a causes the silence interval deleting means 10, the acoustic frame reading means 20, the feature data generating means 30, the feature data correcting means 31, the acoustic signal acquiring means 70, the sound source database searching means 80, the sound source. The function as the information output means 90 is realized. The storage device 3a stores various data necessary for processing.

続いて、図６、図７に示した複数音源の識別装置の処理動作を、図８のフローチャートを用いて説明する。まず、利用者が複数音源の識別装置に対して起動の指示を行う。複数音源の識別装置が汎用のコンピュータで実現されている場合、キーボード上の所定のキーを押すか、画面に表示された所定の箇所をマウスでクリックすることにより、実行できる。利用者は複数音源の識別装置を起動後、任意のタイミングで識別対象の複数の音を適当な区間（例えば、５秒間）だけ発生させる。例えば、複数の楽器の試奏を同時に５秒間行う。複数音源の識別装置は、指示が入力されると、マイクロフォン６０から流れている音楽を一定区間（例えば、２秒間）だけ取り込み、録音してデジタル音響信号として取得する（Ｓ２１０）。具体的には、マイクロフォン６０から入力される音響信号を、音響信号取得手段７０によりデジタル化する処理を行うことになる。 Next, the processing operation of the multiple sound source identification device shown in FIGS. 6 and 7 will be described with reference to the flowchart of FIG. First, the user gives an activation instruction to the identification device for a plurality of sound sources. When the identification device for a plurality of sound sources is realized by a general-purpose computer, it can be executed by pressing a predetermined key on the keyboard or clicking a predetermined portion displayed on the screen with a mouse. The user generates a plurality of sounds to be identified for an appropriate interval (for example, 5 seconds) at an arbitrary timing after starting the identification device for a plurality of sound sources. For example, a plurality of musical instruments are simultaneously played for 5 seconds. When the instruction is input, the multi-sound source identification device takes the music flowing from the microphone 60 for a certain period (for example, 2 seconds), records it, and obtains it as a digital acoustic signal (S210). Specifically, the acoustic signal input from the microphone 60 is digitized by the acoustic signal acquisition unit 70.

デジタル音響信号が得られたら、このデジタル音響信号から特徴データを生成する（Ｓ２２０）。具体的には、無音区間削除手段１０、音響フレーム読込手段２０、特徴データ生成手段３０が、図３に示したＳ１０１〜Ｓ１０４の処理を実行する。Ｓ２２０における特徴データ生成処理の結果、上記〔数式４〕に示したような特徴データδＶ（ｊ）が得られる。上述のように、特徴データの形式、生成手法は、登録特徴データδＶｄ（ｊ）と同一であるが、登録特徴データδＶｄ（ｊ）、登録側平均スペクトルＶｄ（ｊ）と区別するため、以下、複数音源の識別装置において生成された特徴データδＶ（ｊ）、時間平均スペクトルＶ（ｊ）をそれぞれ検索特徴データδＶｑ（ｊ）、検索側平均スペクトルＶｑ（ｊ）と表現する。 If the digital sound signal is obtained, feature data is generated from the digital sound signal (S220). Specifically, the silent section deleting unit 10, the acoustic frame reading unit 20, and the feature data generating unit 30 execute the processes of S101 to S104 shown in FIG. As a result of the feature data generation process in S220, the feature data δV (j) as shown in [Formula 4] is obtained. As described above, the format and generation method of the feature data is the same as the registered feature data δVd (j), but in order to distinguish it from the registered feature data δVd (j) and the registration-side average spectrum Vd (j), The feature data δV (j) and the time average spectrum V (j) generated in the multiple sound source identification device are expressed as search feature data δVq (j) and search side average spectrum Vq (j), respectively.

次に生成された特徴データを用いて音源の検索を行う（Ｓ２３０）。この音源の検索処理の詳細については後述する。音源の検索の結果、生成された特徴データに対応する音源の特定に成功した場合には（Ｓ２４０）、特徴データの補正を行う（Ｓ２５０）。具体的には、特定された音源に対応する登録特徴データを用いて特徴データを補正し、補正特徴データを得る。この特徴データの補正処理の詳細についても後述する。特徴データの補正後、Ｓ２３０に戻って、得られた補正特徴データを用い、既に特定された音源を検索対象から除外して音源の検索を行う。このようにして、特徴データの補正（Ｓ２５０）および音源の検索（Ｓ２３０）を繰り返して実行し、音源の特定に失敗した場合に処理を終了する。 Next, a sound source is searched using the generated feature data (S230). Details of the sound source search process will be described later. As a result of the sound source search, if the sound source corresponding to the generated feature data is successfully identified (S240), the feature data is corrected (S250). Specifically, the feature data is corrected using the registered feature data corresponding to the specified sound source to obtain corrected feature data. Details of the correction process of the feature data will be described later. After correcting the feature data, the process returns to S230, and using the obtained corrected feature data, the already identified sound source is excluded from the search target and the sound source is searched. In this way, the feature data correction (S250) and the sound source search (S230) are repeatedly executed, and the process ends when the sound source identification fails.

本実施形態では、Ｓ２４０において音源の特定に成功したか否かを判断し、音源の特定に失敗するまで繰り返し音源の特定を実行するが、音源の特定を終了する条件は、本実施形態に限定されず、変更しても良い。例えば、特定する音源の数を特定しておき、所定数に達した場合に、終了するようにしても良い。 In the present embodiment, it is determined whether or not the sound source has been successfully identified in S240, and the sound source is identified repeatedly until the sound source identification fails. However, the condition for ending the sound source identification is limited to the present embodiment. It may be changed. For example, the number of sound sources to be specified may be specified, and the process may be terminated when a predetermined number is reached.

Ｓ２３０における音源の検索処理について、図９のフローチャートを用いて説明する。取得されたデジタル音響信号から特徴データδＶ（ｊ）が得られたら、音源データベース検索手段８０が、音源データベース５０内の各登録特徴データとの照合を行う。まず、音源データベース５０から１つの登録特徴データを抽出し、特徴データとの相関演算を行う（Ｓ２３１）。具体的には、音源データベース検索手段８０は、以下の〔数式５〕に従った処理を実行し、相関値ＲＥを算出する。 The sound source search process in S230 will be described with reference to the flowchart of FIG. When the feature data δV (j) is obtained from the acquired digital acoustic signal, the sound source database search means 80 collates with each registered feature data in the sound source database 50. First, one registered feature data is extracted from the sound source database 50, and a correlation calculation with the feature data is performed (S231). Specifically, the sound source database search means 80 executes processing according to the following [Equation 5] to calculate the correlation value RE.

上述のように、音源データベース５０に登録された登録特徴データと、取得されたデジタル音響信号から生成された検索特徴データは、いずれも同じ形式の最大Ｎ／２次元（本例では７１７次元）の偏差ベクトルδＶ（ｊ）である。 As described above, the registered feature data registered in the sound source database 50 and the search feature data generated from the acquired digital acoustic signal are both of the maximum N / 2 dimensions (717 dimensions in this example) of the same format. Deviation vector δV (j).

〔数式５〕
ＲＥ＝｛δＶｑ（ｊ）・δＶｄ（ｊ）｝／｛｜δＶｑ（ｊ）｜｜δＶｄ（ｊ）｜｝＝Σ_j=f1,…,f2｛（δＶｑ（ｊ））×（δＶｄ（ｊ））｝／[｛Σ_j=f1,…,f2（δＶｑ（ｊ））²｝^1/2×｛Σ_j=f1,…,f2（δＶｄ（ｊ））²｝^1/2] [Formula 5]
RE = {δVq (j) · δVd (j)} / {| δVq (j) || δVd (j) |} = Σ _{j = f1,...,} F2 {(δVq (j)) × (δVd (j) )} / [{Σ _{j = f1,...,} F2 (δVq (j)) ² } ^1/2 × {Σ _{j = f1,...,} F2 (δVd (j)) ² } ^1/2 ]

相関値としては、登録特徴データδＶｄ（ｊ）と特徴データδＶｑ（ｊ）の相関が評価できるものであれば、どのようなものであっても良いが、本実施形態では、ともに偏差ベクトルである両者の内積を相関値ＲＥとして算出している。上記〔数式５〕に示すように、本実施形態では、内積を各ベクトルの大きさ（δＶｑ（ｊ）、δＶｄ（ｊ）の２乗和平方根）で除算して相関値ＲＥとして算出することにより、両特徴データの対象である音の大きさの差を取り除いている。 Any correlation value may be used as long as the correlation between the registered feature data δVd (j) and the feature data δVq (j) can be evaluated. In the present embodiment, both are deviation vectors. The inner product of both is calculated as a correlation value RE. As shown in the above [Equation 5], in this embodiment, the inner product is divided by the magnitude of each vector (the square sum of squares of δVq (j) and δVd (j)) to calculate the correlation value RE. The difference in the volume of sound that is the target of both feature data is removed.

１つの登録特徴データについて、相関値ＲＥが得られたら、音源データベース５０内の全音源に対応する登録特徴データについて処理が終了したかどうかを判断し（Ｓ２３２）、終了していなければ、全音源について相関値ＲＥが得られるまで繰り返し相関演算を実行する（Ｓ２３１）。 When the correlation value RE is obtained for one registered feature data, it is determined whether the processing has been completed for the registered feature data corresponding to all sound sources in the sound source database 50 (S232). The correlation calculation is repeatedly performed until a correlation value RE is obtained for (S231).

全音源について相関値ＲＥが得られたら、相関値ＲＥが最大となった音源を特定する。この際、相関値ＲＥについてしきい値を設定しておき、相関値ＲＥがしきい値より大きいものに限り選出するようにする。したがって、相関値ＲＥのうち最大のものがしきい値以下の場合には、音源は特定されない。その場合、発せられていた音と類似するものが音源データベース５０内に存在しなかったということになり、音源の特定は失敗となる。 When the correlation value RE is obtained for all sound sources, the sound source having the maximum correlation value RE is specified. At this time, a threshold value is set for the correlation value RE, and only those having a correlation value RE larger than the threshold value are selected. Therefore, when the maximum correlation value RE is equal to or less than the threshold value, no sound source is specified. In that case, it means that there is no sound similar to the sound that has been emitted in the sound source database 50, and the sound source identification fails.

相関値ＲＥのうち最大のものが、しきい値より大きい場合は、相関値ＲＥ算出の対象となった登録特徴データに対応付けて音源データベース５０内に記録された音源識別情報により音源が特定される（Ｓ２３３）。音源が特定されたら、音源情報出力手段９０が、特定された音源識別情報を所定の形式で出力する（Ｓ２３４）。出力は、音源識別情報に関連したコンテンツの表示出力、他の情報機器への音源識別情報のデータ伝送出力等様々な形式が可能で、設定された形式により対応した機器にて出力される。例えば、別途準備したマルチメディアコンテンツのデータベースに音源識別情報に対応付けて、楽器の詳細や解説コンテンツ、解説コンテンツのプログラムを記憶させておくことにより、解説コンテンツが起動され、表示出力Ｉ／Ｆ６ａを介した表示装置に表示される。あるいは、別の情報端末に対してＵＳＢインタフェースやワイヤレスインタフェースを介して音源識別情報を伝送することにより、前記情報端末に対して音源識別情報に対応した所定のプログラムを起動させるようにすることができる。本実施形態では、図９のＳ２３４に示すように、音源識別情報の出力は、音源が特定される度に行われる。したがって、本実施形態では、音源が特定される度に他の情報機器等へ音源識別情報を出力し、図８のＳ２４０において音源の特定に失敗した後、音源識別情報を入力した情報機器が、まとめて複数の音源識別情報に基づく出力を行うことになる。本実施形態に限定されず、音源が全て特定された後、まとめて他の情報機器等へ音源識別情報を出力するようにすることも可能である。 If the maximum correlation value RE is larger than the threshold value, the sound source is specified by the sound source identification information recorded in the sound source database 50 in association with the registered feature data for which the correlation value RE is calculated. (S233). When the sound source is specified, the sound source information output means 90 outputs the specified sound source identification information in a predetermined format (S234). The output can be in various formats such as a display output of content related to the sound source identification information, a data transmission output of the sound source identification information to other information devices, and is output by a device corresponding to the set format. For example, by storing the musical instrument details, commentary content, and commentary content program in association with the sound source identification information in a separately prepared multimedia content database, the commentary content is activated, and the display output I / F 6a is set. Displayed on the display device. Alternatively, by transmitting sound source identification information to another information terminal via a USB interface or a wireless interface, it is possible to cause the information terminal to start a predetermined program corresponding to the sound source identification information. . In this embodiment, as shown in S234 of FIG. 9, the sound source identification information is output every time a sound source is specified. Therefore, in this embodiment, every time a sound source is specified, the information device outputs the sound source identification information to other information devices, etc., and after the failure to specify the sound source in S240 of FIG. Collectively, output based on a plurality of sound source identification information is performed. The present invention is not limited to this embodiment, and it is also possible to output sound source identification information collectively to other information devices after all sound sources are specified.

音源識別情報を出力する場合、コンピュータ等の情報処理装置に出力すれば、情報処理装置は、音源識別情報に応じた処理を行うことができる。例えば、コンピュータに音源識別情報ごとに異なる処理を行うプログラムを搭載しておくことにより、異なる音源にマイクロフォンを向けるだけで、異なる処理が行われることになる。 When outputting sound source identification information, if the information is output to an information processing apparatus such as a computer, the information processing apparatus can perform processing according to the sound source identification information. For example, by installing a program for performing different processing for each sound source identification information on a computer, different processing is performed only by directing a microphone toward a different sound source.

（３．特徴データ補正の詳細）
次に、Ｓ２５０の特徴データの補正について説明する。まず、特徴データ補正手段３１は、特定された音源に対応付けて記録された登録側平均スペクトルＶｄ（ｊ）をＶｄｍａｘ（ｊ）とし、Ｖｄｍａｘ（ｊ）（ｊ＝ｆ１，ｆ１＋１，・・・，ｆ２−１，ｆ２）のうち、Ｖｄｍａｘ（ｊ）の値を最大にするｊを特定し、そのときのｊをｊｍａｘとする。そして、特徴データ補正手段３１は、以下の〔数式６〕に従った処理を実行することにより、ｊ＝ｆ１，ｆ１＋１，・・・，ｆ２−１，ｆ２の検索側平均スペクトルＶｑ（ｊ）の各値を補正した補正平均スペクトルＶｑ´（ｊ）を算出する。なお、音源データベース５０に登録側平均スペクトルＶ（ｊ）を登録しておかない実施形態とすることも可能であるが、その場合は、音源データベース５０に特徴データδＶ（ｊ）に対応付けてスペクトルの平均値Ａｖを登録しておく。そして、特徴データ補正手段３１は、上記〔数式４〕を変形した数式Ｖ（ｊ）＝ [δＶ（ｊ）＋Ａｖ]／ｊに従った処理を実行し、登録側平均スペクトルＶｄ（ｊ）を算出する処理を行う。 (3. Details of feature data correction)
Next, the correction of feature data in S250 will be described. First, the feature data correcting unit 31 sets the registration-side average spectrum Vd (j) recorded in association with the specified sound source as Vdmax (j), and Vdmax (j) (j = f1, f1 + 1,... Among the f2-1 and f2), j that maximizes the value of Vdmax (j) is specified, and j at that time is defined as jmax. Then, the feature data correction unit 31 executes the processing according to the following [Equation 6], whereby the search side average spectrum Vq (j) of j = f1, f1 + 1,. A corrected average spectrum Vq ′ (j) obtained by correcting each value is calculated. It should be noted that an embodiment in which the registration-side average spectrum V (j) is not registered in the sound source database 50 is possible, but in that case, the spectrum is associated with the feature data δV (j) in the sound source database 50. The average value Av is registered. Then, the feature data correction unit 31 executes a process according to a formula V (j) = [δV (j) + Av] / j obtained by modifying the above [Formula 4], and calculates a registration-side average spectrum Vd (j). Perform the process.

〔数式６〕
Ｖｑ´（ｊ）＝Ｖｑ（ｊ）−Ｖｄｍａｘ（ｊ）・Ｖｑ（ｊｍａｘ）／Ｖｄｍａｘ（ｊｍａｘ）
Ｖｑ´（ｊ）＜０の場合、Ｖｑ´（ｊ）＝０に設定 [Formula 6]
Vq ′ (j) = Vq (j) −Vdmax (j) · Vq (jmax) / Vdmax (jmax)
If Vq '(j) <0, set Vq' (j) = 0

上記〔数式６〕において、Ｖｑ（ｊｍａｘ）、Ｖｄｍａｘ（ｊｍａｘ）はいずれもｊ＝ｊｍａｘで固定されているため、固定値である。そのため、Ｖｑ（ｊｍａｘ）／Ｖｄｍａｘ（ｊｍａｘ）とベクトルＶｄｍａｘ（ｊ）との間の演算子“・”は、乗算を意味する。上記〔数式６〕に示すように、補正平均スペクトルＶｑ´（ｊ）は、特定された音源に対応する配列Ｖｄｍａｘ（ｊ）を最大にする周波数番号ｊｍａｘについて、検索側のＶｑ（ｊｍａｘ）を特定されたＶｄｍａｘ（ｊｍａｘ）で除算し、その値を配列Ｖｄｍａｘ（ｊ）に乗じたものを検索側平均スペクトルＶｑ（ｊ）から減じることにより算出される。すなわち、特定された音源の最大成分の周波数ｊｍａｘについての検索側との比を、特定された音源に対応する配列Ｖｄｍａｘ（ｊ）に乗じて検索側平均スペクトルＶｑ（ｊ）から減じることにより、元の検索側のスペクトルの各値から、特定された音源に対応するスペクトルの各値の寄与分を除去した補正平均スペクトルＶｑ´（ｊ）が得られる。Ｖｑ´（ｊ）が０未満となった場合に、強制的に“０”に設定するのは、Ｖｑ´（ｊ）として負の値は物理的に有り得ないためである。 In the above [Equation 6], Vq (jmax) and Vdmax (jmax) are both fixed values because they are fixed at j = jmax. Therefore, the operator “·” between Vq (jmax) / Vdmax (jmax) and the vector Vdmax (j) means multiplication. As shown in [Formula 6] above, the corrected average spectrum Vq ′ (j) specifies the search side Vq (jmax) for the frequency number jmax that maximizes the array Vdmax (j) corresponding to the specified sound source. It is calculated by dividing by the calculated Vdmax (jmax), and multiplying the value by the array Vdmax (j) and subtracting it from the search side average spectrum Vq (j). That is, by multiplying the array Vdmax (j) corresponding to the identified sound source by the ratio of the frequency jmax of the maximum component of the identified sound source to the search side and subtracting it from the search side average spectrum Vq (j), the original A corrected average spectrum Vq ′ (j) obtained by removing the contribution of each value of the spectrum corresponding to the specified sound source is obtained from each value of the spectrum on the search side. When Vq ′ (j) becomes less than 0, it is forcibly set to “0” because a negative value as Vq ′ (j) is physically impossible.

そして、特徴データ補正手段３１は、この補正平均スペクトルＶｑ´（ｊ）を用いて以下の〔数式７〕に従った処理を実行することにより、偏差ベクトルである補正特徴データδＶｑ´（ｊ）を算出する。 Then, the feature data correcting unit 31 executes the processing according to the following [Equation 7] using the corrected average spectrum Vq ′ (j), thereby obtaining the corrected feature data δVq ′ (j) which is a deviation vector. calculate.

〔数式７〕
Ａｖ´＝Σ_j=f1,…,f2Ｖｑ´（ｊ）・ｊ／（ｆ２-ｆ１＋１）
δＶｑ´（ｊ）＝Ｖｑ´（ｊ）・ｊ−Ａｖ´ [Formula 7]
Av ′ = Σ _{j = f1,...,} F2 Vq ′ (j) · j / (f2−f1 + 1)
δVq ′ (j) = Vq ′ (j) · j−Av ′

上記〔数式７〕の第１式、第２式は、それぞれ上記〔数式３〕〔数式４〕におけるＶ（ｊ）をＶｑ´（ｊ）、ＡｖをＡｖ´に置き換えたものである。上記〔数式７〕に従って算出された補正特徴データδＶｑ´（ｊ）を用いて、Ｓ２３０の音源の検索処理を行い、２番目に主要な音源を特定する。２番目に主要な音源の特定に成功したら、補正平均スペクトルＶｑ´（ｊ）を新たな検索側平均スペクトルＶｑ（ｊ）とし、上記〔数式６〕〔数式７〕を用いてＳ２５０において新たな補正特徴データδＶｑ´（ｊ）を求める。音源の特定に失敗するまで、Ｓ２３０〜Ｓ２５０の処理は、特徴データ補正手段３１、音源データベース検索手段８０により繰り返し実行される。 The first and second formulas of [Formula 7] are obtained by replacing V (j) with Vq ′ (j) and Av with Av ′ in [Formula 3] and [Formula 4], respectively. Using the corrected feature data δVq ′ (j) calculated according to the above [Equation 7], the sound source search process of S230 is performed, and the second main sound source is specified. When the second main sound source is successfully identified, the corrected average spectrum Vq ′ (j) is set as a new search-side average spectrum Vq (j), and a new correction is made in S250 using the above [Expression 6] and [Expression 7]. Characteristic data δVq ′ (j) is obtained. Until the identification of the sound source fails, the processing of S230 to S250 is repeatedly executed by the feature data correction unit 31 and the sound source database search unit 80.

上記〔数式６〕に示したように、特定された音源に対応する登録側平均スペクトルを最大にする周波数番号ｊｍａｘについて、対応する検索側平均スペクトルの成分Ｖｑ（ｊｍａｘ）を特定された登録側平均スペクトルの成分Ｖｄｍａｘ（ｊｍａｘ）で除算し、その値を登録側平均スペクトルＶｄｍａｘ（ｊ）に乗じたものを検索側平均スペクトルＶｑ（ｊ）から減じることにより、検索側平均スペクトルを補正し、検索特徴データを補正するようにしたので、取得された音響信号の検索側平均スペクトルから特定された音源の登録側平均スペクトルに相当する分を除去することができ、補正された検索側平均スペクトルは、特定された第１音源を除外した音源を表現したものとなるので、第２音以降を正確に特定することが可能になる。 As shown in [Formula 6] above, for the frequency number jmax that maximizes the registration-side average spectrum corresponding to the identified sound source, the corresponding search-side average spectrum component Vq (jmax) is specified. The search-side average spectrum is corrected by dividing the spectrum component Vdmax (jmax) by the value obtained by multiplying the value by the registration-side average spectrum Vdmax (j) from the search-side average spectrum Vq (j). Since the data has been corrected, it is possible to remove the portion corresponding to the specified registration-side average spectrum of the sound source from the search-side average spectrum of the acquired acoustic signal, and the corrected search-side average spectrum is specified. Since the sound source excluding the generated first sound source is expressed, it becomes possible to accurately specify the second and subsequent sounds.

（４．特徴データ補正の変形例）
次に、Ｓ２５０の特徴データの補正の変形例について説明する。上記の例では、検索特徴データδＶｑ（ｊ）の基になる検索側平均スペクトルＶｑ（ｊ）を補正した後、補正特徴データδＶｑ´（ｊ）を得るようにしたが、変形例では、検索特徴データδＶｑ（ｊ）が複数種の音源に対応する登録特徴データδＶｄ（ｊ）の線形結合で表されると仮定して、最も主要な音源のレベルを推定し、２番目以降に主要な音源を高精度に判定する。ただし、説明を簡単にするため、検索対象の音源を２音源に限定する。 (4. Modification of feature data correction)
Next, a modified example of the feature data correction in S250 will be described. In the above example, the search-side average spectrum Vq (j) that is the basis of the search feature data δVq (j) is corrected, and then the corrected feature data δVq ′ (j) is obtained. Assuming that the data δVq (j) is represented by a linear combination of registered feature data δVd (j) corresponding to a plurality of types of sound sources, the level of the most important sound source is estimated, and the second and subsequent main sound sources are selected. Judge with high accuracy. However, in order to simplify the description, the sound source to be searched is limited to two sound sources.

変形例では、最も主要な音源のレベルをＶ１、２番目に主要な音源のレベルをＶ２とした場合、検索特徴データδＶｑ（ｊ）が以下の〔数式８〕により表現できると仮定する。〔数式８〕において、右辺の演算子“・”は、スカラー値であるＶ１、Ｖ２とベクトルとの演算であるため、乗算を意味する。 In the modification, it is assumed that the search feature data δVq (j) can be expressed by the following [Equation 8] when the level of the most main sound source is V1, and the level of the second main sound source is V2. In [Formula 8], the operator “·” on the right side means an operation of scalar values V1 and V2 and a vector, and means multiplication.

〔数式８〕
δＶｑ（ｊ）＝Ｖ１・δＶｄｍａｘ（ｊ）＋Ｖ２・δＶｄ２ｍａｘ（ｊ） [Formula 8]
δVq (j) = V1 · δVdmax (j) + V2 · δVd2max (j)

検索特徴データδＶｑ（ｊ）と最も主要な音源の登録特徴データδＶｄｍａｘ（ｊ）の内積は、以下の〔数式９〕により表現される。〔数式９〕において、左辺の演算子“・”と右辺の｛｝内の演算子“・”は、ベクトル同士の演算であるため、内積を意味し、右辺の｛｝外の演算子“・”は、スカラー値との演算であるため、乗算を意味する。 The inner product of the search feature data δVq (j) and the registered feature data δVdmax (j) of the most main sound source is expressed by the following [Equation 9]. In [Equation 9], the operator “•” on the left side and the operator “•” in {} on the right side are operations between vectors, and therefore mean an inner product, and the operator “• outside the {} on the right side. "" Means a multiplication because it is an operation with a scalar value.

〔数式９〕
δＶｑ（ｊ）・δＶｄｍａｘ（ｊ）＝Ｖ１・｜δＶｄｍａｘ（ｊ）｜²＋Ｖ２・｛δＶｄ２ｍａｘ（ｊ）・δＶｄｍａｘ（ｊ）｝ [Formula 9]
δVq (j) · δVdmax (j) = V1 · | δVdmax (j) | ² + V2 · {δVd2max (j) · δVdmax (j)}

ここで、２番目に主要な音源の登録特徴データδＶｄ２ｍａｘ（ｊ）が最も主要な音源の登録特徴データδＶｄｍａｘ（ｊ）に比べて顕著に異なれば、線形独立と仮定でき、上記〔数式９〕右辺第２項の内積値｛δＶｄ２ｍａｘ（ｊ）・δＶｄｍａｘ（ｊ）｝は０に近似できるため、最も主要な音源のレベルＶ１は、以下の〔数式１０〕により表現される。 Here, if the registered feature data δVd2max (j) of the second main sound source is significantly different from the registered feature data δVdmax (j) of the most main sound source, it can be assumed to be linearly independent. Since the inner product value {δVd2max (j) · δVdmax (j)} of the second term can be approximated to 0, the level V1 of the most important sound source is expressed by the following [Equation 10].

〔数式１０〕
Ｖ１＝｛δＶｑ（ｊ）・δＶｄｍａｘ（ｊ）｝／｜δＶｄｍａｘ（ｊ）｜² [Formula 10]
V1 = {δVq (j) · δVdmax (j)} / | δVdmax (j) | ²

したがって、特徴データ補正手段３１は、上記〔数式１０〕に従った処理を実行し、最も主要な音源のレベルＶ１を算出する。さらに、特徴データ補正手段３１は、上記主要な音源のレベルＶ１を用いて以下の〔数式１１〕に従った処理を実行することにより、偏差ベクトルである補正特徴データδＶｑ´（ｊ）を算出する。 Therefore, the feature data correcting unit 31 executes the process according to the above [Equation 10] and calculates the level V1 of the most main sound source. Further, the feature data correcting unit 31 calculates the corrected feature data δVq ′ (j), which is a deviation vector, by executing processing according to the following [Equation 11] using the level V1 of the main sound source. .

〔数式１１〕
δＶｑ´（ｊ）＝δＶｑ（ｊ）−Ｖ１・δＶｄｍａｘ（ｊ） [Formula 11]
δVq ′ (j) = δVq (j) −V1 · δVdmax (j)

変形例による特徴データの補正によれば、上記〔数式１０〕に示したように、特定された音源に対応する登録特徴データδＶｄｍａｘ（ｊ）と、検索特徴データδＶｑ（ｊ）の相関値を音源のレベルＶ１として求め、さらに上記〔数式１１〕に示したように、相関値Ｖ１を特定された登録特徴データδＶｄｍａｘ（ｊ）に乗じたものを検索特徴データδＶｑ（ｊ）から減じることにより補正特徴データδＶｑ´（ｊ）を生成するようにしたので、定量的に検索特徴データから第１音源の特徴を除外した補正特徴データを生成でき、第２音源以降の特定を高精度に行うことが可能になる。補正特徴データδＶｑ´（ｊ）を用いて、２番目に主要な音源の登録特徴データδＶｄ２ｍａｘ（ｊ）との内積を計算することにより、２番目に主要な音源のレベルＶ２を算出できる。即ち、〔数式９〕〔数式１０〕を以下〔数式９ａ〕〔数式１０ａ〕のように変形することにより算出できる。 According to the correction of the feature data according to the modified example, as shown in [Formula 10], the correlation value between the registered feature data δVdmax (j) corresponding to the identified sound source and the search feature data δVq (j) is used as the sound source. And the corrected feature is obtained by subtracting, from the search feature data δVq (j), the value obtained by multiplying the specified registered feature data δVdmax (j) by the correlation value V1 as shown in [Formula 11]. Since the data δVq ′ (j) is generated, correction feature data excluding the feature of the first sound source can be generated quantitatively from the search feature data, and the second and subsequent sound sources can be specified with high accuracy. become. By using the corrected feature data δVq ′ (j) to calculate the inner product with the registered feature data δVd2max (j) of the second main sound source, the level V2 of the second main sound source can be calculated. That is, [Formula 9] and [Formula 10] can be calculated by modifying them into [Formula 9a] and [Formula 10a] below.

〔数式９ａ〕
δＶｑ´（ｊ）・δＶｄ２ｍａｘ（ｊ）＝Ｖ２・｜δＶｄ２ｍａｘ（ｊ）｜² [Formula 9a]
δVq ′ (j) · δVd2max (j) = V2 · | δVd2max (j) | ²

〔数式１０ａ〕
Ｖ２＝｛δＶｑ´（ｊ）・δＶｄ２ｍａｘ（ｊ）｝／｜δＶｄ２ｍａｘ（ｊ）｜² [Formula 10a]
V2 = {δVq ′ (j) · δVd2max (j)} / | δVd2max (j) | ²

（５．窓関数の変形例）
上記実施形態では、音源を時系列方向の平均スペクトルを基礎とした特徴データを生成した。しかし、一般に音は、図１０（ａ）に示すように、音の立ち上がり部（アタックとディケイ）、定常部（サステイン）、立下り部（リリース）の４相でスペクトルが動的に変化するため、その特徴を単一な平均スペクトルを基礎として表現するのは適切ではない。例えば、ある音響信号の時間軸を逆転させた逆回し波形は平均スペクトルに変化はないが、図１０（ｂ）（ｃ）に示すように、波形が時間軸方向で反対になるため、音色が劇的に変化する。 (5. Modified examples of window functions)
In the above embodiment, feature data based on an average spectrum in the time series direction is generated for the sound source. However, generally, as shown in FIG. 10 (a), the spectrum of a sound dynamically changes in four phases of a rising part (attack and decay), a steady part (sustain), and a falling part (release) of the sound. It is not appropriate to express the feature based on a single average spectrum. For example, a reverse waveform obtained by reversing the time axis of a certain acoustic signal does not change the average spectrum, but as shown in FIGS. 10B and 10C, since the waveform is opposite in the time axis direction, It changes dramatically.

上記実施形態では、フーリエ変換を行う場合、全ての音響フレームに対して、一律にハニング窓である窓関数Ｗ（ｉ）を用いているが、ハニング窓は、左右対称な形状であるため、逆回し波形を識別できないという問題がある。逆回し波形とは、ある音響信号の時間軸を逆転させたものである。例えば、ピアノの音色の音響信号を逆方向から再生した場合、オルガン風音色となるが、左右対称（正確には時間軸方向に前後対称）な上記窓関数Ｗ（ｉ）では、両者の違いを識別することができない。 In the above embodiment, when performing Fourier transform, the window function W (i), which is a Hanning window, is uniformly used for all acoustic frames. However, the Hanning window has a bilaterally symmetric shape. There is a problem that the turning waveform cannot be identified. The reverse rotation waveform is obtained by reversing the time axis of a certain acoustic signal. For example, when an acoustic signal of a piano tone is reproduced from the opposite direction, an organ-like tone color is obtained. It cannot be identified.

このような逆回し波形にも対応可能とするため、本発明では、図１１（ａ）に示すような上記窓関数Ｗ（ｉ）に代えて左右非対称な２つの窓関数Ｗ（１，ｉ）、Ｗ（２，ｉ）を用いるようにすることもできる。窓関数Ｗ（１，ｉ）は、図１１（ｂ）に示すように所定のサンプル番号ｉの位置において、最大値１をとり、後部においては、最小値０をとるように設定されている。どのサンプル番号の場合に最大値をとるかについては、窓関数Ｗ（１，ｉ）の設計によって異なってくるが、本実施形態では、以下の〔数式１２〕で定義される。また、窓関数Ｗ（２，ｉ）は、図１１（ｃ）に示すように、所定のサンプル番号ｉの位置において、最大値１をとり、前部においては、最小値０をとるように設定されている。どのサンプル番号の場合に最大値をとるかについては、窓関数Ｗ（２，ｉ）の設計によって異なってくるが、本実施形態では、以下の〔数式１３〕で定義される。これらの窓関数は、特許文献３にも開示されているように公知の窓関数である。 In order to be able to deal with such a reverse rotation waveform, in the present invention, two window functions W (1, i) that are asymmetrical left and right are used instead of the window function W (i) as shown in FIG. , W (2, i) can also be used. As shown in FIG. 11B, the window function W (1, i) is set to take the maximum value 1 at the position of the predetermined sample number i and take the minimum value 0 at the rear part. Which sample number has the maximum value depends on the design of the window function W (1, i), but in the present embodiment, it is defined by the following [Equation 12]. Further, as shown in FIG. 11C, the window function W (2, i) is set to take the maximum value 1 at the position of the predetermined sample number i and take the minimum value 0 at the front part. Has been. Which sample number takes the maximum value depends on the design of the window function W (2, i), but is defined by the following [Equation 13] in this embodiment. These window functions are known window functions as disclosed in Patent Document 3.

〔数式１２〕
ｉ≦Ｎ／８のとき、Ｗ（１，ｉ）＝０．０
Ｎ／８＜ｉ≦３Ｎ／８のとき、Ｗ（１，ｉ）＝０．５−０．５ｃｏｓ（４π（ｉ−Ｎ／８）／Ｎ）
３Ｎ／８＜ｉ≦１１Ｎ／１６のとき、Ｗ（１，ｉ）＝１．０
１１Ｎ／１６＜ｉ≦１３Ｎ／１６のとき、Ｗ（１，ｉ）＝０．５＋０．５ｃｏｓ（８π（ｉ−１１Ｎ／１６）／Ｎ）
ｉ＞１３Ｎ／１６のとき、Ｗ（１，ｉ）＝０．０ [Formula 12]
When i ≦ N / 8, W (1, i) = 0.0
When N / 8 <i ≦ 3N / 8, W (1, i) = 0.5−0.5 cos (4π (i−N / 8) / N)
When 3N / 8 <i ≦ 11N / 16, W (1, i) = 1.0
When 11N / 16 <i ≦ 13N / 16, W (1, i) = 0.5 + 0.5 cos (8π (i-11N / 16) / N)
When i> 13N / 16, W (1, i) = 0.0

〔数式１３〕
ｉ≦３Ｎ／１６のとき、Ｗ（２，ｉ）＝０．０
３Ｎ／１６＜ｉ≦５Ｎ／１６のとき、Ｗ（２，ｉ）＝０．５−０．５ｃｏｓ（８π（ｉ−３Ｎ／１６）／Ｎ）
５Ｎ／１６＜ｉ≦５Ｎ／８のとき、Ｗ（２，ｉ）＝１．０
５Ｎ／８＜ｉ≦７Ｎ／８のとき、Ｗ（２，ｉ）＝０．５＋０．５ｃｏｓ（４π（ｉ−５Ｎ／８）／Ｎ）
ｉ＞７Ｎ／８のとき、Ｗ（２，ｉ）＝０．０ [Formula 13]
When i ≦ 3N / 16, W (2, i) = 0.0
When 3N / 16 <i ≦ 5N / 16, W (2, i) = 0.5−0.5 cos (8π (i−3N / 16) / N)
When 5N / 16 <i ≦ 5N / 8, W (2, i) = 1.0
When 5N / 8 <i ≦ 7N / 8, W (2, i) = 0.5 + 0.5 cos (4π (i−5N / 8) / N)
When i> 7N / 8, W (2, i) = 0.0

このように、左右非対称な２つの窓関数を用いることにより、通常再生の場合と逆回し波形との識別を行うことが可能となる。ただし、逆回し波形は品質上の問題から使用される頻度は少なく、実際のピアノ音色とオルガン音色の波形は互いに時間軸反転させたような単純な形状ではないため、上記Ｗ（ｉ）のような対象は窓関数を用いても、通常の音源の識別は可能である。 As described above, by using two window functions that are asymmetrical to the left and right, it is possible to distinguish between a normal reproduction case and a reverse rotation waveform. However, the reverse rotation waveform is used less frequently due to quality problems, and the waveform of the actual piano tone and organ tone is not a simple shape in which the time axes are mutually inverted, so that the above W (i) Even if a target object uses a window function, it is possible to identify a normal sound source.

以上、本発明の好適な実施形態について説明したが、本発明は上記実施形態に限定されず、種々の変形が可能である。例えば、上記実施形態では、特徴データとして、各周波数に重みを乗じたものから平均値を減じた偏差ベクトルを算出するようにしたが、平均値を減じない状態のベクトルを特徴データとしても良い。 The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the above embodiments, and various modifications can be made. For example, in the above embodiment, as the feature data, a deviation vector obtained by subtracting the average value from the product of each frequency multiplied by the weight is calculated. However, a vector in a state where the average value is not reduced may be used as the feature data.

１、１ａ・・・ＣＰＵ
２、２ａ・・・ＲＡＭ
３、３ａ・・・記憶装置
４、４ａ・・・キー入力Ｉ／Ｆ
５、５ａ・・・データ入出力Ｉ／Ｆ
６、６ａ・・・表示出力Ｉ／Ｆ
７ａ・・・音声入力Ｉ／Ｆ
１０・・・無音区間削除手段
２０・・・音響フレーム読込手段
３０・・・特徴データ生成手段
３１・・・特徴データ補正手段
４０・・・特徴データ登録手段
５０・・・音源データベース
６０・・・マイクロフォン
７０・・・音響信号取得手段
８０・・・音源データベース検索手段
９０・・・音源情報出力手段 1, 1a ... CPU
2, 2a ... RAM
3, 3a ... Storage device 4, 4a ... Key input I / F
5, 5a ... Data I / O I / F
6, 6a: Display output I / F
7a ... Voice input I / F
DESCRIPTION OF SYMBOLS 10 ... Silent section deletion means 20 ... Acoustic frame reading means 30 ... Feature data generation means 31 ... Feature data correction means 40 ... Feature data registration means 50 ... Sound source database 60 ... Microphone 70 ... Acoustic signal acquisition means 80 ... Sound source database search means 90 ... Sound source information output means

Claims

複数の音源から発せられる音を取得して、それぞれの音源を識別する装置であって、
各音源について、その特徴を表現した登録特徴データと、各音源を特定する識別情報が対応付けて登録された音源データベースと、
前記複数の音源から発せられる音を録音してデジタルの音響信号として取得する音響信号取得手段と、
前記音響信号に対して周波数解析を行い、時間的に平均化したスペクトルである検索側平均スペクトルに基づいて検索特徴データを生成する特徴データ生成手段と、
前記生成された検索特徴データと前記音源データベースに登録されている登録特徴データの各々と相関計算を行い、得られた相関値の中で、最大の相関値をもち、かつ当該相関値が所定のしきい値以上を満たす登録特徴データに対応する音源を前記識別情報により特定する音源データベース検索手段と、
前記音源データベース検索手段により特定された音源に対応する登録特徴データを用いて、前記検索特徴データを補正し、補正特徴データを生成する特徴データ補正手段と、を有し、
前記音源データベース検索手段は、前記生成された補正特徴データと、既に特定された音源を除外した前記音源データベースに登録されている登録特徴データを用いて、前記音源を特定する処理を実行し、
前記特徴データ補正手段は、前記音源データベース検索手段により特定された音源に対応付けて前記音源データベースに登録されている前記登録特徴データを用いて、前記登録特徴データを作成する基になる時間的に平均化したスペクトルである登録側平均スペクトルを作成し、作成された登録側平均スペクトルのうち最大の成分を与える周波数番号について、検索側平均スペクトルの対応する値を登録側平均スペクトルのうち最大の成分で除算し、その値を、前記最大の成分を与える周波数番号を含む各周波数番号に対して、前記特定された音源に対応する登録側平均スペクトルに乗じたものを前記検索側平均スペクトルから減じることにより、各周波数番号に対応する前記検索側平均スペクトルを補正するものであり、
前記音源データベース検索手段、前記特徴データ補正手段は、所定の条件を満たすまで、繰り返し処理を実行することを特徴とする複数音源の識別装置。 A device for acquiring sounds emitted from a plurality of sound sources and identifying each sound source,
For each sound source, registered feature data that represents the feature, a sound source database in which identification information for identifying each sound source is registered in association with each other,
Sound signal acquisition means for recording the sound emitted from the plurality of sound sources and acquiring it as a digital sound signal;
Frequency analysis for the acoustic signal, and feature data generating means for generating search feature data based on a search-side average spectrum that is a spectrum averaged over time,
A correlation calculation is performed with each of the generated search feature data and each of the registered feature data registered in the sound source database, and the correlation value obtained has a maximum correlation value, and the correlation value is a predetermined value. Sound source database search means for identifying a sound source corresponding to registered feature data satisfying a threshold value or more by the identification information;
Using the registered feature data corresponding to the sound source specified by the sound source database search means, correcting the search feature data, and generating feature data correction means,
The sound source database search means executes the process of specifying the sound source using the generated correction feature data and the registered feature data registered in the sound source database excluding the already specified sound source,
The feature data correction means uses the registered feature data registered in the sound source database in association with the sound source specified by the sound source database search means, and serves as a basis for creating the registered feature data. Create a registration-side average spectrum that is an averaged spectrum, and for the frequency number that gives the largest component of the created registration-side average spectrum, the corresponding value of the search-side average spectrum is the largest component of the registration-side average spectrum. And the value obtained by multiplying each frequency number including the frequency number giving the maximum component by the registration-side average spectrum corresponding to the specified sound source is subtracted from the search-side average spectrum. To correct the search-side average spectrum corresponding to each frequency number,
The sound source database search unit and the feature data correction unit repeatedly perform a process until a predetermined condition is satisfied.

複数の音源から発せられる音を取得して、それぞれの音源を識別する装置であって、
各音源について、その特徴を表現した登録特徴データと、各音源を特定する識別情報が対応付けて登録された音源データベースと、
前記複数の音源から発せられる音を録音してデジタルの音響信号として取得する音響信号取得手段と、
前記音響信号に対して周波数解析を行い、時間的に平均化したスペクトルである検索側平均スペクトルの各周波数成分に、それぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した値に基づいて検索特徴データを生成する特徴データ生成手段と、
前記生成された検索特徴データと前記音源データベースに登録されている登録特徴データの各々と相関計算を行い、得られた相関値の中で、最大の相関値をもち、かつ当該相関値が所定のしきい値以上を満たす登録特徴データに対応する音源を前記識別情報により特定する音源データベース検索手段と、
前記音源データベース検索手段により特定された音源に対応する登録特徴データを用いて、前記検索特徴データを補正し、補正特徴データを生成する特徴データ補正手段と、を有し、
前記音源データベース検索手段は、前記生成された補正特徴データと、既に特定された音源を除外した前記音源データベースに登録されている登録特徴データを用いて、前記音源を特定する処理を実行し、
前記音源データベース検索手段、前記特徴データ補正手段は、所定の条件を満たすまで、繰り返し処理を実行することを特徴とする複数音源の識別装置。 A device for acquiring sounds emitted from a plurality of sound sources and identifying each sound source,
For each sound source, registered feature data that represents the feature, a sound source database in which identification information for identifying each sound source is registered in association with each other,
Sound signal acquisition means for recording the sound emitted from the plurality of sound sources and acquiring it as a digital sound signal;
Based on a value obtained by performing frequency analysis on the acoustic signal and multiplying each frequency component of the search-side average spectrum , which is a spectrum averaged over time, by a weight that increases as the frequency value increases. Feature data generating means for generating search feature data;
A correlation calculation is performed with each of the generated search feature data and each of the registered feature data registered in the sound source database, and the correlation value obtained has a maximum correlation value, and the correlation value is a predetermined value. Sound source database search means for identifying a sound source corresponding to registered feature data satisfying a threshold value or more by the identification information;
Using the registered feature data corresponding to the sound source specified by the sound source database search means, correcting the search feature data, and generating feature data correction means,
The sound source database search means executes the process of specifying the sound source using the generated correction feature data and the registered feature data registered in the sound source database excluding the already specified sound source,
The sound source database search unit and the feature data correction unit repeatedly perform a process until a predetermined condition is satisfied.

請求項２において、
前記特徴データ補正手段は、前記音源データベース検索手段により特定された音源に対応する登録特徴データと、前記検索特徴データの相関値を求め、当該相関値を前記特定された登録特徴データに乗じたものを前記検索特徴データから減じることにより前記補正特徴データを生成するものであることを特徴とする複数音源の識別装置。 In claim 2 ,
The feature data correction unit obtains a correlation value between the registered feature data corresponding to the sound source specified by the sound source database search unit and the search feature data, and multiplies the specified registered feature data by the correlation value The correction feature data is generated by subtracting the search feature data from the search feature data.

請求項１において、
前記特徴データ生成手段は、前記検索側平均スペクトルの各周波数成分に、それぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した値に基づいて前記検索特徴データを生成するものであることを特徴とする複数音源の識別装置。 Oite to claim 1,
The feature data generation means generates the search feature data based on a value obtained by multiplying each frequency component of the search-side average spectrum by a weight that increases as the frequency value increases. A device for identifying a plurality of sound sources as a feature.

請求項２から請求項４のいずれか一項において、
前記特徴データ生成手段は、前記検索特徴データとして、前記各周波数成分に、前記周波数値が大きくなるのに伴って大きくなる重みを乗算した値から、さらに、それぞれ周波数値が大きくなるのに伴って大きくなる重みを乗算した値の平均値を減じて偏差ベクトルを生成することを特徴とする複数音源の識別装置。 In any one of Claims 2-4 ,
The feature data generation means further includes a value obtained by multiplying each frequency component by a weight that increases as the frequency value increases as the search feature data. An apparatus for identifying a plurality of sound sources, wherein a deviation vector is generated by subtracting an average value obtained by multiplying a weight that increases.

請求項１から請求項５のいずれか一項において、
前記特徴データ生成手段は、前記音響信号取得手段により取得された音響信号の振幅が所定の値未満で所定の時間以上連続する無音区間を特定し、当該特定された無音区間を削除して音響信号を時間的に短縮する補正を実行し、当該補正された音響信号に対して、前記周波数解析を行うことを特徴とする複数音源の識別装置。 In any one of Claims 1-5,
The feature data generation means specifies a silent section in which the amplitude of the acoustic signal acquired by the acoustic signal acquisition means is less than a predetermined value and continues for a predetermined time or longer, deletes the specified silent section and deletes the specified acoustic section. A plurality of sound source identification devices, wherein correction for reducing the time is performed, and the frequency analysis is performed on the corrected acoustic signal.

請求項１から請求項６のいずれか一項において、
前記特徴データ生成手段は、前記周波数解析として、前記音響信号に対して、所定の区間単位に分割し、分割した各区間の波形データに同区間長にあらかじめ定義された重み関数を重畳した波形データに対してフーリエ変換を行い、各区間ごとに実部と虚部の２乗平均値をとって実数化したスペクトルを全区間に渡って平均化することにより前記検索側平均スペクトルを得ることを特徴とする複数音源の識別装置。 In any one of Claims 1-6,
The feature data generation means, as the frequency analysis, divides the acoustic signal into a predetermined section unit, and waveform data obtained by superimposing a weight function defined in advance on the same section length on the divided waveform data The search side average spectrum is obtained by performing a Fourier transform on each and averaging the spectrum obtained by taking the mean square value of the real part and the imaginary part for each section and averaging it over all the sections. A multiple sound source identification device.

請求項７において、
前記各区間は隣接する区間どうしで区間長の１／２の時間幅だけ重複しており、
前記重み関数は時間軸方向に対して非対称な形状で２種類定義されており、奇数番目の区間に対しては、一方の重み関数を重畳し、偶数番目の区間に対しては、他方の重み関数を重畳するようにしていることを特徴とする複数音源の識別装置。 In claim 7,
Each of the sections overlaps with the adjacent section by a time width of ½ of the section length,
Two types of weighting functions are defined in an asymmetric shape with respect to the time axis direction. One weighting function is superimposed on an odd-numbered section, and the other weighting is performed on an even-numbered section. An apparatus for identifying a plurality of sound sources, wherein a function is superimposed.

請求項１から請求項８のいずれか一項に記載の複数音源の識別装置により特定された前記識別情報を用いる装置であって、前記識別情報ごとに異なる処理を行うプログラムを記憶しておき、前記複数音源の識別装置から出力された識別情報に対応する前記プログラムを起動し、当該プログラムに従った処理を実行することを特徴とする複数音源に連動する情報処理装置。 A device using the identification information specified by the multiple sound source identification device according to any one of claims 1 to 8 , wherein a program for performing different processing for each identification information is stored, the corresponding identification information output from the plurality tone identification device activating the program, the information processing apparatus interlocked with the plurality of sound sources, characterized that you execute processing according to the program.

請求項１から請求項８のいずれか一項に記載の複数音源の識別装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the multiple sound source identification device according to any one of claims 1 to 8.

請求項９に記載の複数音源に連動する情報処理装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as an information processing apparatus linked to a plurality of sound sources according to claim 9.