JP2871204B2

JP2871204B2 - Music transcription device

Info

Publication number: JP2871204B2
Application number: JP3208367A
Authority: JP
Inventors: 直樹柴多
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1991-08-21
Filing date: 1991-08-21
Publication date: 1999-03-17
Anticipated expiration: 2014-03-17
Also published as: JPH0546164A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は楽譜採譜装置に関し、特
にピッチが変動する楽器もしくは歌唱を対象に採譜する
楽音採譜装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a musical notation apparatus, and more particularly to a musical notation apparatus for recording musical instruments or singing whose pitch varies.

【０００２】[0002]

【従来の技術】従来は、例えばＣＱ出版社刊行インター
フェース誌１９９１年２月号ｐｐ２１１−２２９（以下
文献１と言う）においては自己相関係数を用いて求めた
楽音や歌唱のピッチを絶対ピッチ上でのピッチカテゴリ
の範囲と単純比較して音程を決定していた。また、入力
音のパワーは、音のセグメンテーションのみに用いら
れ、ピッチ計算には使用されていなかった。2. Description of the Related Art Conventionally, for example, in an interface magazine published by CQ Publishing Co., Ltd., February 1991, pp 211-229 (hereinafter referred to as reference 1), pitches of musical tones and singing obtained using an autocorrelation coefficient are expressed in absolute pitches. The pitch was determined simply by comparing it with the range of the pitch category. Also, the power of the input sound was used only for sound segmentation, not for pitch calculation.

【０００３】[0003]

【発明が解決しようとする課題】しかし、上記の方法で
採譜した場合には、ピッチカテゴリの範囲が固定であっ
たため、ビブラートなどの様にピッチが極端に変動する
場合には細かな音程に分解されてしまうことがあった。However, when the music is transcribed by the above-described method, the range of the pitch category is fixed, so that when the pitch fluctuates extremely like vibrato or the like, the pitch is decomposed into fine pitches. Was sometimes done.

【０００４】本発明の目的は、安定に音程を決定して採
譜を行うことのできる楽音採譜装置を提供することにあ
る。[0004] It is an object of the present invention to provide a musical sound transcription apparatus capable of stably determining a musical interval and performing transcription.

【０００５】[0005]

【課題を解決するための手段】第１の発明は、楽器演奏
または歌唱から楽譜を作成する自動採譜装置において、
楽器演奏または歌唱のパワースペクトルを複数チャンネ
ル数求めることが可能なバンドパスフィルタバンク部
と、前記バンドパスフィルタバンク部から得られる各チ
ャンネル毎の出力を入力とする競合想起ニューラルネッ
トワーク部と、前記競合想起ニューラルネットワーク部
の出力を読み出す音程バッファ部と、前記音程バッファ
部の出力する音程データを記憶する音程記憶部と、前記
音程バッファ部が前記競合想起ニューラルネットワーク
部の出力を読み出すタイミングを生成する読み出しタイ
ミング生成部と、を有することを特徴とする。According to a first aspect of the present invention, there is provided an automatic music transcription apparatus for producing a musical score from musical instrument performance or singing.
A bandpass filter bank unit capable of obtaining a power spectrum of musical instrument performance or singing in a plurality of channels, a competition recall neural network unit which receives an output for each channel obtained from the bandpass filter bank unit, A pitch buffer for reading the output of the recall neural network, a pitch storage for storing the pitch data output from the pitch buffer, and a read for generating a timing at which the pitch buffer reads the output of the competitive recall neural network And a timing generation unit.

【０００６】第２の発明は、楽器演奏または歌唱から楽
譜を作成する自動採譜装置において、楽器演奏または歌
唱のパワースペクトルを複数チャンネル数求めることが
可能なバンドパスフィルタバンク部と、前記バンドパス
フィルタバンク部から得られる各チャンネル毎の出力を
入力とする競合想起ニューラルネットワーク部と、前記
競合想起ニューラルネットワーク部の出力を入力とする
フィードフォワードニューラルネットワーク部と、前記
フィードフォワードニューラルネットワーク部の出力を
読み出し音程バッファ部と、前記音程バッファ部の出力
する音程データを記憶する音程記憶部と、前記音程バッ
ファ部が前記フィードフォワードニューラルネットワー
ク部の出力を読み出すタイミングを生成する読み出しタ
イミング生成部と、を有することを特徴とする。According to a second aspect of the present invention, there is provided an automatic transcription apparatus for creating a musical score from a musical instrument performance or singing, a band-pass filter bank unit capable of obtaining the power spectrum of the musical instrument performance or singing on a plurality of channels, and the band-pass filter. A competitive recall neural network unit that receives an output of each channel obtained from the bank unit as an input, a feedforward neural network unit that receives an output of the competitive recall neural network unit, and an output of the feedforward neural network unit. A pitch buffer unit, a pitch storage unit that stores pitch data output from the pitch buffer unit, and a read timing generation unit that generates a timing at which the pitch buffer unit reads the output of the feedforward neural network unit. Characterized in that it has a.

【０００７】[0007]

【作用】本発明の基本的な原理の第一は、細かに変動す
るピッチを持つ楽音もしくは歌唱音の採譜時に、リカレ
ントニューラルネットワークの入出力特性のうちのヒス
テリシス性を用いることでピッチを安定に採譜すること
である。また基本的な原理の第二は、フィードフォワー
ド型ニューラルネットワークのもつ連想記憶の特性を用
いて採譜されたピッチに音名を与えることである。The first principle of the present invention is to stabilize the pitch by using the hysteresis property of the input / output characteristics of the recurrent neural network when transcribing musical tones or singing sounds having finely varying pitches. Transcription. The second of the basic principles is to give pitch names to transcribed pitches using the characteristics of associative memory of a feedforward neural network.

【０００８】従来は、例えば文献１では高速フーリエ変
換（ＦＦＴ）や自己相関係数から求められた楽音の基本
周波数と音名にあたる周波数の間の距離を計り、その最
短距離を与える音名として音程をとっていた。また、音
程を求めるセグメントは音量についてのしきい値処理に
より決定されていた。Conventionally, for example, in Reference 1, a distance between a fundamental frequency of a musical tone obtained from a fast Fourier transform (FFT) or an autocorrelation coefficient and a frequency corresponding to a pitch name is measured, and a pitch is given as a pitch name giving the shortest distance. Was taking. Also, the segment for which the pitch is to be determined has been determined by threshold processing for the volume.

【０００９】しかしこれらの方法のみでは、実際の演奏
もしくは歌唱においては音量をとぎらせないで音程を変
える場合に対処できないことがあり得た。そのため音程
の時間変化を調べ、変化が激しくなった時点で異なる音
符に対応させる様になっていた。音程の時間変化をもと
に音程をつけるセグメントを決定することは、ビブラー
トの様な演奏法もしくは歌唱法によって得られた採譜対
象を処理する際にセグメントが細かくなり過ぎてしまう
という副作用をもっていた。[0009] However, these methods alone may not be able to cope with a case where the pitch is changed without stopping the volume in an actual performance or singing. For this reason, the time change of the pitch is examined, and when the change becomes severe, it is made to correspond to a different note. Determining the segment to be pitched based on the time change of the pitch has the side effect that the segment becomes too fine when processing a transcription target obtained by a vibrato-like playing method or a singing method.

【００１０】本発明では、過去の入力により発火する領
域が制限されるリカレントニューラルネットワークモデ
ルを用い、ピッチ変動のうちで幅が小さくかつ比較的短
周期であるものを吸収しようとするものである。また、
変動が吸収されたピッチのカテゴリをフィードフォワー
ド型のニューラルネットワークモデルを用いて音名を与
えるものである。In the present invention, a recurrent neural network model in which a region to be ignited by a past input is restricted is used to absorb pitch fluctuations having a small width and a relatively short period among pitch fluctuations. Also,
The pitch category in which the fluctuation has been absorbed is given a pitch name using a feedforward neural network model.

【００１１】リカレントニューラルネットワークのモデ
ルとしては以下に説明するモデルを用いる。A model described below is used as a model of the recurrent neural network.

【００１２】１種類の神経細胞からなる１次元の層
ｕ₁，ｕ₂を考える。ここで、この平面上での位置を指
定するため座標ｘを導入する。また、ｕ（ｘ，ｔ）を、
層上の位置ｘでのノードの内部状態とする。この時、ｕ
を場所ｘの周囲のノードの平均値をとるとし、個々のノ
ードの細かな変動を無視する。さらに場は、ｘに関して
等方性の場で、近い距離のノード同士は興奮性結合、離
れた距離にあるノード同士は抑制性結合しており、その
結合荷重は場において一様であるとする。また、ノード
の出力関数ｆ（ｕ）は次式で与えられるとする。Consider one-dimensional layers u ₁ and u ₂ composed of one kind of nerve cells. Here, coordinates x are introduced to specify a position on this plane. Also, u (x, t) is
The state is the internal state of the node at the position x on the layer. At this time, u
Is taken as the average value of the nodes around the location x, and fine fluctuations of individual nodes are ignored. Further, the field is an isotropic field with respect to x. Nodes at close distances have excitatory coupling, nodes at distant distances have inhibitory coupling, and the coupling weight is assumed to be uniform in the field. . The output function f (u) of the node is given by the following equation.

【００１３】[0013]

【数１】 (Equation 1)

【００１４】さらに、ｕ₂からｕ₁への結合は極めて幅
が狭く、ｕ₂は自己結合を持たないとすると、ｕ₁，ｕ
₂の状態を表す方程式は次のように書ける。Further, if the bond from u ₂ to u ₁ is extremely narrow and u ₂ has no self-bond, u ₁ , u ₁
_The equation for state ₂ can be written as

【００１５】[0015]

【数２】 (Equation 2)

【００１６】[0016]

【数３】 (Equation 3)

【００１７】ただし、ｓ_i，ｈ_iはそれぞれ、ｕ_iへの
外部からの一様刺激レベル、ｕ_iのしきい値である。ま
た、ｗ_ijはｕ_jからｕ_iへの結合荷重を表す。τ_iはｕ
₁の状態変化の時定数である。[0017] However, s _i, respectively, h _i, uniform level of stimulation from the outside to the u _i, is the threshold of u _i. W _ij represents a connection weight from u _j to u _i . τ _i is u
It is the time constant of the state change of ₁ .

【００１８】この設定のもとで、場には以下の３種類の
内部状態が現れ得る。・発火状態にあるノードが存在しない。・場全体が一様に発火状態にある。・発火状態にある有限の大きさの局在興奮領域が存在す
る。Under this setting, the following three internal states can appear in the field. -There is no firing node.・ The entire site is uniformly ignited. -There is a finite-sized localized excitable region in the firing state.

【００１９】これらの現れ方はｓ（外部からの一様刺激
レベル），ｈ（閾値），ｗ（ｘ′−ｘ）（結合荷重の
形）の間の大小関係によって変化する。発火状態にある
ノードが存在しない場合と、全体が発火状態にある場合
を除いて考える。These appearances vary depending on the magnitude relationship between s (uniform stimulus level from the outside), h (threshold), and w (x'-x) (the form of connection weight). Except when there is no node in the firing state and when the entire node is in the firing state.

【００２０】局在興奮領域が存在する場合の、場の性質
についてまとめると以下の様な性質がある。・弱い非一様定常入力ｓ（ｘ）が与えられた時、局在興
奮領域はｓ（ｘ）が増加する方向へ移動し、極大値の位
置に停止する。・２つの局在興奮は、ｗ₁₁（ｘ′−ｘ）に依存して定ま
る距離ｘ_dを境界として反発し合うか、または引き合
う。つまり、距離がｘ_d以下なら引き合い、ｘ_d以上な
ら反発しあう。・複数の局在興奮を許す場では、ある個数の局在興奮が
相互作用によって安定に存在し得る。The properties of the field in the case where the localized excitement area exists are summarized as follows. When a weak non-uniform steady input s (x) is given, the localized excitation region moves in the direction in which s (x) increases, and stops at the position of the maximum value. The two local excitements repel or attract each other at a distance x _d determined depending on w ₁₁ (x′−x). In other words, the distance inquiries If x _d or less, repel If more than x _d. In a place where a plurality of localized excitables are allowed, a certain number of localized excitables can be stably present by interaction.

【００２１】以上の様な神経場の説明は、「神経回路網
の数理」（産業図書刊，甘利俊一著，以下文献２）が詳
しい。また、この神経場は位置に関して連続な場として
述べられているが、計算機などでシミュレーションする
際には場を離散的に表現する際のメッシュを細かくとる
ことにより良い精度で近似できる。The description of the neural field as described above is described in detail in "Mathematics of Neural Network" (published by Sangyo Tosho, Shunichi Amari, hereinafter referred to as Reference 2). Although this nerve field is described as a continuous field with respect to position, it can be approximated with good accuracy by taking a fine mesh when expressing the field discretely when simulating with a computer or the like.

【００２２】さて、今まで述べたようなｕ₁，ｕ₂の組
合せを用いてピッチをカテゴライズする方法について述
べる。今ｕ₁，ｕ₂の様に興奮性及び抑制性の結合をし
た２層で、時定数τ₁，τ₂が等しいものを、以降便宜
的にＵと表すことにする。Now, a method of categorizing pitches by using a combination of u ₁ and u ₂ as described above will be described. Here, two layers having excitatory and inhibitory couplings like u ₁ and u ₂ and having the same time constants τ ₁ and τ ₂ are hereinafter referred to as U for convenience.

【００２３】時定数が異なる二つのＵのうち時定数が長
いものをＵ₁，短いものをＵ₂とする。Ｕへの入力及び
出力はどちらもｕ₁を対象とするものとする。Ｕ₁とＵ
₂の間はランダムな荷重で双方向性結合を行う。離散的
に表現された系では、メッシュの各グリッド間で全数結
合を行うことに相当する。Of the two Us having different time constants, U ₁ having a long time constant is referred to as U ₁ and U ₂ having a short time constant is referred to as U ₂ . It is assumed that both input and output to U are directed to u ₁ . U ₁ and U
_{Between two} , bidirectional coupling is performed with a random load. In a system represented discretely, this is equivalent to performing all-number connection between the grids of the mesh.

【００２４】ここで、バンドパスフィルタ（ＢＰＦ）バ
ンクの出力をＵ₂に入力する。各ＢＰＦの中心周波数の
間隔は楽譜における半音（短２度）よりも狭い等周波数
比とし、Ｕ₂には等間隔ずつ離して入力する。一般に楽
器音などは基本周波数成分以外に高調波成分を含んでい
る。ここでは基本周波数ｆ及び２，３，４倍の各整数倍
倍音、すなわち２ｆ，３ｆ，４ｆが含まれた楽音がＢＰ
Ｆバンクに入力されたとする。Ｕ₂への入力の大きさに
よりＵ₂上のｆ，２ｆ，３ｆ，４ｆに対応する領域が発
火し、またＵ₁−Ｕ₂間の接続の状態により、Ｕ₁上の
幾つかの領域が発火する場合がある。ある結合の両端が
発火した場合にその結合荷重を大きくし、Ｕ₂側のみ発
火した場合にその結合荷重を小さくする。この様なニュ
ーラルネットワークの学習方法については「パラレル・
ディストリビュ−テッド・プロセッシング第１巻ｐｐ１
５１−１９３（ＰａｒａｌｌｅｌＤｉｓｔｒｉｂｕｔ
ｅｄＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ１，ｔｈｅ
ＭＩＴＰｒｅｓｓ）」に詳しい。ＢＰＦバンクの全周波
数領域に渡り、荷重の変更を行うことで結局はＵ₂上に
与えられた４つの倍音に相当するＵ₁上の領域が活性化
されるニューラルネットワークモデルを構築できる。Here, the output of the band pass filter (BPF) bank is input to U ₂ . Distance of the center frequency of each BPF is narrower such as frequency ratio than a semitone (minor second) in the musical score, and inputs apart by equal intervals in the U _2. In general, instrument sounds and the like contain harmonic components in addition to the fundamental frequency components. Here, the fundamental frequency f and the respective overtones of 2, 3, and 4 times, that is, the musical tones containing 2f, 3f, and 4f are BP.
It is assumed that the input is made to the F bank. Input of the size on U ₂ of f to U _2, 2f, 3f, corresponding area ignited 4f, and by the state of the connection between the U ₁ -U _2, several areas on U ₁ May catch fire. The binding load is increased when the two ends of a coupling has fired, to reduce the connection weights if ignited only U ₂ side. For the learning method of such a neural network, see "Parallel
Distributed Processing Volume 1 pp1
51-193 (Parallel Distribut)
ed Processing, vol1, the
MIT Press) ”. Over the entire frequency range of the BPF bank, eventually by making changes in the load can build a neural network model space on U ₁ corresponding to the four harmonic given above U ₂ is activated.

【００２５】学習が済んだネットワークの実際に音が入
って来た場合には時定数に応じて、Ｕ₁，Ｕ₂上の領域
の内部状態、すなわちＵ₁とＵ₂それぞれのｕ₁が変化
してｕ₁＞０となる領域が発火する。Ｕ₁上で発火して
いる部分は双方向性の結合を通じてＵ₂上の対応する領
域を活性化する。活性化された領域は入力が十分に大き
ければ発火することもあり得る。また、Ｕ₂上の発火領
域はＵ₁上の領域を活性化するが、Ｕ₁上の領域への入
力が十分に強くなければ領域は発火せず、内部状態が高
まるのみに留まる。しかし、Ｕ₁とＵ₂の時係数の差に
よりＵ₁の内部状態は刺激がなくなった後もしばらくは
保持される。[0025] Depending on the time constant in the case came in is actually the sound of learning after completion network, U _1, the internal state of the area on the U _2, ie U ₁ and U ₂ each of u ₁ changes Then, a region where u ₁ > 0 is ignited. Moiety that is ignited on U ₁ activates the corresponding area on the U ₂ through binding bidirectional. The activated area can fire if the input is large enough. Also, ignition region on the U ₂ is to activate an area on U _1, region if there is no sufficiently strong input to the region of the U ₁ does not ignite, remains only increases the internal state. However, due to the difference between the time coefficients of U ₁ and U ₂ , the internal state of U ₁ is maintained for a while after the stimulus disappears.

【００２６】またｆ，２ｆ，３ｆ，４ｆという倍音を含
んだ入力により、Ｕ₁上ではｆを基音とするピッチに対
応する領域が活性化するだけでなく、２ｆ，３ｆ，４ｆ
といった倍音のそれぞれに対応する領域をも活性化す
る。また、それぞれの周波数を倍音としてもつ領域も活
性化する。その為、図３に示すようにｕ（ｘ）は活性化
される。通常の楽曲で用いられている１２音音階中のピ
ッチカテゴリの周波数比はおおよそ図５の様な関係にあ
る。つまり、一つの音とその倍音で活性化される領域
は、１２音音階での相対音程に相当する領域となる。[0026] f, 2f, 3f, the input including the harmonics of 4f, not only activation region corresponding to the pitch of the fundamental tone of f is on U _1, 2f, 3f, 4f
The area corresponding to each of the overtones is also activated. Also, a region having each frequency as a harmonic is activated. Therefore, u (x) is activated as shown in FIG. The frequency ratios of the pitch categories in the twelve-tone scale used in normal music generally have a relationship as shown in FIG. In other words, the region activated by one sound and its overtone is a region corresponding to the relative pitch in the 12-note scale.

【００２７】活性化された領域は、周囲の領域と比較し
て発火しやすい状態にある為、該領域への入力が後続す
るＵ₂への入力により増加した場合に発火しやすい。Ｕ
₂への入力である楽音もしくは歌唱から得られたピッチ
が、特にパワーの立ち上がり時などに不安定であって
も、以前のＵ₁の活性化の状態分布のため、それ以前ま
での入力から定まる相対音程上の音高のカテゴリに相当
する領域の近傍が活性化され、発火し、Ｕ₁上では安定
にピッチカテゴリを決定できることになる。Since the activated area is in a state where it is easy to fire compared to the surrounding area, it is easy to fire when the input to the area is increased by the input to the subsequent U ₂ . U
Pitch obtained from tone or singing is the input to _2, even particularly unstable such as during the rise of the power, for state distribution of activation of the previous U _1, determined from the input to earlier the relative proximity of the pitch on a region corresponding to the pitch of the category is activated, ignited, will be able to determine the pitch categories stable on U _1.

【００２８】Ｕ₁上を発火させたＵ₂への入力のピッチ
が微小に変動した場合には、Ｕ₁上の該発火領域及び該
領域を発火させているＵ₂上の領域との間の相互結合に
よるフィードバックのため、Ｕ₁上の発火領域の移動速
度は、Ｕ₂上での発火領域の移動速度ほど大きくならな
い。それ故Ｕ₁上では安定にピッチカテゴリを保持でき
る。If the pitch of the input to U ₂ that fired on U ₁ fluctuates slightly, the distance between the firing area on U ₁ and the area on U ₂ firing the area is reduced. for feedback by mutual coupling, the moving speed of the firing region of the U ₁ is not increased as the moving speed of the firing region of the on U _2. It on the late U ₁ can be held stably pitch categories.

【００２９】また、それまでに入力されてきた相対音程
の範疇に当てはまらないピッチが現れた場合には、まず
入力に相当する領域が活性化され、発火する。しかし、
この入力が消えた後、該発火領域はそれ以前に入力され
てきた相対音程に相当する活性化のピークに向って移動
する。この理由は前述の性質１による。そのため、結局
はこの相対音程の範疇に当てはまらなかったピッチ入力
は、発火領域の移動により以前までと同じ相対音程の範
疇としてＵ₁上でピッチカテゴリに分類される。When a pitch that does not fall within the category of the relative pitch input so far appears, a region corresponding to the input is first activated and fires. But,
After this input has disappeared, the firing zone moves toward the activation peak corresponding to the relative pitch entered previously. This is because of the above-mentioned property 1. Therefore, eventually the pitch input was not the case in the category of the relative pitch is classified as a pitch categories on U ₁ as category of the same relative pitch up earlier by the movement of the firing region.

【００３０】結局、Ｕ₁に現れた発火領域の位置は、Ｕ
₂へ入力された楽音の倍音を含んだピッチクラスを与え
ることになる。絶対音程に従っている、すなわち調律さ
れた楽器などによる楽音の場合、例えばあるオクターブ
の中のＡ音が４４０Ｈｚとなるように調律されているた
め、Ｕ₁上に現れた音程を求めるためには、あらかじめ
知られている音程を入力した時に現れた発火領域の位置
により音程を求めることができる。無伴奏で歌われた歌
唱など、相対音程のみが成り立つ場合には、譜面化にあ
たって調の主音を定める必要がある。Eventually, the position of the ignition region appearing at U ₁ is U
_A pitch class including harmonics of the musical tone input to ₂ is given. Absolute follow pitch, that is, the case of a tone due rhythm musical instrument, for example, A sound in a certain octave is tuned such that 440 Hz, in order to determine the pitch that appeared on U ₁ in advance The pitch can be obtained from the position of the ignition region that appears when a known pitch is input. In the case where only relative pitches are satisfied, such as singing with unaccompaniment, it is necessary to determine the key tone of the key in musical notation.

【００３１】本発明では、調の主音を定めるためにフィ
ードフォワードニューラルネットワークを用いる。In the present invention, a feedforward neural network is used to determine the key of the key.

【００３２】このネットワークを学習させる逆伝搬ネッ
トワーク学習法については「欧文誌パラレル・ディスト
リビューテッド・プロセッシング第３巻１２１−１５９
頁」（“ＰａｒａｌｌｅｌＤｉｓｔｒｉｂｕｔｅｄ
Ｐｒｏｃｅｓｓｉｎｇ”Ｖｏｌ．３，ＭＩＴＰｒｅ
ｓｓ，（１９８７）ｐ１２１−１５９）が詳しい。The back-propagation network learning method for learning this network is described in "European Magazine Parallel Distributed Processing, Vol. 3, 121-159."
Page ”(“ Parallel Distributed ”
Processing “Vol.3, MIT Pre
ss, (1987) p121-159).

【００３３】モデルは一般に図４のように３種類の層か
ら階層的に構成され、それぞれ入力層、中間層、出力層
と呼ばれている。各層にはユニットと呼ばれる処理単位
が配置され、各ユニットは入力層に近い側の隣接層のユ
ニットから入力を受けて、出力層に近い側の隣接層へ出
力を出す。各ユニットの入力及び出力の関係は次のよう
に与えられる。A model is generally hierarchically composed of three types of layers as shown in FIG. 4, and is called an input layer, a middle layer, and an output layer, respectively. A processing unit called a unit is arranged in each layer, and each unit receives an input from a unit in an adjacent layer closer to the input layer and outputs an output to an adjacent layer closer to the output layer. The relationship between the input and output of each unit is given as follows.

【００３４】[0034]

【数４】 (Equation 4)

【００３５】[0035]

【数５】 (Equation 5)

【００３６】[0036]

【数６】 (Equation 6)

【００３７】ここでｘ，ｙ，θはそれぞれ、ユニットへ
の入力、ユニットからの出力、ユニットの持つ閾値を表
し、上付きの添え字は入力層からの階層を表す整数で、
下付きの添え字は層中のユニットを表す番号である。ま
たHere, x, y, and θ represent the input to the unit, the output from the unit, and the threshold value of the unit, respectively, and the superscript suffix is an integer representing the hierarchy from the input layer.
Subscripts are numbers representing units in a layer. Also

【００３８】[0038]

【数７】 (Equation 7)

【００３９】は第（ｎ−１）層のユニットｉから一つ出
力層側の隣接層への結合を表す荷重、ｆ（ｘ）は以下の
数９の様なユニットの入出力の応答関数である。Is a load representing the connection from the unit i of the (n-1) th layer to the adjacent layer on the one output layer side, and f (x) is the input / output response function of the unit as shown in the following equation 9. is there.

【００４０】このモデルの入力層にデータが与えられる
とそのデータは下位の層、つまり入力層に近い層から隣
接する上位の層、つまり出力層に近い層へと順次伝播さ
れていく。その結果として得られる出力層の出力が与え
られた入力データに対する推論の結果となる。本発明で
は入力層にＵ₁上のｕ₁を呈示した時に、出力層に対応
する主音を表すノードを発火させるようなモデルを構成
する。When data is given to the input layer of this model, the data is sequentially propagated from a lower layer, ie, a layer near the input layer, to an adjacent upper layer, ie, a layer near the output layer. The resulting output of the output layer is the result of the inference for the given input data. When presenting u ₁ on U ₁ in the input layer in the present invention, constitutes a model that ignite node representing the tonic corresponding to the output layer.

【００４１】次にモデルが望ましい推論動作を行うよう
にユニット間の結合荷重ｗを定める逆伝播学習法につい
て説明する。学習に用いるデータは様々な楽曲から抽出
した同一拍子・同一調・同一区間長の旋律を表すシンボ
ルの時系列である。これらのデータを入力層に呈示し、
出力層には、入力層に呈示した旋律にハーモニックリズ
ムを考慮してつけた和音を表すシンボルの時系列を呈示
して逆伝播学習を繰り返し行う。逆伝播学習では入力さ
れたデータに対する望ましい推論結果を教師信号として
与えて、その時点でのネットワークモデルの出力（推論
結果）と教師信号との差を小さくするように繰り返しユ
ニット間の結合荷重を修正する。これはNext, a description will be given of a back-propagation learning method for determining a connection weight w between units so that the model performs a desirable inference operation. The data used for learning is a time series of symbols representing melody with the same beat, same key, and same section length extracted from various music pieces. Present these data to the input layer,
The output layer presents a time series of symbols representing chords in consideration of the harmonic rhythm to the melody presented to the input layer, and repeatedly performs back propagation learning. In backpropagation learning, the desired inference result for the input data is given as a teacher signal, and the connection weight between the repetitive units is modified so as to reduce the difference between the network model output (inference result) and the teacher signal at that time. I do. this is

【００４２】[0042]

【数８】 (Equation 8)

【００４３】を出力層のｉ番目のユニットの出力、Ｔ_i
を対応する教師信号として、数９で表される誤差関数の
値を最小化するようなユニット間荷重を求めることと一
致する。Is the output of the i-th unit of the output layer, T _i
Is used as a corresponding teacher signal, and this is consistent with obtaining an inter-unit load that minimizes the value of the error function represented by Expression 9.

【００４４】[0044]

【数９】 (Equation 9)

【００４５】このようなモデルで数９に示すような誤差
関数の値を最小化するアルゴリズムは前記の文献「欧文
誌パラレル・ディストリビューテッド・プロセッシング
第３巻１２１−１５９頁」（“ＰａｒａｌｌｅｌＤｉ
ｓｔｒｉｂｕｔｅｄＰｒｏｃｅｓｓｉｎｇ”Ｖｏｌ．
３，ＭＩＴＰｒｅｓｓ，（１９８７）ｐ１２１−１
５９）に詳しい。The algorithm for minimizing the value of the error function as shown in Expression 9 in such a model is described in the above-mentioned document "European Magazine Parallel Distributed Processing, Vol. 3, pp. 121-159"("ParallelDi").
Structured Processing "Vol.
3, MIT Press, (1987) p121-1
59)

【００４６】このネットワークの入力には採譜すべき楽
音の入力が終了した直後のＵ₁のｕ₁を与える。このｕ
₁は採譜した範囲でのピッチの出現に応じて、活性化さ
れた状態が残っている。ネットワークの出力層には、Ｕ
₁を実現している装置での一オクターブあたりの分解能
分のノードを用意する。学習に当っては、既知の楽曲を
絶対音程で演奏して得られたＵ₁のｕ₁を入力し、教師
信号としては主音に相当するノードを１、その他を０と
して与え、バックプロパゲーション学習を行う。The input of this network is given as u ₁ of U ₁ immediately after the input of the musical sound to be transcribed. This u
₁ remains activated according to the appearance of the pitch in the transcribed range. In the output layer of the network, U
Prepare nodes for the resolution per octave in the device that realizes ₁ . Hitting the learning type the u ₁ of U ₁ obtained by playing known songs absolute pitch, giving the node corresponding to tonic as teacher signal 1, the other as a 0, a back propagation learning I do.

【００４７】学習の済んだフィードフォワードニューラ
ルネットワークを用いて主音を求める際の手順について
説明する。Ｕ₁における１オクターブの分解能をＮとす
ると１２音音階での半音はＮ／１２となる。採譜のため
の入力が終了した時点のＵ₁のｕ₁をフィードフォワー
ドネットワークの入力にプラスマイナスＮ／２４の範囲
の位置のオフセットを与えて入力する。この操作により
Ｎ／１２個のフィードフォワードネットワークの出力が
得られる。この出力をＶ_i，ｉ＝１，…，Ｎ／１２であ
らわす。この場合のＶの次元はＮである。Ｖ_iの中の値
の内で最大値を与えるノードの位置をｊとすると、主音
は、Ｕ₁の出力をｉずらして読みとった場合のｊとして
求めることができる。A procedure for obtaining a tonic using a feedforward neural network that has been learned will be described. Assuming that the resolution of one octave at U ₁ is N, the semitone at the 12-tone scale is N / 12. Input inputs giving the offset of the position of the range of ± N / 24 to the input of the feedforward network u ₁ of U ₁ at the time of completion for the transcription. By this operation, outputs of N / 12 feedforward networks are obtained. This output is represented by V _i , i = 1,..., N / 12. The dimension of V in this case is N. When the position of the node which gives the maximum value among the values in V _i and j, tonic can be obtained as j when read by shifting i the output of U _1.

【００４８】ｊは音名に対応しているので、ｊから音程
に対応するＵ_i上の位置を求め、さらにｉだけずらすこ
とにより、入力された楽音または歌唱の主音を求めるこ
とができる。主音の位置を求めることができれば、絶対
音程に基づかない歌唱などの場合は、主音に相当するＵ
₁上での位置からの相対的位置で階名が決定できる。こ
の場合、主音に相当する音名がないため、主音の周波数
と、主音に対する階名を求めるだけで、十分である。Since j corresponds to a note name, a position on U _i corresponding to a pitch is obtained from j, and by shifting the position by i, a main tone of an input musical tone or singing can be obtained. If the position of the tonic can be determined, in the case of singing that is not based on the absolute pitch, for example, U
_The floor name can be determined by the relative position from the position above ₁ . In this case, since there is no note name corresponding to the main tone, it is sufficient to find the frequency of the main tone and the floor name for the main tone.

【００４９】[0049]

【実施例】図１は、第１の発明による採譜装置の一実施
例を示すブロック図である。この採譜装置は、音程記憶
部１１と、音程バッファ部１２と、競合想起ニューラル
ネットワーク部１３と、バンドパスフィルタバンク部１
４と、読み出しタイミング生成部１５とを有する。バン
ドパスフィルタバンク部１４は、１２音音階における半
音（短２度）音程より狭く、かつ等間隔の周波数を中心
周波数とするバンドパスフィルタの集まりである。外部
からのオーディオ信号は、このバンドパスフィルタバン
ク部で各周波数チャンネル毎にパワーエンベロープに変
換される。競合想起ニューラルネットワーク部１３は、
バンドパスフィルタバンク部１４から得られたパワーエ
ンベロープからピッチカテゴリを求める。音程バッファ
部１２は、競合想起ニューラルネットワーク部１３が出
力するピッチカテゴリを保持する。読み出しタイミング
生成部１５は、採譜を行うために必要な、音程の採集間
隔を生成する。音程記憶部１１は、読み出しタイミング
生成部１５の出力するタイミングを基に音程バッファ部
１２からピッチデータを取り込み、読み取り時刻と合わ
せて記録する。FIG. 1 is a block diagram showing one embodiment of a music transcription device according to the first invention. This musical notation apparatus includes a pitch storage section 11, a pitch buffer section 12, a competitive recall neural network section 13, a band-pass filter bank section 1
4 and a read timing generation unit 15. The band-pass filter bank unit 14 is a group of band-pass filters that are narrower than the semitone (second minor) pitch in the twelve-tone scale and have center frequencies at equally spaced frequencies. An external audio signal is converted into a power envelope for each frequency channel by the band-pass filter bank unit. The competitive recall neural network unit 13
The pitch category is obtained from the power envelope obtained from the band-pass filter bank unit. The pitch buffer unit 12 holds the pitch category output from the competitive recall neural network unit 13. The readout timing generation unit 15 generates a pitch collection interval required for transcription. The pitch storage unit 11 fetches pitch data from the pitch buffer unit 12 based on the timing output from the read timing generation unit 15 and records the pitch data together with the read time.

【００５０】以上の構成の採譜装置においては、読み出
しタイミング生成部１５では採譜すべき音符長よりも十
分に短い時間長で音程の読み出しタイミングを生成す
る。譜面に直す場合には、読み取り時刻と全体の長さ、
曲の演奏テンポにより音符長を決定する必要がある。In the transcription apparatus having the above-described configuration, the read timing generation section 15 generates the read timing of the interval with a time length sufficiently shorter than the note length to be transcribed. When converting to music, read time and overall length,
It is necessary to determine the note length according to the performance tempo of the music.

【００５１】図２は、第２の発明による採譜装置の一実
施例を示すブロック図である。この採譜装置は、音程記
憶部２１と、音程バッファ部２２と、フィードフォワー
ドニューラルネットワーク部２３と、競合想起ニューラ
ルネットワーク部２４と、バンドパスフィルタバンク部
２５と、読み出しタイミング生成部２６とを有する。バ
ンドパスフィルタバンク部２５は、１２音音階における
半音（短２度）音程より狭く、かつ等間隔の周波数を中
心周波数とするバンドパスフィルタの集まりである。外
部からのオーディオ信号は、このバンドパスフィルタバ
ンク部で各周波数チャンネル毎にパワーエンベロープに
変換される。競合想起ニューラルネットワーク部２４
は、バンドパスフィルタバンク部２５から得られたパワ
ーエンベロープからピッチカテゴリを求める。フィード
フォワードニューラルネットワーク部２３は、競合想起
ニューラルネットワーク部２４が出力するピッチカテゴ
リを入力として、主音を探索し、競合想起ニューラルネ
ットワーク部２４の出力を読み取る際の位置のオフセッ
トを求め、そのオフセットを出力する。音程バッファ部
２２は、フィードフォワードニューラルネットワーク部
２３が出力するオフセットを用いて競合想起ニューラル
ネットワーク部２４の出力から、主音に対する相対音程
でピッチカテゴリを求め、それを保持する。読み出しタ
イミング生成部２６は、採譜を行うために必要な、音程
の採集間隔を生成する。音程記憶部２１は、読み出しタ
イミング生成部２６の出力するタイミングを基に音程バ
ッファ部２２からピッチデータを取り込み、読み取り時
刻と合わせて記録する。FIG. 2 is a block diagram showing one embodiment of a music transcription device according to the second invention. This musical notation apparatus includes a pitch storage unit 21, a pitch buffer unit 22, a feedforward neural network unit 23, a competitive recall neural network unit 24, a bandpass filter bank unit 25, and a readout timing generation unit 26. The band-pass filter bank unit 25 is a group of band-pass filters that are narrower than the semitone (second minor) pitch in the twelve-tone scale and whose center frequencies are equally spaced frequencies. An external audio signal is converted into a power envelope for each frequency channel by the band-pass filter bank unit. Competitive Recall Neural Network Unit 24
Calculates the pitch category from the power envelope obtained from the bandpass filter bank unit 25. The feedforward neural network unit 23 receives the pitch category output from the competitive recall neural network unit 24 as an input, searches for a tonic, finds an offset of a position when reading the output of the competitive recall neural network unit 24, and outputs the offset. I do. The pitch buffer unit 22 uses the offset output from the feedforward neural network unit 23 to obtain a pitch category from the output of the competitive recall neural network unit 24 based on the relative pitch to the tonic, and retains it. The readout timing generation unit 26 generates a pitch collection interval required to perform transcription. The pitch storage unit 21 fetches pitch data from the pitch buffer unit 22 based on the timing output from the read timing generation unit 26, and records the pitch data together with the read time.

【００５２】以上の構成の採譜装置においては、読み出
しタイミング生成部２６では採譜すべき音符長よりも十
分に短い時間長で音程の読み出しタイミングを生成す
る。譜面に直す場合には、読み取り時刻と全体の長さ、
曲の演奏テンポにより音符長を決定する必要がある。In the music transcription apparatus having the above-described structure, the read timing generation section 26 generates the read timing of the interval with a time length sufficiently shorter than the note length to be transcribed. When converting to music, read time and overall length,
It is necessary to determine the note length according to the performance tempo of the music.

【００５３】競合想起ニューラルネットワーク部１３及
び競合想起ニューラルネットワーク部２４及びフィード
フォワードニューラルネットワーク部２３、音程記憶部
１１及び音程記憶部２１の具体的な実施例として図６の
装置があげられる。FIG. 6 shows a specific embodiment of the competitive recall neural network unit 13, the competitive recall neural network unit 24, the feedforward neural network unit 23, the pitch storage unit 11, and the pitch storage unit 21.

【００５４】この装置は、マイクロプロセッサ５１と、
ＲＯＭ５２と、ＲＡＭ５３と、出力用ＦＩＦＯ５４と、
出力用磁気ディスク５５と、割り込み信号線５６を有し
ている。This device comprises a microprocessor 51,
ROM 52, RAM 53, output FIFO 54,
It has an output magnetic disk 55 and an interrupt signal line 56.

【００５５】ＲＯＭ５２には、数２及び数３に示す場の
状態遷移の計算を行うプログラム及び、出力用ＦＩＦＯ
５４と出力磁気ディスク５５へデータを書き出すための
制御プログラム及び、フィードフォワード計算を行うプ
ログラムが格納されている。In the ROM 52, a program for calculating the state transition of the field represented by the equations (2) and (3) and an output FIFO
A control program for writing data to the output magnetic disk 54 and the output magnetic disk 55 and a program for performing feedforward calculation are stored.

【００５６】出力磁気ディスク５５は、音程記憶部１１
及び音程記憶部２１を実現する。また、出力用ＦＩＦＯ
５４は音程バッファ部１２及び音程バッファ部２２を実
現する。The output magnetic disk 55 is stored in the pitch storage unit 11.
And the pitch storage unit 21 are realized. Also, output FIFO
54 implements the pitch buffer unit 12 and the pitch buffer unit 22.

【００５７】ＲＡＭ５３は競合想起ネットワーク部１３
及び競合想起ネットワーク部２４での場の状態、及びフ
ィードフォワードネットワーク部２３でのノードの状態
を収めるために用いる。The RAM 53 stores the competition recall network unit 13
And the state of the field in the competitive recall network unit 24 and the state of the node in the feedforward network unit 23.

【００５８】動作時には、マイクロプロセッサ５１はバ
ンドパスフィルタバンクの出力を基に数２及び数３に従
って場の状態遷移を計算する。割り込み信号線５６は外
部からの読み取り信号をマイクロプロセッサ５１に入力
する。割り込み信号を受けたマイクロプロセッサは場の
状態遷移をＲＡＭ５３より読み取り、出力用ＦＩＦＯ部
５４に出力する。In operation, the microprocessor 51 calculates the state transition of the field according to equations 2 and 3 based on the output of the band-pass filter bank. The interrupt signal line 56 inputs a read signal from the outside to the microprocessor 51. The microprocessor that has received the interrupt signal reads the state transition of the field from the RAM 53 and outputs it to the output FIFO unit 54.

【００５９】その後、一実施例においては、出力用ＦＩ
ＦＯ部５４はＵ₁の発火領域の位置を求め、出力用磁気
ディスク５５に出力する。Thereafter, in one embodiment, the output FI
The FO unit 54 finds the position of the firing region of U ₁ and outputs it to the output magnetic disk 55.

【００６０】他の実施例においては、出力用ＦＩＦＯ部
５４は、数式の計算を行い、さらにＵ₁読み取りのため
のオフセットを求め、以降はそのオフセットを用いて出
力磁気ディスクにオフセットを差し引いて計算した階名
を出力する。In another embodiment, the output FIFO unit 54 calculates the formula, further obtains an offset for reading U ₁ , and thereafter calculates the offset by subtracting the offset from the output magnetic disk using the offset. Output the floor name.

【００６１】[0061]

【発明の効果】以上述べたように、本発明によれば、ピ
ッチが不安定な楽器や歌唱の採譜をする場合に、ビブラ
ートなどの効果を吸収し、さらに無伴奏時など絶対音程
が確保できない状況で行われた演奏や歌唱から安定に音
程を決定して採譜を行うことができる。As described above, according to the present invention, when transcribing musical instruments or singing whose pitch is unstable, the effect of vibrato or the like is absorbed, and the absolute pitch cannot be secured, such as during unaccompaniment. It is possible to stably determine the pitch based on the performance or singing performed in the situation and perform the transcription.

【図面の簡単な説明】[Brief description of the drawings]

【図１】第１の発明の採譜装置のブロック図である。FIG. 1 is a block diagram of a music transcription device according to a first invention.

【図２】第２の発明の採譜装置の一実施例のブロック図
である。FIG. 2 is a block diagram of one embodiment of a music transcription device of a second invention.

【図３】リカレントニューラルネットワークの場の活性
化の様子の一例を表す図である。FIG. 3 is a diagram illustrating an example of a state of activation of a field of a recurrent neural network.

【図４】フィードフォワードニューラルネットワークの
一例を示す図である。FIG. 4 is a diagram illustrating an example of a feedforward neural network.

【図５】１２音音階の一つである平均律１２音音階の音
程の周波数比を示す図である。FIG. 5 is a diagram showing a frequency ratio of pitches of an equal-tempered 12-note scale, which is one of the 12-note scales.

【図６】第１，第２の発明中の競合想起ニューラルネッ
トワーク部１３及び競合想起ニューラルネットワーク部
２４及びフィードフォワードニューラルネットワーク部
２３、音程記憶部１１及び音程記憶部２１の一実施例の
ブロック図である。FIG. 6 is a block diagram of an embodiment of the competitive recall neural network unit 13, the competitive recall neural network unit 24, the feedforward neural network unit 23, the interval storage unit 11, and the interval storage unit 21 in the first and second inventions. It is.

【符号の説明】[Explanation of symbols]

１１，２１音程記憶部１２，２２音程バッファ部１３，２４競合想起ニューラルネットワーク部１４，２５バンドパスフィルタバンク部１５，２６読み出しタイミング生成部２３フィードフォワードニューラルネットワーク部５１マイクロプロセッサ５２ＲＯＭ５２５３ＲＡＭ５４ＦＩＦＯ５５出力用磁気ディスク５６割り込み信号線 11, 21 interval storage unit 12, 22 interval buffer unit 13, 24 competitive recall neural network unit 14, 25 bandpass filter bank unit 15, 26 readout timing generation unit 23 feedforward neural network unit 51 microprocessor 52 ROM 52 53 RAM 54 FIFO 55 Magnetic disk for output 56 Interrupt signal line

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】楽器演奏または歌唱から楽譜を作成する自
動採譜装置において、楽器演奏または歌唱のパワースペクトルを複数チャンネ
ル数求めることが可能なバンドパスフィルタバンク部
と、前記バンドパスフィルタバンク部から得られる各チャン
ネル毎の出力を入力とする競合想起ニューラルネットワ
ーク部と、前記競合想起ニューラルネットワーク部の出力を読み出
す音程バッファ部と、前記音程バッファ部の出力する音程データを記憶する音
程記憶部と、前記音程バッファ部が前記競合想起ニューラルネットワ
ーク部の出力を読み出すタイミングを生成する読み出し
タイミング生成部と、を有することを特徴とする楽音採
譜装置。1. An automatic transcription apparatus for creating a musical score from a musical instrument performance or singing, comprising: a band-pass filter bank unit capable of obtaining a power spectrum of the musical instrument performance or singing from a plurality of channels; A competition recall neural network unit that receives an output of each channel as an input, a pitch buffer unit that reads an output of the competition recall neural network unit, a pitch storage unit that stores pitch data output from the pitch buffer unit, A musical sound notation apparatus, comprising: a read timing generation unit that generates a timing at which a pitch buffer unit reads the output of the competitive recall neural network unit.

【請求項２】楽器演奏または歌唱から楽譜を作成する自
動採譜装置において、楽器演奏または歌唱のパワースペクトルを複数チャンネ
ル数求めることが可能なバンドパスフィルタバンク部
と、前記バンドパスフィルタバンク部から得られる各チャン
ネル毎の出力を入力とする競合想起ニューラルネットワ
ーク部と、前記競合想起ニューラルネットワーク部の出力を入力と
するフィードフォワードニューラルネットワーク部と、前記フィードフォワードニューラルネットワーク部の出
力を読み出し音程バッファ部と、前記音程バッファ部の出力する音程データを記憶する音
程記憶部と、前記音程バッファ部が前記フィードフォワードニューラ
ルネットワーク部の出力を読み出すタイミングを生成す
る読み出しタイミング生成部と、を有することを特徴と
する楽音採譜装置。2. An automatic transcription apparatus for creating a musical score from a musical instrument performance or singing, comprising: a band-pass filter bank unit capable of obtaining a power spectrum of the musical instrument performance or singing from a plurality of channels; A competitive recall neural network unit that receives an output of each channel as input, a feedforward neural network unit that receives an output of the competitive recall neural network unit, and a pitch buffer unit that reads an output of the feedforward neural network unit. A pitch storage unit that stores pitch data output from the pitch buffer unit; and a read timing generation unit that generates a timing at which the pitch buffer unit reads the output of the feedforward neural network unit. Tone music transcription apparatus according to claim and.