JP2011064997A

JP2011064997A - Feature quantity collation device and program

Info

Publication number: JP2011064997A
Application number: JP2009216618A
Authority: JP
Inventors: Noriaki Asemi; 典昭阿瀬見; Seiji Kurokawa; 誠司黒川
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2009-09-18
Filing date: 2009-09-18
Publication date: 2011-03-31

Abstract

<P>PROBLEM TO BE SOLVED: To improve possibility that an output result is just what a user intends, in technology for implementing processing by using a feature quantity collation result. <P>SOLUTION: In feature quantity collating processing, pitch transition of input voice is derived (S250), and smoothing transition for smoothing the derived pitch transition is derived (S260). Moreover, a relative pitch and a relative tone length between pitch extremes in smoothing transition are derived as the voice feature quantity (S280). Collating voice feature quantity in which the voice feature quantity is made into a word is created (S290), and it is collated to a comparison vertex data (S330). As a result of the collation, as the collation feature quantity in time progression of the input voice continuously better matches the comparison vertex data, a larger matching degree of the feature quantity is derived (S350). A musical piece which corresponds to a maximum value of an integrated matching degree to which the derived feature quantity matching degree is reflected, is specified as user's intention anticipated musical piece. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音高の推移から抽出した特徴量を用いた処理を実行する特徴量照合装置、及びプログラムに関する。 The present invention relates to a feature amount matching apparatus and a program for executing processing using feature amounts extracted from pitch transitions.

従来、予め用意された複数の曲の中から、利用者が発声することで入力された音声に従って、その利用者が意図した曲を検索する楽曲検索装置が知られている。
この種の楽曲検索装置では、曲毎に、各曲の譜面を表す譜面データ（即ち、各音の音高と音価との両方が表されたデータ）が予め用意されている。そして、当該楽曲検索装置に入力された音声を採譜し、その採譜した結果である音符（即ち、各音の音高と音価との両方）それぞれを、音声が入力された順序に従って譜面データに照合する。その照合の結果、一致度が基準値よりも高い譜面データに対応する曲を、利用者が意図した曲として推定することがなされている（例えば、特許文献１参照）。 2. Description of the Related Art Conventionally, there has been known a music search device that searches a song intended by a user according to a voice input by the user uttering from a plurality of songs prepared in advance.
In this type of music search device, musical score data representing the musical score of each musical piece (that is, data representing both the pitch and the note value of each sound) is prepared in advance for each musical piece. Then, the voice input to the music search device is scored, and the notes (that is, both the pitch and the tone value of each sound) obtained as a result of the transcription are converted into musical score data according to the order in which the voices are input. Collate. As a result of the collation, a song corresponding to musical score data having a matching degree higher than a reference value is estimated as a song intended by the user (for example, see Patent Document 1).

つまり、特許文献１に記載の楽曲検索装置では、採譜した結果（即ち、各音の音高と音長とによって規定される音符）を特徴量として用いて、利用者が意図した曲を予め用意された曲の中から検索している。 That is, in the music search device described in Patent Document 1, a song intended by the user is prepared in advance using the recorded result (that is, a note defined by the pitch and length of each sound) as a feature amount. Searching from among the songs that were made.

特開２００２−１５７２５５号公報JP 2002-157255 A

ところで、一般的に、専門的な訓練を受けていない人物（以下、一般ユーザと称す）が歌唱する場合、その一般ユーザが思い描く全ての音高を正確に発声することは困難である。よって、一般ユーザが歌唱する場合、その一般ユーザが意図した曲に対応する譜面データに表された音高（以下、譜面音高と称す）に一致しない音高にて発声される分量・割合が多い。 By the way, generally, when a person who has not undergone professional training (hereinafter referred to as a general user) sings, it is difficult to accurately utter all the pitches envisioned by the general user. Therefore, when a general user sings, the amount / ratio uttered at a pitch that does not match the pitch represented in the musical score data corresponding to the music intended by the general user (hereinafter referred to as musical score pitch) is Many.

このような状況下においては、特許文献１に記載の楽曲検索装置では、譜面音高に一致しない音高のまま採譜される。このため、特許文献１に記載の楽曲検索装置では、採譜結果である音符の中に、譜面データに一致しない音符が含まれることになり、採譜結果が、利用者が意図した曲とは別の曲になるという問題があった。 Under such circumstances, the music search device described in Patent Document 1 records music with a pitch that does not match the musical score. For this reason, in the music search device described in Patent Document 1, notes that do not match the musical score data are included in the notes that are the score results, and the score results are different from the song intended by the user. There was a problem of becoming a song.

つまり、採譜した結果である音符を特徴量として用いると、音高の細かな推移まで一致しなければ、出力結果が、利用者が意図したものにならないという問題があった。
そこで、本発明は、特徴量を照合した結果を用いて処理を実行する技術において、出力結果が、利用者が意図したものとなる可能性を向上させることを目的とする。 In other words, when a note, which is the result of recording, is used as a feature value, there is a problem that the output result is not what the user intended unless the pitches are matched to a fine transition.
Therefore, an object of the present invention is to improve the possibility that an output result is intended by a user in a technique for executing processing using a result of collating feature amounts.

上記目的を達成するためになされた本発明の特徴量照合装置では、音高推移導出手段が、時間進行に沿って連続して入力された入力音声から、音高の推移を表す音高推移を導出し、平滑化手段が、その導出された音高推移を平滑化した平滑音高推移を導出する。これと共に、極値検出手段が、平滑音高推移から、その平滑音高推移についての音高極値である平滑音高極値を検出し、その検出された平滑音高極値に基づいて、特徴量導出手段が、平滑音高極値についての推移特徴量である音声特徴量を導出する。なお、ここでいう音高極値とは、音高推移における音高変化の極値であり、ここでいう推移特徴量とは、時間進行に沿って連続する音高極値の間での音高差及び時間長の比率である。 In the feature amount collation device of the present invention made to achieve the above object, the pitch transition deriving means calculates pitch transitions representing pitch transitions from input voices continuously input along the time progress. The smoothing means derives a smooth pitch transition obtained by smoothing the derived pitch transition. At the same time, the extreme value detecting means detects a smooth sound high extreme value that is a pitch high value for the smooth sound pitch transition from the smooth sound pitch transition, and based on the detected smooth sound high extreme value, The feature quantity deriving unit derives a speech feature quantity that is a transition feature quantity for the smooth sound high extreme value. Note that the pitch extreme value here is the extreme value of the pitch change in the pitch transition, and the transition feature value here is the sound between the peak extreme values that are continuous over time. It is the ratio of height difference and time length.

さらに、本発明の特徴量照合装置では、特徴量照合手段が、特徴量導出手段で導出された音声特徴量それぞれを旋律特徴量それぞれに照合することで、音声特徴量と旋律特徴量との一致度が高いほど大きな値となる特徴量一致度を曲毎に導出すると共に、結果出力手段が、少なくとも、特徴量一致度に基づき、特徴量一致度の中で、値が最大の特徴量一致度に対応する曲である入力対応曲を出力する。ただし、ここでいう旋律特徴量とは、曲毎に予め用意され、かつ曲を構成する構成音の音高推移が平滑化された平滑化旋律における音高極値についての推移特徴量である。 Furthermore, in the feature amount matching apparatus of the present invention, the feature amount matching unit matches each of the speech feature amounts derived by the feature amount deriving unit with each melody feature amount, thereby matching the speech feature amount and the melody feature amount. In addition to deriving a feature value coincidence that increases as the degree increases for each song, the result output means at least based on the feature amount coincidence, the feature amount coincidence having the maximum value among the feature amount coincidences Outputs an input-compatible song that is a song corresponding to. However, the melody feature value referred to here is a transition feature value for a pitch extreme value in a smoothed melody that is prepared in advance for each song and smoothes the pitch transition of the constituent sounds constituting the song.

つまり、本発明の特徴量照合装置では、音声特徴量として、平滑音高推移における音高極値の間の音高差及び時間長の比率、即ち、相対的な音高及び音長を用いている。しかも、本発明の特徴量照合装置では、音高推移を平滑化することで、平滑音高推移を導出している。 That is, in the feature value collating apparatus of the present invention, the pitch difference between the pitch extreme values in the smooth pitch transition and the ratio of the time length, that is, the relative pitch and pitch are used as the voice feature value. Yes. Moreover, in the feature value matching apparatus of the present invention, the smooth pitch transition is derived by smoothing the pitch transition.

よって、本発明の特徴量照合装置によって導出される音声特徴量は、入力音声における音高の推移の全体的な傾向を表すものとなり、この結果、音声特徴量は、細かな音高の推移を無視した特徴量とすることができる。 Therefore, the speech feature amount derived by the feature amount matching device of the present invention represents the overall trend of the pitch transition in the input speech, and as a result, the speech feature amount has a fine pitch transition. The feature amount can be ignored.

したがって、このような音声特徴量を旋律特徴量に照合することで導出された特徴量一致度が最も大きなものに対応する曲を入力対応曲とすることで、出力結果が、利用者が意図したものとなる可能性を向上させることができる。換言すれば、本発明の特徴量照合装置によれば、利用者が意図した曲が検索され、出力される可能性を向上させることができる。 Therefore, by making the song corresponding to the one with the largest feature value matching degree derived by collating such voice feature quantity with the melody feature quantity, the output result is intended by the user. The possibility of becoming a thing can be improved. In other words, according to the feature amount collation apparatus of the present invention, it is possible to improve the possibility that a song intended by the user is retrieved and output.

また、本発明の特徴量照合装置における特徴量照合手段は、請求項２に記載のように、時間進行に沿った音声特徴量それぞれが連続して一致する旋律特徴量が多いほど、大きな値の特徴量一致度を導出するように構成されていても良い。 In addition, the feature value matching means in the feature value matching device of the present invention has a larger value as the number of melodic feature values that each of the voice feature values along the time progress matches continuously as described in claim 2. The feature amount matching degree may be derived.

このように構成された特徴量照合装置では、時間進行の中で導出された１つの音声特徴量が、偶発的に旋律特徴量に一致しただけでは、導出される特徴量一致度の値は大きなものとならない。このため、本発明の特徴量照合装置によれば、誤って、利用者が意図しない結果となることを防止できる。 In the feature amount matching apparatus configured as described above, the derived feature amount matching value is large only when one speech feature amount derived in time progress coincides with the melodic feature amount accidentally. It will not be a thing. For this reason, according to the feature amount collation apparatus of the present invention, it is possible to prevent erroneous results that are not intended by the user.

なお、本発明の特徴量照合装置において、平滑化手段が実行する平滑化は、請求項３に記載のように、単位区間それぞれに含まれる全音高の中央値の算出、及び移動平均値の算出の少なくとも一方によって実行されていても良い。 In the feature value collating apparatus according to the present invention, the smoothing performed by the smoothing means is the calculation of the median value of all pitches included in each unit section and the calculation of the moving average value, as described in claim 3. It may be executed by at least one of the above.

このような平滑化により、音高遷移から平滑化音高遷移を導出する際に、使用者が意図しない音高の細かな揺らぎや、ノイズを除去できる。
特に、中央値の算出と、移動平均の算出との両方を実行すれば、音高の細かな揺らぎや、ノイズをより確実に実行することができる。 By such smoothing, when the smooth pitch transition is derived from the pitch transition, it is possible to remove fine pitch fluctuations and noise that are not intended by the user.
In particular, if both the median calculation and the moving average calculation are executed, fine pitch fluctuations and noise can be executed more reliably.

また、本発明の特徴量照合装置において、特徴量照合手段が実行する出力とは、請求項４に記載のように、入力対応曲を画像にて表示、入力対応曲を音声にて通知すること、またはそれらの組合せであっても良い。 Further, in the feature amount matching device of the present invention, the output executed by the feature amount matching means is to display the input corresponding song as an image and to notify the input corresponding song by voice as described in claim 4. Or a combination thereof.

このような特徴量照合装置によれば、当該特徴量照合装置の使用者に入力対応曲を認識させることができる。
なお、ここでいう画像とは、表示装置に表示されるものであり、表示装置に表示される画面（例えば、文字列などからなる）を含むものである。 According to such a feature amount matching device, the user of the feature amount matching device can be made to recognize the input-compatible music.
Here, the image is displayed on the display device and includes a screen (for example, a character string) displayed on the display device.

さらに、本発明の特徴量照合装置は、請求項５に記載のように、音高推移に従って、音符化手段が、入力音声の音高及び音価を表す音符データに変換し、音符照合手段が、曲毎に予め用意され、かつ曲を構成する構成音それぞれの音高及び音価を表す基準音符データに、音符化手段にて変換された音符データそれぞれを曲毎に照合することで音符一致度を導出するように構成されていても良い。 Further, according to the feature amount collating apparatus of the present invention, as described in claim 5, the note converting means converts the note data to the note data representing the pitch and the note value of the input voice according to the pitch transition, and the note collating means The notes match by comparing each note data converted by the notation means with the reference note data prepared in advance for each song and representing the pitch and note value of each of the constituent sounds constituting the song It may be configured to derive the degree.

ただし、このように構成された特徴量照合装置における照合結果出力手段は、特徴量一致度、及び音符一致度に基づいて、特徴量一致度及び音符一致度の両方が大きいほど大きな値となるように演算した結果、最も大きな値に対応する曲を入力対応曲として出力するように構成されている必要がある。 However, the collation result output means in the feature quantity collation apparatus configured as described above has a larger value as both the feature quantity coincidence degree and the note coincidence degree are larger based on the feature quantity coincidence degree and the note coincidence degree. As a result of the above calculation, it is necessary that the music corresponding to the largest value is output as the input corresponding music.

このような特徴量照合装置では、音高の推移に対して傾向が異なる２つの特徴量を用いて、入力対応曲を検索して出力する。このため、本発明の特徴量照合装置によれば、利用者が意図した曲を出力結果（即ち、入力対応曲）とすることを、様々な音高推移の傾向を有した曲、即ち、より多くの曲に対して実現できる。 In such a feature amount matching device, an input corresponding song is searched and output using two feature amounts having different tendencies with respect to pitch transition. For this reason, according to the feature amount collation apparatus of the present invention, the song intended by the user is set as the output result (that is, the input-corresponding song). It can be realized for many songs.

本発明は、コンピュータに実行させるプログラムとしてなされたものでも良い。
ただし、本発明のプログラムは、請求項６に記載のように、入力音声から音高推移を導出する音高推移導出手順と、その導出された音高推移を平滑化した平滑音高推移を導出する平滑化手順と、その導出された平滑音高推移から、その平滑音高推移の平滑音高極値を検出する極値検出手順と、極値検出手順にて検出された平滑音高極値に基づいて、入力音声の音声特徴量を導出する特徴量導出手順と、特徴量導出手順にて導出された音声特徴量それぞれを旋律特徴量それぞれに照合することで、特徴量一致度を曲毎に導出する特徴量照合手順と、その特徴量照合手順で導出された特徴量一致度に基づき、入力対応曲を出力する結果出力手順とをコンピュータに実行させる必要がある。 The present invention may be implemented as a program that is executed by a computer.
However, as described in claim 6, the program of the present invention derives a pitch transition deriving procedure for deriving a pitch transition from the input speech and a smooth pitch transition obtained by smoothing the derived pitch transition. Smoothing procedure to detect, and from the derived smooth pitch transition, the extreme value detection procedure for detecting the smooth peak high value of the smooth pitch transition, and the smooth tone high extreme value detected by the extreme value detection procedure Based on the above, the feature amount derivation procedure for deriving the speech feature amount of the input speech and the speech feature amount derived by the feature amount derivation procedure are compared with each melody feature amount to It is necessary to cause the computer to execute a feature amount matching procedure derived in step (2) and a result output procedure for outputting an input-corresponding song based on the feature amount matching degree derived in the feature amount matching procedure.

本発明がこのようになされたプログラムであれば、コンピュータ読み取り可能な記録媒体（例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスクや、フラッシュメモリ等）に記録し、必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された特徴量照合装置として機能させることができる。 If the present invention is a program made in this way, it is recorded on a computer-readable recording medium (for example, a DVD-ROM, CD-ROM, hard disk, flash memory, etc.), and loaded into a computer as necessary. It can be used by starting up or acquiring and starting up a computer via a communication line as necessary. And by making a computer perform each procedure, the computer can be functioned as the feature-value collation apparatus described in Claim 1.

楽曲検索システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a music search system. ガイドメロディを模式的に示した模式図である。It is the schematic diagram which showed the guide melody typically. 頂点データの作成手順を説明するための説明図である。It is explanatory drawing for demonstrating the creation procedure of vertex data. 頂点データの作成手順を説明するための説明図である。It is explanatory drawing for demonstrating the creation procedure of vertex data. 頂点データの概要を示した説明図である。It is explanatory drawing which showed the outline | summary of vertex data. 楽曲検索処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the music search process. 特徴量照合検索処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the feature-value collation search process. 音符照合検索処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the note collation search process.

以下に本発明の実施形態を図面と共に説明する。
まず、図１は、本発明が適用された楽曲検索システムの概略構成を示すブロック図である。
〈楽曲検索システムについて〉
楽曲検索システム１は、利用者が発声することで入力された入力音声から、その音声を入力する際に利用者が意図したと推定される楽曲（以下、意図予想曲と称す）を検索するものである。なお、検索結果として出力される意図予想曲が、本発明の入力対応曲に相当する。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a schematic configuration of a music search system to which the present invention is applied.
<About the music search system>
The music search system 1 searches for a music (hereinafter referred to as an intended expected music) that is estimated to be intended by the user when inputting the voice from the input voice inputted by the user. It is. In addition, the intention prediction music output as a search result is equivalent to the input corresponding music of this invention.

このため、図１に示すように、楽曲検索システム１は、楽曲毎に予め用意された楽曲データを格納するサーバ４０と、入力音声から特徴量を抽出して、その抽出した特徴量を楽曲データに照合することで入力対応曲を出力する音声処理装置２０とを備えている。なお、音声処理装置２０は、ネットワーク（例えば、専用回線やＷＡＮ）を介してサーバ４０に接続されている。 For this reason, as shown in FIG. 1, the music search system 1 extracts a feature quantity from the input sound and a server 40 that stores music data prepared in advance for each song, and uses the extracted feature quantity as music data. And a voice processing device 20 that outputs an input-compatible music piece by collating with each other. The voice processing device 20 is connected to the server 40 via a network (for example, a dedicated line or WAN).

このうち、サーバ４０は、楽曲データを格納する記憶装置４１と、ＲＯＭ，ＲＡＭ，ＣＰＵを少なくとも有した周知のマイクロコンピュータ４２とを備えた情報処理装置を中心に構成された周知のサービス用サーバ装置である。
〈楽曲データについて〉
次に、記憶装置４１に格納される楽曲データについて説明する。 Among these, the server 40 is a known service server device mainly composed of an information processing apparatus including a storage device 41 for storing music data and a known microcomputer 42 having at least a ROM, a RAM, and a CPU. It is.
<About music data>
Next, music data stored in the storage device 41 will be described.

この楽曲データは、当該楽曲を識別するためのデータである楽曲情報と、当該楽曲１曲の演奏開始から演奏終了までに要する時間を示す時間情報とを有している。さらに、楽曲データは、当該楽曲の旋律に関するデータであるガイドメロディと、当該楽曲における音高の時間軸に沿った推移の特徴を表す頂点データとを有している。 This music data has music information which is data for identifying the music and time information indicating the time required from the start of performance of the music to the end of performance. Further, the music data includes a guide melody that is data relating to the melody of the music, and vertex data that represents a transition characteristic of the music along the time axis.

そして、楽曲情報には、楽曲を特定するための曲番号データと、その楽曲の曲名を示す曲名データとが少なくとも含まれている。
また、ガイドメロディは、図２に示すように、楽曲の主旋律（以下、基準旋律と称す）を形成する各構成音（図２中：ｎｏ１，ｎｏ２，…ｎｏ９）について、それぞれの音高及び音価が表された周知のデータであり、本発明の基準音符データに相当するものである。具体的に、本実施形態における構成音の音長は、楽音出力開始時間及び楽音出力終了時間によって表されている。ただし、ここで言う楽音出力開始時間とは、その構成音の出力を開始するまでの完成楽曲の演奏開始からの時間であり、楽音出力終了時間とは、その構成音の出力を終了するまでの完成楽曲の演奏開始からの時間である。つまり、楽音出力開始時間と楽音出力終了時間との間の時間長が、当該構成音の音長となる。
〈頂点データについて〉
次に、楽曲データに含まれる頂点データは、予め作成されたものであり、ガイドメロディによって表される基準旋律を平滑化した平滑化旋律についての推移特徴量からなる。この推移特徴量とは、時間進行に沿った音高の推移（以下、音高推移と称す、頂点データにおいては、基準旋律が、この「音高の推移」に相当）を平滑化した平滑化推移（頂点データにおいては、平滑化旋律が、この「平滑化推移」に相当）にて、時間進行に沿って隣接する極値間の相対音高、及び相対音長である。以下、平滑化旋律（即ち、基準旋律）についての推移特徴量を旋律推移特徴量と称す。 The song information includes at least song number data for specifying a song and song name data indicating the song name of the song.
Further, as shown in FIG. 2, the guide melody has a pitch and a tone for each of the constituent sounds (in FIG. 2, no1, no2,..., No9) that form the main melody of the music (hereinafter referred to as a reference melody). This is well-known data in which the value is represented, and corresponds to the reference note data of the present invention. Specifically, the tone lengths of the constituent sounds in the present embodiment are represented by a tone output start time and a tone output end time. However, the tone output start time here is the time from the start of the performance of the completed music until the output of the constituent sound is started, and the tone output end time is the time until the output of the constituent sound is ended. It is the time from the start of the performance of the completed song. That is, the time length between the musical sound output start time and the musical sound output end time is the sound length of the constituent sound.
<Vertex data>
Next, the vertex data included in the music data is created in advance, and includes transition feature amounts regarding a smoothed melody obtained by smoothing a reference melody represented by a guide melody. This transition feature is a smoothing that smooths the transition of the pitch over time (hereinafter referred to as the transition of the pitch, and in the vertex data, the standard melody corresponds to this "pitch transition"). In the transition (in the vertex data, the smoothing melody corresponds to this “smoothing transition”), the relative pitch and the relative pitch between the extreme values adjacent to each other along the time progress. Hereinafter, the transition feature amount regarding the smoothing melody (that is, the reference melody) is referred to as a melody transition feature amount.

その相対音高及び相対音長について、図３，４を用いて説明する。
この図３（Ａ）は、平滑化旋律を簡易的に示した図面であり、図３（Ｂ）は、平滑化旋律における極値（以下、平滑音高極値と称す）を簡易的に示した図面である。 The relative pitch and relative pitch will be described with reference to FIGS.
FIG. 3 (A) is a drawing simply showing a smoothing melody, and FIG. 3 (B) shows an extreme value in the smoothing melody (hereinafter referred to as a smooth sound high extreme value). It is a drawing.

そして、相対音高及び相対音長を導出する際には、まず、基準旋律において、時間進行に沿って互いに連続するように規定された時間長それぞれ（以下、単位区間とする）に含まれる全音高を取得する。なお、単位区間は、互いに重複するように基準旋律に対して規定される。そして、単位区間毎に取得された全音高の中央値を算出し、その算出された中央値の移動平均を求めることで、基準旋律を平滑化する。この平滑化を規定回数実行した結果を、時間進行に沿って分布させることにより、図３（Ａ）に示すような平滑化旋律が導出される。 When deriving the relative pitch and the relative pitch, first, all the sounds included in each of the time lengths (hereinafter referred to as unit intervals) defined to be continuous with each other along the time progression in the reference melody. Get high. The unit sections are defined with respect to the reference melody so as to overlap each other. Then, the median of all pitches acquired for each unit section is calculated, and the moving average of the calculated median is obtained to smooth the reference melody. By distributing the result of executing this smoothing a specified number of times along the time progress, a smoothing melody as shown in FIG. 3A is derived.

続いて、その導出された平滑化旋律を平滑化微分した結果から、図３（Ｂ）に示すような、平滑音高極値それぞれ（図３（Ｂ）中、ｅｘ_g１、ｅｘ_g２、…ｅｘ_g６）を検出する。そして、図４に示すように、その検出された平滑音高極値ｅｘ_gの中で、時間進行に沿って互いに隣接する平滑音高極値ｅｘ_gの組それぞれ（図４中、ｅｘ_g１とｅｘ_g２との組、ｅｘ_g２とｅｘ_g３との組、…ｅｘ_g５とｅｘ_g６との組））から、音高差ｄｐ_g（図４中、第１音高差ｄｐ_g１，２、第２音高差ｄｐ_g２，３、…、第５音高差ｄｐ_g５，６）をそれぞれ導出する。さらに、時間進行に沿って互いに隣接する平滑音高極値ｅｘ_g間（以下、対象区間とする）の時間長ｔ_g（図４中、第１時間長ｔ_g１，２、第２時間長ｔ_g２，３、…、第５時間長ｔ_g５，６）をそれぞれ導出する。そして、導出された時間長ｔ_gに従って、時間進行に沿って互いに隣接する対象区間の時間長ｔ_g同士の比率（以下、時間比率とする）ｄｔ_gを導出する。 Subsequently, from the result of smoothing differentiation of the derived smoothing melody, smooth sound high extreme values as shown in FIG. 3B (ex _g 1, ex _g 2, ... ex _g 6) is detected. Then, as shown in FIG. 4, in the detected smooth tone pitch extreme ex _g, in pairs, respectively (Fig. 4 of smooth pitch extreme ex _g adjacent to each other along the time progression, ex _g 1 And ex _g 2, ex _g 2 and ex _g 3,... Ex _g 5 and ex _g 6)) to pitch difference dp _g (in FIG. 4, the first pitch difference dp _g 1, 2, the second pitch difference dp _g 2,3, ..., derives fifth pitch difference dp _g 5, 6), respectively. Further, a time length t _g (hereinafter, referred to as a target section) between smooth sound high extreme values ex _g that are adjacent to each other as time progresses (first time length t _g 1,2, second time length in FIG. 4). t _g 2,3,..., fifth time length t _g 5,6) are derived respectively. Then, according to the derived time length t _g , a ratio dt _g between the time lengths t _g of the target sections adjacent to each other along the time progress (hereinafter referred to as a time ratio) is derived.

続いて、それらの導出された音高差ｄｐ_g、及び時間比率ｄｔ_gを単純化し、その単純化された音高差ｄｐ_gを相対音高、単純化された時間比率ｄｔ_gを相対音長として特定する。
なお、本実施形態における単純化は、音高差ｄｐ及び時間比率ｄｔそれぞれに予め設定された範囲を表す区分基準に従って実施される。 Subsequently, it simplified their derived pitch difference dp _g, and the time ratio dt _g, the simplified pitch difference dp _g relative pitch, simplified time ratio dt _g relative durations As specified.
Note that the simplification in the present embodiment is performed in accordance with a division criterion representing ranges preset for the pitch difference dp and the time ratio dt.

具体的に、本実施形態では、単純化の対象が音高差ｄｐである場合、音高差ｄｐの絶対値が、予め規定された第１特定値未満であれば、相対音高を「１」と特定する。また、音高差ｄｐの絶対値が、第１特定値以上かつ第２特定値（ただし、第２特定値＞第１特定値）未満であれば、相対音高を「２」と特定し、第２特定値以上であれば、相対音高を「３」と特定している。さらに、本実施形態では、音高差ｄｐが負の値であれば、各相対音高の前に「マイナス記号」を付す。 Specifically, in the present embodiment, when the target of simplification is the pitch difference dp, if the absolute value of the pitch difference dp is less than a predetermined first specific value, the relative pitch is set to “1”. ". If the absolute value of the pitch difference dp is greater than or equal to the first specific value and less than the second specific value (where the second specific value> the first specific value), the relative pitch is specified as “2”, If it is greater than or equal to the second specific value, the relative pitch is specified as “3”. Furthermore, in this embodiment, if the pitch difference dp is a negative value, a “minus sign” is added before each relative pitch.

一方、本実施形態では、単純化の対象が時間比率ｄｔである場合、その時間比率ｄｔが、予め規定された第３特定値（ただし、第３特定値＜１）未満であれば、相対音長を「１」と特定する。また、時間比率ｄｔが第２特定値（ただし、第２特定値＞１）以上であれば、相対音長を「３」と特定し、時間比率ｄｔが第１特定値以上かつ第２特定値未満であれば、相対音長を「２」と特定している。 On the other hand, in the present embodiment, when the target of simplification is the time ratio dt, if the time ratio dt is less than a predetermined third specific value (however, the third specific value <1), the relative sound The length is specified as “1”. If the time ratio dt is equal to or greater than the second specific value (where the second specific value> 1), the relative sound length is specified as “3”, and the time ratio dt is equal to or greater than the first specific value and the second specific value. If it is less, the relative sound length is specified as “2”.

このように特定された相対音高及び相対音長（即ち、旋律推移特徴量）それぞれを、当該相対音高及び相対音長が時間進行に沿って何番目の旋律推移特徴量であるのかを表すデータ番号に対応付けることで、図５に示すような頂点データが生成される。 Each of the relative pitch and relative pitch (that is, melody transition feature amount) specified in this way represents the melody transition feature amount of the relative pitch and relative pitch as time progresses. By associating with the data number, vertex data as shown in FIG. 5 is generated.

なお、平滑音高極値ｅｘ_g、音高差ｄｐ_g、時間長ｔ_g、時間比率ｄｔ_gに付された添え字ｇは、これらの平滑音高極値ｅｘ、音高差ｄｐ、時間長ｔ、時間比率ｄｔが基準旋律から導出されたものであることを表すものである。
〈音声処理装置について〉
次に、音声処理装置２０について説明する。 Incidentally, smooth tone pitch extreme ex _g, pitch difference dp _g, time length t _g, subscript g attached to time ratio dt _g, these smooth tone pitch extreme ex, pitch difference dp, the time length This represents that t and the time ratio dt are derived from the reference melody.
<About the audio processor>
Next, the voice processing device 20 will be described.

ここで図１へと戻り、音声処理装置２０は、通信部２１と、表示部２２と、操作受付部２３と、マイクロホン２４と、音声入力部２５と、音声出力部２６と、スピーカ２７と、記憶部２８と、制御部３０とを備えている。 Returning to FIG. 1, the audio processing device 20 includes a communication unit 21, a display unit 22, an operation receiving unit 23, a microphone 24, an audio input unit 25, an audio output unit 26, a speaker 27, A storage unit 28 and a control unit 30 are provided.

このうち、通信部２１は、音声処理装置２０をネットワーク（例えば、専用回線や、ＷＡＮ）に接続し、その接続されたネットワークを介して外部（即ち、サーバ４０）と通信を行うための通信インタフェースである。 Among these, the communication unit 21 connects the voice processing device 20 to a network (for example, a dedicated line or WAN), and communicates with the outside (that is, the server 40) via the connected network. It is.

そして、表示部２２は、例えば、液晶ディスプレイ等から構成された周知の表示装置である。また、操作受付部２３は、例えば、キーボードやポインティングデバイス（例えば、マウス）等の周知の入力装置からなる。 And the display part 22 is a known display apparatus comprised from the liquid crystal display etc., for example. The operation receiving unit 23 includes a known input device such as a keyboard and a pointing device (for example, a mouse).

マイクロホン２４は、音声を入力するための周知の装置である。そして、音声入力部２５は、マイクロホン２４を介して入力された音声（アナログ信号）をサンプリングし、そのサンプリング値（標本値）を制御部３０に入力するＡＤ変換器として構成されている。なお、以下では、音声入力部２５にてサンプリング値へと変換された入力音声全体を音声データと称す。 The microphone 24 is a well-known device for inputting sound. The voice input unit 25 is configured as an AD converter that samples the voice (analog signal) input via the microphone 24 and inputs the sampled value (sample value) to the control unit 30. In the following, the entire input voice converted into the sampling value by the voice input unit 25 is referred to as voice data.

さらに、音声出力部２６は、制御部３０からの指令に基づく制御信号を、スピーカ２７に出力するように構成されている。そして、スピーカ２７は、音声出力部２６からの制御信号を音に変換して放音するように構成されている。 Furthermore, the audio output unit 26 is configured to output a control signal based on a command from the control unit 30 to the speaker 27. And the speaker 27 is comprised so that the control signal from the audio | voice output part 26 may be converted into a sound and emitted.

また、記憶部２８は、電源が切断されても記憶内容を保持すると共に、記憶内容を読み書き可能に構成された記憶装置（例えば、ハードディスクドライブ）であり、プログラムや通信部２１を介してサーバから取得した楽曲データ等が格納される。 The storage unit 28 is a storage device (for example, a hard disk drive) configured to hold the stored content even when the power is turned off and to be able to read and write the stored content, and from the server via the program or the communication unit 21 Acquired music data and the like are stored.

次に、制御部３０は、ＲＯＭ３１と、ＲＡＭ３２と、ＣＰＵ３３とを少なくとも有した周知のマイクロコンピュータを中心に構成されている。
このうち、ＲＯＭ３１は、電源が切断されても記憶内容を保持する必要のあるプログラムやデータを格納するものである。また、ＲＡＭ３２は、プログラムやデータを一時的に格納するものであり、記憶部２８からの処理プログラムが転送されて格納されるものである。 Next, the control unit 30 is configured around a known microcomputer having at least a ROM 31, a RAM 32, and a CPU 33.
Of these, the ROM 31 stores programs and data that need to retain stored contents even when the power is turned off. The RAM 32 temporarily stores programs and data. The processing program from the storage unit 28 is transferred and stored in the RAM 32.

そして、ＣＰＵ３３は、ＲＯＭ３１やＲＡＭ３２に記憶された処理プログラムに従って各処理を実行して、音声処理装置２０を構成する各部２１，２２，２３，２５（２４），２６（２７）,２８に対する制御する。 Then, the CPU 33 executes each process according to the processing program stored in the ROM 31 or the RAM 32 and controls the units 21, 22, 23, 25 (24), 26 (27), and 28 constituting the sound processing device 20. .

なお、本実施形態では、制御部３０（より正確には、ＣＰＵ３３）が実行する処理プログラムとして、利用者がマイクロホン２４を介して入力した入力音声に基づいて、予め用意された楽曲の中から、入力対応曲を検索する楽曲検索処理を実行するためのものが用意されている。 In the present embodiment, as a processing program executed by the control unit 30 (more precisely, the CPU 33), based on the input voice input by the user via the microphone 24, the music prepared in advance is A program for executing a music search process for searching for input-compatible music is prepared.

そして、楽曲検索処理には、入力音声についての推移特徴量である音声特徴量を、入力音声から導出し、その導出された音声特徴量を頂点データに照合した結果に基づいて、意図予想曲を検索する特徴量照合処理が含まれている。さらに、楽曲検索処理には、入力音声を採譜した採譜結果を、音符データに照合した結果に基づいて意図予想曲を検索する音符照合処理が含まれている。 Then, in the music search process, a speech feature amount, which is a transition feature amount of the input speech, is derived from the input speech, and based on the result of collating the derived speech feature amount with the vertex data, A feature amount matching process to be searched is included. Further, the music search process includes a note collation process that retrieves the expected musical composition based on the result of collating the music transcription result obtained by recording the input voice with the note data.

つまり、音声処理装置２０が楽曲検索処理を実行することにより、本発明の特徴量照合装置として機能する。
なお、詳しくは後述する特徴量照合処理において音声特徴量を導出する過程で導出される平滑音高極値ｅｘ、音高差ｄｐ、時間長ｔ、時間比率ｄｔには、添え字ｉを付す。この添え字ｉは、その添え字ｉが付されたそれらの指標が、音声処理装置２０に入力された入力音声から導出されたことを意味するものであり、基準旋律から導出されたそれらの指標と区別するためのものである。なお、平滑音高極値ｅｘ_i、音高差ｄｐ_i、時間長ｔ_i、時間比率ｄｔ_iは、導出する際に入力される対象が異なること（すなわち、基準旋律であるか入力音声であるか）を除けば、平滑音高極値ｅｘ_g、音高差ｄｐ_g、時間長ｔ_g、時間比率ｄｔ_gと同様の処理によって導出されている。
〈楽曲検索処理について〉
次に、制御部３０が実行する楽曲検索処理について説明する。 That is, the voice processing device 20 functions as a feature amount matching device of the present invention by executing the music search process.
In detail, the subscript i is added to the smooth tone high extreme value ex, the pitch difference dp, the time length t, and the time ratio dt that are derived in the process of deriving the speech feature value in the feature value matching process described later. This subscript i means that those indices to which the subscript i is attached are derived from the input speech input to the speech processing apparatus 20, and those indices derived from the reference melody. It is for distinguishing from. It should be noted that the smooth tone high extremum ex _i , pitch difference dp _i , time length t _i , and time ratio dt _i are different from each other when they are derived (that is, whether they are reference melody or input speech). Except for ()), it is derived by the same processing as the smooth tone extreme value ex _g , pitch difference dp _g , time length t _g , and time ratio dt _g .
<About music search processing>
Next, a music search process executed by the control unit 30 will be described.

ここで、図６は、楽曲検索処理の処理手順を示すフローチャートである。
この楽曲検索処理は、マイクロホン２４を介して入力された入力音声に基づく音声データが、少なくとも１つ記憶部２８に格納された後、操作受付部２３を介して起動指令を受け付けると起動される。ここでの入力音声は、時間の進行に沿って、一定時間以上連続（継続）したものであることが望ましい。 Here, FIG. 6 is a flowchart showing the procedure of the music search process.
The music search process is started when an activation command is received via the operation reception unit 23 after at least one piece of audio data based on the input voice input via the microphone 24 is stored in the storage unit 28. Here, it is desirable that the input voice is continuous (continuous) for a predetermined time or more as time progresses.

そして、図６に示すように、楽曲検索処理は、起動されると、まず、Ｓ１１０にて、記憶部２８に記憶された音声データの１つから音声特徴量を導出し、その導出した音声特徴量を、各楽曲の頂点データに照合する特徴量照合処理を実行する。この特徴量照合処理により、音声特徴量を頂点データに照合した結果として、音声特徴量と頂点データとの一致度合いが高いほど値が大きい特徴量一致度が、楽曲毎に導出される。 As shown in FIG. 6, when the music search process is started, first, in S110, an audio feature amount is derived from one of the audio data stored in the storage unit 28, and the derived audio feature is derived. A feature amount matching process for matching the amount with the vertex data of each music is executed. As a result of collating the voice feature quantity with the vertex data by this feature quantity matching process, a feature quantity matching degree having a larger value as the matching degree between the voice feature quantity and the vertex data is derived for each music piece.

続く、Ｓ１２０では、記憶部２８に記憶された音声データの１つから、その音声データに対応する入力音声を採譜した結果である音声音符データを導出し、その導出した音声音符データを各楽曲のガイドメロディに照合する音符照合処理を実行する。この音符照合処理により、音声音符データをガイドメロディに照合した結果として、音声音符データとガイドメロディとの一致度合いが高いほど大きな値となる音符一致度が、楽曲毎に導出される。 In S120, voice note data, which is the result of recording the input voice corresponding to the voice data, is derived from one of the voice data stored in the storage unit 28, and the derived voice note data is used for each piece of music. A note collation process for collating with the guide melody is executed. As a result of collating the voice note data with the guide melody by this note collation processing, a note coincidence degree that is a larger value as the coincidence degree between the voice note data and the guide melody is derived for each music piece.

そして、Ｓ１３０では、Ｓ１１０にて導出された特徴量一致度、及びＳ１２０にて導出された音符一致度に基づいて、同一楽曲に対する特徴量一致度及び音符一致度の両方が大きな値であるほど、大きな値となる統合一致度を楽曲毎に導出する。 And in S130, based on the feature amount matching degree derived in S110 and the note matching degree derived in S120, as both the feature amount matching degree and the note matching degree for the same music are larger values, The degree of integrated matching that is a large value is derived for each song.

続く、Ｓ１４０では、Ｓ１３０にて導出された統合一致度の中で、値が最大である統合一致度に対応する楽曲を意図予想曲として特定する。さらに、Ｓ１５０では、Ｓ１４０にて特定された意図予想曲についての曲名データを取得し、その取得された曲名データに対応する曲名を、表示部２２に表示すると共に、その曲名をスピーカ２７から音声にて出力する。すなわち、意図予想曲の曲名が報知される。 Subsequently, in S140, the music corresponding to the integrated coincidence having the maximum value among the integrated coincidence derived in S130 is specified as the expected expected music. Further, in S150, the song title data for the intended expected song specified in S140 is acquired, the song title corresponding to the acquired song title data is displayed on the display unit 22, and the song title is voiced from the speaker 27. Output. That is, the song title of the expected song is notified.

その後、本楽曲検索処理を終了する。
〈特徴量照合処理について〉
次に、楽曲検索処理のＳ１１０にて起動される特徴量照合処理について説明する。 Thereafter, the music search process ends.
<About feature value matching processing>
Next, the feature amount matching process activated in S110 of the music search process will be described.

ここで、図７は、特徴量照合処理の処理手順を示したフローチャートである。
この特徴量照合処理は、図７に示すように、楽曲検索処理のＳ１１０にて起動されると、まず、Ｓ２１０にて、記憶部２８に記憶されている音声データの中から、ひとつの音声データを取得する。 Here, FIG. 7 is a flowchart showing the processing procedure of the feature amount matching processing.
As shown in FIG. 7, when the feature amount matching process is started in S110 of the music search process, first, one piece of voice data is selected from the voice data stored in the storage unit 28 in S210. To get.

続く、Ｓ２２０では、Ｓ２１０にて取得した音声データに対して、周知のノーマライズ及び周知のノイズ除去処理を事前処理として実行する。
さらに、Ｓ２３０では、Ｓ２２０にて事前処理が実行された音声データを周波数解析する。この周波数解析として、本実施形態では、音声データにおける予め規定されたサンプリング数の標本値をＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）する。これにより、サンプリング数に対応する分析期間内での音声の振幅スペクトル（即ち、周波数成分の分布）が導出される。なお、この周波数解析は、サンプリング数の標本値を、音声データの開始から終了までの間を、時間進行に沿った一部を重複させながら繰り返し取得して実行される。 In S220, a known normalization process and a known noise removal process are executed as pre-processing on the audio data acquired in S210.
Further, in S230, frequency analysis is performed on the audio data that has been pre-processed in S220. As this frequency analysis, in the present embodiment, a sample value of a predetermined number of samplings in the audio data is subjected to FFT (Fast Fourier Transform). Thereby, the amplitude spectrum (that is, the distribution of frequency components) of the speech within the analysis period corresponding to the sampling number is derived. This frequency analysis is performed by repeatedly acquiring sample values of the number of samplings from the start to the end of the audio data while overlapping a part along the time progress.

続いて、Ｓ２４０では、周知の手法により、Ｓ２３０にて導出した振幅スペクトルに基づいて、各分析期間における音高（音声基本周波数ｆ０）を推定する。この基本周波数ｆ０を検出する方法として、本実施形態では、振幅スペクトルの周波数軸の自己相関値を用いる。 Subsequently, in S240, the pitch (sound fundamental frequency f0) in each analysis period is estimated based on the amplitude spectrum derived in S230 by a known method. As a method for detecting the fundamental frequency f0, in this embodiment, an autocorrelation value on the frequency axis of the amplitude spectrum is used.

その周波数軸の自己相関値は、１つの振幅スペクトルの各周波数成分における振幅値と、その振幅スペクトルにおける各周波数成分から規定周波数幅だけ増加させた周波数成分における振幅値との積和である。このため、規定周波数幅だけ変位させた際に、基本周波数成分、またはその基本周波数の倍音成分が一致すると、自己相関値は大きな値となる。よって、周波数軸の自己相関値は、振幅スペクトル同士を、周波数軸に沿って規定周波数幅ずつ変位させた際の相関の強さであり、基本周波数成分の確からしさを表すものとなる。 The autocorrelation value of the frequency axis is the product sum of the amplitude value in each frequency component of one amplitude spectrum and the amplitude value in a frequency component that is increased from each frequency component in the amplitude spectrum by a specified frequency width. For this reason, when the fundamental frequency component or the harmonic component of the fundamental frequency matches when displaced by the specified frequency width, the autocorrelation value becomes a large value. Therefore, the autocorrelation value on the frequency axis is the strength of correlation when the amplitude spectra are displaced by the specified frequency width along the frequency axis, and represents the probability of the fundamental frequency component.

よって、本実施形態においては、変位幅そのものを周波数として捉え、周波数軸の自己相関値が最大となるポジションの周波数を音声基本周波数ｆ０として推定する。
続く、Ｓ２５０では、Ｓ２４０にて推定された音高（音声基本周波数ｆ０）に基づいて、その音高の時間進行に沿った推移（即ち、音高推移、以下、音声音高推移と称す）を導出する。 Therefore, in the present embodiment, the displacement width itself is regarded as a frequency, and the frequency at the position where the autocorrelation value on the frequency axis is maximum is estimated as the voice basic frequency f0.
In S250, based on the pitch (speech fundamental frequency f0) estimated in S240, a transition along the time progression of the pitch (ie, pitch transition, hereinafter referred to as voice pitch transition) is performed. To derive.

そして、Ｓ２６０では、Ｓ２５０にて導出された音声音高推移を平滑化する。この平滑化として、本実施形態では、音声音高推移における単位区間（即ち、規定された時間長を有す区間）に含まれる全音高を取得する。その取得した全音高の中央値を算出し、その算出された中央値の移動平均を求めることで、音声音高推移を平滑化する。この平滑化を規定回数実行した結果を、時間進行に沿って分布させることにより、平滑化推移（本発明の平滑音高推移に相当）として導出する。なお、単位区間は、音声音高推移において、時間進行に沿って互いに連続かつ重複するように繰り返し規定される。 In S260, the sound pitch transition derived in S250 is smoothed. As this smoothing, in the present embodiment, all pitches included in a unit interval (that is, an interval having a specified time length) in the transition of the audio pitch are acquired. The median of the acquired total pitches is calculated, and the moving average of the calculated median values is obtained to smooth the voice pitch transition. The result of executing this smoothing a specified number of times is distributed as time progresses, and is derived as a smoothing transition (corresponding to the smooth pitch transition of the present invention). Note that the unit sections are repeatedly defined so as to be continuous and overlap each other along the time progress in the transition of the voice pitch.

さらに、Ｓ２７０では、Ｓ２６０にて導出された平滑化推移を平滑化微分し、その平滑化微分の結果に従って、平滑音高推移についての平滑音高極値ｅｘ_iを検出する。
続く、Ｓ２８０では、Ｓ２７０にて検出された平滑音高極値ｅｘ_iの中で、時間進行に沿って隣接する平滑音高極値ｅｘ_iに基づいて、入力音声についての相対音高及び相対音長を導出する。具体的には、時間進行に沿って隣接する平滑音高極値ｅｘ_iから、音高差ｄｐ_iそれぞれ、及び時間比率ｄｔ_iそれぞれを導出する。そして、それらの導出した音高差ｄｐ_i、及び時間比率ｄｔ_iを、区分基準に従って単純化し、その単純化された音高差ｄｐ_iを相対音高として、単純化された時間比率ｄｔ_iを相対音長として特定する。つまり、Ｓ２８０では、相対音高および相対音長が音声特徴量として導出される。 Further, in S270, the smoothing transition derived in S260 is subjected to smoothing differentiation, and the smooth sound peak extreme value ex _i for the smooth sound pitch transition is detected according to the smoothing differentiation result.
Subsequently, in S280, the relative pitch and the relative sound for the input speech are determined based on the smooth sound high extreme value ex _i that is adjacent in time progress among the smooth sound high extreme values ex _i detected in S270. Deriving the length. Specifically, the smooth pitch extreme ex _i which are adjacent to each other along the time progression, pitch difference dp _i, respectively, and derives the respective time ratio dt _i. And their derived the pitch difference dp _i, and the time ratio dt _i, simplified according partitioning rule, the simplified pitch difference dp _i as a relative pitch, a simplified time ratio dt _i Identified as relative sound length. That is, in S280, the relative pitch and the relative pitch are derived as speech feature values.

続く、Ｓ２９０では、Ｓ２８０にて導出された音声特徴量を、入力音声の時間進行に沿って連続する予め規定された特徴量規定数毎に単語化（即ち、グループ化）する。この単語化に際しては、音声特徴量の一部が互いに重複するように実施する。以下、単語化された音声特徴量それぞれを、照合音声特徴量と称す。 Subsequently, in S290, the speech feature amount derived in S280 is worded (ie, grouped) for each prescribed feature amount that is continuous along the time progress of the input speech. This wording is performed so that some of the audio feature values overlap each other. Hereinafter, each of the voice feature quantities that have been worded is referred to as a collation voice feature quantity.

さらに、Ｓ３００では、サーバ４０から取得され記憶部２８に記憶されている楽曲データに対応する楽曲の中から、照合音声特徴量を頂点データに照合する楽曲である照合楽曲を１つ決定する。 Furthermore, in S300, one collation music which is a music which collates collation audio | voice feature-value with vertex data is determined from the music corresponding to the music data acquired from the server 40 and memorize | stored in the memory | storage part 28. FIG.

続く、Ｓ３１０では、Ｓ２９０にて生成された全ての照合音声特徴量の中から、１つの照合音声特徴量を取得する。ただし、照合音声特徴量を取得する際には、入力音声の時間進行において、音声開始に近い音声特徴量からなる照合特徴量を取得する。 In S310, one collation voice feature value is acquired from all the collation voice feature values generated in S290. However, when the collation voice feature amount is acquired, the collation feature amount including the voice feature amount close to the start of the voice is acquired as time progresses of the input voice.

そして、Ｓ３２０では、Ｓ３００にて決定された照合楽曲に対応する頂点データを形成する全ての旋律推移特徴量の中から、基準旋律の時間進行に沿って連続する特徴量規定数分だけ、旋律推移特徴量を単語化して取得する。この特徴量規定数分の旋律推移特徴量を単語化する際には、基準旋律の時間進行において、その基準旋律の開始に近い平滑音高極値から導出された旋律推移特徴量から実行する。以下、Ｓ３２０にて単語化した特徴量規定数分の旋律推移特徴量を、比較頂点データと称す。 Then, in S320, the melody transition is performed by the prescribed number of features that are continuous along the time progression of the reference melody from all the melody transition features that form the vertex data corresponding to the collation music determined in S300. Acquire the feature quantity into words. When converting the melody transition feature quantity for the specified number of feature quantities into words, the melody transition feature quantity is derived from the melody transition feature quantity derived from the smooth tone high extreme value close to the start of the reference melody in the time progression of the reference melody. Hereinafter, the melodic transition feature amounts for the specified number of feature amounts worded in S320 are referred to as comparative vertex data.

続いて、Ｓ３３０では、Ｓ３１０にて取得した照合音声特徴量を、Ｓ３２０にて取得した比較頂点データに照合する。その照合の結果、照合音声特徴量と比較頂点データとが一致すれば（Ｓ３４０：ＹＥＳ）、Ｓ３５０へと進む。 Subsequently, in S330, the collation voice feature amount acquired in S310 is collated with the comparison vertex data acquired in S320. As a result of the collation, if the collation voice feature value matches the comparison vertex data (S340: YES), the process proceeds to S350.

そのＳ３５０では、特徴量一致度を導出すると共に、その導出した特徴量一致度をデータ番号と対応付けて記憶し、その後、Ｓ３６０へと進む。この特徴量一致度と対応付けられるデータ番号は、比較頂点データを形成する複数の旋律推移特徴量の中で、基準旋律の時間進行に沿って最初の旋律推移特徴量と対応付けられたものである。 In S350, the feature quantity matching degree is derived, and the derived feature quantity matching degree is stored in association with the data number, and then the process proceeds to S360. The data number associated with this feature amount coincidence is associated with the first melody transition feature amount along the time progression of the reference melody among the plurality of melody transition feature amounts forming the comparison vertex data. is there.

一方、Ｓ３４０での照合の結果、照合音声特徴量と比較頂点データとが一致しなければ（Ｓ３４０：ＮＯ）、Ｓ３６０へと進む。
そのＳ３６０では、全ての旋律推移特徴量を単語化して、それらの単語化した旋律推移特徴量（即ち、比較頂点データ）に、Ｓ３１０にて取得した照合音声特徴量を照合したか否かを判定する。その判定の結果、全ての比較頂点データに照合音声特徴量を照合していなければ、Ｓ３２０へと戻る。そのようにして移行したＳ３２０では、前回のＳ３２０にて単語化した旋律推移特徴量に、基準旋律の時間進行に沿った一部が重複するように特徴量規定数分だけ、旋律推移特徴量を単語化して取得する。すなわち、新たな比較頂点データを生成して、Ｓ３３０へと進む。 On the other hand, if the collation voice feature value does not match the comparison vertex data as a result of the collation in S340 (S340: NO), the process proceeds to S360.
In S360, all the melodic transition feature quantities are worded, and it is determined whether or not the collation voice feature quantity acquired in S310 is collated with those wordized melodic transition feature quantities (that is, comparative vertex data). To do. If the result of this determination is that collation voice feature values have not been collated with all comparison vertex data, the process returns to S320. In S320 that has been shifted as described above, the melodic transition feature amount is set by the prescribed number of feature amounts so that a part along the time progression of the reference melody overlaps the melody transition feature amount worded in the previous S320. Get worded. That is, new comparison vertex data is generated, and the process proceeds to S330.

これにより、１つの楽曲における全ての比較頂点データに対して、１つの照合音声特徴量の照合が完了するまで、Ｓ３２０からＳ３５０が繰り返し実行される。
なお、Ｓ３６０での判定の結果、全ての比較頂点データに照合音声特徴量を照合していれば、Ｓ３７０へと進む。そのＳ３７０では、全ての照合音声特徴量を取得して、各照合音声特徴量を比較頂点データに照合済みであるか否かを判定する。 Thereby, S320 to S350 are repeatedly executed until the collation of one collation voice feature amount is completed for all the comparison vertex data in one music piece.
Note that, as a result of the determination in S360, if collation voice feature values have been collated with all comparison vertex data, the process proceeds to S370. In S370, it is determined whether or not all the matching voice feature values have been acquired and each matching voice feature value has been checked against the comparison vertex data.

そのＳ３７０での判定の結果、全ての照合音声特徴量を比較頂点データに照合していなければ、Ｓ３１０へと戻る。そのようにして移行したＳ３１０では、比較頂点データに対して未照合の照合音声特徴量の中から、１つの照合音声特徴量を取得する。ただし、照合音声特徴量を取得する際には、入力音声の時間進行において、音声開始に近い音声特徴量からなる照合特徴量を取得する。 As a result of the determination in S370, if not all the collated speech feature values are collated with the comparison vertex data, the process returns to S310. In S310 that has been shifted as described above, one collation voice feature amount is acquired from the unverified collation voice feature amounts with respect to the comparison vertex data. However, when the collation voice feature amount is acquired, the collation feature amount including the voice feature amount close to the start of the voice is acquired as time progresses of the input voice.

その後、Ｓ３７０にて肯定判定されるまで、Ｓ３１０〜Ｓ３７０までのステップを繰り返す。以下、Ｓ３１０〜Ｓ３７０までの一回の流れを、別特徴照合サイクルと称す。また、別特徴照合サイクルにて、照合音声特徴量を取得してから新たな照合音声特徴量を取得するまでのＳ３２０〜Ｓ３６０の一回の流れを、同一特徴照合サイクルと称す。 Thereafter, the steps from S310 to S370 are repeated until an affirmative determination is made in S370. Hereinafter, one flow from S310 to S370 is referred to as another feature matching cycle. In addition, a single flow from S320 to S360 from the acquisition of the verification voice feature amount to the acquisition of a new verification voice feature amount in another feature verification cycle is referred to as the same feature verification cycle.

この同一特徴照合サイクルを繰り返す過程の中で、Ｓ３４０にて肯定判定されると、Ｓ３５０へと進む。そのようにして移行したＳ３５０では、今回の別特徴照合サイクルにて照合音声特徴量と一致した比較頂点データが、前回の別特徴照合サイクルにて照合音声特徴量と一致した比較頂点データと、基準旋律の時間進行上連続するものであるか否かを判定（以下、連続判定とする）する。具体的には、前回の別特徴照合サイクルにて特徴量一致度に対応付けられたデータ番号の中に、今回Ｓ３５０へと進んだ際に、照合音声特徴量に一致したと判定された比較頂点データを形成する旋律推移特徴量のデータ番号よりも、照合楽曲における時間進行上１つ前であることを示すデータ番号があれば、連続判定における判定結果が肯定されたものとする。 If an affirmative determination is made in S340 during the process of repeating the same feature matching cycle, the process proceeds to S350. In S350 thus shifted, the comparison vertex data that matches the matching speech feature amount in the current different feature matching cycle is compared with the comparison vertex data that matches the matching speech feature amount in the previous different feature matching cycle, It is determined whether the melody is continuous over time (hereinafter referred to as continuous determination). Specifically, in the data number associated with the feature amount matching degree in the previous different feature matching cycle, the comparison vertex that is determined to match the matching speech feature amount when proceeding to S350 this time If there is a data number indicating that the melody transition feature value forming the data is one before the time progression in the collation music, the determination result in the continuous determination is affirmed.

その連続判定の結果が肯定判定であれば、連続して肯定判定された別特徴量照合サイクルの回数を「べき指数」として、予め規定された初期規定値を累乗した値を、特徴量一致度として導出する。一方、連続判定の結果が否定判定であれば、初期規定値そのものを、特徴量一致度として導出する。 If the result of the continuous determination is an affirmative determination, a value obtained by multiplying a predetermined initial specified value to a power with the number of consecutive feature determination cycles continuously determined as a “power index” is used as a feature amount matching degree. Derived as On the other hand, if the result of the continuous determination is negative determination, the initial specified value itself is derived as the feature amount matching degree.

つまり、特徴量一致度は、入力音声の時間進行に沿った照合特徴量が連続して、楽曲の時間進行に沿った比較頂点データに一致するほど、大きな値となる。
なお、Ｓ３７０にて肯定判定されると、Ｓ３８０へと進む。 That is, the feature amount matching degree becomes larger as the collated feature amount along the time progress of the input voice continuously matches the comparison vertex data along the time progress of the music.
If a positive determination is made in S370, the process proceeds to S380.

そのＳ３８０では、先のＳ３１０にて決定された照合楽曲に対する特徴量一致度の中で、値が最大のものを、その照合楽曲に対応する曲名データと対応付けて、記憶部２８に記憶する。つまり、Ｓ３８０にて曲名データと対応付けられる特徴量一致度は、一つの照合楽曲に対する別特徴照合サイクルの繰り返しにて導出された全特徴量一致度の中で、値が最大のものである。 In S380, the feature value matching degree for the collated music determined in S310 is stored in the storage unit 28 in association with the music name data corresponding to the collated music. That is, the feature amount matching degree associated with the song name data in S380 has the largest value among all the feature amount matching degrees derived by repeating another feature matching cycle for one matching song.

続く、Ｓ３９０では、記憶部２８に記憶されている楽曲データに対応する全ての楽曲を、照合楽曲として決定済みであるか否かを判定する。その判定の結果、全ての楽曲を照合楽曲として決定済みでなければ、Ｓ３００へと戻る。そのようにして移行したＳ３００では、照合楽曲として未決定の楽曲の中から、新たな楽曲を照合楽曲として決定して、Ｓ３１０へと進む。つまり、Ｓ３００からＳ３９０までのステップを、記憶部２８に記憶されている全ての楽曲データ中の頂点データに照合音声特徴量の照合が完了するまで繰り返す。 Subsequently, in S390, it is determined whether or not all the music corresponding to the music data stored in the storage unit 28 has been determined as the verification music. As a result of the determination, if all the music pieces have not been determined as verification music pieces, the process returns to S300. In S300 thus shifted, a new music is determined as the verification music from the music that has not been determined as the verification music, and the process proceeds to S310. That is, the steps from S300 to S390 are repeated until the collation of the collation voice feature amount is completed for the vertex data in all the music data stored in the storage unit.

なお、Ｓ３９０での判定の結果、記憶部２８に記憶されている全ての楽曲を照合楽曲として決定済みであれば、楽曲検索処理へと戻り、その楽曲検索処理のＳ１２０へと進む。
つまり、本実施形態の特徴量照合処理では、入力音声の音高推移の特徴を表す特徴量（即ち、音声特徴量）として、平滑化推移における音高極値の間の音高差ｄｐ_i及び時間比率ｄｔ_i、即ち、相対的な音高及び音長を導出する。そして、その導出された音声特徴量を単語化した照合音声特徴量を生成して、その生成された照合音声特徴量を、全ての比較頂点データに照合する。その照合の結果、入力音声の時間進行に沿った照合特徴量が連続して、比較頂点データに一致するほど、大きな値の特徴量一致度を導出する。
〈音符照合処理について〉
次に、楽曲検索処理のＳ１２０にて起動される音符照合処理について説明する。 As a result of the determination in S390, if all the songs stored in the storage unit 28 have been determined as collation songs, the process returns to the music search process and proceeds to S120 of the music search process.
That is, the feature checker process of the present embodiment, feature amount representing the feature of pitch changes in the input speech (i.e., speech features) as, pitch difference dp _i and between the pitch extremum in smooth transition The time ratio dt _i , that is, the relative pitch and length are derived. Then, a collation voice feature quantity obtained by converting the derived voice feature quantity into a word is generated, and the generated collation voice feature quantity is collated with all comparison vertex data. As a result of the collation, as the collation feature amount along the time progress of the input speech continuously matches the comparison vertex data, a feature value coincidence value having a larger value is derived.
<Note verification processing>
Next, the note collation process started in S120 of the music search process will be described.

ここで、図８は、音符照合処理の処理手順を示したフローチャートである。
この音符照合処理は、図８に示すように、楽曲検索処理のＳ１２０にて起動されると、まず、Ｓ４１０にて、記憶部２８に記憶されている音声データの中から、先のＳ２１０にて取得されたものと同一の音声データを取得する。 Here, FIG. 8 is a flowchart showing the processing procedure of the note collation processing.
As shown in FIG. 8, when the note collating process is started in S120 of the music search process, first, in S410, the voice data stored in the storage unit 28 is first selected in S210. Acquire the same audio data as the acquired one.

続く、Ｓ４２０では、Ｓ４１０にて取得した音声データに対して、事前処理を実行する。さらに、Ｓ４３０では、Ｓ４２０にて事前処理が実行された音声データを分析期間毎に周波数解析する。続いて、Ｓ４４０では、Ｓ４３０における周波数解析の結果に従って、各分析期間における音高（音声基本周波数ｆ０）を推定する。なお、これらＳ４２０、Ｓ４３０、Ｓ４４０での処理は、特徴量照合処理におけるＳ２１０，Ｓ２３０，Ｓ２４０と同様であるため、ここでの説明は省略する。 Subsequently, in S420, pre-processing is performed on the audio data acquired in S410. Further, in S430, frequency analysis is performed for each analysis period on the audio data that has been pre-processed in S420. Subsequently, in S440, the pitch (sound fundamental frequency f0) in each analysis period is estimated according to the result of the frequency analysis in S430. Note that the processes in S420, S430, and S440 are the same as S210, S230, and S240 in the feature amount matching process, and thus description thereof is omitted here.

続く、Ｓ４５０では、Ｓ４１０にて取得した音声データの音圧変動に従って、１つの音符とみなせる音符期間を推定する周知の音符期間推定処理を実行する。具体的には、音声データの音圧が単調増加である区間において、時間進行上最初に規定値以上の増加率となる分析期間を音符開始タイミングする。また、音符開始タイミング以降にて、音声データの音圧が単調減少である区間において、時間進行上最初に規定値以上の減少率となる分析期間を音符終了タイミングとする。そして、それらの音符開始タイミング及び音符終了タイミングの間の期間それぞれを音符期間とする。 In S450, a well-known note period estimation process for estimating a note period that can be regarded as one note is executed in accordance with the sound pressure variation of the sound data acquired in S410. Specifically, in an interval in which the sound pressure of the audio data is monotonously increasing, the note start timing is set to an analysis period in which the rate of increase is equal to or higher than a specified value in the course of time. Also, after the note start timing, in an interval in which the sound pressure of the voice data is monotonously decreasing, an analysis period in which the rate of decrease is equal to or higher than a specified value in the course of time is set as a note end timing. Each period between the note start timing and the note end timing is defined as a note period.

さらに、Ｓ４６０では、Ｓ４５０にて推定された音符期間それぞれに対応する全ての分析期間の音高（音声基本周波数ｆ０）に従って、その音符期間における音高を表す音符音高を推定する。具体的には、音符期間に対応する分析期間に占める割合がもっとも高い基本周波数に対応する音高を、音符音高として推定する。これにより、入力音声の時間進行に沿って、音符期間毎に、その音符期間の時間長（即ち、音価）と、その音符期間における音高と対応付けた音声音符データを生成する。つまり、Ｓ４６０では、入力音声が音符化される。 In S460, note pitches representing pitches in the note period are estimated according to the pitches (sound fundamental frequency f0) of all analysis periods corresponding to the note periods estimated in S450. Specifically, the pitch corresponding to the fundamental frequency having the highest ratio in the analysis period corresponding to the note period is estimated as the note pitch. Thus, for each note period, voice note data associated with the time length of the note period (that is, the note value) and the pitch in the note period is generated along the time progress of the input voice. That is, in S460, the input voice is converted into musical notes.

続く、Ｓ４７０では、Ｓ４６０にて生成された音声音符データを、入力音声の時間進行に沿って連続する予め規定された音符規定数毎に単語化（即ち、グループ化）する。この単語化に際しては、音声音符データの一部が互いに重複するように実施する。以下、単語化された音声音符データそれぞれを、単語音符データと称す。 In step S470, the voice note data generated in step S460 is worded (ie, grouped) for each predetermined number of notes that are continuous along the time progress of the input voice. This wording is performed so that part of the voice note data overlaps each other. Hereinafter, each of the voiced note data converted into words is referred to as word note data.

さらに、Ｓ４８０では、サーバ４０から取得され記憶部２８に記憶されている楽曲データに対応する楽曲の中から、単語音符データを基準音符データ（即ち、ガイドメロディ）に照合する楽曲（以下、音符照合楽曲と称す）を１つ決定する。 Furthermore, in S480, the music (hereinafter referred to as note collation) in which the word note data is collated with the reference note data (that is, the guide melody) from the music corresponding to the music data acquired from the server 40 and stored in the storage unit 28. (Referred to as music).

続く、Ｓ４９０では、Ｓ４７０にて生成された全ての単語音符データの中から、１つの単語音符データを取得する。ただし、単語音符データを取得する際には、入力音声の時間進行において、音声開始に近い音声音符データを含むものを取得する。 In S490, one word note data is acquired from all the word note data generated in S470. However, when the word note data is acquired, the one including the voice note data close to the start of the voice is acquired in the time progress of the input voice.

そして、Ｓ５００では、Ｓ４８０にて決定された音符照合楽曲に対応する基準音符データを形成する構成音の音高及び音価の中から、時間進行に沿って連続する音符規定数分だけ単語化して取得する。この音符規定数分の音高及び音価を単語化する際には、基準旋律の時間進行において、その基準旋律の開始に近い構成音についての音高及び音価から実行する。以下、Ｓ５００にて単語化して取得した音符規定数分の構成音についての音高及び音価を、比較音符データとする。 In S500, words corresponding to the prescribed number of notes that are continuous in time are formed from the pitches and note values of the constituent notes that form the reference note data corresponding to the note-matching music determined in S480. get. When converting the pitches and note values for the specified number of notes into words, it is executed from the pitches and tone values of constituent sounds close to the start of the reference melody in the time progression of the reference melody. Hereinafter, the pitches and note values of the constituent sounds corresponding to the specified number of notes acquired by wording in S500 are referred to as comparative note data.

続いて、Ｓ５１０では、Ｓ４９０にて取得した単語音符データを、Ｓ５００にて取得した比較音符データに照合する。その照合の結果、単語音符データと比較音符データとが一致すれば（Ｓ５２０：ＹＥＳ）、Ｓ５３０へと進む。 Subsequently, in S510, the word note data acquired in S490 is collated with the comparison note data acquired in S500. As a result of the collation, if the word note data matches the comparison note data (S520: YES), the process proceeds to S530.

そのＳ５３０では、音符一致度を導出すると共に、その導出した音符一致度を構成音の番号と対応付けて記憶し、その後、Ｓ５４０へと進む。この音符一致度と対応付けられる構成音の番号は、比較音符データを形成する音符規定数の構成音の中で、基準旋律の時間進行に沿った最初の構成音に対応付けられたものである。 In S530, the note coincidence is derived, and the derived note coincidence is stored in association with the constituent sound numbers, and then the process proceeds to S540. The number of the constituent sound associated with this note coincidence is the one associated with the first constituent sound along the time progression of the reference melody among the prescribed number of constituent sounds forming the comparative note data. .

一方、Ｓ５１０での照合の結果、単語音符データと比較音符データとが一致しなければ（Ｓ５２０：ＮＯ）、Ｓ５４０へと進む。
そのＳ５４０では、全ての構成音についての音高及び音価を単語化して、その単語化によって生成された比較音符データに、Ｓ４９０にて取得した単語音符データを照合したか否かを判定する。その判定の結果、全ての比較音符データに単語音符データを照合していなければ、Ｓ５００へと戻る。そのようにして移行したＳ５００では、前回のＳ５００にて単語化した構成音についての音高及び音価と、基準旋律の時間進行に沿った一部が重複するように音符規定数分だけ、構成音についての音高及び音価を単語化して取得する。すなわち、新たな比較音符データを生成して、Ｓ５１０へと進む。 On the other hand, if the result of the collation in S510 does not match the word note data and the comparison note data (S520: NO), the process proceeds to S540.
In S540, the pitches and note values for all the constituent sounds are converted into words, and it is determined whether or not the word note data acquired in S490 is collated with the comparison note data generated by the wording. As a result of the determination, if the word note data is not collated with all the comparison note data, the process returns to S500. In S500 thus shifted, the pitches and pitches of the constituent sounds worded in the previous S500 are configured by the prescribed number of notes so that a part along the time progression of the reference melody overlaps. Acquires the pitch and value of the sound as words. That is, new comparison note data is generated, and the process proceeds to S510.

これにより、１つの楽曲における全ての構成音についての音高及び音価に対して、１つの単語音符データの照合が完了するまで、Ｓ５００からＳ５４０が繰り返し実行される。
なお、Ｓ５４０での判定の結果、全ての構成音についての音高及び音価を単語化して、その単語化によって生成された比較音符データに、単語音符データを照合していれば、Ｓ５５０へと進む。そのＳ５５０では、全ての単語音符データを取得して、比較音符データに照合済みであるか否かを判定する。 Thereby, S500 to S540 are repeatedly executed until collation of one word note data is completed with respect to the pitches and tone values of all the constituent sounds in one musical piece.
As a result of the determination in S540, if the pitches and note values for all the constituent sounds are worded, and the word note data is collated with the comparison note data generated by the wording, the process proceeds to S550. move on. In S550, all the word note data is acquired, and it is determined whether or not the comparison note data has been collated.

そのＳ５５０での判定の結果、全ての単語音符データを比較音符データに照合していなければ、Ｓ４９０へと戻る。そのＳ４９０では、比較音符データに対して未照合の単語音符データの中から、１つの単語音符データを取得する。ただし、単語音符データを取得する際には、入力音声の時間進行において、音声開始に近い音声音符データからなる単語音符データを取得する。 As a result of the determination in S550, if all the word note data is not collated with the comparison note data, the process returns to S490. In S490, one word note data is acquired from the word note data that has not been compared with the comparison note data. However, when acquiring the word note data, the word note data consisting of the voice note data close to the start of the voice is acquired as the input voice progresses over time.

その後、Ｓ５５０にて肯定判定されるまで、Ｓ４９０〜Ｓ５５０までのステップを繰り返す。以下、Ｓ４９０〜Ｓ５５０までの一回の流れを、別音符照合サイクルと称す。また、別音符照合サイクルにて、単語音符データを取得してから新たな単語音符データを取得するまでの、Ｓ５００〜Ｓ５４０の一回の流れを、同一音符照合サイクルと称す。 Thereafter, the steps from S490 to S550 are repeated until an affirmative determination is made in S550. Hereinafter, one flow from S490 to S550 is referred to as a separate note collation cycle. A single flow from S500 to S540 from the acquisition of word note data to the acquisition of new word note data in another note verification cycle is referred to as the same note verification cycle.

この同一音符照合サイクルを繰り返す過程の中で、Ｓ５２０にて肯定判定されると、Ｓ５３０へと進む。そのようにして移行したＳ５３０では、今回の別音符照合サイクルにて単語音符データと一致した比較音符データが、前回の別音符照合サイクルにて単語音符データと一致した比較音符データと、基準旋律の時間進行上連続するものであるか否かを判定（以下、音符接続判定とする）する。具体的には、前回の別音符照合サイクルにて音符一致度に対応付けられた構成音の番号の中に、今回Ｓ５３０へと進んだ際に、単語音符データに一致したと判定された比較音符データを形成する構成音の番号よりも、基準旋律における時間進行上１つ前の構成音であることを示す番号があれば、音符接続判定における判定結果が肯定されたものとする。 If an affirmative determination is made in S520 in the process of repeating the same note matching cycle, the process proceeds to S530. In step S530, the comparison note data matched with the word note data in the current separate note collation cycle is compared with the comparison note data matched with the word note data in the previous separate note collation cycle and the reference melody. It is determined whether it is continuous over time (hereinafter referred to as note connection determination). Specifically, the comparison note determined to match the word note data when proceeding to S530 this time, among the constituent note numbers associated with the note matching degree in the previous separate note matching cycle. If there is a number indicating that it is a constituent sound one before the time progression in the reference melody than the number of the constituent sound forming the data, the determination result in the note connection determination is affirmed.

その音符接続判定の判定結果が肯定であれば、連続して肯定判定された別音符照合サイクルの回数を「べき指数」として、初期規定値を累乗した値を音符一致度として導出する。一方、音符接続判定の判定結果が否定であれば、初期規定値そのものを音符一致度として導出する。 If the determination result of the note connection determination is affirmative, a value obtained by raising the power of the initial specified value is derived as the note coincidence, with the number of different note collation cycles successively determined to be positive as the “power exponent”. On the other hand, if the determination result of the note connection determination is negative, the initial specified value itself is derived as the note matching degree.

つまり、音符一致度は、入力音声の時間進行に沿った単語音符データが連続して、音符照合楽曲の基準旋律における時間進行に沿った比較音符データに一致するほど、大きな値となる。 That is, the degree of note coincidence increases as the word note data along the time progression of the input speech continuously matches the comparison note data along the time progression in the reference melody of the note collation music.

なお、Ｓ５５０にて肯定判定されると、Ｓ５６０へと進む。そのＳ５６０では、先のＳ４８０にて決定された音符照合楽曲に対する音符一致度の中で、値が最大のものを、その音符照合楽曲に対応する曲名データと対応付けて、記憶部２８に記憶する。つまり、Ｓ５６０にて曲名データと対応付けられる音符一致度は、一つの音符照合楽曲に対する別音符照合サイクルの繰り返しにて導出された全音符一致度の中で、値が最大のものである。 If a positive determination is made in S550, the process proceeds to S560. In S560, the note matching degree with respect to the note collation music determined in the previous S480 is stored in the storage unit 28 in association with the song name data corresponding to the note collation music. . That is, the note coincidence associated with the song name data in S560 has the largest value among all note coincidence derived by repeating another note collation cycle for one note collation music.

続く、Ｓ５７０では、記憶部２８に記憶されている楽曲データに対応する全ての楽曲を、音符照合楽曲として決定済みであるか否かを判定する。その判定の結果、全ての楽曲を音符照合楽曲として決定済みでなければ、Ｓ４８０へと戻る。そのようにして移行したＳ４８０では、音符照合楽曲として未決定の楽曲の中から、新たな楽曲を音符照合楽曲として決定して、Ｓ４９０へと進む。つまり、Ｓ４９０からＳ５７０までのステップを、記憶部２８に記憶されている全ての楽曲データ中の基準音符データに、単語音符データの照合が完了するまで繰り返す。 Subsequently, in S570, it is determined whether or not all the music corresponding to the music data stored in the storage unit 28 has been determined as the note collation music. As a result of the determination, if all music pieces have not been determined as note collation music pieces, the process returns to S480. In S480 thus shifted, a new musical piece is determined as the musical note collation music from the music pieces that have not been determined as musical note collation music, and the process proceeds to S490. That is, the steps from S490 to S570 are repeated until the collation of the word note data is completed for the reference note data in all the music data stored in the storage unit 28.

なお、Ｓ５７０での判定の結果、記憶部２８に記憶されている全ての楽曲を音符照合楽曲として決定済みであれば、楽曲検索処理へと戻り、その楽曲検索処理のＳ１３０へと進む。 Note that, as a result of the determination in S570, if all the music stored in the storage unit 28 has been determined as the note collation music, the process returns to the music search process and proceeds to S130 of the music search process.

つまり、本実施形態の音符照合処理では、入力音声を音符化した音声音符データを生成し、その音声音符データを、楽曲毎に予め用意された基準音符データに照合する。そして、その照合結果として、入力音声の時間進行に沿って連続する音声音符データが、音符照合楽曲の基準旋律における時間進行に沿って連続して一致する比較音符データの数が多いほど、大きな値の音符一致度を導出している。 That is, in the note collation process of this embodiment, voice note data obtained by converting the input voice into notes is generated, and the voice note data is collated with reference note data prepared in advance for each music piece. As a result of the collation, the greater the number of comparison note data in which the voice note data continuous along the time progression of the input voice matches continuously along the time progression in the reference melody of the note collation music, the larger the value. The degree of note coincidence is derived.

そして、本実施形態の楽曲検索処理では、特徴量一致度と、音声一致度とに基づいて導出した統合一致度が最も高いものに対応する楽曲を、意図予想曲として検出している。
［実施形態の効果］
以上説明したように、本実施形態の音声処理装置２０では、音声特徴量として、平滑化推移における極値の間の相対音高及び相対音長を導出している。しかも、本実施形態の音声処理装置２０にて導出される平滑化推移は、入力音声の音高推移を平滑化したものである。 And in the music search process of this embodiment, the music corresponding to the thing with the highest integrated matching degree derived | led-out based on the feature-value matching degree and the audio | voice matching degree is detected as an expected expected music piece.
[Effect of the embodiment]
As described above, in the speech processing device 20 of the present embodiment, the relative pitch and relative pitch between extreme values in smoothing transition are derived as speech feature values. Moreover, the smoothing transition derived by the speech processing apparatus 20 of the present embodiment is a smoothing of the pitch transition of the input speech.

よって、本実施形態にて導出される音声特徴量は、入力音声における音高推移の全体的な傾向を表すものになると共に、細かな音高の推移を無視したものとすることができる。
したがって、このような音声特徴量を旋律推移特徴量に照合することで導出された特徴量一致度を反映した結果から意図予想曲を特定することで、その特定結果（即ち、入力対応曲）が、利用者が意図したものに一致することになる。このため、本実施形態の音声処理装置２０によれば、採譜結果のみから意図予想曲を特定する場合に比べて、より多くの楽曲に対して、利用者が意図した曲が正しく検索される可能性を向上させることができる。 Therefore, the voice feature amount derived in the present embodiment represents the overall tendency of the pitch transition in the input voice, and can ignore the fine pitch transition.
Therefore, by specifying the intended expected song from the result of reflecting the degree of coincidence of the feature amount derived by collating such voice feature amount with the melodic transition feature amount, the specified result (that is, the input corresponding song) is obtained. , It will match what the user intended. For this reason, according to the audio processing device 20 of the present embodiment, it is possible to correctly search for a song intended by the user for more songs compared to the case where the expected song is specified only from the result of transcription. Can be improved.

特に、本実施形態の楽曲検索処理では、特徴量一致度と、音符一致度という、音高の推移に対して傾向が異なる２つの特徴量を用いて検索した結果から、意図予想曲を特定している。このため、本実施形態の楽曲検索処理によれば、意図予想曲が、利用者が意図したものに正しく一致することを、様々な音高推移の傾向を有した曲、即ち、より多くの曲に対して実現できる。 In particular, in the music search process of the present embodiment, the intended expected music is identified from the results of searching using two feature quantities that have different tendencies with respect to pitch transitions, ie, the feature quantity coincidence and the note coincidence. ing. For this reason, according to the music search process of the present embodiment, it is confirmed that the intended expected music correctly matches the music intended by the user, that is, music having various pitch transition trends, that is, more music. Can be realized.

なお、本実施形態の特徴量照合処理、及び音符照合処理では、音声特徴量または音声音符データがそれぞれの時間進行に沿って連続して、旋律推移特徴量または基準音符データに一致する数が多いほど、大きな値の特徴量一致度または音符一致度を導出している。 It should be noted that in the feature amount matching process and the note matching process of the present embodiment, the number of voice feature quantities or voice note data continuously matching the melody transition feature quantity or the reference note data is increased along each time progression. The larger the feature amount coincidence or the note coincidence is derived.

このため、本実施形態の楽曲検索処理では、時間進行の中で導出された１つの音声特徴量または音声音符データが、旋律推移特徴量または基準音符データに偶発的に一致しただけでは、特徴量一致度または音符一致度の値は大きなものとならない。よって、本実施形態の楽曲検索処理によれば、誤って、利用者が意図しない楽曲が、意図予想曲として特定されることを低減できる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 For this reason, in the music search process of the present embodiment, if one voice feature value or voice note data derived in time progress is coincidentally coincident with the melody transition feature quantity or the reference note data, the feature quantity The degree of coincidence or note coincidence does not become large. Therefore, according to the music search process of this embodiment, it can reduce that the music which a user does not intend is specified as an expected expected music accidentally.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態における楽曲検索処理では、意図予想曲を検索する際に、特徴量照合処理と、音符照合処理との両方の処理結果に基づいて検索を実行していたが、楽曲検索処理における意図予想曲の検索は、特徴量照合処理にて導出された特徴量一致度にのみ基づくものでもよい。つまり、楽曲検索処理では、値が最大である特徴量一致度に対応する楽曲を意図予想曲としてもよい。この場合、楽曲検索処理のＳ１２０は、実行しなくとも良く、楽曲検索処理のＳ１３０では、特徴量一致度そのものを統合一致度として導出すれば良い。 For example, in the music search process in the above embodiment, when searching for the expected song, the search is executed based on the processing results of both the feature amount matching process and the note matching process. The search for the expected predicted song may be based only on the feature amount matching degree derived by the feature amount matching process. In other words, in the music search process, the music corresponding to the feature value matching degree having the maximum value may be set as the intended expected music. In this case, S120 of the music search process does not have to be executed, and in S130 of the music search process, the feature amount matching degree itself may be derived as the integrated matching degree.

ところで、上記実施形態における楽曲検索処理では、信号処理する対象の音声データを、記憶部２８に記憶された音声データとしていたが、楽曲検索処理にて信号処理の対象とする音声データは、音声入力部２５にてサンプリングされた直後の音声データであっても良い。つまり、楽曲検索処理では、マイクロホン２４を介して入力された音声をリアルタイムに処理しても良い。 By the way, in the music search process in the above embodiment, the audio data to be signal-processed is the audio data stored in the storage unit 28. However, the audio data to be signal-processed in the music search process is the audio input. The audio data immediately after being sampled by the unit 25 may be used. That is, in the music search process, sound input via the microphone 24 may be processed in real time.

また、上記実施形態では、特徴量照合処理において、音声特徴量が照合される頂点データを、楽曲データの一部として予め用意していたが、頂点データは、これに限るものではない。例えば、特徴量照合処理を実行する過程にて生成しても良い。 In the above-described embodiment, the vertex data with which the voice feature amount is compared is prepared in advance as part of the music data in the feature amount matching process. However, the vertex data is not limited to this. For example, you may produce | generate in the process of performing a feature-value collation process.

なお、上記実施形態では、頂点データを形成する相対音長として、時間比率ｄｔを導出していたが、相対音長は、時間比率ｄｔに限るものではなく、例えば、対象区間の時間長そのもの、即ち、極値間の時間差であっても良い。 In the above embodiment, the time ratio dt is derived as the relative sound length forming the vertex data. However, the relative sound length is not limited to the time ratio dt. For example, the time length itself of the target section, That is, it may be a time difference between extreme values.

また、上記実施形態では、推移特徴量を導出する過程にて実行する音高推移の平滑化として、中央値の算出と、移動平均値の算出とを組み合わせて実行していたが、音高推移の平滑化は、これに限るものではなく、例えば、中央値の算出と、移動平均値の算出とのいずれか一方のみを実行することでなされても良い。さらに言えば、中央値の算出または移動平均の導出以外の周知の方法にて平滑化されていてもよい。 In the above embodiment, the smoothing of the pitch transition performed in the process of deriving the transition feature value is performed by combining the calculation of the median value and the calculation of the moving average value. The smoothing is not limited to this. For example, the smoothing may be performed by executing only one of the median calculation and the moving average calculation. Furthermore, it may be smoothed by a known method other than the calculation of the median value or the derivation of the moving average.

また、上記実施形態における特徴量照合処理、及び音符照合処理では、音声特徴量または音声音符データを、旋律推移特徴量または基準音符データに照合する際に、単語化して照合していたが、これらを照合する際には、音声特徴量または音声音符データを単語化することなく、旋律推移特徴量または基準音符データに照合しても良い。 Further, in the feature amount collation processing and the note collation processing in the above embodiment, the speech feature amount or the voice note data is collated as a word when collating with the melodic transition feature amount or the reference note data. May be compared with the melodic transition feature quantity or the reference note data without converting the voice feature quantity or the voice note data into words.

なお、上記実施形態における音声処理装置２０は、スピーカ２７と音声出力部２６とを備えていなくとも良い。
ところで、上記実施形態では、音声処理装置２０にて楽曲検索処理を実行していたが、楽曲検索処理は、サーバ４０にて実行されていても良い。 Note that the audio processing device 20 in the above embodiment may not include the speaker 27 and the audio output unit 26.
By the way, in the said embodiment, although the music search process was performed in the audio | voice processing apparatus 20, the music search process may be performed in the server 40. FIG.

逆に、楽曲検索システム１は、音声処理装置２０のみから構成されていても良い。この場合、楽曲データは、予め記憶部２８に記憶されている必要がある。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 Conversely, the music search system 1 may be configured only from the voice processing device 20. In this case, the music data needs to be stored in the storage unit 28 in advance.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の特徴量照合処理におけるＳ２５０を実行することで得られる機能が、本発明の音高推移導出手段に相当し、Ｓ２６０を実行することで得られる機能が、本発明の平滑化手段に相当する。さらに、特徴量照合処理におけるＳ２７０を実行することで得られる機能が、本発明の極値検出手段に相当し、Ｓ２８０を実行することで得られる機能が、本発明の特徴量導出手段に相当する。また、特徴量照合処理におけるＳ２９０〜Ｓ３９０を実行することで得られる機能が、本発明の特徴量照合手段に相当する。 The function obtained by executing S250 in the feature amount matching process of the above embodiment corresponds to the pitch transition deriving unit of the present invention, and the function obtained by executing S260 is the smoothing unit of the present invention. Equivalent to. Furthermore, the function obtained by executing S270 in the feature amount matching process corresponds to the extreme value detecting means of the present invention, and the function obtained by executing S280 corresponds to the feature value deriving means of the present invention. . The function obtained by executing S290 to S390 in the feature amount matching process corresponds to the feature amount matching means of the present invention.

そして、上記実施形態の楽曲検索処理におけるＳ１３０〜Ｓ１５０を実行することで得られる機能が、本発明の結果出力手段に相当する。
なお、上記実施形態の音符照合処理におけるＳ４３０〜Ｓ４６０を実行することで得られる機能が、本発明の音符化手段に相当し、Ｓ４７０〜Ｓ５７０を実行することで得られる機能が、本発明の音符照合手段に相当する。 And the function obtained by performing S130-S150 in the music search process of the said embodiment is equivalent to the result output means of this invention.
The function obtained by executing S430 to S460 in the note collating process of the above embodiment corresponds to the note coding means of the present invention, and the function obtained by executing S470 to S570 is the note of the present invention. Corresponds to verification means.

１…楽曲検索システム２０…音声処理装置２１…通信部２２…表示部２３…操作受付部２４…マイクロホン２５…音声入力部２６…音声出力部２７…スピーカ２８…記憶部３０…制御部３１…ＲＯＭ３２…ＲＡＭ３３…ＣＰＵ４０…サーバ４１…記憶装置４２…マイクロコンピュータ DESCRIPTION OF SYMBOLS 1 ... Music search system 20 ... Audio | voice processing apparatus 21 ... Communication part 22 ... Display part 23 ... Operation reception part 24 ... Microphone 25 ... Audio | voice input part 26 ... Audio | voice output part 27 ... Speaker 28 ... Memory | storage part 30 ... Control part 31 ... ROM 32 ... RAM 33 ... CPU 40 ... Server 41 ... Storage device 42 ... Microcomputer

Claims

時間進行に沿って連続して入力された入力音声から、音高の推移を表す音高推移を導出する音高推移導出手段と、
前記音高推移導出手段で導出された音高推移を平滑化した平滑音高推移を導出する平滑化手段と、
前記音高推移における音高変化の極値を音高極値とし、前記平滑化手段で導出された平滑音高推移から、その平滑音高推移についての前記音高極値である平滑音高極値を検出する極値検出手段と、
時間進行に沿って連続する前記音高極値の間での音高差及び時間長の比を推移特徴量とし、前記極値検出手段で検出された平滑音高極値に基づいて、その平滑音高極値についての前記推移特徴量である音声特徴量を導出する特徴量導出手段と、
曲毎に予め用意され、かつ曲を構成する構成音の音高推移が平滑化された平滑化旋律における前記音高極値についての前記推移特徴量を旋律特徴量とし、前記特徴量導出手段で導出された音声特徴量それぞれを前記旋律特徴量それぞれに照合することで、前記音声特徴量と前記旋律特徴量との一致度が高いほど大きな値となる特徴量一致度を前記曲毎に導出する特徴量照合手段と、
少なくとも、前記特徴量照合手段で導出された特徴量一致度に基づき、前記特徴量一致度の中で、値が最大の前記特徴量一致度に対応する曲である入力対応曲を出力する結果出力手段と
を備えることを特徴とする特徴量照合装置。 A pitch transition deriving means for deriving a pitch transition representing a transition of the pitch from the input voice continuously input along the time progress;
Smoothing means for deriving a smooth pitch transition obtained by smoothing the pitch transition derived by the pitch transition deriving means;
The extreme value of the pitch change in the pitch transition is defined as the pitch extreme value, and the smooth pitch that is the pitch extreme value of the smooth pitch transition from the smooth pitch transition derived by the smoothing means. Extreme value detection means for detecting a value;
The pitch difference between the pitch extreme values that continue along the time progress and the ratio of the time length are used as transition feature quantities, and the smoothed pitch is detected based on the smooth pitch extreme value detected by the extreme value detecting means. Feature amount deriving means for deriving a speech feature amount that is the transition feature amount for the pitch extreme value;
The transition feature amount for the pitch extreme value in the smoothed melody prepared in advance for each song and smoothing the pitch transition of the constituent sounds constituting the song is a melody feature amount, and the feature amount deriving means By collating each derived speech feature quantity with each melody feature quantity, a feature quantity coincidence value that is larger as the coincidence between the speech feature quantity and the melody feature quantity is higher is derived for each song. Feature amount matching means;
Based on at least the feature quantity matching degree derived by the feature quantity matching means, a result output for outputting an input corresponding song that is a song corresponding to the feature quantity matching degree having the maximum value among the feature quantity matching degrees And a feature amount matching device.

前記特徴量照合手段は、
時間進行に沿った前記音声特徴量それぞれが連続して一致する前記旋律特徴量が多いほど、大きな値の前記特徴量一致度を導出することを特徴とする請求項１に記載の特徴量照合装置。 The feature amount matching means includes:
The feature amount matching apparatus according to claim 1, wherein the feature amount matching degree having a larger value is derived as the melodic feature amount in which each of the voice feature amounts along the time progress continuously matches increases. .

前記平滑化手段は、
前記音高推移の時間進行に対して連続するように規定された時間長それぞれを単位区間とし、前記単位区間それぞれに含まれる全音高の中央値の算出、及び移動平均値の算出の少なくとも一方を前記平滑化として実行することを特徴とする請求項１または請求項２に記載の特徴量照合装置。 The smoothing means includes
Each time length defined to be continuous with respect to the time progression of the pitch transition is a unit section, and at least one of the calculation of the median value of all pitches included in each of the unit sections and the calculation of the moving average value is performed. The feature amount matching apparatus according to claim 1, wherein the feature amount matching device is executed as the smoothing.

前記結果出力手段は、
前記入力対応曲を画像にて表示、及び前記入力対応曲を音声にて通知することの少なくとも一方を前記出力として実行することを特徴とする請求項１ないし請求項３のいずれか一項に記載の特徴量照合装置。 The result output means includes:
4. The method according to claim 1, wherein at least one of displaying the input-compatible music as an image and notifying the input-compatible music by voice is executed as the output. 5. Feature amount matching device.

前記音高推移導出手段で導出された音高推移に従って、前記入力音声の音高及び音価を表す音符データに変換する音符化手段と、
曲毎に予め用意され、かつ曲を構成する構成音それぞれの音高及び音価を表すデータを基準音符データとし、前記音符化手段にて変換された音符データそれぞれを前記基準音符データに曲毎に照合することで音符一致度を導出する音符照合手段と
を備え、
前記照合結果出力手段は、
前記特徴量照合手段にて導出された特徴量一致度、及び前記音符照合手段で導出された音符一致度に基づいて、前記特徴量一致度及び前記音符一致度の両方が大きいほど大きな値となるように演算した結果、最も大きな値に対応する曲を前記入力対応曲として出力することを特徴とする請求項１ないし請求項４の何れか一項に記載の特徴量照合装置。 Note converting means for converting into note data representing the pitch and value of the input voice according to the pitch transition derived by the pitch transition deriving means;
Data prepared in advance for each song and representing the pitch and note value of each of the constituent sounds constituting the song is used as reference note data, and each note data converted by the note converting means is used as the reference note data for each song. And a note matching means for deriving a note matching degree by matching to
The collation result output means includes
Based on the feature amount matching degree derived by the feature amount matching means and the note matching degree derived by the note matching means, the larger the both the feature amount matching degree and the note matching degree, the larger the value. 5. The feature amount collating apparatus according to claim 1, wherein the music corresponding to the largest value is output as the input corresponding music as a result of the calculation.

時間進行に沿って連続して入力された入力音声から、音高の推移を表す音高推移を導出する音高推移導出手順と、
前記音高推移導出手順で導出された音高推移を平滑化した平滑音高推移を導出する平滑化手順と、
前記音高推移における音高変化の極値を音高極値とし、前記平滑化手順で導出された平滑音高推移から、その平滑音高推移についての前記音高極値である平滑音高極値を検出する極値検出手順と、
時間進行に沿って連続する前記音高極値の間での音高差及び時間長の比を推移特徴量とし、前記極値検出手順で検出された平滑音高極値に基づいて、その平滑音高極値についての前記推移特徴量である音声特徴量を導出する特徴量導出手順と、
曲毎に予め用意され、かつ曲を構成する構成音の音高推移が平滑化された平滑化旋律における前記音高極値についての前記推移特徴量を旋律特徴量とし、前記特徴量導出手順で導出された音声特徴量それぞれを前記旋律特徴量それぞれに照合することで、前記音声特徴量と前記旋律特徴量との一致度が高いほど大きな値となる特徴量一致度を前記曲毎に導出する特徴量照合手順と、
少なくとも、前記特徴量照合手順で導出された特徴量一致度に基づき、前記特徴量一致度の中で、値が最大の前記特徴量一致度に対応する曲である入力対応曲を出力する結果出力手順と
をコンピュータに実行させることを特徴とするプログラム。 A pitch transition deriving procedure for deriving a pitch transition representing a transition of the pitch from the input voice continuously input along the time progress;
A smoothing procedure for deriving a smooth pitch transition obtained by smoothing the pitch transition derived by the pitch transition deriving procedure;
The extreme value of the pitch change in the pitch transition is defined as the pitch extreme value, and the smooth pitch that is the pitch extreme value of the smooth pitch transition from the smooth pitch transition derived in the smoothing procedure. Extreme value detection procedure to detect the value;
The ratio of pitch difference and time length between the pitch extreme values continuous with time progress is used as a transition feature value, and the smoothness is determined based on the smooth pitch extreme value detected by the extreme value detection procedure. A feature amount derivation procedure for deriving a speech feature amount that is the transition feature amount for a pitch extreme value;
The transition feature amount for the pitch extreme value in the smoothed melody prepared in advance for each song and the pitch transition of the constituent sounds constituting the song is smoothed as a melody feature amount, and the feature amount derivation procedure By collating each derived speech feature quantity with each melody feature quantity, a feature quantity coincidence value that is larger as the coincidence between the speech feature quantity and the melody feature quantity is higher is derived for each song. A feature matching procedure;
Based on at least the feature amount matching degree derived by the feature amount matching procedure, a result output that outputs an input corresponding song that is a song corresponding to the feature amount matching degree having the maximum value among the feature amount matching degrees A program characterized by causing a computer to execute the procedure.