JP2013152337A

JP2013152337A - Note string setting device

Info

Publication number: JP2013152337A
Application number: JP2012012888A
Authority: JP
Inventors: Keijiro Saino; 慶二郎才野; Keiichi Tokuda; 恵一徳田; Keiichiro Oura; 圭一郎大浦
Original assignee: Yamaha Corp; Nagoya Institute of Technology NUC
Current assignee: Yamaha Corp; Nagoya Institute of Technology NUC
Priority date: 2012-01-25
Filing date: 2012-01-25
Publication date: 2013-08-08
Anticipated expiration: 2032-01-25
Also published as: JP5943618B2

Abstract

PROBLEM TO BE SOLVED: To realize a flexible as well as versatile note division for a character string of a lyric.SOLUTION: A note string setting section 24 sets a specific note string M which is suitable for a specified character string X having a plurality of sound units x[n] arranged in time series. An analysis processing section 30 generates a connection information string Y having connection information y[n] arranged in time series for specifying the presence or absence of connection in the respective sound units x[n] which occur one after another in the specified character string X for every sound unit x[n] by applying a probability model generated by prior machine learning to the specified character string X. The probability model is set by the machine learning using a plurality of learning data including a character string XL for learning having a plurality of sound units x[n] arranged in time series and a connection information string YL having a plurality of connection information y[n] arranged in time series. A note string acquiring section 40 acquires the specific note string M which has a plurality of notes corresponding to respective note division units z[m] acquired by applying the presence or absence of connection specified by the connection information string Y to the respective sound units x[n] of a specified character string X arranged in time series.

Description

本発明は、文字列を解析する技術に関し、特に、歌詞の文字列を音符列に対応させるために好適に利用される。 The present invention relates to a technique for analyzing a character string, and in particular, is suitably used for making a character string of lyrics correspond to a note string.

利用者が指定した歌詞に好適な音符列（旋律）を生成する技術が従来から提案されている。例えば特許文献１には、歌詞を構成する各単語の抑揚に応じた音高を歌詞の各音節（文字）に付与することで音符列を生成する技術が開示されている。また、非特許文献１には、歌詞の韻律に対応して音高が変動するように音符列を生成する技術が開示されている。 A technique for generating a note sequence (melody) suitable for lyrics specified by a user has been proposed. For example, Patent Literature 1 discloses a technique for generating a musical note string by giving a pitch corresponding to the inflection of each word constituting lyrics to each syllable (character) of the lyrics. Non-Patent Document 1 discloses a technique for generating a note string so that the pitch fluctuates according to the prosody of lyrics.

特開２００２−１４９１７９号公報JP 2002-149179 A

深山ほか６名，“Orpheus：歌詞の韻律に基づいた自動作曲システム”，情報処理学会研究報告［音楽情報科学］，2008(78)，p.179-184，2008年7月30日Fukayama et al., “Orpheus: Automatic Music System Based on Prosody of Lyrics”, Information Processing Society of Japan [Music Information Science], 2008 (78), p.179-184, July 30, 2008

ところで、実際には歌詞と音符との対応は非常に多様であり、歌詞の各音節と音符列の各音符とが１対１に対応する楽曲もあれば、歌詞の複数の音節が１個の音符に対応する楽曲も数多く存在する。複数の音節が１個の音符に対応するという傾向は、例えばラップ音楽等の分野で特に顕著である。しかし、特許文献１や非特許文献１の技術では、歌詞の各音節に対して各音符が１対１に対応する単調な音符列しか生成できないという問題がある。 By the way, the correspondence between the lyrics and the notes is actually very diverse, and there is a song in which each syllable of the lyrics and each note of the note string have a one-to-one correspondence, and there are one syllable of the lyrics. There are many songs that correspond to notes. The tendency that a plurality of syllables correspond to one note is particularly remarkable in the field of rap music, for example. However, the techniques of Patent Document 1 and Non-Patent Document 1 have a problem that only a monotonous note string in which each note has a one-to-one correspondence can be generated for each syllable of the lyrics.

他方、歌詞の２個以上の音節に対して１個の音符が対応するように歌詞の単語毎に音節数と音符数との関係を事前に決定することも可能であるが、歌詞の各単語と音符数との関係が画一的であるという問題がある。例えば、「ない（無い）」という単語の全体（２音節）を１個の音符に対応させるという規則を前提とした場合、歌詞内の前後の内容に関わらず「ない」という単語には固定的に１個の音符が付与され、「な」と「い」とを別個の音符に付与した音符列は生成されないという制約がある。したがって、歌詞に対する多様な音符列を生成できないという前述の問題は根本的には解決されない。以上の事情を考慮して、本発明は、歌詞の文字列に対する柔軟かつ多様な譜割の実現を目的とする。 On the other hand, the relationship between the number of syllables and the number of notes can be determined in advance for each word in the lyrics so that one note corresponds to two or more syllables in the lyrics. There is a problem that the relationship between the number of notes and the number of notes is uniform. For example, assuming the rule that the entire word (two syllables) (two syllables) corresponds to one note is fixed to the word "none" regardless of the content before and after in the lyrics. There is a restriction that a single note is added to the note and a note string in which “na” and “i” are assigned to separate notes is not generated. Therefore, the above-described problem that it is impossible to generate various musical note strings for lyrics cannot be fundamentally solved. In view of the above circumstances, an object of the present invention is to realize a flexible and diverse musical score for a character string of lyrics.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の各要素と後述の各実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate understanding of the present invention, in the following description, the correspondence between each element of the present invention and the element of each of the embodiments described later is indicated in parentheses, but the scope of the present invention is not limited to the embodiment. It is not intended to limit the example.

本発明の音符列設定装置は、複数の音単位を時系列に配列した指定文字列（例えば指定文字列Ｘ）に対応する音符列（例えば特定音符列Ｍ）を設定する装置であって、複数の音単位を時系列に配列した学習用文字列（例えば学習用文字列ＸL）と、学習用文字列内で相前後する各音単位間の連結の有無を音単位毎に指定する連結情報を時系列に配列した学習用連結情報列（例えば学習用連結情報列ＹL）とを各々が含む複数の学習データ（例えば学習データＬ）を利用した機械学習で生成された確率モデルを、指定文字列に適用することで、指定文字列内の各音単位の連結情報（例えば連結情報ｙ[n]）を時系列に配列した連結情報列（例えば連結情報列Ｙ）を生成する解析処理手段（例えば解析処理部３０）と、連結情報列が指定する連結の有無を指定文字列の各音単位に適用して得られる各譜割単位（例えば譜割単位ｚ[m]）に対応する複数の音符を時系列に配列した音符列を取得する音符列取得手段（例えば音符列取得部４０）とを具備する。 A note string setting device of the present invention is an apparatus for setting a note string (for example, a specific note string M) corresponding to a specified character string (for example, a specified character string X) in which a plurality of sound units are arranged in time series. A learning character string (for example, a learning character string XL) in which sound units are arranged in time series, and connection information for designating the connection of each sound unit in the learning character string for each sound unit. A probability model generated by machine learning using a plurality of learning data (for example, learning data L) each including a learning connection information string (for example, learning connection information string YL) arranged in time series is designated character string. Is applied to the analysis processing means (for example, the connection information string Y) that generates the connection information string (for example, the connection information string Y) in which the connection information (for example, the connection information y [n]) of each sound unit in the designated character string is arranged in time series. The analysis processing unit 30) and the presence of the connection specified by the connection information string. A note string acquisition means for acquiring a note string in which a plurality of notes corresponding to each musical score unit (for example, musical score unit z [m]) obtained by applying to each musical note unit of the designated character string is arranged in time series ( For example, a note sequence acquisition unit 40) is provided.

以上の構成では、指定文字列の各音単位の連結の有無を指定する連結情報列が生成される。したがって、歌詞の各音節が音符列の各音符に対して１対１に割当てられる特許文献１や非特許文献１の技術と比較して、各音符に対応する音単位の個数が可変に設定された多様な譜割が実現される。また、指定文字列Ｘに対する確率モデルの適用で連結情報列が生成されるから、例えば歌詞の単語毎に音節数と音符数との関係を事前に決定する構成と比較して、柔軟かつ多様な譜割を実現することが可能である。 In the above configuration, a connection information string that specifies whether or not each sound unit of the specified character string is connected is generated. Therefore, the number of sound units corresponding to each note is variably set as compared with the techniques of Patent Document 1 and Non-Patent Document 1 in which each syllable of the lyrics is assigned one-to-one to each note of the note string. A variety of musical notation is realized. In addition, since a concatenated information sequence is generated by applying a probability model to the designated character string X, for example, compared with a configuration in which the relationship between the number of syllables and the number of notes is determined in advance for each word of the lyrics, it is flexible and diverse. It is possible to realize a musical score.

なお、音符列は、音高が指定された複数の音符の時系列（音高列）を意味する。各音符の音高は、典型的には平均律の１２半音から選択されるが、任意に選定することも可能である。例えば、既存の任意のスケール（例えばペンタトニックスケール）の各音高や、既存のスケールとは無関係に選定された各音高（例えば任意の間隔で選定された各音高）が、音符列の各音符について指定され得る。また、音符列の各音符について継続長の指定は必須ではない。 Note that the note sequence means a time series (pitch sequence) of a plurality of notes with specified pitches. The pitch of each note is typically selected from 12 semitones of equal temperament, but can be arbitrarily selected. For example, each pitch of an existing arbitrary scale (for example, the pentatonic scale) or each pitch selected independently of the existing scale (for example, each pitch selected at an arbitrary interval) Can be specified for notes. Also, it is not essential to specify the duration for each note in the note string.

本発明の好適な態様の音符列設定装置は、複数の音符を時系列に配列した複数の音符列（例えば音符列Ｍ[k]）を記憶する記憶手段（例えば記憶装置１４）を具備し、音符列取得手段は、連結情報列が指定する連結の有無を指定文字列の各音単位に適用して得られる譜割単位の個数に対応する音符数の音符列を複数の音符列から選択する。以上の態様では、譜割単位の個数に対応する音符数の音符列が記憶手段内の複数の音符列から選択される。したがって、例えば各譜割単位に対応する音符を所定の規則で自動的に選定する構成と比較して、既存の楽曲と同等の自然な音符列を指定文字列に対して設定できるという利点がある。 A note string setting device according to a preferred aspect of the present invention includes storage means (for example, a storage device 14) that stores a plurality of note strings (for example, a note string M [k]) in which a plurality of notes are arranged in time series. The note string acquisition means selects a note string having a number of notes corresponding to the number of musical score units obtained by applying the presence / absence of connection specified by the connection information string to each sound unit of the specified character string, from a plurality of note strings. . In the above aspect, a note string having the number of notes corresponding to the number of musical score units is selected from a plurality of note strings in the storage means. Therefore, for example, compared with a configuration in which notes corresponding to each musical score unit are automatically selected according to a predetermined rule, there is an advantage that a natural note string equivalent to an existing music can be set for a specified character string. .

本発明の好適な態様において、音符列取得手段は、譜割単位の個数に対応する音符数の複数の候補音符列（例えば候補音符列ＭC）を複数の音符列から選択する第１選択手段（例えば第１選択部４１）と、各譜割単位を構成する音単位の個数に応じた基準長（例えば基準長ＴZ）と、候補音符列内で当該譜割単位に対応する音符の継続長（例えば継続長ＴM）との差異に応じた誤差指標値（例えば誤差指標値Ｅ）を、複数の候補音符列の各々について算定し、各候補音符列の誤差指標値に応じて１個の候補音符列を選択する第２選択手段（例えば第２選択部４２）とを含む。以上の態様では、各譜割単位を構成する音単位の個数に応じた基準長と各音符の継続長との差異に応じた誤差指標値に応じて候補音符列が選択されるから、指定文字列の各音単位と音符列の各音符とが無理なく対応した自然な音符列を設定できるという利点がある。 In a preferred aspect of the present invention, the note string obtaining means selects a plurality of candidate note strings (for example, candidate note strings MC) having the number of notes corresponding to the number of musical score units from a plurality of note strings ( For example, the first selection unit 41), a reference length (for example, reference length TZ) corresponding to the number of sound units constituting each musical score unit, and a continuation length of a note corresponding to the musical score unit in the candidate note string ( For example, an error index value (for example, error index value E) corresponding to a difference from the duration TM is calculated for each of a plurality of candidate note strings, and one candidate note is determined according to the error index value of each candidate note string. 2nd selection means (for example, 2nd selection part 42) which chooses a column. In the above aspect, since the candidate note string is selected according to the error index value according to the difference between the reference length corresponding to the number of sound units constituting each musical score unit and the duration of each note, the designated character is selected. There is an advantage that it is possible to set a natural note string that reasonably corresponds to each note unit of the row and each note of the note row.

本発明の好適な態様において、確率モデルは、複数の素性で規定される条件付確率場の確率モデルである。多数の楽曲（特にラップ音楽）に妥当する一般的な譜割傾向を考慮すると、確率モデルを規定する複数の素性は、音単位が母音であり連結情報が連結を指定する場合に発火する素性（例えば素性ｆ1）と、音単位が撥音であり連結情報が連結を指定する場合に発火する素性（例えば素性ｆ2）と、音単位が長音であり連結情報が連結を指定する場合に発火する素性（例えば素性ｆ3）と、音単位が促音であり連結情報が連結を指定する場合に発火する素性（例えば素性ｆ4）と、音単位が特定の品詞を構成し、連結情報が連結を指定する場合に発火する素性（例えば素性ｆ5）と、音単位が無声化音であり連結情報が連結を指定する場合に発火する素性（例えば素性ｆ6）とのうちの少なくとも１種類の素性を含むように選定される。以上の態様によれば、譜割傾向を充分に反映した連結情報列を生成できるという利点がある。また、直前の音単位にグリッサンドが付与され（より詳細には、直前の音単位が、グリッサンドの付与された音符に割当てられた複数の音単位のなかで先頭の音単位であり）、連結情報が直前の音単位との連結を指定する場合に発火する素性（例えば素性ｆ7）や、直前の音単位がアクセントであり連結情報が直前の音単位との連結を指定する場合に発火する素性（例えば素性ｆ8）を確率モデルに適用することも可能である。グリッサンドに関連する素性は、音高が上昇する方向のグリッサンドが直前の音単位に付与され（より詳細には、直前の音単位が、上昇方向のグリッサンドの付与された音符に割当てられた複数の音単位のなかで先頭の音単位であり）、連結情報が直前の音単位との連結を指定する場合に発火する素性（例えば素性ｆ7a）と、音高が下降する方向のグリッサンドが直前の音単位に付与され（より詳細には、直前の音単位が、下降方向のグリッサンドの付与された音符に割当てられた複数の音単位のなかで先頭の音単位であり）、連結情報が直前の音単位との連結を指定する場合に発火する素性（例えば素性ｆ7b）とに区分され得る。 In a preferred aspect of the present invention, the probability model is a probability model of a conditional random field defined by a plurality of features. Considering the general score tendency that is appropriate for a large number of songs (especially rap music), the multiple features that define the probability model are features that fire when the unit of sound is a vowel and the connection information specifies connection ( For example, a feature f1), a feature that fires when the sound unit is sound repellent and the connection information specifies connection (for example, feature f2), and a feature that fires when the sound unit is long sound and the connection information specifies connection ( For example, when the feature f3), the sound unit is a prompt sound, and the connection information specifies connection (for example, the feature f4), and the sound unit constitutes a specific part of speech, and the connection information specifies connection. Selected to include at least one type of feature to be ignited (eg, feature f5) and feature to be ignited when the unit of sound is unvoiced sound and the connection information specifies connection (eg, feature f6). The According to the above aspect, there exists an advantage that the connection information row | line | column which fully reflected the musical score tendency can be produced | generated. In addition, a glissando is assigned to the immediately preceding sound unit (more specifically, the immediately preceding sound unit is the first sound unit among a plurality of sound units assigned to the note to which the glissando is assigned), and the connection information Is a feature that fires when a connection with the previous sound unit is specified (for example, feature f7), or a feature that fires when the previous sound unit is an accent and the connection information specifies a connection with the previous sound unit ( For example, the feature f8) can be applied to the probability model. The feature related to the glissando is assigned to the previous sound unit in the direction in which the pitch increases (more specifically, the previous sound unit is assigned to the notes assigned to the notes with the upward glissando. The first sound unit among the sound units), the feature that fires when the connection information specifies the connection with the previous sound unit (for example, the feature f7a), and the glissando in the direction in which the pitch decreases is the previous sound. (In more detail, the previous sound unit is the first sound unit among a plurality of sound units assigned to a note with a descending glissando), and the connection information is the previous sound. It can be divided into features that ignite when specifying the connection with the unit (for example, feature f7b).

本発明の好適な態様の音符列設定装置は、処理対象の文字列（例えば文字列Ｘ0）を区分して複数の指定文字列を生成する文字列取得手段（例えば文字列取得部２２）を具備し、複数の指定文字列の各々について、解析処理手段による連結情報列の生成と、音符列取得手段による音符列の取得とが実行される。以上の態様では、処理対象の文字列が複数の指定文字列に区分されて指定文字列毎に連結情報列の生成と音符列の取得とが実行されるから、指定文字列が充分に長い場合でも適切な音符列を設定できるという利点がある。なお、以上の態様の具体例は例えば第４実施形態として後述される。 A note string setting device according to a preferred aspect of the present invention includes character string acquisition means (for example, a character string acquisition unit 22) that generates a plurality of designated character strings by dividing a character string to be processed (for example, a character string X0). Then, for each of the plurality of designated character strings, generation of a connection information string by the analysis processing unit and acquisition of a note string by the note string acquisition unit are executed. In the above aspect, the character string to be processed is divided into a plurality of designated character strings, and the generation of the linked information string and the acquisition of the note string are executed for each designated character string. However, there is an advantage that an appropriate note sequence can be set. In addition, the specific example of the above aspect is later mentioned as 4th Embodiment, for example.

以上の各態様に係る音符列設定装置は、音符列の設定に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、複数の音単位を時系列に配列した指定文字列に対応する音符列を設定するためのプログラムであって、複数の音単位を時系列に配列した学習用文字列と、学習用文字列内で相前後する各音単位間の連結の有無を音単位毎に指定する連結情報を時系列に配列した学習用連結情報列とを各々が含む複数の学習データを利用した機械学習で生成された確率モデルを、指定文字列に適用することで、指定文字列内の各音単位の連結情報を時系列に配列した連結情報列を生成する解析処理と、連結情報列が指定する連結の有無を指定文字列の各音単位に適用して得られる各譜割単位に対応する複数の音符を時系列に配列した音符列を取得する音符列取得処理とをコンピュータに実行させる。以上のプログラムによれば、本発明に係る音符列設定装置と同様の作用および効果が実現される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされるほか、通信網を介した配信の形態で提供されてコンピュータにインストールされる。 The note string setting device according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to setting a note string, or a general-purpose such as a CPU (Central Processing Unit). This is also realized by cooperation between the arithmetic processing unit and the program. A program according to the present invention is a program for setting a note string corresponding to a designated character string in which a plurality of sound units are arranged in time series, and a learning character string in which a plurality of sound units are arranged in time series; , Using a plurality of learning data each including a learning connection information sequence in which connection information for specifying the presence or absence of connection between successive sound units in the learning character string for each sound unit is arranged in time series By applying the probability model generated by machine learning to the specified character string, an analysis process for generating a connected information sequence in which the connected information of each sound unit in the specified character string is arranged in time series, and the connected information sequence is Causes a computer to execute a note string acquisition process for acquiring a note string in which a plurality of notes corresponding to each musical score unit obtained by applying presence / absence of connection specified to each sound unit of a specified character string is arranged in time series . According to the above program, the same operation and effect as the note string setting apparatus according to the present invention are realized. Note that the program of the present invention is provided in a form stored in a computer-readable recording medium and installed in the computer, or is provided in a form distributed via a communication network and installed in the computer.

本発明の第１実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 指定文字列，連結情報列および譜割文字列の模式図である。It is a schematic diagram of a designated character string, a connection information string, and a staff string. 音符列設定部のブロック図である。It is a block diagram of a note string setting unit. 譜割文字列の各譜割単位と音符列の各音符との対応（譜割）の説明図である。It is explanatory drawing of the correspondence (staff) of each staff unit of a staff character string and each note of a note string. 学習データの模式図である。It is a schematic diagram of learning data. 音符列取得部のブロック図である。It is a block diagram of a note sequence acquisition part. ラップ音楽の歌唱音を記譜した譜面である。It is a musical score that records the singing sound of rap music.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、利用者が指定した歌詞の文字列Ｘ0の歌唱旋律として好適な音符列を設定してその音符列の歌唱音の音声信号Ｖを生成する信号処理装置であり、演算処理装置１２と記憶装置１４と入力装置１６と放音装置１８とを具備するコンピュータシステムで実現される。なお、以下の説明では、ラップ音楽の歌唱音を合成する場合を想定する。 <First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 is a signal processing device that sets a note sequence suitable as a singing melody of a lyric character string X0 designated by a user and generates a speech signal V of the singing sound of the note sequence. 12, a storage device 14, an input device 16, and a sound emitting device 18. In the following description, it is assumed that the rap music singing sound is synthesized.

記憶装置１４は、演算処理装置１２が実行するプログラムＰGMや演算処理装置１２が使用する各種の情報（例えば音符列Ｍ[1]〜Ｍ[K]および確率モデル情報Ｑ）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として任意に採用される。第１実施形態の記憶装置１４は、相異なる旋律を表現するＫ個の音符列Ｍ[1]〜Ｍ[K]を記憶する（Ｋは２以上の自然数）。各音符列Ｍ[k]（ｋ＝１〜Ｋ）は、音高および継続長が指定された複数の音符の時系列である。具体的には、既存の楽曲から抽出された所定長の区間（例えば１小節分）が音符列Ｍ[k]として記憶装置１４に事前に格納される。Ｋ個の音符列Ｍ[1]〜Ｍ[K]は、相異なる個数の音符で構成される２個以上の音符列Ｍ[k]を含む。各音符列Ｍ[k]は、例えば、各音符の音高を指定して発音または消音を指示するイベントデータと各イベントデータの処理の時点を指定するタイミングデータとを配列したMIDI（Musical Instrument Digital Interface）形式の時系列データとして記述される。 The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various information used by the arithmetic processing device 12 (for example, note sequences M [1] to M [K] and probability model information Q). A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 14. The storage device 14 of the first embodiment stores K note sequences M [1] to M [K] expressing different melody (K is a natural number of 2 or more). Each note string M [k] (k = 1 to K) is a time series of a plurality of notes in which pitch and duration are specified. Specifically, a predetermined length section (for example, one bar) extracted from existing music is stored in advance in the storage device 14 as a note string M [k]. The K note sequences M [1] to M [K] include two or more note sequences M [k] composed of different numbers of notes. Each note string M [k] is, for example, MIDI (Musical Instrument Digital) in which event data for designating the pitch of each note and instructing sound generation or muting and timing data for designating the time point of processing of each event data are arranged. Interface) format time series data.

入力装置１６は、音声合成装置１００に対する利用者からの指示を受付ける機器であり、例えば利用者が操作する複数の操作子を含んで構成される。利用者は、入力装置１６を適宜に操作することで所望の文字列Ｘ0を歌詞として指定することが可能である。文字列Ｘ0は、例えば漢字および仮名で指定される。なお、音声合成装置１００に対する指示を音声で入力するためのマイクロホンを入力装置１６として採用することも可能である。 The input device 16 is a device that receives an instruction from the user to the speech synthesizer 100, and includes, for example, a plurality of operators operated by the user. The user can designate a desired character string X0 as lyrics by appropriately operating the input device 16. The character string X0 is designated by, for example, kanji and kana. Note that a microphone for inputting an instruction to the speech synthesizer 100 by voice may be employed as the input device 16.

演算処理装置１２は、記憶装置１４に格納されたプログラムＰGMを実行することで、利用者が指定した文字列Ｘ0に応じた音声信号Ｖを生成するための複数の機能（文字列取得部２２，音符列設定部２４，音声合成部２６）を実現する。なお、演算処理装置１２の各機能を複数の装置に分散した構成や、専用の電子回路（ＤＳＰ）が各機能を実現する構成も採用され得る。放音装置１８（例えばスピーカやヘッドホン）は、演算処理装置１２が生成した音声信号Ｖに応じた音波を再生する。 The arithmetic processing unit 12 executes a program PGM stored in the storage device 14 and thereby generates a plurality of functions (character string acquisition unit 22, character string generation unit 22) for generating a voice signal V corresponding to the character string X0 designated by the user. A note string setting unit 24 and a speech synthesis unit 26) are realized. A configuration in which each function of the arithmetic processing device 12 is distributed to a plurality of devices or a configuration in which a dedicated electronic circuit (DSP) realizes each function may be employed. The sound emitting device 18 (for example, a speaker or headphones) reproduces sound waves according to the audio signal V generated by the arithmetic processing device 12.

図１の文字列取得部２２は、利用者が指定した文字列Ｘ0から指定文字列Ｘを生成する。指定文字列Ｘは、複数の音単位ｘ[n]（ｘ[1],ｘ[2],ｘ[3],……）の時系列である。第１実施形態における１個の音単位ｘ[n]は１個のモーラ（拍）に相当する。１個のモーラは、特定の時間長（１個の短音節に相当する時間）にわたる音声の分節単位を意味する。日本語に着目すると、長音「ー」や促音「ッ」や撥音「ン」は１個のモーラに相当するが、「ョ」や「ェ」等の小書きの仮名は単独では１個のモーラに該当せず、「キョ」や「シェ」のように直前の１文字と一体（拗音）で１個のモーラに相当する。 The character string acquisition unit 22 in FIG. 1 generates a designated character string X from a character string X0 designated by the user. The designated character string X is a time series of a plurality of sound units x [n] (x [1], x [2], x [3],...). One sound unit x [n] in the first embodiment corresponds to one mora (beat). One mora means a segmental unit of speech over a specific time length (time corresponding to one short syllable). Focusing on Japanese, the long sound “-”, the prompt sound “tsu”, and the repelling sound “n” are equivalent to one mora, but small letters such as “yo” and “e” are single mora alone. It does not correspond to the above, and corresponds to one mora integrally with the immediately preceding character such as “Kyo” or “Che”.

第１実施形態の文字列取得部２２は、漢字と仮名とが混在する文字列Ｘ0を仮名（カタカナ）に変換し、図２の部分(A)に示すように、変換後の文字列Ｘ0を音単位ｘ[n]毎に区分することで指定文字列Ｘを生成する。文字列取得部２２による指定文字列Ｘの生成には、形態素解析を含む公知の自然言語処理が任意に採用される。 The character string acquisition unit 22 of the first embodiment converts a character string X0 in which kanji and kana are mixed into kana (katakana), and converts the converted character string X0 as shown in part (A) of FIG. The designated character string X is generated by dividing each sound unit x [n]. For the generation of the designated character string X by the character string acquisition unit 22, known natural language processing including morphological analysis is arbitrarily employed.

図１の音符列設定部２４は、記憶装置１４に記憶されたＫ個の音符列Ｍ[1]〜Ｍ[K]のうち指定文字列Ｘの歌唱用の旋律として好適な１個の音符列Ｍ[k]（以下「特定音符列Ｍ」という）を選択する。音声合成部２６は、指定文字列Ｘを特定音符列Ｍの旋律で歌唱した歌唱音の音声信号Ｖを生成（音声合成）する。音声合成部２６による音声合成には公知の技術が任意に採用される。 The note string setting unit 24 of FIG. 1 is one note string suitable as a melody for singing the designated character string X among the K note strings M [1] to M [K] stored in the storage device 14. M [k] (hereinafter referred to as “specific note string M”) is selected. The voice synthesizing unit 26 generates (voice synthesis) a voice signal V of a singing sound in which the designated character string X is sung by the melody of the specific note string M. A known technique is arbitrarily employed for speech synthesis by the speech synthesizer 26.

音符列設定部２４の具体的な構成および動作を以下に説明する。図３は、音符列設定部２４のブロック図である。図３に示すように、音符列設定部２４は、解析処理部３０と音符列取得部４０とを含んで構成される。解析処理部３０および音符列取得部４０の各々の構成および動作を以下に詳述する。 A specific configuration and operation of the note string setting unit 24 will be described below. FIG. 3 is a block diagram of the note string setting unit 24. As shown in FIG. 3, the note string setting unit 24 includes an analysis processing unit 30 and a note string acquisition unit 40. The configurations and operations of the analysis processing unit 30 and the note string acquisition unit 40 will be described in detail below.

＜解析処理部３０＞
解析処理部３０は、指定文字列Ｘ[k]内の各音単位ｘ[n]と各音符との対応（譜割）を指定する連結情報列Ｙを生成する。図４には、図２の部分(A)の指定文字列Ｘの譜割が例示されている。図４に示すように、特定音符列Ｍの各音符には単数または複数の音単位ｘ[n]が割当てられる。解析処理部３０が生成する連結情報列Ｙは、指定文字列Ｘ[k]内で特定音符列Ｍ内の１個の音符に割当てられる１個以上の音単位ｘ[n]の範囲を指定する情報である。なお、図４に示すように、音符列Ｍ[k]にてタイで連結された複数の音符は１個の音符として取扱う。 <Analysis processing unit 30>
The analysis processing unit 30 generates a concatenated information string Y that specifies the correspondence (staff) between each sound unit x [n] in the specified character string X [k] and each note. FIG. 4 exemplifies the musical score of the designated character string X in part (A) of FIG. As shown in FIG. 4, one or a plurality of sound units x [n] are assigned to each note of the specific note string M. The linked information sequence Y generated by the analysis processing unit 30 designates a range of one or more sound units x [n] assigned to one note in the specific note sequence M in the designated character string X [k]. Information. As shown in FIG. 4, a plurality of notes connected by ties in a note string M [k] are handled as one note.

第１実施形態の連結情報列Ｙは、図２の部分(B)に例示される通り、指定文字列Ｘ内の各音単位ｘ[n]に対応する複数の連結情報ｙ[n]（ｙ[1],ｙ[2],ｙ[3],……）の時系列である。連結情報列Ｙのうち任意の１個の音単位ｘ[n]に対応する連結情報ｙ[n]は、その音単位ｘ[n]を直前の音単位ｘ[n-1]に連結して１個の音符に割当てるか否かを指定する情報（フラグ）である。具体的には、連結情報ｙ[n]の数値０は、音単位ｘ[n]を直前の音単位ｘ[n-1]に連結することを意味し、連結情報ｙ[n]の数値１は、音単位ｘ[n]を直前の音単位ｘ[n-1]に連結しないことを意味する。 The connection information string Y of the first embodiment includes a plurality of pieces of connection information y [n] (y corresponding to each sound unit x [n] in the designated character string X, as illustrated in part (B) of FIG. [1], y [2], y [3], ...). The concatenated information y [n] corresponding to any one sound unit x [n] in the concatenated information string Y is obtained by concatenating the sound unit x [n] to the immediately preceding sound unit x [n-1]. This is information (flag) that specifies whether or not to assign to one note. Specifically, the numerical value 0 of the connection information y [n] means that the sound unit x [n] is connected to the immediately preceding sound unit x [n-1], and the numerical value 1 of the connection information y [n]. Means that the sound unit x [n] is not connected to the immediately preceding sound unit x [n-1].

例えば、図２の部分(B)に例示された連結情報列Ｙでは連結情報ｙ[2]が０であるから、図４に示すように、指定文字列Ｘのうち連結情報ｙ[2]に対応する音単位ｘ[2]「ー」と直前の音単位ｘ[1]「キョ」とを連結した文字列「キョー」（すなわち音単位ｘ[k]の２個分）が特定音符列Ｍ内の１個の音符に割当てられる。また、図２の部分(B)の例示では連結情報ｙ[7]が０であるから、図４に示すように、指定文字列Ｘのうち連結情報ｙ[7]に対応する音単位ｘ[7]「ス」と直前の音単位ｘ[6]「デ」とを連結した文字列「デス」が特定音符列Ｍ内の１個の音符に割当てられる。なお、３個以上の音単位ｘ[n]が連結情報ｙ[n]に応じて連結される場合もある。他方、図２に例示された音単位ｘ[3]〜ｘ[5]の各々は、自身の連結情報ｙ[n]および直後の連結情報ｙ[n+1]の双方が１であるから、図４に示すように、各音単位ｘ[n]が単独で特定音符列Ｍ内の１個の音符に割当てられる。 For example, since the link information y [2] is 0 in the link information sequence Y illustrated in part (B) of FIG. 2, the link information y [2] in the specified character string X is displayed as shown in FIG. A character string “Kyo” (ie, two sound units x [k]) obtained by concatenating the corresponding sound unit x [2] “−” and the previous sound unit x [1] “Kyo” is a specific note string M. Assigned to one of the notes. Further, in the example of the part (B) in FIG. 2, the link information y [7] is 0, so that the sound unit x [[corresponding to the link information y [7] in the designated character string X is shown in FIG. 7] A character string “Death” obtained by concatenating “su” and the previous sound unit x [6] “de” is assigned to one note in the specific note sequence M. In some cases, three or more sound units x [n] may be linked according to the linkage information y [n]. On the other hand, each of the sound units x [3] to x [5] illustrated in FIG. 2 has both its own linked information y [n] and the immediately subsequent linked information y [n + 1] being 1. As shown in FIG. 4, each sound unit x [n] is independently assigned to one note in the specific note string M.

以上の説明から理解されるように、指定文字列Ｘ内で相前後する各音単位ｘ[n]を連結情報列Ｙの各連結情報ｙ[n]に応じて連結することで、図２の部分(C)や図４の例示の通り、複数の譜割単位ｚ[m]（ｚ[1],ｚ[2],ｚ[3],……）を時系列に配列した譜割文字列Ｚが特定される。譜割文字列Ｚ内の各譜割単位ｚ[m]は、特定音符列Ｍ内の１個の音符に割当てられる単位を意味し、指定文字列Ｘ内の１個の音単位ｘ[n]または指定文字列Ｘ内で相前後する複数の音単位ｘ[n]の結合に相当する。例えば図２の部分(C)に例示された譜割文字列Ｚ内の譜割単位ｚ[1]は、指定文字列Ｘ内の音単位ｘ[1]「キョ」と音単位ｘ[2]「ー」とを連結情報ｙ[2]に応じて連結した文字列「キョー」に相当し、図２の部分(C)の譜割単位ｚ[2]は、指定文字列Ｘ内の１個の音単位[3]「ワ」に相当する。 As understood from the above description, by connecting each successive sound unit x [n] in the designated character string X according to each connection information y [n] of the connection information string Y, FIG. As shown in the part (C) and FIG. 4, a musical score character string in which a plurality of musical score units z [m] (z [1], z [2], z [3],...) Are arranged in time series. Z is specified. Each staff unit z [m] in the staff string Z means a unit assigned to one note in the specific note string M, and one note unit x [n] in the specified string X Alternatively, this corresponds to a combination of a plurality of sound units x [n] that are consecutive in the designated character string X. For example, the musical score unit z [1] in the musical score string Z illustrated in part (C) of FIG. 2 is the sound unit x [1] “Kyo” and the sound unit x [2] in the designated character string X. Corresponds to the character string “kyo” concatenated with “−” according to the concatenation information y [2], and the musical score unit z [2] in the part (C) of FIG. This corresponds to the sound unit [3] “wa”.

指定文字列Ｘ内で相前後する各音単位ｘ[n]の連結の有無（各音単位ｘ[n]を単独で１個の音符に割当てるか、各音単位ｘ[n]を前後に連結して１個の音符に割当てるか）は、既存の楽曲の歌唱音から観測される譜割の傾向（以下「譜割傾向」という）を加味して決定される。解析処理部３０は、事前の機械学習により譜割傾向が反映された確率モデルを指定文字列Ｘに適用することで連結情報列Ｙを生成する。第１実施形態の解析処理部３０が適用する確率モデルは、条件付確率場（CRF：Conditional Random Fields）を利用した線形対数モデルである。条件付確率場の確率モデルは、指定文字列Ｘ（Ｘ＝｛ｘ[1],ｘ[2],ｘ[3],……｝）が観測されたという条件のもとで連結情報列Ｙ（Ｙ＝｛ｙ[1],ｙ[2],ｙ[3],……｝）が生起する条件付確率Ｐ(Y|X)を以下の数式(1)で定義する。

Presence or absence of concatenation of each sound unit x [n] in the specified character string X (each sound unit x [n] is assigned to a single note, or each sound unit x [n] is concatenated back and forth. Is assigned to a single note) is determined in consideration of the tendency of the musical score observed from the singing sound of the existing music (hereinafter referred to as “musical score tendency”). The analysis processing unit 30 generates the linked information sequence Y by applying the probability model in which the musical score tendency is reflected by the prior machine learning to the designated character string X. The probability model applied by the analysis processing unit 30 of the first embodiment is a linear logarithmic model using a conditional random field (CRF). The probability model of the conditional random field is based on the condition that the designated character string X (X = {x [1], x [2], x [3],...}) Is observed. The conditional probability P (Y | X) in which (Y = {y [1], y [2], y [3],...}) Occurs is defined by the following equation (1).

譜割傾向のもとで指定文字列Ｘに対して最適な連結情報列Ｙを特定する解析処理部３０の処理は、以下の数式(2)で表現される通り、指定文字列Ｘに対して条件付確率Ｐ(Y|X)を最大化する連結情報列Ｙを特定する演算に相当する。

The processing of the analysis processing unit 30 that identifies the optimal linked information string Y for the designated character string X under the musical score tendency is as follows: This corresponds to an operation for specifying the concatenated information string Y that maximizes the conditional probability P (Y | X).

数式(1)の分母は、条件付確率Ｐ(Y|X)を１以下の正数（確率値）に制限するための正規化項であり、全通りの連結情報列Ｙにわたる総和を意味するから、数式(2)の連結情報列Ｙには関与しない。したがって、指定文字列Ｘに最適な連結情報列Ｙを特定する確率モデルは、数式(2)を変形した以下の数式(3)で表現される。

以上に説明した通り、第１実施形態の解析処理部３０は、複数（例えば全通り）の連結情報列Ｙのうち指定文字列Ｘに対して条件付確率Ｐ(Y|X)が最大となる連結情報列Ｙを選択する。 The denominator of the mathematical formula (1) is a normalization term for limiting the conditional probability P (Y | X) to a positive number (probability value) of 1 or less, and means the sum total over the entire connected information sequence Y. Therefore, it is not involved in the concatenated information sequence Y of the formula (2). Therefore, the probability model for specifying the link information string Y that is optimal for the designated character string X is expressed by the following formula (3) obtained by modifying the formula (2).

As described above, the analysis processing unit 30 of the first embodiment has the maximum conditional probability P (Y | X) for the specified character string X among a plurality (for example, all) of the linked information string Y. The linked information string Y is selected.

数式(3)（数式(1)）の確率モデルは、複数の素性ｆで規定される。各素性ｆは、指定文字列Ｘと連結情報列Ｙとの関係を規定する関数である。具体的には、指定文字列Ｘの各音単位ｘ[n]と連結情報列Ｙの各連結情報ｙ[n]とが所定の条件を充足する場合に１に設定され、条件を充足しない場合に０に設定される関数が素性ｆとして採用される。すなわち、各素性ｆは、指定文字列Ｘの各音単位ｘ[n]と連結情報列Ｙの各連結情報ｙ[n]とが所定の条件を充足したことを検出する関数と換言され得る。各素性ｆに対応する条件が成立してその素性ｆが１に設定される（素性ｆが条件の成立を検出する）ことを、以下の説明では「素性ｆが発火する」と表記する場合がある。素性ｆが発火する条件は素性ｆ毎に相違する。機械学習用の楽曲に対して頻繁に発火する（譜割傾向に整合する）ように複数の素性ｆは設定される。具体的には、母音，撥音（ン），長音（ー）および促音（ッ）が直前の音節と連結して１音符内で発声され易いという傾向を考慮して、第１実施形態の確率モデルに適用される複数の素性ｆは、以下の４種類の素性ｆ1〜ｆ4を包含する。
素性ｆ1：音単位ｘ[n]が母音であり、連結情報ｙ[n]が０である場合に発火する。
素性ｆ2：音単位ｘ[n]が撥音であり、連結情報ｙ[n]が０である場合に発火する。
素性ｆ3：音単位ｘ[n]が長音であり、連結情報ｙ[n]が０である場合に発火する。
素性ｆ4：音単位ｘ[n]が促音であり、連結情報ｙ[n]が０である場合に発火する。 The probability model of Equation (3) (Equation (1)) is defined by a plurality of features f. Each feature f is a function that defines the relationship between the designated character string X and the linked information string Y. Specifically, it is set to 1 when each sound unit x [n] of the designated character string X and each connection information y [n] of the connection information string Y satisfies a predetermined condition, and when the condition is not satisfied A function set to 0 is adopted as the feature f. That is, each feature f can be rephrased as a function for detecting that each sound unit x [n] of the designated character string X and each connection information y [n] of the connection information string Y satisfy a predetermined condition. When the condition corresponding to each feature f is established and the feature f is set to 1 (the feature f detects the establishment of the condition), in the following description, it may be expressed as “the feature f ignites”. is there. The conditions under which the feature f is ignited are different for each feature f. A plurality of features f are set so that the music for machine learning is frequently ignited (matches the musical score tendency). Specifically, considering the tendency that a vowel, a repelling sound (n), a long sound (-), and a prompt sound (t) are connected to the immediately preceding syllable and easily uttered within one note, the probability model of the first embodiment. The plurality of features f applied to 4 includes the following four types of features f1 to f4.
Feature f1: Fired when the sound unit x [n] is a vowel and the link information y [n] is 0.
Feature f2: Fired when the sound unit x [n] is sound repellent and the link information y [n] is 0.
Feature f3: Fires when the sound unit x [n] is a long sound and the connection information y [n] is 0.
Feature f4: Fired when the sound unit x [n] is a prompt sound and the link information y [n] is 0.

なお、以上の説明では、音単位ｘ[n]と連結情報ｙ[n]との関係を規定した素性（観測素性）ｆを例示したが、連結情報列Ｙ内で相前後する各連結情報ｙ[n]の関係を規定した素性（遷移素性）ｆを確率モデルに反映させることも可能である。ただし、連結情報ｙ[n]の時系列自体に特定の傾向を見出すことが困難であるという事情を考慮すると、連結情報列Ｙの生成に使用される確率モデルには、前述の例示のような観測素性ｆが好適に適用される。 In the above description, the feature (observation feature) f that defines the relationship between the sound unit x [n] and the connection information y [n] has been exemplified. It is also possible to reflect the feature (transition feature) f defining the relationship of [n] in the probability model. However, considering the fact that it is difficult to find a specific tendency in the time series itself of the linked information y [n], the probability model used for generating the linked information sequence Y is as shown in the above example. The observation feature f is preferably applied.

数式(1)の記号φf(X,Y)は、１種類の素性ｆが指定文字列Ｘと連結情報列Ｙとの関係のもとで発火する回数（すなわち音単位ｘ[n]と連結情報ｙ[n]とが所定の条件を充足する回数）を計数する関数である。また、数式(1)の記号θfは、１種類の素性ｆの加重値（重要度）を意味する。したがって、数式(3)のうち加重値θfと関数（発火回数）φf(X,Y)との積を全種類の素性ｆについて加算した部分Σ_fθ_fφ_f(X,Y)は、指定文字列Ｘに対する連結情報列Ｙの確信度（譜割傾向のもとでの尤もらしさ）に相当する。なお、数式(1)で指数関数（ｅ）を導入しているのは、条件付確率Ｐ(Y|X)を正数（確率分布）に制限するための便宜的な措置である。 The symbol φf (X, Y) in the equation (1) indicates the number of times that one type of feature f is ignited based on the relationship between the designated character string X and the concatenated information string Y (that is, sound unit x [n] and concatenated information y [n] is a function that counts the number of times a predetermined condition is satisfied). Further, the symbol θf in the equation (1) means a weight value (importance) of one type of feature f. Therefore, the part Σ _f θ _f φ _f (X, Y) in which the product of the weighted value θf and the function (the number of firings) φf (X, Y) is added for all types of features f in the formula (3) is specified. This corresponds to the certainty of the linked information string Y with respect to the character string X (likelihood under a musical score tendency). Note that the introduction of the exponential function (e) in Equation (1) is a convenient measure for limiting the conditional probability P (Y | X) to a positive number (probability distribution).

数式(3)で表現される確率モデルに譜割傾向を反映させる機械学習は、既存の楽曲から事前に作成された多数の学習データＬに対して数式(1)の条件付確率Ｐ(Y|X)が大きい数値となるように各素性ｆの加重値θfを選定する処理である。多数の学習データＬについて、学習用文字列ＸLのうち素性ｆが規定する各音単位（例えば母音，撥音，長音，促音）ｘ[n]の出現度数に対して学習用連結情報列ＹL内の連結情報ｙ[n]が０となる度数の割合（発火割合）が高い素性ｆ（すなわち、多数の学習データＬの学習用文字列ＸL内の特定の音単位ｘ[n]に対する連結情報ｙ[n]の数値０の割合を高精度に検出できる素性ｆ）の加重値θfほど大きい数値（数値範囲や正負は不問）に設定される。図５に示すように、各学習データＬは、学習用文字列ＸLと学習用連結情報列ＹLとを含んで構成される。学習用文字列ＸLは、指定文字列Ｘと同様に複数の音単位（第１実施形態ではモーラ）ｘ[n]の時系列であり、学習用連結情報列ＹLは、連結情報列Ｙと同様に複数の連結情報ｙ[n]の時系列である。 Machine learning that reflects the musical score tendency in the probability model expressed by Equation (3) is based on conditional probability P (Y |) of Equation (1) for a large number of learning data L created in advance from existing music. In this process, the weight value θf of each feature f is selected so that X) becomes a large numerical value. For a large number of learning data L, the number of occurrences of each sound unit (for example, vowel, sound repellent, long sound, prompt sound) x [n] defined by the feature f in the learning character string XL is included in the learning connection information string YL. A feature f having a high rate (firing rate) at which the concatenated information y [n] is 0 (ie, concatenated information y [for a specific sound unit x [n] in the learning character string XL of a large number of learning data L) n] is set to a larger numerical value (the numerical value range and positive / negative are irrelevant) as the weight value θf of the feature f) that can detect the ratio of the numerical value 0 with high accuracy. As shown in FIG. 5, each learning data L includes a learning character string XL and a learning connection information string YL. The learning character string XL is a time series of a plurality of sound units (mora in the first embodiment) x [n] similarly to the designated character string X, and the learning linked information string YL is similar to the linked information string Y. Is a time series of a plurality of pieces of linked information y [n].

具体的には、各学習データＬは、既存の多数の楽曲（歌唱曲）から抽出された所定長（例えば１小節）の音符列の各音符と歌唱音を区分した各音単位との対応（すなわち譜割）に応じて作成される。例えば、図５に示すように、既存の楽曲の歌唱音から抽出された「キ|ノ|ー|ワ|ア|メ|デ|シ|タ」（「｜」は音単位ｘ[n]の境界を意味する）という学習用文字列ＸL（９個の音単位ｘ[1]〜ｘ[9]の時系列）に着目する。学習用の既存の楽曲において、音単位ｘ[2]「ノ」と音単位ｘ[3]「ー」とを連結した文字列「ノー」が１個の音符内で発声され、音単位ｘ[7]「デ」と音単位ｘ[8]「シ」とを連結した文字列「デシ」が１個の音符内で発声されている場合には、連結情報ｙ[3]と連結情報ｙ[8]とを０（直前の音単位ｘ[n-1]との連結を意味する数値）に設定し、残余の連結情報ｙ[n]を１に設定した学習用連結情報列ＹLが生成される。各学習データＬは、例えば音声合成装置１００の提供者が多数の楽曲を解析することで作成され得る。 Specifically, each learning data L corresponds to a correspondence between each note of a predetermined length (for example, one measure) of a note string extracted from a number of existing songs (singing songs) and each sound unit obtained by dividing the singing sound ( That is, it is created according to the musical score). For example, as shown in FIG. 5, “ki | no |-| wa | a | me | de | shita” (“|” is a sound unit x [n] extracted from the singing sound of an existing music piece. Focus on a learning character string XL (which means a boundary) (a time series of nine sound units x [1] to x [9]). In the existing music for learning, the character string “No” obtained by connecting the sound unit x [2] “no” and the sound unit x [3] “−” is uttered in one note, and the sound unit x [ 7] When the character string “Desi” obtained by concatenating “De” and the sound unit x [8] “Si” is uttered in one note, the concatenation information y [3] and the concatenation information y [ 8] is set to 0 (numerical value indicating concatenation with the immediately preceding sound unit x [n-1]), and the learning connection information string YL is generated in which the remaining connection information y [n] is set to 1. The Each learning data L can be created, for example, when the provider of the speech synthesizer 100 analyzes a large number of music pieces.

以上に説明した各学習データＬの学習用文字列ＸLおよび学習用連結情報列ＹLを数式(1)の指定文字列Ｘおよび連結情報列Ｙとして適用した機械学習により各素性ｆの加重値θfが事前に決定され、数式(3)の確率モデルを規定する確率モデル情報Ｑとして各加重値θfが記憶装置１４に格納される。解析処理部３０は、確率モデル情報Ｑの各加重値θfで規定される数式(3)の確率モデルを指定文字列Ｘに適用することで連結情報列Ｙを生成する。したがって、学習用の多数の楽曲にわたる譜割傾向のもとで指定文字列Ｘに最適な連結情報列Ｙ（譜割傾向を反映した連結情報列Ｙ）が特定される。なお、音声合成装置１００の演算処理装置１２が複数の学習データＬから確率モデル情報Ｑを生成して記憶装置１４に格納することも可能であるが、外部装置にて生成された確率モデル情報Ｑが可搬型の記録媒体や通信回線を介して音声合成装置１００に提供されて記憶装置１４に格納される構成も好適である。以上が解析処理部３０の具体的な構成および動作である。 The weighted value θf of each feature f is obtained by machine learning in which the learning character string XL and the learning connection information string YL of each learning data L described above are applied as the designated character string X and the connection information string Y in Expression (1). Each weight value θf is stored in the storage device 14 as probability model information Q that is determined in advance and defines the probability model of Equation (3). The analysis processing unit 30 applies the probability model of Formula (3) defined by each weight value θf of the probability model information Q to generate the linked information sequence Y by applying it to the designated character string X. Therefore, the optimum linked information string Y (the linked information string Y reflecting the score tendency) is specified based on the score tendency over a large number of learning songs. It is possible for the arithmetic processing unit 12 of the speech synthesizer 100 to generate the probability model information Q from a plurality of learning data L and store it in the storage device 14, but the probability model information Q generated by the external device Is preferably provided to the speech synthesizer 100 via a portable recording medium or communication line and stored in the storage device 14. The specific configuration and operation of the analysis processing unit 30 have been described above.

＜音符列取得部４０＞
図３の音符列取得部４０は、文字列取得部２２が取得した指定文字列Ｘと解析処理部３０が生成した連結情報列Ｙとに応じた１個の音符列Ｍ[k]を記憶装置１４内のＫ個の特定音符列Ｍ[1]〜Ｍ[K]から特定音符列Ｍとして選択する。第１実施形態の音符列取得部４０は、連結情報列Ｙ内の各連結情報ｙ[n]が指定する連結の有無を指定文字列Ｘの各音単位ｘ[n]に適用した譜割文字列Ｚ（図２の部分(C)）内の譜割単位ｚ[m]の個数に対応する音符数の音符列Ｍ[k]を特定音符列Ｍとして取得する。図６は、第１実施形態の音符列取得部４０のブロック図である。図６に示すように、音符列取得部４０は、第１選択部４１と第２選択部４２とを含んで構成される。 <Note string acquisition unit 40>
The note string acquisition unit 40 in FIG. 3 stores one note string M [k] corresponding to the designated character string X acquired by the character string acquisition unit 22 and the linked information string Y generated by the analysis processing unit 30. 14 is selected as the specific note sequence M from the K specific note sequences M [1] to M [K]. The musical note string acquisition unit 40 of the first embodiment applies a musical score character in which the presence / absence of connection specified by each link information y [n] in the link information string Y is applied to each sound unit x [n] of the specified character string X. The note sequence M [k] having the number of notes corresponding to the number of musical score units z [m] in the sequence Z (part (C) in FIG. 2) is acquired as the specific note sequence M. FIG. 6 is a block diagram of the note string acquisition unit 40 of the first embodiment. As shown in FIG. 6, the note string acquisition unit 40 includes a first selection unit 41 and a second selection unit 42.

第１選択部４１は、複数の譜割単位ｚ[m]を時系列に配列した譜割文字列Ｚを指定文字列Ｘと連結情報列Ｙとに応じて生成し、記憶装置１４に記憶されたＫ個の音符列Ｍ[1]〜Ｍ[K]のうち、譜割文字列Ｚ内の譜割単位ｚ[m]と同数の音符で構成される全部の音符列Ｍ[k]（以下「候補音符列ＭC」という）を特定する。例えば、図２の部分(C)や図４の例示では、譜割文字列Ｚが５個の譜割単位ｚ[1]〜ｚ[5]で構成されるから、５個の音符で構成される各音符列Ｍ[k]が候補音符列ＭCとして選択される。すなわち、第１選択部４１は、譜割文字列Ｚの譜割単位ｚ[m]に対して１対１に対応する音符で構成される候補音符列ＭCを選択する。 The first selection unit 41 generates a musical score character string Z in which a plurality of musical score units z [m] are arranged in time series according to the designated character string X and the linked information string Y, and is stored in the storage device 14. Of all the note sequences M [1] to M [K], all note sequences M [k] composed of the same number of notes as the staff unit z [m] in the staff string Z “Candidate note string MC”). For example, in the part (C) of FIG. 2 and the example of FIG. 4, the musical score character string Z is composed of five musical score units z [1] to z [5], and is composed of five notes. Each note string M [k] is selected as a candidate note string MC. That is, the first selection unit 41 selects a candidate note string MC composed of notes corresponding to the musical score unit z [m] of the musical score character string Z on a one-to-one basis.

ところで、譜割文字列Ｚ内には、相異なる個数の音単位ｘ[n]を連結した複数の譜割単位ｚ[m]が混在し、各候補音符列ＭC内には、継続長が相違する複数の音符が混在する。以上の事情のもとでは、多数の音単位ｘ[n]で構成される譜割単位ｚ[m]に対して継続長の短い音符を割当てた場合に、１個の音符に多数の音単位ｘ[n]が無理に詰込まれたような不自然な譜割となる可能性がある。以上の傾向を考慮して、各音単位ｘ[n]が無理なく各音符に対応する自然な譜割を実現する観点から、第２選択部４２は、多数の音単位ｘ[n]で構成される譜割単位ｚ[m]に対して継続長が長い音符が対応する候補音符列ＭCを特定音符列Ｍとして選択する。 By the way, there are a plurality of musical score units z [m] in which different numbers of sound units x [n] are concatenated in the musical score character string Z, and the continuation length is different in each candidate musical note string MC. A plurality of notes to be mixed. Under the above circumstances, when a note having a short duration is assigned to a musical score unit z [m] composed of a large number of sound units x [n], a large number of sound units are assigned to a single note. There is a possibility that it becomes an unnatural notation that x [n] is forcibly packed. Considering the above tendency, from the viewpoint of realizing a natural musical score in which each sound unit x [n] corresponds to each note without difficulty, the second selection unit 42 includes a large number of sound units x [n]. A candidate note string MC corresponding to a note having a long duration for the musical score unit z [m] is selected as the specific note string M.

具体的には、第２選択部４２は、第１選択部４１が選択した複数の候補音符列ＭCの各々について誤差指標値Ｅを算定する。誤差指標値Ｅは、譜割文字列Ｚ内の各譜割単位ｚ[m]を構成する音単位ｘ[n]の個数に応じた基準長ＴZと、候補音符列ＭC内でその譜割単位ｚ[m]に対応する音符の継続長ＴMとが相違する度合の指標である。具体的には、誤差指標値Ｅは、以下の数式(4)で表現されるように、各譜割単位ｚ[m]の基準長ＴZと各音符の継続長ＴMとの差分の絶対値を、相対応する譜割単位ｚ[m]と音符との複数組について総和（または平均）した数値である。

Specifically, the second selection unit 42 calculates an error index value E for each of the plurality of candidate note strings MC selected by the first selection unit 41. The error index value E is a reference length TZ corresponding to the number of sound units x [n] constituting each musical score unit z [m] in the musical score character string Z, and the musical score unit in the candidate musical note string MC. This is an index of the degree of difference between the note durations TM corresponding to z [m]. Specifically, the error index value E is expressed by the absolute value of the difference between the reference length TZ of each musical score unit z [m] and the continuation length TM of each note, as expressed by the following equation (4). , A numerical value obtained by summing (or averaging) a plurality of sets of corresponding musical score units z [m] and notes.

譜割文字列Ｚ内の１個の譜割単位ｚ[m]の基準長ＴZは、その譜割単位ｚ[m]を構成する音単位ｘ[n]の個数Ｎxに応じた数値である。具体的には、所定長Ｔ0に音単位ｘ[n]の個数Ｎxを乗算した数値が基準長ＴZ（ＴZ＝Ｎx・Ｔ0）として算定される。所定長Ｔ0は、１個の音単位ｘ[n]が発声される時間長（音価）として合理的に期待される時間長に統計的または経験的に設定される。例えばラップ音楽では、１個の音単位（モーラ）ｘ[n]が１６分音符の時間長で発声される場合が多いという傾向があるから、所定長Ｔ0は１６分音符の時間長に設定される。 The reference length TZ of one musical score unit z [m] in the musical score character string Z is a numerical value corresponding to the number Nx of sound units x [n] constituting the musical score unit z [m]. Specifically, a numerical value obtained by multiplying the predetermined length T0 by the number Nx of sound units x [n] is calculated as the reference length TZ (TZ = Nx · T0). The predetermined length T0 is set statistically or empirically to a time length reasonably expected as a time length (sound value) at which one sound unit x [n] is uttered. For example, in rap music, there is a tendency that one sound unit (mora) x [n] is often uttered with a time length of 16th notes, so the predetermined length T0 is set to the time length of 16th notes. The

例えば、図４に例示された譜割文字列Ｚ内の譜割単位ｚ[1]「キョー」は、音単位ｘ[1]「キョ」および音単位ｘ[2]「ー」の２個で構成されるから、基準長ＴZは所定長Ｔ0の２個分（２Ｔ0）に設定される。譜割単位ｚ[5]「デス」の基準長ＴZも同様に所定長Ｔ0の２個分に設定される。他方、譜割文字列Ｚのうち１個の音単位ｘ[n]で構成される譜割単位ｚ[m]（ｚ[2],ｚ[3],ｚ[4]）の基準長ＴZは所定長Ｔ0の１個分に設定される。以上の説明から理解される通り、基準長ＴZは、１個の譜割単位ｚ[m]の発声に好適な時間長に相当する。他方、継続長ＴMは、所定長Ｔ0を単位とした音符の時間長である。すなわち、１６分音符の継続長ＴMは所定長Ｔ0の１個分に相当し（ＴM＝Ｔ0）、８分音符の継続長ＴMは所定長Ｔ0の２個分に相当する（ＴM＝２Ｔ0）。 For example, the musical score unit z [1] “Kyo” in the musical score character string Z illustrated in FIG. 4 is two units of sound unit x [1] “Kyo” and sound unit x [2] “-”. Therefore, the reference length TZ is set to two (2T0) of the predetermined length T0. Similarly, the reference length TZ of the musical score unit z [5] “Death” is set to two of the predetermined length T0. On the other hand, the reference length TZ of a musical score unit z [m] (z [2], z [3], z [4]) composed of one sound unit x [n] in the musical score character string Z is It is set to one predetermined length T0. As understood from the above description, the reference length TZ corresponds to a time length suitable for utterance of one musical score unit z [m]. On the other hand, the continuation length TM is the time length of a note in units of a predetermined length T0. That is, the continuation length TM of the sixteenth note corresponds to one predetermined length T0 (TM = T0), and the continuation length TM of the eighth note corresponds to two predetermined lengths T0 (TM = 2T0).

以上の説明から理解されるように、誤差指標値Ｅは、譜割文字列Ｚ内の各譜割単位ｚ[m]に期待される発音長（基準長ＴZ）と各候補音符列ＭC内で各譜割単位ｚ[m]に対応する音符の継続長ＴMとの乖離度の指標に相当する。すなわち、多数の音単位ｘ[n]で構成される譜割単位ｚ[m]に継続長ＴMの短い音符が対応する候補音符列ＭCや少数の音単位ｘ[n]で構成される譜割単位ｚ[m]に継続長ＴMの長い音符が対応する候補音符列ＭCについては誤差指標値Ｅが大きい数値となる。以上の傾向を考慮して、第１実施形態の第２選択部４２は、複数の候補音符列ＭCのうち誤差指標値Ｅが最小となる１個の候補音符列ＭCを特定音符列Ｍとして選択する。 As understood from the above description, the error index value E is calculated in the pronunciation length (reference length TZ) expected for each musical score unit z [m] in the musical score character string Z and in each candidate musical note string MC. This corresponds to an index of the degree of deviation from the note duration length TM corresponding to each musical score unit z [m]. That is, a musical score unit composed of a candidate musical note sequence MC corresponding to a musical note unit z [m] composed of a large number of sound units x [n] and a short note of a duration TM, or a small number of sound units x [n]. For the candidate note string MC in which a note having a long duration TM corresponds to the unit z [m], the error index value E is a large numerical value. Considering the above tendency, the second selection unit 42 of the first embodiment selects one candidate note string MC having the smallest error index value E among the plurality of candidate note strings MC as the specific note string M. To do.

例えば、図４の例示の通り、５個の譜割単位ｚ[1]〜ｚ[5]で構成される譜割文字列Ｚに対し、５個の１６分音符で構成される候補音符列ＭC1と、５個の８分音符で構成される候補音符列ＭC2とを第１選択部４１が選択した場合を想定する。所定長Ｔ0を前述の例示のように１６分音符と仮定すると、候補音符列ＭC1の誤差指標値Ｅ1と候補音符列ＭC2の誤差指標値Ｅ2とは以下の数値となる。
Ｅ1＝|２−１|＋|１−１|＋|１−１|＋|１−１|＋|２−１|＝２
Ｅ2＝|２−２|＋|１−２|＋|１−２|＋|１−２|＋|２−２|＝３
なお、以上の演算では、基準長ＴZと継続長ＴMとに共通する所定長Ｔ0の表記を省略した。したがって、誤差指標値Ｅ1および誤差指標値Ｅ2の演算式における数値「２」は１６分音符の２個分（８分音符）の時間長２Ｔ0を意味し、数値「１」は１６分音符の１個分の時間長Ｔ0を意味する。以上の例示では誤差指標値Ｅ1が誤差指標値Ｅ2を下回るから、第２選択部４２は、候補音符列ＭC1を特定音符列Ｍとして選択する。以上が音符列取得部４０の具体的な構成および動作である。 For example, as illustrated in FIG. 4, a candidate note string MC1 composed of five sixteenth notes for a staff string Z composed of five staff units z [1] to z [5]. Assume that the first selection unit 41 selects a candidate note string MC2 composed of five eighth notes. Assuming that the predetermined length T0 is a sixteenth note as shown in the above example, the error index value E1 of the candidate note string MC1 and the error index value E2 of the candidate note string MC2 are the following numerical values.
E1 = | 2-1 | + | 1-1 | + | 1-1 | + | 1-1 | + | 2-1 | = 2
E2 = | 2-2 | + | 1-2 | + | 1-2 | + | 1-2 | + | 2-2 | = 3
In the above calculation, the notation of the predetermined length T0 common to the reference length TZ and the continuation length TM is omitted. Therefore, the numerical value “2” in the calculation formulas of the error index value E1 and the error index value E2 means the time length 2T0 of two sixteenth notes (eighth notes), and the numerical value “1” is one of the sixteenth notes. This means the time length T0. In the above example, since the error index value E1 is lower than the error index value E2, the second selection unit 42 selects the candidate note string MC1 as the specific note string M. The specific configuration and operation of the note string acquisition unit 40 have been described above.

音符列設定部２４（解析処理部３０，音符列取得部４０）による以上の処理の結果、譜割文字列Ｚの各譜割単位ｚ[m]に対して１対１に対応する音符で構成される特定音符列Ｍが特定される。図１の音声合成部２６は、譜割文字列Ｚの各譜割単位ｚ[m]を、特定音符列Ｍのうちその譜割単位ｚ[m]に対応する音符の音高および継続長で発声した音声（歌唱音）の音声信号Ｖを生成する。 As a result of the above processing by the note string setting unit 24 (the analysis processing unit 30 and the note string acquisition unit 40), it is composed of one-to-one notes corresponding to each musical score unit z [m] of the musical score character string Z. The specific note sequence M to be performed is specified. The speech synthesizer 26 in FIG. 1 uses each musical score unit Z [m] of the musical score character string Z as the pitch and duration of the note corresponding to the musical score unit z [m] in the specific musical note string M. A voice signal V of the uttered voice (singing sound) is generated.

以上に説明したように、第１実施形態では、複数の学習データＬにわたる譜割傾向が反映されるように学習処理で生成された確率モデルに指定文字列Ｘを適用することで、指定文字列Ｘ内の各音単位ｘ[n]の連結の有無を指定する連結情報列Ｙが生成される。したがって、歌詞の各音節が音符列の各音符に対して１対１に割当てられる特許文献１や非特許文献１の技術と比較すると、各音符に割当てられる音単位ｘ[n]の個数が可変に設定された多様な譜割が実現される。しかも、既存の楽曲から生成された複数の学習データＬを利用した学習処理で生成された確率モデルが連結情報列Ｙの生成に使用されるから、既存の楽曲の譜割傾向を反映した自然な譜割を実現できるという利点もある。特定の歌手の歌唱曲の学習データＬを確率モデルの学習処理に適用することで、その歌手に固有の譜割傾向を反映した譜割を実現することも可能である。 As described above, in the first embodiment, the designated character string X is applied to the probability model generated by the learning process so that the musical score tendency over the plurality of learning data L is reflected, thereby providing the designated character string. A connection information string Y is generated that specifies whether or not each sound unit x [n] in X is connected. Therefore, the number of sound units x [n] assigned to each note is variable as compared with the techniques of Patent Document 1 and Non-Patent Document 1 in which each syllable of the lyrics is assigned one-to-one to each note of the note string. A variety of musical score set in is realized. Moreover, since the probability model generated by the learning process using the plurality of learning data L generated from the existing music is used for generating the linked information sequence Y, it is natural to reflect the musical score tendency of the existing music. There is also an advantage that a musical score can be realized. By applying the learning data L of the song of a specific singer to the learning process of the probability model, it is possible to realize a musical score that reflects the musical score tendency unique to that singer.

また、第１実施形態では、指定文字列Ｘの全体に対する確率モデルの適用により連結情報列Ｙが生成されるから、各音単位ｘ[n]の連結の有無は指定文字列Ｘの全体に応じて多様に制御される。例えば、指定文字列Ｘが「キョー（今日）」という単語を含む場合、指定文字列Ｘ内の前後の音単位ｘ[n]の内容に応じて、「キョ」と「ー」とが相異なる音符に別個に割当てられる場合もあれば、「キョー」が一体として１個の音符に割当てられる場合もある。したがって、例えば歌詞の単語毎に音節数と音符数との関係を事前に決定する構成と比較して柔軟かつ多様な譜割を実現することが可能である。 In the first embodiment, since the connection information sequence Y is generated by applying the probability model to the entire designated character string X, whether or not each sound unit x [n] is linked depends on the entire designated character string X. Are controlled in various ways. For example, when the designated character string X includes the word “kyo (today)”, “kyo” and “−” are different depending on the contents of the preceding and following sound units x [n] in the designated character string X. In some cases, the notes may be assigned separately, and in other cases, “Kyo” may be assigned to one note as a whole. Therefore, for example, it is possible to realize a flexible and diverse musical score as compared with a configuration in which the relationship between the number of syllables and the number of notes is determined in advance for each word of the lyrics.

また、第１実施形態では、複数の候補音符列ＭCのうち誤差指標値Ｅが最小となる候補音符列ＭCが特定音符列Ｍとして選択されるから、指定文字列Ｘの各音単位ｘ[n]が各音符に対して無理なく対応した自然な譜割を実現できるという利点がある。 In the first embodiment, the candidate note string MC having the smallest error index value E among the plurality of candidate note strings MC is selected as the specific note string M. Therefore, each sound unit x [n of the designated character string X is selected. ] Has the advantage that a natural score corresponding to each note can be realized without difficulty.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。なお、以下に例示する各態様において作用や機能が第１実施形態と同等である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In addition, about the element which an effect | action and a function are equivalent to 1st Embodiment in each aspect illustrated below, each reference detailed in description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

第２実施形態の文字列取得部２２は、文字列Ｘ0を複数の音単位ｘ[n]に区分して指定文字列Ｘを生成するほか、文字列Ｘ0内で各音単位ｘ[n]が構成する単語の品詞と、各音単位ｘ[n]の無声化の有無とを判別する。音単位ｘ[n]の品詞および無声化の有無の判別には公知の技術が任意に採用される。 The character string acquisition unit 22 of the second embodiment generates a designated character string X by dividing the character string X0 into a plurality of sound units x [n], and each sound unit x [n] is included in the character string X0. The part-of-speech of the constituent words and the presence or absence of devoicing of each sound unit x [n] are determined. A known technique is arbitrarily adopted to determine the part of speech of the sound unit x [n] and the presence / absence of devoicing.

特定の品詞（例えば名詞）の音単位ｘ[n]や無声化した音単位ｘ[n]が直前の音節と連結して１音符内で発声され易いという傾向を考慮して、第２実施形態の確率モデルを規定する複数の素性ｆは、第１実施形態と同様の４種類の素性ｆ1〜ｆ4に加えて、以下に例示する２種類の素性ｆ（ｆ5，ｆ6）を含む。
素性ｆ5：音単位ｘ[n]を含む単語の品詞が名詞であり、連結情報ｙ[n]が０である場合に発火する。
素性ｆ6：音単位ｘ[n]が無声化し、連結情報ｙ[n]が０である場合に発火する。 In consideration of the tendency that a sound unit x [n] of a specific part of speech (for example, a noun) or a sound unit x [n] that has been made unvoiced is connected to the immediately preceding syllable and is easily uttered within one note. The plurality of features f defining the probability model include two types of features f (f5, f6) exemplified below in addition to the four types of features f1 to f4 similar to those in the first embodiment.
Feature f5: Fires when the part of speech of a word including the sound unit x [n] is a noun and the link information y [n] is 0.
Feature f6: Fired when the sound unit x [n] is devoiced and the link information y [n] is zero.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、音単位ｘ[n]の品詞や無声化の有無を加味した譜割傾向を反映した確率モデルが連結情報列Ｙの生成に適用されるから、第１実施形態と比較して、実際の譜割傾向を忠実に反映した多様な譜割が実現されるという利点もある。なお、第２実施形態では、素性ｆ1〜ｆ6を含む複数の素性ｆで規定される確率モデルを例示したが、素性ｆ1〜ｆ6の少なくとも１種類の素性を含む複数の素性ｆで規定される確率モデルを連結情報列Ｙの生成に適用することも可能である。 In the second embodiment, the same effect as in the first embodiment is realized. Further, in the second embodiment, a probability model reflecting the musical score tendency that takes into account the part of speech of sound unit x [n] and the presence or absence of devoicing is applied to the generation of the linked information sequence Y. In comparison, there is also an advantage that a variety of musical scores that faithfully reflect actual musical score trends are realized. In the second embodiment, the probability model defined by the plurality of features f including the features f1 to f6 is illustrated, but the probability defined by the plurality of features f including at least one type of the features f1 to f6. It is also possible to apply the model to the generation of the linked information sequence Y.

＜第３実施形態＞
第３実施形態の音声合成装置１００は、第１実施形態と同様にラップ音楽の歌唱音を合成する。図７は、多数のラップ音楽の傾向を加味して決定された以下の条件を基礎とする記譜法で特定のラップ音楽（歌詞「キョーワハレデス」）の歌唱音を表現した譜面である。
条件１：各音符の音価（継続長）の基本単位は１６分音符である。ただし、８分音符以上の音価の３連符の各々を音価の基本単位とすることも可能である。
条件２：音符を規定する音階は、所定の根音（基本音高）を中心として上下に２段階の合計５段階の音高で構成される。例えば、図７の例示のように、１半音を単位として根音（root）からの音高差が「−５」，「−２」，「０（根音自身）」，「＋３」および「＋５」となるマイナーペンタトニックスケールが採用される。
条件３：１個の音符に単数または複数のモーラが内包され得る。 <Third Embodiment>
The speech synthesizer 100 of the third embodiment synthesizes the rap music singing sound as in the first embodiment. FIG. 7 is a musical score representing the singing sound of a specific rap music (lyric “Kyowa Halledes”) by a notation method based on the following conditions determined in consideration of the tendency of many rap music.
Condition 1: The basic unit of the note value (continuation length) of each note is a sixteenth note. However, each triplet having a note value equal to or greater than an eighth note can be used as the basic unit of the note value.
Condition 2: A musical scale that defines a musical note is composed of a total of five levels of pitches, two levels up and down around a predetermined root (basic pitch). For example, as illustrated in FIG. 7, the pitch difference from the root is “−5”, “−2”, “0 (root itself)”, “+3”, and “+3”. A minor pentatonic scale of “+5” is adopted.
Condition 3: One or more mora can be included in one note.

図７では、歌唱音のグリッサンドとアクセント（強勢）とが凡例の記号で便宜的に図式化されている。グリッサンドは、目標音高を起点として歌唱音高を上昇または下降させる歌唱法である。図７の例示では、音単位ｘ[1]「キョ」と音単位ｘ[2]「ー」との２個にわたりグリッサンドが付与され、音単位ｘ[1]「キョ」にアクセントが付与されている。図７から把握される通り、グリッサンドで歌唱される音単位ｘ[n]やアクセントが付与される音単位ｘ[n]は直後の音節と連結して１音符内で発声され易いという傾向がある。なお、図７では、無声化された音単位ｘ[n]「ス」に記号「’」が付加されている。 In FIG. 7, the glissando and accent (stress) of the singing sound are schematically represented by legend symbols. The glissando is a singing method that raises or lowers the singing pitch from the target pitch. In the example of FIG. 7, a glissando is given to two of the sound unit x [1] “Kyo” and the sound unit x [2] “−”, and an accent is given to the sound unit x [1] “Kyo”. Yes. As can be seen from FIG. 7, the sound unit x [n] sung by glissando and the sound unit x [n] to which accents are given tend to be uttered within one note by connecting to the syllable immediately after. . In FIG. 7, the symbol “′” is added to the unvoiced sound unit x [n] “su”.

以上の傾向を考慮して、第３実施形態の確率モデルを規定する複数の素性ｆは、第１実施形態と同様の４種類の素性ｆ1〜ｆ4に加えて、以下に例示する２種類の素性ｆ（ｆ7，ｆ8）を含む。
素性ｆ7：直前の音単位ｘ[n-1]にグリッサンドが付与され（より詳細には、直前の音単位ｘ[n-1]が、グリッサンドの付与された音符に割当てられた複数の音単位のなかで先頭の音単位であり）、連結情報ｙ[n]が０である場合に発火する。
素性ｆ8：直前の音単位ｘ[n-1]がアクセントであり、連結情報ｙ[n]が０である場合に発火する。 In consideration of the above tendency, the plurality of features f defining the probability model of the third embodiment include two types of features exemplified below in addition to the four types of features f1 to f4 similar to those of the first embodiment. f (f7, f8) is included.
Feature f7: a glissando is assigned to the immediately preceding sound unit x [n-1] (more specifically, a plurality of sound units in which the immediately preceding sound unit x [n-1] is assigned to the note to which the glissando is assigned) Fired when the linked information y [n] is 0.
Feature f8: Fires when the immediately preceding sound unit x [n-1] is an accent and the link information y [n] is 0.

素性ｆ7は、グリッサンドにより音高が変化する方向に応じて以下の素性ｆ7aと素性ｆ7bとに区別することも可能である。
素性ｆ7a：直前の音単位ｘ[n-1]に上昇方向のグリッサンドが付与され（より詳細には、直前の音単位ｘ[n-1]が、上昇方向のグリッサンドの付与された音符に割当てられた複数の音単位のなかで先頭の音単位であり）、連結情報ｙ[n]が０である場合に発火する。
素性ｆ7b：直前の音単位ｘ[n-1]に下降方向のグリッサンドが付与され（より詳細には、直前の音単位ｘ[n-1]が、下降方向のグリッサンドの付与された音符に割当てられた複数の音単位のなかで先頭の音単位であり）、連結情報ｙ[n]が０である場合に発火する。 The feature f7 can be distinguished into the following feature f7a and feature f7b according to the direction in which the pitch changes due to glissando.
Feature f7a: Upward glissando is assigned to the immediately preceding sound unit x [n-1] (more specifically, the immediately preceding sound unit x [n-1] is assigned to the note to which the rising glissando is assigned. Fired when the linked information y [n] is 0).
Feature f7b: A downward glissando is assigned to the immediately preceding note unit x [n-1] (more specifically, the immediately preceding note unit x [n-1] is assigned to the note given the descending glissando. Fired when the linked information y [n] is 0).

第３実施形態の文字列取得部２２は、文字列Ｘ0を複数の音単位ｘ[n]に区分して指定文字列Ｘを生成するほか、各音単位ｘ[n]がアクセントに該当するか否かと、各音単位ｘ[n]にグリッサンドが付与されるか否かとを判別する。アクセントの有無の判別には公知の技術（形態素解析等の自然言語処理）が任意に採用される。なお、素性ｆ8に係るアクセントは、音声の強弱を対象とした強勢アクセント（stress accent）と音声の高低を対象とした高低アクセント（pitch accennt）との双方を包含し、何れも形態素解析等の自然言語処理で特定可能である。また、多数のラップ音楽では、アクセントの直後の音単位（モーラ）ｘ[n]が長音や無声化音である場合に、音単位ｘ[n]が直前の音単位ｘ[n-1]からのグリッサンドで発声されるという概略的な傾向がある。以上の傾向を考慮して、第３実施形態の文字列取得部２２は、文字列Ｘ0内の音単位ｘ[n]自身が長音または無声化音であり、かつ、直前の音単位ｘ[n-1]がアクセントに該当する場合に、音単位ｘ[n]にグリッサンドが付与されると推定する。各音単位ｘ[n]におけるアクセントおよびグリッサンドの有無が以上の方法で判別された指定文字列Ｘを、前掲の素性ｆ7（ｆ7a，ｆ7b）および素性ｆ8を含む複数の素性ｆで規定される数式(3)の確率モデルに適用することで、解析処理部３０は連結情報列Ｙを生成する。 The character string acquisition unit 22 of the third embodiment generates a designated character string X by dividing the character string X0 into a plurality of sound units x [n], and whether each sound unit x [n] corresponds to an accent. And whether or not glissando is given to each sound unit x [n]. A known technique (natural language processing such as morphological analysis) is arbitrarily employed to determine the presence or absence of an accent. The accents related to the feature f8 include both a stress accent for the strength of the speech and a pitch accennt for the height of the speech, both of which are natural such as morphological analysis. It can be specified by language processing. In many rap music, when the sound unit (mora) x [n] immediately after the accent is a long sound or a non-voiced sound, the sound unit x [n] is changed from the immediately preceding sound unit x [n-1]. There is a general tendency to be uttered by the glissando. Considering the above tendency, the character string acquisition unit 22 of the third embodiment is such that the sound unit x [n] itself in the character string X0 is a long sound or an unvoiced sound, and the immediately preceding sound unit x [n When [−1] corresponds to an accent, it is estimated that glissando is given to the sound unit x [n]. The specified character string X in which the presence or absence of accents and glissandos in each sound unit x [n] is determined by the above method is a mathematical expression defined by a plurality of features f including the feature f7 (f7a, f7b) and the feature f8 described above. By applying to the probability model of (3), the analysis processing unit 30 generates a linked information sequence Y.

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態では、各音単位ｘ[n]におけるアクセントやグリッサンドの有無を加味した譜割傾向を反映した確率モデルが連結情報列Ｙの生成に適用されるから、第１実施形態と比較して、実際の譜割傾向（特にラップ音楽の譜割傾向）を忠実に反映した多様な譜割が実現されるという利点もある。なお、第３実施形態では、素性ｆ7および素性ｆ8を含む複数の素性ｆで規定される確率モデルを例示したが、素性ｆ7および素性ｆ8の一方のみを含む複数の素性ｆで規定される確率モデルを連結情報列Ｙの生成に適用することも可能である。また、第２実施形態で例示した素性ｆ5および素性ｆ6と、第３実施形態で例示した素性ｆ7および素性ｆ8との双方を適用することも可能である。以上の説明から理解されるように、本発明の好適な確率モデルは、以上に例示した素性ｆ1〜ｆ8のうちの少なくとも１種類の素性を含む複数の素性ｆで規定される。 In the third embodiment, the same effect as in the first embodiment is realized. In the third embodiment, since a probability model that reflects the musical score tendency that takes into account the presence or absence of accents and glissandos in each sound unit x [n] is applied to the generation of the linked information sequence Y, In comparison, there is also an advantage that various musical scores that faithfully reflect an actual musical score tendency (especially the musical score tendency of rap music) are realized. In the third embodiment, the probability model defined by the plurality of features f including the feature f7 and the feature f8 is exemplified. However, the probability model defined by the plurality of features f including only one of the feature f7 and the feature f8 is illustrated. Can also be applied to the generation of the concatenated information string Y. It is also possible to apply both the feature f5 and the feature f6 exemplified in the second embodiment and the feature f7 and the feature f8 exemplified in the third embodiment. As can be understood from the above description, the preferred probability model of the present invention is defined by a plurality of features f including at least one type of features f1 to f8 exemplified above.

＜第４実施形態＞
第１実施形態では、指定文字列Ｘから生成された譜割文字列Ｚ内の譜割単位ｚ[m]と同数の音符で構成される音符列Ｍ[k]を候補音符列ＭCとして選択したが、譜割文字列Ｚ内の譜割単位ｚ[m]の個数が音符列Ｍ[1]〜Ｍ[K]の音符数の最大値を上回るような長い指定文字列Ｘを利用者が指定する可能性もある。 <Fourth embodiment>
In the first embodiment, a note string M [k] composed of the same number of notes as the musical score unit z [m] in the musical score string Z generated from the designated character string X is selected as the candidate note string MC. However, the user specifies a long designated character string X such that the number of staff units z [m] in the musical score character string Z exceeds the maximum number of notes in the note strings M [1] to M [K]. There is also a possibility to do.

以上の事情を考慮して、第４実施形態の文字列取得部２２は、利用者が指定した文字列Ｘ0を区分することで複数の指定文字列Ｘを生成する。具体的には、指定文字列Ｘを構成する音単位ｘ[n]の総数が所定の範囲内の数値となるように各指定文字列Ｘが画定される。各指定文字列Ｘの音単位ｘ[n]の総数の範囲は、各指定文字列Ｘから生成される譜割文字列Ｚの譜割単位ｚ[m]の総数が記憶装置１４内の各音符列Ｍ[k]の音符数の最大値以下となるように選定される。そして、文字列取得部２２が生成した複数の指定文字列Ｘの各々について、音符列設定部２４による特定音符列Ｍの取得（解析処理部３０による連結情報列Ｙの生成および音符列取得部４０による特定音符列Ｍの生成）と音声合成部２６による音声信号Ｖの合成とが第１実施形態と同様に実行される。 Considering the above circumstances, the character string acquisition unit 22 of the fourth embodiment generates a plurality of designated character strings X by dividing the character string X0 designated by the user. Specifically, each designated character string X is defined such that the total number of sound units x [n] constituting the designated character string X is a numerical value within a predetermined range. The range of the total number of sound units x [n] of each designated character string X is such that the total number of musical score units z [m] of the staff character string Z generated from each designated character string X is equal to each note in the storage device 14. It is selected so as to be less than or equal to the maximum number of notes in the column M [k]. Then, for each of the plurality of designated character strings X generated by the character string acquisition unit 22, acquisition of the specific note string M by the note string setting unit 24 (generation of the linked information string Y by the analysis processing unit 30 and note string acquisition unit 40). The generation of the specific note string M by) and the synthesis of the voice signal V by the voice synthesizer 26 are executed in the same manner as in the first embodiment.

文字列取得部２２が文字列Ｘ0を区分する方法は任意であるが、例えば自然言語処理で検出される言語的な境界（区切）で文字列Ｘ0を複数の指定文字列Ｘに区分する方法が好適である。例えばアクセント句（１個のアクセントを含む単位）や文節等の句単位で文字列Ｘ0が複数の指定文字列Ｘに区分される。また、自然言語処理で文字列Ｘ0に特定される複数の境界のうち、区分後の各指定文字列Ｘを構成する音単位ｘ[n]の個数が所定の基準値に近似するように選択された境界で文字列Ｘ0を区分することも可能である。音単位ｘ[n]の個数の基準値は、例えば各音符列Ｍ[k]の各々の音符数に応じた数値（例えばＫ個の音符列Ｍ[1]〜Ｍ[K]にわたる音符数の平均値や最大値）に設定される。また、各指定文字列Ｘの音単位ｘ[n]の個数が所定の範囲内の数値となるように利用者が入力装置１６の操作で各指定文字列Ｘの境界を指定する構成も好適である。 The method of dividing the character string X0 by the character string acquisition unit 22 is arbitrary. For example, a method of dividing the character string X0 into a plurality of designated character strings X at a linguistic boundary (separation) detected by natural language processing, for example. Is preferred. For example, the character string X0 is divided into a plurality of designated character strings X in units of phrases such as accent phrases (units including one accent) and phrases. In addition, among a plurality of boundaries specified for the character string X0 by natural language processing, the number of sound units x [n] constituting each designated character string X after division is selected so as to approximate a predetermined reference value. It is also possible to segment the character string X0 at the boundary. The reference value of the number of sound units x [n] is, for example, a numerical value corresponding to the number of notes of each note string M [k] (for example, the number of notes ranging from K note strings M [1] to M [K]) (Average value or maximum value). A configuration in which the user designates the boundary of each designated character string X by operating the input device 16 so that the number of sound units x [n] of each designated character string X is a numerical value within a predetermined range is also preferable. is there.

第４実施形態においても第１実施形態と同様の効果が実現される。また、第４実施形態では、文字列Ｘ0を区分した複数の指定文字列Ｘの各々について音符列設定部２４や音声合成部２６による処理が実行されるから、利用者が長い文字列Ｘ0を指定した場合でも、適切に譜割や音声合成を実現することが可能である。なお、第２実施形態の構成（素性ｆ5および素性ｆ6）や第３実施形態の構成（素性ｆ7および素性ｆ8）は第４実施形態にも採用され得る。 In the fourth embodiment, the same effect as in the first embodiment is realized. In the fourth embodiment, since the processing by the note string setting unit 24 and the voice synthesis unit 26 is executed for each of a plurality of designated character strings X obtained by dividing the character string X0, the user designates a long character string X0. Even in such a case, it is possible to appropriately realize the musical notation and speech synthesis. Note that the configuration of the second embodiment (feature f5 and feature f6) and the configuration of the third embodiment (feature f7 and feature f8) can also be adopted in the fourth embodiment.

＜変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）前述の各形態では、譜割文字列Ｚの譜割単位ｚ[m]と同数の音符で構成される複数の候補音符列ＭCの選択（第１選択部４１）と、候補音符列ＭCの誤差指標値Ｅの算定および比較とで特定音符列Ｍを選択したが、音符列取得部４０が特定音符列Ｍを選択する方法は適宜に変更される。例えば、譜割文字列Ｚの譜割単位ｚ[m]と同数の音符で構成される複数の候補音符列ＭCから例えばランダムに１個の候補音符列ＭCを特定音符列Ｍとして選択する構成も採用され得る。また、記憶装置１４に記憶された音符列Ｍ[k]毎に音符数が相違する場合には、譜割文字列Ｚの譜割単位ｚ[m]の個数と同数の音符で構成される１個の音符列Ｍ[k]が特定音符列Ｍとして選択される。 (1) In each of the above-described embodiments, selection of a plurality of candidate note strings MC composed of the same number of notes as the musical score unit z [m] of the staff character string Z (first selection unit 41), and candidate note strings Although the specific note string M is selected in the calculation and comparison of the error index value E of MC, the method by which the note string acquisition unit 40 selects the specific note string M is appropriately changed. For example, a configuration in which one candidate note string MC is selected as a specific note string M at random from a plurality of candidate note strings MC composed of the same number of notes as the staff unit z [m] of the staff string Z, for example. Can be employed. When the number of notes is different for each note string M [k] stored in the storage device 14, the number of notes is the same as the number of staff units z [m] of the staff string Z. The note sequence M [k] is selected as the specific note sequence M.

また、前述の各形態では、記憶装置１４に事前に記憶されたＫ個の音符列Ｍ[1]〜Ｍ[K]の何れかを音符列取得部４０が特定音符列Ｍとして選択する構成を例示したが、音符列取得部４０が特定音符列Ｍを取得する方法は、以上の例示（事前に用意された音符列Ｍ[k]の選択）に限定されない。具体的には、譜割文字列Ｚの各譜割単位ｚ[m]に対して所定の規則で音符を順次に割当てることで音符列取得部４０が特定音符列Ｍを生成（自動作曲）することも可能である。例えば、各譜割単位ｚ[m]の抑揚に応じて音高や継続長が設定された各音符の時系列を特定音符列Ｍとして生成する構成が採用され得る。以上の説明から理解されるように、音符列取得部４０は、譜割文字列Ｚの譜割単位ｚ[m]の個数に対応する音符数の音符列を取得（例えば選択や生成）する要素として包括される。 In each of the above-described embodiments, the note string acquisition unit 40 selects any one of the K note strings M [1] to M [K] stored in advance in the storage device 14 as the specific note string M. Although illustrated, the method by which the note string acquisition unit 40 acquires the specific note string M is not limited to the above example (selection of the note string M [k] prepared in advance). Specifically, the note string acquisition unit 40 generates a specific note string M (automatic song) by sequentially assigning notes to each staff unit z [m] of the staff string Z according to a predetermined rule. It is also possible. For example, a configuration may be adopted in which a time series of notes in which pitches and durations are set according to the inflection of each musical score unit z [m] is generated as a specific note string M. As understood from the above description, the note string acquisition unit 40 acquires (for example, selects or generates) a note string having the number of notes corresponding to the number of musical score units z [m] of the musical score character string Z. It is included as

（２）指定文字列Ｘを構成する音単位ｘ[n]は、前述の各形態で例示したモーラには限定されない。例えば文字列Ｘ0が英語で指定された場合、文字列Ｘ0の各音節を音単位ｘ[n]として指定文字列Ｘを生成する構成が好適である。 (2) The sound unit x [n] constituting the designated character string X is not limited to the mora exemplified in the above embodiments. For example, when the character string X0 is designated in English, it is preferable to generate the designated character string X with each syllable of the character string X0 as a sound unit x [n].

（３）前述の各形態では、利用者が文字列Ｘ0を指定する構成を例示したが、文字列取得部２２が文字列Ｘ0を取得する方法は任意である。例えば、外部装置から通信網を介して受信した文字列Ｘ0を文字列取得部２２が処理する構成や、記憶装置１４や他の記録媒体に記憶された文字列Ｘ0を文字列取得部２２が処理する構成も採用され得る。また、利用者が指定文字列Ｘ（各音単位ｘ[n]）を直接的に指定する構成も採用され得る。指定文字列Ｘが直接に指定される構成では、文字列取得部２２による文字列Ｘ0の解析が省略される。 (3) In each of the above embodiments, the configuration in which the user designates the character string X0 has been exemplified. However, the method by which the character string obtaining unit 22 obtains the character string X0 is arbitrary. For example, the character string acquisition unit 22 processes the character string X0 received from the external device via the communication network, or the character string acquisition unit 22 processes the character string X0 stored in the storage device 14 or another recording medium. The structure to do can also be employ | adopted. Further, a configuration in which the user directly designates the designated character string X (each sound unit x [n]) may be employed. In the configuration in which the designated character string X is directly designated, analysis of the character string X0 by the character string obtaining unit 22 is omitted.

（４）前述の各形態では、連結情報列Ｙの各連結情報ｙ[n]が音単位ｘ[n]とその直前の音単位ｘ[n]との連結の有無を指定する場合を例示したが、各連結情報ｙ[n]が音単位ｘ[n]とその直後の音単位ｘ[n+1]との連結の有無を指定することも可能である。すなわち、連結情報ｙ[n]は、指定文字列Ｘや学習用文字列ＸL内で相前後する各音単位ｘ[n]間の連結の有無を指定する情報として包括される。 (4) In each of the above-described embodiments, the case where each link information y [n] of the link information sequence Y specifies whether or not the sound unit x [n] and the immediately preceding sound unit x [n] are connected is illustrated. However, each connection information y [n] can also specify whether or not the sound unit x [n] and the sound unit x [n + 1] immediately after the sound unit x [n] are connected. That is, the connection information y [n] is included as information that specifies whether or not the sound units x [n] that follow each other in the designated character string X and the learning character string XL are connected.

（５）前述の各形態では条件付確率場の確率モデルを例示したが、確率モデルの形式は適宜に変更される。例えば、隠れマルコフモデル（HMM：Hidden Markov Model）等の公知の確率モデルを連結情報列Ｙの生成に利用することも可能である。 (5) In each of the above-described embodiments, the probability model of the conditional random field is illustrated, but the format of the probability model is changed as appropriate. For example, a known probability model such as a Hidden Markov Model (HMM) can be used for generating the linked information sequence Y.

（６）前述の各形態では、音声合成部２６を含む音声合成装置１００を例示したが、指定文字列Ｘに好適な音符列を設定する音符列設定装置（自動作曲装置）としても本発明は実現され得る。すなわち、前述の各形態で例示した音声合成部２６は省略され得る。また、指定文字列Ｘの解析で連結情報列Ｙを生成する文字列解析装置や、音符列に対する譜割に好適な譜割文字列Ｚを指定文字列Ｘの解析で生成する文字列解析装置としても本発明は実現され得る。本発明の文字列解析装置は、前述の各形態における解析処理部３０で構成され、音符列取得部４０は省略される。 (6) In each of the above embodiments, the speech synthesizer 100 including the speech synthesizer 26 has been exemplified. However, the present invention is also applicable to a note string setting device (automatic composition device) that sets a suitable note string for the designated character string X. Can be realized. That is, the speech synthesizer 26 exemplified in each of the above embodiments can be omitted. In addition, as a character string analysis device that generates a concatenated information string Y by analysis of a specified character string X, or a character string analysis device that generates a musical score character string Z suitable for a musical score for a note string by analysis of the specified character string X The present invention can also be realized. The character string analysis device of the present invention is configured by the analysis processing unit 30 in each of the above-described embodiments, and the note string acquisition unit 40 is omitted.

１００……音声合成装置、１２……演算処理装置、１４……記憶装置、１６……入力装置、１８……放音装置、２２……文字列取得部、２４……音符列設定部、２６……音声合成部、３０……解析処理部、４０……音符列取得部、４１……第１選択部、４２……第２選択部、Ｑ……確率モデル情報、Ｘ……指定文字列、ｘ[n]……音単位、Ｙ……連結情報列、ｙ[n]……連結情報、Ｚ……譜割文字列、ｚ[n]……譜割単位、Ｍ[k]（Ｍ[1]〜Ｍ[K]）……音符列、Ｍ……特定音符列、Ｖ……音声信号。
DESCRIPTION OF SYMBOLS 100 ... Speech synthesizer, 12 ... Arithmetic processing unit, 14 ... Memory | storage device, 16 ... Input device, 18 ... Sound emission device, 22 ... Character string acquisition part, 24 ... Note string setting part, 26 …… Speech synthesizer, 30 …… analysis processor, 40 …… note string acquisition unit, 41 …… first selection unit, 42 …… second selection unit, Q …… probability model information, X …… designated character string , X [n] …… Sound unit, Y …… Concatenated information string, y [n] …… Concatenated information, Z …… Musical score character string, z [n] …… Musical score unit, M [k] (M [1] to M [K]) …… Note string, M …… Specific note string, V …… Voice signal.

Claims

複数の音単位を時系列に配列した指定文字列に対応する音符列を設定する装置であって、
複数の音単位を時系列に配列した学習用文字列と、前記学習用文字列内で相前後する各音単位間の連結の有無を音単位毎に指定する連結情報を時系列に配列した学習用連結情報列とを各々が含む複数の学習データを利用した機械学習で生成された確率モデルを、前記指定文字列に適用することで、前記指定文字列内の各音単位の連結情報を時系列に配列した連結情報列を生成する解析処理手段と、
前記連結情報列が指定する連結の有無を前記指定文字列の各音単位に適用して得られる各譜割単位に対応する複数の音符を時系列に配列した音符列を取得する音符列取得手段と
を具備する音符列設定装置。 A device for setting a note string corresponding to a specified character string in which a plurality of sound units are arranged in time series,
A learning character string in which a plurality of sound units are arranged in a time series, and learning information in which the connection information for specifying the presence or absence of the connection between each sound unit in the learning character string for each sound unit is arranged in a time series. By applying a probability model generated by machine learning using a plurality of learning data, each of which includes a connection information sequence for use, to the specified character string, the connection information of each sound unit in the specified character string is An analysis processing means for generating a linked information sequence arranged in a series;
A note string acquisition means for acquiring a note string in which a plurality of notes corresponding to each musical score unit obtained by applying the presence or absence of connection specified by the connection information string to each sound unit of the specified character string is arranged in time series. A note string setting device comprising:

複数の音符を時系列に配列した複数の音符列を記憶する記憶手段を具備し、
前記音符列取得手段は、前記連結情報列が指定する連結の有無を前記指定文字列の各音単位に適用して得られる譜割単位の個数に対応する音符数の音符列を前記複数の音符列から選択する
請求項１の音符列設定装置。 Comprising storage means for storing a plurality of note sequences in which a plurality of notes are arranged in time series,
The note string acquisition means converts a note string having the number of notes corresponding to the number of musical score units obtained by applying the connection information specified by the connection information string to each sound unit of the specified character string. The note string setting device according to claim 1, which is selected from a string.

前記音符列取得手段は、
前記譜割単位の個数に対応する音符数の複数の候補音符列を前記複数の音符列から選択する第１選択手段と、
前記各譜割単位を構成する音単位の個数に応じた基準長と、前記候補音符列内で当該譜割単位に対応する音符の継続長との差異に応じた誤差指標値を、前記複数の候補音符列の各々について算定し、前記各候補音符列の誤差指標値に応じて１個の候補音符列を選択する第２選択手段とを含む
請求項２の音符列設定装置。 The note string acquisition means includes
First selection means for selecting a plurality of candidate note strings having the number of notes corresponding to the number of musical score units from the plurality of note strings;
An error index value corresponding to a difference between a reference length corresponding to the number of sound units constituting each musical score unit and a duration of notes corresponding to the musical score unit in the candidate note sequence, The note string setting device according to claim 2, further comprising: a second selection unit that calculates each candidate note string and selects one candidate note string according to an error index value of each candidate note string.

前記確率モデルは、複数の素性で規定される条件付確率場の確率モデルであり、
前記複数の素性は、
音単位が母音であり連結情報が連結を指定する場合に発火する素性と、
音単位が撥音であり連結情報が連結を指定する場合に発火する素性と、
音単位が長音であり連結情報が連結を指定する場合に発火する素性と、
音単位が促音であり連結情報が連結を指定する場合に発火する素性と、
音単位が特定の品詞を構成し、連結情報が連結を指定する場合に発火する素性と、
音単位が無声化音であり連結情報が連結を指定する場合に発火する素性と
のうちの少なくとも１種類の素性を含む
請求項１から請求項３の何れかの音符列設定装置。 The probability model is a probability model of a conditional random field defined by a plurality of features,
The plurality of features are:
A feature that fires when the sound unit is a vowel and the connection information specifies connection;
A feature that ignites when the sound unit is sound repellent and the connection information specifies connection;
A feature that fires when the sound unit is a long sound and the connection information specifies connection;
A feature that is ignited when the sound unit is a sound and the connection information specifies connection,
A feature that ignites when a sound unit constitutes a specific part of speech and the connection information specifies connection;
The note string setting device according to any one of claims 1 to 3, comprising at least one type of feature that is ignited when a sound unit is an unvoiced sound and the connection information specifies connection.

処理対象の文字列を区分して複数の指定文字列を生成する文字列取得手段を具備し、
前記複数の指定文字列の各々について、前記解析処理手段による連結情報列の生成と、前記音符列取得手段による音符列の取得とが実行される
請求項１から請求項４の何れかの音符列設定装置。
Character string acquisition means for dividing a character string to be processed and generating a plurality of designated character strings,
The note string according to any one of claims 1 to 4, wherein for each of the plurality of designated character strings, generation of a connection information string by the analysis processing unit and acquisition of a note string by the note string acquisition unit are executed. Setting device.