WO2022074754A1 - Information processing method, information processing system, and program - Google Patents

Information processing method, information processing system, and program Download PDF

Info

Publication number
WO2022074754A1
WO2022074754A1 PCT/JP2020/037966 JP2020037966W WO2022074754A1 WO 2022074754 A1 WO2022074754 A1 WO 2022074754A1 JP 2020037966 W JP2020037966 W JP 2020037966W WO 2022074754 A1 WO2022074754 A1 WO 2022074754A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
time
user
editing
instruction
Prior art date
Application number
PCT/JP2020/037966
Other languages
French (fr)
Japanese (ja)
Inventor
竜之介 大道
慶二郎 才野
正宏 清水
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to JP2022555020A priority Critical patent/JPWO2022074754A1/ja
Priority to CN202080105738.8A priority patent/CN116324965A/en
Priority to PCT/JP2020/037966 priority patent/WO2022074754A1/en
Publication of WO2022074754A1 publication Critical patent/WO2022074754A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10GREPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
    • G10G1/00Means for the representation of music
    • G10G1/04Transposing; Transcribing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • This disclosure relates to the processing of time series data.
  • Patent Document 1 discloses a technique for synthesizing a singing voice that pronounces a note sequence instructed by a user on an editing screen.
  • the edit screen is a piano roll screen in which the time axis and the pitch axis are set.
  • the user specifies a phonetic (phonetic character), a pitch, and a pronunciation period for each note that constitutes a musical piece.
  • the information processing method uses the first time-series data representing the time-series of the feature amount of the sound in which the symbol string is pronounced in the first pronunciation style.
  • the information processing system edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user.
  • the editing processing unit edits the first time-series data
  • the first history data corresponding to the edited first time-series data is saved as new version data
  • the second time-series data is edited.
  • each time it is provided with an information management unit that saves the second history data corresponding to the edited second time-series data as new version data, and the information management unit has the saved different versions.
  • the information processing system edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user. For each editing of the editing processing unit and the first time-series data, the first history data corresponding to the edited first time-series data is saved as new version data, and the second time-series data is stored.
  • a program that causes the computer system to function as an information management unit that saves the second history data corresponding to the edited second time-series data as new version data for each edit, and the information management unit is a program.
  • the second history data the second time-series data corresponding to the second history data according to the instruction from the user is acquired.
  • FIG. 1 is a block diagram illustrating the configuration of the information processing system 100 according to the first embodiment of the present disclosure.
  • the information processing system 100 is an acoustic processing system that generates an acoustic signal Z.
  • the acoustic signal Z is a signal in the time domain representing the waveform of the synthetic sound.
  • the synthetic sound is, for example, a musical instrument sound produced by a virtual performer playing a musical instrument, or a singing sound produced by, for example, a virtual singer singing a song.
  • the information processing system 100 is realized by a computer system including a control device 11, a storage device 12, a sound emitting device 13, a display device 14, and an operating device 15.
  • the information processing system 100 is realized by, for example, an information device such as a smartphone, a tablet terminal, or a personal computer.
  • the information processing system 100 is realized not only by a single device but also by a plurality of devices (for example, a client-server system) configured as separate bodies from each other.
  • the control device 11 is a single or a plurality of processors that control each element of the information processing system 100. Specifically, for example, one or more types of processors such as CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). 3.
  • the control device 11 is configured. The control device 11 executes various processes for generating the acoustic signal Z.
  • the storage device 12 is a single or a plurality of memories for storing a program executed by the control device 11 and various data used by the control device 11.
  • the storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium.
  • the storage device 12 may be composed of a combination of a plurality of types of recording media.
  • a portable recording medium attached to and detached from the information processing system 100, or a recording medium capable of writing and reading via a communication network (for example, cloud storage) may be used as the storage device 12.
  • the sound emitting device 13 reproduces the synthetic sound represented by the acoustic signal Z generated by the control device 11.
  • the sound emitting device 13 is, for example, a speaker or headphones.
  • the D / A converter that converts the acoustic signal Z from digital to analog and the amplifier that amplifies the acoustic signal Z are not shown for convenience. Further, in FIG. 1, the configuration in which the sound emitting device 13 is mounted on the information processing system 100 is illustrated, but the sound emitting device 13 separate from the information processing system 100 is connected to the information processing system 100 by wire or wirelessly. May be done.
  • the display device 14 displays an image under the control of the control device 11.
  • the display device 14 is composed of a display panel such as a liquid crystal panel or an organic EL (ElectroLuminescence) panel.
  • the operation device 15 is an input device that receives instructions from the user.
  • the operation device 15 is, for example, a plurality of controls operated by the user or a touch panel for detecting contact by the user.
  • the user can instruct the condition of the synthesized sound by operating the operation device 15.
  • the display device 14 displays an image (hereinafter referred to as “editing screen”) G referred to by the user for instructing the condition of the synthetic sound.
  • FIG. 2 is a schematic diagram of the edit screen G.
  • the editing screen G includes a plurality of editing areas E (En, Ef, and Ew).
  • a common time axis (horizontal axis) is set in the plurality of editing areas E.
  • the section of the synthetic sound displayed on the edit screen G is changed according to the instruction from the user to the operation device 15.
  • a time series (hereinafter referred to as "note sequence") N of a plurality of notes constituting the score of the synthesized sound is displayed.
  • a coordinate plane defined by a time axis and a pitch axis (vertical axis) is set in the editing area En.
  • An image representing each note constituting the note sequence N is arranged in the editing area En.
  • a pitch (for example, a note number) and a pronunciation period are specified for each note in the note sequence N.
  • the phoneme is specified for each note.
  • performance symbols such as crescendo, forte, and decrescendo are also displayed.
  • the user can give an edit instruction Qn to the edit area En by operating the operation device 15.
  • the edit instruction Qn is an instruction to edit the note string N.
  • the edit instruction Qn is an instruction to add or delete each note in the note sequence N, an instruction to change the condition (pitch, pronunciation period or phonology) of each note, or an instruction to change the performance symbol. be.
  • a time series (hereinafter referred to as "feature column") F of the feature amount of the synthetic sound is displayed.
  • the feature amount is an acoustic feature amount of the synthetic sound.
  • the feature column F (that is, the temporal transition of the fundamental frequency) is displayed in the editing area Ef with the fundamental frequency (pitch) of the synthesized sound as the feature amount.
  • the user can give an edit instruction Qf to the edit area Ef by operating the operation device 15.
  • the edit instruction Qf is an instruction to edit the feature column F.
  • the editing instruction Qf is, for example, an instruction for changing the time change of the feature amount in the desired section of the feature column F displayed in the editing area Ef.
  • the waveform W of the synthesized sound on the time axis is displayed.
  • the user can give an edit instruction Qw to the edit area Ew by operating the operation device 15.
  • the edit instruction Qw is an instruction to edit the waveform W.
  • the editing instruction Qw is an instruction to change the waveform in the user's desired section of the waveform W displayed in the editing area Ew.
  • the editing screen G includes, in addition to the plurality of editing areas E exemplified above, a plurality of operating areas (Gn, Gf and Gw) corresponding to different editing areas E, and an operating image B1 (playback).
  • the operation image B1 is a software button that can be operated by the user using the operation device 15.
  • the operation image B1 is an operation element for the user to instruct the reproduction of the synthesized sound.
  • the synthetic sound of the waveform W displayed in the editing area Ew is reproduced from the sound emitting device 13.
  • the operation area Gn is an area related to the note string N. Specifically, the note string version number Vn, the operation image Gn1 and the operation image Gn2 are displayed in the operation area Gn.
  • the note string version number Vn is a number representing the version of the note string N displayed in the editing area En.
  • the note string version number Vn is incremented by 1 each time the note string N is edited according to the edit instruction Qn. Further, the user can change the note string version number Vn in the operation area Gn to an arbitrary numerical value by operating the operation device 15.
  • the note string N of the version corresponding to the note string version number Vn changed by the user is displayed in the editing area En.
  • the operation image Gn1 and the operation image Gn2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gn1 is an operation element for the user to instruct to return the note string N to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gn1, the note string version number Vn is changed to the immediately preceding numerical value, and the note string N of the version corresponding to the changed note string version number Vn is the edit area En. Is displayed in. Therefore, the operation image Gn1 is also expressed as an operator for retreating the note string version number Vn to the immediately preceding numerical value (that is, canceling the immediately preceding edit regarding the note string N).
  • the operation image Gn2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gn1 again (Redo).
  • the operation area Gf is an area related to the feature column F. Specifically, the feature column version number Vf, the operation image Gf1, and the operation image Gf2 are displayed in the operation area Gf.
  • the feature column version number Vf is a number representing the version of the feature column F displayed in the editing area Ef.
  • the feature column version number Vf is incremented by 1 each time the feature column F is edited according to the edit instruction Qf. Further, the user can change the feature column version number Vf in the operation area Gf to an arbitrary numerical value by operating the operation device 15.
  • the feature column F of the version corresponding to the feature column version number Vf changed by the user is displayed in the editing area Ef.
  • the operation image Gf1 and the operation image Gf2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gf1 is an operation element for the user to instruct to return the feature column F to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gf1, the feature column version number Vf is changed to the immediately preceding numerical value, and the feature column F of the version corresponding to the changed feature column version number Vf is the edit area Ef. Is displayed in. Therefore, the operation image Gf1 is also expressed as an operator for retreating the feature column version number Vf to the immediately preceding numerical value (that is, canceling the immediately preceding edit regarding the feature sequence F).
  • the operation image Gf2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gf1 again (Redo).
  • the operation area Gw is an area related to the waveform W. Specifically, the waveform version number Vw, the operation image Gw1 and the operation image Gw2 are displayed in the operation area Gw.
  • the waveform version number Vw is a number representing the version of the waveform W displayed in the editing area Ew.
  • the waveform version number Vw is incremented by 1 each time the waveform W is edited according to the edit instruction Qw. Further, the user can change the waveform version number Vw in the operation area Gw to an arbitrary numerical value by operating the operation device 15.
  • the version of the waveform W corresponding to the waveform version number Vw changed by the user is displayed in the editing area Ew.
  • the operation image Gw1 and the operation image Gw2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gw1 is an operator for instructing the user to return the waveform W to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gw1, the waveform version number Vw is changed to the immediately preceding value, and the waveform W of the version corresponding to the changed waveform version number Vw is displayed in the editing area Ew. To. Therefore, the operation image Gw1 is also expressed as an operator for retreating the waveform version number Vw to the immediately preceding value (that is, canceling the immediately preceding edit regarding the waveform W).
  • the operation image Gw2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gw1 again (Redo).
  • a plurality of version numbers V (Vn, Vf, Vw) are used.
  • An increase in each version number (increment) means the progress of the editing work, and a decrease in each version number (decrement) means a recession in the editing work.
  • FIG. 3 is a block diagram illustrating a functional configuration of the information processing system 100.
  • the control device 11 executes a program stored in the storage device 12 to perform a plurality of functions (display control unit 20, editing processing unit 30, and information) for editing synthetic sound conditions and generating an acoustic signal Z. Realize the management unit 40).
  • the display control unit 20 causes the display device 14 to display an image under the control of the control device 11.
  • the display control unit 20 causes the display device 14 to display the editing screen G illustrated in FIG.
  • the display control unit 20 updates the edit screen G in response to an instruction (Qn, Qf or Qw) from the user.
  • the editing processing unit 30 in FIG. 3 edits the synthetic sound conditions (note sequence N, feature sequence F, and waveform W) according to an instruction (Qn, Qf, or Qw) from the user.
  • the editing processing unit 30 includes a first editing unit 31, a first generation unit 32, a second editing unit 33, a second generation unit 34, and a third editing unit 35.
  • the first editorial unit 31 edits the note string data Dn.
  • the note string data Dn is time-series data representing the note sequence N of the synthesized sound.
  • the first editing unit 31 edits the note string data Dn according to the editing instruction Qn from the user for the editing area En.
  • the display control unit 20 displays the musical note string N represented by the musical note string data Dn edited by the first editing unit 31 in the editing area En.
  • the first generation unit 32 generates the feature sequence data Df from the note sequence data Dn edited by the first editing unit 31.
  • the feature sequence data Df is time-series data representing the feature sequence F of the synthesized sound.
  • at least the notes before and after the note are generated for the generation of the feature amount at each time point on the time axis among the plurality of feature amounts constituting the feature sequence F.
  • the data of one note is used. That is, the feature sequence data Df is generated according to the context of the note sequence N represented by the note sequence data Dn.
  • the first generation unit 32 generates the feature column data Df using the first generation model M1.
  • the first generative model M1 is a statistical inference model that inputs the note sequence data Dn and outputs the feature sequence data Df.
  • the first generative model M1 is a trained model that has learned the relationship between the note sequence N and the feature sequence F.
  • the first generative model M1 is composed of, for example, a deep neural network (DNN).
  • DNN deep neural network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • additional elements such as long short-term memory (LSTM: Long Short-Term Memory) or Self-Attention may be mounted on the first generation model M1.
  • the first generation model M1 includes a program that causes the control device 11 to execute an operation for generating feature sequence data Df from the note sequence data Dn, and a plurality of variables (specifically, weighted values and biases) applied to the operation. It is realized by the combination of.
  • the plurality of variables defining the first generation model M1 are preset and stored in the storage device 12 by machine learning using the plurality of first training data.
  • Each of the plurality of first training data includes the note sequence data Dn and the feature sequence data Df (correct answer value).
  • the feature sequence data Df output by the provisional first generation model M1 for the note sequence data Dn of each first training data and the feature sequence data of the first training data.
  • the first generative model M1 is a statistically valid feature for the unknown note sequence data Dn under the latent tendency between the note sequence N and the feature sequence F in the plurality of first training data. Output the column data Df.
  • the second editing unit 33 edits the feature column data Df generated by the first generation unit 32. Specifically, the second editing unit 33 edits the feature column data Df according to the editing instruction Qf from the user for the editing area Ef.
  • the display control unit 20 displays the feature column F represented by the feature column data Df generated by the first generation unit 32 or the feature column F represented by the feature column data Df edited by the second editing unit 33 in the editing area Ef. do.
  • the second generation unit 34 generates waveform data Dw from the note sequence data Dn and the feature sequence data Df.
  • the waveform data Dw is time-series data representing the waveform W of the synthesized sound. That is, the waveform data Dw is composed of a time series of a plurality of samples representing the acoustic signal Z.
  • the acoustic signal Z is generated by D / A conversion and amplification for the waveform data Dw.
  • the feature sequence data Df immediately after being generated by the first generation unit 32 (that is, the feature sequence data DF not edited by the second editing unit 33) may be used for generating the waveform data Dw.
  • the second generation unit 34 generates waveform data Dw using the second generation model M2.
  • the second generative model M2 is a statistical inference model that outputs waveform data Dw by inputting a set of note sequence data Dn and feature sequence data Df (hereinafter referred to as “input data Din”).
  • the second generative model M2 is a trained model in which the relationship between the set of the note sequence N and the feature sequence F and the waveform W is learned.
  • the second generative model M2 is composed of, for example, a deep neural network.
  • an arbitrary form of deep neural network such as a convolutional neural network or a recurrent neural network is used as the second generative model M2.
  • additional elements such as long-term memory or self-attention may be mounted on the second generative model M2.
  • the second generation model M2 is a program that causes the control device 11 to execute an operation of generating waveform data Dw from the input data Din including the note string data Dn and the feature sequence data Df, and a plurality of variables applied to the operation (the second generation model M2). Specifically, it is realized in combination with a weighted value and a bias).
  • the plurality of variables defining the second generation model M2 are preset and stored in the storage device 12 by machine learning using the plurality of second training data. Each of the plurality of second training data includes input data Din and waveform data Dw (correct answer value).
  • a plurality of variables of the second generative model M2 are updated iteratively so that the error is reduced. Therefore, the second generative model M2 is statistical with respect to the unknown input data Din under the latent tendency between the set of the note sequence N and the feature sequence F and the waveform W in the plurality of second training data. Outputs appropriate waveform data Dw.
  • the third editing unit 35 edits the waveform data Dw generated by the second generation unit 34. Specifically, the third editing unit 35 edits the waveform data Dw according to the editing instruction Qw from the user for the editing area Ew.
  • the display control unit 20 displays the waveform W represented by the waveform data Dw generated by the second generation unit 34 or the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew. Further, when the operation image B1 (reproduction) is operated by the user, the acoustic signal Z corresponding to the waveform data Dw generated by the second generation unit 34 or the waveform data Dw edited by the third editing unit 35 is emitted. By being supplied to the device 13, the synthesized sound is reproduced.
  • the information management unit 40 manages versions of each of the note sequence data Dn, the feature sequence data Df, and the waveform data Dw. Specifically, the information management unit 40 manages the note sequence version number Vn, the feature sequence version number Vf, and the waveform version number Vw.
  • the information management unit 40 stores different versions of data (hereinafter referred to as “history data”) for each of the note sequence data Dn, the feature sequence data Df, and the waveform data Dw in the storage device 12.
  • a history area and a work area are set in the storage device 12.
  • the history area is a storage area in which the history of editing related to the synthetic sound condition is stored.
  • the work area is a storage area in which the note sequence data Dn, the feature sequence data Df, and the waveform data Dw are temporarily stored in the process of editing using the edit screen G.
  • the information management unit 40 saves the edited note sequence data Dn as the first history data Hn [Vn, Vf, Vw] in the history area for each edit of the note sequence N in response to the edit instruction Qn. do. That is, the new version of the note string data Dn is stored in the storage device 12 as the first history data Hn [Vn, Vf, Vw].
  • the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df according to the edit instruction Qf in the history area as new version data.
  • the second history data Hf [Vn, Vf, Vw] of the first embodiment is data showing how the feature column data Df was edited according to the edit instruction Qf (that is, the time series of the edit instruction Qf).
  • the second history data Hf [Vn, Vf, Vw] is also referred to as data representing the difference between the feature column data Df before and after editing.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw according to the edit instruction Qw in the history area as new version data.
  • the third history data Hw [Vn, Vf, Vw] of the first embodiment is data showing how the waveform data Dw was edited according to the editing instruction Qw (that is, the time series of the editing instruction Qw).
  • the third history data Hw [Vn, Vf, Vw] is also referred to as data representing the difference between the waveform data Dw before and after editing.
  • FIG. 4 to 6 are flowcharts illustrating a specific procedure of the editing process Sa (Sa1, Sa2 and Sa3) for editing the condition of the synthetic sound according to the editing instruction Q (Qn, Qf or Qw) from the user.
  • FIG. 4 is a flowchart of the first editing process Sa1 relating to the editing of the note string N.
  • the first editing process Sa1 is started with the editing instruction Qn for the note string N as a trigger.
  • the first editing unit 31 edits the current note string data Dn according to the editing instruction Qn (Sa101).
  • the information management unit 40 increases the note string version number Vn by "1" (Sa102).
  • the note string data Dn is newly generated (Sa101), and the note string version number Vn is initialized to "0" (Sa102).
  • the information management unit 40 initializes the feature column version number Vf to "0" (Sa103) and initializes the waveform version number Vw to "0" (Sa104).
  • the first generation unit 32 generates the feature sequence data Df by supplying the note sequence data Dn edited by the first editing unit 31 to the first generation model M1 (Sa106).
  • the feature sequence data Df generated by the first generation unit 32 is stored in the work area of the storage device 12.
  • the second generation unit 34 supplies the input data Din including the note sequence data Dn edited by the first editing unit 31 and the feature sequence data Df generated by the first generation unit 32 to the second generation model M2. This generates waveform data Dw (Sa107).
  • the waveform data Dw generated by the second generation unit 34 is stored in the work area of the storage device 12.
  • the note string data Dn requires one data for each note.
  • the feature sequence data Df is composed of one sample every several milliseconds to several tens of milliseconds in order to represent the change in pitch in each note. Since the waveform data Dw represents the waveform of each note, one sample is configured for each sampling period (for example, 1/50 kHz to 20 ⁇ sec).
  • the amount of data of the feature sequence data Df created from one note sequence data Dn is several hundred to several thousand times the amount of data of the note sequence data Dn, and one feature sequence.
  • the amount of data of the waveform data Dw generated from the data Df is several hundred times to several thousand times the amount of data of the feature column data Df.
  • the layer data feature column data Df and waveform data Dw
  • the layer data has a large amount of data as described above, only the difference from the upper layer data is stored as historical data. According to the above configuration, there is an advantage that the amount of data stored in the storage device 12 can be significantly reduced with respect to the hierarchical data as compared with the configuration in which the data itself is stored.
  • the display control unit 20 updates the edit screen G (Sa108-Sa110). Specifically, the display control unit 20 displays the note string N represented by the note string data Dn edited by the first editing unit 31 in the editing area En (Sa108). Further, the display control unit 20 displays the feature column F represented by the current feature column data Df stored in the work area in the edit area Ef (Sa109). Similarly, the display control unit 20 displays the waveform W represented by the current waveform data Dw stored in the work area in the edit area Ew (Sa110).
  • FIG. 5 is a flowchart of the second editing process Sa2 relating to the editing of the feature column F.
  • the second editing process Sa2 is started with the editing instruction Qf for the feature column F as a trigger.
  • the second editing unit 33 edits the current feature column data Df according to the editing instruction Qf (Sa201).
  • the second generation unit 34 generates waveform data Dw by supplying input data Din including the current note sequence data Dn and the feature sequence data Df edited by the second editing unit 33 to the second generation model M2. (Sa206).
  • the waveform data Dw generated by the second generation unit 34 is stored in the work area of the storage device 12.
  • the display control unit 20 updates the edit screen G (Sa207 and Sa208). Specifically, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the editing area Ef (Sa207). Further, the display control unit 20 displays the waveform W represented by the current waveform data Dw stored in the work area in the edit area Ew (Sa208). In the second editing process Sa2, the note string N in the editing area En is not updated.
  • FIG. 6 is a flowchart of the third editing process Sa3 relating to the editing of the waveform W.
  • the third editing process Sa3 is started with the editing instruction Qw for the waveform W as a trigger.
  • the third editing unit 35 edits the current waveform data Dw according to the editing instruction Qw (Sa301).
  • the information management unit 40 increases the waveform version number Vw by "1" (Sa302). Further, the information management unit 40 maintains the note sequence version number Vn at the current value Cn (Sa303), and also maintains the feature sequence version number Vf at the current value Cf (Sa304). Then, the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] representing the editing instruction Qw this time in the history area as new version data (Sa305).
  • step Sa303 and step Sa304 may be omitted.
  • the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew (Sa306).
  • the note string N in the editing area En and the feature string F in the editing area Ef are not updated.
  • FIG. 7 is an explanatory diagram of the data structure in the history area of the storage device 12.
  • a plurality of third history data Hw [Vn, Vn, corresponding to different versions of the waveform W under the common feature sequence F. Vf, Vw] is stored in the history area.
  • the hierarchical relationship is established in which the note sequence N is located above the feature sequence F and the feature sequence F is located above the waveform W.
  • the feature column version number Vf is increased, and the waveform version number Vw corresponding to the lower layer is set to "0" while the note string version number Vn corresponding to the upper layer is maintained. It is initialized.
  • FIG. 8 to 10 are flowcharts illustrating a specific procedure of the management process Sb (Sb1, Sb2 and Sb3) that manages the version according to the instruction from the user.
  • FIG. 8 is a flowchart of the first management process Sb1 regarding the version of the note string N.
  • the first management process Sb1 is started with the instruction to change the note string version number Vn.
  • the numerical value of the note string version number Vn after the change according to the instruction from the user is referred to as "set value Xn" below.
  • the changed numerical value that is, the numerical value specified by the user
  • the information management unit 40 changes the note string version number Vn from the current value Cn to the set value Xn (Sb101).
  • the information management unit 40 sets the feature column version number Vf to the latest value Yf corresponding to the set value Xn of the note string N (Sb102).
  • the latest value Yf is the number of the latest version among the plurality of versions of the feature string F generated for each edit instruction Qf under the note string N of the version corresponding to the set value Xn.
  • the information management unit 40 sets the waveform version number Vw to the latest value Yw corresponding to the set value Xn of the note string N (Sb103).
  • the latest value Yw is the number of the latest version among a plurality of versions of the waveform W generated for each edit instruction Qw under the note string N of the version corresponding to the set value Xn.
  • the note sequence version number Vn is the set value Xn. It is data representing the time series of the edit instruction Qf before the Yfth among the one or more edit instruction Qf sequentially given by the user under the note string N.
  • the edit instruction Qw before the Yw th It is data representing a time series.
  • the feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb106).
  • the feature sequence data Df edited according to the edit instruction Qf up to the Yf th is generated under the note sequence N corresponding to the set value Xn.
  • the editing by the second editing unit 33 is a small part of the feature sequence data Df over a plurality of notes. For example, only a very small part of the whole music, such as the attack part of a specific note in the music, or the first two notes in the third phrase in the music, is edited.
  • the waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb107).
  • the waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb108).
  • the waveform data Dw edited according to the edit instruction Qw up to the Ywth th is generated under the note string N corresponding to the set value Xn and the feature string F corresponding to the latest value Yf.
  • the waveform data Dw is not edited in step Sb108, and the waveform data Dw is determined as final data.
  • the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the edit area Ef, and displays the feature column version number Vf of the operation area Gf in the latest value Yf. Update (Sb110). That is, the feature column F corresponding to the set value Xn and the latest value Yf is displayed in the editing area E2. Similarly, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and updates the display of the waveform version number Vw in the operation area Gw to the latest value Yw. (Sb111).
  • the waveform W corresponding to the set value Xn, the latest value Yf, and the latest value Yw is displayed in the editing area Ew.
  • the user can give an editing instruction (Qn, Qf or Qw) for each of the note sequence N, the feature sequence F and the waveform W.
  • FIG. 9 is a flowchart of the second management process Sb2 regarding the version of the feature column F.
  • the second management process Sb2 is started with the instruction to change the feature column version number Vf.
  • the numerical value of the feature column version number Vf after the change according to the instruction from the user is referred to as "set value Xf" below.
  • the changed numerical value that is, the numerical value specified by the user
  • the information management unit 40 changes the feature column version number Vf from the current value Cf to the set value Xf (Sb201). Further, the information management unit 40 maintains the note string version number Vn at the current value Cn (Sb202), and changes the waveform version number Vw from the current value Cw to the latest value Yw (Sb203).
  • the latest value Yw of the waveform version number Vw is the number of the latest version among the plurality of versions of the waveform W generated for each edit instruction Qw under the feature column F of the version corresponding to the set value Xf.
  • the edit instruction Qw before the Ywth It is data representing a time series.
  • the feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb206). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Xf th is generated under the note sequence N corresponding to the current value Cn.
  • the waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb207).
  • the waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb208). That is, the waveform data Dw edited according to the edit instruction Qw up to the Ywth th is generated under the note string N corresponding to the current value Cn and the feature string F corresponding to the set value Xf.
  • the display control unit 20 updates the edit screen G (Sb209-Sb210). Specifically, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the edit area Ef, and sets the display of the feature column version number Vf of the operation area Gf. Update to the value Xf (Sb209). That is, the feature column F corresponding to the current value Cn and the set value Xf is displayed in the editing area Ef. Further, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and updates the display of the waveform version number Vw in the operation area Gw to the latest value Yw ( Sb210).
  • the waveform W corresponding to the current value Cn, the set value Xf, and the latest value Yw is displayed in the editing area Ew.
  • the user can give an editing instruction (Qn, Qf or Qw) for each of the note sequence N, the feature sequence F and the waveform W.
  • FIG. 10 is a flowchart of the third management process Sb3 regarding the version of the waveform W.
  • the third management process Sb3 is started with the instruction to change the waveform version number Vw.
  • the numerical value of the waveform version number Vw after the change according to the instruction from the user is referred to as "set value Xw" below.
  • the changed numerical value that is, the numerical value specified by the user
  • the information management unit 40 changes the waveform version number Vw from the current value Cw to the set value Xw (Sb301). Further, the information management unit 40 maintains the note sequence version number Vn at the current value Cn (Sb302) and the feature sequence version number Vf at the current value Cf (Sb303).
  • the note sequence version number Vn is the set value Xn. It is data representing the time series of the edit instruction Qf before the Cfth among one or more edit instruction Qf sequentially given by the user under the note string N.
  • the feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb306). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Cf th is generated under the note sequence N corresponding to the current value Cn.
  • the waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb307).
  • the waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb308). That is, the waveform data Dw edited according to the edit instruction Qw up to the Xwth is generated under the note string N corresponding to the current value Cn and the feature string F corresponding to the current value Cf.
  • the display control unit 20 updates the edit screen G (Sb309). Specifically, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and displays the waveform version number Vw in the operation area Gw as the set value Xw. Update. That is, the waveform W corresponding to the current value Cn, the current value Cf, and the set value Xf is displayed in the editing area Ew.
  • the note sequence data Dn and the feature sequence data Df are edited according to the instructions (editing instruction Qn and editing instruction Qf) from the user. Therefore, it is possible to generate waveform data Dw that precisely reflects the instruction from the user, as compared with the configuration in which only the note string data Dn is edited in response to the instruction from the user.
  • the note string version number Vn is increased, the numerical value of the feature string version number Vf is initialized, and when the feature string data Df is edited, the note is used.
  • the numerical value of the feature column version number Vf is increased while the numerical value of the column version number Vn is maintained. Then, among the plurality of numerical values of the note string version number Vn, the first history data Hn [Vn, Vf, Vw] corresponding to the set value Xn according to the instruction from the user, and the plurality of numerical values of the feature column version number Vf.
  • the waveform data Dw is generated by using at least one of the second history data Hf [Vn, Vf, Vw] corresponding to the set value Xf according to the instruction from the user. Therefore, the user can instruct the editing of the note sequence data Dn and the feature sequence data Df while generating the waveform data Dw by trial and error for different combinations of the note sequence version number Vn and the feature sequence version number Vf.
  • FIG. 11 is a schematic diagram of the editing screen G in the second embodiment.
  • the operation image B2 is added to the same elements as those of the first embodiment.
  • the operation image B2 is an image (specifically, a pull-down menu) for the user to select the pronunciation style of the synthetic sound.
  • the user can select a desired pronunciation style from a plurality of pronunciation styles by operating the operation device 15.
  • Pronunciation style means a feature related to how to pronounce.
  • the pronunciation style is a characteristic of how the musical instrument is played.
  • the pronunciation style is a feature (sung around) regarding how to sing the music.
  • a suitable pronunciation method for each music genre such as pop / rock / rap, is exemplified as a pronunciation style.
  • the musical expression of playing or singing such as bright / quiet / violent, is also exemplified as a pronunciation style.
  • FIG. 12 is a block diagram illustrating a functional configuration of the control device 11 in the second embodiment.
  • the pronunciation style s selected by the user in the operation on the operation image B2 is instructed to the first generation unit 32 and the second generation unit 34 of the second embodiment.
  • the first generation unit 32 generates the feature sequence data Df from the note sequence data Dn and the pronunciation style s.
  • the feature sequence data Df is time series data representing a time series of feature quantities (for example, fundamental frequency) related to a synthetic sound obtained by reproducing the note sequence N represented by the note sequence data Dn in the pronunciation style s.
  • the first generation unit 32 generates the feature column data Df using the first generation model M1.
  • the first generative model M1 is a statistical inference model that outputs feature sequence data Df by inputting note sequence data Dn and pronunciation style s. Similar to the first embodiment, the first generative model M1 is composed of a deep neural network having an arbitrary structure such as a convolutional neural network or a recurrent neural network.
  • the first generation model M1 includes a program that causes the control device 11 to execute an operation for generating feature sequence data Df from the note string data Dn and the pronunciation style s, and a plurality of variables applied to the operation. It is realized by the combination of.
  • a plurality of variables defining the first generation model M1 are set in advance by machine learning using a plurality of first training data and stored in the storage device 12.
  • Each of the plurality of first training data includes the note sequence data Dn, the set of pronunciation styles s, and the feature sequence data Df (correct answer value).
  • the feature sequence data Df output by the provisional first generation model M1 for the note sequence data Dn and the pronunciation style s of each first training data, and the first training.
  • the plurality of variables of the first generation model M1 are updated iteratively so that the error with the data feature column data Df is reduced. Therefore, the first generative model M1 outputs statistically valid feature sequence data Df for an unknown combination of note sequence data Dn and pronunciation style s under a tendency latent in a plurality of first training data. do.
  • the second generation unit 34 generates waveform data Dw from the note sequence data Dn, the feature sequence data Df, and the pronunciation style s.
  • the waveform data Dw is time-series data representing the waveform of the synthetic sound sound obtained by pronouncing the note sequence N represented by the note sequence data Dn in the pronunciation style s.
  • the second generation unit 34 generates the waveform data Dw using the second generation model M2.
  • the second generative model M2 is a statistical inference model that outputs waveform data Dw by inputting note sequence data Dn, feature sequence data Df, and pronunciation style s. Similar to the first embodiment, the second generative model M2 is composed of a deep neural network having an arbitrary structure such as a convolutional neural network or a recurrent neural network. Specifically, the second generation model M2 is applied to a program that causes the control device 11 to execute an operation of generating waveform data Dw from the note string data Dn, the feature sequence data Df, and the pronunciation style s, and the operation. It is realized by combining with multiple variables.
  • a plurality of variables defining the second generation model M2 are set in advance by machine learning using a plurality of second training data and stored in the storage device 12.
  • Each of the plurality of second training data includes a set of the note sequence data Dn, the feature sequence data Df, and the pronunciation style s, and the waveform data Dw (correct answer value).
  • the waveform data Dw output by the tentative second generative model M2 for the note sequence data Dn, the feature sequence data Df, and the pronunciation style s of each second training data A plurality of variables of the second generation model M2 are iteratively updated so that the error of the second training data with the waveform data Dw is reduced. Therefore, the second generative model M2 has a statistically valid waveform for an unknown combination of the note sequence data Dn, the feature sequence data Df, and the pronunciation style s under the tendency latent in the plurality of second training data.
  • Output data Dw is a statistically valid waveform for an unknown combination of the note sequence data Dn, the feature sequence data D
  • step Sa201 of the second editing process Sa2 the first editing unit 31 obtains the feature sequence data Df representing the feature sequence F of the synthetic sound in which the note sequence N is pronounced by the pronunciation style s selected by the user. Edit according to the edit instruction Qf. Further, in the step Sa205 of the second editing process Sa2, the information management unit 40 stores the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df for each version of the feature column data Df. It is saved in the history area of the device 12.
  • the feature sequence data Df corresponding to the pronunciation style s and the waveform data Dw corresponding to the pronunciation style s are generated under the specific note sequence N.
  • the note sequence N is not affected by the pronunciation style s. Therefore, as illustrated in FIG. 13, for the first history data Hn [Vn, Vf, Vw] (note string data Dn) corresponding to one note sequence N, different feature sequences F for each pronunciation style s.
  • a plurality of second history data Hf [Vn, Vf, Vw] corresponding to the above and a plurality of third history data Hw [Vn, Vf, Vw] corresponding to different waveforms W are stored in the history area of the storage device 12. It will be saved.
  • the feature sequence data Df representing the feature sequence F of the synthetic sound that pronounces the note sequence N in the pronunciation style s is generated by the first processing unit (Sa106), and represents the waveform W of the synthetic sound.
  • Waveform data Dw is generated by the second processing unit (Sa107).
  • the second editing unit 33 edits the feature sequence data Df according to the pronunciation style s according to the editing instruction Qf from the user.
  • the information management unit 40 history the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df for each edit of the feature column data Df (that is, for each version of the feature column data Df). Save to area.
  • the third editing unit 35 edits the waveform data Dw according to the pronunciation style s according to the editing instruction Qw from the user.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area for each edit of the waveform data Dw (that is, for each version of the waveform data Dw). do.
  • the first management process Sb1 in the state where the pronunciation style s is selected, the first management process Sb1 is started with the instruction to change the note string version number Vn.
  • the second management process Sb2 in the state where the pronunciation style s is selected, the second management process Sb2 is started with the instruction to change the feature column version number Vf.
  • the "feature column F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (set value Xn), the pronunciation style s, and the feature sequence version number Vf (latest value Yf).
  • the "waveform W corresponding to the pronunciation style s" includes a note string version number Vn (set value Xn), a pronunciation style s, a feature string version number Vf (latest value Yf), and a waveform version.
  • the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated.
  • the "feature sequence F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (current value Cn), the pronunciation style s, and the feature sequence version number Vf (set value Xf).
  • the "waveform W corresponding to the pronunciation style s" includes a note string version number Vn (current value Cn), a pronunciation style s, a feature string version number Vf (set value Xf), and a waveform version. It is a waveform W corresponding to the number Vw (latest value Yw).
  • the third management process Sb3 is started with the instruction to change the waveform version number Vw.
  • the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated.
  • the "feature sequence F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (current value Cn), the pronunciation style s, and the feature sequence version number Vf (current value Cf).
  • the "waveform W corresponding to the pronunciation style s" specifically includes a note string version number Vn (current value Cn), a pronunciation style s, a feature string version number Vf (current value Cf), and a waveform version. It is a waveform W corresponding to the number Vw (set value Xw).
  • the pronunciation style s1 and the pronunciation style s2 are different pronunciation styles s.
  • the pronunciation style s1 is an example of the "first pronunciation style”
  • the pronunciation style s2 is an example of the "second pronunciation style”.
  • the second editing unit 33 edits the feature sequence data Df corresponding to the pronunciation style s1 according to the editing instruction Qf from the user. Then, each time the feature column data Df is edited, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df in the history area. Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw according to the pronunciation style s1 according to the editing instruction Qw from the user.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area.
  • the feature sequence data Df or waveform data Dw generated when the pronunciation style s1 is selected is an example of "first time series data”.
  • the editing instruction Qf or the editing instruction Qw given by the user with the pronunciation style s1 selected is an example of the "first instruction”.
  • the feature sequence data Df of F and the waveform data Dw of the waveform W corresponding to the pronunciation style s1 are generated. That is, the feature sequence data Df and the waveform data corresponding to the history data H corresponding to the instruction (Xn, Xf, Xw) from the user among the plurality of history data H (Hn, Hf, Hw) corresponding to the pronunciation style s1. Dw is generated.
  • the second editing unit 33 edits the feature sequence data Df corresponding to the pronunciation style s2 according to the editing instruction Qf from the user. Then, each time the feature column data Df is edited, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df in the history area. Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw corresponding to the pronunciation style s2 according to the editing instruction Qw from the user.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area.
  • the feature sequence data Df or waveform data Dw generated when the pronunciation style s2 is selected is an example of "second time series data”.
  • the editing instruction Qf or the editing instruction Qw given by the user with the pronunciation style s2 selected is an example of the "second instruction”.
  • the feature sequence data Df of F and the waveform data Dw of the waveform W corresponding to the pronunciation style s2 are generated. That is, the feature sequence data Df and the waveform data corresponding to the history data H corresponding to the instruction (Xn, Xf or Xw) from the user among the plurality of history data H (Hn, Hf and Hw) corresponding to the pronunciation style s2. Dw is generated.
  • the editing processing unit 30 in the second embodiment has the feature sequence data Df and the waveform data Dw corresponding to the pronunciation style s1, or the feature sequence data Df and the waveform data corresponding to the pronunciation style s2.
  • Dw is acquired according to the common version of the note string data Dn.
  • the editing history of the feature sequence data Df and the waveform data Dw corresponding to the pronunciation style s1 is stored in the storage device 12, and the feature sequence data Df and the feature sequence data Df corresponding to the pronunciation style s2 are stored.
  • the editing history of the waveform data Dw is stored in the storage device 12. Therefore, the editing of the feature sequence data Df or the waveform data Dw corresponding to the pronunciation style s1 and the editing of the feature sequence data Df or the waveform data Dw corresponding to the pronunciation style s2 are performed by trial and error according to the instruction from the user. It is possible to execute.
  • the display control unit 20 causes the display device 14 to display the comparison screen U of FIG.
  • the comparison screen U includes a first region U1, an operation image U1a (call), an operation image U1b (reproduction), a second region U2, an operation image U2a (call), and an operation image U2b (reproduction).
  • the first history data Hn [Vn, Vf, Vw], the second history data Hf [Vn, Vf, Vw] and the third history data Hw [Vn, Vf, The hierarchical relationship with Vw] is displayed.
  • the user can select desired historical data H for each of the first region U1 and the second region U2.
  • the user selects the desired history data H for each of the first region U1 and the second region U2 by designating the pronunciation style s and each version number (Vn, Vf, Vw). ..
  • the control device 11 uses each history data H acquired from the history area to display the feature sequence data Df of the feature sequence F and the waveform data Dw of the waveform W corresponding to the version numbers (Vn, Vf, Vw) of the pronunciation style s. And generate.
  • the display screen G including the above is displayed on the display device 14.
  • the control device 11 supplies the sound emitting device 13 with the acoustic signal Z corresponding to the waveform data Dw generated in the above procedure for the first region U1. By doing so, the synthetic sound is reproduced.
  • the control device 11 acquires the history data H selected in the second region U2 from the storage device 12, and edits the history data H according to the history data H.
  • the screen G is displayed on the display device 14.
  • the control device 11 sets the pronunciation style s and each version number (Vn, Vf, Vw) specified by the user for the second region U2 by the same procedure as described above for the first region U1. Generate the corresponding feature sequence data Df and waveform data Dw.
  • the display screen G including the above is displayed on the display device 14.
  • the control device 11 supplies the sound emitting device 13 with the acoustic signal Z corresponding to the waveform data Dw generated in the above procedure for the second region U2. By doing so, the synthetic sound is reproduced.
  • the user mutually compares the combination of the version and the pronunciation style s selected from the first region U1 with the combination of the version and the pronunciation style s selected from the second region U2. While, it is possible to adjust the note sequence N, the feature sequence F, the waveform W, and the pronunciation style s.
  • FIG. 15 is an explanatory diagram of the synthetic sound in the third embodiment.
  • the synthetic sound of the third embodiment is composed of a plurality of tracks T (T1, T2, ...) Parallel to each other on the time axis.
  • T1, T2, ...) Parallel to each other on the time axis.
  • each performance part corresponds to the track T.
  • a singing sound composed of a plurality of singing parts is used as a synthetic sound, each singing part corresponds to the track T.
  • Each of the plurality of tracks T includes a plurality of sections (hereinafter referred to as "unit intervals") R that do not overlap each other on the time axis.
  • Each of the plurality of unit intervals R is an interval (region) including the note string N on the time axis. That is, a unit interval R is set for each note sequence N, with a set of a plurality of notes that are close to each other on the time axis as a note sequence N.
  • the time length of each unit interval R is a variable length according to the total number of notes in the note sequence N, the continuation length of each note, and the like.
  • FIG. 16 is a schematic diagram of the editing screen G in the third embodiment.
  • Information on one unit interval R selected by the user (note sequence N, feature sequence F, or waveform) among the plurality of unit intervals R of one track T selected by the user from the plurality of tracks T of the synthetic sound. W) is displayed on the edit screen G.
  • the operation area Gt and the operation area Gr are added to the same elements as those of the first embodiment.
  • the operation area Gt is an area related to the track T of the synthetic sound. Specifically, the track version number Vt, the operation image Gt1 and the operation image Gt2 are displayed in the operation area Gt.
  • the track version number Vt is a number representing the version of the track T displayed on the edit screen G.
  • the track version number Vt is incremented by 1 each time the information about the track T displayed on the edit screen G (note string N, feature column F, or waveform W) is edited. Further, the user can change the track version number Vt in the operation area Gt to an arbitrary numerical value by operating the operation device 15.
  • the operation image Gt1 and the operation image Gt2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gt1 is an operator for instructing the user to return the information (note string N, feature sequence F, or waveform W) related to the track T to the state before the execution of the immediately preceding edit (Undo).
  • the operation image Gt2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gt1 again (Redo).
  • the operation area Gr is an area related to the unit interval R of the synthetic sound. Specifically, the section version number Vr, the operation image Gr1 and the operation image Gr2 are displayed in the operation area Gr.
  • the section version number Vr is a number representing the version of the unit section R displayed on the edit screen G.
  • the section version number Vr is incremented by 1 each time the information regarding the unit interval R displayed on the edit screen G (note sequence N, feature sequence F, or waveform W) is edited. Further, the user can change the track version number Vt in the operation area Gt to an arbitrary numerical value by operating the operation device 15.
  • the operation image Gr1 and the operation image Gr2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gr1 is an operator for instructing the user to return the information (note string N, feature sequence F, or waveform W) regarding the unit interval R to the state before the execution of the immediately preceding edit (Undo).
  • the operation image Gr2 is an operator for instructing the user to execute (Redo) the editing canceled by the operation on the operation image Gr1 again.
  • the editing process Sa (Sa1-Sa3) or the management process Sb (Sb1-Sb3) is executed for each of the plurality of unit intervals R in one track T displayed on the editing screen G.
  • the information management unit 40 increases the track version number Vt and the section version number Vr by one. Further, when the user operates the operation image (Gn1, Gf1, Gw1, Gn2, Gf2 or Gw2), the information management unit 40 similarly increases the track version number Vt and the section version number Vr by one.
  • the user generates the waveform data Dw by trial and error for each of the plurality of unit intervals R on the time axis, while the note sequence data Dn, the feature sequence data Df, and the waveform data Dw. You can instruct each edit with.
  • the note string data Dn of each version is stored in the history area as the first history data Hn [Vn, Vf, Vw], but the first history data Hn [Vn, Vf, Vw] And the format of the first history data Hn [Vn, Vf, Vw] are not limited to the above examples.
  • the first history data Hn [Vn, Vf, Vw] indicating how the note string data Dn is edited may be saved.
  • the first history data Hn [Vn, Vf, Vw] is comprehensively expressed as data corresponding to the edited note sequence N.
  • the second history data Hf [Vn, Vf, Vw] indicating how the feature column data Df is edited (that is, the time series of the edit instruction Qf) is stored in the history area.
  • the matters represented by the second history data Hf [Vn, Vf, Vw] and the format of the second history data Hf [Vn, Vf, Vw] are not limited to the above examples.
  • the feature column data Df after editing according to the editing instruction Qf may be saved in the history area as the second history data Hf [Vn, Vf, Vw].
  • the second history data Hf [Vn, Vf, Vw] is comprehensively represented as data corresponding to the edited feature column data Df.
  • the third history data Hw [Vn, Vf, Vw] indicating how the waveform data Dw is edited (that is, the time series of the edit instruction Qw) is saved in the history area.
  • the matters represented by the third history data Hw [Vn, Vf, Vw] and the format of the third history data Hw [Vn, Vf, Vw] are not limited to the above examples.
  • the waveform data Dw after editing according to the editing instruction Qw may be saved in the history area as the third history data Hw [Vn, Vf, Vw].
  • the third history data Hw [Vn, Vf, Vw] is comprehensively expressed as data corresponding to the edited waveform data Dw.
  • the feature sequence F having the fundamental frequency of the synthesized sound as the feature quantity is illustrated, but the feature quantity represented by the feature sequence data Df is not limited to the fundamental frequency.
  • the frequency spectrum of the synthesized sound in the frequency domain for example, the intensity spectrum
  • the time-series data representing the time series (feature sequence F) of the feature amount with the sound pressure level on the time axis as the feature sequence data may be Df.
  • the feature sequence data Df is comprehensively represented as time series data representing a time series (feature sequence F) of the feature amount of the note sequence data Dn.
  • the second generation unit 34 generates the waveform data Dw from the note sequence data Dn and the feature sequence data Df, but the second generation unit 34 generates the waveform data Dw from the note sequence data Dn.
  • the second generation unit 34 generates waveform data Dw from the feature column data Df. That is, the second generation unit 34 is specified as an element that generates waveform data Dw from at least one of the note string data Dn and the waveform data Dw.
  • the first generation model M1 that outputs the feature sequence data Df for the input including the pronunciation style s is exemplified, but the feature sequence data Df corresponding to the pronunciation style s is first generated.
  • the configuration for the unit 32 to be generated is not limited to the above examples.
  • the feature sequence data Df may be generated by selectively using a plurality of first generation models M1 corresponding to different pronunciation styles s.
  • the first generation model M1 corresponding to each pronunciation style s is constructed by machine learning using a plurality of first training data prepared for the pronunciation style s.
  • the first generation unit 32 generates the feature sequence data Df by inputting the note sequence data Dn into the first generation model M1 corresponding to the pronunciation style s selected by the user among the plurality of first generation models M1. ..
  • the second generation model M2 that outputs the waveform data Dw to the input including the pronunciation style s is exemplified, but the second generation unit 34 generates the waveform data Dw according to the pronunciation style s.
  • the configuration for generation is not limited to the above examples.
  • the waveform data Dw may be generated by selectively using a plurality of second generation models M2 corresponding to different pronunciation styles s.
  • the second generative model M2 corresponding to each pronunciation style s is constructed by machine learning using a plurality of second training data prepared for the pronunciation style s.
  • the second generation unit 34 inputs the note sequence data Dn and the feature sequence data Df (input data Din) to the second generation model M2 corresponding to the pronunciation style s selected by the user among the plurality of second generation models M2. As a result, waveform data Dw is generated.
  • the waveform W of the acoustic signal Z is displayed in the edit area Ew of the edit screen G, but the time series (that is, spectrogram) of the frequency spectrum of the acoustic signal Z is displayed on the edit screen G together with the waveform W. It may be displayed.
  • the editing screen G illustrated in FIG. 17 includes an editing area Ew1 and an editing area Ew2.
  • the waveform W is displayed in the same manner as the editing area Ew in each of the above-described forms.
  • the time series of the frequency spectrum of the acoustic signal Z is displayed.
  • the user can give the editing instruction Qw for the frequency spectrum in the editing area Ew2 by operating the operation device 15.
  • Note string data Dn is time-series data representing a note sequence N having a plurality of notes on the time axis as elements.
  • the feature sequence data Df is time-series data representing the feature sequence F having a plurality of feature quantities on the time axis as elements.
  • the waveform data Dw is time-series data representing a waveform W having a plurality of samples on the time axis as elements.
  • the note sequence data Dn, the feature sequence data Df, and the waveform data Dw are comprehensively represented as time series data representing a time series of a plurality of elements.
  • the deep neural network is exemplified as the first generation model M1 and the second generation model M2, but the configurations of the first generation model M1 and the second generation model M2 are arbitrary.
  • a statistical inference model of another structure such as HMM (Hidden Markov Model) may be used as the first generation model M1 or the second generation model M2.
  • each of the above-mentioned forms the synthesis of the synthetic sound corresponding to the note string N is illustrated, but each of the above-mentioned forms can be used in any scene for processing time-series data representing a time-series of a plurality of elements. Will be done.
  • the upper layer corresponds to the note sequence N
  • the middle layer corresponds to the feature sequence F
  • the lower layer corresponds to the waveform W.
  • Each layer in the scene is a combination illustrated below.
  • the note strings constituting the melody correspond to the upper layer
  • the time series of the chords in the melody corresponds to the middle layer
  • the accompaniment sound that harmonizes with the melody corresponds to the lower layer.
  • the voice synthesis scene in which the voice corresponding to the character string is synthesized, the character string corresponds to the upper layer, the pronunciation style of the voice corresponds to the middle layer, and the waveform of the voice corresponds to the lower layer. do.
  • the waveform of the signal corresponds to the upper layer
  • the time series of the feature amount of the signal corresponds to the middle layer
  • the time series of the parameters related to the processing for the signal corresponds to the upper layer.
  • the lower layer data is expressed as "lower data”.
  • the lower-level data is data representing the content actually used by the user (for example, the waveform W in each of the above-mentioned forms).
  • each note constituting the note string N in each of the above-mentioned forms and each character constituting the character string in speech synthesis are comprehensively expressed as symbols indicating sounds. Further, the note string N and the character string are comprehensively represented as a symbol string in which a plurality of symbols are arranged in time series.
  • the functions of the acoustic processing system exemplified above are realized by the cooperation of the single or a plurality of processors constituting the control device 11 and the program stored in the storage device 12.
  • the program according to the present disclosure may be provided and installed in a computer in a form stored in a computer-readable recording medium.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a semiconductor recording medium, a magnetic recording medium, or the like is known as arbitrary. Recording media in the form of are also included.
  • the non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and the volatile recording medium is not excluded. Further, in the configuration in which the distribution device distributes the program via the communication network, the storage device 12 that stores the program in the distribution device corresponds to the above-mentioned non-transient recording medium.
  • the first time-series data representing the time-series of the feature amount of the sound in which the symbol string is sounded in the first pronunciation style is given as the first instruction from the user.
  • the first history data corresponding to the edited first time-series data is saved as new version data for each edit of the first time-series data, and what is the first pronunciation style?
  • the second time-series data representing the time series of the feature amount of the sound that pronounced the symbol string with a different second pronunciation style is edited according to the second instruction from the user, and the second time-series data is edited.
  • the second history data corresponding to the edited second time-series data is saved as new version data, and among the saved first history data of different versions, from the user.
  • the history of editing the first time-series data corresponding to the first pronunciation style is saved, and the history of editing the second time-series data corresponding to the second pronunciation style is saved. Therefore, the editing of the first time-series data corresponding to the first pronunciation style and the editing of the second time-series data corresponding to the second pronunciation style are executed by trial and error according to the instruction from the user.
  • the "symbol string" is, for example, a musical note string or a character string.
  • the symbol string is a note sequence including a plurality of notes arranged in a time series.
  • the note sequence data representing the note sequence is edited according to the instruction from the user, and the first time series data and the second time series data are common. Generated from the note string data of the version of.
  • the first history data after the editing immediately before the plurality of first history data and the immediately preceding of the plurality of second history data Acquire any of the second history data after editing.
  • the first history data or the second history data before the execution of the immediately preceding edit that is, the state in which the edit is canceled
  • the first history data of the version designated by the user among the plurality of first history data, and the plurality of second history data.
  • any of the second history data of the version specified by the user is acquired. According to the above configuration, it is possible to acquire the first history data or the second history data corresponding to any version according to the instruction from the user.
  • the information processing system edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user.
  • the editing processing unit edits the first time-series data
  • the first history data corresponding to the edited first time-series data is saved as new version data
  • the second time-series data is edited.
  • Each time it is provided with an information management unit that saves the second history data corresponding to the edited second time-series data as new version data, and the information management unit has the saved different versions.
  • the program according to one aspect of the present disclosure causes the computer system to function as the above information processing system.
  • 100 Information processing system, 11 ... Control device, 12 ... Storage device, 13 ... Sound emitting device, 14 ... Display device, 15 ... Operation device, 20 ... Display control unit, 30 ... Editing processing unit, 31 ... First editing unit , 32 ... 1st generation unit, 33 ... 2nd editorial unit, 34 ... 2nd generation unit, 35 ... 3rd editorial unit, M1 ... 1st generation model, M2 ... 2nd generation model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

This invention includes: editing, in accordance with a first instruction from a user, first time-series data representing a time series of a feature amount of a sound produced by pronouncing a symbol string with a first pronunciation style; saving, as a new version of data, first historical data corresponding to the edited first time-series data, for each edit of the first time-series data; editing, in accordance with a second instruction from the user, second time-series data representing a time series of a feature amount of a sound produced by pronouncing the symbol string with a second pronunciation style which is different from the first pronunciation style; saving, as a new version of data, second historical data corresponding to the edited second time-series data, for each edit of the second time-series data; and acquiring first time-series data corresponding to a first historical data piece, among a plurality of saved first historical data pieces of different versions, that corresponds to an instruction from the user, or second time-series data corresponding to a second historical data piece, among a plurality of saved second historical data pieces of different versions, that corresponds to an instruction from the user.

Description

情報処理方法、情報処理システムおよびプログラムInformation processing methods, information processing systems and programs
 本開示は、時系列データの処理に関する。 This disclosure relates to the processing of time series data.
 任意の音韻の音声を合成する各種の音声合成技術が従来から提案されている。例えば特許文献1には、利用者が編集画面に対して指示した音符列を発音した歌唱音声を合成する技術が開示されている。編集画面は、時間軸と音高軸とが設定されたピアノロール画面である。利用者は、楽曲を構成する音符毎に、音韻(発音文字)と音高と発音期間とを指定する。 Various speech synthesis techniques for synthesizing speeches of arbitrary phonology have been proposed conventionally. For example, Patent Document 1 discloses a technique for synthesizing a singing voice that pronounces a note sequence instructed by a user on an editing screen. The edit screen is a piano roll screen in which the time axis and the pitch axis are set. The user specifies a phonetic (phonetic character), a pitch, and a pronunciation period for each note that constitutes a musical piece.
特開2016-90916号公報Japanese Unexamined Patent Publication No. 2016-90916
 利用者の意図を正確に反映した音声を合成するためには、音声合成の条件(例えば各種のパラメータ)の編集と実際の音声の聴取とを反復する試行錯誤が利用者に要求される。利用者が順番に指示した複数の編集のうち最新の編集を逆順に取消す処理(アンドゥ)、または取消済の編集を実行し直す処理(リドゥ)を許容する構成も想定されるが、単純なアンドゥまたはリドゥだけでは、多様な編集の結果を相互に比較しながら試行錯誤的に利用者が編集を指示することは実際には困難である。なお、以上の説明では音声合成の場面を例示したが、時系列データを生成する各種の場面において同様の課題が想定される。以上の事情を考慮して、本開示は、利用者の意図に沿った時系列データの生成を容易化することを目的とする。 In order to synthesize a voice that accurately reflects the user's intention, the user is required to perform trial and error by repeating editing of voice synthesis conditions (for example, various parameters) and listening to the actual voice. A configuration that allows the process of canceling the latest edit in reverse order (undo) or the process of re-executing the canceled edit (redo) among multiple edits instructed by the user in order is also assumed, but a simple undo Or, it is actually difficult for the user to instruct editing by trial and error while comparing the results of various editing with each other only by redoing. In the above description, the scene of speech synthesis is illustrated, but similar problems are assumed in various scenes of generating time-series data. In consideration of the above circumstances, it is an object of the present disclosure to facilitate the generation of time-series data in line with the user's intention.
 以上の課題を解決するために、本開示のひとつの態様に係る情報処理方法は、第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存し、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する。 In order to solve the above problems, the information processing method according to one aspect of the present disclosure uses the first time-series data representing the time-series of the feature amount of the sound in which the symbol string is pronounced in the first pronunciation style. Edit according to the first instruction from, and for each edit of the first time series data, the first history data corresponding to the first time series data after the edit is saved as a new version of the data, and the first The second time-series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user, and the second Each time the time series data is edited, the second history data corresponding to the edited second time series data is saved as new version data, and among the plurality of first history data of the saved different versions. The first time-series data corresponding to the first history data in response to the instruction from the user, or the second of the plurality of saved second history data of different versions in response to the instruction from the user. 2 Acquire the second time-series data corresponding to the historical data.
 本開示のひとつの態様に係る情報処理システムは、第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集する編集処理部と、前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存する情報管理部とを具備し、前記情報管理部は、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する。 The information processing system according to one aspect of the present disclosure edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user. Each time the editing processing unit edits the first time-series data, the first history data corresponding to the edited first time-series data is saved as new version data, and the second time-series data is edited. Each time, it is provided with an information management unit that saves the second history data corresponding to the edited second time-series data as new version data, and the information management unit has the saved different versions. The use of the first time-series data corresponding to the first history data according to the instruction from the user among the plurality of first history data, or the second history data of a plurality of different versions of the saved data. Acquire the second time-series data corresponding to the second history data according to the instruction from the person.
 本開示のひとつの態様に係る情報処理システムは、第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集する編集処理部、および、前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存する情報管理部、としてコンピュータシステムを機能させるプログラムであって、前記情報管理部は、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する。 The information processing system according to one aspect of the present disclosure edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user. For each editing of the editing processing unit and the first time-series data, the first history data corresponding to the edited first time-series data is saved as new version data, and the second time-series data is stored. A program that causes the computer system to function as an information management unit that saves the second history data corresponding to the edited second time-series data as new version data for each edit, and the information management unit is a program. Among the plurality of saved different versions of the first history data, the first time-series data corresponding to the first history data in response to the instruction from the user, or the plurality of saved different versions of the first history data. Of the second history data, the second time-series data corresponding to the second history data according to the instruction from the user is acquired.
第1実施形態に係る情報処理システムの構成を例示するブロック図である。It is a block diagram which illustrates the structure of the information processing system which concerns on 1st Embodiment. 編集画面の模式図である。It is a schematic diagram of an edit screen. 情報処理システムの機能的な構成を例示するブロック図である。It is a block diagram which illustrates the functional structure of an information processing system. 第1編集処理の手順を例示するフローチャートである。It is a flowchart illustrating the procedure of the 1st editing process. 第2編集処理の手順を例示するフローチャートである。It is a flowchart which illustrates the procedure of the 2nd editing process. 第3編集処理の手順を例示するフローチャートである。It is a flowchart which illustrates the procedure of the 3rd editing process. 履歴領域におけるデータ構造の説明図である。It is explanatory drawing of the data structure in a history area. 第1管理処理の手順を例示するフローチャートである。It is a flowchart illustrating the procedure of the 1st management process. 第2管理処理の手順を例示するフローチャートである。It is a flowchart which illustrates the procedure of the 2nd management process. 第3管理処理の手順を例示するフローチャートである。It is a flowchart which illustrates the procedure of the 3rd management process. 第2実施形態における編集画面の模式図である。It is a schematic diagram of the edit screen in 2nd Embodiment. 第2実施形態における情報処理システムの機能的な構成を例示するブロック図である。It is a block diagram which illustrates the functional structure of the information processing system in 2nd Embodiment. 第2実施形態における履歴領域におけるデータ構造の説明図である。It is explanatory drawing of the data structure in the history area in 2nd Embodiment. 比較画面の模式図である。It is a schematic diagram of a comparison screen. 第3実施形態における合成音の説明図である。It is explanatory drawing of the synthetic sound in 3rd Embodiment. 第3実施形態における編集画面の模式図である。It is a schematic diagram of the edit screen in 3rd Embodiment. 変形例における編集画面の模式図である。It is a schematic diagram of an edit screen in a modification.
A:第1実施形態
 図1は、本開示の第1実施形態に係る情報処理システム100の構成を例示するブロック図である。情報処理システム100は、音響信号Zを生成する音響処理システムである。音響信号Zは、合成音の波形を表す時間領域の信号である。合成音は、例えば仮想的な演奏者が楽器を演奏することで発音される楽器音、または、例えば仮想的な歌唱者が楽曲を歌唱することで発音される歌唱音である。
A: First Embodiment FIG. 1 is a block diagram illustrating the configuration of the information processing system 100 according to the first embodiment of the present disclosure. The information processing system 100 is an acoustic processing system that generates an acoustic signal Z. The acoustic signal Z is a signal in the time domain representing the waveform of the synthetic sound. The synthetic sound is, for example, a musical instrument sound produced by a virtual performer playing a musical instrument, or a singing sound produced by, for example, a virtual singer singing a song.
 情報処理システム100は、制御装置11と記憶装置12と放音装置13と表示装置14と操作装置15とを具備するコンピュータシステムで実現される。情報処理システム100は、例えば、スマートフォン、タブレット端末またはパーソナルコンピュータ等の情報機器により実現される。なお、情報処理システム100は、単体の装置で実現されるほか、相互に別体で構成された複数の装置(例えばクライアントサーバシステム)でも実現される。 The information processing system 100 is realized by a computer system including a control device 11, a storage device 12, a sound emitting device 13, a display device 14, and an operating device 15. The information processing system 100 is realized by, for example, an information device such as a smartphone, a tablet terminal, or a personal computer. The information processing system 100 is realized not only by a single device but also by a plurality of devices (for example, a client-server system) configured as separate bodies from each other.
 制御装置11は、情報処理システム100の各要素を制御する単数または複数のプロセッサである。具体的には、例えばCPU(Central Processing Unit)、SPU(Sound Processing Unit)、DSP(Digital Signal Processor)、FPGA(Field Programmable Gate Array)、またはASIC(Application Specific Integrated Circuit)等の1種類以上のプロセッサにより、制御装置11が構成される。制御装置11は、音響信号Zを生成する各種の処理を実行する。 The control device 11 is a single or a plurality of processors that control each element of the information processing system 100. Specifically, for example, one or more types of processors such as CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). 3. The control device 11 is configured. The control device 11 executes various processes for generating the acoustic signal Z.
 記憶装置12は、制御装置11が実行するプログラムと制御装置11が使用する各種のデータとを記憶する単数または複数のメモリである。記憶装置12は、例えば磁気記録媒体または半導体記録媒体等の公知の記録媒体で構成される。記憶装置12は、複数種の記録媒体の組合せで構成されてもよい。また、情報処理システム100に着脱される可搬型の記録媒体、または、通信網を介した書込および読出が可能な記録媒体(例えばクラウドストレージ)が、記憶装置12として利用されてもよい。 The storage device 12 is a single or a plurality of memories for storing a program executed by the control device 11 and various data used by the control device 11. The storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium. The storage device 12 may be composed of a combination of a plurality of types of recording media. Further, a portable recording medium attached to and detached from the information processing system 100, or a recording medium capable of writing and reading via a communication network (for example, cloud storage) may be used as the storage device 12.
 放音装置13は、制御装置11が生成した音響信号Zが表す合成音を再生する。放音装置13は、例えばスピーカまたはヘッドホンである。なお、音響信号Zをデジタルからアナログに変換するD/A変換器と、音響信号Zを増幅する増幅器とについては、便宜的に図示が省略されている。また、図1においては、放音装置13が情報処理システム100に搭載された構成を例示したが、情報処理システム100とは別体の放音装置13が有線または無線により情報処理システム100に接続されてもよい。 The sound emitting device 13 reproduces the synthetic sound represented by the acoustic signal Z generated by the control device 11. The sound emitting device 13 is, for example, a speaker or headphones. The D / A converter that converts the acoustic signal Z from digital to analog and the amplifier that amplifies the acoustic signal Z are not shown for convenience. Further, in FIG. 1, the configuration in which the sound emitting device 13 is mounted on the information processing system 100 is illustrated, but the sound emitting device 13 separate from the information processing system 100 is connected to the information processing system 100 by wire or wirelessly. May be done.
 表示装置14は、制御装置11による制御のもとで画像を表示する。表示装置14は、例えば液晶パネルまたは有機EL(ElectroLuminescence)パネル等の表示パネルで構成される。操作装置15は、利用者からの指示を受付ける入力機器である。操作装置15は、例えば、利用者が操作する複数の操作子、または、利用者による接触を検知するタッチパネルである。利用者は、操作装置15を操作することで、合成音の条件を指示することが可能である。表示装置14は、合成音の条件を指示するために利用者が参照する画像(以下「編集画面」という)Gを表示する。 The display device 14 displays an image under the control of the control device 11. The display device 14 is composed of a display panel such as a liquid crystal panel or an organic EL (ElectroLuminescence) panel. The operation device 15 is an input device that receives instructions from the user. The operation device 15 is, for example, a plurality of controls operated by the user or a touch panel for detecting contact by the user. The user can instruct the condition of the synthesized sound by operating the operation device 15. The display device 14 displays an image (hereinafter referred to as “editing screen”) G referred to by the user for instructing the condition of the synthetic sound.
 図2は、編集画面Gの模式図である。編集画面Gは、複数の編集領域E(En、EfおよびEw)を含む。複数の編集領域Eには共通の時間軸(横軸)が設定される。合成音のうち編集画面Gに表示される区間は、操作装置15に対する利用者からの指示に応じて変更される。 FIG. 2 is a schematic diagram of the edit screen G. The editing screen G includes a plurality of editing areas E (En, Ef, and Ew). A common time axis (horizontal axis) is set in the plurality of editing areas E. The section of the synthetic sound displayed on the edit screen G is changed according to the instruction from the user to the operation device 15.
 編集領域Enには、合成音の楽譜を構成する複数の音符の時系列(以下「音符列」という)Nが表示される。編集領域Enには、時間軸と音高軸(縦軸)とで規定される座標平面が設定される。音符列Nを構成する各音符を表す画像が編集領域Enに配置される。音符列Nの音符毎に音高(例えばノート番号)と発音期間とが指定される。また、合成音が歌唱音である場合には、音符毎に音韻が指定される。編集領域Enには、例えばクレッシェンド、フォルテまたはデクレッシェンド等の演奏記号も表示される。利用者は、操作装置15を操作することで、編集領域Enに対する編集指示Qnを付与できる。編集指示Qnは、音符列Nを編集する指示である。具体的には、編集指示Qnは、音符列Nの各音符の追加または削除の指示、各音符の条件(音高、発音期間または音韻)の変更の指示、または、演奏記号の変更の指示である。 In the editing area En, a time series (hereinafter referred to as "note sequence") N of a plurality of notes constituting the score of the synthesized sound is displayed. A coordinate plane defined by a time axis and a pitch axis (vertical axis) is set in the editing area En. An image representing each note constituting the note sequence N is arranged in the editing area En. A pitch (for example, a note number) and a pronunciation period are specified for each note in the note sequence N. When the synthetic sound is a singing sound, the phoneme is specified for each note. In the editing area En, performance symbols such as crescendo, forte, and decrescendo are also displayed. The user can give an edit instruction Qn to the edit area En by operating the operation device 15. The edit instruction Qn is an instruction to edit the note string N. Specifically, the edit instruction Qn is an instruction to add or delete each note in the note sequence N, an instruction to change the condition (pitch, pronunciation period or phonology) of each note, or an instruction to change the performance symbol. be.
 編集領域Efには、合成音の特徴量の時系列(以下「特徴列」という)Fが表示される。特徴量は、合成音の音響的な特徴量である。具体的には、合成音の基本周波数(ピッチ)を特徴量として編集領域Efに特徴列F(すなわち基本周波数の時間的な遷移)が表示される。利用者は、操作装置15を操作することで、編集領域Efに対する編集指示Qfを付与できる。編集指示Qfは、特徴列Fを編集する指示である。具体的には、編集指示Qfは、例えば、編集領域Efに表示された特徴列Fのうち利用者の所望の区間における特徴量の時間変化を変更する指示である。 In the editing area Ef, a time series (hereinafter referred to as "feature column") F of the feature amount of the synthetic sound is displayed. The feature amount is an acoustic feature amount of the synthetic sound. Specifically, the feature column F (that is, the temporal transition of the fundamental frequency) is displayed in the editing area Ef with the fundamental frequency (pitch) of the synthesized sound as the feature amount. The user can give an edit instruction Qf to the edit area Ef by operating the operation device 15. The edit instruction Qf is an instruction to edit the feature column F. Specifically, the editing instruction Qf is, for example, an instruction for changing the time change of the feature amount in the desired section of the feature column F displayed in the editing area Ef.
 編集領域Ewには、時間軸上における合成音の波形Wが表示される。利用者は、操作装置15を操作することで、編集領域Ewに対する編集指示Qwを付与できる。編集指示Qwは、波形Wを編集する指示である。具体的には、編集指示Qwは、編集領域Ewに表示された波形Wのうち利用者の所望の区間における波形を変更する指示である。 In the editing area Ew, the waveform W of the synthesized sound on the time axis is displayed. The user can give an edit instruction Qw to the edit area Ew by operating the operation device 15. The edit instruction Qw is an instruction to edit the waveform W. Specifically, the editing instruction Qw is an instruction to change the waveform in the user's desired section of the waveform W displayed in the editing area Ew.
 編集画面Gは、以上に例示した複数の編集領域Eのほか、相異なる編集領域Eに対応する複数の操作領域(Gn、GfおよびGw)と、操作画像B1(再生)とを含む。操作画像B1は、操作装置15を利用して利用者が操作可能なソフトウェアボタンである。具体的には、操作画像B1は、合成音の再生を利用者が指示するための操作子である。具体的には、利用者が操作画像B1を操作することで、編集領域Ewに表示された波形Wの合成音が放音装置13から再生される。 The editing screen G includes, in addition to the plurality of editing areas E exemplified above, a plurality of operating areas (Gn, Gf and Gw) corresponding to different editing areas E, and an operating image B1 (playback). The operation image B1 is a software button that can be operated by the user using the operation device 15. Specifically, the operation image B1 is an operation element for the user to instruct the reproduction of the synthesized sound. Specifically, when the user operates the operation image B1, the synthetic sound of the waveform W displayed in the editing area Ew is reproduced from the sound emitting device 13.
 操作領域Gnは、音符列Nに関する領域である。具体的には、操作領域Gnには、音符列バージョン番号Vnと操作画像Gn1と操作画像Gn2とが表示される。音符列バージョン番号Vnは、編集領域Enに表示される音符列Nのバージョンを表す番号である。編集指示Qnに応じた音符列Nの編集毎に音符列バージョン番号Vnが1ずつ増加する。また、利用者は、操作装置15を操作することで、操作領域Gn内の音符列バージョン番号Vnを任意の数値に変更することが可能である。過去の編集の過程で生成された音符列Nの複数のバージョンのうち、利用者による変更後の音符列バージョン番号Vnに対応するバージョンの音符列Nが編集領域Enに表示される。 The operation area Gn is an area related to the note string N. Specifically, the note string version number Vn, the operation image Gn1 and the operation image Gn2 are displayed in the operation area Gn. The note string version number Vn is a number representing the version of the note string N displayed in the editing area En. The note string version number Vn is incremented by 1 each time the note string N is edited according to the edit instruction Qn. Further, the user can change the note string version number Vn in the operation area Gn to an arbitrary numerical value by operating the operation device 15. Of the plurality of versions of the note string N generated in the process of past editing, the note string N of the version corresponding to the note string version number Vn changed by the user is displayed in the editing area En.
 操作画像Gn1および操作画像Gn2は、操作装置15を利用して利用者が操作可能なソフトウェアボタンである。操作画像Gn1は、音符列Nを直前の編集の実行前の状態に戻すこと(Undo)を利用者が指示するための操作子である。すなわち、利用者が操作画像Gn1を操作することで、音符列バージョン番号Vnが直前の数値に変更され、かつ、当該変更後の音符列バージョン番号Vnに対応するバージョンの音符列Nが編集領域Enに表示される。したがって、操作画像Gn1は、音符列バージョン番号Vnを直前の数値に後退させる(すなわち音符列Nに関する直前の編集を取消する)ための操作子とも表現される。他方、操作画像Gn2は、操作画像Gn1に対する操作で取消された編集を再び実行すること(Redo)を利用者が指示するための操作子である。 The operation image Gn1 and the operation image Gn2 are software buttons that can be operated by the user using the operation device 15. The operation image Gn1 is an operation element for the user to instruct to return the note string N to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gn1, the note string version number Vn is changed to the immediately preceding numerical value, and the note string N of the version corresponding to the changed note string version number Vn is the edit area En. Is displayed in. Therefore, the operation image Gn1 is also expressed as an operator for retreating the note string version number Vn to the immediately preceding numerical value (that is, canceling the immediately preceding edit regarding the note string N). On the other hand, the operation image Gn2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gn1 again (Redo).
 操作領域Gfは、特徴列Fに関する領域である。具体的には、操作領域Gfには、特徴列バージョン番号Vfと操作画像Gf1と操作画像Gf2とが表示される。特徴列バージョン番号Vfは、編集領域Efに表示される特徴列Fのバージョンを表す番号である。編集指示Qfに応じた特徴列Fの編集毎に特徴列バージョン番号Vfが1ずつ増加する。また、利用者は、操作装置15を操作することで、操作領域Gf内の特徴列バージョン番号Vfを任意の数値に変更することが可能である。過去の編集の過程で生成された特徴列Fの複数のバージョンのうち、利用者による変更後の特徴列バージョン番号Vfに対応するバージョンの特徴列Fが編集領域Efに表示される。 The operation area Gf is an area related to the feature column F. Specifically, the feature column version number Vf, the operation image Gf1, and the operation image Gf2 are displayed in the operation area Gf. The feature column version number Vf is a number representing the version of the feature column F displayed in the editing area Ef. The feature column version number Vf is incremented by 1 each time the feature column F is edited according to the edit instruction Qf. Further, the user can change the feature column version number Vf in the operation area Gf to an arbitrary numerical value by operating the operation device 15. Of the plurality of versions of the feature column F generated in the process of past editing, the feature column F of the version corresponding to the feature column version number Vf changed by the user is displayed in the editing area Ef.
 操作画像Gf1および操作画像Gf2は、操作装置15を利用して利用者が操作可能なソフトウェアボタンである。操作画像Gf1は、特徴列Fを直前の編集の実行前の状態に戻すこと(Undo)を利用者が指示するための操作子である。すなわち、利用者が操作画像Gf1を操作することで、特徴列バージョン番号Vfが直前の数値に変更され、かつ、当該変更後の特徴列バージョン番号Vfに対応するバージョンの特徴列Fが編集領域Efに表示される。したがって、操作画像Gf1は、特徴列バージョン番号Vfを直前の数値に後退させる(すなわち特徴列Fに関する直前の編集を取消する)ための操作子とも表現される。他方、操作画像Gf2は、操作画像Gf1に対する操作で取消された編集を再び実行すること(Redo)を利用者が指示するための操作子である。 The operation image Gf1 and the operation image Gf2 are software buttons that can be operated by the user using the operation device 15. The operation image Gf1 is an operation element for the user to instruct to return the feature column F to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gf1, the feature column version number Vf is changed to the immediately preceding numerical value, and the feature column F of the version corresponding to the changed feature column version number Vf is the edit area Ef. Is displayed in. Therefore, the operation image Gf1 is also expressed as an operator for retreating the feature column version number Vf to the immediately preceding numerical value (that is, canceling the immediately preceding edit regarding the feature sequence F). On the other hand, the operation image Gf2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gf1 again (Redo).
 操作領域Gwは、波形Wに関する領域である。具体的には、操作領域Gwには、波形バージョン番号Vwと操作画像Gw1と操作画像Gw2とが表示される。波形バージョン番号Vwは、編集領域Ewに表示される波形Wのバージョンを表す番号である。編集指示Qwに応じた波形Wの編集毎に波形バージョン番号Vwが1ずつ増加する。また、利用者は、操作装置15を操作することで、操作領域Gw内の波形バージョン番号Vwを任意の数値に変更することが可能である。過去の編集の過程で生成された波形Wの複数のバージョンのうち、利用者による変更後の波形バージョン番号Vwに対応するバージョンの波形Wが編集領域Ewに表示される。 The operation area Gw is an area related to the waveform W. Specifically, the waveform version number Vw, the operation image Gw1 and the operation image Gw2 are displayed in the operation area Gw. The waveform version number Vw is a number representing the version of the waveform W displayed in the editing area Ew. The waveform version number Vw is incremented by 1 each time the waveform W is edited according to the edit instruction Qw. Further, the user can change the waveform version number Vw in the operation area Gw to an arbitrary numerical value by operating the operation device 15. Of the plurality of versions of the waveform W generated in the process of past editing, the version of the waveform W corresponding to the waveform version number Vw changed by the user is displayed in the editing area Ew.
 操作画像Gw1および操作画像Gw2は、操作装置15を利用して利用者が操作可能なソフトウェアボタンである。操作画像Gw1は、波形Wを直前の編集の実行前の状態に戻すこと(Undo)を利用者が指示するための操作子である。すなわち、利用者が操作画像Gw1を操作することで、波形バージョン番号Vwが直前の数値に変更され、かつ、当該変更後の波形バージョン番号Vwに対応するバージョンの波形Wが編集領域Ewに表示される。したがって、操作画像Gw1は、波形バージョン番号Vwを直前の数値に後退させる(すなわち波形Wに関する直前の編集を取消する)ための操作子とも表現される。他方、操作画像Gw2は、操作画像Gw1に対する操作で取消された編集を再び実行すること(Redo)を利用者が指示するための操作子である。 The operation image Gw1 and the operation image Gw2 are software buttons that can be operated by the user using the operation device 15. The operation image Gw1 is an operator for instructing the user to return the waveform W to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gw1, the waveform version number Vw is changed to the immediately preceding value, and the waveform W of the version corresponding to the changed waveform version number Vw is displayed in the editing area Ew. To. Therefore, the operation image Gw1 is also expressed as an operator for retreating the waveform version number Vw to the immediately preceding value (that is, canceling the immediately preceding edit regarding the waveform W). On the other hand, the operation image Gw2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gw1 again (Redo).
 以上の例示の通り、第1実施形態においては、複数のバージョン番号V(Vn、Vf、Vw)が使用される。各バージョン番号の増加(increment)は、編集作業の進行を意味し、各バージョン番号の減少(decrement)は、編集作業の後退を意味する。 As described above, in the first embodiment, a plurality of version numbers V (Vn, Vf, Vw) are used. An increase in each version number (increment) means the progress of the editing work, and a decrease in each version number (decrement) means a recession in the editing work.
 図3は、情報処理システム100の機能的な構成を例示するブロック図である。制御装置11は、記憶装置12に記憶されたプログラムを実行することで、合成音の条件の編集と音響信号Zの生成とのための複数の機能(表示制御部20、編集処理部30および情報管理部40)を実現する。表示制御部20は、制御装置11による制御のもとで表示装置14に画像を表示させる。例えば、表示制御部20は、図2に例示した編集画面Gを表示装置14に表示させる。また、表示制御部20は、利用者からの指示(Qn、QfまたはQw)に応じて編集画面Gを更新する。 FIG. 3 is a block diagram illustrating a functional configuration of the information processing system 100. The control device 11 executes a program stored in the storage device 12 to perform a plurality of functions (display control unit 20, editing processing unit 30, and information) for editing synthetic sound conditions and generating an acoustic signal Z. Realize the management unit 40). The display control unit 20 causes the display device 14 to display an image under the control of the control device 11. For example, the display control unit 20 causes the display device 14 to display the editing screen G illustrated in FIG. Further, the display control unit 20 updates the edit screen G in response to an instruction (Qn, Qf or Qw) from the user.
 図3の編集処理部30は、合成音の条件(音符列N、特徴列Fおよび波形W)を利用者からの指示(Qn,QfまたはQw)に応じて編集する。編集処理部30は、第1編集部31と第1生成部32と第2編集部33と第2生成部34と第3編集部35とを具備する。 The editing processing unit 30 in FIG. 3 edits the synthetic sound conditions (note sequence N, feature sequence F, and waveform W) according to an instruction (Qn, Qf, or Qw) from the user. The editing processing unit 30 includes a first editing unit 31, a first generation unit 32, a second editing unit 33, a second generation unit 34, and a third editing unit 35.
 第1編集部31は、音符列データDnを編集する。音符列データDnは、合成音の音符列Nを表す時系列データである。具体的には、第1編集部31は、編集領域Enに対する利用者からの編集指示Qnに応じて音符列データDnを編集する。表示制御部20は、第1編集部31による編集後の音符列データDnが表す音符列Nを編集領域Enに表示する。 The first editorial unit 31 edits the note string data Dn. The note string data Dn is time-series data representing the note sequence N of the synthesized sound. Specifically, the first editing unit 31 edits the note string data Dn according to the editing instruction Qn from the user for the editing area En. The display control unit 20 displays the musical note string N represented by the musical note string data Dn edited by the first editing unit 31 in the editing area En.
 第1生成部32は、第1編集部31による編集後の音符列データDnから特徴列データDfを生成する。特徴列データDfは、合成音の特徴列Fを表す時系列データである。なお、特徴列Fを構成する複数の特徴量のうち時間軸上の各時点における特徴量の生成には、当該時点の音符のデータに加えて、当該音符の前方の音符および後方の音符の少なくとも一方の音符のデータが利用される。すなわち、特徴列データDfは、音符列データDnが表す音符列Nのコンテキストに応じて生成される。 The first generation unit 32 generates the feature sequence data Df from the note sequence data Dn edited by the first editing unit 31. The feature sequence data Df is time-series data representing the feature sequence F of the synthesized sound. In addition, in addition to the note data at the time point, at least the notes before and after the note are generated for the generation of the feature amount at each time point on the time axis among the plurality of feature amounts constituting the feature sequence F. The data of one note is used. That is, the feature sequence data Df is generated according to the context of the note sequence N represented by the note sequence data Dn.
 具体的には、第1生成部32は、第1生成モデルM1を利用して特徴列データDfを生成する。第1生成モデルM1は、音符列データDnを入力として特徴列データDfを出力する統計的推定モデルである。具体的には、第1生成モデルM1は、音符列Nと特徴列Fとの関係を学習した学習済モデルである。第1生成モデルM1は、例えば深層ニューラルネットワーク(DNN:Deep Neural Network)で構成される。例えば、畳込ニューラルネットワーク(CNN:Convolutional Neural Network)または再帰型ニューラルネットワーク(RNN:Recurrent Neural Network)等の任意の形式の深層ニューラルネットワークが、第1生成モデルM1として利用される。なお、長短期記憶(LSTM:Long Short-Term Memory)またはSelf-Attention等の付加的な要素が第1生成モデルM1に搭載されてもよい。 Specifically, the first generation unit 32 generates the feature column data Df using the first generation model M1. The first generative model M1 is a statistical inference model that inputs the note sequence data Dn and outputs the feature sequence data Df. Specifically, the first generative model M1 is a trained model that has learned the relationship between the note sequence N and the feature sequence F. The first generative model M1 is composed of, for example, a deep neural network (DNN). For example, an arbitrary form of deep neural network such as a convolutional neural network (CNN) or a recurrent neural network (RNN) is used as the first generative model M1. In addition, additional elements such as long short-term memory (LSTM: Long Short-Term Memory) or Self-Attention may be mounted on the first generation model M1.
 第1生成モデルM1は、音符列データDnから特徴列データDfを生成する演算を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数(具体的には加重値およびバイアス)との組合せで実現される。第1生成モデルM1を規定する複数の変数は、複数の第1訓練データを利用した機械学習により事前に設定されて記憶装置12に記憶される。複数の第1訓練データの各々は、音符列データDnと特徴列データDf(正解値)とを含む。第1生成モデルM1の機械学習においては、各第1訓練データの音符列データDnに対して暫定的な第1生成モデルM1が出力する特徴列データDfと、当該第1訓練データの特徴列データDfとの誤差が低減されるように、第1生成モデルM1の複数の変数が反復的に更新される。したがって、第1生成モデルM1は、複数の第1訓練データにおいて音符列Nと特徴列Fとの間に潜在する傾向のもとで、未知の音符列データDnに対して統計的に妥当な特徴列データDfを出力する。 The first generation model M1 includes a program that causes the control device 11 to execute an operation for generating feature sequence data Df from the note sequence data Dn, and a plurality of variables (specifically, weighted values and biases) applied to the operation. It is realized by the combination of. The plurality of variables defining the first generation model M1 are preset and stored in the storage device 12 by machine learning using the plurality of first training data. Each of the plurality of first training data includes the note sequence data Dn and the feature sequence data Df (correct answer value). In the machine learning of the first generation model M1, the feature sequence data Df output by the provisional first generation model M1 for the note sequence data Dn of each first training data and the feature sequence data of the first training data. A plurality of variables of the first generation model M1 are updated iteratively so that the error with Df is reduced. Therefore, the first generative model M1 is a statistically valid feature for the unknown note sequence data Dn under the latent tendency between the note sequence N and the feature sequence F in the plurality of first training data. Output the column data Df.
 第2編集部33は、第1生成部32が生成した特徴列データDfを編集する。具体的には、第2編集部33は、編集領域Efに対する利用者からの編集指示Qfに応じて特徴列データDfを編集する。表示制御部20は、第1生成部32が生成した特徴列データDfが表す特徴列F、または第2編集部33による編集後の特徴列データDfが表す特徴列Fを、編集領域Efに表示する。 The second editing unit 33 edits the feature column data Df generated by the first generation unit 32. Specifically, the second editing unit 33 edits the feature column data Df according to the editing instruction Qf from the user for the editing area Ef. The display control unit 20 displays the feature column F represented by the feature column data Df generated by the first generation unit 32 or the feature column F represented by the feature column data Df edited by the second editing unit 33 in the editing area Ef. do.
 第2生成部34は、音符列データDnと特徴列データDfとから波形データDwを生成する。波形データDwは、合成音の波形Wを表す時系列データである。すなわち、波形データDwは、音響信号Zを表す複数のサンプルの時系列で構成される。波形データDwに対するD/A変換および増幅により音響信号Zが生成される。なお、第1生成部32が生成した直後の特徴列データDf(すなわち第2編集部33により編集されていない特徴列データDF)を、波形データDwの生成に利用してもよい。 The second generation unit 34 generates waveform data Dw from the note sequence data Dn and the feature sequence data Df. The waveform data Dw is time-series data representing the waveform W of the synthesized sound. That is, the waveform data Dw is composed of a time series of a plurality of samples representing the acoustic signal Z. The acoustic signal Z is generated by D / A conversion and amplification for the waveform data Dw. The feature sequence data Df immediately after being generated by the first generation unit 32 (that is, the feature sequence data DF not edited by the second editing unit 33) may be used for generating the waveform data Dw.
 第2生成部34は、第2生成モデルM2を利用して波形データDwを生成する。第2生成モデルM2は、音符列データDnと特徴列データDfとの組(以下「入力データDin」という)を入力として波形データDwを出力する統計的推定モデルである。具体的には、第2生成モデルM2は、音符列Nおよび特徴列Fの組と波形Wとの関係を学習した学習済モデルである。第2生成モデルM2は、例えば深層ニューラルネットワークで構成される。例えば、畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の任意の形式の深層ニューラルネットワークが、第2生成モデルM2として利用される。なお、長短期記憶またはSelf-Attention等の付加的な要素が第2生成モデルM2に搭載されてもよい。 The second generation unit 34 generates waveform data Dw using the second generation model M2. The second generative model M2 is a statistical inference model that outputs waveform data Dw by inputting a set of note sequence data Dn and feature sequence data Df (hereinafter referred to as “input data Din”). Specifically, the second generative model M2 is a trained model in which the relationship between the set of the note sequence N and the feature sequence F and the waveform W is learned. The second generative model M2 is composed of, for example, a deep neural network. For example, an arbitrary form of deep neural network such as a convolutional neural network or a recurrent neural network is used as the second generative model M2. In addition, additional elements such as long-term memory or self-attention may be mounted on the second generative model M2.
 第2生成モデルM2は、音符列データDnと特徴列データDfとを含む入力データDinから波形データDwを生成する演算を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数(具体的には加重値およびバイアス)との組合せで実現される。第2生成モデルM2を規定する複数の変数は、複数の第2訓練データを利用した機械学習により事前に設定されて記憶装置12に記憶される。複数の第2訓練データの各々は、入力データDinと波形データDw(正解値)とを含む。第2生成モデルM2の機械学習においては、各第2訓練データの入力データDinに対して暫定的な第2生成モデルM2が出力する波形データDwと、当該第2訓練データの波形データDwとの誤差が低減されるように、第2生成モデルM2の複数の変数が反復的に更新される。したがって、第2生成モデルM2は、複数の第2訓練データにおいて音符列Nおよび特徴列Fの組と波形Wとの間に潜在する傾向のもとで、未知の入力データDinに対して統計的に妥当な波形データDwを出力する。 The second generation model M2 is a program that causes the control device 11 to execute an operation of generating waveform data Dw from the input data Din including the note string data Dn and the feature sequence data Df, and a plurality of variables applied to the operation (the second generation model M2). Specifically, it is realized in combination with a weighted value and a bias). The plurality of variables defining the second generation model M2 are preset and stored in the storage device 12 by machine learning using the plurality of second training data. Each of the plurality of second training data includes input data Din and waveform data Dw (correct answer value). In the machine learning of the second generative model M2, the waveform data Dw output by the provisional second generative model M2 with respect to the input data Din of each second training data and the waveform data Dw of the second training data. A plurality of variables of the second generative model M2 are updated iteratively so that the error is reduced. Therefore, the second generative model M2 is statistical with respect to the unknown input data Din under the latent tendency between the set of the note sequence N and the feature sequence F and the waveform W in the plurality of second training data. Outputs appropriate waveform data Dw.
 第3編集部35は、第2生成部34が生成した波形データDwを編集する。具体的には、第3編集部35は、編集領域Ewに対する利用者からの編集指示Qwに応じて波形データDwを編集する。表示制御部20は、第2生成部34が生成した波形データDwが表す波形W、または第3編集部35による編集後の波形データDwが表す波形Wを、編集領域Ewに表示する。また、操作画像B1(再生)が利用者により操作された場合、第2生成部34が生成した波形データDwまたは第3編集部35による編集後の波形データDwに応じた音響信号Zが放音装置13に供給されることで、合成音が再生される。 The third editing unit 35 edits the waveform data Dw generated by the second generation unit 34. Specifically, the third editing unit 35 edits the waveform data Dw according to the editing instruction Qw from the user for the editing area Ew. The display control unit 20 displays the waveform W represented by the waveform data Dw generated by the second generation unit 34 or the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew. Further, when the operation image B1 (reproduction) is operated by the user, the acoustic signal Z corresponding to the waveform data Dw generated by the second generation unit 34 or the waveform data Dw edited by the third editing unit 35 is emitted. By being supplied to the device 13, the synthesized sound is reproduced.
 情報管理部40は、音符列データDnと特徴列データDfと波形データDwとの各々についてバージョンを管理する。具体的には、情報管理部40は、音符列バージョン番号Vnと特徴列バージョン番号Vfと波形バージョン番号Vwとを管理する。 The information management unit 40 manages versions of each of the note sequence data Dn, the feature sequence data Df, and the waveform data Dw. Specifically, the information management unit 40 manages the note sequence version number Vn, the feature sequence version number Vf, and the waveform version number Vw.
 また、情報管理部40は、音符列データDnと特徴列データDfと波形データDwとの各々について相異なるバージョンのデータ(以下「履歴データ」という)を記憶装置12に保存する。記憶装置12には、履歴領域と作業領域とが設定される。履歴領域は、合成音の条件に関する編集の履歴が記憶される記憶領域である。他方、作業領域は、編集画面Gを利用した編集の過程において音符列データDnと特徴列データDfと波形データDwとが一時的に保存される記憶領域である。 Further, the information management unit 40 stores different versions of data (hereinafter referred to as “history data”) for each of the note sequence data Dn, the feature sequence data Df, and the waveform data Dw in the storage device 12. A history area and a work area are set in the storage device 12. The history area is a storage area in which the history of editing related to the synthetic sound condition is stored. On the other hand, the work area is a storage area in which the note sequence data Dn, the feature sequence data Df, and the waveform data Dw are temporarily stored in the process of editing using the edit screen G.
 具体的には、情報管理部40は、編集指示Qnに応じた音符列Nの編集毎に、編集後の音符列データDnを第1履歴データHn[Vn,Vf,Vw]として履歴領域に保存する。すなわち、新規なバージョンの音符列データDnが第1履歴データHn[Vn,Vf,Vw]として記憶装置12に保存される。 Specifically, the information management unit 40 saves the edited note sequence data Dn as the first history data Hn [Vn, Vf, Vw] in the history area for each edit of the note sequence N in response to the edit instruction Qn. do. That is, the new version of the note string data Dn is stored in the storage device 12 as the first history data Hn [Vn, Vf, Vw].
 また、情報管理部40は、編集指示Qfに応じた編集後の特徴列データDfに対応する第2履歴データHf[Vn,Vf,Vw]を、新規なバージョンのデータとして履歴領域に保存する。第1実施形態の第2履歴データHf[Vn,Vf,Vw]は、特徴列データDfが編集指示Qfに応じて如何に編集されたか(すなわち編集指示Qfの時系列)を表すデータである。第2履歴データHf[Vn,Vf,Vw]は、編集の前後における特徴列データDfの差分を表すデータとも換言される。 Further, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df according to the edit instruction Qf in the history area as new version data. The second history data Hf [Vn, Vf, Vw] of the first embodiment is data showing how the feature column data Df was edited according to the edit instruction Qf (that is, the time series of the edit instruction Qf). The second history data Hf [Vn, Vf, Vw] is also referred to as data representing the difference between the feature column data Df before and after editing.
 同様に、情報管理部40は、編集指示Qwに応じた編集後の波形データDwに対応する第3履歴データHw[Vn,Vf,Vw]を、新規なバージョンのデータとして履歴領域に保存する。第1実施形態の第3履歴データHw[Vn,Vf,Vw]は、波形データDwが編集指示Qwに応じて如何に編集されたか(すなわち編集指示Qwの時系列)を表すデータである。第3履歴データHw[Vn,Vf,Vw]は、編集の前後における波形データDwの差分を表すデータとも換言される。 Similarly, the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw according to the edit instruction Qw in the history area as new version data. The third history data Hw [Vn, Vf, Vw] of the first embodiment is data showing how the waveform data Dw was edited according to the editing instruction Qw (that is, the time series of the editing instruction Qw). The third history data Hw [Vn, Vf, Vw] is also referred to as data representing the difference between the waveform data Dw before and after editing.
 図4から図6は、利用者からの編集指示Q(Qn、QfまたはQw)に応じて合成音の条件を編集する編集処理Sa(Sa1、Sa2およびSa3)の具体的な手順を例示するフローチャートである。図4は、音符列Nの編集に関する第1編集処理Sa1のフローチャートである。音符列Nに対する編集指示Qnを契機として第1編集処理Sa1が開始される。第1編集処理Sa1が開始されると、第1編集部31は、現時点の音符列データDnを編集指示Qnに応じて編集する(Sa101)。 4 to 6 are flowcharts illustrating a specific procedure of the editing process Sa (Sa1, Sa2 and Sa3) for editing the condition of the synthetic sound according to the editing instruction Q (Qn, Qf or Qw) from the user. Is. FIG. 4 is a flowchart of the first editing process Sa1 relating to the editing of the note string N. The first editing process Sa1 is started with the editing instruction Qn for the note string N as a trigger. When the first editing process Sa1 is started, the first editing unit 31 edits the current note string data Dn according to the editing instruction Qn (Sa101).
 情報管理部40は、音符列バージョン番号Vnを「1」だけ増加させる(Sa102)。なお、編集指示Qnが最初に付与された段階では、音符列データDnが新規に生成され(Sa101)、音符列バージョン番号Vnが「0」に初期化される(Sa102)。また、情報管理部40は、特徴列バージョン番号Vfを「0」に初期化し(Sa103)、かつ、波形バージョン番号Vwを「0」に初期化する(Sa104)。そして、情報管理部40は、第1編集部31による編集後の音符列データDnを、音符列Nの第1履歴データHn[Vn,Vf=0,Vw=0]として記憶装置12の履歴領域に保存する(Sa105)。 The information management unit 40 increases the note string version number Vn by "1" (Sa102). When the edit instruction Qn is first given, the note string data Dn is newly generated (Sa101), and the note string version number Vn is initialized to "0" (Sa102). Further, the information management unit 40 initializes the feature column version number Vf to "0" (Sa103) and initializes the waveform version number Vw to "0" (Sa104). Then, the information management unit 40 uses the note string data Dn edited by the first editing unit 31 as the first history data Hn [Vn, Vf = 0, Vw = 0] of the note string N in the history area of the storage device 12. Save to (Sa105).
 以上の説明から理解される通り、編集指示Qnに応じた音符列データDnの編集毎に、当該編集後のバージョンの音符列データDnが第1履歴データHn[Vn,Vf=0,Vw=0]として履歴領域に保存され(Sa105)、音符列バージョン番号Vnが増加され(Sa102)、かつ、特徴列バージョン番号Vfと波形バージョン番号Vwとが初期化される(Sa103およびSa104)。 As can be understood from the above explanation, for each edit of the note string data Dn according to the edit instruction Qn, the note string data Dn of the edited version is the first history data Hn [Vn, Vf = 0, Vw = 0. ] Is saved in the history area (Sa105), the note string version number Vn is increased (Sa102), and the feature column version number Vf and the waveform version number Vw are initialized (Sa103 and Sa104).
 第1生成部32は、第1編集部31による編集後の音符列データDnを第1生成モデルM1に供給することで特徴列データDfを生成する(Sa106)。第1生成部32が生成した特徴列データDfは、記憶装置12の作業領域に保存される。また、第2生成部34は、第1編集部31による編集後の音符列データDnと第1生成部32が生成した特徴列データDfとを含む入力データDinを第2生成モデルM2に供給することで波形データDwを生成する(Sa107)。第2生成部34が生成した波形データDwは、記憶装置12の作業領域に保存される。 The first generation unit 32 generates the feature sequence data Df by supplying the note sequence data Dn edited by the first editing unit 31 to the first generation model M1 (Sa106). The feature sequence data Df generated by the first generation unit 32 is stored in the work area of the storage device 12. Further, the second generation unit 34 supplies the input data Din including the note sequence data Dn edited by the first editing unit 31 and the feature sequence data Df generated by the first generation unit 32 to the second generation model M2. This generates waveform data Dw (Sa107). The waveform data Dw generated by the second generation unit 34 is stored in the work area of the storage device 12.
 なお、音符列データDnは、音符毎に1個のデータが必要である。特徴列データDfは、各音符内におけるピッチの変化を表すため、数ミリ秒から数十ミリ秒毎に1個のサンプルで構成される。波形データDwは、各音符の波形を表すため、サンプリング周期(例えば1/50kHz~20μ秒)毎に1個のサンプルが構成される。以上の例示の通り、1個の音符列データDnから作成される特徴列データDfのデータ量は、当該音符列データDnのデータ量の数百倍から数千倍であり、1個の特徴列データDfから生成される波形データDwのデータ量は、当該特徴列データDfのデータ量の数百倍から数千倍である。以上の事情を考慮して、第1実施形態においては、上位層のデータ(音符列データDn)はそのまま第1履歴データHn[Vn,Vf=0,Vw=0]として保存される。他方、階層のデータ(特徴列データDfおよび波形データDw)は、前述の通りデータ量が大きいため、上位層のデータとの差分だけが履歴データとして保存される。以上の構成によれば、階層のデータについても当該データ自体を保存する構成と比較して、記憶装置12に記憶されるデータ量を大幅に削減できるという利点がある。 Note that the note string data Dn requires one data for each note. The feature sequence data Df is composed of one sample every several milliseconds to several tens of milliseconds in order to represent the change in pitch in each note. Since the waveform data Dw represents the waveform of each note, one sample is configured for each sampling period (for example, 1/50 kHz to 20 μsec). As shown in the above example, the amount of data of the feature sequence data Df created from one note sequence data Dn is several hundred to several thousand times the amount of data of the note sequence data Dn, and one feature sequence. The amount of data of the waveform data Dw generated from the data Df is several hundred times to several thousand times the amount of data of the feature column data Df. In consideration of the above circumstances, in the first embodiment, the upper layer data (note string data Dn) is stored as it is as the first history data Hn [Vn, Vf = 0, Vw = 0]. On the other hand, since the layer data (feature column data Df and waveform data Dw) has a large amount of data as described above, only the difference from the upper layer data is stored as historical data. According to the above configuration, there is an advantage that the amount of data stored in the storage device 12 can be significantly reduced with respect to the hierarchical data as compared with the configuration in which the data itself is stored.
 表示制御部20は、編集画面Gを更新する(Sa108-Sa110)。具体的には、表示制御部20は、第1編集部31による編集後の音符列データDnが表す音符列Nを編集領域Enに表示する(Sa108)。また、表示制御部20は、作業領域に保存された現時点の特徴列データDfが表す特徴列Fを編集領域Efに表示する(Sa109)。同様に、表示制御部20は、作業領域に保存された現時点の波形データDwが表す波形Wを編集領域Ewに表示する(Sa110)。 The display control unit 20 updates the edit screen G (Sa108-Sa110). Specifically, the display control unit 20 displays the note string N represented by the note string data Dn edited by the first editing unit 31 in the editing area En (Sa108). Further, the display control unit 20 displays the feature column F represented by the current feature column data Df stored in the work area in the edit area Ef (Sa109). Similarly, the display control unit 20 displays the waveform W represented by the current waveform data Dw stored in the work area in the edit area Ew (Sa110).
 図5は、特徴列Fの編集に関する第2編集処理Sa2のフローチャートである。特徴列Fに対する編集指示Qfを契機として第2編集処理Sa2が開始される。第2編集処理Sa2が開始されると、第2編集部33は、現時点の特徴列データDfを編集指示Qfに応じて編集する(Sa201)。 FIG. 5 is a flowchart of the second editing process Sa2 relating to the editing of the feature column F. The second editing process Sa2 is started with the editing instruction Qf for the feature column F as a trigger. When the second editing process Sa2 is started, the second editing unit 33 edits the current feature column data Df according to the editing instruction Qf (Sa201).
 情報管理部40は、特徴列バージョン番号Vfを「1」だけ増加させる(Sa202)。また、情報管理部40は、音符列バージョン番号Vnを現在値Cnに維持し(Sa203)、かつ、波形バージョン番号Vwを「0」に初期化する(Sa204)。そして、情報管理部40は、今回の編集指示Qfを表す第2履歴データHf[Vn,Vf,Vw=0]を新規なバージョンのデータとして履歴領域に保存する(Sa205)。 The information management unit 40 increases the feature column version number Vf by "1" (Sa202). Further, the information management unit 40 maintains the note string version number Vn at the current value Cn (Sa203) and initializes the waveform version number Vw to “0” (Sa204). Then, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw = 0] representing the editing instruction Qf this time in the history area as new version data (Sa205).
 以上の説明から理解される通り、編集指示Qfに応じた特徴列データDfの編集毎に、当該編集後の特徴列データDfに応じた第2履歴データHf[Vn,Vf,Vw=0]が履歴領域に保存され(Sa205)、音符列バージョン番号Vnが維持されたまま(Sa203)、特徴列バージョン番号Vfが増加され(Sa202)、かつ、波形バージョン番号Vwが初期化される(Sa204)。なお、ステップSa203は省略されてもよい。 As can be understood from the above explanation, every time the feature column data Df is edited according to the edit instruction Qf, the second history data Hf [Vn, Vf, Vw = 0] corresponding to the edited feature column data Df is generated. It is saved in the history area (Sa205), the note sequence version number Vn is maintained (Sa203), the feature sequence version number Vf is increased (Sa202), and the waveform version number Vw is initialized (Sa204). Note that step Sa203 may be omitted.
 第2生成部34は、現時点の音符列データDnと第2編集部33による編集後の特徴列データDfとを含む入力データDinを第2生成モデルM2に供給することで波形データDwを生成する(Sa206)。第2生成部34が生成した波形データDwは、記憶装置12の作業領域に保存される。 The second generation unit 34 generates waveform data Dw by supplying input data Din including the current note sequence data Dn and the feature sequence data Df edited by the second editing unit 33 to the second generation model M2. (Sa206). The waveform data Dw generated by the second generation unit 34 is stored in the work area of the storage device 12.
 表示制御部20は、編集画面Gを更新する(Sa207およびSa208)。具体的には、表示制御部20は、第2編集部33による編集後の特徴列データDfが表す特徴列Fを編集領域Efに表示する(Sa207)。また、表示制御部20は、作業領域に保存された現時点の波形データDwが表す波形Wを編集領域Ewに表示する(Sa208)。なお、第2編集処理Sa2においては、編集領域En内の音符列Nは更新されない。 The display control unit 20 updates the edit screen G (Sa207 and Sa208). Specifically, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the editing area Ef (Sa207). Further, the display control unit 20 displays the waveform W represented by the current waveform data Dw stored in the work area in the edit area Ew (Sa208). In the second editing process Sa2, the note string N in the editing area En is not updated.
 図6は、波形Wの編集に関する第3編集処理Sa3のフローチャートである。波形Wに対する編集指示Qwを契機として第3編集処理Sa3が開始される。第3編集処理Sa3が開始されると、第3編集部35は、現時点の波形データDwを編集指示Qwに応じて編集する(Sa301)。 FIG. 6 is a flowchart of the third editing process Sa3 relating to the editing of the waveform W. The third editing process Sa3 is started with the editing instruction Qw for the waveform W as a trigger. When the third editing process Sa3 is started, the third editing unit 35 edits the current waveform data Dw according to the editing instruction Qw (Sa301).
 情報管理部40は、波形バージョン番号Vwを「1」だけ増加させる(Sa302)。また、情報管理部40は、音符列バージョン番号Vnを現在値Cnに維持し(Sa303)、かつ、特徴列バージョン番号Vfも現在値Cfに維持する(Sa304)。そして、情報管理部40は、今回の編集指示Qwを表す第3履歴データHw[Vn,Vf,Vw]を新規なバージョンのデータとして履歴領域に保存する(Sa305)。 The information management unit 40 increases the waveform version number Vw by "1" (Sa302). Further, the information management unit 40 maintains the note sequence version number Vn at the current value Cn (Sa303), and also maintains the feature sequence version number Vf at the current value Cf (Sa304). Then, the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] representing the editing instruction Qw this time in the history area as new version data (Sa305).
 以上の説明から理解される通り、編集指示Qwに応じた波形データDwの編集毎に、当該編集後の波形データDwに応じた第3履歴データHw[Vn,Vf,Vw]が履歴領域に保存され(Sa305)、音符列バージョン番号Vnと特徴列バージョン番号Vfとが維持されたまま(Sa303およびSa304)、波形バージョン番号Vwが増加される(Sa302)。なお、ステップSa303およびステップSa304は省略されてもよい。 As understood from the above explanation, every time the waveform data Dw is edited according to the editing instruction Qw, the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw is saved in the history area. (Sa305), the waveform version number Vw is increased (Sa302) while the note string version number Vn and the feature column version number Vf are maintained (Sa303 and Sa304). In addition, step Sa303 and step Sa304 may be omitted.
 表示制御部20は、第3編集部35による編集後の波形データDwが表す波形Wを編集領域Ewに表示する(Sa306)。なお、第3編集処理Sa3においては、編集領域En内の音符列Nと編集領域Ef内の特徴列Fとは更新されない。 The display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew (Sa306). In the third editing process Sa3, the note string N in the editing area En and the feature string F in the editing area Ef are not updated.
 図7は、記憶装置12の履歴領域におけるデータ構造の説明図である。履歴領域には、音符列Nの相異なるバージョンに対応する複数の第1履歴データHn[Vn,Vf=0,Vw=0](音符列データDn)が記憶される。複数の第1履歴データHn[Vn,Vf=0,Vw=0]の各々について、共通の音符列Nのもとで相異なるバージョンの特徴列Fに対応する複数の第2履歴データHf[Vn,Vf,Vw=0]が、履歴領域に記憶される。また、複数の第2履歴データHf[Vn,Vf,Vw=0]の各々について、共通の特徴列Fのもとで相異なるバージョンの波形Wに対応する複数の第3履歴データHw[Vn,Vf,Vw]が、履歴領域に記憶される。以上の例示の通り、音符列Nは特徴列Fの上位に位置し、特徴列Fは波形Wの上位に位置する、という階層関係が成立する。特徴列Fが編集されると、特徴列バージョン番号Vfが増加され、かつ、上位層に対応する音符列バージョン番号Vnが維持されたまま、下位層に対応する波形バージョン番号Vwは「0」に初期化される。 FIG. 7 is an explanatory diagram of the data structure in the history area of the storage device 12. A plurality of first history data Hn [Vn, Vf = 0, Vw = 0] (note string data Dn) corresponding to different versions of the note sequence N are stored in the history area. For each of the plurality of first history data Hn [Vn, Vf = 0, Vw = 0], a plurality of second history data Hf [Vn] corresponding to the feature sequence F of different versions under the common note sequence N. , Vf, Vw = 0] is stored in the history area. Further, for each of the plurality of second history data Hf [Vn, Vf, Vw = 0], a plurality of third history data Hw [Vn, Vn, corresponding to different versions of the waveform W under the common feature sequence F. Vf, Vw] is stored in the history area. As described above, the hierarchical relationship is established in which the note sequence N is located above the feature sequence F and the feature sequence F is located above the waveform W. When the feature column F is edited, the feature column version number Vf is increased, and the waveform version number Vw corresponding to the lower layer is set to "0" while the note string version number Vn corresponding to the upper layer is maintained. It is initialized.
 図8から図10は、利用者からの指示に応じてバージョンを管理する管理処理Sb(Sb1、Sb2およびSb3)の具体的な手順を例示するフローチャートである。図8は、音符列Nのバージョンに関する第1管理処理Sb1のフローチャートである。音符列バージョン番号Vnの変更の指示を契機として第1管理処理Sb1が開始される。 8 to 10 are flowcharts illustrating a specific procedure of the management process Sb (Sb1, Sb2 and Sb3) that manages the version according to the instruction from the user. FIG. 8 is a flowchart of the first management process Sb1 regarding the version of the note string N. The first management process Sb1 is started with the instruction to change the note string version number Vn.
 利用者からの指示に応じた変更後の音符列バージョン番号Vnの数値を以下では「設定値Xn」と表記する。操作領域Gn内の音符列バージョン番号Vnを利用者が直接に変更した場合、当該変更後の数値(すなわち利用者が指定した数値)が設定値Xnに相当する。また、利用者が操作画像Gn1を操作した場合、音符列バージョン番号Vnの現在値Cnの直前の数値(=Cn-1)が設定値Xnに相当する。他方、利用者が操作画像Gn2を操作した場合、音符列バージョン番号Vnの現在値Cnの直後の数値(=Cn+1)が設定値Xnに相当する。 The numerical value of the note string version number Vn after the change according to the instruction from the user is referred to as "set value Xn" below. When the user directly changes the note string version number Vn in the operation area Gn, the changed numerical value (that is, the numerical value specified by the user) corresponds to the set value Xn. Further, when the user operates the operation image Gn1, the numerical value (= Cn-1) immediately before the current value Cn of the note string version number Vn corresponds to the set value Xn. On the other hand, when the user operates the operation image Gn2, the numerical value (= Cn + 1) immediately after the current value Cn of the note string version number Vn corresponds to the set value Xn.
 第1管理処理Sb1が開始されると、情報管理部40は、音符列バージョン番号Vnを現在値Cnから設定値Xnに変更する(Sb101)。 When the first management process Sb1 is started, the information management unit 40 changes the note string version number Vn from the current value Cn to the set value Xn (Sb101).
 情報管理部40は、特徴列バージョン番号Vfを、音符列Nの設定値Xnに対応する最新値Yfに設定する(Sb102)。最新値Yfは、設定値Xnに対応するバージョンの音符列Nのもとで編集指示Qf毎に生成された特徴列Fの複数のバージョンのうち、最新のバージョンの番号である。 The information management unit 40 sets the feature column version number Vf to the latest value Yf corresponding to the set value Xn of the note string N (Sb102). The latest value Yf is the number of the latest version among the plurality of versions of the feature string F generated for each edit instruction Qf under the note string N of the version corresponding to the set value Xn.
 情報管理部40は、波形バージョン番号Vwを、音符列Nの設定値Xnに対応する最新値Ywに設定する(Sb103)。最新値Ywは、設定値Xnに対応するバージョンの音符列Nのもとで編集指示Qw毎に生成された波形Wの複数のバージョンのうち、最新のバージョンの番号である。 The information management unit 40 sets the waveform version number Vw to the latest value Yw corresponding to the set value Xn of the note string N (Sb103). The latest value Yw is the number of the latest version among a plurality of versions of the waveform W generated for each edit instruction Qw under the note string N of the version corresponding to the set value Xn.
 情報管理部40は、音符列Nの第1履歴データHn[Vn=Xn,Vf=0,Vw=0]と、特徴列Fの第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Yf,Vw=0]と、波形Wの第3履歴データHw[Vn=Xn,Vf=Yf,Vw=1]~Hw[Vn=Xn,Vf=Yf,Vw=Yw]とを、記憶装置12の履歴領域から取得する(Sb104)。なお、第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Yf,Vw=0]の取得は、実際には特徴量Fが編集された場合に実行され、特徴量Fが編集されない場合には実行されない。音符列Nの第1履歴データHn[Vn=Xn,Vf=0,Vw=0]は、音符列バージョン番号Vnが設定値Xnであるバージョンの音符列Nを表す音符列データDnである。特徴列Fの第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Yf,Vw=0]は、音符列バージョン番号Vnが設定値Xnである音符列Nのもとで利用者が順次に付与した1以上の編集指示Qfのうち第Yf番目以前の編集指示Qfの時系列を表すデータである。波形Wの第3履歴データHw[Vn=Xn,Vf=Yf,Vw=1]~Hw[Vn=Xn,Vf=Yf,Vw=Yw]は、音符列バージョン番号Vnが設定値Xnであるバージョンの音符列Nと特徴列バージョン番号Vfが最新値Yfであるバージョンの特徴列Fとのもとで利用者が順次に付与した1以上の編集指示Qwのうち第Yw番目以前の編集指示Qwの時系列を表すデータである。 The information management unit 40 has the first history data Hn [Vn = Xn, Vf = 0, Vw = 0] of the note sequence N and the second history data Hf [Vn = Xn, Vf = 1, Vw = = of the feature sequence F. 0] to Hf [Vn = Xn, Vf = Yf, Vw = 0] and the third history data Hw [Vn = Xn, Vf = Yf, Vw = 1] to Hw [Vn = Xn, Vf = Yf] of the waveform W. , Vw = Yw] is acquired from the history area of the storage device 12 (Sb104). The acquisition of the second history data Hf [Vn = Xn, Vf = 1, Vw = 0] to Hf [Vn = Xn, Vf = Yf, Vw = 0] is when the feature amount F is actually edited. Is executed, and is not executed when the feature amount F is not edited. The first history data Hn [Vn = Xn, Vf = 0, Vw = 0] of the note string N is the note string data Dn representing the version of the note string N in which the note string version number Vn is the set value Xn. In the second history data Hf [Vn = Xn, Vf = 1, Vw = 0] to Hf [Vn = Xn, Vf = Yf, Vw = 0] of the feature column F, the note sequence version number Vn is the set value Xn. It is data representing the time series of the edit instruction Qf before the Yfth among the one or more edit instruction Qf sequentially given by the user under the note string N. The third history data Hw [Vn = Xn, Vf = Yf, Vw = 1] to Hw [Vn = Xn, Vf = Yf, Vw = Yw] of the waveform W is the version in which the note string version number Vn is the set value Xn. Of the one or more edit instruction Qw sequentially given by the user under the note sequence N of the note sequence N and the feature sequence F of the version in which the feature column version number Vf is the latest value Yf, the edit instruction Qw before the Yw th It is data representing a time series.
 第1生成部32は、情報管理部40が取得した第1履歴データHn[Vn=Xn,Vf=0,Vw=0](音符列データDn)を第1生成モデルM1に供給することで特徴列データDfを生成する(Sb105)。第2編集部33は、情報管理部40が取得した1以上の第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Yf,Vw=0]が表す編集指示Qfに応じて当該特徴列データDfを順次に編集する(Sb106)。すなわち、設定値Xnに対応する音符列Nのもとで第Yf番目までの編集指示Qfに応じて編集された特徴列データDfが生成される。なお、第2編集部33による編集は、複数の音符にわたる特徴列データDfのうちのごく一部である。例えば、楽曲内の特定の音符のアタック部、または、楽曲内の第3番目のフレーズにおける最初から2個の音符等、楽曲の全体からすれば非常に僅かな部分だけが編集される。 The first generation unit 32 is characterized by supplying the first history data Hn [Vn = Xn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40 to the first generation model M1. Generate column data Df (Sb105). The second editorial unit 33 has one or more second history data Hf [Vn = Xn, Vf = 1, Vw = 0] to Hf [Vn = Xn, Vf = Yf, Vw = 0] acquired by the information management unit 40. The feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb106). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Yf th is generated under the note sequence N corresponding to the set value Xn. The editing by the second editing unit 33 is a small part of the feature sequence data Df over a plurality of notes. For example, only a very small part of the whole music, such as the attack part of a specific note in the music, or the first two notes in the third phrase in the music, is edited.
 第2生成部34は、情報管理部40が取得した第1履歴データHn[Vn=Xn,Vf=0,Vw=0](音符列データDn)と編集後の特徴列データDfとを含む入力データDinを第2生成モデルM2に供給することで波形データDwを生成する(Sb107)。第3編集部35は、情報管理部40が取得した1以上の第3履歴データHw[Vn=Xn,Vf=Yf,Vw=1]~Hw[Vn=Xn,Vf=Yf,Vw=Yw]が表す編集指示Qwに応じて波形データDwを順次に編集する(Sb108)。すなわち、設定値Xnに対応する音符列Nと最新値Yfに対応する特徴列Fとのもとで第Yw番目までの編集指示Qwに応じて編集された波形データDwが生成される。なお、第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Yf,Vw=0]が存在しない場合、第3履歴データHw[Vn=Xn,Vf=Yf,Vw=1]~Hw[Vn=Xn,Vf=Yf,Vw=Yw]は取得されない。すなわち、波形データDwはステップSb108において編集されず、当該波形データDwが最終的なデータとして確定する。なお、波形Wを時間軸の方向に移動させる編集が指示された場合、例えば「時点1から時点2の区間をXミリ秒だけ移動する」という編集指示Qwのみが第3履歴データHw[Vn=Xn,Vf=Yf,Vw=1]~Hw[Vn=Xn,Vf=Yf,Vw=Yw]として保存される。したがって、移動後の波形Wのサンプルデータを楽曲の全体にわたり保存する形態と比較して、記憶装置12に記憶されるデータ量を大幅に削減できる。波形Wに対する音量の編集またはフィルタの編集についても同様である。波形Wに対する音量の編集については、当該編集の区間における音量変化の遷移が保存され、波形Wに対するフィルタの編集については、当該編集の区間内におけるフィルタのパラメータが保存される。 The second generation unit 34 is an input including the first history data Hn [Vn = Xn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40 and the edited feature sequence data Df. The waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb107). The third editorial unit 35 has one or more third history data Hw [Vn = Xn, Vf = Yf, Vw = 1] to Hw [Vn = Xn, Vf = Yf, Vw = Yw] acquired by the information management unit 40. The waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb108). That is, the waveform data Dw edited according to the edit instruction Qw up to the Ywth th is generated under the note string N corresponding to the set value Xn and the feature string F corresponding to the latest value Yf. If the second history data Hf [Vn = Xn, Vf = 1, Vw = 0] to Hf [Vn = Xn, Vf = Yf, Vw = 0] does not exist, the third history data Hw [Vn = Xn, Vf = Yf, Vw = 1] to Hw [Vn = Xn, Vf = Yf, Vw = Yw] are not acquired. That is, the waveform data Dw is not edited in step Sb108, and the waveform data Dw is determined as final data. When editing to move the waveform W in the direction of the time axis is instructed, for example, only the editing instruction Qw "move the section from time point 1 to time point 2 by X milliseconds" is the third history data Hw [Vn = It is saved as Xn, Vf = Yf, Vw = 1] to Hw [Vn = Xn, Vf = Yf, Vw = Yw]. Therefore, the amount of data stored in the storage device 12 can be significantly reduced as compared with the form in which the sample data of the waveform W after movement is stored over the entire music. The same applies to the editing of the volume or the editing of the filter for the waveform W. For editing the volume for the waveform W, the transition of the volume change in the section of the editing is saved, and for editing the filter for the waveform W, the parameters of the filter in the section of the editing are saved.
 表示制御部20は、編集画面Gを更新する(Sb109-Sb111)。具体的には、表示制御部20は、情報管理部40が取得した第1履歴データHn[Vn=Xn,Vf=0,Vw=0](音符列データDn)が表す音符列Nを編集領域Enに表示し、操作領域Gnの音符列バージョン番号Vnの表示を設定値Xnに更新する(Sb109)。すなわち、第Xn番目の編集指示Qnによる編集後の音符列Nが編集領域Enに表示される。 The display control unit 20 updates the edit screen G (Sb109-Sb111). Specifically, the display control unit 20 edits the note string N represented by the first history data Hn [Vn = Xn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40. It is displayed in En, and the display of the note string version number Vn in the operation area Gn is updated to the set value Xn (Sb109). That is, the note string N after editing by the Xnth editing instruction Qn is displayed in the editing area En.
 また、表示制御部20は、第2編集部33による編集後の特徴列データDfが表す特徴列Fを編集領域Efに表示し、操作領域Gfの特徴列バージョン番号Vfの表示を最新値Yfに更新する(Sb110)。すなわち、設定値Xnと最新値Yfとに対応する特徴列Fが編集領域E2に表示される。同様に、表示制御部20は、第3編集部35による編集後の波形データDwが表す波形Wを編集領域Ewに表示し、操作領域Gwの波形バージョン番号Vwの表示を最新値Ywに更新する(Sb111)。すなわち、設定値Xnと最新値Yfと最新値Ywとに対応する波形Wが編集領域Ewに表示される。以上の状態において、利用者は、音符列Nと特徴列Fと波形Wとの各々に関する編集の指示(Qn、QfまたはQw)を付与できる。 Further, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the edit area Ef, and displays the feature column version number Vf of the operation area Gf in the latest value Yf. Update (Sb110). That is, the feature column F corresponding to the set value Xn and the latest value Yf is displayed in the editing area E2. Similarly, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and updates the display of the waveform version number Vw in the operation area Gw to the latest value Yw. (Sb111). That is, the waveform W corresponding to the set value Xn, the latest value Yf, and the latest value Yw is displayed in the editing area Ew. In the above state, the user can give an editing instruction (Qn, Qf or Qw) for each of the note sequence N, the feature sequence F and the waveform W.
 図9は、特徴列Fのバージョンに関する第2管理処理Sb2のフローチャートである。特徴列バージョン番号Vfの変更の指示を契機として第2管理処理Sb2が開始される。 FIG. 9 is a flowchart of the second management process Sb2 regarding the version of the feature column F. The second management process Sb2 is started with the instruction to change the feature column version number Vf.
 利用者からの指示に応じた変更後の特徴列バージョン番号Vfの数値を以下では「設定値Xf」と表記する。操作領域Gf内の特徴列バージョン番号Vfを利用者が直接に変更した場合、当該変更後の数値(すなわち利用者が指定した数値)が設定値Xfに相当する。また、利用者が操作画像Gf1を操作した場合、特徴列バージョン番号Vfの現在値Cfの直前の数値(=Cf-1)が設定値Xfに相当する。他方、利用者が操作画像Gf2を操作した場合、特徴列バージョン番号Vfの現在値Cfの直後の数値(=Cf+1)が設定値Xfに相当する。 The numerical value of the feature column version number Vf after the change according to the instruction from the user is referred to as "set value Xf" below. When the user directly changes the feature column version number Vf in the operation area Gf, the changed numerical value (that is, the numerical value specified by the user) corresponds to the set value Xf. Further, when the user operates the operation image Gf1, the numerical value (= Cf-1) immediately before the current value Cf of the feature column version number Vf corresponds to the set value Xf. On the other hand, when the user operates the operation image Gf2, the numerical value (= Cf + 1) immediately after the current value Cf of the feature column version number Vf corresponds to the set value Xf.
 第2管理処理Sb2が開始されると、情報管理部40は、特徴列バージョン番号Vfを現在値Cfから設定値Xfに変更する(Sb201)。また、情報管理部40は、音符列バージョン番号Vnを現在値Cnに維持し(Sb202)、波形バージョン番号Vwを現在値Cwから最新値Ywに変更する(Sb203)。波形バージョン番号Vwの最新値Ywは、設定値Xfに対応するバージョンの特徴列Fのもとで編集指示Qw毎に生成された波形Wの複数のバージョンのうち、最新のバージョンの番号である。 When the second management process Sb2 is started, the information management unit 40 changes the feature column version number Vf from the current value Cf to the set value Xf (Sb201). Further, the information management unit 40 maintains the note string version number Vn at the current value Cn (Sb202), and changes the waveform version number Vw from the current value Cw to the latest value Yw (Sb203). The latest value Yw of the waveform version number Vw is the number of the latest version among the plurality of versions of the waveform W generated for each edit instruction Qw under the feature column F of the version corresponding to the set value Xf.
 情報管理部40は、音符列Nの第1履歴データHn[Vn=Cn,Vf=0,Vw=0]と、特徴列Fの第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Xf,Vw=0]と、波形Wの第3履歴データHw[Vn=Cn,Vf=Xf,Vw=1]~Hw[Vn=Xn,Vf=Xf,Vw=Yw]とを、記憶装置12の履歴領域から取得する(Sb204)。音符列Nの第1履歴データHn[Vn=Cn,Vf=0,Vw=0]は、現在のバージョンの音符列Nを表す音符列データDnである。特徴列Fの第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Xf,Vw=0]は、現在のバージョンの音符列Nのもとで利用者が順次に付与した1以上の編集指示Qfのうち第Xf番目以前の編集指示Qfの時系列を表すデータである。波形Wの第3履歴データHw[Vn=Cn,Vf=Xf,Vw=1]~Hw[Vn=Xn,Vf=Xf,Vw=Yw]は、音符列バージョン番号Vnが現在値Cnであるバージョンの音符列Nと特徴列バージョン番号Vfが設定値Xfであるバージョンの特徴列Fとのもとで利用者が順次に付与した1以上の編集指示Qwのうち第Yw番目以前の編集指示Qwの時系列を表すデータである。 The information management unit 40 has the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] of the note sequence N and the second history data Hf [Vn = Cn, Vf = 1, Vw = = of the feature sequence F. 0] to Hf [Vn = Cn, Vf = Xf, Vw = 0] and the third history data Hw [Vn = Cn, Vf = Xf, Vw = 1] to Hw [Vn = Xn, Vf = Xf] of the waveform W. , Vw = Yw] is acquired from the history area of the storage device 12 (Sb204). The first history data Hn [Vn = Cn, Vf = 0, Vw = 0] of the note string N is the note string data Dn representing the note string N of the current version. The second history data Hf [Vn = Cn, Vf = 1, Vw = 0] to Hf [Vn = Cn, Vf = Xf, Vw = 0] of the feature column F is under the note string N of the current version. It is data representing the time series of the edit instruction Qf before the Xfth among the one or more edit instruction Qf given sequentially by the user. The third history data Hw [Vn = Cn, Vf = Xf, Vw = 1] to Hw [Vn = Xn, Vf = Xf, Vw = Yw] of the waveform W is the version in which the note string version number Vn is the current value Cn. Of the one or more edit instruction Qw sequentially given by the user under the note sequence N of the note sequence N and the feature sequence F of the version in which the feature column version number Vf is the set value Xf, the edit instruction Qw before the Ywth It is data representing a time series.
 第1生成部32は、情報管理部40が取得した第1履歴データHn[Vn=Cn,Vf=0,Vw=0](音符列データDn)を第1生成モデルM1に供給することで特徴列データDfを生成する(Sb205)。第2編集部33は、情報管理部40が取得した1以上の第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Xf,Vw=0]が表す編集指示Qfに応じて当該特徴列データDfを順次に編集する(Sb206)。すなわち、現在値Cnに対応する音符列Nのもとで第Xf番目までの編集指示Qfに応じて編集された特徴列データDfが生成される。 The first generation unit 32 is characterized by supplying the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40 to the first generation model M1. Generate column data Df (Sb205). The second editorial unit 33 has one or more second history data Hf [Vn = Cn, Vf = 1, Vw = 0] to Hf [Vn = Cn, Vf = Xf, Vw = 0] acquired by the information management unit 40. The feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb206). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Xf th is generated under the note sequence N corresponding to the current value Cn.
 第2生成部34は、情報管理部40が取得した第1履歴データHn[Vn=Cn,Vf=0,Vw=0](音符列データDn)と編集後の特徴列データDfとを含む入力データDinを第2生成モデルM2に供給することで波形データDwを生成する(Sb207)。第3編集部35は、情報管理部40が取得した1以上の第3履歴データHw[Vn=Cn,Vf=Xf,Vw=1]~Hw[Vn=Xn,Vf=Xf,Vw=Yw]が表す編集指示Qwに応じて波形データDwを順次に編集する(Sb208)。すなわち、現在値Cnに対応する音符列Nと設定値Xfに対応する特徴列Fとのもとで第Yw番目までの編集指示Qwに応じて編集された波形データDwが生成される。 The second generation unit 34 is an input including the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40 and the edited feature sequence data Df. The waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb207). The third editorial unit 35 has one or more third history data Hw [Vn = Cn, Vf = Xf, Vw = 1] to Hw [Vn = Xn, Vf = Xf, Vw = Yw] acquired by the information management unit 40. The waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb208). That is, the waveform data Dw edited according to the edit instruction Qw up to the Ywth th is generated under the note string N corresponding to the current value Cn and the feature string F corresponding to the set value Xf.
 表示制御部20は、編集画面Gを更新する(Sb209-Sb210)。具体的には、表示制御部20は、第2編集部33による編集後の特徴列データDfが表す特徴列Fを編集領域Efに表示し、操作領域Gfの特徴列バージョン番号Vfの表示を設定値Xfに更新する(Sb209)。すなわち、現在値Cnと設定値Xfとに対応する特徴列Fが編集領域Efに表示される。また、表示制御部20は、第3編集部35による編集後の波形データDwが表す波形Wを編集領域Ewに表示し、操作領域Gwの波形バージョン番号Vwの表示を最新値Ywに更新する(Sb210)。すなわち、現在値Cnと設定値Xfと最新値Ywとに対応する波形Wが編集領域Ewに表示される。以上の状態において、利用者は、音符列Nと特徴列Fと波形Wとの各々に関する編集の指示(Qn、QfまたはQw)を付与できる。 The display control unit 20 updates the edit screen G (Sb209-Sb210). Specifically, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the edit area Ef, and sets the display of the feature column version number Vf of the operation area Gf. Update to the value Xf (Sb209). That is, the feature column F corresponding to the current value Cn and the set value Xf is displayed in the editing area Ef. Further, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and updates the display of the waveform version number Vw in the operation area Gw to the latest value Yw ( Sb210). That is, the waveform W corresponding to the current value Cn, the set value Xf, and the latest value Yw is displayed in the editing area Ew. In the above state, the user can give an editing instruction (Qn, Qf or Qw) for each of the note sequence N, the feature sequence F and the waveform W.
 図10は、波形Wのバージョンに関する第3管理処理Sb3のフローチャートである。波形バージョン番号Vwの変更の指示を契機として第3管理処理Sb3が開始される。 FIG. 10 is a flowchart of the third management process Sb3 regarding the version of the waveform W. The third management process Sb3 is started with the instruction to change the waveform version number Vw.
 利用者からの指示に応じた変更後の波形バージョン番号Vwの数値を以下では「設定値Xw」と表記する。操作領域Gw内の波形バージョン番号Vwを利用者が直接に変更した場合、当該変更後の数値(すなわち利用者が指定した数値)が設定値Xwに相当する。また、利用者が操作画像Gw1を操作した場合、波形バージョン番号Vwの現在値Cwの直前の数値(=Cw-1)が設定値Xwに相当する。他方、利用者が操作画像Gw2を操作した場合、波形バージョン番号Vwの現在値Cwの直後の数値(=Cw+1)が設定値Xwに相当する。 The numerical value of the waveform version number Vw after the change according to the instruction from the user is referred to as "set value Xw" below. When the user directly changes the waveform version number Vw in the operation area Gw, the changed numerical value (that is, the numerical value specified by the user) corresponds to the set value Xw. Further, when the user operates the operation image Gw1, the numerical value (= Cw-1) immediately before the current value Cw of the waveform version number Vw corresponds to the set value Xw. On the other hand, when the user operates the operation image Gw2, the numerical value (= Cw + 1) immediately after the current value Cw of the waveform version number Vw corresponds to the set value Xw.
 第3管理処理Sb3が開始されると、情報管理部40は、波形バージョン番号Vwを現在値Cwから設定値Xwに変更する(Sb301)。また、情報管理部40は、音符列バージョン番号Vnを現在値Cnに維持し(Sb302)、特徴列バージョン番号Vfを現在値Cfに維持する(Sb303)。 When the third management process Sb3 is started, the information management unit 40 changes the waveform version number Vw from the current value Cw to the set value Xw (Sb301). Further, the information management unit 40 maintains the note sequence version number Vn at the current value Cn (Sb302) and the feature sequence version number Vf at the current value Cf (Sb303).
 情報管理部40は、音符列Nの第1履歴データHn[Vn=Cn,Vf=0,Vw=0]と、特徴列Fの第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Cf,Vw=0]と、波形Wの第3履歴データHw[Vn=Cn,Vf=Cf,Vw=1]~Hw[Vn=Cn,Vf=Cf,Vw=Xw]とを、記憶装置12の履歴領域から取得する(Sb304)。音符列Nの第1履歴データHn[Vn=Cn,Vf=0,Vw=0]は、現在のバージョンの音符列Nを表す音符列データDnである。特徴列Fの第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Cf,Vw=0]は、音符列バージョン番号Vnが設定値Xnである音符列Nのもとで利用者が順次に付与した1以上の編集指示Qfのうち第Cf番目以前の編集指示Qfの時系列を表すデータである。波形Wの第3履歴データHw[Vn=Cn,Vf=Cf,Vw=1]~Hw[Vn=Cn,Vf=Cf,Vw=Xw]は、現在のバージョンの音符列Nと現在のバージョンの特徴列Fとのもとで利用者が順次に付与した1以上の編集指示Qwのうち第Xw番目以前の編集指示Qwの時系列を表すデータである。 The information management unit 40 has the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] of the note sequence N and the second history data Hf [Vn = Cn, Vf = 1, Vw = = of the feature sequence F. 0] to Hf [Vn = Cn, Vf = Cf, Vw = 0] and the third history data Hw [Vn = Cn, Vf = Cf, Vw = 1] to Hw [Vn = Cn, Vf = Cf] of the waveform W. , Vw = Xw] is acquired from the history area of the storage device 12 (Sb304). The first history data Hn [Vn = Cn, Vf = 0, Vw = 0] of the note string N is the note string data Dn representing the note string N of the current version. In the second history data Hf [Vn = Cn, Vf = 1, Vw = 0] to Hf [Vn = Cn, Vf = Cf, Vw = 0] of the feature column F, the note sequence version number Vn is the set value Xn. It is data representing the time series of the edit instruction Qf before the Cfth among one or more edit instruction Qf sequentially given by the user under the note string N. The third history data Hw [Vn = Cn, Vf = Cf, Vw = 1] to Hw [Vn = Cn, Vf = Cf, Vw = Xw] of the waveform W is the note string N of the current version and the current version. It is data representing the time series of the edit instruction Qw before the Xwth among one or more edit instruction Qw sequentially given by the user under the feature column F.
 第1生成部32は、情報管理部40が取得した第1履歴データHn[Vn=Cn,Vf=0,Vw=0](音符列データDn)を第1生成モデルM1に供給することで特徴列データDfを生成する(Sb305)。第2編集部33は、情報管理部40が取得した1以上の第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Cf,Vw=0]が表す編集指示Qfに応じて当該特徴列データDfを順次に編集する(Sb306)。すなわち、現在値Cnに対応する音符列Nのもとで第Cf番目までの編集指示Qfに応じて編集された特徴列データDfが生成される。 The first generation unit 32 is characterized by supplying the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40 to the first generation model M1. Generate column data Df (Sb305). The second editorial unit 33 has one or more second history data Hf [Vn = Cn, Vf = 1, Vw = 0] to Hf [Vn = Cn, Vf = Cf, Vw = 0] acquired by the information management unit 40. The feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb306). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Cf th is generated under the note sequence N corresponding to the current value Cn.
 第2生成部34は、情報管理部40が取得した第1履歴データHn[Vn=Cn,Vf=0,Vw=0](音符列データDn)と編集後の特徴列データDfとを含む入力データDinを第2生成モデルM2に供給することで波形データDwを生成する(Sb307)。第3編集部35は、情報管理部40が取得した1以上の第3履歴データHw[Vn=Cn,Vf=Cf,Vw=1]~Hw[Vn=Cn,Vf=Cf,Vw=Xw]が表す編集指示Qwに応じて波形データDwを順次に編集する(Sb308)。すなわち、現在値Cnに対応する音符列Nと現在値Cfに対応する特徴列Fとのもとで第Xw番目までの編集指示Qwに応じて編集された波形データDwが生成される。 The second generation unit 34 is an input including the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] (note string data Dn) acquired by the information management unit 40 and the edited feature sequence data Df. The waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb307). The third editorial unit 35 has one or more third history data Hw [Vn = Cn, Vf = Cf, Vw = 1] to Hw [Vn = Cn, Vf = Cf, Vw = Xw] acquired by the information management unit 40. The waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb308). That is, the waveform data Dw edited according to the edit instruction Qw up to the Xwth is generated under the note string N corresponding to the current value Cn and the feature string F corresponding to the current value Cf.
 表示制御部20は、編集画面Gを更新する(Sb309)。具体的には、表示制御部20は、第3編集部35による編集後の波形データDwが表す波形Wを編集領域Ewに表示し、操作領域Gwの波形バージョン番号Vwの表示を設定値Xwに更新する。すなわち、現在値Cnと現在値Cfと設定値Xfとに対応する波形Wが編集領域Ewに表示される。 The display control unit 20 updates the edit screen G (Sb309). Specifically, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and displays the waveform version number Vw in the operation area Gw as the set value Xw. Update. That is, the waveform W corresponding to the current value Cn, the current value Cf, and the set value Xf is displayed in the editing area Ew.
 以上の通り、第1実施形態においては、音符列データDnと特徴列データDfとが利用者からの指示(編集指示Qnおよび編集指示Qf)に応じて編集される。したがって、音符列データDnのみが利用者からの指示に応じて編集される構成と比較して、利用者からの指示を精緻に反映した波形データDwを生成できる。 As described above, in the first embodiment, the note sequence data Dn and the feature sequence data Df are edited according to the instructions (editing instruction Qn and editing instruction Qf) from the user. Therefore, it is possible to generate waveform data Dw that precisely reflects the instruction from the user, as compared with the configuration in which only the note string data Dn is edited in response to the instruction from the user.
 また、音符列データDnが編集された場合には、音符列バージョン番号Vnが増加し、かつ、特徴列バージョン番号Vfの数値が初期化され、特徴列データDfが編集された場合には、音符列バージョン番号Vnの数値が維持されたまま、特徴列バージョン番号Vfの数値が増加する。そして、音符列バージョン番号Vnの複数の数値のうち利用者からの指示に応じた設定値Xnに対応する第1履歴データHn[Vn,Vf,Vw]と、特徴列バージョン番号Vfの複数の数値のうち利用者からの指示に応じた設定値Xfに対応する第2履歴データHf[Vn,Vf,Vw]との少なくとも一方を利用して波形データDwが生成される。したがって、利用者は、音符列バージョン番号Vnと特徴列バージョン番号Vfとの相異なる組合せについて試行錯誤的に波形データDwを生成しながら、音符列データDnおよび特徴列データDfの編集を指示できる。 Further, when the note string data Dn is edited, the note string version number Vn is increased, the numerical value of the feature string version number Vf is initialized, and when the feature string data Df is edited, the note is used. The numerical value of the feature column version number Vf is increased while the numerical value of the column version number Vn is maintained. Then, among the plurality of numerical values of the note string version number Vn, the first history data Hn [Vn, Vf, Vw] corresponding to the set value Xn according to the instruction from the user, and the plurality of numerical values of the feature column version number Vf. Of these, the waveform data Dw is generated by using at least one of the second history data Hf [Vn, Vf, Vw] corresponding to the set value Xf according to the instruction from the user. Therefore, the user can instruct the editing of the note sequence data Dn and the feature sequence data Df while generating the waveform data Dw by trial and error for different combinations of the note sequence version number Vn and the feature sequence version number Vf.
B:第2実施形態
 第2実施形態を説明する。なお、以下に例示する各態様において機能が第1実施形態と同様である要素については、第1実施形態の説明で使用した符号と同様の符号を使用して各々の詳細な説明を適宜に省略する。
B: Second Embodiment The second embodiment will be described. For the elements whose functions are the same as those of the first embodiment in each of the embodiments exemplified below, the same reference numerals as those used in the description of the first embodiment are used, and detailed description of each is appropriately omitted. do.
 図11は、第2実施形態における編集画面Gの模式図である。第2実施形態の編集画面Gにおいては、第1実施形態と同様の要素に操作画像B2が追加される。操作画像B2は、合成音の発音スタイルを利用者が選択するための画像(具体的にはプルダウンメニュー)である。利用者は、操作装置15を操作することで複数の発音スタイルのうち所望の発音スタイルを選択できる。 FIG. 11 is a schematic diagram of the editing screen G in the second embodiment. In the editing screen G of the second embodiment, the operation image B2 is added to the same elements as those of the first embodiment. The operation image B2 is an image (specifically, a pull-down menu) for the user to select the pronunciation style of the synthetic sound. The user can select a desired pronunciation style from a plurality of pronunciation styles by operating the operation device 15.
 発音スタイルは、発音の仕方に関する特徴を意味する。例えば合成音が楽器音である場合、発音スタイルは、楽器の演奏の仕方に関する特徴である。また、例えば合成音が歌唱音である場合、発音スタイルは、楽曲の歌唱の仕方に関する特徴(歌い廻し)である。具体的には、ポップス/ロック/ラップ等、音楽ジャンル毎に好適な発音の仕方が発音スタイルとして例示される。また、明るく/静かに/激しく等、演奏または歌唱の音楽的な表情も発音スタイルとして例示される。 Pronunciation style means a feature related to how to pronounce. For example, when the synthetic sound is a musical instrument sound, the pronunciation style is a characteristic of how the musical instrument is played. Further, for example, when the synthetic sound is a singing sound, the pronunciation style is a feature (sung around) regarding how to sing the music. Specifically, a suitable pronunciation method for each music genre, such as pop / rock / rap, is exemplified as a pronunciation style. In addition, the musical expression of playing or singing, such as bright / quiet / violent, is also exemplified as a pronunciation style.
 図12は、第2実施形態における制御装置11の機能的な構成を例示するブロック図である。第2実施形態の第1生成部32および第2生成部34には、操作画像B2に対する操作で利用者が選択した発音スタイルsが指示される。 FIG. 12 is a block diagram illustrating a functional configuration of the control device 11 in the second embodiment. The pronunciation style s selected by the user in the operation on the operation image B2 is instructed to the first generation unit 32 and the second generation unit 34 of the second embodiment.
 第1生成部32は、音符列データDnと発音スタイルsとから特徴列データDfを生成する。特徴列データDfは、音符列データDnが表す音符列Nを発音スタイルsで発音した合成音に関する特徴量(例えば基本周波数)の時系列を表す時系列データである。 The first generation unit 32 generates the feature sequence data Df from the note sequence data Dn and the pronunciation style s. The feature sequence data Df is time series data representing a time series of feature quantities (for example, fundamental frequency) related to a synthetic sound obtained by reproducing the note sequence N represented by the note sequence data Dn in the pronunciation style s.
 具体的には、第1生成部32は、第1生成モデルM1を利用して特徴列データDfを生成する。第1生成モデルM1は、音符列データDnと発音スタイルsとを入力として特徴列データDfを出力する統計的推定モデルである。第1実施形態と同様に、第1生成モデルM1は、例えば畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の任意の構造の深層ニューラルネットワークで構成される。具体的には、第1生成モデルM1は、音符列データDnと発音スタイルsとから特徴列データDfを生成する演算を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数との組合せで実現される。 Specifically, the first generation unit 32 generates the feature column data Df using the first generation model M1. The first generative model M1 is a statistical inference model that outputs feature sequence data Df by inputting note sequence data Dn and pronunciation style s. Similar to the first embodiment, the first generative model M1 is composed of a deep neural network having an arbitrary structure such as a convolutional neural network or a recurrent neural network. Specifically, the first generation model M1 includes a program that causes the control device 11 to execute an operation for generating feature sequence data Df from the note string data Dn and the pronunciation style s, and a plurality of variables applied to the operation. It is realized by the combination of.
 第1生成モデルM1を規定する複数の変数は、複数の第1訓練データを利用した機械学習により事前に設定されて記憶装置12に記憶される。複数の第1訓練データの各々は、音符列データDnおよび発音スタイルsの組と特徴列データDf(正解値)とを含む。第1生成モデルM1の機械学習においては、各第1訓練データの音符列データDnと発音スタイルsとに対して暫定的な第1生成モデルM1が出力する特徴列データDfと、当該第1訓練データの特徴列データDfとの誤差が低減されるように、第1生成モデルM1の複数の変数が反復的に更新される。したがって、第1生成モデルM1は、複数の第1訓練データに潜在する傾向のもとで音符列データDnと発音スタイルsとの未知の組合せに対して統計的に妥当な特徴列データDfを出力する。 A plurality of variables defining the first generation model M1 are set in advance by machine learning using a plurality of first training data and stored in the storage device 12. Each of the plurality of first training data includes the note sequence data Dn, the set of pronunciation styles s, and the feature sequence data Df (correct answer value). In the machine learning of the first generation model M1, the feature sequence data Df output by the provisional first generation model M1 for the note sequence data Dn and the pronunciation style s of each first training data, and the first training. The plurality of variables of the first generation model M1 are updated iteratively so that the error with the data feature column data Df is reduced. Therefore, the first generative model M1 outputs statistically valid feature sequence data Df for an unknown combination of note sequence data Dn and pronunciation style s under a tendency latent in a plurality of first training data. do.
 第2生成部34は、音符列データDnと特徴列データDfと発音スタイルsとから波形データDwを生成する。波形データDwは、音符列データDnが表す音符列Nを発音スタイルsで発音した合成音音の波形を表す時系列データである。 The second generation unit 34 generates waveform data Dw from the note sequence data Dn, the feature sequence data Df, and the pronunciation style s. The waveform data Dw is time-series data representing the waveform of the synthetic sound sound obtained by pronouncing the note sequence N represented by the note sequence data Dn in the pronunciation style s.
 具体的には、第2生成部34は、第2生成モデルM2を利用して波形データDwを生成する。第2生成モデルM2は、音符列データDnと特徴列データDfと発音スタイルsとを入力として波形データDwを出力する統計的推定モデルである。第1実施形態と同様に、第2生成モデルM2は、例えば畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の任意の構造の深層ニューラルネットワークで構成される。具体的には、第2生成モデルM2は、音符列データDnと特徴列データDfと発音スタイルsとから波形データDwを生成する演算を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数との組合せで実現される。 Specifically, the second generation unit 34 generates the waveform data Dw using the second generation model M2. The second generative model M2 is a statistical inference model that outputs waveform data Dw by inputting note sequence data Dn, feature sequence data Df, and pronunciation style s. Similar to the first embodiment, the second generative model M2 is composed of a deep neural network having an arbitrary structure such as a convolutional neural network or a recurrent neural network. Specifically, the second generation model M2 is applied to a program that causes the control device 11 to execute an operation of generating waveform data Dw from the note string data Dn, the feature sequence data Df, and the pronunciation style s, and the operation. It is realized by combining with multiple variables.
 第2生成モデルM2を規定する複数の変数は、複数の第2訓練データを利用した機械学習により事前に設定されて記憶装置12に記憶される。複数の第2訓練データの各々は、音符列データDnと特徴列データDfと発音スタイルsとの組と、波形データDw(正解値)とを含む。第2生成モデルM2の機械学習においては、各第2訓練データの音符列データDnと特徴列データDfと発音スタイルsとに対して暫定的な第2生成モデルM2が出力する波形データDwと、当該第2訓練データの波形データDwとの誤差が低減されるように、第2生成モデルM2の複数の変数が反復的に更新される。したがって、第2生成モデルM2は、複数の第2訓練データに潜在する傾向のもとで音符列データDnと特徴列データDfと発音スタイルsとの未知の組合せに対して統計的に妥当な波形データDwを出力する。 A plurality of variables defining the second generation model M2 are set in advance by machine learning using a plurality of second training data and stored in the storage device 12. Each of the plurality of second training data includes a set of the note sequence data Dn, the feature sequence data Df, and the pronunciation style s, and the waveform data Dw (correct answer value). In the machine learning of the second generative model M2, the waveform data Dw output by the tentative second generative model M2 for the note sequence data Dn, the feature sequence data Df, and the pronunciation style s of each second training data, A plurality of variables of the second generation model M2 are iteratively updated so that the error of the second training data with the waveform data Dw is reduced. Therefore, the second generative model M2 has a statistically valid waveform for an unknown combination of the note sequence data Dn, the feature sequence data Df, and the pronunciation style s under the tendency latent in the plurality of second training data. Output data Dw.
 第1編集部31は、第2編集処理Sa2のステップSa201において、利用者が選択した発音スタイルsで音符列Nを発音した合成音の特徴列Fを表す特徴列データDfを、利用者からの編集指示Qfに応じて編集する。また、情報管理部40は、第2編集処理Sa2のステップSa205において、編集後の特徴列データDfに応じた第2履歴データHf[Vn,Vf,Vw]を特徴列データDfのバージョン毎に記憶装置12の履歴領域に保存する。 In step Sa201 of the second editing process Sa2, the first editing unit 31 obtains the feature sequence data Df representing the feature sequence F of the synthetic sound in which the note sequence N is pronounced by the pronunciation style s selected by the user. Edit according to the edit instruction Qf. Further, in the step Sa205 of the second editing process Sa2, the information management unit 40 stores the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df for each version of the feature column data Df. It is saved in the history area of the device 12.
 以上の説明から理解される通り、特定の音符列Nのもとで、発音スタイルsに応じた特徴列データDfと当該発音スタイルsに応じた波形データDwとが生成される。他方、音符列Nは発音スタイルsに影響されない。したがって、図13に例示される通り、1個の音符列Nに対応する第1履歴データHn[Vn,Vf,Vw](音符列データDn)について、発音スタイルs毎に、相異なる特徴列Fに対応する複数の第2履歴データHf[Vn,Vf,Vw]と、相異なる波形Wに対応する複数の第3履歴データHw[Vn,Vf,Vw]とが、記憶装置12の履歴領域に保存される。 As understood from the above explanation, the feature sequence data Df corresponding to the pronunciation style s and the waveform data Dw corresponding to the pronunciation style s are generated under the specific note sequence N. On the other hand, the note sequence N is not affected by the pronunciation style s. Therefore, as illustrated in FIG. 13, for the first history data Hn [Vn, Vf, Vw] (note string data Dn) corresponding to one note sequence N, different feature sequences F for each pronunciation style s. A plurality of second history data Hf [Vn, Vf, Vw] corresponding to the above and a plurality of third history data Hw [Vn, Vf, Vw] corresponding to different waveforms W are stored in the history area of the storage device 12. It will be saved.
 次に、第2実施形態における動作の具体例を説明する。第1編集処理Sa1においては、音符列Nを発音スタイルsで発音する合成音の特徴列Fを表す特徴列データDfが第1処理部により生成され(Sa106)、当該合成音の波形Wを表す波形データDwが第2処理部により生成される(Sa107)。 Next, a specific example of the operation in the second embodiment will be described. In the first editing process Sa1, the feature sequence data Df representing the feature sequence F of the synthetic sound that pronounces the note sequence N in the pronunciation style s is generated by the first processing unit (Sa106), and represents the waveform W of the synthetic sound. Waveform data Dw is generated by the second processing unit (Sa107).
 第2編集処理Sa2において、第2編集部33は、発音スタイルsに応じた特徴列データDfを、利用者からの編集指示Qfに応じて編集する。情報管理部40は、特徴列データDfの編集毎(すなわち特徴列データDfのバージョン毎)に、当該編集後の特徴列データDfに応じた第2履歴データHf[Vn,Vf,Vw]を履歴領域に保存する。 In the second editing process Sa2, the second editing unit 33 edits the feature sequence data Df according to the pronunciation style s according to the editing instruction Qf from the user. The information management unit 40 history the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df for each edit of the feature column data Df (that is, for each version of the feature column data Df). Save to area.
 同様に、第3編集処理Sa3において、第3編集部35は、発音スタイルsに応じた波形データDwを、利用者からの編集指示Qwに応じて編集する。情報管理部40は、波形データDwの編集毎(すなわち波形データDwのバージョン毎)に、当該編集後の波形データDwに応じた第3履歴データHw[Vn,Vf,Vw]を履歴領域に保存する。 Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw according to the pronunciation style s according to the editing instruction Qw from the user. The information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area for each edit of the waveform data Dw (that is, for each version of the waveform data Dw). do.
 第2実施形態においては、発音スタイルsが選択された状態において、音符列バージョン番号Vnの変更の指示を契機として、第1管理処理Sb1が開始される。第1管理処理Sb1のステップSb104において、情報管理部40は、音符列Nの第1履歴データHn[Vn=Xn,Vf=0,Vw=0]と、発音スタイルsに対応する特徴列Fの第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Yf,Vw=0]と、当該発音スタイルsに対応する波形Wの第3履歴データHw[Vn=Xn,Vf=Yf,Vw=1]~Hw[Vn=Xn,Vf=Yf,Vw=Yw]とを、履歴領域から取得する。第1管理処理Sb1のステップSb105からステップSb108においては、発音スタイルsに対応する特徴列Fの特徴列データDfと、発音スタイルsに対応する波形Wの波形データDwが生成される。 In the second embodiment, in the state where the pronunciation style s is selected, the first management process Sb1 is started with the instruction to change the note string version number Vn. In step Sb104 of the first management process Sb1, the information management unit 40 has the first history data Hn [Vn = Xn, Vf = 0, Vw = 0] of the note sequence N and the feature sequence F corresponding to the pronunciation style s. The second history data Hf [Vn = Xn, Vf = 1, Vw = 0] to Hf [Vn = Xn, Vf = Yf, Vw = 0] and the third history data Hw of the waveform W corresponding to the pronunciation style s. [Vn = Xn, Vf = Yf, Vw = 1] to Hw [Vn = Xn, Vf = Yf, Vw = Yw] are acquired from the history area. In steps Sb105 to Sb108 of the first management process Sb1, the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated.
 第2実施形態においては、発音スタイルsが選択された状態において、特徴列バージョン番号Vfの変更の指示を契機として、第2管理処理Sb2が開始される。第2管理処理Sb2のステップSb204において、情報管理部40は、音符列Nの第1履歴データHn[Vn=Cn,Vf=0,Vw=0]と、発音スタイルsに対応する特徴列Fの第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Xf,Vw=0]と、当該発音スタイルsに対応する波形Wの第3履歴データHw[Vn=Cn,Vf=Xf,Vw=1]~Hw[Vn=Xn,Vf=Xf,Vw=Yw]とを、履歴領域から取得する。「発音スタイルsに対応する特徴列F」は、具体的には、音符列バージョン番号Vn(設定値Xn)と、発音スタイルsと、特徴列バージョン番号Vf(最新値Yf)とに対応する特徴列Fである。また、「発音スタイルsに対応する波形W」は、具体的には、音符列バージョン番号Vn(設定値Xn)と、発音スタイルsと、特徴列バージョン番号Vf(最新値Yf)と、波形バージョン番号Vw(最新値Yw)とに対応する波形Wである。第2管理処理Sb2のステップSb205からステップSb208においては、発音スタイルsに対応する特徴列Fの特徴列データDfと、発音スタイルsに対応する波形Wの波形データDwが生成される。「発音スタイルsに対応する特徴列F」は、具体的には、音符列バージョン番号Vn(現在値Cn)と、発音スタイルsと、特徴列バージョン番号Vf(設定値Xf)とに対応する特徴列Fである。また、「発音スタイルsに対応する波形W」は、具体的には、音符列バージョン番号Vn(現在値Cn)と、発音スタイルsと、特徴列バージョン番号Vf(設定値Xf)と、波形バージョン番号Vw(最新値Yw)とに対応する波形Wである。 In the second embodiment, in the state where the pronunciation style s is selected, the second management process Sb2 is started with the instruction to change the feature column version number Vf. In step Sb204 of the second management process Sb2, the information management unit 40 has the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] of the note sequence N and the feature sequence F corresponding to the pronunciation style s. The second history data Hf [Vn = Cn, Vf = 1, Vw = 0] to Hf [Vn = Cn, Vf = Xf, Vw = 0] and the third history data Hw of the waveform W corresponding to the pronunciation style s. [Vn = Cn, Vf = Xf, Vw = 1] to Hw [Vn = Xn, Vf = Xf, Vw = Yw] are acquired from the history area. Specifically, the "feature column F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (set value Xn), the pronunciation style s, and the feature sequence version number Vf (latest value Yf). Column F. Specifically, the "waveform W corresponding to the pronunciation style s" includes a note string version number Vn (set value Xn), a pronunciation style s, a feature string version number Vf (latest value Yf), and a waveform version. It is a waveform W corresponding to the number Vw (latest value Yw). In steps Sb205 to Sb208 of the second management process Sb2, the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated. Specifically, the "feature sequence F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (current value Cn), the pronunciation style s, and the feature sequence version number Vf (set value Xf). Column F. Specifically, the "waveform W corresponding to the pronunciation style s" includes a note string version number Vn (current value Cn), a pronunciation style s, a feature string version number Vf (set value Xf), and a waveform version. It is a waveform W corresponding to the number Vw (latest value Yw).
 第2実施形態においては、発音スタイルsが選択された状態において、波形バージョン番号Vwの変更の指示を契機として、第3管理処理Sb3が開始される。第3管理処理Sb3のステップSb304において、情報管理部40は、音符列Nの第1履歴データHn[Vn=Cn,Vf=0,Vw=0]と、発音スタイルsに対応する特徴列Fの第2履歴データHf[Vn=Cn,Vf=1,Vw=0]~Hf[Vn=Cn,Vf=Cf,Vw=0]と、当該発音スタイルsに対応する波形Wの第3履歴データHw[Vn=Cn,Vf=Cf,Vw=1]~Hw[Vn=Cn,Vf=Cf,Vw=Xw]とを、履歴領域から取得する。第3管理処理Sb3のステップSb305からステップSb308においては、発音スタイルsに対応する特徴列Fの特徴列データDfと、発音スタイルsに対応する波形Wの波形データDwが生成される。「発音スタイルsに対応する特徴列F」は、具体的には、音符列バージョン番号Vn(現在値Cn)と、発音スタイルsと、特徴列バージョン番号Vf(現在値Cf)とに対応する特徴列Fである。また、「発音スタイルsに対応する波形W」は、具体的には、音符列バージョン番号Vn(現在値Cn)と、発音スタイルsと、特徴列バージョン番号Vf(現在値Cf)と、波形バージョン番号Vw(設定値Xw)とに対応する波形Wである。 In the second embodiment, in the state where the pronunciation style s is selected, the third management process Sb3 is started with the instruction to change the waveform version number Vw. In step Sb304 of the third management process Sb3, the information management unit 40 has the first history data Hn [Vn = Cn, Vf = 0, Vw = 0] of the note sequence N and the feature sequence F corresponding to the pronunciation style s. The second history data Hf [Vn = Cn, Vf = 1, Vw = 0] to Hf [Vn = Cn, Vf = Cf, Vw = 0] and the third history data Hw of the waveform W corresponding to the pronunciation style s. [Vn = Cn, Vf = Cf, Vw = 1] to Hw [Vn = Cn, Vf = Cf, Vw = Xw] are acquired from the history area. In steps Sb305 to Sb308 of the third management process Sb3, the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated. Specifically, the "feature sequence F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (current value Cn), the pronunciation style s, and the feature sequence version number Vf (current value Cf). Column F. Further, the "waveform W corresponding to the pronunciation style s" specifically includes a note string version number Vn (current value Cn), a pronunciation style s, a feature string version number Vf (current value Cf), and a waveform version. It is a waveform W corresponding to the number Vw (set value Xw).
 ここで、複数の発音スタイルsから利用者が選択し得る発音スタイルs1と発音スタイルs2とに着目する。発音スタイルs1と発音スタイルs2とは相異なる発音スタイルsである。発音スタイルs1は、「第1発音スタイル」の一例であり、発音スタイルs2は、「第2発音スタイル」の一例である。 Here, pay attention to the pronunciation style s1 and the pronunciation style s2 that the user can select from a plurality of pronunciation styles s. The pronunciation style s1 and the pronunciation style s2 are different pronunciation styles s. The pronunciation style s1 is an example of the "first pronunciation style", and the pronunciation style s2 is an example of the "second pronunciation style".
 まず、発音スタイルs1が選択されている場合を想定する。第2編集処理Sa2において、第2編集部33は、発音スタイルs1に応じた特徴列データDfを、利用者からの編集指示Qfに応じて編集する。そして、情報管理部40は、特徴列データDfの編集毎に、当該編集後の特徴列データDfに応じた第2履歴データHf[Vn,Vf,Vw]を履歴領域に保存する。同様に、第3編集処理Sa3において、第3編集部35は、発音スタイルs1に応じた波形データDwを、利用者からの編集指示Qwに応じて編集する。そして、情報管理部40は、波形データDwの編集毎に、当該編集後の波形データDwに応じた第3履歴データHw[Vn,Vf,Vw]を履歴領域に保存する。なお、発音スタイルs1が選択された状態で生成される特徴列データDfまたは波形データDwは、「第1時系列データ」の一例である。また、発音スタイルs1が選択された状態で利用者から付与される編集指示Qfまたは編集指示Qwは、「第1指示」の一例である。 First, assume that the pronunciation style s1 is selected. In the second editing process Sa2, the second editing unit 33 edits the feature sequence data Df corresponding to the pronunciation style s1 according to the editing instruction Qf from the user. Then, each time the feature column data Df is edited, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df in the history area. Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw according to the pronunciation style s1 according to the editing instruction Qw from the user. Then, each time the waveform data Dw is edited, the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area. The feature sequence data Df or waveform data Dw generated when the pronunciation style s1 is selected is an example of "first time series data". Further, the editing instruction Qf or the editing instruction Qw given by the user with the pronunciation style s1 selected is an example of the "first instruction".
 発音スタイルs1が選択されている場合、第1管理処理Sb1のステップSb104と、第2管理処理Sb2のステップSb204と、第3管理処理Sb3のステップSb304とにおいては、発音スタイルs1に対応する特徴列Fの特徴列データDfと、発音スタイルs1に対応する波形Wの波形データDwとが生成される。すなわち、発音スタイルs1に対応する複数の履歴データH(Hn、Hf、Hw)のうち利用者からの指示(Xn、Xf、Xw)に応じた履歴データHに対応する特徴列データDfおよび波形データDwが生成される。 When the pronunciation style s1 is selected, the feature sequence corresponding to the pronunciation style s1 in step Sb104 of the first management process Sb1, step Sb204 of the second management process Sb2, and step Sb304 of the third management process Sb3. The feature sequence data Df of F and the waveform data Dw of the waveform W corresponding to the pronunciation style s1 are generated. That is, the feature sequence data Df and the waveform data corresponding to the history data H corresponding to the instruction (Xn, Xf, Xw) from the user among the plurality of history data H (Hn, Hf, Hw) corresponding to the pronunciation style s1. Dw is generated.
 次に、発音スタイルs2が選択されている場合を想定する。第2編集処理Sa2において、第2編集部33は、発音スタイルs2に応じた特徴列データDfを、利用者からの編集指示Qfに応じて編集する。そして、情報管理部40は、特徴列データDfの編集毎に、当該編集後の特徴列データDfに応じた第2履歴データHf[Vn,Vf,Vw]を履歴領域に保存する。同様に、第3編集処理Sa3において、第3編集部35は、発音スタイルs2に応じた波形データDwを、利用者からの編集指示Qwに応じて編集する。そして、情報管理部40は、波形データDwの編集毎に、当該編集後の波形データDwに応じた第3履歴データHw[Vn,Vf,Vw]を履歴領域に保存する。なお、発音スタイルs2が選択された状態で生成される特徴列データDfまたは波形データDwは、「第2時系列データ」の一例である。また、発音スタイルs2が選択された状態で利用者から付与される編集指示Qfまたは編集指示Qwは、「第2指示」の一例である。 Next, assume that the pronunciation style s2 is selected. In the second editing process Sa2, the second editing unit 33 edits the feature sequence data Df corresponding to the pronunciation style s2 according to the editing instruction Qf from the user. Then, each time the feature column data Df is edited, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df in the history area. Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw corresponding to the pronunciation style s2 according to the editing instruction Qw from the user. Then, each time the waveform data Dw is edited, the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area. The feature sequence data Df or waveform data Dw generated when the pronunciation style s2 is selected is an example of "second time series data". Further, the editing instruction Qf or the editing instruction Qw given by the user with the pronunciation style s2 selected is an example of the "second instruction".
 発音スタイルs2が選択されている場合、第1管理処理Sb1のステップSb104と、第2管理処理Sb2のステップSb204と、第3管理処理Sb3のステップSb304とにおいては、発音スタイルs2に対応する特徴列Fの特徴列データDfと、発音スタイルs2に対応する波形Wの波形データDwとが生成される。すなわち、発音スタイルs2に対応する複数の履歴データH(Hn、HfおよびHw)のうち利用者からの指示(Xn、XfまたはXw)に応じた履歴データHに対応する特徴列データDfおよび波形データDwが生成される。 When the pronunciation style s2 is selected, the feature sequence corresponding to the pronunciation style s2 in the step Sb104 of the first management process Sb1, the step Sb204 of the second management process Sb2, and the step Sb304 of the third management process Sb3. The feature sequence data Df of F and the waveform data Dw of the waveform W corresponding to the pronunciation style s2 are generated. That is, the feature sequence data Df and the waveform data corresponding to the history data H corresponding to the instruction (Xn, Xf or Xw) from the user among the plurality of history data H (Hn, Hf and Hw) corresponding to the pronunciation style s2. Dw is generated.
 以上の例示から理解される通り、第2実施形態における編集処理部30は、発音スタイルs1に対応する特徴列データDfおよび波形データDw、または、発音スタイルs2に対応する特徴列データDfおよび波形データDwを、共通のバージョンの音符列データDnに応じて取得する。 As can be understood from the above examples, the editing processing unit 30 in the second embodiment has the feature sequence data Df and the waveform data Dw corresponding to the pronunciation style s1, or the feature sequence data Df and the waveform data corresponding to the pronunciation style s2. Dw is acquired according to the common version of the note string data Dn.
 以上に例示した通り、第2実施形態においては、発音スタイルs1に対応する特徴列データDfおよび波形データDwの編集の履歴が記憶装置12に保存され、発音スタイルs2に対応する特徴列データDfおよび波形データDwの編集の履歴が記憶装置12に保存される。したがって、発音スタイルs1に対応する特徴列データDfまたは波形データDwの編集と、発音スタイルs2に対応する特徴列データDfまたは波形データDwの編集とを、利用者からの指示に応じて試行錯誤的に実行することが可能である。 As illustrated above, in the second embodiment, the editing history of the feature sequence data Df and the waveform data Dw corresponding to the pronunciation style s1 is stored in the storage device 12, and the feature sequence data Df and the feature sequence data Df corresponding to the pronunciation style s2 are stored. The editing history of the waveform data Dw is stored in the storage device 12. Therefore, the editing of the feature sequence data Df or the waveform data Dw corresponding to the pronunciation style s1 and the editing of the feature sequence data Df or the waveform data Dw corresponding to the pronunciation style s2 are performed by trial and error according to the instruction from the user. It is possible to execute.
 例えば、操作装置15の操作により利用者が発音スタイルs間の比較を指示すると、表示制御部20は、図14の比較画面Uを表示装置14に表示させる。比較画面Uは、第1領域U1と操作画像U1a(呼出)と操作画像U1b(再生)と第2領域U2と操作画像U2a(呼出)と操作画像U2b(再生)とを含む。 For example, when the user instructs the comparison between the pronunciation styles s by the operation of the operation device 15, the display control unit 20 causes the display device 14 to display the comparison screen U of FIG. The comparison screen U includes a first region U1, an operation image U1a (call), an operation image U1b (reproduction), a second region U2, an operation image U2a (call), and an operation image U2b (reproduction).
 第1領域U1および第2領域U2の各々には、第1履歴データHn[Vn,Vf,Vw]と第2履歴データHf[Vn,Vf,Vw]と第3履歴データHw[Vn,Vf,Vw]との間の階層関係が表示される。利用者は、操作装置15を操作することで、第1領域U1および第2領域U2の各々について、所望の履歴データHを選択することが可能である。具体的には、利用者は、発音スタイルsと各バージョン番号(Vn,Vf,Vw)とを指定することで、第1領域U1および第2領域U2の各々について所望の履歴データHを選択する。 In each of the first region U1 and the second region U2, the first history data Hn [Vn, Vf, Vw], the second history data Hf [Vn, Vf, Vw] and the third history data Hw [Vn, Vf, The hierarchical relationship with Vw] is displayed. By operating the operating device 15, the user can select desired historical data H for each of the first region U1 and the second region U2. Specifically, the user selects the desired history data H for each of the first region U1 and the second region U2 by designating the pronunciation style s and each version number (Vn, Vf, Vw). ..
 操作画像U1a(呼出)を利用者が選択した場合、制御装置11は、第1領域U1において選択されている履歴データHを記憶装置12から取得し、当該履歴データHに応じた編集画面Gを表示装置14に表示させる。具体的には、制御装置11は、第1領域U1について選択された履歴データHの発音スタイルsと各バージョン番号(Vn,Vf,Vw)とに応じて、音符列Nの第1履歴データHn[Vn=Xn,Vf=0,Vw=0]と、発音スタイルsに対応する特徴列Fの第2履歴データHf[Vn=Xn,Vf=1,Vw=0]~Hf[Vn=Xn,Vf=Xf,Vw=0]と、発音スタイルsに対応する波形Wの第3履歴データHw[Vn=Xn,Vf=Xf,Vw=1]~Hw[Vn=Xn,Vf=Xf,Vw=Xw]とを履歴領域から取得する。制御装置11は、履歴領域から取得した各履歴データHを利用して、発音スタイルsのバージョン番号(Vn,Vf,Vw)に対応する特徴列Fの特徴列データDfと波形Wの波形データDwとを生成する。そして、制御装置11は、第1履歴データHn[Vn=Xn,Vf=0,Vw=0]が示す音符列と、特徴列データDfが示す特徴列Fと、波形データDwが示す波形Wとを含む表示画面Gを、表示装置14に表示させる。また、操作画像U1b(再生)を利用者が選択した場合、制御装置11は、第1領域U1について以上の手順で生成した波形データDwに応じた音響信号Zを、放音装置13に供給することで、合成音を再生させる。 When the user selects the operation image U1a (call), the control device 11 acquires the history data H selected in the first area U1 from the storage device 12, and displays the edit screen G corresponding to the history data H. Display on the display device 14. Specifically, the control device 11 determines the first history data Hn of the note sequence N according to the pronunciation style s of the history data H selected for the first region U1 and each version number (Vn, Vf, Vw). [Vn = Xn, Vf = 0, Vw = 0] and the second history data Hf [Vn = Xn, Vf = 1, Vw = 0] to Hf [Vn = Xn, of the feature column F corresponding to the pronunciation style s. Vf = Xf, Vw = 0] and the third history data Hw [Vn = Xn, Vf = Xf, Vw = 1] to Hw [Vn = Xn, Vf = Xf, Vw =] of the waveform W corresponding to the pronunciation style s. Xw] and get from the history area. The control device 11 uses each history data H acquired from the history area to display the feature sequence data Df of the feature sequence F and the waveform data Dw of the waveform W corresponding to the version numbers (Vn, Vf, Vw) of the pronunciation style s. And generate. Then, the control device 11 has a note sequence indicated by the first history data Hn [Vn = Xn, Vf = 0, Vw = 0], a feature sequence F indicated by the feature sequence data Df, and a waveform W indicated by the waveform data Dw. The display screen G including the above is displayed on the display device 14. When the user selects the operation image U1b (reproduction), the control device 11 supplies the sound emitting device 13 with the acoustic signal Z corresponding to the waveform data Dw generated in the above procedure for the first region U1. By doing so, the synthetic sound is reproduced.
 同様に、操作画像U2a(呼出)を利用者が選択した場合、制御装置11は、第2領域U2において選択されている履歴データHを記憶装置12から取得し、当該履歴データHに応じた編集画面Gを表示装置14に表示させる。具体的には、制御装置11は、第1領域U1について前述したのと同様の手順により、利用者が第2領域U2について指定した発音スタイルsと各バージョン番号(Vn,Vf,Vw)とに対応する特徴列データDfおよび波形データDwを生成する。そして、制御装置11は、第1履歴データHn[Vn=Xn,Vf=0,Vw=0]が示す音符列と、特徴列データDfが示す特徴列Fと、波形データDwが示す波形Wとを含む表示画面Gを、表示装置14に表示させる。また、操作画像U2b(再生)を利用者が選択した場合、制御装置11は、第2領域U2について以上の手順で生成した波形データDwに応じた音響信号Zを、放音装置13に供給することで、合成音を再生させる。 Similarly, when the user selects the operation image U2a (call), the control device 11 acquires the history data H selected in the second region U2 from the storage device 12, and edits the history data H according to the history data H. The screen G is displayed on the display device 14. Specifically, the control device 11 sets the pronunciation style s and each version number (Vn, Vf, Vw) specified by the user for the second region U2 by the same procedure as described above for the first region U1. Generate the corresponding feature sequence data Df and waveform data Dw. Then, the control device 11 has a note sequence indicated by the first history data Hn [Vn = Xn, Vf = 0, Vw = 0], a feature sequence F indicated by the feature sequence data Df, and a waveform W indicated by the waveform data Dw. The display screen G including the above is displayed on the display device 14. When the user selects the operation image U2b (reproduction), the control device 11 supplies the sound emitting device 13 with the acoustic signal Z corresponding to the waveform data Dw generated in the above procedure for the second region U2. By doing so, the synthetic sound is reproduced.
 以上の例示から理解される通り、利用者は、第1領域U1から選択されたバージョンおよび発音スタイルsの組合せと、第2領域U2から選択されたバージョンおよび発音スタイルsの組合せとを相互に比較しながら、音符列Nと特徴列Fと波形Wと発音スタイルsとを調整することが可能である。 As can be understood from the above examples, the user mutually compares the combination of the version and the pronunciation style s selected from the first region U1 with the combination of the version and the pronunciation style s selected from the second region U2. While, it is possible to adjust the note sequence N, the feature sequence F, the waveform W, and the pronunciation style s.
C:第3実施形態
 図15は、第3実施形態における合成音の説明図である。第3実施形態の合成音は、時間軸上で相互に並行する複数のトラックT(T1、T2、…)で構成される。例えば、複数の演奏パートで構成される楽器音を合成音とした場合、各演奏パートがトラックTに相当する。また、複数の歌唱パートで構成される歌唱音を合成音とした場合、各歌唱パートがトラックTに相当する。
C: Third Embodiment FIG. 15 is an explanatory diagram of the synthetic sound in the third embodiment. The synthetic sound of the third embodiment is composed of a plurality of tracks T (T1, T2, ...) Parallel to each other on the time axis. For example, when a musical instrument sound composed of a plurality of performance parts is regarded as a synthetic sound, each performance part corresponds to the track T. Further, when a singing sound composed of a plurality of singing parts is used as a synthetic sound, each singing part corresponds to the track T.
 複数のトラックTの各々は、時間軸上で相互に重複しない複数の区間(以下「単位区間」という)Rを含む。複数の単位区間Rの各々は、時間軸上において音符列Nを含む区間(リージョン)である。すなわち、時間軸上で相互に近接する複数の音符の集合を音符列Nとして、音符列N毎に単位区間Rが設定される。各単位区間Rの時間長は、音符列Nの音符の総数または各音符の継続長等に応じた可変長である。 Each of the plurality of tracks T includes a plurality of sections (hereinafter referred to as "unit intervals") R that do not overlap each other on the time axis. Each of the plurality of unit intervals R is an interval (region) including the note string N on the time axis. That is, a unit interval R is set for each note sequence N, with a set of a plurality of notes that are close to each other on the time axis as a note sequence N. The time length of each unit interval R is a variable length according to the total number of notes in the note sequence N, the continuation length of each note, and the like.
 図16は、第3実施形態における編集画面Gの模式図である。合成音の複数のトラックTから利用者が選択した1個のトラックTの複数の単位区間Rのうち、利用者が選択した1個の単位区間Rに関する情報(音符列N、特徴列Fまたは波形W)が、編集画面Gに表示される。第2実施形態の編集画面Gにおいては、第1実施形態と同様の要素に操作領域Gtと操作領域Grとが追加される。 FIG. 16 is a schematic diagram of the editing screen G in the third embodiment. Information on one unit interval R selected by the user (note sequence N, feature sequence F, or waveform) among the plurality of unit intervals R of one track T selected by the user from the plurality of tracks T of the synthetic sound. W) is displayed on the edit screen G. In the editing screen G of the second embodiment, the operation area Gt and the operation area Gr are added to the same elements as those of the first embodiment.
 操作領域Gtは、合成音のトラックTに関する領域である。具体的には、操作領域Gtには、トラックバージョン番号Vtと操作画像Gt1と操作画像Gt2とが表示される。トラックバージョン番号Vtは、編集画面Gに表示されるトラックTのバージョンを表す番号である。編集画面Gに表示されたトラックTに関する情報(音符列N、特徴列Fまたは波形W)の編集毎にトラックバージョン番号Vtが1ずつ増加する。また、利用者は、操作装置15を操作することで、操作領域Gt内のトラックバージョン番号Vtを任意の数値に変更することが可能である。 The operation area Gt is an area related to the track T of the synthetic sound. Specifically, the track version number Vt, the operation image Gt1 and the operation image Gt2 are displayed in the operation area Gt. The track version number Vt is a number representing the version of the track T displayed on the edit screen G. The track version number Vt is incremented by 1 each time the information about the track T displayed on the edit screen G (note string N, feature column F, or waveform W) is edited. Further, the user can change the track version number Vt in the operation area Gt to an arbitrary numerical value by operating the operation device 15.
 操作画像Gt1および操作画像Gt2は、操作装置15を利用して利用者が操作可能なソフトウェアボタンである。操作画像Gt1は、トラックTに関する情報(音符列N、特徴列Fまたは波形W)を直前の編集の実行前の状態に戻すこと(Undo)を利用者が指示するための操作子である。また、操作画像Gt2は、操作画像Gt1に対する操作で取消された編集を再び実行すること(Redo)を利用者が指示するための操作子である。 The operation image Gt1 and the operation image Gt2 are software buttons that can be operated by the user using the operation device 15. The operation image Gt1 is an operator for instructing the user to return the information (note string N, feature sequence F, or waveform W) related to the track T to the state before the execution of the immediately preceding edit (Undo). Further, the operation image Gt2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gt1 again (Redo).
 操作領域Grは、合成音の単位区間Rに関する領域である。具体的には、操作領域Grには、区間バージョン番号Vrと操作画像Gr1と操作画像Gr2とが表示される。区間バージョン番号Vrは、編集画面Gに表示される単位区間Rのバージョンを表す番号である。編集画面Gに表示された単位区間Rに関する情報(音符列N、特徴列Fまたは波形W)の編集毎に区間バージョン番号Vrが1ずつ増加する。また、利用者は、操作装置15を操作することで、操作領域Gt内のトラックバージョン番号Vtを任意の数値に変更することが可能である。 The operation area Gr is an area related to the unit interval R of the synthetic sound. Specifically, the section version number Vr, the operation image Gr1 and the operation image Gr2 are displayed in the operation area Gr. The section version number Vr is a number representing the version of the unit section R displayed on the edit screen G. The section version number Vr is incremented by 1 each time the information regarding the unit interval R displayed on the edit screen G (note sequence N, feature sequence F, or waveform W) is edited. Further, the user can change the track version number Vt in the operation area Gt to an arbitrary numerical value by operating the operation device 15.
 操作画像Gr1および操作画像Gr2は、操作装置15を利用して利用者が操作可能なソフトウェアボタンである。操作画像Gr1は、単位区間Rに関する情報(音符列N、特徴列Fまたは波形W)を直前の編集の実行前の状態に戻すこと(Undo)を利用者が指示するための操作子である。また、操作画像Gr2は、操作画像Gr1に対する操作で取消された編集を再び実行すること(Redo)を利用者が指示するための操作子である。 The operation image Gr1 and the operation image Gr2 are software buttons that can be operated by the user using the operation device 15. The operation image Gr1 is an operator for instructing the user to return the information (note string N, feature sequence F, or waveform W) regarding the unit interval R to the state before the execution of the immediately preceding edit (Undo). Further, the operation image Gr2 is an operator for instructing the user to execute (Redo) the editing canceled by the operation on the operation image Gr1 again.
 編集画面Gに表示される1個のトラックT内の複数の単位区間Rの各々について、編集処理Sa(Sa1-Sa3)または管理処理Sb(Sb1-Sb3)が実行される。編集処理Saにおいて、音符列Nと特徴列Fと波形Wとの何れかが編集されるたびに、情報管理部40は、トラックバージョン番号Vtおよび区間バージョン番号Vrを1ずつ増加させる。また、操作画像(Gn1、Gf1、Gw1、Gn2、Gf2またはGw2)を利用者が操作した場合も同様に、情報管理部40は、トラックバージョン番号Vtおよび区間バージョン番号Vrを1ずつ増加させる。 The editing process Sa (Sa1-Sa3) or the management process Sb (Sb1-Sb3) is executed for each of the plurality of unit intervals R in one track T displayed on the editing screen G. In the editing process Sa, each time any one of the note sequence N, the feature sequence F, and the waveform W is edited, the information management unit 40 increases the track version number Vt and the section version number Vr by one. Further, when the user operates the operation image (Gn1, Gf1, Gw1, Gn2, Gf2 or Gw2), the information management unit 40 similarly increases the track version number Vt and the section version number Vr by one.
 第3実施形態においても第1実施形態と同様の効果が実現される。また、第3実施形態においては、利用者は、時間軸上の複数の単位区間Rの各々について試行錯誤的に波形データDwを生成しながら、音符列データDnと特徴列データDfと波形データDwとの各々の編集を指示できる。 The same effect as that of the first embodiment is realized in the third embodiment. Further, in the third embodiment, the user generates the waveform data Dw by trial and error for each of the plurality of unit intervals R on the time axis, while the note sequence data Dn, the feature sequence data Df, and the waveform data Dw. You can instruct each edit with.
D:変形例
 以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2以上の態様を、相互に矛盾しない範囲で適宜に併合してもよい。
D: Modification example The specific modification mode added to each of the above-exemplified embodiments will be exemplified below. Two or more embodiments arbitrarily selected from the following examples may be appropriately merged to the extent that they do not contradict each other.
(1)前述の各形態においては、各バージョンの音符列データDnを第1履歴データHn[Vn,Vf,Vw]を履歴領域に保存したが、第1履歴データHn[Vn,Vf,Vw]が表す事項および第1履歴データHn[Vn,Vf,Vw]の形式は、以上の例示に限定されない。例えば、音符列データDnが如何に編集されるか(すなわち編集指示Qnの時系列)を表す第1履歴データHn[Vn,Vf,Vw]を保存してもよい。以上の説明から理解される通り、第1履歴データHn[Vn,Vf,Vw]は、編集後の音符列Nに応じたデータとして包括的に表現される。 (1) In each of the above-mentioned forms, the note string data Dn of each version is stored in the history area as the first history data Hn [Vn, Vf, Vw], but the first history data Hn [Vn, Vf, Vw] And the format of the first history data Hn [Vn, Vf, Vw] are not limited to the above examples. For example, the first history data Hn [Vn, Vf, Vw] indicating how the note string data Dn is edited (that is, the time series of the edit instruction Qn) may be saved. As understood from the above description, the first history data Hn [Vn, Vf, Vw] is comprehensively expressed as data corresponding to the edited note sequence N.
(2)前述の各形態においては、特徴列データDfが如何に編集されるか(すなわち編集指示Qfの時系列)を表す第2履歴データHf[Vn,Vf,Vw]を履歴領域に保存したが、第2履歴データHf[Vn,Vf,Vw]が表す事項および第2履歴データHf[Vn,Vf,Vw]の形式は、以上の例示に限定されない。例えば、編集指示Qfに応じた編集後の特徴列データDfを第2履歴データHf[Vn,Vf,Vw]として履歴領域に保存してもよい。以上の例示から理解される通り、第2履歴データHf[Vn,Vf,Vw]は、編集後の特徴列データDfに応じたデータとして包括的に表現される。 (2) In each of the above-mentioned forms, the second history data Hf [Vn, Vf, Vw] indicating how the feature column data Df is edited (that is, the time series of the edit instruction Qf) is stored in the history area. However, the matters represented by the second history data Hf [Vn, Vf, Vw] and the format of the second history data Hf [Vn, Vf, Vw] are not limited to the above examples. For example, the feature column data Df after editing according to the editing instruction Qf may be saved in the history area as the second history data Hf [Vn, Vf, Vw]. As can be understood from the above examples, the second history data Hf [Vn, Vf, Vw] is comprehensively represented as data corresponding to the edited feature column data Df.
(3)前述の各形態においては、波形データDwが如何に編集されるか(すなわち編集指示Qwの時系列)を表す第3履歴データHw[Vn,Vf,Vw]を履歴領域に保存したが、第3履歴データHw[Vn,Vf,Vw]が表す事項および第3履歴データHw[Vn,Vf,Vw]の形式は、以上の例示に限定されない。例えば、編集指示Qwに応じた編集後の波形データDwを第3履歴データHw[Vn,Vf,Vw]として履歴領域に保存してもよい。以上の例示から理解される通り、第3履歴データHw[Vn,Vf,Vw]は、編集後の波形データDwに応じたデータとして包括的に表現される。 (3) In each of the above-mentioned forms, the third history data Hw [Vn, Vf, Vw] indicating how the waveform data Dw is edited (that is, the time series of the edit instruction Qw) is saved in the history area. , The matters represented by the third history data Hw [Vn, Vf, Vw] and the format of the third history data Hw [Vn, Vf, Vw] are not limited to the above examples. For example, the waveform data Dw after editing according to the editing instruction Qw may be saved in the history area as the third history data Hw [Vn, Vf, Vw]. As can be understood from the above examples, the third history data Hw [Vn, Vf, Vw] is comprehensively expressed as data corresponding to the edited waveform data Dw.
(4)前述の各形態においては、合成音の基本周波数を特徴量とする特徴列Fを例示したが、特徴列データDfが表す特徴量は基本周波数に限定されない。例えば、周波数領域における合成音の周波数スペクトル(例えば強度スペクトル)、または時間軸上の音圧レベルを特徴量として、当該特徴量の時系列(特徴列F)を表す時系列データを、特徴列データDfとしてもよい。特徴列データDfは、音符列データDnの特徴量の時系列(特徴列F)を表す時系列データとして包括的に表現される。 (4) In each of the above-described forms, the feature sequence F having the fundamental frequency of the synthesized sound as the feature quantity is illustrated, but the feature quantity represented by the feature sequence data Df is not limited to the fundamental frequency. For example, the frequency spectrum of the synthesized sound in the frequency domain (for example, the intensity spectrum) or the time-series data representing the time series (feature sequence F) of the feature amount with the sound pressure level on the time axis as the feature sequence data. It may be Df. The feature sequence data Df is comprehensively represented as time series data representing a time series (feature sequence F) of the feature amount of the note sequence data Dn.
(5)前述の各形態においては、第2生成部34が、音符列データDnと特徴列データDfとから波形データDwを生成したが、第2生成部34が音符列データDnから波形データDwを生成する構成、または、第2生成部34が特徴列データDfから波形データDwを生成する構成も想定される。すなわち、第2生成部34は、音符列データDnおよび波形データDwの少なくとも一方から波形データDwを生成する要素として特定される。 (5) In each of the above-described embodiments, the second generation unit 34 generates the waveform data Dw from the note sequence data Dn and the feature sequence data Df, but the second generation unit 34 generates the waveform data Dw from the note sequence data Dn. Or a configuration in which the second generation unit 34 generates waveform data Dw from the feature column data Df is also assumed. That is, the second generation unit 34 is specified as an element that generates waveform data Dw from at least one of the note string data Dn and the waveform data Dw.
(6)第2実施形態においては、発音スタイルsを含む入力に対して特徴列データDfを出力する第1生成モデルM1を例示したが、発音スタイルsに応じた特徴列データDfを第1生成部32が生成するための構成は以上の例示に限定されない。例えば、相異なる発音スタイルsに対応する複数の第1生成モデルM1を選択的に利用して特徴列データDfを生成してもよい。各発音スタイルsに対応する第1生成モデルM1は、当該発音スタイルsについて用意された複数の第1訓練データを利用した機械学習により構築される。第1生成部32は、複数の第1生成モデルM1のうち利用者が選択した発音スタイルsに対応する第1生成モデルM1に音符列データDnを入力することで、特徴列データDfを生成する。 (6) In the second embodiment, the first generation model M1 that outputs the feature sequence data Df for the input including the pronunciation style s is exemplified, but the feature sequence data Df corresponding to the pronunciation style s is first generated. The configuration for the unit 32 to be generated is not limited to the above examples. For example, the feature sequence data Df may be generated by selectively using a plurality of first generation models M1 corresponding to different pronunciation styles s. The first generation model M1 corresponding to each pronunciation style s is constructed by machine learning using a plurality of first training data prepared for the pronunciation style s. The first generation unit 32 generates the feature sequence data Df by inputting the note sequence data Dn into the first generation model M1 corresponding to the pronunciation style s selected by the user among the plurality of first generation models M1. ..
 また、第2実施形態においては、発音スタイルsを含む入力に対して波形データDwを出力する第2生成モデルM2を例示したが、発音スタイルsに応じた波形データDwを第2生成部34が生成するための構成は以上の例示に限定されない。例えば、相異なる発音スタイルsに対応する複数の第2生成モデルM2を選択的に利用して波形データDwを生成してもよい。各発音スタイルsに対応する第2生成モデルM2は、当該発音スタイルsについて用意された複数の第2訓練データを利用した機械学習により構築される。第2生成部34は、複数の第2生成モデルM2のうち利用者が選択した発音スタイルsに対応する第2生成モデルM2に音符列データDnおよび特徴列データDf(入力データDin)を入力することで、波形データDwを生成する。 Further, in the second embodiment, the second generation model M2 that outputs the waveform data Dw to the input including the pronunciation style s is exemplified, but the second generation unit 34 generates the waveform data Dw according to the pronunciation style s. The configuration for generation is not limited to the above examples. For example, the waveform data Dw may be generated by selectively using a plurality of second generation models M2 corresponding to different pronunciation styles s. The second generative model M2 corresponding to each pronunciation style s is constructed by machine learning using a plurality of second training data prepared for the pronunciation style s. The second generation unit 34 inputs the note sequence data Dn and the feature sequence data Df (input data Din) to the second generation model M2 corresponding to the pronunciation style s selected by the user among the plurality of second generation models M2. As a result, waveform data Dw is generated.
(7)前述の各形態においては、編集画面Gの編集領域Ewに音響信号Zの波形Wを表示したが、音響信号Zの周波数スペクトルの時系列(すなわちスペクトログラム)を波形Wとともに編集画面Gに表示してもよい。例えば、図17に例示された編集画面Gは、編集領域Ew1と編集領域Ew2とを含む。編集領域Ew1には、前述の各形態における編集領域Ewと同様に波形Wが表示される。他方、編集領域Ew2には、音響信号Zの周波数スペクトルの時系列が表示される。利用者は、編集領域Ew1内の波形に対する編集指示Qwのほか、編集領域Ew2内の周波数スペクトルに対する編集指示Qwを、操作装置15に対する操作で付与できる。 (7) In each of the above-described embodiments, the waveform W of the acoustic signal Z is displayed in the edit area Ew of the edit screen G, but the time series (that is, spectrogram) of the frequency spectrum of the acoustic signal Z is displayed on the edit screen G together with the waveform W. It may be displayed. For example, the editing screen G illustrated in FIG. 17 includes an editing area Ew1 and an editing area Ew2. In the editing area Ew1, the waveform W is displayed in the same manner as the editing area Ew in each of the above-described forms. On the other hand, in the editing area Ew2, the time series of the frequency spectrum of the acoustic signal Z is displayed. In addition to the editing instruction Qw for the waveform in the editing area Ew1, the user can give the editing instruction Qw for the frequency spectrum in the editing area Ew2 by operating the operation device 15.
(8)音符列データDnは、時間軸上の複数の音符を要素とする音符列Nを表す時系列データである。特徴列データDfは、時間軸上の複数の特徴量を要素とする特徴列Fを表す時系列データである。波形データDwは、時間軸上の複数のサンプルを要素とする波形Wを表す時系列データである。以上の例示から理解される通り、音符列データDnと特徴列データDfと波形データDwとは、複数の要素の時系列を表す時系列データとして包括的に表現される。 (8) Note string data Dn is time-series data representing a note sequence N having a plurality of notes on the time axis as elements. The feature sequence data Df is time-series data representing the feature sequence F having a plurality of feature quantities on the time axis as elements. The waveform data Dw is time-series data representing a waveform W having a plurality of samples on the time axis as elements. As understood from the above examples, the note sequence data Dn, the feature sequence data Df, and the waveform data Dw are comprehensively represented as time series data representing a time series of a plurality of elements.
(9)前述の各形態においては、深層ニューラルネットワークを第1生成モデルM1および第2生成モデルM2として例示したが、第1生成モデルM1および第2生成モデルM2の構成は任意である。例えばHMM(Hidden Markov Model)等の他の構造の統計的推定モデルを第1生成モデルM1または第2生成モデルM2として利用してもよい。 (9) In each of the above-described embodiments, the deep neural network is exemplified as the first generation model M1 and the second generation model M2, but the configurations of the first generation model M1 and the second generation model M2 are arbitrary. For example, a statistical inference model of another structure such as HMM (Hidden Markov Model) may be used as the first generation model M1 or the second generation model M2.
(10)前述の各形態においては、音符列Nに対応する合成音の合成を例示したが、複数の要素の時系列を表す時系列データを処理する任意の場面において、前述の各形態は利用される。例えば、前述の各形態においては、上位層が音符列Nに対応し、中位層が特徴列Fに対応し、下位層が波形Wに対応する形態を例示したが、合成音の合成以外の場面における各階層は、以下に例示する組合せとなる。 (10) In each of the above-mentioned forms, the synthesis of the synthetic sound corresponding to the note string N is illustrated, but each of the above-mentioned forms can be used in any scene for processing time-series data representing a time-series of a plurality of elements. Will be done. For example, in each of the above-mentioned forms, the upper layer corresponds to the note sequence N, the middle layer corresponds to the feature sequence F, and the lower layer corresponds to the waveform W. Each layer in the scene is a combination illustrated below.
 例えば、メロディを生成する自動作曲の場面においては、当該メロディを構成する音符列が上位層に対応し、当該メロディにおけるコードの時系列が中位層に対応し、当該メロディに調和する伴奏音の音符列が下位層に対応する。また、文字列に対応する音声を合成する音声合成の場面においては、当該文字列が上位層に対応し、音声の発音のスタイルが中位層に対応し、当該音声の波形が下位層に対応する。各種の信号を処理する信号処理の場面においては、当該信号の波形が上位層に対応し、当該信号の特徴量の時系列が中位層に対応し、当該信号に対する処理に関するパラメータの時系列が下位層に対応する。以上に例示した何れの形態においても、上位層のデータは「上位データ」と表現され、中位層のデータは「中位データ」と表現され、下位層のデータは「下位データ」と表現される。下位データは,利用者が実際に利用するコンテンツ(例えば前述の各形態における波形W)を表すデータである。 For example, in the scene of a self-operated song that generates a melody, the note strings constituting the melody correspond to the upper layer, the time series of the chords in the melody corresponds to the middle layer, and the accompaniment sound that harmonizes with the melody. The note sequence corresponds to the lower layer. Further, in the voice synthesis scene in which the voice corresponding to the character string is synthesized, the character string corresponds to the upper layer, the pronunciation style of the voice corresponds to the middle layer, and the waveform of the voice corresponds to the lower layer. do. In the signal processing scene where various signals are processed, the waveform of the signal corresponds to the upper layer, the time series of the feature amount of the signal corresponds to the middle layer, and the time series of the parameters related to the processing for the signal corresponds to the upper layer. Corresponds to the lower layer. In any of the above-exemplified forms, the upper layer data is expressed as "upper data", the middle layer data is expressed as "middle data", and the lower layer data is expressed as "lower data". To. The lower-level data is data representing the content actually used by the user (for example, the waveform W in each of the above-mentioned forms).
 なお、前述の各形態における音符列Nを構成する各音符と、音声合成における文字列を構成する各文字とは、音を示すシンボルとして包括的に表現される。また、音符列Nおよび文字列は、複数のシンボルが時系列に配列されたシンボル列として包括的に表現される。 Note that each note constituting the note string N in each of the above-mentioned forms and each character constituting the character string in speech synthesis are comprehensively expressed as symbols indicating sounds. Further, the note string N and the character string are comprehensively represented as a symbol string in which a plurality of symbols are arranged in time series.
(11)以上に例示した音響処理システムの機能は、前述の通り、制御装置11を構成する単数または複数のプロセッサと、記憶装置12に記憶されたプログラムとの協働により実現される。本開示に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体も包含される。なお、非一過性の記録媒体とは、一過性の伝搬信号(transitory, propagating signal)を除く任意の記録媒体を含み、揮発性の記録媒体も除外されない。また、配信装置が通信網を介してプログラムを配信する構成では、当該配信装置においてプログラムを記憶する記憶装置12が、前述の非一過性の記録媒体に相当する。 (11) As described above, the functions of the acoustic processing system exemplified above are realized by the cooperation of the single or a plurality of processors constituting the control device 11 and the program stored in the storage device 12. The program according to the present disclosure may be provided and installed in a computer in a form stored in a computer-readable recording medium. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a semiconductor recording medium, a magnetic recording medium, or the like is known as arbitrary. Recording media in the form of are also included. The non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and the volatile recording medium is not excluded. Further, in the configuration in which the distribution device distributes the program via the communication network, the storage device 12 that stores the program in the distribution device corresponds to the above-mentioned non-transient recording medium.
E:付記
 以上に例示した形態から、例えば以下の構成が把握される。
E: Addendum For example, the following configuration can be grasped from the above-exemplified forms.
 本開示のひとつの態様(態様1)に係る情報処理方法は、第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存し、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する。 In the information processing method according to one aspect (aspect 1) of the present disclosure, the first time-series data representing the time-series of the feature amount of the sound in which the symbol string is sounded in the first pronunciation style is given as the first instruction from the user. The first history data corresponding to the edited first time-series data is saved as new version data for each edit of the first time-series data, and what is the first pronunciation style? The second time-series data representing the time series of the feature amount of the sound that pronounced the symbol string with a different second pronunciation style is edited according to the second instruction from the user, and the second time-series data is edited. For each time, the second history data corresponding to the edited second time-series data is saved as new version data, and among the saved first history data of different versions, from the user. Corresponds to the first time-series data corresponding to the first history data corresponding to the instruction, or the second history data corresponding to the instruction from the user among the plurality of saved second history data of different versions. 2nd time series data to be acquired.
 以上の態様によれば、第1発音スタイルに対応する第1時系列データの編集の履歴が保存され、第2発音スタイルに対応する第2時系列データの編集の履歴が保存される。したがって、第1発音スタイルに対応する第1時系列データの編集と、第2発音スタイルに対応する第2時系列データの編集とを、利用者からの指示に応じて試行錯誤的に実行することが可能である。なお、「シンボル列」は、例えば音符列または文字列である。 According to the above aspect, the history of editing the first time-series data corresponding to the first pronunciation style is saved, and the history of editing the second time-series data corresponding to the second pronunciation style is saved. Therefore, the editing of the first time-series data corresponding to the first pronunciation style and the editing of the second time-series data corresponding to the second pronunciation style are executed by trial and error according to the instruction from the user. Is possible. The "symbol string" is, for example, a musical note string or a character string.
 態様1の具体例(態様2)において、前記シンボル列は、時系列に配列された複数の音符を含む音符列である。また、態様2の具体例(態様3)において、前記音符列を表す音符列データを前記利用者からの指示に応じて編集し、前記第1時系列データおよび前記第2時系列データは、共通のバージョンの前記音符列データから生成される。 In the specific example of the first aspect (the second aspect), the symbol string is a note sequence including a plurality of notes arranged in a time series. Further, in the specific example of the second aspect (aspect 3), the note sequence data representing the note sequence is edited according to the instruction from the user, and the first time series data and the second time series data are common. Generated from the note string data of the version of.
 態様1から態様3の具体例(態様4)において、前記取得においては、前記複数の第1履歴データのうち直前の編集後の第1履歴データ、および、前記複数の第2履歴データのうち直前の編集後の第2履歴データの何れかを取得する。以上の構成によれば、直前の編集の実行前(すなわち当該編集を取消した状態)の第1履歴データまたは第2履歴データを取得できる。 In the specific example of the first to third aspects (aspect 4), in the acquisition, the first history data after the editing immediately before the plurality of first history data and the immediately preceding of the plurality of second history data. Acquire any of the second history data after editing. According to the above configuration, the first history data or the second history data before the execution of the immediately preceding edit (that is, the state in which the edit is canceled) can be acquired.
 態様1から態様3の具体例(態様5)において、前記取得においては、前記複数の第1履歴データのうち前記利用者が指定したバージョンの第1履歴データ、および、前記複数の第2履歴データのうち前記利用者が指定したバージョンの第2履歴データの何れかを取得する。以上の構成によれば、利用者からの指示に応じた任意のバージョンに対応する第1履歴データまたは第2履歴データを取得できる。 In the specific example of the first to third aspects (aspect 5), in the acquisition, the first history data of the version designated by the user among the plurality of first history data, and the plurality of second history data. Of these, any of the second history data of the version specified by the user is acquired. According to the above configuration, it is possible to acquire the first history data or the second history data corresponding to any version according to the instruction from the user.
 本開示のひとつの態様に係る情報処理システムは、第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集する編集処理部と、前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存する情報管理部とを具備し、前記情報管理部は、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する。本開示のひとつの態様に係るプログラムは、コンピュータシステムを以上の情報処理システムとして機能させる。 The information processing system according to one aspect of the present disclosure edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user. Each time the editing processing unit edits the first time-series data, the first history data corresponding to the edited first time-series data is saved as new version data, and the second time-series data is edited. Each time, it is provided with an information management unit that saves the second history data corresponding to the edited second time-series data as new version data, and the information management unit has the saved different versions. The use of the first time-series data corresponding to the first history data according to the instruction from the user among the plurality of first history data, or the second history data of a plurality of different versions of the saved data. Acquire the second time-series data corresponding to the second history data according to the instruction from the person. The program according to one aspect of the present disclosure causes the computer system to function as the above information processing system.
100…情報処理システム、11…制御装置、12…記憶装置、13…放音装置、14…表示装置、15…操作装置、20…表示制御部、30…編集処理部、31…第1編集部、32…第1生成部、33…第2編集部、34…第2生成部、35…第3編集部、M1…第1生成モデル、M2…第2生成モデル。 100 ... Information processing system, 11 ... Control device, 12 ... Storage device, 13 ... Sound emitting device, 14 ... Display device, 15 ... Operation device, 20 ... Display control unit, 30 ... Editing processing unit, 31 ... First editing unit , 32 ... 1st generation unit, 33 ... 2nd editorial unit, 34 ... 2nd generation unit, 35 ... 3rd editorial unit, M1 ... 1st generation model, M2 ... 2nd generation model.

Claims (7)

  1.  第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、
     前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、
     前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集し、
     前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存し、
     前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する
     コンピュータシステムにより実現される情報処理方法。
    The first time-series data representing the time-series of the feature amount of the sound that pronounced the symbol string in the first pronunciation style is edited according to the first instruction from the user.
    Every time the first time-series data is edited, the first history data corresponding to the edited first time-series data is saved as new version data.
    The second time-series data representing the time-series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user.
    Every time the second time-series data is edited, the second history data corresponding to the edited second time-series data is saved as new version data.
    Among the plurality of saved different versions of the first history data, the first time-series data corresponding to the first history data in response to the instruction from the user, or the plurality of saved different versions of the first history data. An information processing method realized by a computer system that acquires second time-series data corresponding to the second history data according to an instruction from the user among the second history data.
  2.  前記シンボル列は、時系列に配列された複数の音符を含む音符列である
     請求項1の情報処理方法。
    The information processing method according to claim 1, wherein the symbol sequence is a musical note sequence including a plurality of musical notes arranged in a time series.
  3. 前記音符列を表す音符列データを前記利用者からの指示に応じて編集し、
     前記第1時系列データおよび前記第2時系列データは、共通のバージョンの前記音符列データから生成される
     請求項2の情報処理方法。
    The note string data representing the note string is edited according to the instruction from the user, and the note string data is edited.
    The information processing method according to claim 2, wherein the first time-series data and the second time-series data are generated from a common version of the note string data.
  4.  前記取得においては、前記複数の第1履歴データのうち直前の編集後の第1履歴データ、および、前記複数の第2履歴データのうち直前の編集後の第2履歴データの何れかを取得する
     請求項1から請求項3の何れかの情報処理方法。
    In the acquisition, either the first history data after the previous editing among the plurality of first history data and the second history data after the previous editing among the plurality of second history data are acquired. The information processing method according to any one of claims 1 to 3.
  5.  前記取得においては、前記複数の第1履歴データのうち前記利用者が指定したバージョンの第1履歴データ、および、前記複数の第2履歴データのうち前記利用者が指定したバージョンの第2履歴データの何れかを取得する
     請求項1から請求項3の何れかの情報処理方法。
    In the acquisition, the first history data of the version specified by the user among the plurality of first history data and the second history data of the version designated by the user among the plurality of second history data. The information processing method according to any one of claims 1 to 3.
  6.  第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集する編集処理部と、
     前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存する情報管理部とを具備し、
     前記情報管理部は、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する
     情報処理システム。
    The first time-series data representing the time-series of the feature amount of the sound in which the symbol string is pronounced in the first pronunciation style is edited according to the first instruction from the user, and the second pronunciation is different from the first pronunciation style. An editing processing unit that edits the second time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the style according to the second instruction from the user.
    Each time the first time-series data is edited, the first history data corresponding to the edited first time-series data is saved as new version data, and each time the second time-series data is edited, the edit is performed. It is equipped with an information management unit that saves the second history data corresponding to the second time series data later as new version data.
    The information management unit is the first time-series data corresponding to the first history data according to the instruction from the user among the plurality of first history data of different versions saved, or the saved first history data. An information processing system that acquires second time-series data corresponding to the second history data in response to an instruction from the user among a plurality of second history data of different versions.
  7.  第1発音スタイルでシンボル列を発音した音の特徴量の時系列を表す第1時系列データを、利用者からの第1指示に応じて編集し、前記第1発音スタイルとは異なる第2発音スタイルで前記シンボル列を発音した音の特徴量の時系列を表す第2時系列データを、前記利用者からの第2指示に応じて編集する編集処理部、および、
     前記第1時系列データの編集毎に、当該編集後の前記第1時系列データに応じた第1履歴データを新規バージョンのデータとして保存し、前記第2時系列データの編集毎に、当該編集後の前記第2時系列データに応じた第2履歴データを新規バージョンのデータとして保存する情報管理部、
     としてコンピュータシステムを機能させるプログラムであって、
     前記情報管理部は、前記保存された相異なるバージョンの複数の第1履歴データのうち前記利用者からの指示に応じた第1履歴データに対応する第1時系列データ、または、前記保存された相異なるバージョンの複数の第2履歴データのうち前記利用者からの指示に応じた第2履歴データに対応する第2時系列データを取得する
     プログラム。
    The first time-series data representing the time-series of the feature amount of the sound in which the symbol string is pronounced in the first pronunciation style is edited according to the first instruction from the user, and the second pronunciation is different from the first pronunciation style. An editing processing unit that edits the second time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the style according to the second instruction from the user, and
    Each time the first time-series data is edited, the first history data corresponding to the edited first time-series data is saved as new version data, and each time the second time-series data is edited, the edit is performed. Information management unit that saves the second history data corresponding to the second time series data later as new version data,
    It is a program that makes a computer system function as
    The information management unit is the first time-series data corresponding to the first history data according to the instruction from the user among the plurality of first history data of different versions saved, or the saved first history data. A program for acquiring second time-series data corresponding to the second history data in response to an instruction from the user among a plurality of second history data of different versions.
PCT/JP2020/037966 2020-10-07 2020-10-07 Information processing method, information processing system, and program WO2022074754A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022555020A JPWO2022074754A1 (en) 2020-10-07 2020-10-07
CN202080105738.8A CN116324965A (en) 2020-10-07 2020-10-07 Information processing method, information processing system, and program
PCT/JP2020/037966 WO2022074754A1 (en) 2020-10-07 2020-10-07 Information processing method, information processing system, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/037966 WO2022074754A1 (en) 2020-10-07 2020-10-07 Information processing method, information processing system, and program

Publications (1)

Publication Number Publication Date
WO2022074754A1 true WO2022074754A1 (en) 2022-04-14

Family

ID=81125769

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/037966 WO2022074754A1 (en) 2020-10-07 2020-10-07 Information processing method, information processing system, and program

Country Status (3)

Country Link
JP (1) JPWO2022074754A1 (en)
CN (1) CN116324965A (en)
WO (1) WO2022074754A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019239971A1 (en) * 2018-06-15 2019-12-19 ヤマハ株式会社 Information processing method, information processing device and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019239971A1 (en) * 2018-06-15 2019-12-19 ヤマハ株式会社 Information processing method, information processing device and program

Also Published As

Publication number Publication date
JPWO2022074754A1 (en) 2022-04-14
CN116324965A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
JP6547878B1 (en) Electronic musical instrument, control method of electronic musical instrument, and program
JP6610715B1 (en) Electronic musical instrument, electronic musical instrument control method, and program
JP6610714B1 (en) Electronic musical instrument, electronic musical instrument control method, and program
JP3102335B2 (en) Formant conversion device and karaoke device
US7094962B2 (en) Score data display/editing apparatus and program
US5939654A (en) Harmony generating apparatus and method of use for karaoke
JP6728754B2 (en) Pronunciation device, pronunciation method and pronunciation program
JP2022116335A (en) Electronic musical instrument, method, and program
JP6784022B2 (en) Speech synthesis method, speech synthesis control method, speech synthesis device, speech synthesis control device and program
JP7180587B2 (en) Electronic musical instrument, method and program
CN111696498A (en) Keyboard musical instrument and computer-implemented method of keyboard musical instrument
JP4274272B2 (en) Arpeggio performance device
JP5136128B2 (en) Speech synthesizer
JPH10319947A (en) Pitch extent controller
JP6835182B2 (en) Electronic musical instruments, control methods for electronic musical instruments, and programs
WO2022074754A1 (en) Information processing method, information processing system, and program
WO2022074753A1 (en) Information processing method, information processing system, and program
JP6801766B2 (en) Electronic musical instruments, control methods for electronic musical instruments, and programs
JP6819732B2 (en) Electronic musical instruments, control methods for electronic musical instruments, and programs
CN115349147A (en) Sound signal generation method, estimation model training method, sound signal generation system, and program
JP5106437B2 (en) Karaoke apparatus, control method therefor, and control program therefor
JP4240099B2 (en) Electronic musical instrument and electronic musical instrument control program
WO2004025306A1 (en) Computer-generated expression in music production
JP7276292B2 (en) Electronic musical instrument, electronic musical instrument control method, and program
JP5953743B2 (en) Speech synthesis apparatus and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20956702

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022555020

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20956702

Country of ref document: EP

Kind code of ref document: A1