CN112466266A

CN112466266A - Control system and control method

Info

Publication number: CN112466266A
Application number: CN202010876140.0A
Authority: CN
Inventors: 前泽阳
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2019-09-06
Filing date: 2020-08-27
Publication date: 2021-03-09
Anticipated expiration: 2040-08-27
Also published as: CN112466266B; JP2021043258A; JP7383943B2

Abstract

Provided are a control system and a control method capable of estimating the timing of occurrence of an event based on the motion of a face. The disclosed device is provided with: an acquisition unit that acquires image information; a determination unit that detects, based on the image information, a movement of a face portion and a direction of a line of sight in a captured image indicated by the image information, and determines, using a result of the detection, whether or not a preliminary operation associated with a motion of a sign indicating a timing at which an event occurs has been performed; an estimating unit that estimates a timing of causing an event to occur based on the image information when the determining unit determines that the preliminary operation has been performed; and an output unit that outputs the estimation result estimated by the estimation unit.

Description

Control system and control method

Technical Field

The present invention relates to a control system and a control method.

Background

Conventionally, a score alignment technique has been proposed in which a current position of a performance (hereinafter referred to as "performance position") in a music piece is estimated by analyzing a sound used to perform the music piece (for example, patent document 1).

Documents of the prior art

Patent document

Patent document 1: japanese laid-open patent publication No. 2015-79183

Disclosure of Invention

Problems to be solved by the invention

In addition, in an ensemble system in which a player and an automatic playing musical instrument or the like perform ensemble, for example, the following processing is performed: based on the estimation result of the position on the score of the performance performed by the player, the timing of the event at which the automatic playing musical instrument makes the next sound is envisioned. However, in actual human and human ensemble, timing may be matched by a secret sign operation such as eye contact when matching the start of a musical composition, the recovery of an extension symbol (fermata), the final sound of a musical composition, and the like.

The present invention has been made in view of such a situation, and an object thereof is to provide a control system and a control method capable of estimating the timing of occurrence of an event based on the motion of a face.

Means for solving the problems

In order to solve the above problem, one aspect of the present invention is a control system including: an acquisition unit that acquires image information including a user captured over time; a determination unit configured to determine whether or not a preliminary operation has been performed, based on the movement of the face of the user detected from the image information and the direction of the line of sight; an estimation unit that estimates a timing at which an event occurs when it is determined that the preliminary operation has been performed; and an output unit that outputs the estimation result estimated by the estimation unit.

In order to solve the above problem, one aspect of the present invention is a control system including: an acquisition unit that acquires image information; a determination unit that detects, based on the image information, a movement of a face portion and a direction of a line of sight in a captured image indicated by the image information, and determines, using a result of the detection, whether or not a preliminary operation associated with a motion of a sign indicating a timing at which an event occurs has been performed; an estimating unit that estimates a timing of occurrence of an event from the motion of the logo based on the image information when the determining unit determines that the preliminary motion is performed; and an output unit that outputs the estimation result estimated by the estimation unit.

In addition, an aspect of the present invention is a control method in which an acquisition unit acquires image information; a determination unit that detects, based on the image information, a movement of a face portion and a direction of a line of sight in a captured image indicated by the image information, and determines, using a result of the detection, whether or not a preliminary operation associated with a motion of a sign indicating a timing at which an event occurs has been performed; an estimating unit that estimates a timing of occurrence of an event from the motion of the logo based on the image information when the determining unit determines that the preliminary motion is performed; the output section outputs the estimation result estimated by the estimation section.

Effects of the invention

According to the present invention, the timing of causing an event to occur can be estimated based on the motion of the face.

Drawings

Fig. 1 is a block diagram of an automatic playing system according to an embodiment of the present invention.

Fig. 2 is an explanatory diagram of the combination motion and the performance position.

Fig. 3 is an explanatory diagram of image synthesis performed by the image synthesis unit.

Fig. 4 is an explanatory diagram of a relationship between a performance position of a performance object song and an instruction position of an automatic performance.

Fig. 5 is an explanatory diagram of the relationship between the position of the combination action and the start point of the performance target track.

Fig. 6 is an explanatory diagram of a performance image.

Fig. 7 is an explanatory diagram of a performance image.

Fig. 8 is a flowchart of the operation of the control device.

Fig. 9 is a block diagram of an analysis processing unit in embodiment 2.

Fig. 10 is an explanatory diagram of an operation of the analysis processing unit in embodiment 2.

Fig. 11 is a flowchart of the operation of the analysis processing unit in embodiment 2.

Fig. 12 is a block diagram of the automatic playing system.

Fig. 13 is a simulation result of the sounding timing of the player and the sounding timing of the accompaniment part.

Fig. 14 is the evaluation result of the automatic playing system.

Fig. 15 is a block diagram of the detection processing unit 524 in embodiment 3.

Fig. 16 is a flowchart of the operation of the detection processing unit 524 in embodiment 3.

Description of the reference symbols

100 … automatic musical performance system, 12 … control device, 22 … recording device, 222 … camera, 52 … dark sign detector, 522 … image synthesizer, 524 … detection processor, 5240 … acquirer, 5241 … determiner, 5242 … estimator, 5243 … output, 5244 … face extraction model, 5245 … dark sign motion estimator model

Detailed Description

< embodiment 1 >

Fig. 1 is a block diagram of an automatic playing system 100 according to embodiment 1 of the present invention. The automatic playing system 100 is a computer system that is installed in a space such as a concert hall where a plurality of players P play musical instruments, and executes automatic playing of music to be played (hereinafter referred to as "music to be played") in parallel with the playing of music by the plurality of players P. Note that the player P is typically a player of a musical instrument, but a singer of a music track to be played may be the player P. That is, the "performance" in the present application includes not only the performance of musical instruments but also singing. Further, a person who is not actually responsible for the performance of the musical instrument (for example, a conductor at a concert, or a sound monitor at a recording) may be included in the player P.

As illustrated in fig. 1, the automatic playing system 100 of the present embodiment includes a control device 12, a storage device 14, a recording device 22, an automatic playing device 24, and a display device 26. The control device 12 and the storage device 14 are implemented by an information processing device such as a personal computer, for example.

The control device 12 is a Processing circuit such as a CPU (Central Processing Unit), and integrally controls each element of the automatic playing system 100. The storage device 14 is configured by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media, and stores a program executed by the control device 12 and various data used by the control device 12. Further, a storage device 14 (for example, cloud storage) separate from the automatic playing system 100 may be prepared, and the control device 12 may execute writing and reading to and from the storage device 14 via a communication network such as a mobile communication network or the internet. That is, the storage device 14 may be omitted from the automatic playing system 100.

The storage device 14 of the present embodiment stores music data M. The music data M specifies performance contents of the performance object track of the automatic performance. As the music data M, for example, a File (SMF: Standard MIDI File) in a form complying with the MIDI (Musical Instrument Digital Interface) Standard is preferable. Specifically, the music data M is time-series data in which instruction data indicating performance contents and time data indicating the occurrence time of the instruction data are arranged. The indicating data indicates a pitch (note number) and intensity (velocity) and indicates various events such as sound production and sound attenuation. The time data specifies, for example, an interval (time difference) of the indicating data immediately before and after.

The automatic playing device 24 of fig. 1 executes an automatic performance of a performance target song based on the control of the control device 12. Specifically, of the plurality of playing sections constituting the performance target music, a playing section different from the playing sections (for example, stringed musical instruments) of the plurality of players P is automatically played by the automatic playing device 24. The automatic playing device 24 of the present embodiment is a keyboard musical instrument (i.e., an automatic player piano) provided with a driving mechanism 242 and a sound generating mechanism 244. The sound generating mechanism 244 is a string striking mechanism that sounds strings (i.e., sound generating bodies) in conjunction with the displacement of each key of the keyboard, as in a piano of a natural musical instrument. Specifically, the sound generation mechanism 244 includes, for each key, an operation mechanism including a hammer capable of striking a string and a plurality of transmission members (for example, a wippen (wihippen), a jack (jack), and a repetition lever (repetition lever)) for transmitting the displacement of the key to the hammer. The driving mechanism 242 performs the automatic performance of the performance target song by driving the sound emission mechanism 244. Specifically, the driving mechanism 242 includes a plurality of drivers (for example, actuators such as solenoids) for displacing the keys, and a driving circuit for driving the drivers. The driving mechanism 242 drives the sound emission mechanism 244 in accordance with an instruction from the control device 12, thereby realizing automatic performance of the performance target track. The control device 12 or the storage device 14 may be mounted on the automatic playing device 24.

The recording device 22 records a situation where a plurality of players P perform the performance on the performance target tracks. As illustrated in fig. 1, the recording device 22 of the present embodiment includes a plurality of imaging devices 222 and a plurality of sound pickup devices 224. The imaging device 222 is provided for each player P, and generates an image signal V0 by imaging the player P. The image signal V0 is a signal representing a moving image of the player P. The sound pickup device 224 is provided for each player P, and picks up sound (for example, musical instrument sound or singing sound) generated by the performance (for example, musical instrument performance or singing) of the player P to generate an acoustic signal a 0. The acoustic signal a0 is a signal representing the waveform of sound. As understood from the above description, the plurality of image signals V0 photographing different players P and the plurality of acoustic signals a0 picking up sounds played by different players P are recorded. In addition, the acoustic signal a0 output from an electronic musical instrument such as an electronic stringed instrument may be used. Therefore, the sound pickup device 224 may be omitted.

The control device 12 executes the program stored in the storage device 14, thereby realizing a plurality of functions (the combination detection unit 52, the performance analysis unit 54, the performance control unit 56, and the display control unit 58) for realizing the automatic performance of the music piece to be performed. Further, the functions of the control device 12 may be realized by a set of a plurality of devices (i.e., a system), or a part or all of the functions of the control device 12 may be realized by a dedicated electronic circuit. Further, a server device located apart from a space such as a concert hall where the recording device 22, the automatic playing device 24, and the display device 26 are installed may realize a part or all of the functions of the control device 12.

Each player P executes an action of a trumpet (hereinafter referred to as "trumpet action") that becomes a performance of the performance target track. The motion of a secret sign is a motion (gesture) indicating 1 time point on the time axis. For example, a motion of the player P lifting up his own musical instrument or a motion of the player P moving his own body is a preferable example of the combination motion. For example, as illustrated in fig. 2, the specific player P who is dominating the performance of the performance target track performs the dark sign operation at a time Q that is ahead of the start point at which the performance of the performance target track should start by a predetermined period (hereinafter referred to as "preparation period") B. The preparation period B is, for example, a period corresponding to 1 beat of the performance target music. Therefore, the duration of the preparation period B varies according to the performance tempo (beat) of the performance target music. For example, the faster the performance speed, the shorter the preparation period B is. The player P executes the trumpet motion at a time point advanced by the preparation period B corresponding to 1 beat from the start point of the performance target track based on the assumed performance speed of the performance target track, thereby starting the performance of the performance target track at the arrival of the start point. The combination act is used as a trigger for the automatic performance of the automatic playing device 24 in addition to a trigger for the performance of another player P. The length of the preparation period B is arbitrary, and may be set to a length corresponding to a plurality of beats, for example.

The medal detecting section 52 of fig. 1 detects the medal movement of the player P. Specifically, the combination detector 52 analyzes the image of the performer P photographed by each of the photographing devices 222 to detect the combination operation. As illustrated in fig. 1, the sign detection unit 52 of the present embodiment includes an image combining unit 522 and a detection processing unit 524. The image synthesizing unit 522 synthesizes the plurality of image signals V0 generated by the plurality of imaging devices 222 to generate an image signal V. As illustrated in fig. 3, the image signal V is a signal indicating an image in which a plurality of moving images (#1, #2, #3, … …) indicated by the image signal V0 are arranged. That is, the image signal V representing the moving images of the plurality of players P is supplied from the image synthesizing unit 522 to the detection processing unit 524.

The detection processing unit 524 analyzes the image signal V generated by the image synthesizing unit 522 to detect a signature motion of one of the players P. For the detection of the signature operation by the detection processing unit 524, a known image analysis technique including an image recognition process of extracting an element (for example, a body or a musical instrument) moved when the player P performs the signature operation from an image and a moving body detection process of detecting the movement of the element can be used. Further, a recognition model such as a neural network or a multi-branch tree may be used for detecting the signature operation. For example, machine learning (e.g., deep learning) of the recognition model is performed in advance using feature quantities extracted from image signals obtained by imaging performances of a plurality of players P as given learning data. The detection processing unit 524 detects a motion of a trumpet by applying a feature amount extracted from the image signal V in a scene in which the automatic performance is actually performed to the recognition model after machine learning.

The performance analysis unit 54 in fig. 1 sequentially estimates the positions (hereinafter referred to as "performance positions") T at which the plurality of players P are currently performing in the performance target track in parallel with the performance of each player P. Specifically, the performance analysis unit 54 estimates the performance position T by analyzing the sound collected by each of the plurality of sound collection devices 224. As illustrated in fig. 1, the performance analysis unit 54 of the present embodiment includes an acoustic mixing unit 542 and an analysis processing unit 544. The acoustic mixing unit 542 generates an acoustic signal a by mixing the acoustic signals a0 generated by the acoustic pickup devices 224. That is, the acoustic signal a is a signal representing a mixed sound of a plurality of sounds represented by different acoustic signals a 0.

The analysis processing unit 544 analyzes the acoustic signal a generated by the acoustic mixing unit 542 to estimate the performance position T. For example, the analysis processing unit 544 specifies the performance position T by matching the sound indicated by the acoustic signal a with the performance content of the performance target song indicated by the music data M. The analysis processing unit 544 according to the present embodiment estimates a playing speed (tempo) R of the performance target music through analysis of the acoustic signal a. For example, the analysis processing unit 544 specifies the performance tempo R from the temporal change of the performance position T (i.e., the change of the performance position T in the time axis direction). Note that, for estimation of the performance position T and the performance tempo R by the analysis processing unit 544, a known acoustic analysis technique (score alignment) may be arbitrarily employed. For example, the analysis technique disclosed in patent document 1 can be used for estimation of the performance position T and the performance tempo R. Further, a recognition model such as a neural network or a multi-branch tree may be used for estimation of the performance position T and the performance tempo R. For example, machine learning (e.g., deep learning) for generating a recognition model is performed before an automatic performance using, as given learning data, feature amounts extracted from an acoustic signal a in which performances of a plurality of players P are picked up. The analysis processing unit 544 estimates the performance position T and the performance tempo R by applying the feature quantities extracted from the acoustic signal a in the scene in which the automatic performance is actually performed to the recognition model generated by the machine learning.

The detection of the trumpet motion by the trumpet detecting section 52 and the estimation of the performance position T and the performance tempo R by the performance analyzing section 54 are performed in parallel with the performance of the performance target tracks of the plurality of players P in real time. For example, the detection of the motion of the combination and the estimation of the performance position T and the performance tempo R are repeated in a predetermined cycle. However, regardless of the difference between the period of detection of the motion of the combination and the periods of estimation of the performance position T and the performance tempo R.

The performance control unit 56 in fig. 1 causes the automatic playing device 24 to execute the automatic playing of the performance target music piece so that the signature motion detected by the signature detection unit 52 and the performance position T estimated by the performance analysis unit 54 are synchronized. Specifically, the performance control unit 56 instructs the automatic playing device 24 to start the automatic playing by using the detection of the combination motion by the combination detection unit 52 as a trigger, and instructs the automatic playing device 24 to specify the performance content specified by the music data M at the time point corresponding to the performance position T in the performance target song. That is, the performance control unit 56 is a sequencer (sequencer) that sequentially supplies the instruction data included in the music data M of the performance target music to the automatic playing device 24. The automatic playing device 24 executes the automatic playing of the performance target song in accordance with the instruction from the playing control section 56. Since the performance position T moves rearward in the performance target track as the performance of the plurality of players P progresses, the automatic performance of the performance target track by the automatic playing apparatus 24 progresses together with the movement of the performance position T. As understood from the above description, the performance control section 56 instructs the automatic playing device 24 to perform the automatic performance so that the tempo of the performance and the timing of each tone are synchronized with the performance of the plurality of players P while maintaining the music performance such as the intensity of each sound of the performance target music or the expression of phrase (phrase) as the content specified by the music data M. Therefore, for example, when music data M indicating the performance of a specific player (for example, a past player who has died) is used, it is possible to create an atmosphere in which the player and a plurality of players P who are actually present cooperate in a harmonious manner with each other in breathing while faithfully reproducing the musical expression unique to the player through automatic performance.

Further, it takes about several hundred milliseconds from the time when the performance control section 56 instructs the automatic playing device 24 to perform automatic playing by the output of the instruction data until the automatic playing device 24 actually sounds (for example, the hammer of the sound emitting mechanism 244 strikes a string). That is, the actual sound production of the automatic playing device 24 is inevitably delayed with respect to the instruction from the performance control section 56. Consequently, in the configuration in which the performance control section 56 instructs the automatic playing device 24 to perform the performance at the performance position T itself estimated by the performance analysis section 54 on the performance target track, the sound emission of the automatic playing device 24 is delayed with respect to the performance of the plurality of players P.

Therefore, as illustrated in fig. 2, the performance control unit 56 of the present embodiment instructs the automatic playing device 24 to perform the performance at the time TA after the performance position T estimated by the performance analysis unit 54 in the target music piece. That is, the performance control section 56 preliminarily reads the instruction data in the music data M of the performance target track so that the delayed sounding is synchronized with the performance of the plurality of players P (for example, a specific note of the performance target track is played substantially simultaneously between the automatic playing device 24 and each player P).

Fig. 4 is an explanatory diagram of temporal changes in the performance position T. The variation amount of the performance position T per unit time (the slope of the straight line in fig. 4) corresponds to the performance speed R. In fig. 4, for the sake of simplicity, a case where the performance tempo R is maintained constant is illustrated.

As illustrated in fig. 4, the performance control unit 56 instructs the automatic playing device 24 to perform the performance of the music piece to be played at the time TA after the adjustment amount α with respect to the playing position T. The adjustment amount α is variably set based on the delay amount D from the instruction of the automatic performance by the performance control section 56 until the automatic performance apparatus 24 actually sounds and the performance tempo R estimated by the performance analysis section 54. Specifically, the performance control unit 56 sets the length of the section of performance progress of the performance target music for the delay amount D based on the performance tempo R as the adjustment amount α. Therefore, the adjustment amount α becomes a larger value as the performance speed R is faster (the gradient of the straight line in fig. 4 is steeper). In fig. 4, it is assumed that the performance tempo R is maintained constant throughout the entire section of the performance target music, but the performance tempo R may actually vary. Therefore, the adjustment amount α can vary with time in conjunction with the performance speed R.

The delay amount D is set in advance to a predetermined value (for example, from several tens to several hundreds of milliseconds) corresponding to the measurement result of the automatic playing apparatus 24. In addition, in the actual automatic playing device 24, the delay amount D may be different depending on the pitch or intensity of the performance. Therefore, the delay amount D (further, the adjustment amount α depending on the delay amount D) may be variably set according to the pitch or intensity of the note that is the subject of the automatic playing.

The performance control unit 56 instructs the automatic playing device 24 to start the automatic performance of the performance target music piece in accordance with the combination of the. Fig. 5 is an explanatory diagram of the relationship between the combination action and the automatic performance. As illustrated in fig. 5, the performance control section 56 starts an instruction for the automatic performance of the automatic performance device 24 at a time QA when a time period δ has elapsed from a time Q when the motion of the combination is detected. The time length δ is a time length obtained by subtracting the delay amount D of the automatic performance from the time length τ corresponding to the preparation period B. The time length τ of the preparation period B varies according to the performance speed R of the performance target music. Specifically, the faster the performance speed R (the steeper the slope of the straight line in fig. 5), the shorter the period τ of the preparation period B becomes. However, at the time QA of the trumpet action, the performance of the target song is not started yet, and therefore the performance speed R is not estimated. Therefore, the performance control unit 56 calculates the time length τ of the preparation period B from the standard performance tempo (standard tempo) R0 assumed for the performance target track. The performance tempo R0 is specified in the music data M, for example. However, the tempo that the plurality of players P recognize with respect to the performance target track (for example, the tempo assumed during the performance practice) may be set to the performance tempo R0.

As described above, the performance control unit 56 instructs the start of the automatic performance at the time QA when the time period δ (δ τ -D) has elapsed from the time QA when the trumpet is operated. Therefore, the sounding of the automatic playing device 24 is started at the time QB (i.e., the time when the plurality of players P start playing) when the preparatory period B has elapsed from the time Q of the trumpet action. That is, the automatic performance by the automatic playing device 24 is started substantially simultaneously with the start of the performance target tracks of the plurality of players P. The control of the automatic performance by the performance control unit 56 according to the present embodiment is as described above as an example.

The display control section 58 of fig. 1 causes an image (hereinafter referred to as "performance image") G visually representing the progress of the automatic performance apparatus 24 to be displayed on the display device 26. Specifically, the display control unit 58 generates image data representing the performance image G and outputs the image data to the display device 26, thereby displaying the performance image G on the display device 26. The display device 26 displays the performance image G instructed from the display control section 58. Such as a liquid crystal display panel or a projector, is a preferred example of the display device 26. The plurality of players P can visually confirm the performance images G displayed on the display device 26 at any time in parallel with the performance of the performance target tracks.

The display control unit 58 of the present embodiment displays a moving image that dynamically changes in conjunction with the automatic performance of the automatic performance device 24 on the display device 26 as a performance image G. Fig. 6 and 7 show examples of the performance image G. As illustrated in fig. 6 and 7, the musical performance image G is a stereoscopic image in which a display body (object) 74 is disposed in a virtual space 70 having a bottom surface 72. As illustrated in fig. 6, the display body 74 is a substantially spherical solid that floats in the virtual space 70 and descends at a predetermined speed. A shadow 75 of the display 74 is displayed on the bottom surface 72 of the virtual space 70, and the shadow 75 approaches the display 74 on the bottom surface 72 as the display 74 descends. As illustrated in fig. 7, when the sound emission of the automatic playing device 24 is started, the display 74 rises to a predetermined height in the virtual space 70, and the shape of the display 74 is irregularly deformed while the sound emission is continuing. When the sound emission of the automatic musical performance is stopped (muffled), the irregular deformation of the display 74 is stopped and returns to the original shape (spherical shape) of fig. 6, and the display 74 moves to a state of descending at a predetermined speed. The above operations (rise and deformation) of the display 74 are repeated for each sound of the automatic performance. For example, the display 74 is lowered before the start of the performance target track, and the direction of movement of the display 74 is changed from being lowered to being raised when the note at the start point of the performance target track is sounded by the automatic performance. Therefore, the player P who visually confirms the performance image G displayed on the display device 26 can grasp the timing of sounding of the automatic performance apparatus 24 by the transition from the fall to the rise of the display body 74.

The display control section 58 of the present embodiment controls the display device 26 to display the performance image G exemplified above. Further, the delay from when the display control unit 58 instructs the display device 26 to display or change the image until the instruction is reflected on the display image of the display device 26 is sufficiently smaller than the delay amount D of the automatic performance device 24. Therefore, the display control unit 58 causes the display device 26 to display the performance image G corresponding to the performance content of the performance position T itself estimated by the performance analysis unit 54 in the performance target song. Therefore, as described above, the performance image G dynamically changes in synchronization with the actual sound emission of the automatic performance apparatus 24 (at a time delayed by the delay amount D from the instruction of the performance control section 56). That is, at the point when the automatic playing apparatus 24 actually starts sounding each note of the performance target song, the movement of the display 74 of the performance image G is shifted from descending to ascending. Therefore, each player P can visually confirm the timing at which each note of the performance target song is sounded by the automatic playing device 24.

Fig. 8 is a flowchart illustrating the operation of the control device 12 of the automatic playing system 100. For example, the process of fig. 8 is started in parallel with the performance of the target music tracks of the plurality of players P, with the interrupt signal generated at a predetermined cycle as a trigger. When the processing of fig. 8 is started, the control device 12 (the signature detection unit 52) analyzes the plurality of image signals V0 supplied from the plurality of imaging devices 222, and determines whether or not there is a signature operation from an arbitrary player P (SA 1). The control device 12 (performance analysis unit 54) analyzes the plurality of acoustic signals a0 supplied from the plurality of sound pickup devices 224 to estimate a performance position T and a performance tempo R (SA 2). In addition, the order of detection of the motion of the combination (SA1) and estimation of the performance position T and performance tempo R (SA2) may be reversed.

The control device 12 (performance control section 56) instructs the automatic playing device 24 on the automatic playing corresponding to the playing position T and the playing tempo R (SA 3). Specifically, the automatic playing device 24 is caused to execute the automatic playing of the performance target music so that the signature motion detected by the signature detection section 52 and the performance position T estimated by the performance analysis section 54 are synchronized. Further, the control device 12 (display control section 58) causes the display device 26 to display a performance image G for expressing the progress of the automatic performance (SA 4).

In the above illustrated embodiment, the automatic performance of the automatic playing device 24 is performed so as to synchronize the combination movements of the player P and the progress of the performance position T, while the performance image G for indicating the progress of the automatic performance of the automatic playing device 24 is displayed on the display device 26. Therefore, the player P can visually confirm the progress of the automatic performance of the automatic playing device 24 to be reflected on the performance of the player P. That is, a natural ensemble in which the performances of the plural players P and the automatic performance of the automatic playing device 24 interact with each other is realized. In the present embodiment, in particular, there are advantages as follows: since the performance image G dynamically changing in accordance with the performance contents of the automatic performance is displayed on the display device 26, the player P can visually and intuitively grasp the progress of the automatic performance.

In the present embodiment, the performance contents at the time point TA later in time than the performance position T estimated by the performance analysis unit 54 are indicated to the automatic performance device 24. Therefore, even in the case where the actual sounding of the automatic playing device 24 is delayed with respect to the instruction of the performance by the performance control section 56, the performance of the player P and the automatic performance can be synchronized with high accuracy. Further, the automatic playing device 24 is instructed to play at a time TA after the playing position T by the variable adjustment amount α corresponding to the playing speed R estimated by the playing analyzing unit 54. Therefore, even when the performance tempo R fluctuates, for example, the performance of the player and the automatic performance can be synchronized with high accuracy.

< embodiment 2 >

Embodiment 2 of the present invention will be described. In the following embodiments, elements having the same functions or functions as those of embodiment 1 are not described in detail along with the reference numerals used in the description of embodiment 1.

Fig. 9 is a block diagram illustrating the configuration of the analysis processing unit 544 according to embodiment 2. As illustrated in fig. 9, the analysis processing unit 544 according to embodiment 2 includes a likelihood calculation unit 82 and a position estimation unit 84. Fig. 10 is an explanatory diagram of the operation of the likelihood calculation unit 82.

The likelihood calculation unit 82 calculates the observation likelihood L at each of the plurality of time points t in the performance target track in parallel with the performance of the performance target track by the plurality of players P. That is, the distribution of the observation likelihoods L at a plurality of points t within the musical performance target song (hereinafter referred to as "observation likelihood distribution") is calculated. The observation likelihood distribution is calculated for the acoustic signal a for each unit section (frame) divided on the time axis. The observation likelihood L at an arbitrary 1 time point t in the observation likelihood distribution calculated for 1 unit section of the acoustic signal a is an index of the accuracy with which the sound represented by the acoustic signal a in the unit section is uttered at the time point t in the musical performance target song. The observation likelihood L is also an index of the accuracy with which the plurality of players P perform the performance at each time point t in the performance target track. That is, it is highly likely that the time point t at which the observation likelihood L calculated for an arbitrary 1 unit section is high matches the sound emission position of the sound indicated by the acoustic signal a in the unit section. The unit sections immediately before and after may overlap each other on the time axis.

As illustrated in fig. 9, the likelihood calculation unit 82 according to embodiment 2 includes a1 st calculation unit 821, a2 nd calculation unit 822, and a3 rd calculation unit 823. The 1 st arithmetic unit 821 calculates the 1 st likelihood L1(a), and the 2 nd arithmetic unit 822 calculates the 2 nd likelihood L2 (C). The 3 rd arithmetic unit 823 calculates the distribution of the observation likelihood L by multiplying the 1 st likelihood L1(a) calculated by the 1 st arithmetic unit 821 and the 2 nd likelihood L2(C) calculated by the 2 nd arithmetic unit 822. That is, the observation likelihood L is expressed by a product of the 1 st likelihood L1(a) and the 2 nd likelihood L2(C) (L ═ L1(a) L2 (C)).

The 1 st arithmetic unit 821 compares the acoustic signal a in each unit section with the music data M of the performance target track, and calculates the 1 st likelihood L1(a) for each of the plurality of time points t in the performance target track. That is, as illustrated in fig. 10, the distribution of the 1 st likelihood L1(a) at a plurality of time points t within the performance target track is calculated for each unit section. The 1 st likelihood L1(a) is a likelihood calculated by analysis of the acoustic signal a. The 1 st likelihood L1(a) calculated for an arbitrary 1 time point t by analyzing 1 unit section of the acoustic signal a is an index of the accuracy with which the sound represented by the acoustic signal a in the unit section is uttered at the time point t in the musical performance target song. At a time point t with a high possibility of matching the performance position of 1 unit section of the acoustic signal a among the plurality of time points t on the time axis, there is a peak of the 1 st likelihood L1 (a). As for the method of calculating the 1 st likelihood L1(a) from the acoustic signal a, for example, the technique of japanese patent application laid-open No. 2014-178395 can be preferably utilized.

The 2 nd arithmetic unit 822 in fig. 9 calculates the 2 nd likelihood L2(C) corresponding to the detection of the presence or absence of the motion of the signature. Specifically, the 2 nd likelihood L2(C) is calculated based on a variable C indicating the presence or absence of a beacon operation. The variable C is notified from the sign detection unit 52 to the likelihood calculation unit 82. The variable C is set to 1 when the combination detector 52 detects the combination operation, and is set to 0 when the combination detector 52 does not detect the combination operation. The value of the variable C is not limited to 2 values, i.e., 0 and 1. For example, the variable C when the sign operation is not detected may be set to a predetermined positive number (however, a numerical value smaller than the numerical value of the variable C when the sign operation is detected).

As illustrated in fig. 10, a plurality of reference points a are specified on the time axis of the performance target track. The reference point a is, for example, the start time of the music piece or the time when the performance is restarted after a long-time pause indicated by an extension symbol or the like. For example, the respective timings of the plurality of reference points a within the performance target track are specified by the music data M.

As illustrated in fig. 10, the 2 nd likelihood L2(C) is maintained at 1 in a unit interval (C is 0) in which no secret sign operation is detected. On the other hand, in the unit section (C ═ 1) in which the secret sign operation is detected, the 2 nd likelihood L2(C) is set to 0 (exemplified by the 2 nd value) in the period ρ extending from each reference point a to the front side on the time axis by a predetermined length (hereinafter referred to as "reference period"), and is set to 1 (exemplified by the 1 st value) in the periods other than each reference period ρ. The reference period ρ is set to a period of time of about 1 beat corresponding amount to 2 beats corresponding amount of the performance target music, for example. As described above, the observation likelihood L is calculated by the product of the 1 st likelihood L1(a) and the 2 nd likelihood L2 (C). Therefore, when the motion of the combination is detected, the observation likelihood L in the reference period ρ in front of each of the plurality of reference points a specified for the performance target track is reduced to 0. On the other hand, when no sign operation is detected, the 2 nd likelihood L2(C) is maintained at 1, and therefore the 1 st likelihood L1(a) is calculated as the observation likelihood L.

The position estimating unit 84 in fig. 9 estimates the musical performance position T based on the observation likelihood L calculated by the likelihood calculating unit 82. Specifically, the position estimating unit 84 calculates the posterior distribution of the performance position T from the observation likelihood L, and estimates the performance position T from the posterior distribution. The posterior distribution of the performance position T is a probability distribution of posterior probabilities of the position T in the target music piece when the sound is emitted in the unit section under the condition that the acoustic signal a in the unit section is observed. In the calculation of the posterior distribution using the observation likelihood L, for example, as disclosed in japanese patent application laid-open No. 2015-79183, a known statistical process such as bayesian estimation using a Hidden Semi Markov Model (HSMM) is used.

As described above, since the observation likelihood L is set to 0 in the reference period ρ in front of the reference point a corresponding to the secret sign operation, the posterior distribution becomes effective in the section following the reference point a. Therefore, the time after the reference point a corresponding to the motion of the trumpet is estimated as the performance position T. Further, the position estimating section 84 determines the performance tempo R from the temporal change of the performance position T. The configuration and operation other than the analysis processing unit 544 are the same as those of embodiment 1.

Fig. 11 is a flowchart illustrating the contents of the process (step SA2 of fig. 8) in which the analysis processing unit 544 estimates the performance position T and the performance tempo R. The processing of fig. 11 is executed for each unit section on the time axis in parallel with the performance of the performance target tracks of the plurality of players P.

The 1 st arithmetic unit 821 analyzes the acoustic signal a in the unit section, and calculates the 1 st likelihood L1(a) at each of the plurality of time points t in the target musical performance track (SA 21). Further, the 2 nd arithmetic unit 822 calculates the 2 nd likelihood L2(C) corresponding to the presence or absence of the motion of detecting the combination (SA 22). In addition, the order of the calculation of the 1 st likelihood L1(a) (SA21) by the 1 st arithmetic unit 821 and the calculation of the 2 nd likelihood L2(C) (SA22) by the 2 nd arithmetic unit 822 may be reversed. The 3 rd arithmetic unit 823 calculates the distribution of the observed likelihood L by multiplying the 1 st likelihood L1(a) calculated by the 1 st arithmetic unit 821 and the 2 nd likelihood L2(C) calculated by the 2 nd arithmetic unit 822 (SA 23).

The position estimating unit 84 estimates the performance position T from the observation likelihood distribution calculated by the likelihood calculating unit 82 (SA 24). Further, the position estimating section 84 calculates the performance tempo R from the temporal change of the performance position T (SA 25).

As described above, in embodiment 2, the result of detecting the motion of the combination in addition to the analysis result of the acoustic signal a is considered in the estimation of the performance position T, and therefore, for example, the performance position T can be estimated with higher accuracy than a configuration in which only the analysis result of the acoustic signal a is considered. The performance position T is also estimated with high accuracy, for example, at the start time of the music piece or at the time of resuming the performance after the rest. In embodiment 2, when a signature motion is detected, the observation likelihood L in the reference period ρ corresponding to the reference point a at which the signature motion is detected among the plurality of reference points a specified for the performance target track decreases. That is, the detection timing of the motion of the dark sign other than the reference period ρ is not reflected in the estimation of the performance timing T. Therefore, there is an advantage that erroneous estimation of the performance time T in the case where the motion of the trumpet is erroneously detected can be suppressed.

< modification example >

The above illustrated aspects may be variously modified. Specific modifications are exemplified below. The 2 or more arbitrarily selected from the following examples can be appropriately combined within a range not inconsistent with each other.

(1) In the above-described embodiment, the automatic performance of the performance target track is started with the cue motion detected by the cue detection unit 52 as a trigger, but the cue motion may be used for the control of the automatic performance at a point in time during the performance target track. For example, when the performance is resumed after the end of a pause of a long time in the performance target song, the automatic performance of the performance target song is resumed in response to the motion of the trumpet as in the above-described embodiments. For example, as in the operation described with reference to fig. 5, the specific player P performs the hash operation at a time Q that is earlier than the preparation period B with respect to the time at which the performance of the target music piece resumes after the pause. Then, at the time point when the time period δ corresponding to the delay amount D and the performance tempo R has elapsed from this time point Q, the performance control section 56 resumes the instruction for the automatic performance by the automatic playing device 24. Since the performance tempo R is already estimated at a point in the middle of the performance target music, the performance tempo R estimated by the performance analysis unit 54 is applied to the setting of the time length δ.

In addition, the period in which the combination action is executed in the performance target track can be grasped in advance according to the performance content of the performance target track. Therefore, the combination detecting unit 52 may monitor whether or not the combination operation is performed for a specific period (hereinafter, referred to as a "monitoring period") in which there is a possibility that the combination operation is performed in the performance target song. For example, section designation data for designating a start point and an end point for each of a plurality of monitoring periods assumed for the performance target track is stored in the storage device 14. The section specification data may be included in the music data M. The cue detection unit 52 monitors the cue operation when the performance position T is present in each monitoring period specified by the section specification data for the performance target track, and stops the monitoring of the cue operation when the performance position T is outside the monitoring period. According to the above configuration, since the motion of the trumpet is detected for the performance target song limited to the monitoring period, there is an advantage that the processing load of the trumpet detecting unit 52 is reduced compared to the configuration in which the presence or absence of the motion of the trumpet is monitored for the entire section of the performance target song. Further, it is possible to reduce the possibility of false detection of the combination motion during a period in which the combination motion is not actually possible in the performance target song.

(2) In the above-described embodiment, the dark sign operation is detected by analyzing the entire image represented by the image signal V (fig. 3), but the dark sign detection unit 52 may monitor the presence or absence of the dark sign operation for a specific area (hereinafter referred to as "monitoring area") in the image represented by the image signal V. For example, the medal detecting unit 52 selects a range in which a specific player P who is scheduled to perform a medal motion is included in the image represented by the image signal V as a monitoring area, and detects the medal motion with the monitoring area as an object. The range outside the monitoring area is excluded from the monitoring objects by the secret number detecting unit 52. According to the above configuration, since the motion of the dark sign is detected while being limited to the monitoring area, there is an advantage that the processing load of the dark sign detecting unit 52 is reduced as compared with a configuration in which the presence or absence of the motion of the dark sign is monitored over the entire image represented by the image signal V. Further, it is possible to reduce the possibility that the action of the player P who does not actually perform the trumpet action is erroneously determined to be the trumpet action.

As illustrated in the modification (1), when it is assumed that the medal motion is performed over a plurality of times during the performance of the performance target track, the performer P who performs the medal motion may be changed for each medal motion. For example, the player P1 performs a preceding-to-start medal motion of the performance target song, while the player P2 performs a halfway medal motion of the performance target song. Therefore, it is also preferable to change the position (or size) of the monitoring area within the image represented by the image signal V over time. Since the player P who performs the combination action is decided before the performance, area specifying data for specifying the position of the monitoring area in time series, for example, is stored in the storage device 14 in advance. The darkness detection unit 52 monitors the darkness operation for each monitoring area specified by the area specifying data in the image represented by the image signal V, and excludes the area other than the monitoring area from the monitoring target of the darkness operation. According to the above configuration, even when the player P who performs the trumpet motion is changed as the music progresses, the trumpet motion can be appropriately detected.

(3) In the above-described embodiment, the plurality of players P are photographed by the plurality of photographing devices 222, but the plurality of players P (for example, the entire stage where the plurality of players P are located) may be photographed by 1 photographing device 222. Similarly, sound played by a plurality of players P may be picked up by the 1 sound pickup device 224. The image combination unit 522 may be omitted because the darkness detection unit 52 may monitor whether or not there is a darkness operation for each of the plurality of image signals V0.

(4) In the above-described embodiment, the motion of the dark mark is detected by analyzing the image signal V captured by the image capturing device 222, but the method of detecting the motion of the dark mark by the dark mark detecting unit 52 is not limited to the above-described example. For example, the medal detecting unit 52 may detect the medal movement of the player P by analyzing a detection signal of a detector (for example, various sensors such as an acceleration sensor) worn on the body of the player P. However, according to the configuration of the embodiment described above in which the dark sign operation is detected by analyzing the image captured by the imaging device 222, there is an advantage in that the dark sign operation can be detected with less influence on the performance operation of the player P, as compared with the case where the detector is worn on the body of the player P.

(5) In the above-described embodiment, the performance position T and the performance tempo R are estimated by analyzing the acoustic signal a in which the plurality of acoustic signals a0 representing sounds of different musical instruments are mixed, but the performance position T and the performance tempo R may be estimated by analyzing each of the acoustic signals a 0. For example, the performance analysis unit 54 estimates a tentative performance position T and performance tempo R for each of the plurality of acoustic signals a0 by the same method as in the above-described embodiment, and determines a specific performance position T and performance tempo R from the estimation result for each acoustic signal a 0. For example, representative values (for example, average values) of the performance position T and the performance tempo R estimated from the respective acoustic signals a0 are calculated as the determined performance position T and the performance tempo R. As understood from the above description, the acoustic mixing unit 542 of the performance analysis unit 54 may be omitted.

(6) As exemplified in the foregoing embodiment, the automatic playing system 100 is realized by the cooperation of the control device 12 and the program. A program according to a preferred embodiment of the present invention causes a computer to function as: a medal detecting section 52 for detecting the medal movement of the player P who plays the performance target music, a performance analyzing section 54 for analyzing the acoustic signal a representing the sound to be played in parallel with the performance and sequentially estimating the performance position T in the performance target music, a performance control section 56 for causing the automatic playing device 24 to execute the automatic performance of the performance target music so as to synchronize the medal movement detected by the medal detecting section 52 with the performance position T estimated by the performance analyzing section 54, and a display control section 58 for displaying the performance image G representing the progress of the automatic performance on the display device 26. That is, the program according to a preferred embodiment of the present invention is a program for causing a computer to execute the music data processing method according to a preferred embodiment of the present invention. The program exemplified above may be provided to be installed in a computer in a form of being stored in a computer-readable recording medium. The recording medium is preferably a non-transitory (non-transitory) recording medium, and an optical recording medium (optical disc) such as a CD-ROM, but may include any known recording medium such as a semiconductor recording medium or a magnetic recording medium. Further, the program may be distributed to the computer by a distribution method via a communication network.

(7) A preferred embodiment of the present invention may be determined as the operation method (automatic playing method) of the automatic playing system 100 according to the above-described embodiment. For example, in the automatic playing method according to the preferred embodiment of the present invention, the computer system (a single computer or a system composed of a plurality of computers) detects the signature motion of the player P who plays the performance target song (SA1), analyzes the acoustic signal a representing the played sound in parallel with the performance, sequentially estimates the performance position T within the performance target song (SA2), causes the automatic playing device 24 to execute the automatic performance of the performance target song so as to synchronize the signature motion with the performance position T (SA3), and causes the display device 26 to display the performance image G representing the progress of the automatic performance (SA 4).

(8) In the above-described exemplary embodiment, the following configuration is understood, for example.

[ means A1]

In a performance analysis method according to a preferred aspect of the present invention (aspect a1), a signature motion of a performer who performs a music is detected, a distribution of observation likelihoods as an index of accuracy in accordance with a performance position at each time in the music is calculated by analyzing an acoustic signal representing a sound of the music, the performance position is estimated from the distribution of observation likelihoods, and the observation likelihood is decreased in a period before a reference point specified on a time axis with respect to the music when the signature motion is detected in the calculation of the distribution of observation likelihoods. In the above aspect, in addition to the analysis result of the acoustic signal, the detection result of the motion of the signature is also considered in the estimation of the performance position, and therefore, the performance position can be estimated with higher accuracy than, for example, a configuration in which only the analysis result of the acoustic signal is considered.

[ means A2]

In a preferred example (mode a2) of the mode a1, in the calculation of the distribution of the observation likelihoods, a1 st likelihood that is an index of accuracy at each time point in the music piece corresponding to a performance position is calculated from the acoustic signal, the 1 st likelihood is set to a1 st value in a state where the motion of the signature is not detected, and when the motion of the signature is detected, a2 nd likelihood that is set to a2 nd value smaller than the 1 st value in a period before the reference point is calculated, and the observation likelihoods are calculated by multiplying the 1 st likelihood and the 2 nd likelihood. In the above aspect, there is an advantage in that the observation likelihood can be calculated easily by multiplying the 1 st likelihood calculated from the acoustic signal by the 2 nd likelihood corresponding to the detection result of the sign operation.

[ means A3]

In a preferred example of the mode a2 (mode A3), the 1 st value is 1, and the 2 nd value is 0. According to the above manner, the observation likelihood can be clearly distinguished between the case where the secret sign action is detected and the case where it is not detected.

[ means A4]

In the automatic playing method according to the preferred embodiment of the present invention (mode a4), a combination of a motion of a trumpet of a performer playing a music is detected, estimating a performance position within the piece of music by analysis of acoustic signals representing sounds of performing the piece of music, causing an automatic performance apparatus to perform an automatic performance of the piece of music so as to be synchronized with progress of the performance position, in the estimation of the performance position, a distribution of observation likelihoods as an index of accuracy at each time point in the music piece matching the performance position is calculated by analysis of the acoustic signal, and the performance position is estimated from the distribution of the observation likelihoods, in the calculation of the distribution of the observation likelihoods, when the motion of the combination is detected, the observation likelihoods in a period before a reference point specified on a time axis with respect to the music are decreased. In the above aspect, in addition to the analysis result of the acoustic signal, the detection result of the motion of the signature is also considered in the estimation of the performance position, and therefore, the performance position can be estimated with higher accuracy than, for example, a configuration in which only the analysis result of the acoustic signal is considered.

[ means A5]

In a preferred example (mode a5) of the mode a4, in the calculation of the distribution of the observation likelihoods, a1 st likelihood that is an index of accuracy at each time point in the music piece corresponding to a performance position is calculated from the acoustic signal, the 1 st likelihood is set to a1 st value in a state where the motion of the signature is not detected, and when the motion of the signature is detected, a2 nd likelihood that is set to a2 nd value smaller than the 1 st value in a period before the reference point is calculated, and the observation likelihoods are calculated by multiplying the 1 st likelihood and the 2 nd likelihood. In the above aspect, there is an advantage in that the observation likelihood can be calculated easily by multiplying the 1 st likelihood calculated from the acoustic signal by the 2 nd likelihood corresponding to the detection result of the sign operation.

[ means A6]

In a preferred example (mode a6) of the mode a4 or the mode a5, the automatic playing device is caused to execute an automatic performance in accordance with music data representing performance contents of the music, and the plurality of reference points are specified by the music data. In the above aspect, the reference points are specified by the music data for instructing the automatic playing device to perform automatically, and the configuration and the processing are advantageous in that the configuration and the processing are simplified as compared with the configuration in which a plurality of reference points are specified separately from the music data.

[ means A7]

In any preferable example (mode a7) of the mode a4 through the mode a6, the display device is caused to display an image indicating the progress of the automatic performance. According to the above aspect, the player can visually confirm the progress of the automatic performance of the automatic playing apparatus and reflect the progress on his or her own performance. That is, a natural performance in which the performance of the player and the automatic performance of the automatic playing apparatus are interactively coordinated is realized.

[ means A8]

An automatic playing system according to a preferred embodiment of the present invention (mode A8) includes: a cipher detection unit for detecting a cipher operation of a player who plays a music piece; an analysis processing unit that estimates a performance position within the music piece based on an analysis of an acoustic signal representing a sound of performing the music piece; and a performance control unit that causes the automatic playing device to execute an automatic performance of a music piece so as to synchronize the execution of the trumpet motion detected by the trumpet detection unit and the performance position estimated by the performance analysis unit, wherein the analysis processing unit includes: a likelihood calculation unit that calculates a distribution of observation likelihoods as an index of accuracy at each time point in the music piece corresponding to a performance position by analysis of the acoustic signal; and a position estimation unit that estimates the performance position from a distribution of the observation likelihoods, wherein the likelihood calculation unit decreases the observation likelihoods for a period before a reference point specified on a time axis with respect to the music when the motion of the signature is detected. In the above aspect, in addition to the analysis result of the acoustic signal, the detection result of the secret symbol motion is also considered in the estimation of the performance position, and therefore, compared to a configuration in which only the analysis result of the acoustic signal is considered, for example, the performance position can be estimated with high accuracy

(9) The automatic playing system exemplified in the above-described embodiment has, for example, the following configuration.

[ means B1]

An automatic playing system according to a preferred embodiment of the present invention (mode B1) includes: a cipher detection unit for detecting a cipher operation of a player who plays a music piece; a performance analysis unit that sequentially estimates performance positions within a music piece by analyzing acoustic signals representing performed sounds in parallel with the performance; and a performance control section for causing the automatic performance apparatus to execute an automatic performance of the music piece so as to synchronize the execution of the medal motion detected by the medal detecting section and the performance position estimated by the performance analyzing section; and a display control unit that causes the display device to display an image showing the progress of the automatic performance. In the above configuration, the automatic performance of the automatic playing device is performed so as to synchronize the progression of the player's combination movements and performance positions, while the image showing the progression of the automatic performance of the automatic playing device is displayed on the display device. Therefore, the player can visually confirm the progress of the automatic performance of the automatic playing apparatus to be reflected on his own performance. That is, a natural performance in which the performance of the player and the automatic performance of the automatic playing apparatus are interactively coordinated is realized.

[ means B2]

In a preferred example of the mode B1 (mode B2), the performance control unit instructs the automatic performance device to perform a performance at a later time in the music piece than the performance position estimated by the performance analysis unit. In the above aspect, the performance contents at the time point later in time with respect to the performance position estimated by the performance analysis unit are instructed to the automatic performance device. Therefore, even in the case where the actual sounding of the automatic playing device is delayed with respect to the instruction of the performance by the performance control section, the performance of the player and the automatic performance can be synchronized with high accuracy.

[ means B3]

In a preferred example of the mode B2 (mode B3), the performance analysis unit estimates a performance tempo from the analysis of the acoustic signal, and the performance control unit instructs the automatic performance device to perform a performance at a time point in the music piece that is later than the performance position estimated by the performance analysis unit by an adjustment amount corresponding to the performance tempo. In the above aspect, the automatic playing device is instructed to play at a time point after the playing position by the variable adjustment amount according to the playing speed estimated by the playing analysis unit. Therefore, even when the performance speed varies, for example, the performance of the player and the automatic performance can be synchronized with high accuracy.

[ means B4]

In any one of the preferred embodiments of the method B1 to the method B3 (method B4), the secret symbol detector detects the secret symbol motion by analyzing the image of the performer captured by the imaging device. In the above aspect, the motion of the trumpet is detected by analyzing the image captured by the imaging device, and thus there is an advantage that the motion of the trumpet can be detected while reducing the influence on the performance of the performer, as compared to a case where the motion of the trumpet is detected by wearing a detector on the body of the performer, for example.

[ means B5]

In any one of the preferred modes (mode B5) of B1 through B4, the display control unit causes the display device to display an image that dynamically changes in accordance with the performance content of the automatic performance. In the above aspect, since the image that dynamically changes according to the performance content of the automatic performance is displayed on the display device, there is an advantage in that the player can visually and intuitively grasp the progress of the automatic performance.

[ means B6]

In the automatic playing method according to the preferred embodiment of the present invention (mode B6), the computer system detects the trumpet motion of a performer who performs a music piece, analyzes an acoustic signal representing the performed sound in parallel with the performance to sequentially estimate a performance position within the music piece, causes the automatic playing device to perform the automatic playing of the music piece so as to synchronize the trumpet motion with the performance position, and causes the display device to display an image representing the progress of the automatic playing.

< detailed description >

Preferred embodiments of the present invention can be expressed as follows.

1. Premise(s)

An automatic playing system is a system in which a machine generates an accompaniment in cooperation with a human performance. Here, the description is given of the automatic playing system and the automatic playing system of the score representation that humans should play respectively like the classical music. Such an automatic playing system has been widely used for practice assistance in musical performance, music expansion performance in which an electronic musical instrument is driven in accordance with a player, and the like. In addition, hereinafter, a part played by the ensemble engine is referred to as an "accompaniment part". In order to perform an ensemble musically integrated, it is necessary to appropriately control the performance timing of the accompaniment parts. In the appropriate timing control, there are 4 requirements described below.

[ claim 1] in principle, the automatic playing system needs to play the place where the human player plays. Therefore, the automatic playing system needs to coordinate the position of the played music piece with the human player. In particular, in classical music, since suppression of the playing speed (tempo) is important in music expression, it is necessary to follow the tempo change of the player. In order to perform the follow-up with higher accuracy, it is preferable to obtain the habit of the player by analyzing the practice (rehearsal) of the player.

[ claim 2] an automatic playing system generates a musical integrated performance. That is, it is necessary to follow the performance of a human within a range in which the musicality of the accompaniment part is maintained.

[ claim 3] the degree of accompaniment part matching the player (master-slave relationship) can be changed according to the background (context) of the music. In a musical composition, even if the musicality is somewhat impaired, the musicality of the accompaniment part may be kept even if the musicality is somewhat impaired. Therefore, the balance of "followability" and "musicality" described in the requirement 1 and the requirement 2, respectively, changes depending on the background of the music piece. For example, there is a tendency for rhythmically (rhythm) unclear parts to follow parts with sharper rhythms.

[ claim 4] according to the player's instruction, the master-slave relationship can be changed immediately. The trade-off between followability and musicality of the automatic playing system (trade off) is often adjusted through conversation among people in the deck. In addition, when such adjustment is performed, the adjusted portion is rebounded to confirm the adjustment result. Therefore, an automatic playing system capable of setting a follow-up behavior is required for a color bar.

In order to satisfy these requirements at the same time, it is necessary to generate an accompaniment part so as not to cause musical breaks while following the position where the player plays. To achieve this, the automatic playing system requires three elements: (1) a model for predicting a position of a player, (2) a timing generation model for generating an accompaniment part of music, and (3) a model for correcting a performance timing based on a master-slave relationship. In addition, these elements need to be able to operate or learn independently. However, it has been difficult to independently process these elements. Therefore, in the following description, it is considered that (1) a performance timing generation process by a player, (2) a performance timing generation process expressing a range in which an automatic performance system can perform musically, and (3) a process for combining the performance timings of the automatic performance system and the player, which is used for the automatic performance system to match the player while maintaining the master-slave relationship, are modeled independently and integrated. By expressing independently, each element can be learned or operated independently. When the system is used, the timing generation process of the performer is estimated, the range of the timing capable of being performed by the automatic playing system is estimated, and the accompaniment part is played so as to coordinate the ensemble and the timing of the performer. Thus, the automatic playing system can cooperate with human beings and play ensemble without breaking in music.

2. Correlation technique

In the conventional automatic playing system, the performance timing of the player is estimated by using score tracking. On this basis, in order to coordinate the ensemble engine and the human, roughly two methods can be used. First, it is proposed to obtain an average behavior or a behavior of a temporal change in music by regressing the relationship of performance timings with respect to a player and an ensemble engine by a large number of color bars. In such a method, since the result itself of the ensemble is regressed, the musicality of the accompaniment part and the followability of the accompaniment part can be obtained as a result at the same time. On the other hand, it is considered that it is difficult to independently operate the followability or the musicality in the rehearsal because it is difficult to express the timing prediction of the player, the generation process of the ensemble engine, and the degree of fitting divisionally. In addition, in order to obtain the music followability, since ensemble data between human beings needs to be separately analyzed, a cost is required for the preparation of contents. Second, there is a method of setting a restriction on a beat trajectory by using a dynamic system described by a small number of parameters. In this method, a tempo trajectory or the like of a player is learned by color ranking on the basis of advance information such as tempo continuity. In addition, the accompaniment part may also be capable of learning sounding timing of the accompaniment part. Since they describe the tempo trajectory using few parameters, accompaniment parts or human "habits" can be easily rewritten manually in a rehearsal. However, it is difficult to independently operate the following performance, which is indirectly obtained from the deviation of the sounding timing when the player and the ensemble engine are independently played. In order to improve the explosive power in the line of color, it is considered effective to alternately perform learning of the automatic playing system and dialogue between the automatic playing system and the player. Therefore, in order to independently operate the followability, a method of adjusting the ensemble playing logic itself is proposed. In the present method, based on such an idea, a mathematical model such as "matching method", "performance timing of accompaniment part", "performance timing of performer" can be controlled independently and interactively is considered.

3. Overview of the System

The structure of the automatic playing system is shown in fig. 12. In the method, in order to follow the position of the player, music score following is performed based on the acoustic signal and the camera image. In addition, the position of the player is predicted based on the generation process of the position where the player is playing based on statistical information obtained from the posterior distribution followed by the score. In order to determine the sounding timing of the accompaniment part, the timing of the accompaniment part is generated by combining a model for predicting the timing of the performer and a generation process of the timing of the accompaniment part.

4. Music score tracking

In order to estimate the position in the music piece currently being played by the player, score following is used. In the score following method of the present system, a discrete state space model that expresses both the position of the score and the tempo of performance is considered. The observation sound is modeled as a hidden Markov process (HMM) over the state space, and the posterior distribution of the state space is estimated in turn using a delayed-decision type forward-backward algorithm. The forward-backward algorithm of the delay-decision type means that the forward algorithm is sequentially executed, and a posterior distribution for a state several frames ahead of the current time is calculated by running the backward algorithm whose current time is regarded as the end of data. The laplacian approximation of the posterior distribution is output at the point where the MAP value of the posterior distribution passes through the position regarded as on (onset) on the score.

The discussion is made with respect to the construction of the state space. First, a music piece is divided into R sections, and each section is set to one state. In the r-th section, the number of frames n required to pass through the section and the current passing frame 0 ≦ 1 < n for each n are held as state variables. That is, n corresponds to the tempo of a certain section, and the combination of r and l corresponds to the position on the score. Such a transition on the state space is expressed as a markov process as follows.

[ number 1]

(1) From (r, n, l) to itself: p is a radical of

(2) From (r, n, l < n) to (r, n, l +1):1-p

(3) From (r, n, n-1) to (r +1, n', ()):

this model has both the advantages of an explicit-duration HMM (explicit period HMM) and a left-to-right HMM (left-to-right HMM). That is, the duration in the section can be determined approximately by the selection of n, and minute beat variations in the section can be absorbed by the self-transition probability p. The length of the section or the probability of transition of the section is obtained by analyzing the music data. Specifically, note information such as a beat instruction or an extension symbol is used.

Next, the observed likelihood of such a model is defined. Each state (r, n, l) corresponds to a position-s (r, n, l) in a certain music piece. In addition, for an arbitrary position s in the music piece, except for the observed Constant Q Transform (CQT) and the average value/. about.c of Δ CQT_s ²And/delta. to c_s ²In addition, the accuracy κ is assigned respectively_s ^(c)And

(symbol/represents vector, symbol-represents the upper line within the mathematical expression). Based on this, CQT and c are observed at time t_t、ΔCQT、Δc_tIs corresponding to the state (r)_t，n_t，l_t) Is defined as follows.

Number 2

Here, vMF (x | μ, κ) refers to the von Mises-Fisher distribution, specifically normalized to x ∈ S^D(SD: D-1 dimensional unit sphere) is expressed by the following equation.

[ number 3]

In determining —, or Δ —, models of a piano roller (piano roll) expressed by a musical score and a CQT assumed for each tone are used. First, a specific index i is assigned to a pair of a pitch and a musical instrument name existing on a musical score. In addition, for the ith tone, an average observation CQT ω is assigned_if. At a position s on the score, let h be the intensity of the ith sound_siThen, c is given as follows_s,f. Delta-c is obtained by pairing-c in the s direction_s,fTaking the difference of the first time,and half-wave rectification is performed.

[ number 4]

When music starts from a silent state, visual information becomes more important. Therefore, in the present system, as described above, the motion (cue) of the dark sign detected from the camera disposed in front of the player is used. In the method, unlike a method of controlling an automatic playing system from top to bottom, the presence or absence of a trumpet motion is directly reflected in the observation likelihood, and the sound signal and the trumpet motion are processed uniformly. Therefore, firstly, extracting the place { ^ q requiring the password action from the music score information_i}。^q_iIncluding the starting point of the music piece or the position of the extension symbol. When the gesture of the dark sign is detected in the process of executing the music score following, the gesture of the dark sign is matched with the position U [ ^ q ] on the music score_i－Τ，^q_i]The observation likelihood of the corresponding state is set to 0, and the posterior distribution is guided after the position of the signal operation. With score following, the ensemble engine accepts, after several frames from the position of sound switching on the score, the distribution approximation of the current estimated position or beat as a normal distribution. That is, if the score tracking engine detects switching of the nth sound present on the music piece data (hereinafter referred to as "on event"), a time stamp (timestamp) t of the time at which the on event was detected is set_nAnd the estimated average position mu on the score_nAnd its variance σ_n ²The timing of ensemble generation unit is notified. In addition, since the estimation of delayed-decision type is performed, the notification itself causes a delay of 100 ms.

5. Playing timing combination model

Ensemble engine follows the notified information with score (t)_n，μ_n，σ_n ²) On the basis, the playing position of the appropriate ensemble engine is calculated. The ensemble engine preferably performs (1) a process of generating a timing at which the player performs and (2) a timing at which the accompaniment part performs in order to match the playerGeneration process, (3) process of listening to the performance of the accompaniment part by the player, these 3 processes are independently modeled. Using such a model, the timing of performance of the accompaniment part and the predicted position of the performer are considered to be generated, and the final timing of the accompaniment part is generated.

5.1 Performance timing Generation Process for Player

To express the player's performance timing, it is assumed that the player is at t_nAnd t_n+1At a velocity v_n ^(p)The position on the score is linearly moved. I.e. mixing x_n ^(p)As player at t_nPosition on the music score being played, will be epsilon_n ^(p)As noise for the velocity or the position on the score, the following generation process is considered. Here, the value is set to Δ T_m,n＝t_m－t_n。

[ number 5]

Noise epsilon_n ^(p)In addition to beat variations, pseudo-tones or sounding timing errors are also included. To show the former, consider that at t, the timing of utterance changes with the change in tempo_nAnd t_n-1In accordance with the variance ψ²The normal distribution of the generated acceleration. Then, if ∈ will_n ^(p)Is set as h ═ Δ T_n,n-1 ²/2,ΔT_n,n-1]Then give ∑_n ^(p)＝ψ²h' h, the beat changes become correlated with the sounding timing changes. In addition, to express the latter, the standard deviation σ is considered_n ^(p)White noise of (1) to sigma_n,0,0 ^(p)Plus sigma_n ^(p). Therefore, if it is to be paired with ∑_n,0,0 ^(p)Plus sigma_n ^(p)The latter matrix is set to ∑_n ^(p)Then give ε_n ^(p)～N(0,Σ_n ^(p)). N (a, b) means a normal distribution of the mean a and the variance b.

Next, consider the history of performance timing/μ of the user reported by the score tracking system_n＝[μ_n,μ_n-1,…,μ_n-In]And/sigma_n ²＝[σ_n,σ_n-1,…,σ_n-In]In combination with (3) or (4). Here, I_nIs the length of the history considered, and is set to include the ratio t_nAn event 1 beat ahead. Such a/. mu._nOr/sigma_n ²The generation process of (2) is specified as follows.

[ number 6]

Here,/W_nIs used according to x_n ^(p)And v_n ^(p)To predict observation/. mu._nThe regression coefficient of (2). Here,/W_nThe definition is as follows.

[ number 7]

Not using the latest mu as in the past_nBy using the history before this as an observation value, it is considered that even if the score tracking fails partially, the action is unlikely to be broken. In addition, it is considered that/W can be obtained by color arrangement_nAnd a playing method that depends on a tendency of a long time, such as a pattern that can follow an increase or decrease in tempo. Such a model is used in the sense that the correlation between the tempo and the position change in the score is clearly recordedThis is equivalent to applying the concept of trajectory hmm (trajectory hmm) to a continuous state space.

5.2 Performance timing Generation procedure of accompaniment parts

As described above, it is possible to infer the internal state [ x ] of the player from the history that the score follows the reported position by using the timing model of the player_n ^(p),v_n ^(p)]. The automatic playing system coordinates such inference with the habit of how the accompaniment part "wants to play" and infers the final sound emission timing. Therefore, a generation process of performance timing in an accompaniment part of how to "want to pop" the accompaniment part is considered here.

At the performance timing of the accompaniment part, a process of being performed with a tempo trajectory within a certain range from a given tempo trajectory is considered. The tempo trajectory given means performance data considering the use of a performance expression attaching system or a human being. The predicted value ^ x of which position on the music piece to play when the automatic playing system receives the nth turn-on event_n ^(a)And its relative velocity ^ v_n ^(a)The behavior is as follows.

[ number 8]

Here, v_n ^(a)Is at time t_nThe previously given tempo at position n on the reported score is substituted into the previously given tempo trajectory. In addition,. epsilon^(a)A departure range permitted for performance timing generated from a tempo trajectory given in advance is determined. According to such parameters, a range of musically natural performance is determined as the accompaniment part. Beta is epsilon [0,1 ∈]Means an item indicating how strongly a given beat is to be pulled back, having the beat trace pulled back v_n ^(a)The effect of (1). Thus, the device is provided withThe model of (2) has a certain effect in audio alignment, and thus it is explained that the generation process as the timing of playing the same music is appropriate. In addition, since (β ═ 1) and ^ v follow the wiener process without such a restriction, the tempo diverges, and extremely fast or slow performance may be generated.

5.3 Play timing integration procedure of Player and accompaniment part

Up to this point, the sounding timing of the player and the sounding timing of the accompaniment part are modeled independently. Here, based on their generation process, a process of "fitting" accompaniment parts while listening to a player is described. Thus, consider describing the following behavior: when the accompaniment part is matched with a person, the error between the predicted value of the position to be played currently of the accompaniment part and the predicted value of the current position of the player is slowly corrected. Such a variable describing the degree of correcting the error is hereinafter referred to as a "combination coefficient". The combination coefficient is influenced by the master-slave relationship of the accompaniment part and the player. For example, in a case where the rhythm of the player is clearer than the accompaniment part, the accompaniment part is often closely fitted to the player. In addition, in the color bar, when the master-slave relationship is instructed from the player, it is necessary to change the matching method in accordance with the instruction. That is, the combination coefficient varies depending on the background of the music or the dialogue with the player. Thus, given the received t_nBinding coefficient gamma in the spectral position of time_n∈[0,1]The process of the accompaniment part cooperating with the player is described below.

[ number 9 ]

In this model, the degree of following is based on γ_nMay vary in size. For example, let us say γ_nWhen 0, the accompaniment part is completely out of cooperation with the player at γ_nWhen 1, the accompaniment partIs in perfect fit with players. In this model, the accompaniment part will play a performance ^ x_n ^(a)Variance of (2) and performance timing x of the player_n ^(p)The prediction error in (2) is also weighted by the combining coefficient. Thus, x^(a)Or v^(a)The variance of (1) is a result of the performance timing probability process itself of the player being coordinated with the performance timing probability process itself of the accompaniment part. Therefore, it is found that the tempo trajectories "desired to be generated" by both the player and the automatic playing system can be naturally unified.

The simulation of the present model when β is 0.9 is shown in fig. 13. It is known that by changing γ in this way, it is possible to complement between the tempo trajectory of the accompaniment part (sine wave) and the tempo trajectory of the player (step function). Further, it is found that the generated tempo trajectory is closer to the tempo trajectory targeted for the accompaniment part than the tempo trajectory of the player by the influence of β. Namely, the following effects are considered: in player ratio v^(a)The player is "pulled" in the case of fast, and "urged" in the case of slow.

5.4 method for calculating binding coefficient γ

Setting the binding coefficient γ by several factors_nThe degree of synchronization between the players is indicated. First, the master-slave relationship is influenced by the background in the music piece. For example, the guide ensemble is mostly a part with an easily understandable rhythm. In addition, the master-slave relationship is sometimes changed by a dialog. In order to set a master-slave relationship according to the background of a music piece, the density of sound phi is calculated from score information_nMoving average of note density for accompaniment part, moving average of note density for performer part]. Since a beat trajectory is easily determined by a portion where the number of tones is large, it is considered that a combination coefficient can be approximately extracted by using such a feature amount. At this time, in the case where the accompaniment part is not played (phi)_n,00), the position prediction of the ensemble is completely governed by the player, and in the section where the player does not perform the performance (phi)_n,10), the position prediction of the desired ensemble is completely neglected from the behavior of the player. Therefore, γ is determined as follows_n。

[ number 10 ]

Here,. epsilon.0 is set to a sufficiently small value. And the complete unilateral master-slave relationship (gamma) is difficult to occur in human-to-human ensemble _n0 or gamma_n1), the heuristic step (heiristic) as the above equation does not have a completely unilateral master-slave relationship when both the player and the accompaniment part are playing. A completely unilateral master-slave relationship, which occurs only in the case where one of the players/ensemble engines is temporarily silent, is instead optimal.

In addition, in a color line or the like, a player or an operator can rewrite γ as necessary_n. It is considered that the following desirable characteristics are obtained when humans are rewritten to appropriate values in a color chart: gamma ray_nIs finite and its behavior under boundary conditions is self-explanatory, or relative to gamma_nThe behavior is continuously variable.

5.5 Online inference

When the automatic playing system is in use, it receives (t)_n,μ_n,σ_n ²) The posterior distribution of the performance timing model is updated at the timing of (1). The proposed method enables efficient inference using a kalman filter. Is informing of (t)_n,μ_n,σ_n ²) The prediction and update steps of the kalman filter are performed, as shown below, to predict the position at which the accompaniment part should be played at time t.

[ number 11 ]

Here τ^(s)Refers to input-output delay in the automatic playing system. In addition, in the system, the sound is updated when the accompaniment part soundsA state variable. That is, as described above, in addition to the prediction/update step performed based on the score following result, only the prediction step is performed at the timing when the accompaniment part is sounded, and the obtained prediction value is substituted into the state variable.

6. Evaluation experiment

In order to evaluate the present system, the position estimation accuracy of the player is first evaluated. With regard to the timing generation of the ensemble, the effectiveness of β, which is an item to pull back the tempo of the ensemble to a predetermined value, or γ, which is an index of how much the accompaniment parts fit the player, is evaluated by listening to the player.

6.1 score-following assessment

In order to evaluate the score following accuracy, the following accuracy for the practice music of Bergmuller (bugmuir) was evaluated. As evaluation data, the spectrum surface following accuracy was evaluated using data of 14 (No. 1, No. 4-10, No. 14, No. 15, No. 19, No. 20, No. 22, No. 23) recorded in the practice music piece (op.100) played by the pianist. In addition, no camera input was used in this experiment. Evaluation on a scale with reference to MIREX, Total precision was evaluated. Total precision (overall accuracy) indicates the accuracy of the entire corpus (corpus) when an error of alignment converges to a certain threshold τ as a correct solution.

First, to verify validity regarding the estimation of the delayed-decision type, total precision (τ of 300ms) for the amount of delayed frames in the delayed-decision forward back algorithm was evaluated. The results are shown in FIG. 14. It is known that the accuracy is improved by utilizing the posterior distribution of the results several frames before. It is also known that the accuracy gradually decreases when the delay amount exceeds 2 frames. In the case of 2 frames, the total precision is 82% when τ is 100ms, and 64% when τ is 50 ms.

6.2 verification of Performance timing bond model

The verification of the performance timing in conjunction with the model is performed by listening to the player. As a feature of the present model, there is an ensemble engineTo pull back the β and the gamma of the assumed beat, the validity of both is verified. First, in order to eliminate the influence of the coupling coefficient, it is prepared to set formula (4) to v_n ^(p)＝βv_n-1 ^(p)+(1－β)～v_n ^(a)Is set to x_n ^(a)＝x_n ^(p)，v_n ^(a)＝v_n ^(p)The system of (1). That is, consider the ensemble engine as follows: while assuming that the expected value of tempo is ^ v, the result of filtering the result of score-following is directly used for performance timing generation of accompaniment while the variance is controlled by β. First, after the automatic playing system in the case where 6 pianists are allowed to set β to 0 is used for one day, listening is made with a sense of use. The object music is selected from a wide variety of music such as classical music, romantic music, popular music and the like. In listening, the dominant dissatisfaction is that if a person wants to cooperate with an ensemble, the accompaniment parts are also in time with the person, with extremely slow or fast beats. Such a phenomenon is caused by an improper setting of τ in the formula (12)^(s)This occurs when the response of the system is subtly uncoordinated with the player. For example, in the case where the system responds slightly earlier than expected, the user speeds up the tempo in order to match the system returning slightly earlier. As a result, the response is returned earlier by the system following its beat, and the beat continues to accelerate.

Next, experiments were performed on the same song under the condition of β ═ 0.1 by the other 5 pianists and 1 pianist who also participated in the experiment of β ═ 0. The same question content was listened to with β being 0, but the problem of beat divergence was not heard. In addition, the pianist who assisted the experiment also under the condition of β ═ 0 also published the suggestion that the followability was improved. However, the system produces a delay/speed-up opinion when the player has a large divergence between the tempo assumed by a certain tune and the tempo that the system wants to pull back. This tendency occurs when an unknown piece of music is played, i.e., when the player does not know the "common sense" tempo. This means that by the system trying to introduce the effect of a certain tempo, the divergence of the tempo is protected from the chance, and on the other hand, when the interpretations relating to the accompaniment part and the tempo are extremely different, the accompaniment part gives an impression of incitation. In addition, it is also shown that the followability is preferably changed depending on the background of the music. The reason is that, depending on the characteristics of the music, opinions about the degree of the matching method, such as "pull back is preferable" and "fit is desired" are basically consistent.

Finally, the system in which γ is fixed to 0 and the system in which γ is adjusted according to the background of performance are used in professional string quartet, and the latter behavior is better observed, which indicates the effectiveness thereof. However, in this verification, since the subject knows that the latter system is an improved system, it is necessary to perform additional verification using the AB method or the like as appropriate. In addition, since there are several scenarios in which γ is changed according to the dialogue in the color line, it is indicated that it is useful to change the combination coefficient in the color line.

7. Learning process in advance

To obtain the player's "habit", the MAP state ^ s based on the time t calculated by the score following_tAnd its input feature sequence c_t}^T _t＝1Estimate h_siAnd ω_ifAnd a beat trajectory. Here, their estimation methods are briefly described. At h_siAnd ω_ifIn the estimation of (2), posterior distribution is estimated in consideration of an Informed NMF (notification NMF) model of a Poisson-Gamma system as shown below.

[ number 12 ]

The hyper-parameters appearing here are suitably calculated from a database of instrument sounds or piano roller shutters represented by a score. The posterior distribution is approximated by a variational bayes method. Specifically, the posterior distribution p (h, ω | c) is approximated in the form of q (h) q (w), and the KL distance between the posterior distribution and q (h) q (w) is minimized while introducing auxiliary variables. From the posterior distribution thus estimated, the MAP estimation of the parameter ω corresponding to the tone color of the instrument sound is stored and used in the subsequent system operation. In addition, h corresponding to the strength of the piano roller blind can also be used.

Next, the length of the section (i.e., the beat trajectory) where the player performs each piece of music is estimated. Estimating the tempo trajectory enables restoration of the tempo expression specific to the player, and hence the prediction of the position of the player is improved. On the other hand, when the number of color ranks is small, the beat trajectory may be estimated erroneously due to an estimation error or the like, and the accuracy of the position prediction may deteriorate conversely. Therefore, when changing the tempo trajectory, it is considered to first provide advance information on the tempo trajectory and change only the tempo of a portion of the tempo trajectory of the player which is always out of the advance information. First, how much the tempo of the player deviates is calculated. If the number of color ranks is small, the estimated value of the degree of deviation itself becomes unstable, and therefore the distribution of the tempo trajectory of the player itself has a preliminary distribution. Let us mean μ of beats of the player in a position s in the piece of music_s ^(p)Sum variance λ_s ^(p)Following N (μ)_s ^(p)|m₀,b₀λ_s ^(p)-1)Gamma(λ_s ^(p)-1|a₀ ^λ,b₀ ^λ). Thus, if the average of beats obtained from K performances is μ_s ^(R)Precision (variance) of λ_s ^(R)-1The posterior distribution of beats is given as follows.

[ number 13 ]

If the posterior distribution obtained in this way is found asIs a beat distribution N (mu) that can be obtained from a position s in a musical piece_s ^S,λ_s ^S-1) The posterior distribution of the generated distribution is shown as the average value thereof as follows.

[ number 14 ]

Based on the tempo calculated in this way, the average value of ∈ used in expression (3) or expression (4) is updated.

< embodiment 3 >

Embodiment 3 of the present invention will be described. In the present embodiment, the automatic playing system 100 recognizes the combination of the actions of the player P and performs the performance. Note that, regarding elements having the same functions or functions as those of embodiment 1 in the following embodiments, detailed descriptions thereof will be omitted as appropriate along the reference numerals used in the description of embodiment 1.

The combination action in the present embodiment is premised on an action performed by the movement of the face of the player P. The cipher operation in the present embodiment represents the timing of occurrence of an event by an action. The events here are various behaviors in the performance, and are, for example, timings indicating the start and end of a sound emission, the cycle of a beat, and the like. The sign operation in the present embodiment is, for example, an operation of projecting a line of sight to the partner who delivers the sign, nodding the head, accompanying sound, raising the head to gently inhale the head, or the like.

Fig. 15 is a block diagram showing an example of the configuration of the detection processing unit 524 according to embodiment 3. The detection processing unit 524 includes, for example, an acquisition unit 5240, a determination unit 5241, an estimation unit 5242, an output unit 5243, a face portion extraction model 5244, and a signature motion estimation model 5245.

The acquisition unit 5240 acquires image information. The image information is information of an image in which the performance of the performer P is photographed, and includes, for example, the image signal V generated by the image synthesizing unit 522.

In the present embodiment, the image information is information including depth information. The depth information is information indicating a distance from a predetermined position (for example, a photographing position) to an object for each pixel in an image. In this case, the plurality of photographing devices 222 in the recording device 22 includes at least one depth camera. The depth camera is a distance measuring sensor that measures a distance to an object, and measures the distance to the object based on a time required until receiving reflected light of the irradiated light reflected by the object, for example, by irradiating the light with light such as infrared light. Alternatively, the plurality of image capturing devices 222 may include a stereo camera. The stereo camera photographs a subject from a plurality of different directions to calculate a depth value (depth information) up to the subject.

The acquisition unit 5240 repeatedly acquires image information at predetermined time intervals. The predetermined time interval here is arbitrary, and may be periodic, random, or a mixture thereof. The acquisition unit 5240 outputs the acquired image information to the determination unit 5241.

The determination unit 5241 extracts a face portion (hereinafter, referred to as a face portion) including the eyes of a person from an image (hereinafter, referred to as a photographed image) indicated by the image information based on the image information acquired from the acquisition unit 5240.

Specifically, the determination unit 5241 first separates the background from the captured image. The determination unit 5241 determines, for example, pixels whose distance to the subject is greater than a predetermined threshold value as the background using the depth information of the pixels, and separates the background from the captured image by extracting an area whose distance to the subject is less than the predetermined threshold value. In this case, the determination unit 5241 may determine, as the background, a region having an area smaller than a predetermined threshold value even in a region having a distance to the subject smaller than the predetermined threshold value.

Next, the determination unit 5241 extracts a face portion using the image from which the background is separated and the face portion extraction model 5244. The face portion extraction model 5244 is a learning-done model created by causing a learning model to learn teacher data. The learning model is, for example, CNN (Convolutional Neural Network). The teacher data is data (data set) in which a learning image in which a face portion including eyes of a person is photographed and a determination result in which the face portion of the person in the learning image is determined are associated with each other. By learning teacher data, the face portion extraction model 5244 becomes a model as follows: a face portion of a person in an input image is estimated from the image, and the estimation result is output. The determination unit 5241 extracts a face portion based on an output obtained by inputting the image information acquired from the acquisition unit 5240 to the face portion extraction model 5244.

Next, the determination unit 5241 detects the movement of the face portion based on the image of the face portion extracted from the captured image (hereinafter, referred to as an extracted image). The determination unit 5241 detects the movement of the face portion by comparing the extracted images in time series, for example. The determination unit 5241 extracts, for example, feature points in the extracted image, and detects the motion of the face portion based on a temporal change in the position coordinates of the extracted feature points. The feature points here are points indicating characteristic portions of the face portion, such as the corners of the eyes and the tips of the eyebrows. If the extracted image includes a portion other than the eyes, the mouth angle and the like may be extracted as the feature point.

The determination unit 5241 detects the direction of the line of sight based on the extracted image. The determination unit 5241 extracts the region of the eye in the extracted image. The method of extracting the eye region may be arbitrary, and for example, a learned model similar to the face portion extraction model 5244 may be used, or another image processing method may be used. For example, the determination unit 5241 determines the direction of the line of sight based on the orientation of the face. Generally, this is because it is considered that the player P looks at the opposite party performing the trumpet with the face directed toward the opposite party performing the trumpet. The determination unit 5241 determines the direction of the face in the left-right direction based on the depth information of a portion, such as the left and right eyes or eyebrows, that is bilaterally symmetric with respect to the center line of the face in the up-down direction. For example, when the distance difference between the left and right eyes is smaller than a predetermined threshold value and the left and right eyes are substantially equidistant from the depth camera, the determination unit 5241 determines that the front face of the face faces the depth camera and the direction of the line of sight is in the front direction. The vertical direction can be determined by the same method.

The determination unit 5241 determines whether or not a preliminary operation related to a beacon operation indicating the timing of an event has been performed, using the result of the detection. The preparatory operation is a part of the combination operation or an operation linked to the combination operation and is performed before the timing such as the start of the utterance indicated in the combination operation. For example, when the sign operation is performed by nodding, the preparatory operation is an operation (hereinafter, also referred to as "cue-down") for lowering the face, which is performed before an operation (hereinafter, also referred to as "cue-up") for raising the face. Alternatively, when the head-up performs the secret sign operation by lightly inhaling, the preparatory operation is an air discharge operation performed before the face is lifted up.

The determination unit 5241 determines that the preliminary operation has been performed, for example, when the movement of the face portion is along a direction (an example of a "1 st direction") indicating the up-and-down direction of the nodding and the direction of the line of sight is a direction (an example of a "2 nd direction") of the opposite party performing the dark sign. The determination unit 5241 outputs the determination result of the preliminary operation to the estimation unit 5242.

The estimation unit 5242 estimates the timing of the occurrence of an event from the image indicating the preliminary operation based on the determination result of the determination unit 5241. The estimation unit 5242 estimates the timing of occurrence of an event, for example, using an image group indicating a flow of a series of operations including a preliminary operation and the signature operation estimation model 5245. The secret sign motion estimation model 5245 is a learning model created by causing a learning model to learn teacher data. The learning model is, for example, LSTM (Long Short-Term Memory). The teacher data is data (data set) in which a time-series image of a face portion including eyes of a person is captured and a determination result of a motion of a dark mark in the image for learning are associated with each other. The sign operation herein may include various operations for determining the sign operation, and examples thereof include a sign operation (cue-up), a preparatory operation (cue-down), and an operation in which the line of sight is or is not in a specific direction. By learning teacher data, the secret sign motion estimation model 5245 becomes a model as follows: from the input time-series image group, the motion shown by the next image in the series of images is estimated, and the estimation result is output. The determination unit 5241 estimates the timing of occurrence of an event based on an output obtained by inputting an image group representing a flow of a series of operations including a preparatory operation into the face portion extraction model 5244.

The output unit 5243 outputs information indicating the timing of occurrence of the event estimated by the estimation unit 5242.

The face portion extraction model 5244 is a model that learns a data set in which a learning image in which a face portion including the eyes of a person is photographed and a determination result of determining the face portion of the person in the learning image are associated with each other as teacher data, and outputs the face portion of the person in the input image.

The secret sign motion estimation model 5245 is a model that learns, as teacher data, a data set in which a learning image in which a face portion including the eyes of a person is photographed and a result of determination that a secret sign motion in the learning image is determined are associated with each other, and outputs whether or not the secret sign motion is performed in an input image.

Fig. 16 is a flowchart showing the flow of processing performed by the detection processing unit 524.

The acquisition unit 5240 acquires image information. The acquisition unit 5240 outputs the acquired image information to the determination unit 5241 (step S10).

The determination unit 5241 extracts an area in which a face portion in an image is photographed based on image information (step S11), and detects the movement of the face portion and the direction of the line of sight based on the extracted image. The determination unit 5241 determines whether the movement of the face portion is in a predetermined direction based on the detection result (step S12). The determination unit 5241 determines whether or not the direction of the line of sight is a specific direction (camera direction in fig. 16) (step S13). The determination unit 5241 determines whether or not the image is an image on which a preliminary motion related to a motion of a face portion is performed, based on the motion of the face portion and the direction of the line of sight, and outputs the determination result to the estimation unit 5242.

The estimation unit 5242 estimates the timing of occurrence of the event based on the image information of the image determined by the determination unit 5241 as having performed the preliminary operation (step S14). The estimation unit 5242 estimates the timing of occurrence of an event by estimating the next operation using, for example, a series of time-series image groups including the preliminary operation and the signature operation estimation model 5245. The estimation unit 5242 outputs the estimation result to the output unit 5243.

The output unit 5243 outputs the estimation result estimated by the estimation unit 5242. The output unit 5243 outputs a performance start signal corresponding to the estimated timing of occurrence of the event, for example (step S15).

As described above, the automatic playing system 100 (control system) according to embodiment 3 includes the acquisition unit 5240, the determination unit 5241, the estimation unit 5242, and the output unit 5243. The acquisition unit 5240 acquires image information. The determination unit 5241 detects the movement of a face portion including the eyes of a person and the direction of the line of sight of the person when the face portion is photographed in a photographic image shown in image information based on the image information, and determines whether or not a preliminary operation associated with a secret sign operation indicating the timing at which an event occurs is performed, using the detection result. When the determination unit 5241 determines that the preliminary operation has been performed, the estimation unit 5242 estimates the timing of causing the event to occur based on the image information. The output unit 5243 outputs the estimation result estimated by the estimation unit 5242.

Thus, the automatic playing system 100 of embodiment 3 can estimate the timing of causing an event to occur based on the motion of the face. That is, in a situation where a dark sign based on eye contact is assumed, such as the timing of starting the sounding during the performance of the music, the timing of resuming the symbol by extension, and the timing of sounding and stopping the last sound of the music, the player P can control the performance of the automatic playing system 100 based on the movement of the face and the movement of the dark sign indicated by the line-of-sight direction.

In embodiment 3, the estimation is performed using an image in which a face portion including eyes is photographed. Therefore, even when a part of the face of the player P is hidden (occluded) by the musical instrument or the like in the image of the player P having the wind instrument or the like photographed, the motion of the dark mark can be recognized using the peripheral portion of the eyes where occlusion is hard to occur during the performance, and the timing of causing the event to occur can be estimated. Therefore, even when various operations are performed during the performance, the estimation can be reliably performed.

In embodiment 3, estimation is performed using both the motion of the face portion and the direction of the line of sight. Therefore, since it is possible to distinguish between a motion in which the player P moves the face or body with a large focus on the performance and a motion in which the face or body moves, the accuracy of estimation can be improved as compared with a case in which estimation is performed only by the motion of the face portion.

In the automatic playing system 100 according to embodiment 3, the estimating unit 5242 estimates the timing of occurrence of an event using the signature motion estimation model 5245. This makes it possible to estimate the model by a simple method of inputting an image to the model without performing complicated image processing. Therefore, it is possible to expect reduction in processing load and reduction in processing time as compared with the case of performing complicated image processing. Further, by using the teacher data learned by the secret symbol motion estimation model 5245, the timing of various events such as the start of utterance and the cycle of tempo can be estimated, and any event can be handled.

In the automatic playing system 100 according to embodiment 3, the determination unit 5241 determines that the preliminary action has been performed when the movement of the face portion is in the vertical direction (specific 1 st direction) along the nodding and the direction of the line of sight is in the opposite direction (specific 2 nd direction) of the opposite side of the sign, based on the image information. This makes it possible to perform determination based on the movement in a specific direction, which is characteristic in the motion of the combination, and the direction of the line of sight, and thus to improve the accuracy.

Further, in the automatic playing system 100 of embodiment 3, the determination section 5241 detects the movement of the face portion using the face portion extraction model 5244. This can achieve the same effects as those described above.

In the automatic playing system 100 according to embodiment 3, the image information includes depth information indicating a distance from the subject for each pixel in the image, and the determination unit 5241 separates the background in the captured image based on the depth information to extract the face portion in the image. The region of the eyes shown by the face is a relatively narrow region, and therefore the region of the eyes extracted from the image has fewer pixels than other regions. In addition, the shape or color of the eye is more complex than other parts. Therefore, even when the region of the eye can be accurately extracted, noise is likely to be mixed in compared with other regions. Therefore, even if the orientation of the face is detected by performing image processing on the image of the region from which the eyes are extracted, it is difficult to extract the face with high accuracy. In contrast, in the present embodiment, depth information is used. The depth information does not change in a complicated manner as compared with color information or the like even in the periphery of the eye. Therefore, the direction of the face can be detected with high accuracy based on the depth information (depth information) of the periphery of the eye. Further, the approximate distance from the photographing device 222 to the player P can be grasped in advance. Therefore, if depth information is used, the player P can be easily extracted by separating the background without performing complicated image processing such as contour extraction. By excluding the background pixels from the analysis target, not only can the processing speed be increased, but also the false detection can be expected to be reduced.

In the above description, the case where the direction of the line of sight is detected based on the image information is described as an example, but the present invention is not limited to this. For example, the direction of the line of sight may be detected based on the relative positional relationship between the cornea and the pupil detected by reflected infrared light emitted to the eyeball using eye tracking or the like.

Furthermore, an agent (agent) for ensemble may be reacted to using the automatic playing system 100 according to embodiment 3. For example, if the player P is watching a robot equipped with a camera, the robot may be operated to look at the player P. Further, when the player P performs a cue motion (e.g., cue-up) or a preparatory motion (e.g., cue-down), the robot performs an attachment in accordance with the motion. Thereby, the performance synchronized with the player P by the automatic playing system 100 can be performed.

Some embodiments of the present invention have been described, but these embodiments are presented as examples and are not intended to limit the scope of the invention. These embodiments can be implemented in other various ways, and various omissions, substitutions, and changes can be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the invention described in the claims and the equivalent range thereof, as well as included in the scope and gist of the invention.

Claims

1. A control system is provided with:

an acquisition unit that acquires image information including a user captured over time;

a determination unit configured to determine whether or not a preliminary operation has been performed, based on the movement of the face of the user detected from the image information and the direction of the line of sight;

an estimation unit that estimates a timing at which an event occurs when it is determined that the preliminary operation has been performed; and

an output unit that outputs the estimation result estimated by the estimation unit.

2. A control system is provided with:

an acquisition unit that acquires image information;

a determination unit that detects, based on the image information, a movement of a face portion and a direction of a line of sight in a captured image indicated by the image information, and determines, using a result of the detection, whether or not a preliminary operation associated with a motion of a sign indicating a timing at which an event occurs has been performed;

an estimating unit that estimates a timing of occurrence of an event from the motion of the logo based on the image information when the determining unit determines that the preliminary motion is performed; and

3. The control system of claim 1 or claim 2,

the estimating unit estimates the timing of occurrence of an event using an output result of a signature motion estimation model, the signature motion estimation model being a model including: a learning image in which a face portion including eyes of a person is photographed and a data set in which a determination result of a secret sign operation indicating a timing at which an event occurs in the learning image is associated are learned as teacher data, and whether or not the secret sign operation is performed in an input image is output.

4. The control system according to any one of claim 1 to claim 3,

an event represented by a hash action representing the timing of causing the event to occur is the start of an utterance,

the estimation unit estimates the timing of the start of utterance by using a sign motion estimation model indicating a learning result of learning a relationship between an image and the sign motion, the sign motion being a motion of a face portion including eyes of a person indicating the start of utterance.

5. The control system of any one of claim 1 to claim 4,

an event represented by a shaded action representing the timing at which the event occurs is the period of a beat in a performance,

the estimating section estimates, as the secret sign motion, a motion of a face portion including human eyes, which represents a cycle of a beat in performance, and estimates a timing representing the cycle of the beat in performance, using a secret sign motion estimation model representing a learning result in which a relationship between an image and the secret sign motion is learned.

6. The control system of any one of claim 1 to claim 5,

the determination unit determines that the preparatory movement has been performed when the movement of the face portion including the eyes of the person is in a specific 1 st direction and the direction of the line of sight is in a specific 2 nd direction, based on the image information.

7. The control system of any one of claim 1 to claim 6,

the determination section extracts the face portion in the photographic image shown in the image information using an output result of a face portion extraction model, and detects a motion of the face portion based on an image of the extracted face portion, the face portion extraction model being a model of: a learning image in which a face portion including eyes of a person is photographed and a data set in which a determination result of the face portion in the learning image is determined are associated are learned as teacher data, and the face portion of the person in the input image is output.

8. The control system according to any one of claim 1 to claim 7,

the image information contains depth information indicating a distance to an object for each pixel in an image,

the determination unit separates a background in the captured image indicated by the image information based on the depth information, and extracts a face portion including human eyes in the image based on the image from which the background is separated.

9. A method for controlling a power supply of a vehicle,

the acquisition unit acquires the image information and outputs the image information,

a determination unit that detects, based on the image information, a movement of a face portion and a direction of a line of sight in a captured image indicated by the image information, and determines, using the detected result, whether or not a preliminary operation associated with a motion of a sign indicating a timing at which an event occurs has been performed;

the output section outputs the estimation result estimated by the estimation section.