NL2018758B1

NL2018758B1 - Optical music recognition (OMR) assembly for converting sheet music

Info

Publication number: NL2018758B1
Application number: NL2018758A
Authority: NL
Inventors: Jan Van Der Wel Eelco; Ullrich Karin
Original assignee: Univ Amsterdam
Priority date: 2017-04-20
Filing date: 2017-04-20
Publication date: 2018-11-05
Also published as: WO2018194456A1

Abstract

The invention provides an optical music recognition (OMR) assembly for converting sheet music, representing a music part as a first temporal representation, into a machine-processable representation of said piece of music that represents at least a pitch and duration of notes that are graphically represented in said sheet music and form said music part as a second temporal representation, said assembly comprising a data processor system and software which, when running on said data processor system: -retrieves a machine-processable representation of said sheet music; - generate a series of time slices of said sheet music, by applying a sliding window on said over said machine-processable representation of at least part of said sheet music; - defines a sequence-to-sequence system, said sequence-to-sequence system compnsmg: * provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said sheet music, said CNN comprising an input layer and an output layer; * provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said sheet music, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer; * provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said machineprocessable representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said machine-processable representation.

Description

Field of the invention

The invention relates to an assembly and method for converting data, in particular to optical music recognition (OMR) assembly for converting sheet music.

Background of the invention

There are many application that require conversion of data into another representation, in particular conversion of data that has a temporal relation. This conversion, which may comprise a transcription or translation, for instance relates to the fields of automatic music transcription, music information retrieval, optical music recognition, and the like.

Since the 1960's, attempts have been made to manufacture optical music recognition systems. However, there has yet to be a system that deals with the complexity and ambiguity of sheet music in a satisfactory way. Accuracy scores of these systems are generally too low to use them without human supervision. Classical optical music recognition systems are segmented and generally consist of the following parts: staff isolation, staff line removal, symbol segmentation and symbol classification. It was found that each individual part of such a system has proven to be a difficult challenge, resulting in a low reliability.

In recent years, there have been substantial advances in sequence recognition methods using deep learning neural networks. Examples of these are machine translation and image captioning. A difference between these systems and the segmented optical music recognition systems, is that they try to capture the full recognition process into a single learning algorithm. Similar methods have been applied in optical character recognition and small optical music recognition tasks, showing promising results.

US6297439 for instance according to its abstract describes “A system and method are disclosed for automatically generating music on the basis of an initial sequence of input notes, and in particular to such a system and method utilizing a recursive artificial neural network (RANN) architecture. The aforementioned system includes a score interpreter interpreting an initial input sequence, a rhythm production RANN for generating a subsequent note duration, a note generation RANN for generating a subsequent note, and feedback means for feeding the pitch and duration of the subsequent note back to the rhythm generation and note generation RANNs, the subsequent note thereby becoming the current note for a following iteration.”

US8494257 according to its abstract describes “Data set generation and data set presentation for image processing are described. The processing determines a location for each of one or more musical artifacts [..] in the image and identifies a corresponding label for each of the musical artifacts, generating a training file that associates the identified labels and determined locations of the musical artifacts with the image, and presenting the training file to a neural network for training.”

US9123315 according to its abstract describes “A method for transcoding music, according to various aspects of the present invention, includes in any practical order: (a) reading indicia of a plurality of notes, each note having pitch and duration; (b) selecting a reference pitch; (c) determining indicia of tone from the reference pitch and the pitch of each note; and (d) outputting for use by an engraving engine, indicia of an apposite staff and indicia of tones and durations corresponding to the plurality of notes.”

Chase Dwayne Carthen: Rewind: A Music Transcription Method, 1 May

2016 (2016-05-01), pages 1-54, Reno, USA (Master thesis) in its abstract states: “Music is commonly recorded, played, and shared through digital audio formats such as way, mp3, and various others. These formats are easy to use, but they lack the symbolic information that musicians, bands, and other artists need to retrieve important information out of a given piece. There have been recent advances in the Music Information Retrieval (MIR) field for converting from a digital audio format to a symbolic format. This problem is called Music Transcription and the systems built to solve this problem are called Automatic Music Transcription (AMT) systems. The recent advances in the MIR field have yielded more accurate algorithms using different types of neural networks from deep learning and iterative approaches. Rewind's approach is similar but boasts a new method using an encoder-decoder network where the encoder and decoder both consist of a gated recurrent unit and a linear layer. The encoder layer of Rewind is a single layer autoencoder that captures the temporal dependencies of a song and produces a temporal encoding. In other words, Rewind is a web app that utilizes a deep learning method to allow users to transcribe, listen to, and see their music.'’

JP2871204B2 in its abstract states: “PURPOSE:To prevent variation in pitch from being decomposed into fine variation in the fine interval of a short note by absorbing the variation in pitch by the hysteresis that a recurrent network has when a musical sound which varies in pitch like singing and vibrato performance is put on a score. CONSTITUTION:This device has a band-pass filter bank part 14 which converts an external audio signal into power envelopes by frequency channels, a conflict recollection neural network part 13 which finds pitch categories from the power envelopes by the channels obtained from the band-pass filter bank part 14, an interval buffer part 12 which holds the pitch categories outputted by the conflict recollection neural network part 13, a readout timing generation part 15 which generates transcription intervals of musical intervals required for transcription and a musical interval storage part 11 which inputs and records pitch data from the interval buffer part 12 according to the timing outputted by the readout timing generation part 15.

W02008101126A1 in its abstract states: “Methods, systems, and devices are described for collaborative handling of music contributions over a network. Embodiments of the invention provide a portal, the portal being accessible over the network by a plurality of workstations and configured to provide a set of editing capabilities for editing music elements. Music contributions may be received at the portal. At least a portion of the music contributions include music elements. In certain embodiments, the music elements have been deconstructed from an audio signal or a score image. A number of collaboration requests may be received at the portal over the network. Some collaboration requests may originate from a first workstation, while other collaboration requests may originate from a second workstation. In response to at least one of the collaboration requests, at least a portion of the music elements may be edited using the editing capabilities of the portal.”

US2016099010A1 in its abstract states: “Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.’’

CN105678300A in its abstract states: The invention relates to the image and text identification field, and specifically relates to a complex image and text sequence identification method. The complex image and text sequence identification method includes the steps: utilizing a sliding sampling box to perform sliding sampling on an image and text sequence to be identified; extracting the characteristics from the sub images obtained through sampling by means of a CNN and outputting the characteristics to an RNN, wherein the RNN successively identifies the front part of each character, the back part of each character, numbers, letters, punctuation, or blank according to the input signal; and successively recording and integrating the identification results for the RNN at each moment and acquiring the complete identification result, wherein the input signal for each moment for the RNN also includes the output signal of a recursion neural network for the last moment. The complex image and text sequence identification method can overcome the cutting problem of a complex image and text sequence, and can significantly improve the identification efficiency and accuracy for images and text.”

Currently, for instance converting sheet music to a digital format is cumbersome, if possible at all. Often, human interference or interpretation is required.

Summary of the invention

It is an aspect of the invention to provide an alternative method or assembly for converting a temporal representation into another temporal representation.

The current invention provides an optical music recognition (OMR) assembly for converting sheet music, representing a music part as a first temporal representation, into a machine-processable representation of said piece of music that represents at least a pitch and duration of notes that are graphically represented in said sheet music and form said music part as a second temporal representation, said assembly comprising a data processor system and software which, when running on said data processor system:

- retrieves a machine-processable representation of said sheet music;

- generate a series of time slices of said sheet music, by applying a sliding window on said over said machine-processable representation of at least part of said sheet music;

- defines a sequence-to-sequence system, said sequence-to-sequence system comprising:

* provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said sheet music, said CNN comprising an input layer and an output layer;

* provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said sheet music, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer;

* provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said machineprocessable representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said machine-processable representation.

The assembly provides an end-to-end trainable sequential model. This model can be trained as one pipeline by offering input at an input end and retrieving output at an output end.

Examples of a first temporal representation are for instance graphical music representation, sound recording.

In an embodiment, the first temporal representation relates to sheet music. Sheet music in its most general form relates to a graphical representation of music information. In a specific embodiment, it relates to a form of music notation, primarily used to notate western music. It is a way of writing down sequential musical information in a compact way, readable for human performers. In western music, musical symbols are notated on a staff, a group of five evenly spaced horizontal lines, used to differentiate between the pitches in written music. The higher a note is on a staff, the higher the pitch. A page of music typically consists of multiple staff lines, much like a piece of written text consists of multiple lines. The horizontal position of a note determines the order of the written musical sequence: sheet music is read from left to right. Figures 1-6 are examples of such a music notation.

one staffline, with fifteen ascending notes from the scale of C major. The final symbol is called a rest. It is similar to a musical note in every way, but instead of indicating a pitch it indicates a period of silence.

In another embodiment, the first temporal representation relates to a sound recording. Sound can be recorded, and for instance transformed into a digital format. Sound comprises music, spoken text in a language, and the like. Sound can be recorded and subsequently compressed using a lossless or lossy compression. Examples of such digital formats are MP3, MP4, but in fact any type of lossy compression or lossless compression of sound recordings can be used.

A number of lossless audio compression formats exist. Shorten was an early lossless format. Newer ones include Free Lossless Audio Codec (FLAC), Apple's Apple Lossless (ALAC), MPEG-4 ALS, Microsoft's Windows Media Audio 9 Lossless (WMA Lossless), Monkey's Audio, TTA, and WavPack. See list of lossless codecs for a complete listing.

Some audio formats feature a combination of a lossy format and a lossless correction; this allows stripping the correction to easily obtain a lossy file. Such formats include MPEG-4 SLS (Scalable to Lossless), WavPack, and OptimFROG DualStream.

Examples of a second temporal representation are for instance sound recordings, music files, graphical music representation like sheet music. The sound may be music, spoken text (the same text for instance as in the first representation) in a language or accent different from spoken text in the first temporal representation.

In an embodiment, the second temporal representation relates to a digital music format. Music can be stored in various digital formats. An example of such a digital format is the MIDI protocol, see MIDI Association. The Official Midi Specifications. 1996. URL: https://www.midi.org/specifications, which is widely used by musical sequencers and notation software. It is the de-facto standard of exchanging digital musical information. A notation format much closer to actual sheet music is ABCnotation, MIDI. Digital music notation formats have numerous advantages over their optical counterparts. Images consist of pixel data, and do not give any direct information about the represented musical content. As a result, computational musical analysis cannot be performed on these image formats directly. While an increasing amount of digitalized sheet music is available, a large portion of available sheet music is still only accessible as images

The Musical instrument digital interface (MIDI) standard defines a system of communication between digital musical devices. It is a very compact method to define musical information, as all events are represented on byte level. The MIDI file is a way of storing MIDI information, and is typically only a few kilobytes in size. Since the 1980's, MIDI has been the standard way of playing digital music, favoured because its ease of use and expandability. The MIDI standard provides a way of expressing a variety of musical events. Timing in MIDI is handled different than in sheet music. A MIDI event has a tick property, which defines the amount of ticks between the previous and current event. The duration of one tick is defined in the header of the MIDI file by the Pulses Per Quartemote (PPQ). This value informs the musical sequencer how many ticks one quarter note contains. Typically, as the duration of most musical notes can be divided by either three or four, the PPQ of midi is a multiple of twelve. Most MIDI sequencers take a standard PPQ of 480, but for simpler music lower values are possible too. Timing is the only structural component in a MIDI file. Events like barlines are not present in MIDI. A NoteOnEvent defines a note with a starting tick, pitch and velocity. If the defined velocity is zero, it can be considered the same as a NoteOffEvent, which signals the end of a musical note. As such, a musical note can be defined by two NoteOnEvents: the first event defines the starting tick and velocity of the note, the second tick defines the end tick of the note. Other events are the KeySignatureEvents, which defines the keysignature of a piece, and EndOf-Track, defining the end of a MIDI track. Contrary to sheet music, where the key signature changes the notation of accidentals in a piece, in MIDI the key signature does not influence the notes in a file in any way. It is only used for meta information.

In general, a Convolutional Neural Network (CNN) is a type of artificial neural network (ANN) designed for the processing of spatial data. In embodiments of the current invention, known techniques and implementations can be used. In an embodiment, for instance, Max Pooling is used.

A Convolutional Neural Network (CNN) is a type of artificial neural network (ANN) which in an embodiment is designed for the processing of spatial data.

In an embodiment, convolutional layers are a functionality of CNNs. These convolutional layers replace the weights of a traditional feed forward neural network with trainable convolution filters. Filters can learn simple operations like edge, blob and corner detection. When stacked in multiple layers, the recognition of complicated spatial structures can be learned. In an embodiment, a single convolutional layer consists of multiple filters, where each filter operates on small areas of the layer input, producing a feature map. Feature maps represent the convolution between a filter and input of a layer. After the convolutional layer, in an embodiment a non-linearity and possible pooling operation is applied.

Pooling operations perform down-sampling on a feature map. A popular type of pooling is max pooling, where each pooling region outputs its maximum value. A pooling operation has a region size and stride associated with it. The region size defines the width and height of the pooling region, the stride defines the number of steps between pooling regions.

After the convolutional and pooling operations, the resulting feature maps are in an embodiment reshaped to a column vector and one or more fully connected layers are applied. These layers are the same as the layers in regular feed-forward neural networks. As in the rest of the network, sigmoid activations can for instance be used.

ANNs with recurrent connections are an addition to the standard ANN model described in the previous section. These models belong to the family of Recurrent Neural Networks (RNNs). In addition to predicting from current input data, they are able to predict from past inputs as well. This property adds a new functionality to a network: the ability to work with sequential data.

Traditional RNNs can have trouble learning long-term dependencies. When the distance between two time steps gets too long, the information passed through the recurrent connections can be ’forgotten’. In an embodiment, a recurrent architecture called the Long Short Term Memory (LSTM) is used. A neuron in an LSTM network has multiple gates that control the remembering and forgetting of data from past time steps, improving the ability to model long dependencies.

A method to expand RNNs to perform tasks is to map sequences to sequences, possibly with different lengths and different orders. The architecture is called a sequence-to-sequence network, and is based on an encoder-decoder structure. This structure first encodes an input to a hidden representation, and subsequently decodes the output from this hidden representation. In an embodiment, both the encoder (RNN) and decoder (RNN) are LSTM networks. The hidden representation from the encoder is passed as the last hidden state of the encoder LSTM to the starting hidden state of the decoder LSTM.

In an embodiemnt, the convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are functionally coupled, forming said sequence-to-sequence system, and said sequence-to-sequence system is trained using a training dataset of first temporal representations and known, resulting second temporal representations.

In an embodiment, in said training of said first temporal representations are provided as input of said convolutional neural network (CNN), the output of said second, decoder recurrent neural network (RNN) is compared to said second temporal representation, and parameters of said convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are modified.

In an embodiment, the first RNN comprises a Long Short term memory (LSTM) architecture.

In an embodiment, the second RNN comprises a Long Short term memory (LSTM) architecture.

In an embodiment, first temporal representation is a graphical representation, in particular a digital image, more in particular said graphical representation is a representation of music, in particular sheet music.

In an embodiment, the second digital representation is a digital file comprising temporal instructions for actuating a device, in particular a music file for actuating or controlling a music instrument, more in particular selected from MIDI, music XML.

In an embodiment, the series of time slices comprise digital images obtained by sliding a window over a graphical representation, in particular over a digital image.

The invention further relates to an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming at least part of a music part, said OMR assembly comprising an assembly described, and wherein

- a sliding window is applied over a digital image of at least part of said sheet music, providing said time slices;

- said second, decoder recurrent neural network providing said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.

In an embodiment of the OMR assembly, the sheet music comprises a graphical representation comprising a series of staff lines and notes, in particular said sheet music comprises a visual representation on a carrier, in particular a written or printed representation on paper.

In an embodiment of the OMR assembly, the convolutional neural network, said first, encoder recurrent neural network, and said second, decoder recurrent neural network have been trained as said sequence-to-sequence system using a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation.

The invention further relates to an automatic music transcription (AMT) assembly comprising the assembly, wherein said first temporal representation of a signal is a sound recording. In an embodiment, the sound recording is a digital sound recording.

In an embodiment of the AMT assembly, it further comprising a spectral converter, said spectral converter allowing converting said sound recording is into series of spectral representations of said sound recording.

In an embodiment of the AMT assembly, the software, when running on said data processing system, defines a spectral converter for converting said sound recording into a series of magnitude spectrograms providing said time slices, wherein a time window having a time window size is shifted over said sound recording and a magnitude spectrogram is calculated for each time window.

In an embodiment of the AMT assembly, the time windows have a window overlap.

The invention further pertains to a method for converting sheet music into a digital representation using the OMR assembly, comprising:

- retrieving a digital image of at least part of said sheet music;

- applying a sliding window over said digital image of said sheet music, said sliding window having a time width and a stride and provides said time slices comprising a time sequence of partially overlapping input images of at least part of said sheet music;

- applying said convolutional neural network (CNN) to said input images for converting said input images into a sequence of numerical representations of said input images;

- applying said first, recurrent neural network (RNN) as said encoder on said sequence of numerical representations for providing a hidden output data set;

- applying said second, decoder recurrent neural network (RNN) as said decoder to said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.

The invention further relates to an assembly for producing an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming a music part, said OMR assembly comprising a data processor and software which, when running on said data processor: -provide a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation;

- provide a neural network assembly comprising :

(a) a convolutional neural network (CNN) adapted for receiving training input images and converting said input images into a sequence of numerical representations of said input images;

(b) a first, encoder recurrent neural network (RNN) as an encoder adapted for receiving said sequence of numerical representations for providing a hidden output data set;

(c) a second, decoder recurrent neural network (RNN) as a decoder adapted for receiving said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part;

- train said neural network assembly using said training dataset.

The invention further relates to a method for producing the assembly, wherein:

- a training dataset is provided, said trainings dataset comprising a series of first temporal representation of signals, each having a resulting second temporal representation of each signal;

- said convolutional neural network, said first recurrent neural network and said second recurrent neural network are provided, where said neural networks are coupled;

- said time slices are for converting said time slices into a sequence of third representations of said first temporal representation;

- applies a first, trained, recurrent neural network (RNN) as an encoder on said sequence of third representations, input as one data entry, for providing a hidden representation of said first temporal representation;

- applies a second, trained, decoder recurrent neural network (RNN) as a decoder to said hidden representation, input as one data entry, for converting said hidden representation into a calculated second temporal representation of said signal;

- adjust parameters in at least one selected from said convolutional neural network, said first, trained, recurrent neural network (RNN) and said second, trained, decoder recurrent neural network (RNN) based upon a difference between said resulting second temporal representation of said signal and said calculated second temporal representation of each signal.

In an embodiment of this method, the neural networks are trained using back propagation of said training dataset.

In an embodiment of this method, the parameters in said neural networks are adjusted based upon gradient descent optimization.

The invention further pertains to an assembly for converting a first temporal representation of a signal into a second temporal representation of said signal, said assembly comprising a data processor system and software which, when running on said data processor system:

- retrieves a series of time slices of said first temporal representation;

* provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said first temporal representation, said CNN comprising an input layer and an output layer;

* provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said first temporal representation, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer;

* provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said second temporal representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said second temporal representation.

This assembly in an embodiment can comprise all the feature of the dependent claims.

In an embodiment, that assembly provides an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming at least part of a music part, said OMR assembly comprising an assembly according to any one of the preceding claims, wherein

In an embodiment, the assembly is an automatic music transcription (AMT) assembly, wherein said first temporal representation of a signal is a sound recording.

In an embodiment of the AMT assembly, said software, when running on said data processing system, defines a spectral converter for converting said sound recording into a series of magnitude spectrograms providing said time slices, wherein a time window having a time window size is shifted over said sound recording and a magnitude spectrogram is calculated for each time window.

In an embodiment of the AMT assembly the time windows have a window overlap, in particular said window overlap is less that said time window width, more in particular less than 50% of said time window width.

The invention further pertains to a method for producing the assembly, wherein:

In an embodiment of that method the neural networks are trained using back propagation of said training dataset.

In an embodiment of that method the parameters in said neural networks are adjusted based upon gradient descent optimization.

In implementations of the neural networks of the current inventions, the neural networks are defined through software running on computer systems that comprise one or more so called graphics cards that are commonly used for driving display devices. The logical structure of these graphics cards make them suited for implementing neural networks. This is well known to a skilled person. The neural networks may also be implemented on specially designed computer devices or computer systems. Once a neural network is trained, and the parameters and structure is known, this structure may also be extracted and implemented on a general purpose computer system. The training and/or implementation of one or more of the neural networks may be through software, or partially or completely based upon a hardware implementation.

The terms ‘upstream” and “downstream” relate to an arrangement of items or features relative to the propagation of the light from a light generating means (here the especially the first light source), wherein relative to a first position within a beam of light from the light generating means, a second position in the beam of light closer to the light generating means is “upstream”, and a third position within the beam of light further away from the light generating means is “downstream”. In a neural network, data is also passed through the network from upstream to downstream.

The term “substantially” herein, such as in “substantially consists”, will be understood by the person skilled in the art. The term “substantially” may also include embodiments with “entirely”, “completely”, “all”, etc. Hence, in embodiments the adjective substantially may also be removed. Where applicable, the term “substantially” may also relate to 90% or higher, such as 95% or higher, especially 99% or higher, even more especially 99.5% or higher, including 100%. The term “comprise” includes also embodiments wherein the term “comprises” means “consists of’.

The term functionally will be understood by, and be clear to, a person skilled in the art. The term “substantially” as well as “functionally” may also include embodiments with “entirely”, “completely”, “all”, etc. Hence, in embodiments the adjective functionally may also be removed. When used, for instance in “functionally parallel”, a skilled person will understand that the adjective “functionally” includes the term substantially as explained above. Functionally in particular is to be understood to include a configuration of features that allows these features to function as if the adjective “functionally” was not present. The term “functionally” is intended to cover variations in the feature to which it refers, and which variations are such that in the functional use of the feature, possibly in combination with other features it relates to in the invention, that combination of features is able to operate or function. For instance, if an antenna is functionally coupled or functionally connected to a communication device, received electromagnetic signals that are receives by the antenna can be used by the communication device. The word “functionally” as for instance used in “functionally parallel” is used to cover exactly parallel, but also the embodiments that are covered by the word “substantially” explained above. For instance, “functionally parallel” relates to embodiments that in operation function as if the parts are for instance parallel. This covers embodiments for which it is clear to a skilled person that it operates within its intended field of use as if it were parallel.

Furthermore, the terms first, second, third and the like when used in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The devices or apparatus herein are amongst others described during operation. As will be clear to the person skilled in the art, the invention is not limited to methods of operation or devices in operation.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb to comprise and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article a or an preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device or apparatus claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The invention further applies to an apparatus or device comprising one or more of the characterising features described in the description and/or shown in the attached drawings. The invention further pertains to a method or process comprising one or more of the characterising features described in the description and/or shown in the attached drawings.

The various aspects discussed in this patent can be combined in order to provide additional advantages. Furthermore, some of the features can form the basis for one or more divisional applications.

Brief description of the drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying schematic drawings in which corresponding reference symbols indicate corresponding parts, and in which:

Figures 1-6 provide some illustration of sheet music and its challenges, in which

Figure 1 depicts an example of a staff line with the scale of C in ascending order;

Figure 2 depicts an example of a fragment of sheet music with mixed durations;

Figure 3 depicts an example of a fragment with two sharps as key signature and multiple accidentals;

Figure 4 depicts an example of a fragment of polyphonic music;

Figure 5 depicts an example of a fragment containing two quarter triplets;

Figure 6 depicts an example of a double staff line, or 'Grand staff, with ambiguous notation;

Figure 7 illustrates in an abstract manner a Recurrent Neural Network (RNN);

Figure 8 illustrates an RNN in unfolded manner;

Figure 9 illustrates a Long Short term memory (LSTM) for use in a RNN;

Figure 10 illustrates a sliding window over (part) of a staff line;

Figure 11 illustrates processing of the data provided using the approach of figure 10;

Figurel2 illustrates a complete, schematic overview of the algorithm, here with first 4 time steps and 4 decoder outputs, and

Figure 13 illustrates an approach for an AMT application.

The drawings are not necessarily on scale

Description of preferred embodiments

In the description of embodiments below, two different applications will be discussed as example of applying the current invention.

The first example relates to optical music recognition or OMR. In such an application, as explained above, a graphical representation of music is converted into a digital music representation, in particular a digital representation that comprises machine instructions for playing music. Often, sheet music, in particular classical sheet music, is converted into one or more MIDI of MusicXML files.

The second example relates to automatic music transcription or AMT. In such an application, music is converted into a symbolic representation of the music. In such an application, often a digital recording, like “mp3” file format, is converted into classical sheet music.

First, some examples an challenges of a most common sheet music notation will be discussed on the hand of figures 1-6. Next, a current solution for converting sheet music to a digital format will be discussed.

Figure 1 is an example of one staffline, with fifteen ascending notes from the scale of C major. The final symbol is called a rest. It is similar to a musical note in every way, but instead of indicating a pitch it indicates a period of silence. Sheet music is typically divided in time by bars. Each bar in music is of a fixed duration, indicated by the time signature. In figure 1, the time signature can be read at the start of the staff line: 4/4. Each following bar will be of that particular time signature, until it is changed. The start and end of a bar is indicated by barlines. These are the vertical lines between each for notes in figure 1.

Where horizontal position dictates the order of the sequence, it does not say anything about the duration of a note or rest. These are indicated by the shape of the musical symbol. Figure 1 consists entirely of notes of the same shape, or the same duration.

Figure 2 is slightly more complex and combines a mix of durations and pitches. Additionally, shown in this example, two separate notes can be grouped together with a tie, like the two centre notes in the figure. Tied notes are played as one single note, where the duration is the sum of the duration of tied notes. Finally, there are the concepts of key signature and accidentals. These indicate long and short term dependencies in sheet music to raise or lower the notated pitch of a note. For this, we utilize three different symbols: the sharp (ft), flat ( b ) and natural (A ) symbols. The key signature of a piece indicates a global key and raises or lowers all indicated notes. Accidentals are more local, and only change the pitch of a note for the duration of the bar.

Figure 3 shows an example of these concepts. The key signature, indicated at the start of the staff line, is two sharps. A sharp is added to the pitches on the horizontal lines the sharps of the key signature are located. Within each bar, multiple accidentals can be ob served, changing the pitches of the notes pertaining to the accidental for the duration of the bar.

A key difference between written text and sheet music is the possibility of polyphony. A polyphonic piece of music will contain multiple concurrent sequences, possibly notated in one staff line. Figure 4 displays an example excerpt of a polyphonic score. Polyphony adds a layer of complexity to sheet music, and has to be taken into consideration when designing systems that have to be able to deal with these problems.

There are a few considerations and difficulties of sheet music in a machine learning context.

First, as already referred in the terminology, key signatures and time signatures are examples of long dependencies in sheet music. In some notation styles, the time signature is only indicated at the start of a piece or in case of a time signature change. As a result, there is a possibility of long term dependencies over multiple pages of sheet music. The same is the case for the key signature. Both key and time signature can be changed in the middle of a piece of sheet music, changing the long term dependencies of upcoming bars.

Furthermore, there can be bidirectional dependencies. Structures like tuplets have the possibility of a dependency to both previous and next notes. This is referred to as a bidirectional dependency. An example of such a tuplet is the quarter triplet shown in figure 5. The numbers '3' in the picture indicate the note before, under and after the number are part of a triplet structure, which has an altered duration compared to regular quarter notes.

Yet another challenge in ambiguity. Sheet music has the possibility of ambiguous notation. The most common example of this happens when music is represented on multiple staff lines, a common notation for piano music, as in figure 6. Both the top and bottom staff do not have complete measures, as two beats are missing. Only by interpreting the relative positions of the notes in the top and bottom staff, the musical content can be extracted.

Yet another challenge is contextual musical symbols. Aside from notes and rests, there are some more categories of musical symbols. Chords, lyrics, dynamics, articulation symbols, and textual instructions can all be part of a musical score. This can create difficulty for OMR systems, as a lot of these symbols are not fully standardized. For example, for textual instructions or lyrics an additional OCR system is needed.

Considerations for a computer program product

Artificial neural networks (ANNs) with recurrent connections are an addition to a standard ANN model. These ANNs with recurrent connections belong to the family of Recurrent Neural Networks (RNNs). In addition to predicting from current input data 2, they are able to predict from past inputs 2 as well. This property adds a new functionality to a network: the ability to work with sequential data. To make this RNN a recurrent architecture, a cyclical component 5 is added to the hidden layers 3, as shown in figure 7. In this architecture, the hidden layers 3 output information to the output layer 4 and to itself The infonnation from the cyclical connection 5 is used the next time the network does a forward pass.

The recurrent model 1 in general as depicted in figure 7 can be unrolled into multiple time steps 2 together forming input 2 as shown in figure 8, to make it visually more understandable. It shows the network at different steps in time, connected by the recurrent connections 5.

The inventor further selected in an embodiment a special kind of recurrent architecture called the Long Short Term Memory (LSTM). A neuron 6 in an LSTM network has multiple gates that control the remembering and forgetting of data from past time steps, improving the ability to model long dependencies. In an LSTM cell/neuron, the output h_t is different from the cell state C_t, allowing for better passthrough of information. This is a key difference compared to the traditional recurrent connections. A schematic of the LSTM cell/neuron is shown in figure 9, with input x_tand output h_t.

In the schematic representation of figure 9, the output and next hidden state h₍ are split for clarity, but their values are the same. The σ and tanh are the sigmoid and tanh activation functions, each with their own weight matrix. The x and + symbols are element-wise multiplication and addition, σΐ acts as a forget gate. It decides, based on the previous hidden state h_t-i and the current input what information is kept in the cell state Ct. The forget gate is denoted with f, and shown in formula 1 below.

Next, σ2 is the input gate, or i (formula 2). It decides how much of the computed state is passed through. The tanhl creates new candidate memories for Ct. These two are added to the previous cell state C_t-i modulated by the forget gate, to calculate the new cell state Ct (formula 5).

σ3 is the output gate of the LSTM cell, formulated in 3. If the cell state contains relevant information, the information is passed to the next hidden state h_t (formula 6) and thus to the output of the LSTM cell.

ft=o(h_t.iWf.i₁ + x_tWf.x) (1)

it = o(hi-iWi,ii + xtWi.x)	(2)
Ot = o(h_t-iWo,h + XtW₀,x)	(3)
gt = tanh(h_t-iWg.h + xtW_g._x)	(4)
C_t = ftC_t-i+ itgt	(5)
h_t = tanh(C_t)*Ot	(6)
In the above formulas, W denotes the weight matrix for a specific gate and a

specific input. For example: Wf_x is the weight matrix for the forget gate and the input Xt.

The current architecture is referred to as a sequence-to-sequence network, and is based on an encoder-decoder structure. This structure first encodes an input to a hidden representation, and subsequently decodes the output from this hidden representation. In the case of the sequence-to-sequence architecture, both the encoder and decoder are LSTM networks. The hidden representation from the encoder is passed as the last hidden state of the encoder LSTM to the starting hidden state of the decoder LSTM. An example sequence-to-sequence network is shown in figure 11. In this figure, the input {A, B, C, D} is encoded to a hidden representation, which is mapped to the output {X, Y, Z}.

Adding an attention mechanism that allows the decoder to look back into previous time steps of the encoder. A fixed length hidden representation can be a bottleneck in the network. Adding a ’search function’ to the decoder, can reduce this bottleneck.

For each decoder hidden state dt, an attention vector a’t is calculated from all encoder states (hi..h_n). The attention is calculated as follows:

a^li = softmax(v^T tanh(Wihj+W2dt)) d’_t = ΣΎι a‘i hi

Where v, Wi and W2 are the trained weights.

In the current embodiments, the sheet music 7 is transformed into time slices 8 (indicated A, B, C, D, ....) en input into the system. Talcing example from recurrent convolutional networks a sliding window 9 over the input sheet music 7 is used, as illustrated in figure 10. This method effectively transforms the input to a sequence of image patches or time slices 8, and makes the translation between sheet music and MIDI a sequence-to-sequence problem. Additionally, the translation exhibits similar traits to neural machine translation: The input 8 and output 50 sequences are possibly of different lengths, with alignment problems.

In addition to the input being sequential, it is also image based. Feeding the raw pixel vectors into a sequence to sequence architecture will result in both loss of spatial context and very high input dimensionalities. For this reason, a convolutional network is used before each time step in the encoder. The combined architecture is a convolutional sequence-to-sequence model, as depicted in figure 11. In this model, {A, B, C, D} are image patches of sheet music, and {X, Y, Z} are MIDI events. In the figure the attention mechanism is omitted for clarity, but it is resent in the model. Of course, in the real model the lengths of the input and output are many times longer than shown in the figure. In figure 12, in fact the same model as figure 11 is depicted in a somewhat different way. Here, the overlap of the sliding window 9 is smaller, only about 5-10%.

Experimental example OMR

Convolutional neural network.

To extract relevant features from the image patches 8, each patch is fed into a Convolutional Neural Network (CNN) 40. In this research, we keep the architecture of the CNN 40 the same between different experiments, to ensure a fair comparison. First a max-pooling operation of 3x3 is applied on the input patch for dimensionality reduction. Then, a convolutional layer of 32 5x5 kernels is applied, followed by a relu activation and 2x2 max-pooling operation. This layer is repeated, and a fullyconnected layer of 256 units with relu activation is applied, so each input for the encoder will be a vector size 256.

Sequence-to-sequence network.

After extracting a vector description of each image patch, the sequence of vectors is fed into a sequence-to-sequence network. This architecture consists of two RNN’s 10, 20. The first RNN 10, the encoder 10, encodes the thill input sequence 8 to a fixed size hidden representation. The second RNN 20, the decoder 20, produces a sequence of outputs 51 from the encoded hidden representation, together forming output 50. In the case of an OMR task, this sequence of outputs 51 is the sequence of pitches and durations generated from the MusicXML files. For both encoder and decoder, a single layer Long-Short Term Memory RNN is used with 256 units. To predict both the pitch and duration, the output of the decoder is split into two separate output layers with a softmax activation and categorical cross-entropy loss.

The dataset used in this research is compiled from monophonic MusicXML scores from the MuseScore sheet music archive. The archive is made up of usergenerated scores, and is very diverse in both content and purpose. As a result, the dataset is varied in types of music, key signatures, time signatures, clefs, and notation style. This diversity will aid in the training of our model. To generate the dataset, each score is checked for monophonicity, and dynamics, expressions, chord symbols, and textual elements are removed. This process produces a dataset of about 17.000 MusicXML scores. For training and evaluation, these scores are split into three different subsets. 60% is used for training, 15% for testing and 25% for the evaluation of the models.

To the relatively clean image data of sheet music, several distortions were added, for instance white Gaussian noise (AWGN), additive Perlin noise (APN), small scale elastic transformations (ET-s), large scale elastic transformations (ET-l), and combinations thereof.

Sliding window input

The image input of the algorithm is defined as a sequence of image patches 8, generated by applying a sliding window 9 over the original input score 7. The implementation has two separate parameters: the window width w and window stride s. By varying w, the amount of information per window can be increased or decreased, s defines how much redundancy exists between adjacent windows. Increasing the value of w or decreasing the value of s provides the model with more in formation about the score, but will raise the computational complexity of the algorithm. Thus when determining the optimal parameters, a balance has to be struck between complexity and input coverage. As a rule of thumb, we use a w that is approximately twice the width of a notehead, and an s of half the value of w. This will ensure that each musical object is shown fully at least once in an input window.

The network is trained using backpropagation with gradient descent optimization. An example of this is the ADAM optimizer.

Six separate models are trained, one on each of the proposed augmented data sets; No augmentations, AWGN, APN, Small ET, large ET and all augmentations. All models are trained with a batch-size of 64 using the ADAM optimizer, with an initial learning rate of 8 10. 4 and a constant learning rate decay which halves the rate every ten epochs. Each model is trained to convergence, taking about 25 epochs on the nonaugmented dataset. A single Nvidia Titan X Maxwell is used for training, which trains a single model in approximately 30 hours.

By using a end-to-end trainable sequential model, we perform the full OMR pipeline in a single step. By incorporating sequence-to-sequence models into OMR, there are many new possibilities for obtaining development data. We view this aspect as the largest advantage the proposed method has over segmented models, as the acquisition of quality training data is considered one of the major roadblocks in OMR. The proposed model shows that it is robust to noisy input, an important quality for any OMR model. Additionally, the experiments show that it can deal with the large scale elastic transformations that essentially change the musical font.

Application of the current method or assembly for automatic music transcription (AMT)

Automatic music transcription (AMT) is a challenging problem for humans and machines. The task at hand is to find a mapping f : x y that translates an audio sequence x to a symbolic representation of that sequence y. The difficulty is no surprise because in the most general case, polyphonic AMT, separating the sources of sound alone, e.g. one key stroke on a piano from another, is already a highly underdetermined problem. Thus, any sufficient model needs to learn strong priors over the audio sequences it receives as input in order to perform well. Even if a model does learn these priors sufficiently, though, it can not be guaranteed that the task at hand is well defined. For example, the harmonics of two distinct notes of possibly different instruments can have complex interactions. Furthermore, noise or recording technique may limit the prior assumptions that can be made. However, the fact that machine performance lags behind human performance is a strong indicator for the room of improvement for these models. Furthermore, we already know that music is following (probabilistic) rules according to tempo, harmonic or timbre. Hence, a lot of prior assumptions can be made to simplify the problem. It has been the subject of several studies to work in this prior knowledge without restricting the flexibility of a model too much. Notice that ATM falls in the regime of perceptional tasks. Within this field deep learning has been contributing remarkable improvements on several tasks, initially mainly in computer vision (CV) later also in several other domains such as natural language processing. There is reason to believe that music information retrieval (MIR) tasks are more challenging than CV tasks for example due to the ambiguity of annotation even to human perceivers. However, several pioneering studies in deep learning have shown significant improvement in various MIR tasks such as onset and structural boundary detection, piano transcription, genre classification or sound generation to just name a few. This gives reason to believe in the power of such techniques. Within the deep learning domain there are two popular models: the convolutional neural network and the recurrent neural network. Convolutional neural networks had enormous success in classification tasks such as image recognition. They seem to break the course of dimensionality by learning locally low dimensional representations of their input. By stacking many of these representations in a hierarchical manner a global understanding of the input as a whole can be achieved. The other popular models are recurrent neural networks for sequence modelling. These models can be understood as a generalized version of hidden Markov models. They are used for language modelling such as text generation or language translation. For the latter example sequence to sequence models, a subclass of recurrent NN, are well known. Here a sequence of for example English is fed into a neural network to output a hidden state that contains all the information of the sequence ,i.e., a sufficient statistics. This hidden state is than fed into another model that generates the sequence with the same meaning but in a different language. This model is superior to other because it theoretically can deal with different grammatical structure such as word order. In music translation tasks such as optical music recognition or music transcription, we are often faced with the same problems dependencies need to be ’’kept in mind” and later be placed at a different place in the sequence, for example when translating sheet music to a pitch representation one need the model need to have the capacity to remember the key signature. Additionally, these kind of models can be trained with less effort since one only needs the entire song and its sheet music to train rather than a temporal accurate annotation. However sequences of music usually represented in a spectral representation are too high dimensional to work with a recurrent model directly this is why we propose to learn a low dimensional feature.

This feature is being learned by a convolutional neural network. The model can be trained jointly and can thus benefit from each other.

The following steps were taken in this example.

For each audio sequence, we compute a magnitude spectrogram 7’ with a window 9 size of 46.6 ms (2048 samples at 44.1 kHz) and 50% overlap. We apply an equivalent rectangular filterbank of 200 triangular filters from 27.5 Hz to 16 kHz. The entire preprocessing pipeline was realized with Essentia. The input to the model we split the sequence X = [xtjVi into excepts of N frames, with 50% overlap. We coupled a convolutional neural network (CNN) 40 with a sequence-to-sequence model as explained earlier. The CNN 40 represents the automated feature extractor: for each except xt it extracts meaningful information from the spectral representation and compresses it. This low dimensional representation x’_t is than the input to the recurrent model 10 that decodes the sequence X’ = {x’t}^Tt=i to a hidden H state that ideally contains all information of the sequence much like a sufficient statistics. Consequently the information is than being ’’translated” to the symbolic space with another recurrent neural network, the decoder 20, to the output sequence Y = {yt}^Tt=i. The model is illustrated in Figure 13.

The input 7’ is a series of spectrogram excerpts of N frames. Each frame 9 is passed through the convolutional network. The representation is then passed on to the first RNN 10, which computes a hidden state that can be interpreted as sufficient statistics. Based on this hidden state a second RNN 20 generates an output sequence 50. This sequence 50 is the twofold it computes at which time which pitch is turned on/off This function is entirely deterministic. It can, furthermore, be trained end-toend. We use the Adam optimizer (for back propagation with gradient descent optimization) with standard hyper parameter settings. We apply 15% dropout to the inputs and 50% in the convolutional network. We train for several epochs. Implemented in Keras, training a single model on an Nvidia GTX Titan X graphics card.

It will also be clear that the above description and drawings are included to illustrate some embodiments of the invention, and not to limit the scope of protection. Starting from this disclosure, many more embodiments will be evident to a skilled person. These embodiments are within the scope of protection and the essence of this invention and are obvious combinations of prior art techniques and the disclosure of this patent.

Claims

ConclusiesConclusions

1. Een optisch muziek herkenningssamenstel (optical music recognition, OMR) voor het omzetten van bladmuziek dat een muziekstuk representeert als een eerste tijdsreeksrepresentatie in een machine-verwerkbare representatie van het muziekstuk dat ten minste een toonhoogte en duur van noten die grafisch weergegeven zijn in de bladmuziek en die het muziekstuk vormen als een tweede tijdsreeksrepresentatie, het samenstel omvattende een gegevensverwerkingssamenstel en programmatuur die, wanneer uitgevoerd op het gegevensverwerkingsamenstel:An optical music recognition assembly (OMR) for converting sheet music that represents a piece of music as a first time series representation into a machine-processable representation of the piece of music that has at least a pitch and duration of notes graphically represented in the sheet music and forming the music piece as a second time series representation, the assembly comprising a data processing assembly and software which, when executed on the data processing assembly:

- ontvangt een machine-verwerkbare representatie van de bladmuziek;- receives a machine-processable representation of the sheet music;

- genereert tijdsdelen (time slices) van de bladmuziek door toepassing van een schuivend venster over de machine-verwerkbare representatie van ten minste een deel van de bladmuziek;- generates time slices (time slices) of the sheet music by applying a sliding window over the machine-processable representation of at least a part of the sheet music;

- definieert een reeks-naar reeks samenstel, het reeks-naar-reeks samenstel omvattende:- defines a series-to-series assembly, the series-to-series assembly comprising:

* voorzien in een convolutie neuraal netwerk (convolutional neural network, CNN) voor het omzetten van de tijdsdelen in derde representaties van de bladmuziek, het CNN omvattende een invoerlaag en een uitvoerlaag;* providing a convolutional neural network (convolutional neural network, CNN) for converting the time parts into third representations of the sheet music, the CNN comprising an input layer and an output layer;

* voorzien in een eerste, codeer recurrent neural network (RNN) als codeerder voor de reeks derde representaties, voor het verschaffen van een verborgen representatie van de bladmuziek, het eerste RNN voorzien van een invoerlaag die functioneel gekoppeld is met de uitvoerlaag van het CNN, en een uitvoerlaag;* providing a first encoding recurrent neural network (RNN) as an encoder for the series of third representations, to provide a hidden representation of the sheet music, the first RNN provided with an input layer operably linked to the output layer of the CNN, and an output layer;

* voorzien in een tweede, decodeer recurrent neuraal netwerk (RNN) als decodeerder voor de verborgen representatie, voor het omzetten van de verborgen representatie in de machine-verwerkbare representatie, het tweede RNN voorzien van een invoerlaag die functioneel gekoppeld is met de uitvoerlaag van de eerste RNN, en een uitvoer voor het verschaffen van de machine-verwerkbare representatie.* providing a second, decode recurring neural network (RNN) as a decoder for the hidden representation, for converting the hidden representation into the machine-processable representation, the second RNN provided with an input layer that is functionally coupled to the output layer of the first RNN, and an output for providing the machine-processable representation.

2. Het samenstel volgens conclusie 1, waarbij het convolutie neuraal netwerk (CNN), het eerste, codeer recurrente neurale netwerk (RNN) en het tweede, decoder recurrente neurale netwerk (RNN) functioneel gekoppeld zijn, voor het vormen van het reeks-naar-reeks samenstel, en het reeks-naar-reeks samenstel is getraind onder toepassing van een trainingsgegevensverzameling van bladmuziek en bekende, resulterende machine-verwerkbare representaties.The assembly of claim 1, wherein the convolutional neural network (CNN), the first coding recurring neural network (RNN) and the second decoding recurring neural network (RNN) are functionally coupled to form the sequence-to sequence assembly, and the sequence-to-sequence assembly has been trained using a training data collection of sheet music and known, resulting machine-processable representations.

3. Het samenstel volgens conclusie 2, waarbij bij het trainen de eerste tijdsreeksrepresentatie voorzien worden als invoer voor het convoluti neurale netwerk (CNN), de uitvoer van het tweede, decodeer recurrente neurale netwerk (RNN) vergeleken wordt met het tweede tijdsreeksrepresentatie, en parameters van het convolutie neurale netwerk (CNN), het eerste, encoder recurrente neurale netwerk (RNN) en het tweede, decodeer recurrente neurale network (RNN) aangepast worden.The assembly of claim 2, wherein during training the first time series representation is provided as input to the convoluti neural network (CNN), the output of the second, decode recurring neural network (RNN) is compared to the second time series representation, and parameters of the convolutional neural network (CNN), the first, encoder recurring neural network (RNN) and the second, decode recurring neural network (RNN).

4. Het samenstel volgens een der voorgaande conclusies, waarbij het eerste RNN een lang korte termijn geheugen (Long Short term memory, LSTM) architectuur omvat.The assembly of any one of the preceding claims, wherein the first RNN comprises a long short term memory (LSTM) architecture.

5. Het samenstel volgens een der voorgaande conclusies, waarbij het tweede RNN een lang korte termijn geheugen (Long Short term memory, LSTM) architectuur omvat.The assembly of any one of the preceding claims, wherein the second RNN comprises a long short term memory (LSTM) architecture.

6. Het samenstel volgens een der voorgaande conclusies, waarbij de bladmuziek een grafische representatie is, in het bijzonder een digitaal beeld.The assembly according to any one of the preceding claims, wherein the sheet music is a graphic representation, in particular a digital image.

7. Het samenstel volgens een der voorgaande conclusies, waarbij de tweede digitale representatie een digitaal bestand is omvattende tijdsreeks instructies voor het in werking stellen van een inrichting, in het bijzonder een muziek bestand voor het bedienen of besturen van een muziekinstrument, meer in het bijzonder gekozen uit MIDI, en muziek XML.The assembly according to any one of the preceding claims, wherein the second digital representation is a digital file comprising a time series of instructions for operating a device, in particular a music file for operating or controlling a musical instrument, more in particular selected from MIDI, and music XML.

8. Het samenstel volgens een der voorgaande conclusies, waarbij de tijdsdelen digitale beelden omvat, verkregen door het schuiven van een venster over een grafische representatie, in het bijzonder over een digitaal beeld.The assembly as claimed in any one of the preceding claims, wherein the time sections comprise digital images obtained by sliding a window over a graphic representation, in particular over a digital image.

9. Het samenstel volgens een der voorgaande conclusies, waarbij de bladmuziek een grafische representatie omvat, omvattende notebalk lijnen en noten, in het bijzonder omvat de bladmuziek een visuele representatie op een drager, in het bijzonder een geschreven of gedrukte representatie op papier.The assembly according to any one of the preceding claims, wherein the sheet music comprises a graphic representation, comprising notebar lines and notes, in particular the sheet music comprises a visual representation on a carrier, in particular a written or printed representation on paper.

10. Het samenstel volgens een der voorgaande conclusies, waarbij het convolutie neurale netwerk, het eerste, codeer recurrente neurale netwerk, en het tweede, decodeer recurrente neurale netwerk getraind zijn als het reeks-naar-reeks samenstel onder toepassing van een trainingsgegevensverzameling omvattende bladmuziek monsters en voor elk bladmuziek monster een resulterende digitale representatie.The assembly of any preceding claim, wherein the convolutional neural network, the first, encode recurring neural network, and the second, decode recurring neural network are trained as the series-to-series assembly using a training data set comprising sheet music samples and for each sheet music sample a resulting digital representation.

11. Een werkwijze voor het omzetten van bladmuziek in een digitale representatie onder toepassing van het OMR samenstel volgens een der conclusies 1-10, omvattende:A method for converting sheet music into a digital representation using the OMR assembly according to any of claims 1-10, comprising:

- verkrijgen van een digitaal beeld van ten minste een deel van de bladmuziek;- obtaining a digital image of at least a part of the sheet music;

- aanbrengen van een schuivend venster over het digitale beeld van de bladmuziek, het schuivende venster voorzien van een tijdsbreedte en een pas die de tijdsdelen verschaft van een tijdsreeks van gedeeltelijk overlappende invoer beelden van ten minste deel van de bladmuziek;- arranging a sliding window over the digital image of the sheet music, the sliding window provided with a time width and a pass which provides the time parts of a time series of partially overlapping input images of at least part of the sheet music;

- toepassen van het convolutionele neurale netwerk (CNN) op de invoerbeelden voor het omzetten van de invoerbeelden in een tijdsreeks van numerieke representaties van de invoer beelden;- applying the convolutional neural network (CNN) to the input images for converting the input images into a time series of numerical representations of the input images;

- toepassen van het eerste, recurrente neurale netwerk (RNN) als de codeerder op de tijdsreeks van numerieke representaties voor het verschaffen van een verborgen uitvoer gegevensverzameling;- applying the first, recurrent neural network (RNN) as the encoder to the time series of numerical representations to provide a hidden output data set;

- toepassen van het tweede, decodeer recurrente neurale network (RNN) als de decodeerder op de verborgen uitvoer gegevensverzameling voor het omzetten van de verborgen uitvoergegevens in de digitale representatie die ten minste een toonhoogte en duur representeren van de noten die grafisch gerepresenteerd zijn in de bladmuziek en die het muziekstuk vormen.- applying the second, decode recurring neural network (RNN) as the decoder to the hidden output data set to convert the hidden output data into the digital representation that represents at least a pitch and duration of the notes graphically represented in the sheet music and that form the piece of music.

12. Een samenstel voor het produceren van een optisch muziek herkenning (OMR) samenstel voor het omzetten van bladmuziek in een digitale representatie, de bladmuziek omvattende een grafische, tijdsreeks representatie van noten die een muziekstuk vormen, het OMR samenstel omvattende een gegevensverwerkingsinrichting en programmatuur die, wanner werkend op de gegevensverwerker:12. An assembly for producing an optical music recognition (OMR) assembly for converting sheet music into a digital representation, the sheet music comprising a graphic, time series representation of notes forming a piece of music, the OMR assembly comprising a data processing device and software that , when working on the data processor:

-verschaffen een trainingsgegevensverzameling omvattende bladmuziek monsters en voor elk bladmuziek monster een resulterende digitale representatie;- provide a training data set comprising sheet music samples and a resulting digital representation for each sheet music sample;

- verschaffen van een neuraal netwerk samenstel omvattende:- providing a neural network assembly comprising:

(a) een convolutie neuraal netwerk (CNN) ingericht voor het ontvangen van trainingsinvoerbeelden en omzetten van de invoerbeelden in een reeks numerieke representaties van de invoerbeelden;(a) a convolutional neural network (CNN) adapted to receive training input images and convert the input images into a series of numerical representations of the input images;

(b) een eerste, encoder recurrent neuraal netwerk (RNN) als een codeerder ingericht voor het ontvangen van de reeks numerieke representaties voor het verschaffen van een verborgen uitvoer gegevensverzameling;(b) a first, encoder recurring neural network (RNN) as an encoder arranged to receive the set of numerical representations to provide a hidden output data set;

(c) een tweede, decodeer recurrent neuraal netwerk (RNN) als een decodeerder ingericht voor het ontvangen van de verborgen uitvoer gegevensverzameling voor het omzetten van de verborgen uitvoergegevens in de digitale representatie die ten minste de toonhoogte en duur van de noten die grafisch gerepresenteerd zijn in de bladmuziek representeren en die het muziekstuk vormen;(c) a second, decode, recurrent neural network (RNN) as a decoder arranged to receive the hidden output data set for converting the hidden output data into the digital representation that is at least the pitch and duration of the notes represented graphically. represent in sheet music and which form the piece of music;

- trainen van het neurale netwerk samenstel onder toepassing van de trainingsgegevensverzameling.- training the neural network assembly using the training data collection.

13. Een werkwijze voor het voortbrengen van het samenstel volgens een der voorgaande conclusies, waarbij:A method for producing the assembly of any one of the preceding claims, wherein:

- een trainingsgegevensverzameling wordt verschaft, de trainingsgegensverzameling omvattende eerste tijdreeks representaties van signalen, elk met een resulterende tweede tijdsreeks representatie van elk signaal;- a training data set is provided, the training data set comprising first time series representations of signals, each with a resulting second time series representation of each signal;

- het convolutie neuraal netwerk, het eerste recurrente neurale netwerk en het tweede recurrente neurale netwerk worden verschaft, waarbij de neurale netwerken gekoppeld zijn;- the convolutional neural network, the first recurrent neural network and the second recurrent neural network are provided, the neural networks being coupled;

- de tijdsdelen verschaft worden voor het omzetten van de tijdsdelen in een reeks derde representaties van de eerste tijdsreeksrepresentatie;- the time sections are provided for converting the time sections into a series of third representations of the first time series representation;

- toepassen van een eerste, getraind, recurrent neuraal netwerk (RNN) als een codeerder op de reeks derde representaties, ingevoerd als een enkele gegevensinvoer, voor het verschaffen van een verborgen representatie van de eerste tij dsreeksrepresentatie;- applying a first, trained, recurrent neural network (RNN) as an encoder to the series of third representations, input as a single data entry, to provide a hidden representation of the first time series representation;

- toepassen van een tweede, getraind, decoder recurrent neuraal netwerk (RNN) als een decodeerder op de verborgen representatie, ingevoerd als een enkele gegevensinvoer, voor het omzetten van de verborgen representatie in een berekende tweede tijdsreeks representatie van het signaal;- applying a second, trained, decoder, recurrent neural network (RNN) as a decoder to the hidden representation, input as a single data entry, for converting the hidden representation into a calculated second time series representation of the signal;

- aanpassen van parameters in ten minste één gekozen uit het convolutionele neurale netwerk, het eerste, getrainde, recurrent neurale netwerk (RNN) en het tweede, getrainde, decodeer recurrente neurale netwerk (RNN) op basis van een verschil tussen de resulterende tweede tijdsreeks representatie van het signaal en de berekende tweede tijdsreeks representatie van elk signaal.- adjusting parameters in at least one selected from the convolutional neural network, the first, trained, recurrent neural network (RNN) and the second, trained, decode recurring neural network (RNN) based on a difference between the resulting second time series representation of the signal and the calculated second time series representation of each signal.

14. De werkwijze volgens conclusie 13, waarbij de neurale netwerken getraind worden onder toepassing van terugvoeren (back propagation) van de trainingsgegevensverzameling.The method of claim 13, wherein the neural networks are trained using back propagation of the training data set.

15. De werkwijze volgens conclusie 14, waarbij de parameters in de neurale netwerken aangepast worden op basis van geleidelijke daling (gradient descent) optimalisatie.The method of claim 14, wherein the parameters in the neural networks are adjusted based on gradual decline (gradient descent) optimization.

-o-o-o-o-o--o-o-o-o-o-