CN112528631A

CN112528631A - Intelligent accompaniment system based on deep learning algorithm

Info

Publication number: CN112528631A
Application number: CN202011411392.2A
Authority: CN
Inventors: 计紫豪; 陈世哲; 张振林; 高振华
Original assignee: Shanghai Gujun Education Technology Co ltd
Current assignee: Shanghai Gujun Education Technology Co ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-19
Anticipated expiration: 2040-12-03
Also published as: CN112528631B

Abstract

The invention discloses an intelligent accompaniment system based on a deep learning algorithm, which comprises a user side and a server side, wherein the user side is in communication connection with the server side, the user side provides a visual interface for a user, receives midi fragments transmitted by the user in a long distance and displays and plays 1 accompaniment fragment generated based on the midi fragments; the server side utilizes an accompaniment generator constructed based on a deep learning algorithm to generate an accompaniment according to the midi fragment, and sends the accompaniment to the client side for displaying. The accompaniment generator stored by the server side can automatically analyze the melody track of the mini segment and generate the accompaniment segment, the accompaniment segment is visually displayed and played through the user side for the user to appreciate and select, and non-music professionals who do not know music theory can realize music creation because the accompaniment segment is automatically generated, so that the application universality and creation efficiency of the music creation are improved.

Description

Intelligent accompaniment system based on deep learning algorithm

Technical Field

The invention belongs to the field of intelligent music creation, and particularly relates to an intelligent accompaniment UI system based on a deep learning algorithm.

Background

In recent years, with the development of artificial intelligence technology, the creativity of computers in the art field is greatly improved. In the field of computer music, through machine learning and deep learning methods, musical works comparable to human beings are created intelligently. Such as

The deep bach model, published by Hadjeres et al, can mimic the generation of polyphonic music having the bach style. Further, as with the deep jazz model published by JI-Sung Kim et al, multi-instrument music with jazz tempo type, melody line can be generated. The intelligent accompaniment is an application scene of computer artistic creation capability, and can energize music creation of any music enthusiast. If the user is a professional musician, the intelligent accompaniment can provide rich song editing inspiration for the user, and song editing time is greatly shortened; if the user is a non-professional musician, the intelligent accompaniment function can directly create a first complete song exclusive to the user.

The smart accompaniment is a promising field, and in recent years, there are more and more papers in the academic world, and the research on the field is intensively carried out, such as MuseGAN of Hao-Wen Dong et al, MuseNet of OpenAI, and the like. None of them, however, produce a UI system that meets the needs of the user. MuseGAN places only a few auditoriums on its own project homepage, without any way of user interaction. Musenet makes a simple UI interface including selectable parameters of music style, prelude, musical instrument, length, etc., but it has no way to implement the function of uploading a melody and generating an accompaniment from scratch for a user. Similar attempts have been made in the industry by software. Captain plug-ins provide a series of intelligent composition compilation inserts, such as Captain languages, Captain Melody, and the like. The user can automatically generate a melody or chord progression by adjusting the parameters specified in the plug-in. However, this approach is still very demanding for non-professional musicians, since all parameters are music-related and only those who have the background of composition can appreciate their meaning. Secondly, the number of parameters to be adjusted is large, and the user needs to try to obtain a satisfactory segment. Meanwhile, the plug-ins have the greatest defect that the plug-ins have no linkage, namely, a user needs to master the harmony between the generated melody and the chord by himself to obtain the complete tune. The plug-in cannot analyze the existing melody fragment to fit it with the proper instrumental layout.

In summary, there is no such UI system available on the market that is capable of generating its accompaniment by a given melody intelligence.

Disclosure of Invention

In view of the above, the present invention provides an intelligent accompaniment system based on a deep learning algorithm, which is capable of realizing interaction with a user and automatically generating accompaniment clips for mini clips uploaded by the user.

In order to achieve the purpose, the invention provides the following technical scheme:

an intelligent accompaniment system based on a deep learning algorithm comprises a user side and a server side, wherein the user side is in communication connection with the server side, the user side provides a visual interface for a user, receives midi fragments transmitted by the user in a long-distance mode and displays and plays 1 accompaniment fragment generated based on the midi fragments;

the server side utilizes the accompaniment generator constructed based on the deep learning algorithm to generate the accompaniment fragments according to the midi fragments and sends the accompaniment fragments to the client side for displaying.

Compared with the prior art, the invention has the beneficial effects that at least:

according to the intelligent accompaniment system based on the deep learning algorithm, the accompaniment generator stored by the server side can automatically analyze the melody track of the mini segment and generate the accompaniment segment, the accompaniment segment is visually displayed and played through the user side for the user to appreciate and select, and as the accompaniment segment is automatically generated, non-music professionals who do not know the music knowledge can also realize music creation, so that the application universality and the creation efficiency of the music creation are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a smart accompaniment system based on a deep learning algorithm according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a training framework provided in an embodiment of the present invention;

fig. 3, fig. 4 and fig. 5 are schematic views of visualization interfaces in three application scenarios of the user terminal according to the embodiment of the present invention;

FIG. 6 is a midi fragment of an eight bar provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic structural diagram of a smart accompaniment system based on a deep learning algorithm according to an embodiment of the present invention. As shown in fig. 1, the smart accompaniment system comprises a user terminal and a server terminal, wherein the user terminal and the server terminal are in communication connection, the user terminal provides a visual interface for a user, receives midi fragments transmitted by the user in a long distance, and displays and plays 1 accompaniment fragment generated based on the midi fragments; the server side utilizes an accompaniment generator constructed based on a deep learning algorithm to generate an accompaniment according to the midi fragment, and sends the accompaniment to the client side for displaying.

Specifically, the user side provides a midi fragment adjusting operation and a smart accompaniment triggering operation for the user through a visual interface, the user adjusts the starting time of the midi melody through the midi fragment adjusting operation, and a request for generating the smart accompaniment for the mini fragment is sent out through the smart accompaniment triggering operation.

The user side further provides accompaniment clip selection, playing operation and deleting operation for the user through the visual interface, the user selects and plays the accompaniment clips through the selection and playing operation, and the accompaniment clips are deleted through the deleting operation.

The user side still provides the operation of auditioning the accompaniment clip for the user through the visual interface, and the user auditions the selected accompaniment clip through the operation of auditioning the accompaniment clip, and during audition, the playing time and the total duration are displayed, and after the audition is stopped and the audition is finished, the accompaniment clip is recovered to be in an unplayed state.

The server side comprises an accompaniment generator for generating accompaniment clips according to the midi clips, and the construction method of the accompaniment generator comprises the following steps:

(a) constructing a sample set, wherein each sample of the sample set consists of a midi fragment and an accompaniment fragment, and the specific construction process comprises the following steps: collecting MIDI data sets with genre labels on the Internet by adopting a crawler method, wherein the genres comprise fashion, country and jazz; after melody extraction, track compression, data filtering, whole song segmentation and chord identification, MIDI fragments are obtained and are scrambled, and sample sets corresponding to the genre labels are obtained.

(b) And constructing a training framework, wherein the training framework comprises a coding unit, a base unit, a semantic representation unit and a domain confrontation unit, as shown in FIG. 2.

The encoding unit encodes the input MIDI fragment into entries using a Mu MIDI model whose network parameters are determined so that the Mu MIDI model is not optimized during training and is not shown in fig. 2, and encodes the discrete sequence of signed music into entries. And dividing the vocabulary entry into a target vocabulary entry and a conditional vocabulary entry according to different tasks, wherein the only difference between the target vocabulary entry and the conditional vocabulary entry is that the conditional vocabulary entry is known, the target vocabulary entry is unknown, and the target vocabulary entry is obtained according to the prediction of the conditional vocabulary entry. In the invention, the entry corresponding to the known midi fragment is used as the conditional entry, and the entry corresponding to the accompaniment fragment to be generated is used as the target entry. The dimensions of the encoding include: bar ordinal, note position, track ordinal, note attribute (pitch, duration, loudness), chord, and meta attribute; the coding method can learn the relative dependence of notes between different tracks, thereby improving the overall harmony of the generated music. The meta attribute is added with a "genre" symbol, which encodes the genre information of the data set, specifically, only three genres are considered: fashion, country, jazz, therefore, the three genres are given a genre symbol: 0. 1 and 2. This information will be encoded as one of the meta-attributes.

The semantic representation unit comprises two first representation branches and a second representation branch, wherein the first representation branch is used for performing semantic representation on an input conditional entry and outputting a corresponding global semantic probability, specifically, the first representation branch comprises a recursive encoder, a hidden layer of the conditional entry and a linear layer which are sequentially connected, the input of the recursive encoder is the conditional entry, and the output of the linear layer is the global semantic probability; the second characterization branch is used for performing semantic characterization on the input target entry and outputting a corresponding global semantic probability, and specifically, the second characterization branch comprises a recursive reference encoder and a multi-head semantic attention layer which are sequentially connected, the input of the recursive reference encoder is the target entry, and the output of the multi-head semantic attention layer is the global semantic probability. Specifically, the recursive reference encoder and the recursive encoder have the same structure and have independent hyper-parameters and gradients, and the multi-head semantic attention layer is used for extracting semantic information contained in a target entry passing through the recursive reference encoder and finally outputting a global semantic logic value. Although there is no input of the target entry in the inference stage, it is still desirable to retain semantic information, so the conditional entry is encoded with a linear layer whose output dimensions are the same as the output of the multi-headed semantic attention layer.

The basic unit is used for decoding the global semantic probability corresponding to the target entry after the input conditional entry is coded and outputting the language model probability. Specifically, the basic unit comprises a recursive encoder, a hidden layer of a conditional entry, a fusion operation and a recursive decoder which are sequentially connected, wherein the input of the recursive encoder is the conditional entry, the fusion operation is used for performing fusion processing on an output graph of the hidden layer of the conditional entry and a global semantic probability corresponding to a target entry through the output graph of the hidden layer of the target entry, a fusion processing result is input to the recursive decoder, and the language model probability is output after the fusion processing result is decoded. The base unit adds a part of the recursion, which means that the encoder can save the lemma layer of the last time step and connect it with the lemma layer of the current time step.

The accompaniment of MIDI usually contains many semantic information related to genre, which causes conflict between the attribute of genre meta and the genre information in global semantic logic prediction during the inference stage, resulting in confusion of genre prediction. Therefore, a domain confrontation unit is designed to solve such confusion, and the domain confrontation unit is used for performing feature mapping on the global semantic probability corresponding to the input target entry and outputting a domain volume probability value. Specifically, the domain confrontation unit comprises a gradient inversion layer, a linear layer, a one-dimensional small-batch regular layer, a Relu activation function and a linear layer which are sequentially connected, the input of the gradient inversion layer is global semantic probability corresponding to a target entry, and the output of the last linear layer is domain volume probability value.

Specifically, in the training framework, the number of attention layers is 4, the number of attention heads is 8, the number of encoder layers is 4, the number of encoder heads is 8, the number of decoder layers is 8, the number of decoder heads is 8, the encoder parameters are the same as those of the encoder parameters of the semantic representation component, and the encoder parameters share a gradient; the parameters are that the sizes of all hidden layers are 256, the word embedding dimension is 256, and the length and the memory length of the training input entry are 512.

(c) And constructing a training loss function of a training frame, wherein the training loss function comprises a global semantic loss function, a genre loss function and a language model loss function, the global semantic loss function is the cross entropy of two global semantic probabilities output by the semantic representation unit, the genre loss function is the cross entropy of a domain body probability value output by the domain countermeasure unit and a genre label of the target entry, and the language model loss function is obtained by calculating the language model probability output by the basic unit.

(d) Inputting sample data into a training framework, training loss function convergence as a target, and optimizing network parameters of the training framework; and after training is finished, extracting the coding units with the determined parameters, representing branches corresponding to the target entries in the semantic representing units and the basic units to form an accompaniment generator.

After the accompaniment generator is constructed, the accompaniment generator is packaged at a server, when the accompaniment generator is used for generating an accompaniment fragment according to a midi fragment, the midi fragment is coded by a coding unit and then outputs a condition entry, the condition entry is input to a basic unit, meanwhile, a global semantic probability value corresponding to a determined genre label optimized during training of a training frame is input to a representation branch corresponding to a target entry in a semantic representation unit as the target entry, and after the representation branch and the basic unit are processed, an output language model probability is used as the accompaniment fragment.

In an embodiment, the server further includes a midi fragment determining module, which determines the length of the received midi fragment, and when the midi fragment exceeds 8 hours, generates a prompt message and sends the prompt message to the user terminal for displaying.

In an embodiment, the server further comprises an accompaniment fragment judgment module for judging the length of the generated accompaniment fragments, when the length of the accompaniment fragments is smaller than the minimum threshold value of the accompaniment length, the accompaniment fragments are discarded, and the accompaniment generator regenerates the accompaniment fragments according to the midi judgment;

when the length of the accompaniment clip is greater than the maximum threshold value of the accompaniment length, waiting prompt information is generated and sent to the user terminal to be displayed.

In an embodiment, the generated accompaniment pieces include at most 5 instrument track information, and each time the accompaniment pieces are generated, the 5 instrument tracks including piano, drum, bass, guitar and string are stored in the list.

The following describes in detail that the intelligent accompaniment system based on the deep learning algorithm is carried on a domestic first-money digital music workstation which is completely and independently researched and developed, and the specific application process of the intelligent accompaniment system comprises the following steps:

(1) the user opens the track editor.

The user will display the music editing home interface after opening the music workstation, as shown in fig. 3. Where the box area represents the user-entered midi fragment, which will be placed in the track, which can be slid left and right to select the appropriate melody start time.

(2) The user selects a melody track desired to generate an accompaniment and displays a smart accompaniment button.

When a user moves a mouse to a certain segment, a smart accompaniment button is displayed in a floating mode on the segment; when the mouse is removed, the intelligent accompaniment button is hidden.

(3) And clicking the intelligent accompaniment button by the user and generating and displaying an accompaniment result list.

The user triggers a request for generating the accompaniment clips by clicking a smart accompaniment button on the midi clip and sends the accompaniment clips to the server, and the accompaniment generator of the server generates the accompaniment clips based on the request and the selected midi clip, wherein the accompaniment clips totally comprise at most five musical instruments, namely, a piano, a drum, a bass, a guitar and a string music. In the insect digital music workstation, the man-machine interaction form is expressed as that after the intelligent accompaniment button is clicked, the right side of the editor shows the accompaniment result:

a) each time 1 accompaniment results are generated, the accompaniment results,

b) the selected MIDI fragment cannot be generated for more than 8 measures, and if so, the system prompts "you select a MIDI fragment for more than 8 measures, please re-select! "

c) If the duration of the generated accompaniment is less than 5 seconds and is generally a segment with poor quality, the algorithm can be automatically regenerated, and the system does not have a prompt at the moment

d) If the duration of the generated accompaniment is longer than 25 seconds, the accompaniment is usually a segment with more notes and richer arrangement, and the system prompts the following steps: generate rich choreography fragments for you! Please wait for patience!

e) Each accompaniment result is a MIDI file containing at most 5 instrument tracks

f) Each generated accompaniment result is stored in a list

g) A delete button is clicked to delete the accompaniment; clear all buttons, clear all accompaniment

Fig. 4 shows a result list in which five accompaniments have been generated, the first four being generated by MIDI fragment 1 and the fifth being generated by MIDI fragment 2. Fig. 5 is a visual interface diagram showing a scene of deleting accompaniment clips, wherein a single accompaniment clip deletion button can be selected, and all the accompaniment clips can be emptied.

(4) The user can click the accompaniment and listen on trial, and if the user feels that the accompaniment is not good, the user can click the intelligent accompaniment button again to generate.

After clicking the intelligent accompaniment button, the right side of the editor displays the accompaniment result:

a) clicking the play button on the right side can audition a single accompaniment

b) During audition, the button is changed into a stop button, and the playing time and the total duration are displayed

c) When the stop button is clicked, the playback is stopped and the playback is resumed to the non-playback state

d) After the audition is finished, the state is recovered to be the non-playing state

e) During audition, if a play button of another accompaniment is clicked, the current accompaniment is stopped and a new accompaniment is started to be played

(5) As shown in fig. 6, the user drags the finally liked accompaniment clips into the main editing interface.

a) Drag the accompaniment to the editor, and after the hands are loosened, add the 5 musical instrument segments contained in the accompaniment to the 5 MIDI tracks from the current track to the bottom in turn

b) Adding the MIDI fragment to a track:

when the MIDI track does not exist, automatically creating a corresponding MIDI track;

when the track is a MIDI track, adding the MIDI track to the MIDI track;

when the track is an audio track, the audio track is skipped and an attempt is made to add to the next track

c) The positions of the MIDI clips added are automatically aligned with the initial positions of the MIDI clips selected for generating the accompaniment.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. An intelligent accompaniment system based on a deep learning algorithm comprises a user side and a server side, wherein the user side is in communication connection with the server side, and the intelligent accompaniment system is characterized in that the user side provides a visual interface for a user, receives midi fragments transmitted by the user in a long distance and displays and plays 1 accompaniment fragment generated based on the midi fragments;

2. The system for smart accompaniment according to claim 1, wherein the user terminal provides the user with midi fragment adjustment operation and smart accompaniment triggering operation through a visual interface, the user adjusts the start time of the midi melody through the midi fragment adjustment operation, and issues the request for generating the smart accompaniment for the mini fragment through the smart accompaniment triggering operation.

3. The intelligent accompaniment system based on the deep learning algorithm as claimed in claim 1, wherein said user terminal provides the accompaniment clips selection and playing operation and deleting operation for the user through the visual interface, the user selects and plays the accompaniment clips through the selection and playing operation, and deletes the accompaniment clips through the deleting operation.

4. The intelligent accompaniment system based on the deep learning algorithm as claimed in claim 1, wherein said user end provides the operation of auditioning the accompaniment clips through the visual interface for the user, the user audits the selected accompaniment clips through the operation of auditioning the accompaniment clips, during audition, the playing time and the total duration are displayed, and after the audition is stopped and the audition is finished, the accompaniment clips are restored to the unplayed state.

5. The intelligent accompaniment system based on the deep learning algorithm according to claim 1, wherein said half-fragment generator is constructed by:

constructing a sample consisting of a midi fragment and an accompaniment fragment;

constructing a training frame, wherein the training frame comprises a coding unit, a basic unit, a semantic representation unit and a domain confrontation unit, the coding unit is used for coding a sample, coding a midi fragment into a conditional entry, and coding an accompaniment fragment into a target entry, the semantic representation unit comprises two representation branches, respectively performs semantic extraction on the input conditional entry and the target entry and outputs two corresponding global semantic probabilities, and the basic unit is used for decoding the global semantic probabilities corresponding to the target entry after the input conditional entry is coded and outputting language model probabilities; the domain countermeasure unit is used for performing feature mapping on the global semantic probability corresponding to the input target entry and outputting a domain body probability value;

constructing a training loss function of a training frame, wherein the training loss function comprises a global semantic loss function, a genre loss function and a language model loss function, the global semantic loss function is the cross entropy of two global semantic probabilities output by a semantic representation unit, the genre loss function is the cross entropy of a domain body probability value output by a domain countermeasure unit and a genre label of a target entry, and the language model loss function is obtained by calculating the language model probability output by a basic unit;

inputting sample data into a training framework, training loss function convergence as a target, and optimizing network parameters of the training framework; and after training is finished, extracting the coding units with the determined parameters, representing branches corresponding to the target entries in the semantic representing units and the basic units to form an accompaniment generator.

6. The smart accompaniment system based on the deep learning algorithm according to claim 5, wherein the encoding unit employs a Mu MIDI model;

the semantic representation unit comprises a first representation branch for performing semantic representation on the input conditional terms and outputting corresponding global semantic probabilities, wherein the first representation branch comprises a recursive encoder, a hidden layer of the conditional terms and a linear layer which are sequentially connected, the input of the recursive encoder is the conditional terms, and the output of the linear layer is the global semantic probabilities;

the second representation branch circuit included in the semantic representation unit is used for performing semantic representation on the input target entry and outputting a corresponding global semantic probability, wherein the second representation branch circuit comprises a recursive reference encoder and a multi-head semantic attention layer which are sequentially connected, the input of the recursive reference encoder is the target entry, and the output of the multi-head semantic attention layer is the global semantic probability;

the domain confrontation unit comprises a gradient inversion layer, a linear layer, a one-dimensional small batch regular layer, a Relu activation function and a linear layer which are sequentially connected, the input of the gradient inversion layer is the global semantic probability corresponding to the target entry, and the output of the last linear layer is the domain volume probability value;

the basic unit comprises a recursive encoder, a hidden layer of a conditional entry, a fusion operation and a recursive decoder which are sequentially connected, wherein the input of the recursive encoder is the conditional entry, the fusion operation is used for performing fusion processing on an output graph of the hidden layer of the conditional entry and a global semantic probability corresponding to the target entry through the output graph of the hidden layer of the target entry, a fusion processing result is input to the recursive decoder, and the language model probability is output after the fusion processing result is decoded.

7. The intelligent accompaniment system based on the deep learning algorithm as claimed in claim 5, wherein when the accompaniment generator is used to generate the accompaniment segments according to the midi segments, the midi segments are encoded by the encoding unit and then output the conditional terms, the conditional terms are input to the base unit, meanwhile, the global semantic probability value corresponding to the genre label determined by optimization during the training of the training frame is input as the target term to the representation branch corresponding to the target term in the semantic representation unit, and after the representation branch and the base unit are processed, the output language model probability is used as the accompaniment segments.

8. The intelligent accompaniment system based on the deep learning algorithm, as claimed in claim 1, wherein said server further comprises a midi fragment judgment module, which judges the length of the received midi fragment, and when the midi fragment exceeds 8 hours, generates the prompt message and sends it to the user terminal for displaying.

9. The intelligent accompaniment system based on the deep learning algorithm as claimed in claim 1, wherein said server further comprises an accompaniment clip determination module for determining the length of the generated accompaniment clips, and when the length of the accompaniment clips is smaller than the minimum threshold value of the accompaniment length, discarding the accompaniment clips, and the accompaniment generator regenerates the accompaniment clips according to midi determination;

10. The intelligent accompaniment system based on deep learning algorithm according to any one of claims 1 to 9, wherein the generated accompaniment clips include at most 5 instrument track information, and each time the accompaniment clips are generated, the 5 instrument tracks including piano, drum, bass, guitar and string are stored in the list.