CN110222226B - Method, device and storage medium for generating rhythm by words based on neural network - Google Patents

Method, device and storage medium for generating rhythm by words based on neural network Download PDF

Info

Publication number
CN110222226B
CN110222226B CN201910307611.3A CN201910307611A CN110222226B CN 110222226 B CN110222226 B CN 110222226B CN 201910307611 A CN201910307611 A CN 201910307611A CN 110222226 B CN110222226 B CN 110222226B
Authority
CN
China
Prior art keywords
time
lyrics
neural network
rhythm
music
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910307611.3A
Other languages
Chinese (zh)
Other versions
CN110222226A (en
Inventor
曹靖康
王义文
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910307611.3A priority Critical patent/CN110222226B/en
Priority to PCT/CN2019/102189 priority patent/WO2020211237A1/en
Publication of CN110222226A publication Critical patent/CN110222226A/en
Application granted granted Critical
Publication of CN110222226B publication Critical patent/CN110222226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02BCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO BUILDINGS, e.g. HOUSING, HOUSE APPLIANCES OR RELATED END-USER APPLICATIONS
    • Y02B20/00Energy efficient lighting technologies, e.g. halogen lamps or gas discharge lamps
    • Y02B20/40Control techniques providing energy savings, e.g. smart controller or presence detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Auxiliary Devices For Music (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a method for generating rhythms by words based on a neural network, which comprises the following steps: converting the lyrics of given music into a vector set according to a preset lyrics coding rule; inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics; and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm. The invention also provides a neural network-based word generation rhythm device and a computer-readable storage medium. The invention applies the deep learning network to the generation of music rhythm and can obtain reliable results so that the generated music accords with the specification of the original music.

Description

Method, device and storage medium for generating rhythm by words based on neural network
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a neural network-based method and device for generating rhythms by words and a computer-readable storage medium.
Background
The music rhythm is an important ring in the automatic music generation algorithm, can normalize the distribution of lyrics, can restrict pitch, melody and the like, and is a bridge for connecting lyrics and music. Traditional speech recognition and music model construction are performed by using state modeling, wherein a phoneme or a word is artificially divided into a plurality of states without physical meaning, and then the output distribution of each state is described by using a discrete or continuous Gaussian model. The modeling mode needs to divide the boundary of the modeling unit in the middle of the continuous sequence in advance, and the input and output distribution edges are aligned, so that the calculation speed is low.
Many efforts are made to combine the deep neural network with various fields, and in the aspect of music generation, a probability generation algorithm and a Markov chain can accurately generate the original music rhythm, but the melody of the generated music rhythm is too simple; the structural model of a Long Short-Term Memory (LSTM) is too complex, and the training time of the model is Long; while the recurrent neural network (Recurrent Neural Network, RNN) is likely to suffer from gradient extinction when processing sequences that are far apart. Therefore, how to apply the deep learning network to the generation of music rhythm and obtain reliable results so that the generated music accords with the original music standard, and the system has stable robustness is a problem to be solved.
Disclosure of Invention
The invention provides a method, a device and a computer-readable storage medium for generating rhythms by words based on a neural network, and mainly aims to provide a technical scheme for applying a deep learning network to music rhythms generation.
In order to achieve the above object, the present invention provides a method for generating a rhythm in words based on a neural network, including:
converting lyrics of given music into a vector set according to a preset lyric coding rule, wherein the preset lyric coding rule comprises the following steps: providing that single characters in the lyrics are 1, single punctuation marks are 0, and filling 0 between the characters;
pre-constructing a neural network model, wherein the pre-constructed neural network model comprises three layers of space-time convolutional networks and a layer of bidirectional gating circulating units;
inputting a vector set of lyrics of given music into the three-layer space-time convolution network, and extracting a feature vector;
performing aggregation operation on the feature vectors by using the bidirectional gating circulation unit to obtain time steps; and
Performing linear transformation on each time step to obtain time sequence distribution;
and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm.
Optionally, the calculation mode of each layer of the space-time convolution network is as follows:
where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to the size of the corresponding convolution kernel at (i, j, k) above, < >>The weight matrix representing the convolution kernel and b represents the offset value of the corresponding convolution kernel.
Optionally, the performing connection timing classification on the lyrics by using the time sequence distribution to obtain a target tempo includes:
blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->
Defining a function B:wherein V is * Is->The following operations are performed to obtain: 1) Merging consecutive identical symbols; 2) Removing blank characters;
for a string sequence y E V * Definition:
wherein V is * All elements of (a) are called paths, V * Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) t ,…,u T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied -1(y)s.t.|u|=T Representing a set of character strings y with a length T and showing the result of transformation by the function B;
according to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:
optionally, the bidirectional gating cycle unit obtains the time step by adopting the following formula:
r t =σ(W r ·[h t-1 ,z]);
u t =σ(W u ·[h t-1 ,z]);
wherein: u (u) t And r t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z 1 ,…,z t Is the three-layer space-time convolution network outputFeature vector, W of (2) r And W is u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) t The output state at the time t is the output state at the time t,
the mapping of the two directions of the bidirectional gating circulation unit is respectively as follows:
the time step at time t is thus obtained as:
optionally, the calculation formula of the time series distribution is:
p(u t ,…,u T |z)=∏ 1≤t≤T p(u t |z),
where t is the time step, p (u t |z)=softmax(mlp(h t ;W mlp ) The softmax is a normalized exponential function, mlp is a weighted value W mlp Z is a characteristic vector output by the three-layer space-time convolution network, and T is the number of all time steps.
In addition, to achieve the above object, the present invention further provides a neural network-based rhythmic generating device, the device including a memory and a processor, the memory storing a rhythmic generating program executable on the processor, the rhythmic generating program when executed by the processor implementing a neural network-based rhythmic generating method, the method comprising:
converting lyrics of given music into a vector set according to a preset lyric coding rule, wherein the preset lyric coding rule comprises the following steps: providing that single characters in the lyrics are 1, single punctuation marks are 0, and filling 0 between the characters;
pre-constructing a neural network model, wherein the pre-constructed neural network model comprises three layers of space-time convolutional networks and a layer of bidirectional gating circulating units;
inputting a vector set of lyrics of given music into the three-layer space-time convolution network, and extracting a feature vector;
performing aggregation operation on the feature vectors by using the bidirectional gating circulation unit to obtain time steps; and
Performing linear transformation on each time step to obtain time sequence distribution;
and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm.
Optionally, the calculation mode of each layer of the space-time convolution network is as follows:
where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to the size of the corresponding convolution kernel at (i, j, k) above, < >>Weight matrix representing convolution kernel, b represents bias of corresponding convolution kernelValues.
Optionally, the performing connection timing classification on the lyrics by using the time sequence distribution to obtain a target tempo includes:
blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->
Defining a function B:wherein V is * Is->Performing the following operations to obtain 1) merging consecutive identical symbols; 2) Removing blank characters;
for a string sequence y E V * Definition:
wherein V is * All elements of (a) are called paths, V * Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) t ,…,u T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied -1(y)s.t.|u|=T Representing a set of character strings y with a length T and showing the result of transformation by the function B;
according to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:
optionally, the bidirectional gating cycle unit obtains the time step by adopting the following formula:
r t =σ(W r ·[h t-1 ,z]);
u t =σ(W u ·[h t-1 ,z]);
wherein: u (u) t And r t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z 1 ,…,z t The characteristic vector W is output by the three-layer space-time convolution network r And W is u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) t The output state at the time t is the output state at the time t,
the mapping of the two directions of the bidirectional gating circulation unit is respectively as follows:
the time step at time t is thus obtained as:
in addition, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a word-generating rhythm program executable by one or more processors to implement the steps of the neural network-based rhythm method for generating words as described above.
The invention provides a method, a device and a computer readable storage medium for generating rhythms by words based on a neural network, which are used for converting lyrics of given music into vector sets according to a preset lyrics coding rule; inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics; and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm. Therefore, the invention applies the deep learning network to the generation of music rhythm and can obtain reliable results so that the generated music accords with the specification of the original music.
Drawings
FIG. 1 is a schematic flow chart of a method for generating rhythms by words based on a neural network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a bi-directional gating cycle unit according to a method of generating a rhythm in words based on a neural network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a data flow in a method for generating a cadence in words based on a neural network according to an embodiment of the invention;
fig. 4 is a schematic diagram of an internal structure of a device for generating a rhythm in words based on a neural network according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a generating rhythm program with words in the generating rhythm device with words based on a neural network according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a method for generating rhythms by words based on a neural network. Referring to fig. 1, a flowchart of a method for generating a rhythm by words based on a neural network according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.
In this embodiment, the neural network-based method for generating a cadence in words includes:
s10, converting the lyrics of the given music into a vector set according to a preset lyrics coding rule.
In a preferred embodiment of the present invention, the preset lyric encoding rule includes: the single character in the lyrics is specified as 1, the single punctuation mark is 0, and the characters are filled with 0.
In the preferred embodiment of the present invention, the vector is generated in the form of x i =[time,1,1,channel]. Where "time" is BCD (Binary-Coded Decimal) of the time when lyrics appear in music, and "1" refers to the height and width of an image, and in music, one character corresponds to one pixel, so that the width and height are set to be 1, "channel" is the lyrics code described above, the channel value of a single character is 1, the channel value of a single punctuation mark is 0, and so on. Thus, lyrics in a given music may be converted to a set of vectors of x= { X 1 ,…x i ,…x t }。
S20, inputting a vector set of lyrics of given music into a pre-constructed neural network model, and obtaining time sequence distribution of the lyrics.
The pre-built neural network model of the present invention comprises three layers of space-time convolutional networks (SpatioTemporal convolutional neural networks, STCNNs) and one layer of Bi-gated loop units (Bi-GRU, bidirectional Gated Recurrent Unit).
The convolutional neural network (Convolutional Neural Networks, CNNs) is a feed-forward neural network that can perform convolutional stacking operations on image space, helping to improve the performance of computer vision tasks. And the space-time convolution network STCNNs can process audio and video data by carrying out convolution operation in time and space dimensions.
The calculation mode of each layer of the space-time convolution network STCNNs from input to output is as follows:
where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to the size of the corresponding convolution kernel at (i, j, k) above, < >>The weight matrix representing the convolution kernel and b represents the offset value of the corresponding convolution kernel.
In the preferred embodiment of the invention, the three-layer STCNNs convolution kernel has the shape ofThe four dimensions are time, height, width and feature number, respectively.
After training, the feature vector z can be extracted after the vector set X of the lyrics of the given music is input into the three-layer space-time convolution network.
Further, the invention utilizes Bi-GRU to further polymerize the feature vector z extracted by STCNNs, thus obtaining the time step.
In the preferred embodiment of the invention, a layer of Bi-directional gate cycle units (Bi-GRUs) are connected after the STCNNs. GRU is a variant of a Recurrent Neural Network (RNN) and has a repeating unit model of GRU with two gates, update gate u t And reset gate r t . The update gate is used to control the extent to which the state information at the previous time is brought into the current state, a larger value of the update gate indicates that the state information at the previous time is brought more. The reset gate is used to control the degree to which state information at a previous time is ignored, a smaller value of the reset gate indicating more is ignored. The Bi-GRU is mainly characterized by increasing the future learning capacity and overcoming the defect that only historical information can be processed. Bi-GRU breaks down a normal GRU into two directions, one forward in sequence order and one reverse in time order, but the two GRUs connect the same input layer and output layer, the structure is shown in FIG. 2. In a preferred embodiment of the present invention, the number of neurons of the Bi-GRU is 256.
The Bi-GRU adopts the following formula to obtain the time step:
r t =σ(W r ·[h t-1 ,z]);
u t =σ(W u ·[h t-1 ,z]);
wherein: u (u) t And r t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z 1 ,…,z t The input of Bi-GRU is the output characteristic of STCNNs, W r And W is u The weights of the reset gate and the update gate respectively,candidate form representing time tStatus of->Representation->Weight of (h) t The output state at time t.
The Bi-GRU mapping is:
the time step at time t is thus obtained as:
further, the present invention provides for each time step h t And performing linear transformation to obtain time sequence distribution.
For parameterizing the sequence distribution, the invention, for each time step t, makes p (u t |z)=softmax(mlp(h t ;W mlp ) Softmax is a normalized exponential function, mlp is a weighted value W mlp Then defining a time series distribution:
p(u t ,…,u T |z)=∏ 1≤t≤T p(u t |z),
in this model z is the input of the GRU, i.e. the output of STCNNs. That is, when the input is z, the output state at the time t is reversely transmitted, and classification of the state at each time t is obtained. Finally, the time sequence distribution p of all time step numbers T (namely the vector length of z) is obtained according to definition.
S30, performing connection time sequence classification (Connectionist temporal classification, CTC) on the lyrics by using the time sequence distribution to obtain a target rhythm.
The CTC is a top layer specifically designed for RNN for sequence learning, which eliminates the step of input alignment with the target output.
In a preferred embodiment of the present invention, the main procedure for CTC of the lyrics is as follows:
1) Blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->
2) Defining a function B:wherein V is * Is->Performing the following operations to obtain 1) merging consecutive identical symbols; 2) Removing blank characters;
for a string sequence y E V * Definition:
wherein V is * All elements of (a) are called paths, V * Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) t ,…,u T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied -1(y)s.t.|u|=T Representing a set of character strings y with a length T and showing the result of transformation by the function B;
3) According to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:
in summary, referring to fig. 3, the data flow of the present invention is as follows: for a piece of music, the preferred embodiment of the invention converts lyrics in the music into vectors and transmits the vectors to the constructed neural network to obtain a time sequence, wherein the neural network comprises three layers of space-time convolutional networks and one layer of bidirectional gating circulating units; inputting the obtained time sequence into the connection time sequence classification, and simultaneously inputting target lyrics, and finally obtaining a corresponding sequence of target lyrics, wherein the sequence of target lyrics is the rhythm of the target lyrics corresponding to the piece of music.
The invention also provides a device for generating the rhythm by words based on the neural network. Referring to fig. 4, an internal structure diagram of a neural network-based device for generating a rhythm in words according to an embodiment of the present invention is shown.
In this embodiment, the neural network-based word generation rhythm device 1 may be a PC (Personal Computer ), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer. The neural network based word generating cadence device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal memory unit of the neural network based word generating cadence device 1, e.g. a hard disk of the neural network based word generating cadence device 1. The memory 11 may in other embodiments also be an external memory device of the word generating rhythm apparatus 1 based on a neural network, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card) or the like provided on the word generating rhythm apparatus 1 based on a neural network. Further, the memory 11 may also include both an internal storage unit and an external storage device of the word generating rhythm apparatus 1. The memory 11 may be used not only for storing application software installed in the neural network-based word generating rhythm device 1 and various types of data, for example, codes of the word generating rhythm program 01 and the like, but also for temporarily storing data that has been output or is to be output.
The processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing data stored in the memory 11, e.g. executing the word generating cadence program 01 or the like.
The communication bus 13 is used to enable connection communication between these components.
The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used to establish a communication connection between the apparatus 1 and other electronic devices.
Optionally, the device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or a display unit, as appropriate, for displaying information processed in the neural network-based word-generating rhythm device 1 and for displaying a visual user interface.
Fig. 4 shows only the neural network based word generating cadence device 1 with components 11-14 and the word generating cadence program 01, it will be appreciated by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the neural network based word generating cadence device 1, and may include fewer or more components than illustrated, or may combine some components, or a different arrangement of components.
In the embodiment of the apparatus 1 shown in fig. 4, the memory 11 stores therein a rhythmic program 01 in words; the processor 12, when executing the word generating tempo program 01 stored in the memory 11, performs the following steps:
step one, converting lyrics of given music into a vector set according to a preset lyrics coding rule.
In a preferred embodiment of the present invention, the preset lyric encoding rule includes: the single character in the lyrics is specified as 1, the single punctuation mark is 0, and the characters are filled with 0.
In the preferred embodiment of the present invention, the vector is generated in the form of x i =[time,1,1,channel]. Where "time" is BCD (Binary-Coded Decimal) of the time when lyrics appear in music, and "1" refers to the height and width of an image, and in music, one character corresponds to one pixel, so that the width and height are set to be 1, "channel" is the lyrics code described above, the channel value of a single character is 1, the channel value of a single punctuation mark is 0, and so on. Thus, lyrics in a given music may be converted to a set of vectors of x= { X 1 ,…x i ,…x t }。
Inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics.
The pre-built neural network model of the present invention comprises three layers of space-time convolutional networks (SpatioTemporal convolutional neural networks, STCNNs) and one layer of Bi-gated loop units (Bi-GRU, bidirectional Gated Recurrent Unit).
The convolutional neural network (Convolutional Neural Networks, CNNs) is a feed-forward neural network that can perform convolutional stacking operations on image space, helping to improve the performance of computer vision tasks. And the space-time convolution network STCNNs can process audio and video data by carrying out convolution operation in time and space dimensions.
The calculation mode of each layer of the space-time convolution network STCNNs from input to output is as follows:
where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to the size of the corresponding convolution kernel at (i, j, k) above, < >>The weight matrix representing the convolution kernel and b represents the offset value of the corresponding convolution kernel.
In the preferred embodiment of the invention, the three-layer STCNNs convolution kernel has the shape ofThe four dimensions are time, height, width and feature number, respectively.
After training, the feature vector z can be extracted after the vector set X of the lyrics of the given music is input into the three-layer space-time convolution network.
Further, the invention utilizes Bi-GRU to further polymerize the feature vector z extracted by STCNNs, thus obtaining the time step.
In the preferred embodiment of the invention, a layer of Bi-directional gate cycle units (Bi-GRUs) are connected after the STCNNs. GRU is a variant of a Recurrent Neural Network (RNN) and has a repeating unit model of GRU with two gates, update gate u t And reset gate r t . The update gate is used to control the extent to which the state information at the previous time is brought into the current state, a larger value of the update gate indicates that the state information at the previous time is brought more. The reset gate is used for controlling the degree of neglecting the state information of the previous moment, and the value of the reset gate is moreThe more the small description is ignored. The Bi-GRU is mainly characterized by increasing the future learning capacity and overcoming the defect that only historical information can be processed. Bi-GRU breaks down a normal GRU into two directions, one forward in sequence order and one reverse in time order, but the two GRUs connect the same input layer and output layer, the structure is shown in FIG. 2. In a preferred embodiment of the present invention, the number of neurons of the Bi-GRU is 256.
The Bi-GRU adopts the following formula to obtain the time step:
r t =σ(W r ·[h t-1 ,z]);
u t =σ(W u ·[h t-1 ,z]);
wherein: u (u) t And r t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z 1 ,…,z t The input of Bi-GRU is the output characteristic of STCNNs, W r And W is u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) t The output state at time t.
The Bi-GRU mapping is:
the time step at time t is thus obtained as:
further, the present invention provides for each time step h t And performing linear transformation to obtain time sequence distribution.
For parameterizing the sequence distribution, the invention, for each time step t, makes p (u t |z)=softmax(mlp(h t ;W mlp ) Softmax is a normalized exponential function, mlp is a weighted value W mlp Then defining a time series distribution:
p(u t ,…,u T |z)=∏ 1≤t≤T p(u t |z),
in this model z is the input of the GRU, i.e. the output of STCNNs. That is, when the input is z, the output state at the time t is reversely transmitted, and classification of the state at each time t is obtained. Finally, the time sequence distribution p of all time step numbers T (namely the vector length of z) is obtained according to definition.
And thirdly, performing connection time sequence classification (Connectionist temporal classification, CTC) on the lyrics by using the time sequence distribution to obtain a target rhythm.
The CTC is a top layer specifically designed for RNN for sequence learning, which eliminates the step of input alignment with the target output.
In a preferred embodiment of the present invention, the main procedure for CTC of the lyrics is as follows:
1) Blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->
2) Defining a function B:wherein V is * Is->Performing the following operations to obtain 1) merging consecutive identical symbols; 2) Removing blank characters;
for a string sequence y E V * Definition:
wherein V is * All elements of (a) are called paths, V * Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) t ,…,u T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied -1(y)s.t.|u|=T Representing a set of character strings y with a length T and showing the result of transformation by the function B;
3) According to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:
alternatively, in other embodiments, the generating the rhythm program by the word may be further divided into one or more modules, where one or more modules are stored in the memory 11 and executed by one or more processors (the processor 12 in this embodiment) to complete the present invention, and the modules referred to herein refer to a series of instruction segments of the computer program capable of performing a specific function, for describing the execution of the generating the rhythm program by the word in the generating rhythm device based on the neural network.
For example, referring to fig. 5, a schematic diagram of a program module of a word-generating rhythm program in an embodiment of the neural network-based device for generating rhythms with words according to the present invention is shown, where the word-generating rhythm program 01 may be divided into a lyric conversion module 10, a model calculation module 20, and a rhythm generation module 30, which are illustrated as follows:
the lyrics conversion module 10 is used for: and converting the lyrics of the given music into a vector set according to a preset rule.
Optionally, the preset lyric coding rule includes: the single character in the lyrics is specified as 1, the single punctuation mark is 0, and the characters are filled with 0.
The model calculation module 20 is configured to: and inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics.
Optionally, the pre-built neural network model comprises three layers of space-time convolutional networks (SpatioTemporal convolutional neural networks, STCNNs) and one layer of Bi-gated loop units (Bi-GRU, bidirectional Gated Recurrent Unit).
Optionally, the inputting the vector set X of the lyrics of the given music into the pre-constructed neural network model, to obtain the time sequence distribution of the lyrics, includes:
inputting a vector set X of lyrics of given music into the three-layer space-time convolution network, and extracting a characteristic vector z;
performing aggregation operation on the extracted feature vector z by using the bidirectional gating circulation unit to obtain a time step;
for each time step h t And performing linear transformation to obtain time sequence distribution.
Optionally, the bidirectional gating cycle unit obtains the time step by adopting the following formula:
r t =σ(W r ·[h t-1 ,z]);
u t =σ(W u ·[h t-1 ,z]);
wherein: u (u) t And r t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z 1 ,…,z t The input of Bi-GRU is the output characteristic of STCNNs, W r And W is u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) t The output state at time t.
The mapping of the two directions of the bidirectional gating circulation unit is respectively as follows:
the time step at time t is thus obtained as:
optionally, the calculation formula of the time series distribution is:
p(u t ,…,u T |z)=∏ 1≤t≤T p(u t |z),
wherein t is a time step, p (u) t |z)=softmax(mlp(h t ;W mlp ) The softmax is a normalized exponential function, mlp is a weighted value W mlp Z is the output of the three-layer space-time convolution network, and T is the number of all time steps.
The tempo generation module 30 is configured to: and carrying out connection time sequence classification (Connectionist temporal classification, CTC) on the lyrics by using the time sequence distribution to obtain a target rhythm.
Optionally, the classifying the lyrics according to the connection time sequence includes:
blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->
Defining a function B:wherein V is * Is->Performing the following operations to obtain 1) merging consecutive identical symbols; 2) Removing blank characters;
for a string sequence y E V * Definition:
/>
wherein V is * All elements of (a) are called paths, V * Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) t ,…,u T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied -1(y)s.t.|u|=T Representing a set of character strings y with a length T and showing the result of transformation by the function B;
according to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:
the functions or operation steps implemented when the program modules of the lyric conversion module 10, the model calculation module 20, the tempo generation module 30 and the like are substantially the same as those of the above embodiment, and will not be described herein.
In addition, an embodiment of the present invention further proposes a computer-readable storage medium having stored thereon a word-generating rhythm program executable by one or more processors to implement the following operations:
converting lyrics of given music into a vector set according to a preset rule;
inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics;
and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm.
The computer-readable storage medium of the present invention is substantially the same as the embodiments of the neural network-based apparatus and method for generating a cadence in words described above, and will not be described in detail herein.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (5)

1. A neural network-based method of generating cadence in words, the method comprising:
converting lyrics of given music into a vector set according to a preset lyric coding rule, wherein the preset lyric coding rule comprises the following steps: providing that single characters in the lyrics are 1, single punctuation marks are 0, and filling 0 between the characters;
pre-constructing a neural network model, wherein the pre-constructed neural network model comprises three layers of space-time convolutional networks and a layer of bidirectional gating circulating units;
inputting a vector set of lyrics of given music into the three-layer space-time convolution network, and extracting a feature vector;
performing aggregation operation on the feature vectors by using the bidirectional gating circulation unit to obtain time steps; and
Performing linear transformation on each time step to obtain time sequence distribution;
performing connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm;
the step of classifying the lyrics in a connection time sequence by using the time sequence distribution to obtain a target rhythm comprises the following steps:
blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->
Defining a function B:wherein V is * Is->The following operations are performed to obtain: 1) Merging consecutive identical symbols; 2) Removing blank characters; for a string sequence y E V * Definition:
wherein V is * All elements of (a) are called paths, V * Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) t ,…,u T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied -1(y)s.t.|u|=T Representing the length T and converting the length T into a set of character strings y through a function B;
according to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:
the calculation formula of the time sequence distribution is as follows:
p(u t ,…,u T |z)=Π 1≤t≤T p(u t |z),
where t is the time step, p (u t |z)=softmax(mlp(h t ;W mlp ) The softmax is a normalized exponential function, mlp is a weighted value W mlp Z is a characteristic vector output by the three-layer space-time convolution network, and T is the number of all time steps.
2. The neural network-based rhythmic method of word generation of claim 1, wherein each layer of the spatio-temporal convolution network is calculated by:
where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to (i, j, k) of equal size to the corresponding convolution kernel size, +.>The weight matrix representing the convolution kernel and b represents the offset value of the corresponding convolution kernel.
3. The neural network-based rhythmic method of word generation of claim 1, wherein the bi-directional gating loop unit derives the time step using the formula:
r t =σ(W r ·[h t-1 ,z]);
u t =σ(W u ·[h t-1 ,z]);
wherein: u (u) t And r t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z 1 ,…,z t The characteristic vector W is output by the three-layer space-time convolution network r And W is u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weights of (2),h t The output state at the time t is the output state at the time t,
the mapping of the two directions of the bidirectional gating circulation unit is respectively as follows:
the time step at time t is thus obtained as:
4. a neural network based word generating cadence device for implementing a neural network based word generating cadence method as claimed in any one of claims 1 to 3, wherein the device comprises a memory and a processor, the memory having stored thereon a word generating cadence program executable on the processor, the word generating cadence program when invoked by the processor performing the steps of:
converting lyrics of given music into a vector set according to a preset lyric coding rule, wherein the preset lyric coding rule comprises the following steps: providing that single characters in the lyrics are 1, single punctuation marks are 0, and filling 0 between the characters;
pre-constructing a neural network model, wherein the pre-constructed neural network model comprises three layers of space-time convolutional networks and a layer of bidirectional gating circulating units;
inputting a vector set of lyrics of given music into the three-layer space-time convolution network, and extracting a feature vector;
performing aggregation operation on the feature vectors by using the bidirectional gating circulation unit to obtain time steps; and
Performing linear transformation on each time step to obtain time sequence distribution;
and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm.
5. A computer-readable storage medium having stored thereon a word-generating cadence program executable by one or more processors to implement the neural network-based cadence method of any of claims 1-3.
CN201910307611.3A 2019-04-17 2019-04-17 Method, device and storage medium for generating rhythm by words based on neural network Active CN110222226B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910307611.3A CN110222226B (en) 2019-04-17 2019-04-17 Method, device and storage medium for generating rhythm by words based on neural network
PCT/CN2019/102189 WO2020211237A1 (en) 2019-04-17 2019-08-23 Neural network-based method and apparatus for generating rhythm from lyrics, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910307611.3A CN110222226B (en) 2019-04-17 2019-04-17 Method, device and storage medium for generating rhythm by words based on neural network

Publications (2)

Publication Number Publication Date
CN110222226A CN110222226A (en) 2019-09-10
CN110222226B true CN110222226B (en) 2024-03-12

Family

ID=67822589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910307611.3A Active CN110222226B (en) 2019-04-17 2019-04-17 Method, device and storage medium for generating rhythm by words based on neural network

Country Status (2)

Country Link
CN (1) CN110222226B (en)
WO (1) WO2020211237A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110853604A (en) * 2019-10-30 2020-02-28 西安交通大学 Automatic generation method of Chinese folk songs with specific region style based on variational self-encoder
CN113066457B (en) * 2021-03-17 2023-11-03 平安科技(深圳)有限公司 Fan-exclamation music generation method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109166564A (en) * 2018-07-19 2019-01-08 平安科技(深圳)有限公司 For the method, apparatus and computer readable storage medium of lyrics text generation melody
CN109346045A (en) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 Counterpoint generation method and device based on long neural network in short-term
CN109471951A (en) * 2018-09-19 2019-03-15 平安科技(深圳)有限公司 Lyrics generation method, device, equipment and storage medium neural network based
KR101934057B1 (en) * 2017-09-08 2019-04-08 한성대학교 산학협력단 Method and recording medium for automatic composition using hierarchical artificial neural networks

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626328B2 (en) * 2011-01-24 2014-01-07 International Business Machines Corporation Discrete sampling based nonlinear control system
US10606548B2 (en) * 2017-06-16 2020-03-31 Krotos Ltd Method of generating an audio signal
CN108509534B (en) * 2018-03-15 2022-03-25 华南理工大学 Personalized music recommendation system based on deep learning and implementation method thereof
CN109637509B (en) * 2018-11-12 2023-10-03 平安科技(深圳)有限公司 Music automatic generation method and device and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101934057B1 (en) * 2017-09-08 2019-04-08 한성대학교 산학협력단 Method and recording medium for automatic composition using hierarchical artificial neural networks
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109166564A (en) * 2018-07-19 2019-01-08 平安科技(深圳)有限公司 For the method, apparatus and computer readable storage medium of lyrics text generation melody
CN109471951A (en) * 2018-09-19 2019-03-15 平安科技(深圳)有限公司 Lyrics generation method, device, equipment and storage medium neural network based
CN109346045A (en) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 Counterpoint generation method and device based on long neural network in short-term

Also Published As

Publication number Publication date
CN110222226A (en) 2019-09-10
WO2020211237A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
CN107705784B (en) Text regularization model training method and device, and text regularization method and device
CN111755078B (en) Drug molecule attribute determination method, device and storage medium
US11669746B2 (en) System and method for active machine learning
CN111797893B (en) Neural network training method, image classification system and related equipment
CN107680580B (en) Text conversion model training method and device, and text conversion method and device
CN111368993B (en) Data processing method and related equipment
CN109934173B (en) Expression recognition method and device and electronic equipment
CN112183577A (en) Training method of semi-supervised learning model, image processing method and equipment
CN110534087A (en) A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN111753081A (en) Text classification system and method based on deep SKIP-GRAM network
WO2021238333A1 (en) Text processing network, neural network training method, and related device
US12039766B2 (en) Image processing method, apparatus, and computer product for image segmentation using unseen class obtaining model
WO2023134082A1 (en) Training method and apparatus for image caption statement generation module, and electronic device
CN111354333B (en) Self-attention-based Chinese prosody level prediction method and system
US11830275B1 (en) Person re-identification method and apparatus, device, and readable storage medium
CN110222226B (en) Method, device and storage medium for generating rhythm by words based on neural network
CN113807973B (en) Text error correction method, apparatus, electronic device and computer readable storage medium
CN111414916A (en) Method and device for extracting and generating text content in image and readable storage medium
CN113158676A (en) Professional entity and relationship combined extraction method and system and electronic equipment
CN114420107A (en) Speech recognition method based on non-autoregressive model and related equipment
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN112257716A (en) Scene character recognition method based on scale self-adaption and direction attention network
US20220318289A1 (en) Summary generation model training method, apparatus, electronic device and non-transitory computer readable storage medium
CN113609819B (en) Punctuation mark determination model and determination method
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant