CN110222226B

CN110222226B - Method, device and storage medium for generating rhythm by words based on neural network

Info

Publication number: CN110222226B
Application number: CN201910307611.3A
Authority: CN
Inventors: 曹靖康; 王义文; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2024-03-12
Anticipated expiration: 2039-04-17
Also published as: CN110222226A; WO2020211237A1

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a method for generating rhythms by words based on a neural network, which comprises the following steps: converting the lyrics of given music into a vector set according to a preset lyrics coding rule; inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics; and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm. The invention also provides a neural network-based word generation rhythm device and a computer-readable storage medium. The invention applies the deep learning network to the generation of music rhythm and can obtain reliable results so that the generated music accords with the specification of the original music.

Description

Method, device and storage medium for generating rhythm by words based on neural network

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a neural network-based method and device for generating rhythms by words and a computer-readable storage medium.

Background

The music rhythm is an important ring in the automatic music generation algorithm, can normalize the distribution of lyrics, can restrict pitch, melody and the like, and is a bridge for connecting lyrics and music. Traditional speech recognition and music model construction are performed by using state modeling, wherein a phoneme or a word is artificially divided into a plurality of states without physical meaning, and then the output distribution of each state is described by using a discrete or continuous Gaussian model. The modeling mode needs to divide the boundary of the modeling unit in the middle of the continuous sequence in advance, and the input and output distribution edges are aligned, so that the calculation speed is low.

Many efforts are made to combine the deep neural network with various fields, and in the aspect of music generation, a probability generation algorithm and a Markov chain can accurately generate the original music rhythm, but the melody of the generated music rhythm is too simple; the structural model of a Long Short-Term Memory (LSTM) is too complex, and the training time of the model is Long; while the recurrent neural network (Recurrent Neural Network, RNN) is likely to suffer from gradient extinction when processing sequences that are far apart. Therefore, how to apply the deep learning network to the generation of music rhythm and obtain reliable results so that the generated music accords with the original music standard, and the system has stable robustness is a problem to be solved.

Disclosure of Invention

The invention provides a method, a device and a computer-readable storage medium for generating rhythms by words based on a neural network, and mainly aims to provide a technical scheme for applying a deep learning network to music rhythms generation.

In order to achieve the above object, the present invention provides a method for generating a rhythm in words based on a neural network, including:

converting lyrics of given music into a vector set according to a preset lyric coding rule, wherein the preset lyric coding rule comprises the following steps: providing that single characters in the lyrics are 1, single punctuation marks are 0, and filling 0 between the characters;

pre-constructing a neural network model, wherein the pre-constructed neural network model comprises three layers of space-time convolutional networks and a layer of bidirectional gating circulating units;

inputting a vector set of lyrics of given music into the three-layer space-time convolution network, and extracting a feature vector;

performing aggregation operation on the feature vectors by using the bidirectional gating circulation unit to obtain time steps; and

Performing linear transformation on each time step to obtain time sequence distribution;

and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm.

Optionally, the calculation mode of each layer of the space-time convolution network is as follows:

where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to the size of the corresponding convolution kernel at (i, j, k) above, < >>The weight matrix representing the convolution kernel and b represents the offset value of the corresponding convolution kernel.

Optionally, the performing connection timing classification on the lyrics by using the time sequence distribution to obtain a target tempo includes:

blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->

Defining a function B:wherein V is ^* Is->The following operations are performed to obtain: 1) Merging consecutive identical symbols; 2) Removing blank characters;

for a string sequence y E V ^* Definition:

wherein V is ^* All elements of (a) are called paths, V ^* Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) _t ,…,u _T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied ^{-1(y)s.t.|u|＝T} Representing a set of character strings y with a length T and showing the result of transformation by the function B;

according to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:

optionally, the bidirectional gating cycle unit obtains the time step by adopting the following formula:

r _t ＝σ(W _r ·[h _t-1 ,z])；

u _t ＝σ(W _u ·[h _t-1 ,z])；

wherein: u (u) _t And r _t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z ₁ ，…，z _t Is the three-layer space-time convolution network outputFeature vector, W of (2) _r And W is _u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) _t The output state at the time t is the output state at the time t,

the mapping of the two directions of the bidirectional gating circulation unit is respectively as follows:

the time step at time t is thus obtained as:

optionally, the calculation formula of the time series distribution is:

p(u _t ,…,u _T |z)＝∏ _1≤t≤T p(u _t |z)，

where t is the time step, p (u _t |z)＝softmax(mlp(h _t ；W _mlp ) The softmax is a normalized exponential function, mlp is a weighted value W _mlp Z is a characteristic vector output by the three-layer space-time convolution network, and T is the number of all time steps.

In addition, to achieve the above object, the present invention further provides a neural network-based rhythmic generating device, the device including a memory and a processor, the memory storing a rhythmic generating program executable on the processor, the rhythmic generating program when executed by the processor implementing a neural network-based rhythmic generating method, the method comprising:

where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to the size of the corresponding convolution kernel at (i, j, k) above, < >>Weight matrix representing convolution kernel, b represents bias of corresponding convolution kernelValues.

Defining a function B:wherein V is ^* Is->Performing the following operations to obtain 1) merging consecutive identical symbols; 2) Removing blank characters;

for a string sequence y E V ^* Definition:

r _t ＝σ(W _r ·[h _t-1 ,z])；

u _t ＝σ(W _u ·[h _t-1 ,z])；

wherein: u (u) _t And r _t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z ₁ ，…，z _t The characteristic vector W is output by the three-layer space-time convolution network _r And W is _u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) _t The output state at the time t is the output state at the time t,

the time step at time t is thus obtained as:

in addition, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a word-generating rhythm program executable by one or more processors to implement the steps of the neural network-based rhythm method for generating words as described above.

The invention provides a method, a device and a computer readable storage medium for generating rhythms by words based on a neural network, which are used for converting lyrics of given music into vector sets according to a preset lyrics coding rule; inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics; and carrying out connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm. Therefore, the invention applies the deep learning network to the generation of music rhythm and can obtain reliable results so that the generated music accords with the specification of the original music.

Drawings

FIG. 1 is a schematic flow chart of a method for generating rhythms by words based on a neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a bi-directional gating cycle unit according to a method of generating a rhythm in words based on a neural network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data flow in a method for generating a cadence in words based on a neural network according to an embodiment of the invention;

fig. 4 is a schematic diagram of an internal structure of a device for generating a rhythm in words based on a neural network according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a generating rhythm program with words in the generating rhythm device with words based on a neural network according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a method for generating rhythms by words based on a neural network. Referring to fig. 1, a flowchart of a method for generating a rhythm by words based on a neural network according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.

In this embodiment, the neural network-based method for generating a cadence in words includes:

s10, converting the lyrics of the given music into a vector set according to a preset lyrics coding rule.

In a preferred embodiment of the present invention, the preset lyric encoding rule includes: the single character in the lyrics is specified as 1, the single punctuation mark is 0, and the characters are filled with 0.

In the preferred embodiment of the present invention, the vector is generated in the form of x _i ＝[time,1,1,channel]. Where "time" is BCD (Binary-Coded Decimal) of the time when lyrics appear in music, and "1" refers to the height and width of an image, and in music, one character corresponds to one pixel, so that the width and height are set to be 1, "channel" is the lyrics code described above, the channel value of a single character is 1, the channel value of a single punctuation mark is 0, and so on. Thus, lyrics in a given music may be converted to a set of vectors of x= { X ₁ ,…x _i ,…x _t }。

S20, inputting a vector set of lyrics of given music into a pre-constructed neural network model, and obtaining time sequence distribution of the lyrics.

The pre-built neural network model of the present invention comprises three layers of space-time convolutional networks (SpatioTemporal convolutional neural networks, STCNNs) and one layer of Bi-gated loop units (Bi-GRU, bidirectional Gated Recurrent Unit).

The convolutional neural network (Convolutional Neural Networks, CNNs) is a feed-forward neural network that can perform convolutional stacking operations on image space, helping to improve the performance of computer vision tasks. And the space-time convolution network STCNNs can process audio and video data by carrying out convolution operation in time and space dimensions.

The calculation mode of each layer of the space-time convolution network STCNNs from input to output is as follows:

In the preferred embodiment of the invention, the three-layer STCNNs convolution kernel has the shape ofThe four dimensions are time, height, width and feature number, respectively.

After training, the feature vector z can be extracted after the vector set X of the lyrics of the given music is input into the three-layer space-time convolution network.

Further, the invention utilizes Bi-GRU to further polymerize the feature vector z extracted by STCNNs, thus obtaining the time step.

In the preferred embodiment of the invention, a layer of Bi-directional gate cycle units (Bi-GRUs) are connected after the STCNNs. GRU is a variant of a Recurrent Neural Network (RNN) and has a repeating unit model of GRU with two gates, update gate u _t And reset gate r _t . The update gate is used to control the extent to which the state information at the previous time is brought into the current state, a larger value of the update gate indicates that the state information at the previous time is brought more. The reset gate is used to control the degree to which state information at a previous time is ignored, a smaller value of the reset gate indicating more is ignored. The Bi-GRU is mainly characterized by increasing the future learning capacity and overcoming the defect that only historical information can be processed. Bi-GRU breaks down a normal GRU into two directions, one forward in sequence order and one reverse in time order, but the two GRUs connect the same input layer and output layer, the structure is shown in FIG. 2. In a preferred embodiment of the present invention, the number of neurons of the Bi-GRU is 256.

The Bi-GRU adopts the following formula to obtain the time step:

r _t ＝σ(W _r ·[h _t-1 ,z])；

u _t ＝σ(W _u ·[h _t-1 ,z])；

wherein: u (u) _t And r _t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z ₁ ，…，z _t The input of Bi-GRU is the output characteristic of STCNNs, W _r And W is _u The weights of the reset gate and the update gate respectively,candidate form representing time tStatus of->Representation->Weight of (h) _t The output state at time t.

The Bi-GRU mapping is:

the time step at time t is thus obtained as:

further, the present invention provides for each time step h _t And performing linear transformation to obtain time sequence distribution.

For parameterizing the sequence distribution, the invention, for each time step t, makes p (u _t |z)＝softmax(mlp(h _t ；W _mlp ) Softmax is a normalized exponential function, mlp is a weighted value W _mlp Then defining a time series distribution:

p(u _t ,…,u _T |z)＝∏ _1≤t≤T p(u _t |z)，

in this model z is the input of the GRU, i.e. the output of STCNNs. That is, when the input is z, the output state at the time t is reversely transmitted, and classification of the state at each time t is obtained. Finally, the time sequence distribution p of all time step numbers T (namely the vector length of z) is obtained according to definition.

S30, performing connection time sequence classification (Connectionist temporal classification, CTC) on the lyrics by using the time sequence distribution to obtain a target rhythm.

The CTC is a top layer specifically designed for RNN for sequence learning, which eliminates the step of input alignment with the target output.

In a preferred embodiment of the present invention, the main procedure for CTC of the lyrics is as follows:

1) Blank labels are added on the basis of lyrics V of given music to obtain character stringsI.e. < ->

2) Defining a function B:wherein V is ^* Is->Performing the following operations to obtain 1) merging consecutive identical symbols; 2) Removing blank characters;

for a string sequence y E V ^* Definition:

3) According to the input characteristic vector z, calculating the maximum probability sum to obtain a target lyric sequence h (x) corresponding to the input sequence, namely, the rhythm generated by the target lyrics under given music:

in summary, referring to fig. 3, the data flow of the present invention is as follows: for a piece of music, the preferred embodiment of the invention converts lyrics in the music into vectors and transmits the vectors to the constructed neural network to obtain a time sequence, wherein the neural network comprises three layers of space-time convolutional networks and one layer of bidirectional gating circulating units; inputting the obtained time sequence into the connection time sequence classification, and simultaneously inputting target lyrics, and finally obtaining a corresponding sequence of target lyrics, wherein the sequence of target lyrics is the rhythm of the target lyrics corresponding to the piece of music.

The invention also provides a device for generating the rhythm by words based on the neural network. Referring to fig. 4, an internal structure diagram of a neural network-based device for generating a rhythm in words according to an embodiment of the present invention is shown.

In this embodiment, the neural network-based word generation rhythm device 1 may be a PC (Personal Computer ), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer. The neural network based word generating cadence device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal memory unit of the neural network based word generating cadence device 1, e.g. a hard disk of the neural network based word generating cadence device 1. The memory 11 may in other embodiments also be an external memory device of the word generating rhythm apparatus 1 based on a neural network, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card) or the like provided on the word generating rhythm apparatus 1 based on a neural network. Further, the memory 11 may also include both an internal storage unit and an external storage device of the word generating rhythm apparatus 1. The memory 11 may be used not only for storing application software installed in the neural network-based word generating rhythm device 1 and various types of data, for example, codes of the word generating rhythm program 01 and the like, but also for temporarily storing data that has been output or is to be output.

The processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing data stored in the memory 11, e.g. executing the word generating cadence program 01 or the like.

The communication bus 13 is used to enable connection communication between these components.

The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used to establish a communication connection between the apparatus 1 and other electronic devices.

Optionally, the device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or a display unit, as appropriate, for displaying information processed in the neural network-based word-generating rhythm device 1 and for displaying a visual user interface.

Fig. 4 shows only the neural network based word generating cadence device 1 with components 11-14 and the word generating cadence program 01, it will be appreciated by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the neural network based word generating cadence device 1, and may include fewer or more components than illustrated, or may combine some components, or a different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 4, the memory 11 stores therein a rhythmic program 01 in words; the processor 12, when executing the word generating tempo program 01 stored in the memory 11, performs the following steps:

step one, converting lyrics of given music into a vector set according to a preset lyrics coding rule.

Inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics.

In the preferred embodiment of the invention, a layer of Bi-directional gate cycle units (Bi-GRUs) are connected after the STCNNs. GRU is a variant of a Recurrent Neural Network (RNN) and has a repeating unit model of GRU with two gates, update gate u _t And reset gate r _t . The update gate is used to control the extent to which the state information at the previous time is brought into the current state, a larger value of the update gate indicates that the state information at the previous time is brought more. The reset gate is used for controlling the degree of neglecting the state information of the previous moment, and the value of the reset gate is moreThe more the small description is ignored. The Bi-GRU is mainly characterized by increasing the future learning capacity and overcoming the defect that only historical information can be processed. Bi-GRU breaks down a normal GRU into two directions, one forward in sequence order and one reverse in time order, but the two GRUs connect the same input layer and output layer, the structure is shown in FIG. 2. In a preferred embodiment of the present invention, the number of neurons of the Bi-GRU is 256.

The Bi-GRU adopts the following formula to obtain the time step:

r _t ＝σ(W _r ·[h _t-1 ,z])；

u _t ＝σ(W _u ·[h _t-1 ,z])；

wherein: u (u) _t And r _t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z ₁ ，…，z _t The input of Bi-GRU is the output characteristic of STCNNs, W _r And W is _u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weight of (h) _t The output state at time t.

The Bi-GRU mapping is:

the time step at time t is thus obtained as:

p(u _t ,…,u _T |z)＝∏ _1≤t≤T p(u _t |z)，

And thirdly, performing connection time sequence classification (Connectionist temporal classification, CTC) on the lyrics by using the time sequence distribution to obtain a target rhythm.

for a string sequence y E V ^* Definition:

alternatively, in other embodiments, the generating the rhythm program by the word may be further divided into one or more modules, where one or more modules are stored in the memory 11 and executed by one or more processors (the processor 12 in this embodiment) to complete the present invention, and the modules referred to herein refer to a series of instruction segments of the computer program capable of performing a specific function, for describing the execution of the generating the rhythm program by the word in the generating rhythm device based on the neural network.

For example, referring to fig. 5, a schematic diagram of a program module of a word-generating rhythm program in an embodiment of the neural network-based device for generating rhythms with words according to the present invention is shown, where the word-generating rhythm program 01 may be divided into a lyric conversion module 10, a model calculation module 20, and a rhythm generation module 30, which are illustrated as follows:

the lyrics conversion module 10 is used for: and converting the lyrics of the given music into a vector set according to a preset rule.

Optionally, the preset lyric coding rule includes: the single character in the lyrics is specified as 1, the single punctuation mark is 0, and the characters are filled with 0.

The model calculation module 20 is configured to: and inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics.

Optionally, the pre-built neural network model comprises three layers of space-time convolutional networks (SpatioTemporal convolutional neural networks, STCNNs) and one layer of Bi-gated loop units (Bi-GRU, bidirectional Gated Recurrent Unit).

Optionally, the inputting the vector set X of the lyrics of the given music into the pre-constructed neural network model, to obtain the time sequence distribution of the lyrics, includes:

inputting a vector set X of lyrics of given music into the three-layer space-time convolution network, and extracting a characteristic vector z;

performing aggregation operation on the extracted feature vector z by using the bidirectional gating circulation unit to obtain a time step;

for each time step h _t And performing linear transformation to obtain time sequence distribution.

r _t ＝σ(W _r ·[h _t-1 ,z])；

u _t ＝σ(W _u ·[h _t-1 ,z])；

the time step at time t is thus obtained as:

optionally, the calculation formula of the time series distribution is:

p(u _t ,…,u _T |z)＝∏ _1≤t≤T p(u _t |z)，

wherein t is a time step, p (u) _t |z)＝softmax(mlp(h _t ；W _mlp ) The softmax is a normalized exponential function, mlp is a weighted value W _mlp Z is the output of the three-layer space-time convolution network, and T is the number of all time steps.

The tempo generation module 30 is configured to: and carrying out connection time sequence classification (Connectionist temporal classification, CTC) on the lyrics by using the time sequence distribution to obtain a target rhythm.

Optionally, the classifying the lyrics according to the connection time sequence includes:

for a string sequence y E V ^* Definition:

/>

the functions or operation steps implemented when the program modules of the lyric conversion module 10, the model calculation module 20, the tempo generation module 30 and the like are substantially the same as those of the above embodiment, and will not be described herein.

In addition, an embodiment of the present invention further proposes a computer-readable storage medium having stored thereon a word-generating rhythm program executable by one or more processors to implement the following operations:

converting lyrics of given music into a vector set according to a preset rule;

inputting a vector set of lyrics of given music into a pre-constructed neural network model to obtain time sequence distribution of the lyrics;

The computer-readable storage medium of the present invention is substantially the same as the embodiments of the neural network-based apparatus and method for generating a cadence in words described above, and will not be described in detail herein.

It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A neural network-based method of generating cadence in words, the method comprising:

performing connection time sequence classification on the lyrics by using the time sequence distribution to obtain a target rhythm;

the step of classifying the lyrics in a connection time sequence by using the time sequence distribution to obtain a target rhythm comprises the following steps:

Defining a function B:wherein V is ^* Is->The following operations are performed to obtain: 1) Merging consecutive identical symbols; 2) Removing blank characters; for a string sequence y E V ^* Definition:

wherein V is ^* All elements of (a) are called paths, V ^* Is the set of all paths, p (y|z) represents the sum of the probabilities of the paths corresponding to the target lyric set V, z is the eigenvector output by the three-layer space-time convolution network, T is the number of all time steps, and p (u) _t ，…，u _T I z) is a time series distribution of all time step numbers T, s.t i u i=t is a conditional function, expressing that u is the condition among all time steps T, B is required to be satisfied ^{-1(y)s.t.|u|＝T} Representing the length T and converting the length T into a set of character strings y through a function B;

the calculation formula of the time sequence distribution is as follows:

p(u _t ，…，u _T |z)＝Π _1≤t≤T p(u _t |z)，

2. The neural network-based rhythmic method of word generation of claim 1, wherein each layer of the spatio-temporal convolution network is calculated by:

where y represents the output of a certain layer, σ represents the activation function, i, j, k represents the coordinates of the corresponding location on the sample,representing each layer input to a local region corresponding to (i, j, k) of equal size to the corresponding convolution kernel size, +.>The weight matrix representing the convolution kernel and b represents the offset value of the corresponding convolution kernel.

3. The neural network-based rhythmic method of word generation of claim 1, wherein the bi-directional gating loop unit derives the time step using the formula:

r _t ＝σ(W _r ·[h _t-1 ，z])；

u _t ＝σ(W _u ·[h _t-1 ，z])；

wherein: u (u) _t And r _t Update gate and reset gate, respectively, []Representing that the two vectors are connected, representing the multiplication of the matrix elements, σ is a sigmoid function, z= { z ₁ ，…，z _t The characteristic vector W is output by the three-layer space-time convolution network _r And W is _u The weights of the reset gate and the update gate respectively,representing candidate states at time t->Representation->Weights of (2)，h _t The output state at the time t is the output state at the time t,

the time step at time t is thus obtained as:

4. a neural network based word generating cadence device for implementing a neural network based word generating cadence method as claimed in any one of claims 1 to 3, wherein the device comprises a memory and a processor, the memory having stored thereon a word generating cadence program executable on the processor, the word generating cadence program when invoked by the processor performing the steps of:

5. A computer-readable storage medium having stored thereon a word-generating cadence program executable by one or more processors to implement the neural network-based cadence method of any of claims 1-3.