CN116189663A

CN116189663A - Training method and device of prosody prediction model, and man-machine interaction method and device

Info

Publication number: CN116189663A
Application number: CN202310202425.XA
Authority: CN
Inventors: 顾艳梅; 王志铭
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-30

Abstract

The embodiment of the specification provides a training method and device of a prosody prediction model, and a human-computer interaction method and device. The training method comprises the following steps: obtaining text corpus in a target service scene; normalizing the text corpus; obtaining sample corpus; the sample corpus is obtained by marking the normalized text corpus with prosody labels; the prosody tag is used for indicating the pause time; initializing the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model; and inputting the sample corpus with the prosody tags into the initialized prosody prediction model to train the prosody prediction model. The method of the embodiment of the specification can enable the voice played by the machine equipment to be more natural and is more convenient for users to understand.

Description

Training method and device of prosody prediction model, and man-machine interaction method and device

Technical Field

One or more embodiments of the present disclosure relate to network communication technology, and in particular, to a training method and device of a prosody prediction model, and a human-computer interaction method and device.

Background

At present, human-computer interaction services are more and more. The machine device can broadcast the corresponding conversation according to the conversation content of the user. In a real intelligent dialogue, voice cannot be synthesized in machine equipment in advance, and the machine equipment needs to generate and broadcast the voice needing to be replied in real time according to dialogue content sent by a user in real time.

When a human is engaged in a conversation, the spoken speech is prosodic, such as after which word should be stopped, for a short duration of time. In man-machine interaction, in order to make the voice broadcasted by the machine equipment more natural and more convenient to understand, the machine equipment also needs to perform prosody processing on the voice content generated in real time, and plays the voice according to the prosody after processing.

In the prior art, the prosody processing method in man-machine interaction mainly comprises the following steps: after the machine equipment generates a section of voice content to be broadcasted, the machine equipment performs word segmentation processing on the voice content, and sets a prosodic tag after each word according to the word segmentation result, and pauses when the voice content is broadcasted and the position of the prosodic tag is played.

The rhythm processing method in the prior art can lead to unnatural voice played by the machine equipment, is inconvenient for users to understand, and reduces the practicability of human-computer interaction.

Disclosure of Invention

One or more embodiments of the present disclosure describe a training method and apparatus for a prosody prediction model, and a human-computer interaction method and apparatus, which can make the voice played by a machine device more natural and more convenient for a user to understand.

According to a first aspect, there is provided a training method of a prosody prediction model, wherein the method comprises:

obtaining text corpus in a target service scene;

normalizing the text corpus;

obtaining sample corpus; the sample corpus is obtained by marking the normalized text corpus with prosody labels; the prosody tag is used for indicating the pause time;

initializing the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model;

and inputting the sample corpus with the prosody tags into the initialized prosody prediction model to train the prosody prediction model.

Wherein the normalization process includes at least one of: and removing non-Chinese character symbols in the text corpus, which do not influence semantic understanding, and converting the non-Chinese character symbols in the text corpus, which influence semantic understanding, into Chinese characters with corresponding semantics.

Wherein the prosodic tag is also used to indicate the intonation of the speech.

Wherein the prosodic tag includes at least one of: segmentation, small pause, large pause, and end of period.

The initializing the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model further comprises:

removing a full connection layer in the network structure of the punctuation mark prediction model; and setting the category output by the model as a category label included in the prosody label.

Wherein the prosody prediction model includes: the system comprises a text preprocessing module, a Word2Vec module and a BiLSTM module; wherein,,

the text preprocessing module is used for executing normalization processing on the text corpus to obtain a sample corpus;

the Word2Vec module is used for obtaining the emmbedding value of each Word in the sample corpus;

the BiLSTM module is used for outputting a prediction result according to the emmbedding value of each word in the sample corpus.

According to a second aspect, a human-computer interaction method is provided, wherein the method comprises:

obtaining interactive content to be played to a user;

inputting the interactive content into a pre-trained prosody prediction model to obtain interactive content with prosody tags output by the prosody prediction model;

and playing the interactive content to the user in a voice mode with rhythm according to the rhythm label carried in the interactive content.

According to a third aspect, there is provided a training apparatus of a prosody prediction model, wherein the apparatus comprises:

the sample acquisition module is configured to acquire text corpus in the target business scene; normalizing the text corpus; obtaining sample corpus; the sample corpus is obtained by marking the normalized text corpus with prosody labels; the prosody tag is used for indicating the pause time;

the transfer learning module is configured to initialize the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model;

and the training execution module is configured to input the sample corpus with the prosody label into the initialized prosody prediction model so as to train the prosody prediction model.

According to a fourth aspect, there is provided a human-machine interaction device, wherein the device comprises:

the interactive content obtaining module is configured to obtain interactive content to be played to a user;

the prosody processing module is configured to input the interactive content into a pre-trained prosody prediction model to obtain the interactive content with prosody tags output by the prosody prediction model; wherein, the prosody prediction model is trained by the training device of the prosody prediction model in the embodiment of the present specification;

and the playing module is configured to play the interactive content to the user in a voice mode with rhythm according to the rhythm label carried in the interactive content.

According to a fifth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which, when executing the executable code, implements a method as described in any of the embodiments of the present specification.

The training method and device for the prosody prediction model, the man-machine interaction method and device provided by the embodiment of the specification utilize the punctuation mark prediction model to train the prosody prediction model. Considering that the punctuation prediction model has the function of adding punctuation to a text content, and the punctuation has strong relevance with the prosody in practice, namely, places where the punctuation is needed to be stopped on the prosody are all places where the punctuation is needed, the capability of the punctuation prediction model is utilized to train and train the prosody prediction model, and the trained prosody prediction model can be marked with the stopping on the prosody at the positions where various punctuations should be marked in a text in the subsequent actual prediction service, so that the capability of prosody prediction is provided.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture to which one embodiment of the present description applies.

FIG. 2 is a flow chart of a method of training a prosody prediction model in one embodiment of the present specification.

FIG. 3 is a flow chart of a human-machine interaction method in one embodiment of the present description.

Fig. 4 is a schematic structural diagram of a training device for prosody prediction model in an embodiment of the present specification.

Fig. 5 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present disclosure.

Detailed Description

As described above, in the prior art, the machine device sets the prosody tag according to the result of word segmentation, so as to satisfy the prosody requirement of the voice broadcasted by the machine device. However, in a human conversation, pauses are not just by word. Especially when a sentence to be broadcasted is not separated by punctuation marks and the text is long, if the words are only stopped, the broadcasted voice is unnatural and mechanized, and the understanding of a listener is affected in severe cases. For example, if the machine equipment is in a mode of respectively stopping among four words of "today", "weather", "special" and "good", the voice broadcast is mechanized, and the problem is more prominent the longer a sentence of voice that the machine equipment needs to broadcast is, which is unnatural. For example, the machine should stop in a "today's weather", "particularly good" manner in accordance with the language habits of humans.

Therefore, the rhythm processing method in the prior art cannot meet the rhythm requirement of voice playing in man-machine interaction, and the satisfaction degree of man-machine interaction is greatly reduced.

The following describes the scheme provided in the present specification with reference to the drawings.

It is first noted that the terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

For ease of understanding the methods provided in this specification, a description of the system architecture to which this specification relates and applies is first provided. As shown in fig. 1, the system architecture mainly includes a user and a machine device that perform man-machine interaction.

Machine devices in which human-machine interaction with a user may include, but are not limited to, devices such as: intelligent mobile terminals, intelligent home devices, network devices, wearable devices, intelligent medical devices, PCs (personal computers), etc. Wherein the smart mobile device may comprise a mobile phone, tablet, notebook, PDA (personal digital assistant), internet car, etc. The smart home devices may include smart home devices such as smart televisions, smart air conditioners, smart water heaters, smart refrigerators, smart air cleaners, etc., and may also include smart door locks, smart sockets, smart lights, smart cameras, etc. The network devices may include, for example, switches, wireless APs, servers, etc. Wearable devices may include devices such as smart watches, smart glasses, smart bracelets, virtual reality devices, augmented reality devices, mixed reality devices (i.e., devices that can support virtual reality and augmented reality), and so forth. Smart medical devices may include devices such as smart thermometers, smart blood pressure meters, smart blood glucose meters, and the like.

It should be understood that the number of users and machine devices in fig. 1 is merely illustrative. Any number may be selected and deployed as desired for implementation.

In the embodiment of the present specification, the prosody prediction model is trained first, and then the human-computer interaction is performed by using the prosody prediction model.

The following first describes a training method of the prosody prediction model in the embodiment of the present specification.

FIG. 2 is a flow chart of a method of training a prosody prediction model in one embodiment of the present specification. The execution subject of the method is a training device of a prosody prediction model. It will be appreciated that the method may be performed by any apparatus, device, platform, cluster of devices, having computing, processing capabilities. Referring to fig. 2, the method includes:

step 201: and obtaining text corpus in the target business scene.

Step 203: and carrying out normalization processing on the text corpus.

Step 205: obtaining sample corpus; the sample corpus is obtained by marking the normalized text corpus with prosody labels; the prosody tag is used to indicate a pause duration.

Step 207: and initializing the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model.

Step 209: and inputting the sample corpus with the prosody tags into the initialized prosody prediction model to train the prosody prediction model.

It can be seen that in the training method of the prosody prediction model of the embodiment of the present specification shown in fig. 2, the punctuation prediction model is utilized to train the prosody prediction model. Considering that the punctuation prediction model has the function of adding punctuation to a text content, and the punctuation has strong relevance with the prosody in practice, namely, places where the punctuation is needed to be stopped on the prosody are all places where the punctuation is needed, therefore, the prosody prediction model is trained and trained by utilizing the capacity of the punctuation prediction model, and the trained prosody prediction model can be marked with prosody labels at the positions where various punctuations need to be marked in a text in the subsequent actual prediction business, so that the prosody prediction capability is provided.

If the massive text corpus is manually labeled on rhythm, so that massive training samples are obtained, the process efficiency is extremely low and the cost is extremely high. Referring to the method of the embodiment of the present disclosure shown in fig. 2, because the model structure and parameters of the punctuation mark prediction model are used, when training the prosody prediction model, a massive training sample after manually labeling on prosody is not needed, that is, a prosody prediction model meeting the service requirement can be trained by using a small sample.

In the method of the embodiment of the present disclosure shown in fig. 2, the prosody prediction model is not trained according to the word manner, and the prosody prediction model determines which characters or words are followed by punctuation marks according to the training process, which characters or words are actually followed by a prosody pause, so that even if there is no punctuation mark separation and the characters are longer in a sentence to be broadcasted, in the subsequent prediction service, the prosody prediction model can perform accurate prosody pause in the long text according to the trained capability, so that the broadcasted speech is natural and the understanding of the listener is not affected.

Each step of the process shown in fig. 2 is described separately.

First for step 201: and obtaining text corpus in the target business scene.

For example, the target service scenario is a customer service scenario, and if complaints of the customer are processed, text corpus related to the customer service can be obtained in step 201 for subsequent generation of training samples.

Next for step 203: and carrying out normalization processing on the text corpus.

In this step 203, the normalization process is performed so that the subsequent machine device can recognize the text corpus. The normalization process may include at least one of:

removing non-Chinese character symbols in the text corpus, wherein the non-Chinese character symbols do not influence semantic understanding;

the non-Chinese character symbols in the text corpus, which affect semantic understanding, are converted into Chinese characters of corresponding semantics, such as 'Zhi' into Chinese characters 'Renminbi'.

Next for step 205: obtaining sample corpus; the sample corpus is obtained by marking the normalized text corpus with prosody labels; the prosody tag is used to indicate a pause duration.

In the embodiment of the present disclosure, the prosodic tag is used to indicate a pause duration, for example, the prosodic tag "#1" indicates a word segmentation (the word segmentation may correspond to a minimum pause duration), the prosodic tag "#2" indicates a small pause (the small pause may correspond to a small pause duration such as 0.2 seconds), the prosodic tag "#3" indicates a large pause (the large pause may correspond to a long pause duration such as 0.5 seconds), and the prosodic tag "#4" indicates a period end (the period end may correspond to a maximum pause duration such as 1 second).

In this embodiment of the present specification, the prosodic tag may be further used to indicate the intonation of the speech. For example, the prosody tag "#5" indicates play in ascending tone, and the prosody tag "#6" indicates play in descending tone.

In this step 205, the normalized text corpus is manually marked with prosodic labels, for example, a length of text is long and no punctuation mark is provided in the middle, and the text corpus is manually marked with prosodic labels after which text in the length of text. Such as which word in the segment of long text should be followed by a prosodic tag "#1", which word in the segment of long text should be followed by a prosodic tag "#2", which word in the segment of long text should be followed by a prosodic tag "#5", etc.

As described above, since the prosody prediction model is trained based on the punctuation prediction model in the embodiment of the present disclosure, there is no need for massive training samples, that is, only a small amount of text corpus needs to be manually marked, so as to generate a small amount of, for example, hundreds of training samples.

Next for step 207: and initializing the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model.

As with the model structure of the punctuation predictive model, the prosody predictive model includes: text preprocessing module, word2Vec module and BiLSTM module.

In the prosody prediction model, the functions of the text preprocessing module include: the method comprises the steps of performing normalization processing on text corpus to obtain sample corpus;

the functions of the Word2Vec module include: the method comprises the steps of obtaining an ebedding value of each word in a sample corpus;

the functions of the BiLSTM module include: and the prediction result is output according to the emmbedding value of each word in the sample corpus. The types of prosodic labels output by the BiLSTM module can be determined according to the manual labels.

In one embodiment of the present disclosure, during step 207, the prosody prediction model may be fine-tuned during the initialization stage according to the difference between the function of the prosody prediction model and the function of the punctuation prediction model. For example, the number of categories of punctuation labels output by the punctuation prediction model is 4, and the number of categories of prosody labels output by the prosody prediction model is 5, then the initialization stage further includes: removing a full connection layer in a network structure of the punctuation mark prediction model; each category of the model output is set as each category included in the prosodic tag.

Next for step 209: and inputting the sample corpus with the prosody tags into the initialized prosody prediction model to train the prosody prediction model.

The initialized prosody prediction model has punctuation mark prediction capability, and after the sample corpus with the manually marked prosody tag is input into the initialized prosody prediction model, the initialized prosody prediction model can learn the corresponding relation between the punctuation mark and the prosody tag. For example, because the initialized prosody prediction model has punctuation prediction capability, for an input sample corpus, for example, it is determined that a comma should exist behind a Chinese character 1 in the sample corpus, and a prosody label of manual marking is "#2", the prosody prediction model can determine that the prosody label "#2" is output at any position where a comma should be added. Meanwhile, the prosody prediction model can learn how to prosody mark the text according to the prosody label of manual marking.

As can be seen, in the embodiment of the present specification, the prosody prediction model is provided with: firstly, predicting punctuation marks, and then predicting the capability of rhythm labels according to the punctuation marks; and, the ability to predict prosodic tags directly from the text content.

After the training is performed for a plurality of times, the prosody prediction model is trained.

After the prosody prediction model is trained, the prosody prediction model can be utilized to perform prosody processing on the voice to be played in subsequent human-computer interaction, namely, the voice to be played is marked with a prosody label, so that the machine equipment can pause according to the marked prosody label in the process of playing the voice. Referring to fig. 3, the human-computer interaction method includes:

step 301: and obtaining the interactive content which needs to be played to the user.

Step 303: inputting the interactive content into a pre-trained prosody prediction model to obtain the interactive content with prosody tags output by the prosody prediction model.

Step 305: and playing the interactive content to the user in a voice mode with rhythm according to the rhythm label carried in the interactive content.

For example, the interactive content with prosodic tags includes "xx#1xxxx#2xxxxxx#2xxxx#3 … …", so that after the text before "#1" is played by the machine device, the machine device pauses for 0.1 seconds and then plays, and after the text before "#2" is played, the machine device pauses for 0.2 seconds and then plays. And so on, thereby enabling the machine device to play out prosodic speech similar to that possessed by a person speaking.

In an embodiment of the present disclosure, a training device for a prosody prediction model is further provided, referring to fig. 4, where the device includes:

the sample acquisition module 401 is configured to obtain text corpus in the target service scene; normalizing the text corpus; obtaining sample corpus; the sample corpus is obtained by marking the normalized text corpus with prosody labels; the prosody tag is used for indicating the pause time;

the transfer learning module 402 is configured to initialize the model structure and parameters of the prosody prediction model by using the model structure and parameters of the trained punctuation mark prediction model;

the training execution module 403 is configured to input the sample corpus with prosody tags into the initialized prosody prediction model to train the prosody prediction model.

In the embodiment of the apparatus of the present specification shown in fig. 4, the normalization process includes at least one of: and removing non-Chinese character symbols in the text corpus, which do not influence semantic understanding, and converting the non-Chinese character symbols in the text corpus, which influence semantic understanding, into Chinese characters with corresponding semantics.

In the embodiment of the present description device shown in fig. 4, the prosodic tag is also used to indicate the pitch of the speech.

In the embodiment of the present description device shown in fig. 4, the prosodic tag includes at least one of the following: segmentation, small pause, large pause, and end of period.

In the embodiment of the present description apparatus shown in fig. 4, the transfer learning module 402 is further configured to perform:

In the embodiment of the apparatus of the present specification shown in fig. 4, the prosody prediction model includes: the system comprises a text preprocessing module, a Word2Vec module and a BiLSTM module; wherein,,

An embodiment of the present disclosure further provides a human-computer interaction device, where the human-computer interaction device is disposed in a machine device that performs human-computer interaction with a user. Referring to fig. 5, the apparatus includes:

an interactive content obtaining module 501 configured to obtain interactive content to be played to a user;

the prosody processing module 502 is configured to input the interactive content into a pre-trained prosody prediction model to obtain interactive content with prosody tags output by the prosody prediction model; the prosody prediction model is trained by using the training device of the prosody prediction model in the embodiment of the specification;

and the playing module 503 is configured to play the interactive content to the user in a voice mode with prosody according to the prosody tag carried in the interactive content.

The above-described devices are usually implemented at the server side, and may be provided in separate servers, or a combination of some or all of the devices may be provided in the same server. The server can be a single server or a server cluster consisting of a plurality of servers, and the server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system. The above devices may also be implemented in a computer terminal having a relatively high computing power.

An embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the embodiments of the specification.

An embodiment of the present specification provides a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, performs a method of any of the embodiments of the present specification.

It should be understood that the structures illustrated in the embodiments of the present specification do not constitute a particular limitation on the apparatus of the embodiments of the present specification. In other embodiments of the specification, the apparatus may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, a pendant, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims

1. A method of training a prosody prediction model, wherein the method comprises:

obtaining text corpus in a target service scene;

normalizing the text corpus;

2. The method of claim 1, wherein the normalization process comprises at least one of: and removing non-Chinese character symbols in the text corpus, which do not influence semantic understanding, and converting the non-Chinese character symbols in the text corpus, which influence semantic understanding, into Chinese characters with corresponding semantics.

3. The method of claim 1, wherein the prosodic tag is further used to indicate a intonation of speech.

4. The method of claim 1, wherein the prosodic tag includes at least one of: segmentation, small pause, large pause, and end of period.

5. The method of claim 1, wherein initializing the model structure and parameters of the prosody prediction model with the model structure and parameters of the trained punctuation prediction model further comprises:

removing a full connection layer in the network structure of the punctuation mark prediction model; the category of the model output is set as the category included in the prosody tag.

6. The method of claim 5, wherein the prosody prediction model comprises: the system comprises a text preprocessing module, a Word2Vec module and a BiLSTM module; wherein,,

7. A human-computer interaction method, wherein the method comprises:

obtaining interactive content to be played to a user;

inputting the interactive content into a pre-trained prosody prediction model to obtain interactive content with prosody tags output by the prosody prediction model; wherein the prosody prediction model is trained using the method of any one of claims 1 to 6;

8. Training apparatus for prosody prediction models, wherein the apparatus comprises:

9. A human-machine interaction device, wherein the device comprises:

the prosody processing module is configured to input the interactive content into a pre-trained prosody prediction model to obtain the interactive content with prosody tags output by the prosody prediction model; wherein the prosody prediction model is trained using the apparatus of claim 8;

10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-7.