CN114302217A

CN114302217A - Voice information generation method and device, electronic equipment and storage medium

Info

Publication number: CN114302217A
Application number: CN202111633258.1A
Authority: CN
Inventors: 何思远
Original assignee: Guangzhou Fanxing Huyu IT Co Ltd
Current assignee: Guangzhou Fanxing Huyu IT Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-08
Anticipated expiration: 2041-12-29
Also published as: CN114302217B

Abstract

The embodiment of the invention provides a method and a device for generating voice information, electronic equipment and a storage medium, wherein the method comprises the following steps: after a trigger instruction of a first target user for triggering a preset event is received, preset text information corresponding to the preset event is acquired, the preset text information is input into a sound simulation model obtained through pre-training, corresponding first voice information is generated, and the first voice information is sent to a terminal where the first target user is located, so that the terminal plays the first voice information. By adopting the method, different preset text information can be defined according to different users entering the first target user, so that on the premise of not influencing the live broadcast of the second target user, more personalized voice information is provided for the users entering the live broadcast room of the second target user, and the information communication mode among the users is enriched.

Description

Voice information generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method and an apparatus for generating voice information, an electronic device, and a storage medium.

Background

At present, the information exchange mode between users of application software is single. For example, when a main broadcast user of video live broadcast software broadcasts directly, for events such as other users entering a live broadcast room of the main broadcast user and other users paying attention to an account of the main broadcast user, the main broadcast user usually welcomes the other users directly through a spoken voice representation, or sends subtitles representing welcoming of the other users through an automatic plug-in. However, the manner of welcoming other users directly through spoken voice representations is too frequent for the anchor user, affecting the anchor user to live, while the manner of welcoming through subtitles is single and tedious for other users, affecting the viewing experience of other users.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for generating voice information, electronic equipment and a storage medium, which are used for enriching information communication modes among users of application software.

In a first aspect, an embodiment of the present invention provides a method for generating voice information, including:

after a trigger instruction of a first target user for triggering a preset event is received, acquiring preset text information corresponding to the preset event, wherein the preset event is as follows: an event that a live watching option, an attention option, a comment option, a barrage option or a like option of a live broadcast room of a second target user is triggered;

inputting the preset text information into a pre-trained sound simulation model to generate corresponding first voice information, wherein the sound simulation model is obtained by training based on sample text information and sound characteristics of the second target user;

and sending the first voice information to a terminal where a first target user is located so that the terminal can play the first voice information.

Optionally, the training method of the sound simulation model includes:

inputting sample text information into a to-be-trained sound simulation model, and outputting corresponding voice information;

extracting sound characteristics of the voice information;

determining a feature difference value between the voice feature and the voice feature of the second target user;

if the characteristic difference value is smaller than a preset difference threshold value, determining the current sound simulation model to be trained as the sound simulation model obtained by training;

and if the characteristic difference value is not smaller than a preset difference threshold value, adjusting parameters of the voice simulation model to be trained, and returning to the step of inputting the sample text information into the voice simulation model to be trained.

Optionally, before the inputting the preset text information into a pre-trained sound simulation model, the method further includes:

acquiring a user name of the first target user;

generating target text information according to the user name and the preset text information;

the inputting the preset text information into a pre-trained sound simulation model to generate corresponding first voice information includes:

and inputting the target text information into a sound simulation model obtained by pre-training to generate corresponding first voice information.

Optionally, before the obtaining of the preset text information corresponding to the preset event, the method further includes:

acquiring the intimacy between the first target user and the second target user;

the acquiring of the preset text information corresponding to the preset event includes:

judging whether the intimacy is higher than a preset intimacy threshold value or not;

if the intimacy is higher than the preset intimacy threshold value, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located so that the terminal can play the preset voice information, wherein the preset voice information is recorded by the second target user aiming at the first target user in advance;

and if the intimacy is not higher than the preset intimacy threshold value, acquiring preset text information corresponding to the preset event.

Optionally, before the sending the first voice information to the terminal where the first target user is located, the method further includes:

determining whether an instruction sent by the second target user and not approving the first voice message is received;

if so, acquiring second voice information recorded by the second target user;

calculating the similarity between the second voice information and the first voice information;

the sending the first voice message to the terminal where the first target user is located includes:

judging whether the similarity is greater than a preset similarity threshold value or not;

and if the similarity is greater than a preset similarity threshold, sending the second voice information to a terminal where the first target user is located.

Optionally, the preset event is that the first target user triggers a live watching option of a live broadcast room of the second target user;

before the terminal plays the first voice message, the method further comprises the following steps:

determining whether the second target user is in a state of speaking in a live room;

if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume, wherein the preset volume is smaller than the volume of the first voice information played by the terminal.

In a second aspect, an embodiment of the present invention provides an apparatus for generating voice information, including:

the first text information acquisition module is used for acquiring preset text information corresponding to a preset event after receiving a trigger instruction of a first target user for triggering the preset event, wherein the preset event is as follows: an event that a live watching option, an attention option, a comment option, a barrage option or a like option of a live broadcast room of a second target user is triggered;

the voice information generation module is used for inputting the preset text information into a pre-trained voice simulation model and generating corresponding first voice information, wherein the voice simulation model is obtained by training based on sample text information and the voice characteristics of the second target user;

and the voice information sending module is used for sending the first voice information to a terminal where a first target user is located so that the terminal can play the first voice information.

Optionally, the apparatus further includes:

the model training module is used for inputting the sample text information into the voice simulation model to be trained and outputting corresponding voice information; extracting sound characteristics of the voice information; determining a feature difference value between the voice feature and the voice feature of the second target user; if the characteristic difference value is smaller than a preset difference threshold value, determining the current sound simulation model to be trained as the sound simulation model obtained by training; and if the characteristic difference value is not smaller than a preset difference threshold value, adjusting parameters of the voice simulation model to be trained, and returning to the step of inputting the sample text information into the voice simulation model to be trained.

Optionally, the apparatus further includes:

the second text information acquisition module is used for acquiring the user name of the first target user; generating target text information according to the user name and the preset text information;

the voice information generation module is specifically configured to input the target text information into a pre-trained sound simulation model, and generate corresponding first voice information.

Optionally, the apparatus further includes:

the intimacy acquiring module is used for acquiring intimacy between the first target user and the second target user;

the first text information acquisition module is specifically used for judging whether the intimacy degree is higher than a preset intimacy degree threshold value; if the intimacy is higher than the preset intimacy threshold value, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information; and if the intimacy is not higher than the preset intimacy threshold value, acquiring preset text information corresponding to the preset event, wherein the preset voice information is recorded by the second target user aiming at the first target user in advance.

Optionally, the apparatus further includes:

the similarity calculation module is used for determining whether an instruction which is sent by the second target user and does not approve the first voice message is received; if so, acquiring second voice information recorded by the second target user; calculating the similarity between the second voice information and the first voice information;

the voice information sending module is specifically configured to determine whether the similarity is greater than a preset similarity threshold; and if the similarity is greater than a preset similarity threshold, sending the second voice information to a terminal where the first target user is located.

the device further comprises:

the volume adjusting module is used for determining whether the second target user is in a state of speaking in a live broadcast room; if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume, wherein the preset volume is smaller than the volume of the first voice information played by the terminal.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor adapted to perform the method steps of any of the above first aspects when executing a program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the above first aspects.

The embodiment of the invention has the following beneficial effects:

by adopting the method provided by the embodiment of the invention, after a trigger instruction of a first target user for triggering a preset event is received, the preset text information corresponding to the preset event is acquired, the preset text information is input into a sound simulation model obtained by pre-training, corresponding first voice information is generated, and the first voice information is sent to a terminal where the first target user is located, so that the terminal plays the first voice information. The embodiment of the invention can obtain the sound simulation model according to the sample text information and the sound characteristic training of the second target user in advance, the sound simulation model can convert the text information into the voice information with the sound characteristic of the second target user, namely the voice information imitating the voice sent by the second target user can be sent to the first target user entering the live broadcast room of the second target user, and the preset text information corresponding to the preset event can be customized according to different events, namely different preset text information can be defined according to different users entering the first target user, so that the purpose that the voice information with more individuality is provided for the user entering the live broadcast room of the second target user on the premise that the live broadcast of the second target user is not influenced is realized, and the information communication mode among the users is enriched.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

Fig. 1 is a flowchart of a method for generating voice information according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a sound simulation model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a voice message generating apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.

In order to enrich the information communication mode among users of application software, the embodiment of the invention provides a method and a device for generating voice information, electronic equipment, a computer readable storage medium and a computer program product.

First, a method for generating voice information according to an embodiment of the present invention will be described below. The method for generating the voice information provided by the embodiment of the invention can be applied to any electronic equipment with a video live broadcast function, and is not particularly limited here.

Fig. 1 is a flowchart of a method for generating voice information according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101, after receiving a trigger instruction for triggering a preset event by a first target user, acquiring preset text information corresponding to the preset event.

Wherein, the preset event is: an event in which a watch live option, a follow-up option, a comment option, a pop-up option, or a like option of the second target user's live room is triggered.

Step 102, inputting preset text information into a pre-trained sound simulation model, and generating corresponding first voice information.

The voice simulation model is obtained by training based on the sample text information and the voice characteristics of the second target user.

Step 103, sending the first voice message to the terminal where the first target user is located, so that the terminal plays the first voice message.

In the embodiment of the present invention, the second target user may be a main broadcast user of the video live broadcast application software, and the second target user may be a user entering a live broadcast room of the second target user to watch live broadcast.

The first target user triggering preset event includes but is not limited to: the method comprises the steps that a first target user enters a live broadcast watching room of a second target user, the first target user enters the live broadcast room of the second target user and sends comments aiming at the live broadcast, the first target user enters the live broadcast room of the second target user and sends a barrage aiming at the live broadcast, or the first target user enters the live broadcast room of the second target user and approves the live broadcast.

For example, the user a and the user B are both users of the live video application software X, the user B is a anchor user of the live video application software, and the user a does not pay attention to an account of the user B in the live video application software X. When the user B plays the live broadcast in the live broadcast room, the user A can see the 'attention' option, the 'comment' option, the 'barrage' option, the 'live watching' option, the 'praise' option and the 'present delivery' option for the user B playing the live broadcast in the live broadcast room on a display interface of the terminal where the user A is located. If the user A clicks the 'concern' option to indicate that the user A concerns the account of the user B, if the user A clicks the 'comment' option to indicate that the user A can enter a live broadcast room of the user B to comment on the live broadcast of the user B, if the user A clicks the 'barrage' option to indicate that the user A can enter the live broadcast room of the user B to send a barrage to the live broadcast of the user B, if the user A clicks the 'watch live broadcast' option to indicate that the user A can enter the live broadcast room of the user B to watch the live broadcast of the user B, if the user A clicks the 'favor' option to indicate that the user A can enter the live broadcast room of the user B to show a reward, and if the user A clicks the 'present' option to indicate that the user A can enter the live broadcast room of the user B and watch the live broadcast content of the user B.

If the user a triggers a focus option, a comment option, a bullet screen option, a live broadcast watching option, a like option or a gift sending option for the user B on a display interface of the terminal where the user a is located, the terminal where the user a is located receives a trigger instruction for triggering a preset event, and sends the trigger instruction to the electronic device executing the voice information generation method, and after receiving the trigger instruction, the electronic device can acquire preset text information corresponding to the preset event.

In the embodiment of the present invention, the text information corresponding to each preset event may be preset, and the preset text information corresponding to different preset events may be different. For example, if the user a triggers the "follow" option for the user B on the display interface of the terminal where the user a is located, the text message corresponding to the preset event may be "we are a family from this point on"; if the user a triggers a comment option for the user B on the display interface of the terminal where the user a is located, the text information corresponding to the preset event may be "thank you comment"; if the user a triggers an option of "like" for the user B on the display interface of the terminal where the user a is located, the text message corresponding to the preset event may be "like" thanks to the family; if the user a triggers a "barrage" option for the user B on the display interface of the terminal where the user a is located, the text information corresponding to the preset event may be a "barrage thanking the family"; if the user a triggers a "watch live" option for the user B on the display interface of the terminal where the user a is located, the text information corresponding to the preset event may be "focus on, not get lost"; if the user a triggers a "present" option for the user B on the display interface of the terminal where the user a is located, the user B is rewarded, and the text information corresponding to the preset event may be "thank you for family to me".

If the user A pays attention to the user B in advance, namely the user A is a fan of the user B, the user name of the user B is called XXX, and if the user A enters a live broadcast room of the user B, the text information corresponding to the preset event can be 'family comes back to see XXX'; if the user A enters the live broadcast room of the user B and enjoys the user B, the text information corresponding to the preset event can be 'present thanks' to the house.

In a possible embodiment, before the preset text information is input into the pre-trained sound simulation model, the following steps a1-A3 are further included:

step a1, obtain the user name of the first target user.

And step A2, generating target text information according to the user name and the preset text information.

The method includes inputting preset text information into a pre-trained sound simulation model to generate corresponding first voice information, and specifically may be: and inputting the target text information into a sound simulation model obtained by pre-training to generate corresponding first voice information.

Specifically, the target text information including the user name of the first target user may be generated according to the user name and the preset text information. For example, if the first target user pays attention to the second target user in advance, the user name of the first target user is referred to as "XXYY", and if the first target user enters the live broadcast room of the second target user and enjoys the second target user, the text information corresponding to the preset event may be "gift of thank you", and the target text information "gift of thank you XXYY" may be generated according to the user name "XXYY" of the first target user and the preset text information "gift of thank you".

In a possible implementation manner, before the obtaining of the preset text information corresponding to the preset event, the intimacy between the first target user and the second target user may also be obtained.

Specifically, information such as whether the first target user pays attention to the second target user, the number of comments sent by the first target user in the live broadcast room of the second target user, the number of barrage sent by the first target user in the live broadcast room of the second target user, the value of a gift enjoyed by the first target user in the live broadcast room of the second target user, the live broadcast watching duration of the first target user in the live broadcast room of the second target user, and the approval number of the first target user in the live broadcast room of the second target user is obtained. The affinity between the first target user and the second target user may then be determined using the following formula:

Q_(A,B)＝a*y1+b*y2+c*y3+d*y4+e*y5+f*y6

wherein y1 represents information on whether the first target user is interested in the second target user, a is a preset weight for y1, a is 1 if the first target user is interested in the second target user, and a is 0 if the first target user is not interested in the second target user; y2 is the number of comments sent by the first target user in the live broadcast of the second target user, and b is the preset weight for y 2; y3 is the number of barrages sent by the first target user in the live broadcast room of the second target user, and c is the preset weight for y 3; y4 is the value of the gift enjoyed by the first target user in the live broadcast room of the second target user, d is the preset weight for y 4; y5 is the live broadcast watching duration of the first target user in the live broadcast room of the second target user, and e is the preset weight for y 5; y6 is the number of praise of the first target user on the second target user's live room, and f is a preset weight for y 6. The weights b, c, d, e, and f may be set according to an actual application scenario, and are not specifically limited herein.

The acquiring of the preset text information corresponding to the preset event may specifically include the following steps B1-B3:

and step B1, judging whether the intimacy degree is higher than a preset intimacy degree threshold value.

The preset intimacy threshold may be set according to an actual application scenario, and is not specifically limited herein.

Step B2, if the intimacy degree is higher than the preset intimacy degree threshold value, searching preset voice information corresponding to the first target user, and sending the preset voice information to the terminal where the first target user is located, so that the terminal plays the preset voice information.

The preset voice information is recorded by the second target user aiming at the first target user in advance. For example, if the intimacy degree is higher than the preset intimacy degree threshold, the second target user may pre-record a voice "thank you for the support of the family, and do not go wrong" for each user whose intimacy degree is higher than the preset intimacy degree threshold in advance, and store the preset voice information corresponding to the identity information of the user. When the obtained intimacy between the first target user and the second target user is higher than the preset intimacy threshold, the preset voice information corresponding to the first target user can be searched, and the preset voice information is sent to the terminal where the first target user is located, so that the terminal where the first target user is located plays the preset voice information.

And step B3, if the intimacy degree is not higher than the preset intimacy degree threshold value, acquiring preset text information corresponding to the preset event.

By adopting the method provided by the embodiment of the invention, the second target user can customize the preset voice information of the exclusive user and the first target user for the first target user according to the intimacy with the first target user, so that the pertinence of the voice information is enhanced, the interaction effect is improved, further, the experience of watching live broadcast of the first target user is improved in this way, and the fan activity of the second target user is enhanced.

In a possible implementation manner, before the sending the first voice information to the terminal where the first target user is located, the following steps C1-C3 are further included:

step C1, it is determined whether an instruction sent by the second target user not to approve the first voice message is received.

And step C2, if yes, acquiring second voice information recorded by the second target user.

And step C3, calculating the similarity between the second voice information and the first voice information.

In the embodiment of the present invention, if the first voice information generated by the sound simulation model is not satisfied or recognized by the second target user, an instruction for not recognizing the first voice information may be sent to the electronic device through a terminal where the second target user is located at any time, and meanwhile, the second voice information re-recorded by the electronic device may be sent to the electronic device.

The sending the first voice information to the terminal where the first target user is located may specifically include: judging whether the similarity is greater than a preset similarity threshold value or not; and if the similarity is greater than a preset similarity threshold, sending the second voice information to the terminal where the first target user is located.

The preset similarity threshold may be set to 80% or 90% according to the actual application, and is not specifically limited herein.

For example, the first voice message is a "gift of thank you", the second voice message is a "gift of thank you", a similarity between the first voice message "the gift of thank you" and the second voice message "the gift of thank you" may be calculated, if the similarity between the first voice message "the gift of thank you" and the second voice message "the gift of thank you" is greater than a preset similarity threshold, the second voice message "the gift of thank you" may be transmitted to the terminal where the first target user is located, and if the similarity between the first voice message "the gift of thank you" and the second voice message "the gift of thank you" is not greater than the preset similarity threshold, the first voice message "the gift of thank you" may be transmitted to the terminal where the first target user is located.

By adopting the method provided by the embodiment of the invention, the second target user can adjust the voice information aiming at the first target user at any time according to the voice effect generated by the voice simulation model, the experience of watching the live broadcast of the first target user is improved, and the fan liveness of the second target user is enhanced.

In one possible implementation, if the preset event is that the first target user triggers a live watching option of a live broadcast room of the second target user; before the first voice message is played at the terminal where the first target user is located, the following steps D1-D2 can be further included:

step D1, it is determined whether the second target user is in a state of speaking in the live room.

And D2, if yes, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume.

The preset volume is smaller than the volume of the first voice message played by the terminal, and specifically, the value of the preset volume can be specifically set according to an actual scene.

Specifically, after a first target user triggers a preset event, a terminal where the first target user is located may initiate an event request to the electronic device, receive first voice information generated by the electronic device, and if a second target user is speaking in a live broadcast room at this time, the sound of the second target user speaking in the live broadcast room may interfere with the playing of the first voice information, and the speaking volume of the second target user played by the terminal where the first target user is located may be adjusted to a preset volume, so that the speaking volume of the second target user is smaller than the volume of the first voice information, and information interference is avoided.

By adopting the method provided by the embodiment of the invention, information interference can be avoided by adjusting the speaking volume of the second target user played by the terminal where the first target user is located, the experience of watching live broadcast of the first target user is further improved, and the vermicelli liveness of the second target user is enhanced.

In a possible implementation manner, if the first target user is a fan of the second target user, when the gifts viewed by the live content of the first target user as the second target user reach a preset number, the first target user may leave a message for the second target user by recording voice and video, so as to enhance the information interaction between the first target user and the second target user. The preset number may be set to 100 or 1000, and the like, and is not limited herein.

In a possible implementation manner, fig. 2 is a flowchart of a training method of a sound simulation model according to an embodiment of the present invention, where the training method of the sound simulation model may include:

step 201, inputting the sample text information into the sound simulation model to be trained, and outputting the corresponding voice information.

The acoustic simulation model to be trained may be a vocoder or other device that converts text information into corresponding speech information.

Step 202, extracting the sound features of the voice information.

Specifically, the speech rate feature and the tone feature of the speech information may be extracted.

Step 203, determining a feature difference value between the sound feature and the sound feature of the second target user.

The voice characteristics of the second target user comprise the tone color characteristics and the speech speed characteristics of the second target user.

In the embodiment of the invention, the voice audio sample of the second target user can be collected in advance, and the tone color characteristic and the speech speed characteristic of the second target user are extracted as the voice characteristic of the second target user.

Step 204, if the feature difference value is smaller than a preset difference threshold value, determining the current sound simulation model to be trained as a sound simulation model obtained by training;

the preset difference threshold may be set according to practical applications, and is not specifically limited herein.

And step 205, if the feature difference value is not less than the preset difference threshold value, adjusting parameters of the voice simulation model to be trained, and returning to the step of inputting the sample text information into the voice simulation model to be trained.

By adopting the method provided by the embodiment of the invention, the voice simulation model can be obtained by training in advance according to the sample text information and the voice characteristics of the second target user, the voice simulation model can convert the text information into the voice information with the voice characteristics of the second target user, namely the voice information imitating the voice sent by the second target user can be sent to the first target user entering the live broadcast room of the second target user, and the preset text information corresponding to the preset event can be customized according to different events, namely different preset text information can be defined according to different users entering the first target user, so that the purpose of providing more individual voice information for the user entering the live broadcast room of the second target user on the premise of not influencing the live broadcast of the second target user is realized, and the information communication mode among users is enriched.

Corresponding to the method for generating the voice information, the embodiment of the invention also provides a device for generating the voice information. The following describes a speech information generating apparatus according to an embodiment of the present invention. As shown in fig. 3, an apparatus for generating voice information, the apparatus comprising:

the first text information obtaining module 301 is configured to, after receiving a trigger instruction that a first target user triggers a preset event, obtain preset text information corresponding to the preset event, where the preset event is: an event that a live watching option, an attention option, a comment option, a barrage option or a like option of a live broadcast room of a second target user is triggered;

the voice information generating module 302 is configured to input the preset text information into a pre-trained voice simulation model, and generate corresponding first voice information, where the voice simulation model is obtained by training based on sample text information and voice features of the second target user;

a voice message sending module 303, configured to send the first voice message to a terminal where a first target user is located, so that the terminal plays the first voice message.

By adopting the device provided by the embodiment of the invention, after a trigger instruction of a first target user for triggering a preset event is received, the preset text information corresponding to the preset event is acquired, the preset text information is input into a sound simulation model obtained by pre-training, corresponding first voice information is generated, and the first voice information is sent to a terminal where the first target user is located, so that the terminal plays the first voice information. The embodiment of the invention can obtain the sound simulation model according to the sample text information and the sound characteristic training of the second target user in advance, the sound simulation model can convert the text information into the voice information with the sound characteristic of the second target user, namely the voice information imitating the voice sent by the second target user can be sent to the first target user entering the live broadcast room of the second target user, and the preset text information corresponding to the preset event can be customized according to different events, namely different preset text information can be defined according to different users entering the first target user, so that the purpose that the voice information with more individuality is provided for the user entering the live broadcast room of the second target user on the premise that the live broadcast of the second target user is not influenced is realized, and the information communication mode among the users is enriched.

Optionally, the apparatus further comprises:

a model training module (not shown in the figure) for inputting the sample text information into the voice simulation model to be trained and outputting corresponding voice information; extracting sound characteristics of the voice information; determining a feature difference value between the voice feature and the voice feature of the second target user; if the characteristic difference value is smaller than a preset difference threshold value, determining the current sound simulation model to be trained as the sound simulation model obtained by training; and if the characteristic difference value is not smaller than a preset difference threshold value, adjusting parameters of the voice simulation model to be trained, and returning to the step of inputting the sample text information into the voice simulation model to be trained.

Optionally, the apparatus further comprises:

a second text information obtaining module (not shown in the figure) for obtaining the user name of the first target user; generating target text information according to the user name and the preset text information;

the voice information generating module 302 is specifically configured to input the target text information into a pre-trained sound simulation model, and generate corresponding first voice information.

Optionally, the apparatus further comprises:

an affinity obtaining module (not shown in the figure) for obtaining an affinity between the first target user and the second target user;

the first text information obtaining module 301 is specifically configured to determine whether the intimacy degree is higher than a preset intimacy degree threshold; if the intimacy is higher than the preset intimacy threshold value, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information; and if the intimacy is not higher than the preset intimacy threshold value, acquiring preset text information corresponding to the preset event, wherein the preset voice information is recorded by the second target user aiming at the first target user in advance.

Optionally, the apparatus further comprises:

a similarity calculation module (not shown in the figure) for determining whether an instruction for disapproval of the first voice message sent by the second target user is received; if so, acquiring second voice information recorded by the second target user; calculating the similarity between the second voice information and the first voice information;

the voice information sending module 303 is specifically configured to determine whether the similarity is greater than a preset similarity threshold; and if the similarity is greater than a preset similarity threshold, sending the second voice information to a terminal where the first target user is located.

the device further comprises:

a volume adjustment module (not shown) for determining whether the second target user is in a state of speaking in a live room; if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume, wherein the preset volume is smaller than the volume of the first voice information played by the terminal.

An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401 is configured to implement the steps of any of the above-described voice information generating methods when executing the program stored in the memory 403.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In still another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the above-mentioned voice information generating methods.

In a further embodiment, the present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the above-described methods for generating speech information.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to them, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for generating voice information, comprising:

2. The method of claim 1, wherein the method of training the acoustic simulation model comprises:

extracting sound characteristics of the voice information;

3. The method of claim 1, prior to inputting the predetermined text information into a pre-trained acoustic simulation model, further comprising:

acquiring a user name of the first target user;

4. The method according to claim 1, further comprising, before the obtaining of the preset text information corresponding to the preset event:

5. The method according to claim 1, wherein before sending the first voice message to the terminal where the first target user is located, further comprising:

if so, acquiring second voice information recorded by the second target user;

6. The method of claim 1, wherein the preset event is that the first target user triggers a live watching option of a live room of the second target user;

7. An apparatus for generating speech information, comprising:

8. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.