CN114302217B

CN114302217B - Voice information generation method and device, electronic equipment and storage medium

Info

Publication number: CN114302217B
Application number: CN202111633258.1A
Authority: CN
Inventors: 何思远
Original assignee: Guangzhou Fanxing Huyu IT Co Ltd
Current assignee: Guangzhou Fanxing Huyu IT Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2024-01-05
Anticipated expiration: 2041-12-29
Also published as: CN114302217A

Abstract

The embodiment of the invention provides a method, a device, electronic equipment and a storage medium for generating voice information, wherein the method comprises the following steps: after a triggering instruction of triggering a preset event by a first target user is received, acquiring preset text information corresponding to the preset event, inputting the preset text information into a pre-trained sound simulation model, generating corresponding first voice information, and sending the first voice information to a terminal where the first target user is located so that the terminal plays the first voice information. By adopting the method, different preset text information can be defined according to different users entering the first target users, so that the relatively personalized voice information is provided for the users entering the second target user live broadcasting room on the premise that the second target user live broadcasting is not influenced, and the information exchange modes among the users are enriched.

Description

Voice information generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method and apparatus for generating voice information, an electronic device, and a storage medium.

Background

At present, the information exchange mode between users of application software is single. For example, when a live user of a video live broadcast software plays live, other users enter a live room of the live user and other users pay attention to accounts of the live user, and other users are usually welcome by the live user directly through a spoken voice representation, or subtitles representing welcome of other users are sent through an automated plugin. However, the manner of welcome other users directly by verbal speech is too mechanically frequent for the anchor user, affecting the anchor user to live, while the manner of welcome by subtitles is more tedious for other users, affecting the viewing experience of other users.

Disclosure of Invention

The embodiment of the invention aims to provide a voice information generation method, a voice information generation device, electronic equipment and a storage medium, so as to enrich information exchange modes among users of application software.

In a first aspect, an embodiment of the present invention provides a method for generating voice information, including:

after a triggering instruction of triggering a preset event by a first target user is received, acquiring preset text information corresponding to the preset event, wherein the preset event is: events in which a live watching option, a focus option, a comment option, a bullet screen option or a praise option of a live broadcasting room of a second target user are triggered;

Inputting the preset text information into a pre-trained sound simulation model to generate corresponding first voice information, wherein the sound simulation model is obtained by training based on sample text information and sound characteristics of the second target user;

and sending the first voice information to a terminal where a first target user is located, so that the terminal plays the first voice information.

Optionally, the training method of the sound simulation model includes:

inputting the sample text information into a sound simulation model to be trained, and outputting corresponding voice information;

extracting sound characteristics of the voice information;

determining a feature difference value between the sound feature and the sound feature of the second target user;

if the characteristic difference value is smaller than a preset difference threshold value, determining a current sound simulation model to be trained as the sound simulation model obtained through training;

and if the characteristic difference value is not smaller than a preset difference threshold value, adjusting parameters of the sound simulation model to be trained, and returning to the step of inputting the sample text information into the sound simulation model to be trained.

Optionally, before the inputting the preset text information into the pre-trained acoustic simulation model, the method further includes:

Acquiring a user name of the first target user;

generating target text information according to the user name and the preset text information;

inputting the preset text information into a pre-trained sound simulation model to generate corresponding first voice information, wherein the method comprises the following steps:

and inputting the target text information into a pre-trained sound simulation model to generate corresponding first voice information.

Optionally, before the acquiring the preset text information corresponding to the preset event, the method further includes:

acquiring the intimacy between the first target user and the second target user;

the obtaining the preset text information corresponding to the preset event comprises the following steps:

judging whether the intimacy is higher than a preset intimacy threshold;

if the intimacy is higher than the preset intimacy threshold, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information, wherein the preset voice information is recorded by the second target user in advance for the first target user;

and if the intimacy is not higher than the preset intimacy threshold, acquiring preset text information corresponding to the preset event.

Optionally, before the sending the first voice information to the terminal where the first target user is located, the method further includes:

determining whether an instruction which is sent by the second target user and does not approve the first voice information is received;

if yes, obtaining second voice information recorded by the second target user;

calculating the similarity between the second voice information and the first voice information;

the sending the first voice information to the terminal where the first target user is located includes:

judging whether the similarity is larger than a preset similarity threshold value or not;

and if the similarity is larger than a preset similarity threshold, sending the second voice information to a terminal where the first target user is located.

Optionally, the preset event is that the first target user triggers a live watching option of a live broadcasting room of the second target user;

before the terminal plays the first voice information, the method further comprises:

determining whether the second target user is in a state of speaking in a live room;

if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume, wherein the preset volume is smaller than the volume of the first voice information played by the terminal.

In a second aspect, an embodiment of the present invention provides a device for generating voice information, including:

the first text information acquisition module is used for acquiring preset text information corresponding to a preset event after receiving a triggering instruction of triggering the preset event by a first target user, wherein the preset event is: events in which a live watching option, a focus option, a comment option, a bullet screen option or a praise option of a live broadcasting room of a second target user are triggered;

the voice information generation module is used for inputting the preset text information into a pre-trained voice simulation model to generate corresponding first voice information, wherein the voice simulation model is obtained by training based on sample text information and voice characteristics of the second target user;

and the voice information sending module is used for sending the first voice information to a terminal where the first target user is located so that the terminal plays the first voice information.

Optionally, the apparatus further includes:

the model training module is used for inputting the sample text information into the sound simulation model to be trained and outputting corresponding voice information; extracting sound characteristics of the voice information; determining a feature difference value between the sound feature and the sound feature of the second target user; if the characteristic difference value is smaller than a preset difference threshold value, determining a current sound simulation model to be trained as the sound simulation model obtained through training; and if the characteristic difference value is not smaller than a preset difference threshold value, adjusting parameters of the sound simulation model to be trained, and returning to the step of inputting the sample text information into the sound simulation model to be trained.

Optionally, the apparatus further includes:

the second text information acquisition module is used for acquiring the user name of the first target user; generating target text information according to the user name and the preset text information;

the voice information generation module is specifically configured to input the target text information into a pre-trained acoustic simulation model, and generate corresponding first voice information.

Optionally, the apparatus further includes:

the affinity acquisition module is used for acquiring the affinity between the first target user and the second target user;

the first text information acquisition module is specifically configured to determine whether the affinity is higher than a preset affinity threshold; if the intimacy is higher than the preset intimacy threshold, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information; and if the intimacy is not higher than the preset intimacy threshold, acquiring preset text information corresponding to the preset event, wherein the preset voice information is recorded by the second target user in advance for the first target user.

Optionally, the apparatus further includes:

the similarity calculation module is used for determining whether an instruction which is sent by the second target user and does not approve the first voice information is received; if yes, obtaining second voice information recorded by the second target user; calculating the similarity between the second voice information and the first voice information;

the voice information sending module is specifically configured to determine whether the similarity is greater than a preset similarity threshold; and if the similarity is larger than a preset similarity threshold, sending the second voice information to a terminal where the first target user is located.

the apparatus further comprises:

a volume adjustment module for determining whether the second target user is in a state of speaking in a live room; if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume, wherein the preset volume is smaller than the volume of the first voice information played by the terminal.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method steps of any of the above first aspects when executing a program stored on a memory.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps of any of the first aspects described above.

The embodiment of the invention has the beneficial effects that:

after receiving a triggering instruction of triggering a preset event by a first target user, the method provided by the embodiment of the invention acquires preset text information corresponding to the preset event, inputs the preset text information into a sound simulation model obtained by training in advance, generates corresponding first voice information, and sends the first voice information to a terminal where the first target user is located, so that the terminal plays the first voice information. The embodiment of the invention can train to obtain the sound simulation model in advance according to the sample text information and the sound characteristics of the second target user, the sound simulation model can convert the text information into the sound information with the sound characteristics of the second target user, namely, the sound information imitating the sound sent by the second target user can be sent to the first target user entering the second target user living room, the preset text information corresponding to the preset event can be customized according to different events, namely, different preset text information can be defined according to different users entering the first target user, on the premise of not influencing the living of the second target user, the comparatively personalized sound information is provided for the users entering the second target user living room, and the information exchange mode among the users is enriched.

Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a flowchart of a method for generating voice information according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method of an acoustic simulation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a voice message generating apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art will be able to devise all other embodiments that are obtained based on this application and are within the scope of the present invention.

In order to enrich information exchange modes among users of application software, the embodiment of the invention provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for generating voice information.

The following first describes a method for generating voice information provided by the embodiment of the present invention. The method for generating voice information provided by the embodiment of the invention can be applied to any electronic equipment with a video live broadcast function, and is not particularly limited herein.

Fig. 1 is a flowchart of a method for generating voice information according to an embodiment of the present invention, where, as shown in fig. 1, the method includes:

step 101, after receiving a trigger instruction of triggering a preset event by a first target user, acquiring preset text information corresponding to the preset event.

The preset event is as follows: events in which a watch live option, a focus option, a comment option, a bullet option, or a praise option of a live room of the second target user are triggered.

Step 102, inputting preset text information into a pre-trained sound simulation model to generate corresponding first voice information.

The sound simulation model is obtained by training based on the sample text information and sound characteristics of the second target user.

Step 103, the first voice information is sent to the terminal where the first target user is located, so that the terminal plays the first voice information.

In the embodiment of the invention, the second target user can be a host user of the video live broadcast application software, and the second target user can be a user who enters a live broadcast room of the second target user to watch live broadcast.

The first target user triggering a preset event includes, but is not limited to: the method comprises the steps that a first target user enters a live broadcasting room of a second target user to watch live broadcasting, the first target user enters the live broadcasting room of the second target user and sends comments aiming at the live broadcasting, the first target user enters the live broadcasting room of the second target user and sends a barrage aiming at the live broadcasting, or the first target user enters the live broadcasting room of the second target user and endorses the live broadcasting.

For example, the user a and the user B are both users of the live video application software X, the user B is a hosting user of the live video application software, and the user a does not pay attention to the account number of the user B in the live video application software X. When the user B is in live broadcast in the live broadcast room, the user A can see the 'attention' option, the 'comment' option, the 'barrage' option, the 'live broadcast watching' option, the 'praise' option and the 'gift sending' option aiming at the user B in live broadcast in the live broadcast room on the display interface of the terminal where the user A is located. Wherein, if the user A clicks the "attention" option to indicate that the user A pays attention to the account number of the user B, if the user A clicks the "comment" option to indicate that the user A can enter the live broadcasting room of the user B to comment on the live broadcasting of the user B, if the user A clicks the "bullet screen" option to indicate that the user A can enter the live broadcasting room of the user B to send bullet screens to the live broadcasting of the user B, if the user A clicks the "watch live broadcasting" option to indicate that the user A can enter the live broadcasting room of the user B to watch the live broadcasting of the user B, if the user A clicks the "like" option to indicate that the user A can enter the live broadcasting room of the user B to indicate that the live broadcasting content of the user B is favored, if the user A clicks the "send gift" option to indicate that the user A can enter the live broadcasting room of the user B and to watch the live broadcasting content of the user B.

If the user a triggers the "attention" option, "comment" option, "bullet screen" option, "watch live broadcast" option, "like" option or "give gift" option of the display interface of the terminal where the user a is located, the terminal where the user a is located receives a trigger instruction for triggering a preset event, and sends the trigger instruction to the electronic device executing the method for generating the voice information, and after receiving the trigger instruction, the electronic device can acquire preset text information corresponding to the preset event.

In the embodiment of the invention, the text information corresponding to the preset event can be preset for each preset event, and the preset text information corresponding to different preset events can be different. For example, if the user a triggers the "attention" option for the user B on the display interface of the terminal where the user a is located, the text information corresponding to the preset event may be "we are a family from this point on"; if the user A triggers a comment option of the display interface of the terminal where the user A is located for the user B, the text information corresponding to the preset event can be thank you for comments; if the user A triggers the "like" option of the display interface of the terminal where the user A is located for the user B, the text information corresponding to the preset event may be "like" of thank family; if the user a triggers the "barrage" option for the user B on the display interface of the terminal where the user a is located, the text information corresponding to the preset event may be "thank you for the barrage of family"; if the user A triggers a live watching option aiming at the user B of a display interface of a terminal where the user A is positioned, the text information corresponding to the preset event can be "point attention, no lost"; if the user a triggers the "give gift" option for the user B on the display interface of the terminal where the user a is located, the user B is rewarded, and the text information corresponding to the preset event may be "thank family for my affirmative".

If the user A pays attention to the user B in advance, namely the user A is a fan of the user B, the user name of the user B is XXX, and if the user A enters a live broadcast room of the user B, the text information corresponding to the preset event can be "family looks back to XXX"; if the user a enters the live broadcast room of the user B and rewards the user B, the text information corresponding to the preset event may be "thank the family gift".

In one possible implementation manner, before the preset text information is input into the pre-trained acoustic simulation model, the method further includes the following steps A1-A3:

and A1, acquiring a user name of a first target user.

And A2, generating target text information according to the user name and the preset text information.

The inputting the preset text information into the pre-trained sound simulation model to generate the corresponding first voice information may specifically be: inputting the target text information into a sound simulation model obtained by training in advance, and generating corresponding first voice information.

Specifically, target text information including the user name of the first target user may be generated according to the user name and the preset text information. For example, the first target user pays attention to the second target user in advance, the user name of the first target user is "XXYY", and if the first target user enters the living broadcast room of the second target user and views the second target user, the text information corresponding to the preset event may be "thank for the gift of family", and the target text information "thank for the gift of family" may be generated according to the user name "XXYY" of the first target user and the preset text information "thank for the gift of family".

In one possible implementation manner, the intimacy between the first target user and the second target user may also be obtained before the preset text information corresponding to the preset event is obtained.

Specifically, information about whether the first target user pays attention to the second target user or not, the number of comments sent by the first target user in the live broadcasting room of the second target user, the number of barrages sent by the first target user in the live broadcasting room of the second target user, the value of gifts that the first target user rewards in the live broadcasting room of the second target user, the time period that the first target user watches live broadcasting in the live broadcasting room of the second target user, the number of praise of the first target user in the live broadcasting room of the second target user and the like are obtained. The affinity between the first target user and the second target user may then be determined using the following formula:

Q _(A,B) ＝a*y1+b*y2+c*y3+d*y4+e*y5+f*y6

wherein y1 represents information whether the first target user focuses on the second target user, a is a preset weight for y1, if the first target user focuses on the second target user, a=1, and if the first target user does not focus on the second target user, a=0; y2 is the number of comments sent by the first target user in the live broadcast room of the second target user, and b is preset weight aiming at y 2; y3 is the number of barrages sent by the first target user in the live broadcasting room of the second target user, and c is preset weight aiming at y 3; y4 is the value of the gift that the first target user rewards in the live broadcast room of the second target user, and d is the preset weight for y 4; y5 is the time length of the first target user watching the live broadcast in the live broadcast room of the second target user, and e is the preset weight aiming at y 5; y6 is the number of praise of the first target user in the live broadcasting room of the second target user, and f is the preset weight for y 6. The weights b, c, d, e and f may be set according to the actual application scenario, which is not specifically limited herein.

The obtaining the preset text information corresponding to the preset event may specifically include the following steps B1-B3:

and B1, judging whether the intimacy is higher than a preset intimacy threshold.

The preset affinity threshold may be set according to an actual application scenario, which is not specifically limited herein.

And B2, if the affinity is higher than a preset affinity threshold, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information.

The preset voice information is recorded for the second target user aiming at the first target user in advance. For example, if the affinity is higher than the preset affinity threshold, the second target user may pre-record a piece of voice "thank the home for the past support, how much, and store the preset voice information corresponding to the identity information of the user in advance for each user whose affinity is higher than the preset affinity threshold. When the acquired intimacy between the first target user and the second target user is higher than the preset intimacy threshold, the preset voice information corresponding to the first target user can be searched and sent to the terminal where the first target user is located, so that the terminal where the first target user is located plays the preset voice information.

And step B3, if the affinity is not higher than a preset affinity threshold, acquiring preset text information corresponding to the preset event.

By adopting the method provided by the embodiment of the invention, the second target user can customize the preset voice information of the exclusive and first target users for the first target user according to the intimacy with the first target user, thereby enhancing the pertinence of the voice information, improving the interaction effect, and further improving the experience of watching live broadcast of the first target user and enhancing the fan liveness of the second target user.

In a possible implementation manner, before the first voice information is sent to the terminal where the first target user is located, the method further includes the following steps C1-C3:

step C1, determining whether an instruction of disapproval of the first voice information sent by the second target user is received.

And step C2, if yes, acquiring second voice information recorded by a second target user.

And step C3, calculating the similarity between the second voice information and the first voice information.

In the embodiment of the invention, if the second target user is not satisfied with or does not accept the first voice information generated by the voice simulation model, the instruction of not accepting the first voice information can be sent to the electronic equipment through the terminal where the second target user is located at any time, meanwhile, the second voice information which is re-recorded by the second target user can be sent to the electronic equipment, after receiving the instruction of not accepting the first voice information sent by the second target user and the re-recorded second voice information, the similarity between the second voice information and the first voice information can be calculated, specifically, a text processing tool can be utilized to compare cosine similarity, euclidean distance or Manhattan distance between the text corresponding to the second voice information and the text corresponding to the first voice information, and the similarity between the second voice information and the first voice information can be used.

The sending the first voice information to the terminal where the first target user is located may specifically include: judging whether the similarity is larger than a preset similarity threshold value or not; and if the similarity is greater than a preset similarity threshold, sending the second voice information to the terminal where the first target user is located.

The preset similarity threshold may be set to 80% or 90% according to practical application conditions, and is not specifically limited herein.

For example, the first voice information is "thank you the gift of family", the second voice information is "thank you the gift of family", the similarity between the first voice information "thank you the gift of family" and the second voice information "thank you the gift of family" may be calculated, if the similarity between the first voice information "thank you the gift of family" and the second voice information "thank you the gift of family" is greater than a preset similarity threshold, the second voice information "thank you the gift of family" may be transmitted to the terminal where the first target user is located, and if the similarity between the first voice information "thank you the gift of family" and the second voice information "thank you the gift of family" is not greater than the preset similarity threshold, the first voice information "thank you the gift of family" may be transmitted to the terminal where the first target user is located.

By adopting the method provided by the embodiment of the invention, the second target user can adjust the voice information aiming at the first target user at any time according to the voice effect generated by the voice simulation model, thereby improving the experience of the first target user for watching live broadcast and enhancing the fan liveness of the second target user.

In one possible implementation, if the preset event triggers a live viewing option for the live room of the second target user for the first target user; before the terminal where the first target user is located plays the first voice information, the method may further include the following steps D1-D2:

step D1, determining whether the second target user is in a state of speaking in the living room.

And D2, if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume.

The preset volume is smaller than the volume of the first voice information played by the terminal, and specifically, the value of the preset volume can be specifically set according to an actual scene.

Specifically, after the first target user triggers a preset event, the terminal where the first target user is located can initiate an event request to the electronic device, receive first voice information generated by the electronic device, and if the second target user is speaking in the live broadcasting room at this time, the voice of the second target user speaking in the live broadcasting room can interfere with the playing of the first voice information, so that the speaking volume of the second target user played by the terminal where the first target user is located can be adjusted to the preset volume, so that the speaking volume of the second target user is smaller than the volume of the first voice information, and information interference is avoided.

By adopting the method provided by the embodiment of the invention, the information interference can be avoided by adjusting the speaking volume of the second target user played by the terminal where the first target user is located, the experience of watching live broadcast of the first target user is further improved, and the fan activity of the second target user is enhanced.

In one possible implementation manner, if the first target user is a fan of the second target user, when the first target user reaches a preset number of gifts for viewing live contents of the second target user, the first target user can leave a message for the second target user by recording voice and video, so that information interaction between the first target user and the second target user is enhanced. The preset number may be set to 100 or 1000, and the like, and is not particularly limited herein.

In a possible implementation manner, fig. 2 is a flowchart of a training method of an acoustic simulation model according to an embodiment of the present invention, where the training method of an acoustic simulation model may include:

step 201, inputting the sample text information into the to-be-trained sound simulation model, and outputting the corresponding voice information.

The acoustic simulation model to be trained may be a vocoder or other device that can convert text information into corresponding speech information.

Step 202, extracting sound characteristics of the voice information.

Specifically, the speech speed feature and the tone feature of the voice information can be extracted.

In step 203, a feature difference value between the sound feature and the sound feature of the second target user is determined.

Wherein the sound characteristics of the second target user include tone characteristics and speech rate characteristics of the second target user.

In the embodiment of the invention, the voice audio sample of the second target user can be collected in advance, and the tone characteristic and the speech speed characteristic of the second target user can be extracted as the voice characteristic of the second target user.

204, if the characteristic difference value is smaller than the preset difference threshold value, determining the current sound simulation model to be trained as a sound simulation model obtained by training;

the preset difference threshold may be set according to practical applications, and is not specifically limited herein.

And step 205, if the characteristic difference value is not smaller than the preset difference threshold value, adjusting parameters of the sound simulation model to be trained, and returning to the step of inputting the sample text information into the sound simulation model to be trained.

By adopting the method provided by the embodiment of the invention, the voice simulation model can be obtained in advance according to the sample text information and the voice characteristics of the second target user, the voice simulation model can convert the text information into the voice information with the voice characteristics of the second target user, namely, the voice information imitating the voice sent by the second target user can be sent to the first target user entering the second target user living room, and the preset text information corresponding to the preset event can be customized according to different events, namely, different preset text information can be defined according to different users entering the first target user, so that the user entering the second target user living room is provided with more individual voice information on the premise of not affecting the second target user living room, and the information communication mode among the users is enriched.

Corresponding to the method for generating the voice information, the embodiment of the invention also provides a device for generating the voice information. The following describes a device for generating voice information provided by an embodiment of the present invention. As shown in fig. 3, a device for generating voice information, the device comprising:

the first text information obtaining module 301 is configured to obtain, after receiving a trigger instruction that a first target user triggers a preset event, preset text information corresponding to the preset event, where the preset event is: events in which a live watching option, a focus option, a comment option, a bullet screen option or a praise option of a live broadcasting room of a second target user are triggered;

the voice information generating module 302 is configured to input the preset text information into a pre-trained voice simulation model, and generate corresponding first voice information, where the voice simulation model is obtained by training based on sample text information and voice features of the second target user;

and the voice information sending module 303 is configured to send the first voice information to a terminal where the first target user is located, so that the terminal plays the first voice information.

After receiving a trigger instruction of triggering a preset event by a first target user, the device provided by the embodiment of the invention acquires preset text information corresponding to the preset event, inputs the preset text information into a sound simulation model obtained by training in advance, generates corresponding first voice information, and sends the first voice information to a terminal where the first target user is located, so that the terminal plays the first voice information. The embodiment of the invention can train to obtain the sound simulation model in advance according to the sample text information and the sound characteristics of the second target user, the sound simulation model can convert the text information into the sound information with the sound characteristics of the second target user, namely, the sound information imitating the sound sent by the second target user can be sent to the first target user entering the second target user living room, the preset text information corresponding to the preset event can be customized according to different events, namely, different preset text information can be defined according to different users entering the first target user, on the premise of not influencing the living of the second target user, the comparatively personalized sound information is provided for the users entering the second target user living room, and the information exchange mode among the users is enriched.

Optionally, the apparatus further includes:

the model training module (not shown in the figure) is used for inputting sample text information into the to-be-trained sound simulation model and outputting corresponding voice information; extracting sound characteristics of the voice information; determining a feature difference value between the sound feature and the sound feature of the second target user; if the characteristic difference value is smaller than a preset difference threshold value, determining a current sound simulation model to be trained as the sound simulation model obtained through training; and if the characteristic difference value is not smaller than a preset difference threshold value, adjusting parameters of the sound simulation model to be trained, and returning to the step of inputting the sample text information into the sound simulation model to be trained.

Optionally, the apparatus further includes:

a second text information obtaining module (not shown in the figure) for obtaining a user name of the first target user; generating target text information according to the user name and the preset text information;

the voice information generating module 302 is specifically configured to input the target text information into a pre-trained acoustic simulation model, and generate corresponding first voice information.

Optionally, the apparatus further includes:

An affinity acquisition module (not shown in the figure) for acquiring an affinity between the first target user and the second target user;

the first text information obtaining module 301 is specifically configured to determine whether the affinity is higher than a preset affinity threshold; if the intimacy is higher than the preset intimacy threshold, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information; and if the intimacy is not higher than the preset intimacy threshold, acquiring preset text information corresponding to the preset event, wherein the preset voice information is recorded by the second target user in advance for the first target user.

Optionally, the apparatus further includes:

a similarity calculation module (not shown in the figure) for determining whether an instruction of disapproval of the first voice information sent by the second target user is received; if yes, obtaining second voice information recorded by the second target user; calculating the similarity between the second voice information and the first voice information;

The voice information sending module 303 is specifically configured to determine whether the similarity is greater than a preset similarity threshold; and if the similarity is larger than a preset similarity threshold, sending the second voice information to a terminal where the first target user is located.

the apparatus further comprises:

a volume adjustment module (not shown) for determining whether the second target user is in a state of speaking in the living room; if so, adjusting the speaking volume of the second target user played by the terminal where the first target user is located to a preset volume, wherein the preset volume is smaller than the volume of the first voice information played by the terminal.

The embodiment of the invention also provides an electronic device, as shown in fig. 4, which comprises a processor 401, a communication interface 402, a memory 403 and a communication bus 404, wherein the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401 is configured to implement any of the steps of the method for generating voice information when executing the program stored in the memory 403.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the above-described methods of generating speech information.

In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of generating speech information of any of the above embodiments is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage media and computer program product embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant points are found in the partial description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for generating voice information, comprising:

after a triggering instruction of triggering a preset event by a first target user is received, acquiring intimacy between the first target user and a second target user, wherein the first target user is a user watching live broadcast, the second target user is a host user, and the preset event is: an event that a live watching option, a focus option, a comment option, a bullet screen option or a praise option of a live broadcasting room of the second target user is triggered;

judging whether the intimacy is higher than a preset intimacy threshold;

if the intimacy is higher than the preset intimacy threshold, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located so that the terminal plays the preset voice information, wherein the preset voice information is recorded by the second target user in advance for the first target user;

If the intimacy is not higher than the preset intimacy threshold, acquiring preset text information corresponding to the preset event;

2. The method of claim 1, wherein the training method of the acoustic simulation model comprises:

extracting sound characteristics of the voice information;

3. The method of claim 1, further comprising, prior to said inputting said pre-set text information into a pre-trained acoustic simulation model:

acquiring a user name of the first target user;

4. The method of claim 1, further comprising, prior to said transmitting said first voice information to a terminal at which a first target user is located:

if yes, obtaining second voice information recorded by the second target user;

5. The method of claim 1, wherein the preset event triggers a view live option of a live room of the second target user for the first target user;

6. A voice information generating apparatus, comprising:

the affinity acquisition module is configured to acquire an affinity between a first target user and a second target user after receiving a trigger instruction that the first target user triggers a preset event, where the first target user is a user watching live broadcast, the second target user is a host user, and the preset event is: an event that a live watching option, a focus option, a comment option, a bullet screen option or a praise option of a live broadcasting room of the second target user is triggered;

The first text information acquisition module is used for judging whether the intimacy is higher than a preset intimacy threshold; if the intimacy is higher than the preset intimacy threshold, searching preset voice information corresponding to the first target user, and sending the preset voice information to a terminal where the first target user is located, so that the terminal plays the preset voice information; if the intimacy is not higher than the preset intimacy threshold, acquiring preset text information corresponding to the preset event, wherein the preset voice information is recorded by the second target user in advance for the first target user;

7. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

A memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.