CN113870828A

CN113870828A - Audio synthesis method and device, electronic equipment and readable storage medium

Info

Publication number: CN113870828A
Application number: CN202111148956.2A
Authority: CN
Inventors: 蒋微
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-31

Abstract

The application discloses an audio synthesis method, an audio synthesis device, electronic equipment and a readable storage medium, and belongs to the technical field of speech synthesis. The method comprises the following steps: acquiring target information; acquiring prosodic characteristic parameters of a target speaker, wherein the prosodic characteristic parameters comprise a speech speed reference vector, a pause length reference vector and a style vector; determining acoustic characteristic information according to the target information and the prosodic characteristic parameters; and converting the acoustic characteristic information to generate target audio data corresponding to the target information.

Description

Audio synthesis method and device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of speech synthesis, and particularly relates to an audio synthesis method, an audio synthesis device, electronic equipment and a readable storage medium.

Background

A Text To Speech (TTS) technique refers to a technique of converting Text information into Speech information. Personalized Text To Speech (TTS) is a Speech synthesis technique that takes some Speech segments of a person through a recording device and synthesizes the Speech segments according with the speaking mode of a specific person based on TTS Speech technology.

However, in the current speech synthesis technology, the synthesized speech cannot reflect the vocal characteristics of different users, and the synthesis effect is poor.

Disclosure of Invention

An embodiment of the present application provides an audio synthesis method, an audio synthesis apparatus, an electronic device, and a readable storage medium, which can solve the problems that a synthesized voice cannot reflect the sounding characteristics of different users and the synthesis effect is poor in a voice synthesis technology.

In a first aspect, an embodiment of the present application provides an audio synthesis method, including:

acquiring target information;

acquiring prosodic characteristic parameters of a target speaker, wherein the prosodic characteristic parameters comprise a speech speed reference vector, a pause length reference vector and a style vector;

determining acoustic characteristic information according to the target information and the prosodic characteristic parameters;

and converting the acoustic characteristic information to generate target audio data corresponding to the target information.

In a second aspect, an embodiment of the present application provides an audio synthesizing apparatus, including:

the first acquisition module is used for acquiring target information;

the second acquisition module is used for acquiring prosodic characteristic parameters of the target speaker, wherein the prosodic characteristic parameters comprise a speech speed reference vector, a pause length reference vector and a style vector;

the first determining module is used for determining acoustic characteristic information according to the target information and the prosody characteristic parameter;

and the generating module is used for converting the acoustic characteristic information and generating target audio data corresponding to the target information.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, the target information and the prosodic characteristic parameters of the target speaker are obtained, the acoustic characteristic information is determined according to the target information and the prosodic characteristic parameters of the target speaker, the acoustic characteristic information is converted, and the target audio data corresponding to the target information is generated.

Drawings

Fig. 1 is a schematic flowchart of an audio synthesizing method provided by an embodiment of the present application;

fig. 2 is a schematic diagram of a style vector coding/decoding model according to an embodiment of the present application;

fig. 3 is a schematic diagram of a synthesis process of target audio data according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio synthesizing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The audio synthesis method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Please refer to fig. 1, which illustrates an audio synthesizing method applied to an electronic device according to an embodiment of the present application, and the method may include steps 1100 to 1400, which are described in detail below.

In step 1100, target information is obtained.

In the present embodiment, the target information may be text information input by the user and required to be converted into audio data. The target information may be text information, for example, a sentence input by the user through a text input means. The target information may also be voice information, such as a user recorded sentence.

Step 1200, obtaining prosodic characteristic parameters of the target speaker, wherein the prosodic characteristic parameters comprise a speech speed reference vector, a pause length reference vector and a style vector.

In this embodiment, the prosodic characteristic parameters of the target speaker may reflect the prosodic characteristics of the speech of the target speaker. The prosodic characteristics can be the initial consonant and rhythm in the reading process, specifically, the prosodic characteristics can comprise long pause, short pause and ventilation position in the statement reading process, and can also comprise the speed and the accent of the statement reading process, and the like.

The prosodic characteristic parameters may include a speech rate reference vector, a pause length reference vector, and a style vector.

The speech rate reference vector may represent the speech rate of the target speaker. For different speakers, the speed of reading the sentence is different, and the speech speed reference vector is different.

In some optional embodiments, obtaining the speech rate reference vector may further include: step 2100-step 2300.

Step 2100, obtaining historical audio data of the target speaker.

In this embodiment, the historical audio data of the target speaker may be audio data of the target speaker stored by the electronic device. Such as pre-recorded audio data of the target speaker. Also for example, a voice chat recording of the target speaker stored in the instant messaging application.

In particular implementations, a first input is received from a target speaker and, in response to the first input, historical audio data is obtained for the target speaker. The first input may be an input to a historical audio data storage directory. It should be noted that after the historical audio data of the target speaker is obtained, the historical audio data can be screened to screen the historical audio data meeting the requirement, so that the audio data which does not meet the requirement of the signal-to-noise ratio and is spoken by multiple persons can be screened, and the prosodic characteristic obtaining efficiency is improved.

Step 2200, determining a first average speech rate of the target speaker according to the historical audio data.

The first average pace may be an average pronunciation duration for each phoneme. For Chinese, a phoneme may be, for example, an initial or a final in a sentence.

The first average speech rate may be determined according to the duration of the target sentence and the number of phonemes included in the target sentence. In specific implementation, a target sentence is obtained from historical audio data, and the ratio of the duration of the target sentence to the number of phonemes contained in the target sentence is used as a first average speech speed.

2300, determining the speech rate reference vector according to the first average speech rate and a preset average speech rate.

The preset average speech rate may reflect a reading rhythm consistent with most users. Illustratively, the preset average speech rate may be an average speech rate determined based on big data, which may include, for example, audio data of a plurality of users. In the implementation, the speech rate reference vector a is determined according to the first average speech rate S and the preset average speech rate S ', and may be determined according to a ratio of the first average speech rate S to the preset average speech rate S'.

In the embodiment, historical audio data of the target speaker stored in the electronic device is used to obtain the speech speed reference vector of the target speaker, more audio data can be used to obtain the personalized prosody characteristic parameters of the target speaker, and audio data more conforming to the speaking habit of the target speaker can be generated by combining with the subsequent steps. In addition, the speech speed reference vector of the target speaker can be extracted through the electronic equipment, so that user data leakage can be avoided, and the interaction safety can be improved.

In this embodiment, the pause length reference vector may represent the pause habit of the target speaker reading the sentence. For different speakers, the pause positions in the sentence reading process are different, and the pause length reference vectors are different. For example, some speakers are used to pause and ventilate every two or three words read when reading a sentence, and some speakers pause and ventilate when reading a complete sentence.

In some optional embodiments, obtaining the pause length reference vector may further include: step 3100-step 3300.

Step 3100, obtaining historical audio data of the target speaker.

Step 3200, determining pause probabilities corresponding to different syllable lengths according to the historical audio data.

In this embodiment, for Chinese, the syllable length may be the number of Chinese characters spoken. The pause probabilities for different syllable lengths can be shown in the following table.

And 3300, determining a pause length reference vector according to the pause probabilities corresponding to the different syllable lengths.

In the embodiment, historical audio data of the target speaker stored in the electronic device is used for obtaining the pause length reference vector of the target speaker, more audio data can be used for obtaining the personalized prosody characteristic parameters of the target speaker, and the audio data more conforming to the speaking habit of the target speaker can be generated by combining the subsequent steps. In addition, the pause length reference vector of the target speaker can be extracted through the electronic equipment, so that user data leakage can be avoided, and the interaction safety can be improved.

In the present embodiment, the style vector may be a prosodic style representing the speaker, for example, a natural communication style, a public address style, a speakerphone style. The style vector may be a style vector obtained by performing a clustering analysis on a plurality of speakers. The prosodic styles of pronunciations of the pronunciators with similar style vector distances are similar.

In this embodiment, the style vector of the target speaker may be obtained based on the coding and decoding model. Taking the style vector coding and decoding model shown in fig. 2 as an example, extracting audio feature X and text feature parameters from historical audio data of a target speaker, inputting the audio feature X into an Encoder (Encoder)401 to obtain a style vector C, then inputting the style vector C and the text feature parameters into a Decoder (Decoder)402, outputting audio feature X ', and then optimizing parameters of each module so that a difference between the output audio feature X' and the input audio feature X is smaller than a preset threshold. Obtaining a style vector C of each piece of audio data based on the optimized style vector coding and decoding model_iA style vector C for each piece of audio data_iAs the style vector C of the target speaker.

In the embodiment, the style vector of the target speaker is obtained by using the historical audio data of the target speaker stored in the electronic device, so that more audio data are used to obtain the personalized prosody characteristic parameters of the target speaker, and the audio data more conforming to the pronunciation characteristics of the target speaker can be generated by combining the subsequent steps. In addition, in this way, the prosodic characteristic parameters of the target speaker can be extracted off line, the extraction process of the prosodic characteristic parameters and the training process of the acoustic model can be independently carried out, and the training efficiency can be improved.

After step 1200, step 1300 is executed to determine acoustic feature information according to the target information and the prosodic characteristic parameter.

The acoustic feature information may be feature information input to a vocoder to generate audio data. Different types of acoustic feature information may be selected according to the requirements of the vocoder. For example, mel spectrum, pitch, mgc, etc.

In the present embodiment, the acoustic feature information is related to the text content, and also related to the reading habit and reading style of the speaker, that is, the acoustic feature information is related to the text prosody of the text content itself and also related to the prosodic characteristics of the speaker. Based on the target audio, acoustic feature information is determined according to the target information and prosodic characteristic parameters of the target speaker, and target audio is generated according to the acoustic feature information, so that the target audio is closer to the pronunciation characteristics of the target speaker.

In some embodiments of the present application, the determining acoustic feature information according to the target information and the prosodic characteristic parameter includes: step 4100-step 4500.

Step 4100, analyzing the target information to obtain text characteristic parameters, where the text characteristic parameters include a first phoneme sequence and a text prosody.

In this embodiment, since the acoustic feature information is related to the text content, based on this, the text feature parameter of the target information needs to be obtained, so as to combine the text feature parameter of the target information and the prosodic feature parameter of the target speaker, and generate the acoustic feature information that conforms to the speaking characteristics of the target speaker.

The text feature parameters may include a first phoneme sequence and a text prosody. The first phoneme sequence may be determined according to a word boundary of the target information. The first phoneme sequence is determined based on the correlation between the text contents of the target information. Text feature parameters may also include tone sequences, rereads, and the like.

In specific implementation, under the condition that the target information is text information, the text information is input into a text analysis module, and text characteristic parameters are output. The text analysis module can use traditional mode classification algorithms such as decision trees, ME and the like, and also can use algorithms such as BiLstm, Bert, TCN and the like of a neural network to carry out sequence labeling tasks, so as to obtain a final labeling result, namely text characteristic parameters.

It should be noted that, in the case that the target information is voice information, the voice information may be recognized to obtain text information corresponding to the target information, and the text information may be further subjected to text analysis to obtain text characteristic parameters.

Step 4200 generates a second phoneme sequence according to the text prosody, the pause length reference vector and the first phoneme sequence.

In some optional embodiments, the generating a second phoneme sequence according to the text prosody, the pause length reference vector, and the first phoneme sequence may further include: generating corrected prosody information according to the text prosody and the pause length reference vector; and generating a second phoneme sequence according to the corrected prosody information and the first phoneme sequence.

In specific implementation, a prosody probability corresponding to text prosody is used as a node probability, a pause length reference vector is used as a path probability, and a dynamic programming algorithm is used to find out an optimal path, namely modified prosody information. Then, the prosody information after being corrected is combined with the first phoneme sequence to generate a second phoneme sequence, i.e., a phoneme sequence including prosody information.

Step 4300, determining a first audio feature according to the second phoneme sequence and the speech rate reference vector.

In some optional embodiments, the determining the first audio feature according to the second phoneme sequence and the speech rate reference vector may further include: carrying out duration prediction on the basis of the second phoneme sequence to obtain a first phoneme duration; adjusting the first phoneme duration according to the speech speed reference vector to obtain a second phoneme duration; and expanding the second phoneme sequence according to the second phoneme duration to obtain a first audio feature.

The first phone duration may be a pronunciation duration of each phone predicted from the second phone sequence. The second phoneme duration may be a pronunciation duration of each phoneme considering a speech rate of the target speaker.

In specific implementation, the second phoneme sequence is input into a duration prediction module, the first phoneme duration is predicted, then, the speech speed reference vector adjusts the first phoneme duration to obtain a second phoneme duration, and then, the second phoneme sequence is expanded according to the second phoneme duration to obtain a first audio feature after the number of frames is expanded.

Step 4400, determining a second audio feature according to the first audio feature and the style vector.

The second audio feature may be an audio feature resulting from the influence of the style vector on the first audio feature. The second audio characteristic is more consistent with the speaking style of the target speaker.

Step 4500, determining the acoustic feature information according to the second audio feature based on an acoustic prediction model. And the acoustic prediction model is used for obtaining acoustic characteristic information according to the second audio characteristic.

In some embodiments of the application, before determining the acoustic feature information according to the second audio feature based on the acoustic prediction model, the method further includes: acquiring first audio data of the target speaker, wherein the first audio data is audio data of a preset text read by the target speaker; and performing model training based on the first audio data to obtain the acoustic prediction model.

After step 1300, step 1400 is executed to convert the acoustic feature information and generate target audio data corresponding to the target information.

In specific implementation, the acoustic feature information is input into a vocoder, and target audio data is obtained after conversion. Wherein, different vocoders can be selected according to different deployment scenes and different service requirements. Illustratively, conventional vocoders such as lpc vocoders, world vocoders, and the like may be used. Illustratively, it may also be a neural network vocoder, such as an lpcnet vocoder, a wavnet vocoder, a wavrn vocoder, a hifigan vocoder, a melgan vocoder, or the like.

Please refer to fig. 3, which is a schematic diagram of a process for synthesizing target audio data according to an embodiment of the present application, taking target information as text information, specifically, inputting the text information into a text analysis module 301, and outputting a first phoneme sequence and a text prosody; inputting the text prosody and the first phoneme sequence into a personalized user acoustic model 302; then, generating modified prosody information by using the text prosody and pause length reference vector B; combining the prosody information after modification with the first phoneme sequence to generate a phoneme sequence (second phoneme sequence) including the prosody information; then, carrying out duration prediction according to the second phoneme sequence to predict the pronunciation duration (first phoneme duration L) of each phoneme, then, adjusting the first phoneme duration L by using a speech speed reference vector A to obtain an adjusted second phoneme duration L ', and then, expanding the second phoneme sequence according to the adjusted second phoneme duration L' to obtain a first audio feature X after the frame number is expanded; then, superposing or splicing the style vector C of the target speaker on the first audio feature X, and outputting a second audio feature X'; finally, the second audio feature X' is input into the acoustic prediction model for acoustic prediction, the acoustic feature Y is output, the acoustic feature Y is input into the vocoder 303, and the target audio data is output.

In the embodiment of the application, the target information and the prosodic characteristic parameters of the target speaker are obtained, the acoustic characteristic information is determined according to the target information and the prosodic characteristic parameters of the target speaker, the acoustic characteristic information is converted, and the target audio data corresponding to the target information is generated. In addition, the audio synthesis method and apparatus provided in this embodiment may be applied to screen reading on an electronic device, the timbre of a voice assistant, the timbre of a sound, and the like, and are widely applicable and have good user experience.

It should be noted that, in the audio synthesis method provided in the embodiment of the present application, the execution subject may be an audio synthesis apparatus, or a control module in the audio synthesis apparatus for executing the audio synthesis method. The embodiment of the present application takes a method for an audio synthesis device to perform audio synthesis as an example, and describes an audio synthesis device provided in the embodiment of the present application.

Referring to fig. 4, an embodiment of the present application further provides an audio synthesis apparatus 400, where the audio synthesis apparatus 400 includes a first obtaining module 401, a second obtaining module 402, a first determining module 403, and a generating module 404.

The first obtaining module 401 is configured to obtain target information;

the second obtaining module 402 is configured to obtain prosodic characteristic parameters of the target speaker, where the prosodic characteristic parameters include a speech rate reference vector, a pause length reference vector, and a style vector;

the first determining module 403 is configured to determine acoustic feature information according to the target information and the prosody characteristic parameter;

the generating module 404 is configured to convert the acoustic feature information to generate target audio data corresponding to the target information.

Optionally, the first determining module includes: the text analysis unit is used for analyzing the target information to obtain text characteristic parameters, and the text characteristic parameters comprise a first phoneme sequence and text prosody; a first generating unit, configured to generate a second phoneme sequence according to the text prosody, the pause length reference vector, and the first phoneme sequence; a first determining unit, configured to determine a first audio feature according to the second phoneme sequence and the speech rate reference vector; a second determining unit, configured to determine a second audio feature according to the first audio feature and the style vector; a third determining unit, configured to determine the acoustic feature information according to the second audio feature based on an acoustic prediction model.

Optionally, the first determining unit is specifically configured to: generating corrected prosody information according to the text prosody and the pause length reference vector; and generating a second phoneme sequence according to the corrected prosody information and the first phoneme sequence.

Optionally, the second determining unit is specifically configured to: carrying out duration prediction on the basis of the second phoneme sequence to obtain a first phoneme duration; adjusting the first phoneme duration according to the speech speed reference vector to obtain a second phoneme duration; and expanding the second phoneme sequence according to the second phoneme duration to obtain a first audio feature.

Optionally, the apparatus further comprises: the third acquisition module is used for acquiring first audio data of the target speaker, wherein the first audio data is audio data of a preset text read by the target speaker; and the training module is used for carrying out model training based on the first audio data to obtain the acoustic prediction model, wherein the acoustic prediction model is used for obtaining acoustic characteristic information according to second audio characteristics.

Optionally, the prosodic characteristic parameter includes a speech rate reference vector, and the second obtaining module includes: the first acquisition unit is used for acquiring historical audio data of the target speaker; the fourth determining unit is used for determining the first average speech speed of the target speaker according to the historical audio data; and a fifth determining unit, configured to determine the speech rate reference vector according to the first average speech rate and a preset average speech rate.

Optionally, the prosodic characteristic parameter includes a pause length reference vector, and the second obtaining module includes: the second acquisition unit is used for acquiring historical audio data of the target speaker; a sixth determining unit, configured to determine pause probabilities corresponding to different syllable lengths according to the historical audio data; and the seventh determining unit is used for determining the pause length reference vector according to the pause probabilities corresponding to the different syllable lengths.

The audio synthesis apparatus in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The audio synthesizing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The audio synthesis apparatus provided in the embodiment of the present application can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.

Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the foregoing audio synthesis method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 600 includes, but is not limited to: a radio frequency unit 601, a network module 602, an audio output unit 603, an input unit 604, a sensor 605, a display unit 606, a user input unit 607, an interface unit 608, a memory 609, a processor 610, and the like.

Those skilled in the art will appreciate that the electronic device 600 may further comprise a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 610 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

Wherein, the processor 610 is configured to: acquiring target information; acquiring prosodic characteristic parameters of a target speaker, wherein the prosodic characteristic parameters comprise a speech speed reference vector, a pause length reference vector and a style vector; determining acoustic characteristic information according to the target information and the prosodic characteristic parameters; and converting the acoustic characteristic information to generate target audio data corresponding to the target information.

Optionally, the processor 610, when determining the acoustic feature information according to the target information and the prosody characteristic parameter, is configured to: analyzing the target information to obtain text characteristic parameters, wherein the text characteristic parameters comprise a first phoneme sequence and text prosody; generating a second phoneme sequence according to the text prosody, the pause length reference vector and the first phoneme sequence; determining a first audio characteristic according to the second phoneme sequence and the speech rate reference vector; determining a second audio feature according to the first audio feature and the style vector; determining the acoustic feature information according to the second audio feature based on an acoustic prediction model.

Optionally, the processor 610, when generating a second phoneme sequence according to the text prosody, the pause length reference vector and the first phoneme sequence, is configured to: generating corrected prosody information according to the text prosody and the pause length reference vector; and generating a second phoneme sequence according to the corrected prosody information and the first phoneme sequence.

Optionally, the processor 610, when determining the first audio feature according to the second phoneme sequence and the speech rate reference vector, is configured to: carrying out duration prediction on the basis of the second phoneme sequence to obtain a first phoneme duration; adjusting the first phoneme duration according to the speech speed reference vector to obtain a second phoneme duration; and expanding the second phoneme sequence according to the second phoneme duration to obtain a first audio feature.

Optionally, the processor 610, before determining the acoustic feature information according to the second audio feature based on the acoustic prediction model, is further configured to: acquiring first audio data of the target speaker, wherein the first audio data is audio data of a preset text read by the target speaker; performing model training based on the first audio data to obtain the acoustic prediction model; and the acoustic prediction model is used for obtaining acoustic characteristic information according to the second audio characteristic.

Optionally, the prosodic characteristic parameters include a speech rate reference vector, and the processor 610, when obtaining the prosodic characteristic parameters of the target speaker, includes: acquiring historical audio data of the target speaker; determining a first average speech speed of the target speaker according to the historical audio data; and determining the speech rate reference vector according to the first average speech rate and a preset average speech rate.

Optionally, the prosodic characteristic parameters include a pause length reference vector, and the processor 710, in obtaining the prosodic characteristic parameters of the target speaker, is configured to: acquiring historical audio data of the target speaker; determining pause probabilities corresponding to different syllable lengths according to the historical audio data; and determining a pause length reference vector according to the pause probabilities corresponding to the different syllable lengths.

It is to be understood that, in the embodiment of the present application, the input Unit 604 may include a Graphics Processing Unit (GPU) 6041 and a microphone 6042, and the Graphics Processing Unit 6041 processes image data of a still picture or a video obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The display unit 606 may include a display panel 6061, and the display panel 6061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 607 includes a touch panel 6071 and other input devices 6072. A touch panel 6071, also referred to as a touch screen. The touch panel 6071 may include two parts of a touch detection device and a touch controller. Other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 609 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 610 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 610.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above-mentioned audio synthesis method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above-mentioned audio synthesis method embodiment, and can achieve the same technical effect, and is not described here again to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for audio synthesis, the method comprising:

acquiring target information;

2. The method of claim 1, wherein determining acoustic feature information based on the target information and the prosodic characteristic parameters comprises:

analyzing the target information to obtain text characteristic parameters, wherein the text characteristic parameters comprise a first phoneme sequence and text prosody;

generating a second phoneme sequence according to the text prosody, the pause length reference vector and the first phoneme sequence;

determining a first audio characteristic according to the second phoneme sequence and the speech rate reference vector;

determining a second audio feature according to the first audio feature and the style vector;

determining the acoustic feature information according to the second audio feature based on an acoustic prediction model.

3. The method of claim 2, wherein generating a second phoneme sequence based on the text prosody, the pause length reference vector, and the first phoneme sequence comprises:

generating corrected prosody information according to the text prosody and the pause length reference vector;

and generating a second phoneme sequence according to the corrected prosody information and the first phoneme sequence.

4. The method of claim 2, wherein determining a first audio feature from the second phone sequence and the speech rate reference vector comprises:

carrying out duration prediction on the basis of the second phoneme sequence to obtain a first phoneme duration;

adjusting the first phoneme duration according to the speech speed reference vector to obtain a second phoneme duration;

and expanding the second phoneme sequence according to the second phoneme duration to obtain a first audio feature.

5. The method of claim 2, wherein before determining the acoustic feature information based on the acoustic prediction model from the second audio feature, the method further comprises:

acquiring first audio data of the target speaker, wherein the first audio data is audio data of a preset text read by the target speaker;

performing model training based on the first audio data to obtain the acoustic prediction model;

and the acoustic prediction model is used for obtaining acoustic characteristic information according to the second audio characteristic.

6. The method of claim 1, wherein the prosodic feature parameters include a speech rate reference vector, and wherein the obtaining prosodic feature parameters for the target speaker comprises:

acquiring historical audio data of the target speaker;

determining a first average speech speed of the target speaker according to the historical audio data;

and determining the speech rate reference vector according to the first average speech rate and a preset average speech rate.

7. The method of claim 1, wherein the prosodic feature parameters include a pause length reference vector, and wherein the obtaining prosodic feature parameters for the target speaker comprises:

acquiring historical audio data of the target speaker;

determining pause probabilities corresponding to different syllable lengths according to the historical audio data;

and determining a pause length reference vector according to the pause probabilities corresponding to the different syllable lengths.

8. An audio synthesizing apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring target information;

9. The apparatus of claim 8, wherein the first determining module comprises:

the text analysis unit is used for analyzing the target information to obtain text characteristic parameters, and the text characteristic parameters comprise a first phoneme sequence and text prosody;

a first generating unit, configured to generate a second phoneme sequence according to the text prosody, the pause length reference vector, and the first phoneme sequence;

a first determining unit, configured to determine a first audio feature according to the second phoneme sequence and the speech rate reference vector;

a second determining unit, configured to determine a second audio feature according to the first audio feature and the style vector;

a third determining unit, configured to determine the acoustic feature information according to the second audio feature based on an acoustic prediction model.

10. The apparatus according to claim 9, wherein the first determining unit is specifically configured to:

11. The apparatus according to claim 9, wherein the second determining unit is specifically configured to:

12. The apparatus of claim 9, further comprising:

the third acquisition module is used for acquiring first audio data of the target speaker, wherein the first audio data is audio data of a preset text read by the target speaker;

and the training module is used for carrying out model training based on the first audio data to obtain the acoustic prediction model, wherein the acoustic prediction model is used for obtaining acoustic characteristic information according to second audio characteristics.

13. The apparatus of claim 8, wherein the prosodic characteristic parameter comprises a speech rate reference vector, and the second obtaining module comprises:

the first acquisition unit is used for acquiring historical audio data of the target speaker;

the fourth determining unit is used for determining the first average speech speed of the target speaker according to the historical audio data;

and a fifth determining unit, configured to determine the speech rate reference vector according to the first average speech rate and a preset average speech rate.

14. The apparatus of claim 8, wherein the prosodic characteristic parameters include a pause length reference vector, and the second obtaining module comprises:

the second acquisition unit is used for acquiring historical audio data of the target speaker;

a sixth determining unit, configured to determine pause probabilities corresponding to different syllable lengths according to the historical audio data;

and the seventh determining unit is used for determining the pause length reference vector according to the pause probabilities corresponding to the different syllable lengths.

15. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the audio synthesis method of any of claims 1 to 7.

16. A readable storage medium, on which a program or instructions are stored, which when executed by a processor, carry out the steps of the audio synthesis method according to any one of claims 1 to 7.