CN112885327A

CN112885327A - Speech synthesis method, apparatus, device and storage medium

Info

Publication number: CN112885327A
Application number: CN202110082850.0A
Authority: CN
Inventors: 陈小建; 陈闽川; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-06-01
Anticipated expiration: 2041-01-21

Abstract

The invention relates to the field of artificial intelligence and discloses a voice synthesis method, a voice synthesis device, voice synthesis equipment and a storage medium. The method comprises the following steps: receiving voice synthesis tasks from a user terminal, wherein the number of the voice synthesis tasks is at least one; adding the voice synthesis tasks into a task queue, and packaging all the voice synthesis tasks in the task queue according to a preset task packaging rule to generate batch processing tasks; and acquiring the working state of each GPU card in the GPU server stored in the GPU management queue, selecting one GPU card according to the working state, sending the batch processing tasks to the selected GPU card to process the batch processing tasks, acquiring batch processing results, splitting the batch processing results according to the voice synthesis tasks before packaging, and acquiring at least one voice synthesis result corresponding to at least one voice synthesis task. In addition, the invention also relates to a block chain technology, and the related information of the voice synthesis task can be stored in the block chain.

Description

Speech synthesis method, apparatus, device and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a speech synthesis method, apparatus, device, and storage medium.

Background

The speech synthesis technology is a technology capable of converting a text into speech, and along with the development of science and technology, the speech synthesis technology is gradually applied to various scenes, so that the speech synthesis technology not only can help people with reading obstacles to read, but also can increase the readability of the text; moreover, speech synthesis technology can also offer more possibilities for interaction between machines and people. Therefore, with the development of science and technology, highly human-like speech synthesis technology with real-time performance gradually comes into the visual field of people.

In the prior art, in order to enable the voice audio generated by using a voice synthesis technology to have a higher personification level, algorithms such as a deep neural network are generally used for synthesizing the voice audio, but the method can involve a large number of floating-point matrix operations during synthesis, and the general CPU card for calculation and synthesis can cause that the waiting time during voice synthesis is longer and cannot meet the requirement of real-time performance; when the GPU cards are used for calculation, because each GPU card can only process one concurrent task at the same time, the resource utilization rate of each GPU card is low, the number of needed GPU cards is increased, and the cost is increased.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the resource utilization rate of a GPU card is low when speech synthesis is carried out in the existing speech synthesis technology.

The invention provides a voice synthesis method in a first aspect, which comprises the following steps:

receiving voice synthesis tasks from a user terminal, wherein the number of the voice synthesis tasks is at least one;

adding the voice synthesis tasks into a task queue, and packaging all the voice synthesis tasks in the task queue according to a preset task packaging rule to generate batch processing tasks;

acquiring the working state of each GPU card in a GPU server stored in a GPU management queue, selecting one GPU card according to the working state, sending the batch processing tasks to the selected GPU card to perform parallel processing on all voice synthesis tasks in the batch processing tasks, and obtaining batch processing results;

and splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to at least the voice synthesis task.

Optionally, in a first implementation manner of the first aspect of the present invention, the adding the speech synthesis task into a task queue, and packing all speech synthesis tasks in the task queue according to a preset task packing rule to generate a batch processing task includes:

adding the voice synthesis task into a task queue, controlling the value of a counter to add 1, and controlling a timer to start timing;

judging whether the numerical value in the counter reaches a preset counting threshold value or not;

judging whether the timing time in the timer reaches a preset time or not;

if the numerical value in the counter reaches a counting threshold value and/or the timing time in the timer reaches preset time, resetting the timer and the counter, and packaging the voice synthesis tasks in the task queue to generate batch processing tasks.

Optionally, in a second implementation manner of the first aspect of the present invention, the adding the speech synthesis task into a task queue, and packing all speech synthesis tasks in the task queue according to a preset task packing rule to generate a batch processing task further includes:

and if the numerical value in the counter does not reach the counting threshold value and the timing time in the timer does not reach the preset time, continuing to add the received voice synthesis task into a task queue for temporary storage.

Optionally, in a third implementation manner of the first aspect of the present invention, before adding the speech synthesis task into a task queue, controlling a value of a counter to add 1, and controlling a timer to start timing, the method further includes:

acquiring a computing capacity parameter of each GPU card;

obtaining the maximum task number which can be processed by each GPU card according to the computing capacity parameter;

and setting a counting threshold value according to the maximum task number.

Optionally, in a fourth implementation manner of the first aspect of the present invention, before adding the at least one speech synthesis task into the task queue, controlling a value of a counter to add 1, and controlling a timer to start timing, the method further includes:

acquiring the maximum waiting time of a service and the maximum synthesis time of a single batch processing task;

and setting the preset time of a timer according to the maximum service waiting time and the maximum synthesis time of the single batch processing task, wherein the preset time of the timer is the difference value of the maximum service waiting time and the maximum synthesis time of the single batch processing task.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the obtaining a working state of each GPU card in a GPU server stored in a GPU management queue, selecting one GPU card according to the working state, sending the batch processing task to the selected GPU card, and performing parallel processing on all speech synthesis tasks in the batch processing task, where obtaining a batch processing result includes:

the method comprises the steps of obtaining working states of GPU cards in the GPU server in advance, and storing the working states of the GPU cards into a GPU management queue, wherein the working states comprise idle states and working states;

acquiring the working state of each GPU card in the GPU management queue, and selecting one GPU card with an idle working state in the GPU management queue;

sending the batch processing tasks to a selected GPU card, marking the working state of the selected GPU card stored in the GPU management queue as 'working', and utilizing the selected GPU card to perform parallel processing on all voice synthesis tasks in the batch processing tasks;

and after the processing is finished, marking the working state of the selected GPU card stored in the GPU management queue as idle, and outputting a batch processing result.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the performing, by using the selected GPU card, parallel processing on all speech synthesis tasks in the batch processing task includes:

acquiring text contents contained in all voice synthesis tasks in the batch processing task;

and performing voice conversion according to the text content by using a pre-established voice synthesis model to obtain a batch processing result.

A second aspect of the present invention provides a speech synthesis apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for receiving voice synthesis tasks from a user terminal, and the number of the voice synthesis tasks is at least one;

the task packing module is used for adding the voice synthesis tasks into a task queue, packing all the voice synthesis tasks in the task queue according to a preset task packing rule and generating batch processing tasks;

the batch task processing module is used for acquiring the working state of each GPU card in the GPU server stored in the GPU management queue, selecting one GPU card according to the working state, sending the batch processing task to the selected GPU card, and performing parallel processing on all the voice synthesis tasks in the batch processing task to obtain a batch processing result;

and the result generation module is used for splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task.

Optionally, in a first implementation manner of the second aspect of the present invention, the task packing module includes:

the task receiving unit is used for adding the voice synthesis task into a task queue, controlling the value of a counter to be added by 1 and controlling a timer to start timing;

the counting unit is used for judging whether the numerical value in the counter reaches a preset counting threshold value or not;

the timing unit is used for judging whether the timing time in the timer reaches the preset time or not;

and the packing unit is used for resetting the timer and the counter if the numerical value in the counter reaches a counting threshold value and/or the timing time in the timer reaches preset time, packing the voice synthesis tasks in the task queue and generating batch processing tasks.

Optionally, in a second implementation manner of the second aspect of the present invention, the packing unit is further specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the task packing module further includes:

a counting threshold value setting unit, configured to obtain a computing capability parameter of each GPU card; obtaining the maximum task number which can be processed by each GPU card according to the computing capacity parameter; and setting a counting threshold value according to the maximum task number.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the task packing module further includes:

the timing time setting unit is used for acquiring the maximum waiting time of the service and the maximum synthesis time of the single batch processing task; and setting the preset time of a timer according to the maximum service waiting time and the maximum synthesis time of the single batch processing task, wherein the preset time of the timer is the difference value of the maximum service waiting time and the maximum synthesis time of the single batch processing task.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the batch task processing module includes:

a working state obtaining unit, configured to obtain a working state of each GPU card in the GPU server in advance, and store the working state of each GPU card into a GPU management queue, where the working state includes "idle" and "working";

the batch task processing unit is used for acquiring the working state of each GPU card in the GPU management queue, selecting one GPU card with an idle working state in the GPU management queue, sending the batch processing tasks to the selected GPU card, marking the working state of the selected GPU card stored in the GPU management queue as 'working', and utilizing the selected GPU card to perform parallel processing on all the voice synthesis tasks in the batch processing tasks;

and the batch result generating unit is used for marking the working state of the selected GPU card stored in the GPU management queue as idle and outputting a batch processing result.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the batch task processing unit includes:

a task text acquiring subunit, configured to acquire text contents included in all speech synthesis tasks in the batch processing task;

and the voice synthesis subunit is used for performing voice conversion according to the text content by using a pre-established voice synthesis model to obtain a batch processing result.

A third aspect of the present invention provides a speech synthesis apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the speech synthesis apparatus to perform the steps of the speech synthesis method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the above-described speech synthesis method.

In the technical scheme provided by the invention, a plurality of voice synthesis tasks from a user terminal are received and added into a task queue, and all the voice synthesis tasks in the task queue are packed according to a preset task packing rule to generate batch processing tasks; selecting a GPU card according to the working state of each GPU card in the GPU management queue, and sending the batch processing tasks to the selected GPU card to perform parallel processing on all the voice synthesis tasks in the batch processing tasks to obtain batch processing results; and splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task. The voice synthesis method provided by the embodiment of the invention improves the resource utilization rate of the GPU card during voice synthesis in the voice synthesis technology, thereby correspondingly reducing the number of needed GPU cards and reducing the cost.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a speech synthesis method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another embodiment of a speech synthesis method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a speech synthesis method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a speech synthesis apparatus according to the present invention;

FIG. 6 is a schematic diagram of another embodiment of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an embodiment of a speech synthesis apparatus in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a voice synthesis method, a device, equipment and a storage medium, and in the technical scheme provided by the invention, voice synthesis tasks from a user terminal are received, and the number of the voice synthesis tasks is at least one; adding the voice synthesis tasks into a task queue, and packaging all the voice synthesis tasks in the task queue according to a preset task packaging rule to generate batch processing tasks; acquiring the working state of each GPU card in a GPU server stored in a GPU management queue, selecting one GPU card according to the working state, sending the batch processing tasks to the selected GPU card, and performing parallel processing on all voice synthesis tasks in the batch processing tasks to obtain batch processing results; and splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task. The voice synthesis method provided by the embodiment of the invention improves the resource utilization rate of the GPU card during voice synthesis in the voice synthesis technology, and reduces the cost.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a speech synthesis method in an embodiment of the present invention includes:

101. receiving a voice synthesis task from a user terminal;

it is to be understood that the executing subject of the present invention may be a speech synthesis apparatus, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

The embodiment mainly aims at a real-time Speech synthesis (TTS) system, which is also called Text-To-Speech conversion technology and is a technology for manufacturing Speech by simulating a human by using an electronic computer and some special devices. The real-time speech synthesis system mentioned in the present application specifically aims to convert the text content inputted by the user into the speech with the specified tone and mood in real time.

Specifically, before performing a synthesis operation, a user needs to input a text to be synthesized through a user terminal, the user terminal generates a speech synthesis task according to a certain task generation rule from the text to be synthesized, and sends the generated speech synthesis task to a distribution server, where the number of the speech synthesis tasks is at least one, the user terminal may be a computer, a mobile phone, or other electronic devices or servers capable of performing text input and sending task requests, and the distribution server may be a Central Processing Unit (CPU) server.

102. Adding the voice synthesis tasks into a task queue, and packaging all the voice synthesis tasks in the task queue according to a preset task packaging rule to generate batch processing tasks;

in this step, a task packing rule is first pre-established, for example, the pre-established packing rule may be to pack a certain number of speech synthesis tasks to generate batch processing tasks, or may be to pack speech synthesis tasks temporarily stored in a task queue within a period of time to generate batch processing tasks.

After a task packing rule is established, when a distribution server receives a voice synthesis task from a user terminal, the voice synthesis task is added into a task queue for temporary storage, and according to the task packing rule, all voice synthesis tasks temporarily stored in the task queue are packed according to a certain quantity or within a period of time to obtain batch processing tasks.

Because the speech synthesis result to be obtained in the present proposal can be highly anthropomorphic, a real-time speech synthesis system established based on a deep neural network is adopted to process the speech synthesis task, and a deep neural network algorithm is used to perform the calculation, which involves a large number of floating-point matrix operations.

When the GPU cards which have more calculation cores and can perform a large number of floating-point matrix operations in parallel are synthesized, the calculation amount of a single voice synthesis task in a real-time voice synthesis system is smaller for one GPU card. However, each GPU card can only process one concurrent task at a time, if only a single voice synthesis task is processed as one concurrent task, a GPU card having the same number as the concurrent task needs to be deployed in a multitask system, and in order to meet the real-time performance of voice synthesis, the number of GPU cards in a GPU server needs to be increased, so that the cost is increased; in addition, generally, when a single speech synthesis task is processed as a concurrent task, each GPU card has a large calculation margin when processing the single speech synthesis task, which results in a waste of calculation power.

Therefore, in this step, due to the characteristic that the GPU card can perform a large number of floating-point number matrix operations in parallel, the packing rule is preset according to the characteristics of the used GPU card, a certain amount of speech synthesis tasks are generated into batch tasks according to the packing rule, one batch task is distributed to one GPU card as one concurrent task, the GPU card is used to perform parallel processing on the speech synthesis tasks in the batch tasks, the computing power of one GPU card is maximally utilized, and the number of the GPU cards required to be used is saved.

103. Acquiring the working state of each GPU card in a GPU server stored in a GPU management queue, selecting one GPU card according to the working state, sending the batch processing tasks to the selected GPU card, and performing parallel processing on all voice synthesis tasks in the batch processing tasks to obtain batch processing results;

the distribution server in this embodiment further includes a GPU management queue, where the GPU management queue stores the operating states of the GPU cards included in the GPU server, where the GPU server in this proposal includes a plurality of GPU cards. The GPU management queue selects one GPU card according to the stored working state of the GPU card, and sends the batch processing tasks generated in the previous step to the currently selected GPU card for processing, wherein each batch processing task uses one GPU card for processing, when a plurality of batch processing tasks exist, a plurality of GPU cards in the GPU server are used for processing simultaneously, and the real-time performance of voice synthesis is met to a certain extent.

Specifically, the parallel processing of the speech synthesis task in the batch processing task is to perform speech synthesis on a speech synthesis model established by the text in the batch processing task based on the deep neural network, so as to obtain a batch processing result.

The speech synthesis is carried out according to the speech synthesis model established based on the deep neural network, so that the speech synthesis result generated in the proposal has highly anthropomorphic tone and timbre.

104. And splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task.

After the batch processing result is obtained in the previous step, the voice files in the batch processing result are split according to the voice synthesis task before being packaged in the batch processing task, and the voice synthesis results corresponding to the voice synthesis tasks one to one are obtained.

Specifically, when the speech synthesis tasks in the task queue are packed in step 102, a task separator is added between each task, so as to separate each speech synthesis task in the batch processing task; in addition, the separator is added to the batch processing result according to the original position when the batch processing task is subjected to voice synthesis, after the batch processing task is subjected to parallel processing by the GPU card, the batch processing result obtained through the parallel processing is separated according to the separator, and the separator is deleted after the separation, so that the voice synthesis results corresponding to the voice synthesis tasks one by one are obtained.

According to the voice synthesis method, received voice synthesis tasks are generated into batch processing tasks according to the preset rules, one batch processing task is sent to one GPU card to be synthesized by utilizing the characteristic that the GPU card can perform a large number of floating point number matrix operations in parallel, a batch processing result is obtained, then the batch processing result is split according to the original voice synthesis task, and a voice synthesis result corresponding to the original voice synthesis task is obtained.

In the embodiment of the invention, the resource utilization rate of the GPU card during voice synthesis in the voice synthesis technology is improved, and the cost is reduced.

Referring to fig. 2, another embodiment of the speech synthesis method according to the embodiment of the present invention includes:

201. receiving a voice synthesis task from a user terminal;

the specific content in this step is the same as that in step 101 in the foregoing embodiment, and is not described here again.

202. Adding the voice synthesis tasks into a task queue, and packaging all the voice synthesis tasks in the task queue according to a preset task packaging rule to generate batch processing tasks;

the specific content in this step is the same as that in step 102 in the foregoing embodiment, and is not described here again.

203. The method comprises the steps of obtaining the working state of each GPU card in a GPU server in advance, storing the working state of each GPU card into a GPU management queue, obtaining the working state of each GPU card in the GPU management queue, and selecting one GPU card with an idle working state from the GPU management queue;

the distribution server in this embodiment further includes a GPU management queue, and after the system is started, obtains a working state of each GPU card in the GPU server in advance, and stores the obtained working state of each GPU card into the GPU management queue, where the working state includes "idle" and "in-work". After the batch processing task generated in the above step is obtained, the distribution server in this embodiment obtains the working state of each GPU card in the GPU management queue, and selects one GPU card with an "idle" working state from the GPU cards.

When the GPU management queue selects a GPU card with a working state of "idle", the GPU card is selected from the GPU cards with a working state of "idle" according to a preset selection sequence, and specifically, the preset selection sequence may be random selection or polling selection.

In addition, the GPU management queue can also count the number of GPU cards in various working states in the current GPU server, and if the number of GPU cards in an "idle" state in the GPU server is 0, it is prompted that the number of added GPU cards is insufficient, and the number of GPU cards needs to be increased.

Further, the working state of the GPU card also comprises an abnormal state. When the selected GPU card is used for synthesizing the batch processing tasks, if the synthesis tasks fail, processing failure information is sent to a GPU management queue, and the GPU management queue marks the state of the currently selected GPU card as abnormal. And traversing the GPU cards in the GPU server by the GPU management queue every 3 seconds to send inquiry information, and updating the abnormal state mark stored in the GPU management queue by the GPU cards into idle if the GPU cards with the abnormal state marks become normal.

204. Sending the batch processing tasks to the selected GPU cards, marking the working states of the selected GPU cards stored in a GPU management queue as 'working', utilizing the selected GPU cards to perform parallel processing on all the voice synthesis tasks in the batch processing tasks, marking the working states of the selected GPU cards stored in the GPU management queue as 'idle' after the processing is finished, and outputting batch processing results;

and after the currently selected GPU card is determined, sending the batch processing task generated in the previous step to the GPU card, acquiring text contents contained in the batch processing task, calculating the text contents by using a pre-established speech synthesis model through the GPU card, generating a speech file corresponding to the text contents contained in the current batch processing task, and obtaining a batch processing result.

After the distribution server generates batch processing tasks each time, the GPU management queue selects a GPU card with an idle working state to synthesize the batch processing tasks, the working state of the selected GPU card is changed into 'working', after the GPU card completes the synthesis of the batch processing tasks, batch processing results are sent to the distribution server, and the distribution server changes the working state of the GPU card from 'working' to 'idle'.

In addition, the speech synthesis model is pre-established before the batch processing task is processed by the GPU card. The specific establishment of the speech synthesis model can be realized through the following processes: and collecting various audio files read manually and text contents corresponding to the audio files. Labeling labels such as scenes, emotions and the like for the audio files, forming a speech generation training set by the labeled audio files, namely corresponding text contents, and training a deep neural network algorithm by using the speech generation training set to obtain a speech synthesis model.

The tone and tone most suitable for the current text can be calculated through the speech synthesis model, so that a speech file with high personification degree is generated.

205. And splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task.

The specific content in this step is the same as that in step 104 in the previous embodiment, and is not described here again.

In the embodiment of the invention, the resource utilization rate of the GPU card during voice synthesis in the voice synthesis technology is improved while the voice synthesis result is ensured to be highly anthropomorphic and meets the requirement of real-time performance, so that the cost is relatively reduced.

Referring to fig. 3, another embodiment of the speech synthesis method according to the embodiment of the present invention includes:

301. receiving a voice synthesis task from a user terminal;

302. Adding the voice synthesis task into a task queue, controlling the value of a counter to add 1, and controlling a timer to start timing;

and after receiving the voice synthesis task from the user terminal, the distribution server adds the voice synthesis task into the task queue for temporary storage.

The preset packing rule specifically adopted in this embodiment is that a counter is used to calculate the number of speech synthesis tasks in a task queue, and a timer is used to control the maximum waiting time of the speech synthesis tasks in the task queue, so a counter and a timer need to be set in a distribution server in advance, the initial values of the counter and the timer are both set to be 0, and the counter is preset with a counting threshold, wherein the counting threshold determines the maximum number of speech synthesis tasks that can be included in a single batch processing task; the preset time determines the maximum waiting time of the speech synthesis task temporarily stored in the task queue.

After the system is initialized, the distribution server receives the voice synthesis tasks from the user terminal, controls the timer to start timing, and controls the value of the counter to be increased by 1 when one voice synthesis task is added into the task queue each time.

303. Judging whether the numerical value in the counter reaches a preset counting threshold value or not; judging whether the timing time in the timer reaches the preset time or not; if the numerical value in the counter reaches the counting threshold value and/or the timing time in the timer reaches the preset time, resetting the timer and the counter, packaging the voice synthesis tasks in the task queue, and generating a batch processing task;

judging whether the numerical value in the counter reaches a preset counting threshold value or not; and simultaneously judging whether the timing time in the timer reaches the preset time.

If the numerical value in the counter reaches the counting threshold value, packaging the voice synthesis task temporarily stored in the current task queue to generate a batch processing task; at the same time, the values of the counter and the timer are reset to initial values, i.e., the values of the counter and the timer are set to 0.

If the timing time in the timer reaches the preset time, detecting whether a temporary stored voice synthesis task exists in the task queue, if so, synthesizing the voice synthesis task temporarily stored in the task queue into a batch processing task, and resetting the numerical values of the timer and the counter as initial values; if the current task queue has no temporarily stored voice synthesis task, the value of the timer is reset to be an initial value, and at least one received voice synthesis task is continuously added into the task queue for temporarily storing.

And if the numerical value in the counter does not reach the counting threshold value and the timing time in the timer does not reach the preset time, continuing to add at least one received voice synthesis task into the task queue for temporary storage and waiting for the packaging operation of the tasks.

And limiting the maximum number of voice synthesis tasks in the batch processing tasks by using a counter with a set counting threshold, and packing the voice synthesis tasks temporarily stored in the task queue to generate the batch processing tasks when the voice synthesis tasks temporarily stored in the task queue reach the maximum number of voice synthesis tasks which can be contained in the batch processing tasks. The received voice synthesis tasks are packed according to a certain number in the mode, so that a single GPU card can be used for processing in the follow-up process, and the number of GPU cards needed at the same time is saved.

In this step, at least one received speech synthesis task is packaged to generate a batch processing task. The number of voice synthesis tasks contained in each batch processing task is determined by setting a counter and a timer according to the number of single tasks which can be simultaneously processed by the GPU card adopted in the proposal and the waiting time of the voice synthesis tasks in a task queue; the voice synthesis tasks temporarily stored in the task queue are packed to generate batch processing tasks in such a way, so that a single GPU card can be used for processing subsequently, the resource utilization rate of each GPU card is improved, and the number of GPU cards required to be used subsequently is reduced.

304. Acquiring the working state of each GPU card in a GPU server stored in a GPU management queue, selecting one GPU card according to the working state, sending the batch processing tasks to the selected GPU card, and performing parallel processing on all voice synthesis tasks in the batch processing tasks to obtain batch processing results;

Specifically, the parallel processing of all the speech synthesis tasks in the batch processing task is to perform speech synthesis on the text contents of all the speech synthesis tasks in the batch processing task based on a speech synthesis model established by a deep neural network, so as to obtain a batch processing result.

The speech synthesis is carried out by utilizing the speech synthesis model established based on the deep neural network, so that the speech synthesis result generated in the proposal has highly anthropomorphic tone and timbre.

305. And splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task.

Specifically, when the voice synthesis tasks in the task queue are packed in the foregoing steps, a task separator is added between each voice synthesis task, so as to separate each voice synthesis task in the batch processing task; in addition, the separator is added to the batch processing result according to the original position when the batch processing task is subjected to voice synthesis, after the batch processing task is subjected to parallel processing by the GPU card, the batch processing result obtained through the parallel processing is separated according to the separator, and the separator is deleted after the separation, so that the voice synthesis results corresponding to the voice synthesis tasks one by one are obtained.

In the embodiment of the invention, the counters and the timers are arranged in the distribution server, the voice synthesis tasks with a certain number or waiting time reaching a certain time and temporarily stored in the task queue are packed, and the packed batch processing tasks are processed by the GPU cards, so that the resource utilization rate of each GPU card is improved, the number of needed GPU cards is relatively reduced, and the cost is reduced.

Referring to fig. 4, another embodiment of the speech synthesis method according to the embodiment of the present invention includes:

401. receiving a voice synthesis task from a user terminal;

402. Acquiring a computing capacity parameter of each GPU card, acquiring the maximum task number which can be processed by each GPU card according to the computing capacity parameter, and setting a counting threshold value according to the maximum task number;

in this step, firstly, the calculation capacity parameter of each GPU card currently used is obtained, wherein the calculation capacity parameter of the GPU card includes the number of operation cores, the size of a memory and the operation frequency; and calculating to obtain the maximum number of voice synthesis tasks which can be processed in parallel by each GPU card according to the calculation capacity parameters and the calculation task amount aiming at the voice synthesis in the proposal.

And setting a counting threshold according to the maximum voice synthesis task number, wherein the counting threshold is not more than the maximum voice synthesis task number. In order to ensure that each GPU card can simultaneously carry out data operation with maximum processing data when processing data so as to maximally utilize the computing capacity of each GPU card when the voice synthesis task requests are frequent, thereby reducing the number of used GPUs, and setting a counting threshold value as the maximum voice synthesis task number capable of being processed in parallel; furthermore, because the GPU server may have a plurality of GPU cards with different computing capability parameters in actual use, the minimum value of the maximum speech synthesis task number of the GPU cards in the GPU server needs to be obtained, and when the count threshold of the counter is set, the count threshold is not greater than the minimum value of the maximum speech synthesis task number of the GPU cards; the counting threshold of the counter can be adjusted subsequently according to the actual test effect.

By the method in this step, a count threshold is set to the counter, and the number of the largest voice synthesis tasks included in the batch processing task each time the batch processing task is generated is determined by the count threshold.

403. Acquiring the maximum waiting time of a service and the maximum synthesis time of a single batch processing task; setting the preset time of a timer according to the maximum waiting time of the service and the maximum synthesis time of the single batch processing task;

in this step, first, the maximum waiting time of the service and the maximum synthesis time of the single batch processing task are obtained, wherein the maximum synthesis time of the single batch processing task is related to the calculation capability parameters of each GPU card and the calculation amount in the speech synthesis task, and the maximum synthesis time of the single batch processing task can be obtained through calculation or test; the method comprises the steps of obtaining the maximum service waiting time based on the maximum synthesis time of a single batch processing task and the task synthesis speed which is expected to be achieved in the proposal, wherein the maximum service waiting time is the maximum waiting time from the time when a voice synthesis task is input to the time when a voice synthesis result of the voice synthesis task is obtained.

In order to meet the real-time requirement of the speech synthesis task, the preset time of the timer is set to be the difference between the maximum service waiting time and the maximum synthesis time of a single batch processing task.

Because the voice synthesis task requests are not very frequent under certain conditions, that is, a long time is left from the time when the received voice synthesis task reaches the counting threshold of the counter to the time when the received voice synthesis task reaches the counting threshold of the counter, or when the number of the voice synthesis tasks sent by the current user is less than the counting threshold of the counter, if the timer is not set or the preset time of the timer is set to be too long, the time from the time when the first voice synthesis task is stored in the task queue to the time when the voice synthesis tasks in the task queue are packed into batch processing tasks is too long to be distributed; making speech generation too long reduces the user experience. Therefore, in this embodiment, a timer is set in the distribution server to ensure that when the voice synthesis task request is not frequent and the number of tasks stored in the task queue does not reach the counting threshold of the counter, the voice synthesis tasks stored in the task queue are packed according to a certain duration to generate a batch processing task, so as to reduce the time required for voice synthesis when the voice synthesis task request is not frequent.

To take a specific example, if the maximum waiting time of the service is set to 1 second, and the maximum synthesis time of the GPU card for a single request in the present proposal is 0.8 second, if the counter does not reach the threshold, the maximum waiting time of the speech synthesis task already in the task queue is 0.2 second in order to ensure that the speech synthesis task can be processed within 1 second. That is, even if the speech synthesis task in the task queue does not reach the threshold value of the counter, the speech synthesis task stored in the task queue is packaged and distributed every 0.2 seconds.

The maximum service waiting time can be set by a developer, when the service waiting time is set to be longer, the time spent by voice generation is longer under the same task receiving frequency, and the real-time performance is influenced.

404. Adding the voice synthesis task into a task queue, controlling the value of a counter to add 1, and controlling a timer to start timing;

the specific content in this step is the same as that in step 302 in the previous embodiment, and is not described herein again.

405. Judging whether the numerical value in the counter reaches a preset counting threshold value or not; judging whether the timing time in the timer reaches the preset time or not; if the numerical value in the counter reaches the counting threshold value and/or the timing time in the timer reaches the preset time, resetting the timer and the counter, packaging the voice synthesis tasks in the task queue, and generating a batch processing task;

406. The method comprises the steps of obtaining the working state of each GPU card in a GPU server in advance, storing the working state of each GPU card into a GPU management queue, obtaining the working state of each GPU card in the GPU management queue, and selecting one GPU card with an idle working state from the GPU management queue;

407. Sending the batch processing tasks to the selected GPU cards, marking the working states of the selected GPU cards stored in a GPU management queue as 'working', utilizing the selected GPU cards to perform parallel processing on all the voice synthesis tasks in the batch processing tasks, marking the working states of the selected GPU cards stored in the GPU management queue as 'idle' after the processing is finished, and outputting batch processing results;

To illustrate a specific example, if the GPU card used in this embodiment is calculated and tested, it is found that the GPU card in this example can simultaneously process 30 speech synthesis tasks as one concurrent task, and the time required for processing one concurrent task at a time is 0.8 seconds; the voice synthesis tasks from the user side are that 200 voice synthesis tasks are uniformly received in the first second, when 10 voice synthesis tasks are uniformly received in the second, since the GPU card can only process one concurrent task each time, when the voice synthesis tasks are not packed in the prior art, 200 GPU cards need to be deployed in the GPU server to meet the calculation requirement, the time from the time of receiving the first voice synthesis task to the time of generating the result of the last voice synthesis task is about 2.8 seconds, and at the time, for each GPU card, only one voice synthesis task is processed, and the calculation amount has a large amount of residue. If the solution in this embodiment is adopted, and when the threshold of the counter is set to 30, and the timing time of the timer is set to 0.2 second, the counter triggers the threshold once every 0.15 second at the first second, triggering 6 times in total, synthesizing the first 180 speech synthesis tasks received in the first second into 6 batch processing tasks, and distributing each batch processing task in the 6 batch processing tasks to one GPU card for processing, because the time between each speech synthesis task in the second is 0.1 second, triggering the timer to generate the 7 th batch processing task in the 2.1 second, and the first batch processing task is processed in about 0.95 second, the GPU card processing the first batch processing task can be used to process the 7 th batch processing task, and the speech synthesis task received in the subsequent second also has a GPU card in an "idle" state in the server when generating the batch processing tasks, at the moment, the time from the time of receiving the first voice synthesis task to the time of generating the result of the last voice synthesis task is about 3.0 seconds, only 6 GPU cards are needed while the waiting time of 0.2 seconds is increased, and the number of needed GPU cards is greatly reduced.

408. And splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task.

The specific content in this step is the same as that in step 305 in the foregoing embodiment, and is not described herein again. In the embodiment of the invention, the counter and the timer are arranged in the distribution server, the voice synthesis tasks with a certain number or waiting time reaching a certain time and temporarily stored in the task queue are packaged, and the packaged batch processing tasks are processed by the GPU card, so that the resource utilization rate of the GPU card is improved. The method has the advantages that the highly anthropomorphic tone and timbre of the voice synthesis result are ensured, the voice synthesis speed is increased, the real-time requirement is met, in addition, the resource utilization rate of the GPU card is improved, the number of the needed GPU cards is relatively reduced, and the cost is reduced.

With reference to fig. 5, the speech synthesis method in the embodiment of the present invention is described above, and a speech synthesis apparatus in the embodiment of the present invention is described below, where an embodiment of the speech synthesis apparatus in the embodiment of the present invention includes:

an obtaining module 501, configured to receive a voice synthesis task from a user terminal, where the number of the voice synthesis tasks is at least one;

a task packing module 502, configured to add the speech synthesis tasks into a task queue, and pack all the speech synthesis tasks in the task queue according to a preset task packing rule to generate batch processing tasks;

the batch task processing module 503 is configured to obtain a working state of each GPU card in the GPU server stored in the GPU management queue, select one GPU card according to the working state, send the batch task to the selected GPU card, and perform parallel processing on all speech synthesis tasks in the batch task to obtain a batch processing result;

and a result generating module 504, configured to split the batch processing result according to the speech synthesis task before packaging, so as to obtain a speech synthesis result corresponding to the speech synthesis task.

The voice synthesis method in the embodiment of the invention improves the resource utilization rate of each GPU card and reduces the cost.

Referring to fig. 6, another embodiment of the speech synthesis apparatus according to the embodiment of the present invention includes:

Optionally, the task packing module 502 includes:

a task receiving unit 5021, configured to add the voice synthesis task to a task queue, control a counter to add 1, and control a timer to start timing;

a counting unit 5022, configured to determine whether a value in the counter reaches a preset counting threshold;

a timing unit 5023, configured to determine whether the timing time in the timer reaches a preset time;

and the packing unit 5024 is used for resetting the timer and the counter if the numerical value in the counter reaches a counting threshold value and/or the timing time in the timer reaches preset time, packing the voice synthesis tasks in the task queue and generating batch processing tasks.

Optionally, the packing unit 5024 is specifically further configured to: and if the numerical value in the counter does not reach the counting threshold value and the timing time in the timer does not reach the preset time, continuing to add the received voice synthesis task into a task queue for temporary storage.

Optionally, the task packing module 502 further includes:

a count threshold setting unit 5025, configured to obtain a calculation capability parameter of each GPU card; obtaining the maximum task number which can be processed by each GPU card according to the computing capacity parameter; and setting a counting threshold value according to the maximum task number.

Optionally, the task packing module 502 further includes:

a timing time setting unit 5026, configured to obtain the maximum waiting time of the service and the maximum synthesis time of a single batch processing task; and setting the preset time of a timer according to the maximum service waiting time and the maximum synthesis time of the single batch processing task, wherein the preset time of the timer is the difference value of the maximum service waiting time and the maximum synthesis time of the single batch processing task.

Optionally, the batch task processing module 503 includes:

a working state obtaining unit 5031, configured to obtain in advance a working state of each GPU card in the GPU server, and store the working state of each GPU card into a GPU management queue, where the working state includes "idle" and "working";

a batch task processing unit 5032, configured to obtain the working state of each GPU card in the GPU management queue, select a GPU card with a "free" working state in the GPU management queue, send the batch task to the selected GPU card, mark the working state of the selected GPU card stored in the GPU management queue as "working", and perform parallel processing on all speech synthesis tasks in the batch task by using the selected GPU card;

a batch result generating unit 5033, configured to mark the working state of the selected GPU card stored in the GPU management queue as "idle" and output a batch processing result.

Optionally, the batch task processing unit 5032 includes:

a task text acquiring subunit, configured to acquire text content included in the batch processing task;

The voice synthesis device provided by the embodiment of the invention improves the resource utilization rate of the highly humanized real-time voice synthesis technology in the prior art and reduces the cost.

Fig. 5 and fig. 6 describe the speech synthesis apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the speech synthesis apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 7 is a schematic structural diagram of a speech synthesis apparatus 700 according to an embodiment of the present invention, where the speech synthesis apparatus 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, one or more storage media 730 (e.g., one or more mass storage devices) for storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the speech synthesis apparatus 700. Further, the processor 710 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the speech synthesis apparatus 700.

The speech synthesis apparatus 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the speech synthesis apparatus configuration shown in fig. 7 does not constitute a limitation of the speech synthesis apparatus and may include more or fewer components than those shown, or some of the components may be combined, or a different arrangement of components.

The present invention also provides a speech synthesis apparatus, which includes a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the speech synthesis method in the above embodiments.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the speech synthesis method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speech synthesis method, characterized in that the speech synthesis method comprises:

and splitting the batch processing result according to the voice synthesis task before packaging to obtain a voice synthesis result corresponding to the voice synthesis task.

2. The speech synthesis method according to claim 1, wherein the adding the speech synthesis tasks into a task queue, and packing all the speech synthesis tasks in the task queue according to a preset task packing rule to generate batch processing tasks comprises:

judging whether the timing time in the timer reaches a preset time or not;

3. The speech synthesis method according to claim 2, wherein the adding the speech synthesis tasks into a task queue, and packing all the speech synthesis tasks in the task queue according to a preset task packing rule to generate batch processing tasks further comprises:

4. The speech synthesis method according to claim 3, wherein before the adding the speech synthesis task to the task queue, controlling the value of the counter to add 1, and controlling the timer to start timing, further comprising:

acquiring a computing capacity parameter of each GPU card;

and setting a counting threshold value according to the maximum task number.

5. The speech synthesis method according to claim 3, wherein before the adding the speech synthesis task to the task queue, controlling the value of the counter to add 1, and controlling the timer to start timing, further comprising:

6. The speech synthesis method according to claim 3, wherein the obtaining of the working state of each GPU card in the GPU server stored in the GPU management queue, the selecting of one GPU card according to the working state, the sending of the batch processing tasks to the selected GPU card for parallel processing of all speech synthesis tasks in the batch processing tasks, and the obtaining of the batch processing result comprises:

7. The method according to claim 6, wherein the parallel processing of all the speech synthesis tasks in the batch processing task using the selected GPU card comprises:

8. A speech synthesis apparatus, characterized in that the speech synthesis apparatus comprises:

9. A speech synthesis apparatus characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invoking the instructions in the memory to cause the speech synthesis apparatus to perform the steps of the speech synthesis method of any of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, which when executed by a processor implement the steps of a method of speech synthesis according to any one of claims 1-7.