CN115188363A

CN115188363A - Voice processing method, system, device and storage medium

Info

Publication number: CN115188363A
Application number: CN202210820858.7A
Authority: CN
Inventors: 郭洋; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-10-14

Abstract

The invention relates to artificial intelligence, and provides a voice processing method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring a voice signal and a speaker vector, wherein the voice signal comprises a time domain resolution; obtaining Mel spectrum data according to the voice signal; introducing Mel spectrum data into a first convolution layer in a preset vocoder network structure for extraction processing to obtain initial hidden state data, wherein the vocoder network structure comprises the first convolution layer, an up-sampling layer, a residual error layer and a second convolution layer, and the number of channels of the first convolution layer is different from that of the second convolution layer; under the condition that dimension reduction hidden state data are obtained by performing up-sampling processing on an up-sampling layer on initial hidden state data, introducing a speaker vector and the dimension reduction hidden state data into a residual error layer for synthesis processing to obtain mixed data, wherein the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution; and importing the mixed data into a second convolution layer for dimensionality reduction processing to obtain a voice waveform.

Description

Voice processing method, system, device and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technology, and in particular, to a method, system, device, and storage medium for speech processing.

Background

At present, a Text To Speech (TTS) technology relates to multiple subject technologies such as acoustics, linguistics, digital signal processing technology, multimedia technology and the like, and is a leading technology in the field of chinese information processing. The voice synthesis is a process of converting a text into voice and outputting the voice, and the process is divided into three parts, namely a text front end, an acoustic model and a vocoder, wherein the text front end converts the text into phoneme, tone and intonation control information, the acoustic model converts the information into a spectrogram, the vocoder is used for converting the spectrogram into the voice, and the vocoder is the rear end in a TTS flow.

The vocoder plays an important role, and the quality of the vocoder often determines the quality of the whole voice processing system. On the premise of having a large amount of training data of multiple speakers, the vocoder implementation method of the related art can synthesize high-naturalness voices of each speaker in the training data set. However, for the case of speakers other than the training data set and the data amount is insufficient, the vocoder implementation method of the related art has poor synthetic naturalness. Generally, recording speaker data as much as possible to improve the synthesis naturalness requires a lot of manpower and efforts, and thus how to improve the naturalness of speaker voice synthesis becomes a technical problem to be solved urgently.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

Embodiments of the present invention provide a speech processing method, system, device, and storage medium, which can improve the naturalness of speech synthesis of a speaker in the case of insufficient data amount.

In a first aspect, an embodiment of the present invention provides a speech processing method, where the method includes:

acquiring a voice signal and a speaker vector, wherein the voice signal comprises a time domain resolution;

obtaining Mel spectrum data according to the voice signal;

introducing the Mel spectrum data into a first convolutional layer in a preset vocoder network structure for extraction processing to obtain initial hidden state data, wherein the vocoder network structure comprises the first convolutional layer, an up-sampling layer, a residual error layer and a second convolutional layer, and the number of channels of the first convolutional layer is different from the number of channels of the second convolutional layer;

under the condition that dimension reduction hidden state data are obtained by performing upsampling processing on the initial hidden state data through the upsampling layer, importing the speaker vector and the dimension reduction hidden state data into the residual error layer for synthesis processing to obtain mixed data, wherein the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution;

and importing the mixed data into the second convolution layer for dimension reduction processing to obtain a voice waveform.

The voice processing method provided by the embodiment of the invention at least has the following beneficial effects: the time domain resolution can be obtained through the voice signals, and the Mel spectrum data can be obtained by processing the voice signals. After the Mel spectrum data is obtained, the Mel spectrum data is imported into a preset vocoder network structure, and the initial hidden state data corresponding to the Mel spectrum data can be obtained by extracting the Mel spectrum data through the first convolution layer. The initial hidden state data is subjected to up-sampling processing by utilizing an up-sampling layer in a vocoder network structure, the characteristic dimensionality of the initial hidden state data is reduced, dimension-reduced hidden state data is obtained, the sequence length of the dimension-reduced hidden state data is consistent with the time domain resolution of a voice signal, and the naturalness of subsequent voice synthesis is improved. Therefore, the dimension-reduction hidden state data and the speaker vector are led into the residual error layer to be synthesized, the correlation between the voices is established, and mixed data is obtained. And then the dimension reduction processing is carried out on the mixed data by the second convolution layer to obtain the required voice waveform. The voice processing method provided by the embodiment of the invention utilizes the upper sampling layer to improve the time domain resolution of the dimensionality reduction hidden state data corresponding to the Mel spectrum data through the preset vocoder network structure, then utilizes the residual error layer to model the local correlation of the voice on the time scale, and introduces the speaker vector into the residual error layer, thereby improving the naturalness of the voice synthesis of the speaker under the condition of aiming at the voice of the speaker outside the synthetic data set and deficient data quantity.

According to some embodiments of the present invention, in the above speech processing method, the importing the speaker vector and the dimensionality reduction hidden state data into the residual error layer for synthesis processing to obtain mixed data includes:

calculating the speaker vector and the dimensionality reduction hidden state data according to a preset activation function to obtain comprehensive mapping data;

and obtaining mixed data according to the comprehensive mapping data and the dimensionality reduction hidden state data.

And performing weighted summation calculation on the input of the residual error layer, namely the speaker vector and the dimensionality reduction hidden state data through a preset activation function, so that comprehensive mapping data can be obtained through mapping. And performing superposition calculation processing on the comprehensive mapping data and the dimensionality reduction hidden state data to obtain mixed data, and introducing a speaker vector into a residual error layer to improve the voice naturalness of a synthesized specific speaker.

According to some embodiments of the present invention, in the above-mentioned speech processing method, the calculating the speaker vector and the dimensionality reduction hidden state data according to a preset activation function to obtain comprehensive mapping data includes:

superposing the dimensionality reduction hidden state data and the speaker convolution quantity to obtain initial mixed data;

importing the initial mixed data into a preset first activation function for calculation to obtain first mapping data; importing the initial mixed data into a preset second activation function for calculation to obtain second mapping data;

performing matrix dot product calculation on the first mapping data and the second mapping data to obtain comprehensive mapping data;

the speaker convolution quantity is characterized by a numerical value obtained by one-dimensional convolution calculation of the speaker vector.

And carrying out superposition processing on the dimensionality reduction hidden state data and the speaker convolution quantity to obtain initial mixed data for mixing the voice signal and the speaker information. And respectively importing the initial mixed data into a preset first activation function and a preset second activation function for calculation, namely performing weighted summation calculation on the initial mixed data by using the preset first activation function to obtain first mapping data, and performing weighted summation calculation on the initial mixed data by using the preset second activation function to obtain second mapping data. The initial mixed data is calculated by utilizing a plurality of activation functions, so that the voice synthesis naturalness of a specific speaker can be effectively improved.

According to some embodiments of the present invention, in the above speech processing method, obtaining mixed data according to the comprehensive mapping data and the dimensionality reduction hidden state data includes:

superposing a numerical value obtained by one-dimensional convolution calculation of the comprehensive mapping data and the dimensionality reduction hidden state convolution quantity to obtain mixed data;

and the dimensionality reduction hidden state volume is characterized by a numerical value obtained by performing one-dimensional convolution calculation on the dimensionality reduction hidden state data.

Because the voice signal is a one-dimensional signal, in order to improve the voice waveform synthesis naturalness, the value obtained by performing one-dimensional convolution calculation on the comprehensive mapping data and the value obtained by performing one-dimensional convolution calculation on the dimensionality reduction hidden state data are superposed, so that the dimensionality of the comprehensive mapping data and the dimensionality reduction hidden state data can be normalized, superposition processing is facilitated, and the accuracy of mixed data is improved.

According to some embodiments of the present invention, in the above speech processing method, there are provided a plurality of the upsampling layers and the residual error layers, where the number of the upsampling layers corresponds to the number of the residual error layers one to one, and the upsampling layers and the residual error layers are sequentially connected;

under the condition that dimension reduction hidden state data are obtained by performing upsampling processing on the initial hidden state data through the upsampling layer, the speaker vector and the dimension reduction hidden state data are led into the residual error layer to be synthesized, and mixed data are obtained, wherein the process comprises the following steps:

under the condition that the sequence length of the dimensionality reduction hidden state data is inconsistent with the time domain resolution, introducing the mixed data into a next upsampling layer for upsampling to obtain new dimensionality reduction hidden state data;

and importing the speaker vector and the new dimension reduction hidden state data into a next residual error layer for synthesis processing to obtain new mixed data until the sequence length of the new dimension reduction hidden state data is consistent with the time domain resolution.

Comparing the sequence length of the dimension reduction hidden state data with the time domain resolution, when the sequence length of the dimension reduction hidden state data is inconsistent with the time domain resolution, if the current dimension reduction hidden state data is used for voice synthesis, the obtained voice waveform naturalness is deficient, the dimension reduction hidden state data needs to be subjected to up-sampling again, the characteristic dimension of the dimension reduction hidden state data is reduced, the time domain resolution corresponding to the dimension reduction hidden state data is improved, and the sequence length is improved. After the dimensionality reduction hidden state data is subjected to up-sampling data, new dimensionality reduction hidden state data and a speaker vector are led into a next residual error layer to be subjected to synthesis processing, new mixed data are obtained until the sequence length of the new dimensionality reduction hidden state data is consistent with the time domain resolution, and the effect of improving the naturalness of speech synthesis is achieved.

According to some embodiments of the present invention, in the above speech processing method, the first convolutional layer comprises a plurality of one-dimensional convolutional layers, a size of a convolution kernel of the one-dimensional convolutional layer is 7, a step size is 1, and a number of channels is 512.

The method has the advantages that the convolution kernel with the size of 7, the step length of 1 and the channel number of 512 is utilized to perform one-dimensional convolution extraction on the Mel spectrum data, initial hidden state data corresponding to the Mel spectrum data can be extracted, and meanwhile, the characteristic dimensionality of the initial hidden state data is improved, so that the data can be conveniently processed in the follow-up process, and the speech synthesis naturalness is improved.

According to some embodiments of the present invention, in the above speech processing method, the obtaining mel-spectrum data from the speech signal includes:

performing short-time Fourier transform calculation processing on the voice signal to obtain a voice amplitude spectrum;

and filtering the voice amplitude spectrum by using an 80 Vimel filter bank to obtain mel spectrum data.

The voice signal is calculated by adopting short-time Fourier transform, a voice amplitude spectrum can be obtained, and the obtained voice amplitude spectrum is not suitable for voice synthesis due to the large frequency range and amplitude range. Therefore, in order to obtain a speech feature of an appropriate size, the speech magnitude spectrum is filtered by an 80-dimensional mel filter bank, and mel spectrum data suitable for speech synthesis processing is obtained.

In a second aspect, an embodiment of the present invention provides a speech processing system, including:

the data acquisition module is used for acquiring a voice signal and a speaker vector, wherein the voice signal comprises time domain resolution;

the Mel spectrum calculating module is used for obtaining Mel spectrum data according to the voice signals;

the convolution extraction module is used for introducing the Mel spectrum data into a first convolution layer in a preset vocoder network structure for extraction processing to obtain initial hidden state data, wherein the vocoder network structure comprises the first convolution layer, an up-sampling layer, a residual error layer and a second convolution layer, and the number of channels of the first convolution layer is different from the number of channels of the second convolution layer;

the sampling calculation module is used for leading the speaker vector and the dimensionality reduction hidden state data into the residual error layer for synthesis processing under the condition that the dimensionality reduction hidden state data are obtained by the upsampling processing of the upsampling layer of the initial hidden state data, so as to obtain mixed data, wherein the sequence length of the dimensionality reduction hidden state data is consistent with the time domain resolution;

and the dimension reduction output module is used for leading the mixed data into the second convolution layer to carry out dimension reduction processing to obtain a voice waveform.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the voice processing method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the speech processing method according to the first aspect.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of speech processing provided by an embodiment of the present invention;

fig. 2 is a schematic diagram of a specific implementation process of step S400 in fig. 1;

FIG. 3 is a schematic diagram illustrating a specific implementation process of step S410 in FIG. 2;

FIG. 4 is a schematic diagram of a specific implementation process of step S420 in FIG. 2;

FIG. 5 is a schematic diagram of a specific implementation process of step S400 in FIG. 1;

fig. 6 is a schematic diagram of a specific implementation process of step S200 in fig. 1;

fig. 7 is a schematic diagram of a vocoder network structure model provided by the embodiment of the present invention;

FIG. 8 is a schematic diagram of a residual layer processing model according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a speech processing system according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It is noted that while a division of functional blocks is depicted in a block diagram, and logical order is depicted in a flowchart, in some cases the steps depicted and described may be performed in a different order than the division of blocks within a block diagram or flowchart. The terms "first," "second," and the like in the description and in the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Direct modeling of speech data distribution is a challenging task, with the property of ultra-high resolution in the time domain and correlation on either a long or short time scale. Thus, vocoders are often used to model low resolution speech features and mapping relationships between speech features are used to simplify the modeling problem of semantic annual data distribution.

A Generator Adaptation Network (GAN) -based vocoder in the related art models a mapping relationship between mel spectrum and speech using a non-autoregressive feedforward convolution network structure. In the case of a large amount of training data for multiple speakers, the GAN-based vocoder model is able to synthesize highly natural speech for each speaker in the training data set. However, when a speaker is not in the training data set and the amount of data is insufficient, the synthesized speech waveform obtained by performing speech synthesis using the GAN-based vocoder model is not good enough in the naturalness. In the related art, only the data of a few speakers can be collected as much as possible to improve the synthetic naturalness, but the time cost and the labor cost are wasted greatly. .

The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The invention relates to artificial intelligence, and provides a voice processing method, a system, equipment and a storage medium, wherein the voice processing method comprises the steps of obtaining a voice signal and a speaker vector, wherein the voice signal comprises time domain resolution; obtaining Mel spectrum data according to the voice signal; introducing Mel spectrum data into a first convolution layer in a preset vocoder network structure for extraction processing to obtain initial hidden state data, wherein the vocoder network structure comprises the first convolution layer, an up-sampling layer, a residual error layer and a second convolution layer, and the number of channels of the first convolution layer is different from that of the second convolution layer; under the condition that dimension reduction hidden state data are obtained by performing up-sampling processing on an up-sampling layer on initial hidden state data, introducing a speaker vector and the dimension reduction hidden state data into a residual error layer for synthesis processing to obtain mixed data, wherein the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution; and importing the mixed data into a second convolution layer for dimensionality reduction processing to obtain a voice waveform.

In a first aspect, referring to fig. 1, fig. 1 shows a flowchart of a speech processing method provided in an embodiment of the present invention, where the speech processing method includes, but is not limited to, the following steps:

step S100, acquiring a voice signal and a speaker vector, wherein the voice signal comprises a time domain resolution;

step S200, obtaining Mel spectrum data according to the voice signal;

step S300, introducing the Mel spectrum data into a first convolution layer in a preset vocoder network structure for extraction processing to obtain initial hidden state data, wherein the vocoder network structure comprises the first convolution layer, an up-sampling layer, a residual error layer and a second convolution layer, and the number of channels of the first convolution layer is different from the number of channels of the second convolution layer;

step S400, under the condition that dimension reduction hidden state data are obtained by performing upsampling processing on an upsampling layer on initial hidden state data, introducing speaker vectors and the dimension reduction hidden state data into a residual error layer for synthesis processing to obtain mixed data, wherein the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution;

and step S500, introducing the mixed data into the second convolution layer for dimension reduction processing to obtain a voice waveform.

It is understood that a speech signal to be speech-synthesized and a speaker vector are obtained, wherein the speech signal may be a text signal to be speech-synthesized, and the speaker vector may be data other than the training data set. The mel spectrum data corresponding to the voice signal can be obtained by processing the voice signal, and the mel spectrum data is an amplitude spectrum corresponding to each mel scale after the frequency of the signal is converted into the mel scale. After the Mel spectrum data is obtained, the Mel spectrum data and the speaker vector are respectively introduced into a preset vocoder network structure for voice synthesis to obtain the required voice waveform. The preset vocoder network structure comprises a first convolution layer, an upsampling layer, a residual error layer and a second convolution layer, wherein the first convolution layer is connected with the upsampling layer, the upsampling layer is connected with the residual error layer, and the residual error layer is connected with the second convolution layer. Therefore, the output result of the first convolution layer is input to the upper sampling layer for calculation processing, the output result of the upper sampling layer is input to the residual error layer for calculation processing, the calculation result of the residual error layer is input to the second convolution layer for calculation processing, and finally the voice waveform output by the second convolution layer is obtained. Firstly, introducing Mel spectrum data into the first convolution layer, performing convolution extraction processing on Mel spectrum data by using the first convolution layer in vocoder network structure, and extracting initial hidden state data corresponding to Mel spectrum data. After obtaining the initial hidden-state data output by the first convolution layer, the initial hidden-state data is input into the upper sampling layer. The initial hidden state data is subjected to upsampling processing of an upsampling layer, so that the characteristic dimension of the initial hidden state data is reduced, dimension-reduced hidden state data is obtained, and the time domain resolution corresponding to the data is improved, namely the sequence length of the dimension-reduced hidden state data is consistent with the time domain resolution required by speech synthesis, and the naturalness of subsequent speech synthesis is improved. The speech signal comprises the time domain resolution required by speech synthesis, so that the time domain resolution required by speech synthesis can be directly extracted from the speech signal. After dimension-reduced hidden state data with the sequence length consistent with the time domain resolution is obtained, the dimension-reduced hidden state data and the speaker vector are led into a residual error layer in a vocoder network structure. And synthesizing the dimension reduction hidden state data and the speaker vector by using a residual error layer, and establishing the correlation between the dimension reduction hidden state data and the speaker vector to obtain mixed data. Since the voice signal and the voice waveform are both one-dimensional signals, the mixed data needs to be input into the second convolution layer for dimension reduction processing, and the characteristic dimension of the mixed data is reduced until the voice waveform of the one-dimensional signal is obtained. Therefore, the time domain resolution of the Mel spectrum hidden state data corresponding to the voice signal is improved through a preset vocoder network structure, the speaker vectors except for the training data set are introduced into the residual error layer, the local correlation of the voice on the time scale is modeled to obtain the voice waveform, and only a small amount of data of the voice signal to be synthesized and the speaker vectors except for the training data set are introduced, so that the naturalness of the voice synthesis of the speaker can be improved under the condition of synthesizing the voice of the speaker except for the training data set and lacking the data quantity.

Referring to fig. 2, step S400 in the embodiment shown in fig. 1 includes, but is not limited to, the following steps:

step S410, calculating the speaker vector and the dimensionality reduction hidden state data according to a preset activation function to obtain comprehensive mapping data;

and step S420, obtaining mixed data according to the comprehensive mapping data and the dimensionality reduction hidden state data.

It is understood that an Activation Function (AF) is a Function that runs on a neuron of a neural network, responsible for mapping the input of the neuron to the output. An activation function is preset in a residual error layer in a vocoder network structure, the input of the residual error layer is calculated through the preset activation function, namely, the speaker vector and the dimensionality reduction hidden state data are subjected to weighted summation calculation through the activation function, the correlation between the speaker vector and the dimensionality reduction hidden state data is modeled, and the comprehensive mapping data is obtained through mapping. Local information in the dimension-reduction hidden state data can be calculated by utilizing the activation function, and the voice naturalness of a specific speaker can be improved by introducing the speaker vector into the residual error layer. Therefore, after the comprehensive mapping data are obtained, the comprehensive mapping data and the dimensionality reduction hidden state data are mixed to obtain mixed data. Therefore, by introducing the speaker vector except the training data set into the residual error layer, the voice synthesis can be carried out without a large amount of data, and the voice naturalness of the synthesized specific speaker is improved.

Referring to FIG. 3, step S410 in the embodiment shown in FIG. 2 includes, but is not limited to, the following steps:

step S411, carrying out superposition processing on dimension reduction hidden state data and speaker convolution quantity to obtain initial mixed data;

step S412, importing the initial mixed data into a preset first activation function for calculation to obtain first mapping data; importing the initial mixed data into a preset second activation function for calculation to obtain second mapping data;

step S413, perform matrix dot product calculation on the first mapping data and the second mapping data to obtain comprehensive mapping data.

It can be understood that, because the speech signal and the speech waveform required by the target are both one-dimensional signals, in order to improve the accuracy of the speech waveform and the naturalness of speech synthesis, it is necessary to perform one-dimensional convolution calculation on the speaker vector, so as to reduce the feature dimension of the speaker vector, and at the same time, it is helpful to perform hybrid calculation processing on the speaker vector and the reduced-dimension hidden state data. Wherein, the value obtained by the speaker vector after one-dimensional calculation is the speaker rolling volume. And carrying out superposition processing on the dimensionality reduction hidden state data and the speaker convolution quantity to obtain initial mixed data for mixing the voice signal and the speaker information. And respectively importing the initial mixed data into a preset first activation function and a preset second activation function for calculation processing, namely performing weighted summation calculation on the initial mixed data by using the preset first activation function to obtain first mapping data, and performing weighted summation calculation on the initial mixed data by using the preset second activation function to obtain second mapping data. The initial mixed data are respectively calculated through a plurality of activation functions, so that the voice synthesis naturalness of a specific speaker can be effectively improved. And performing matrix dot product calculation on the first mapping data and the second mapping data to obtain comprehensive mapping data.

The first activation function and the second activation function may be the same activation function or different activation functions, for example, the first activation function and the second activation function may be any two of a Sigmoid function, a Tanh function, a ReLU function, a leakyreu function, an eluu function, a prilu function, a Softmax function, and a Softplus function, respectively, and the first activation function and the second activation function may be the same activation function.

Referring to fig. 4, step S420 in the embodiment shown in fig. 2 includes, but is not limited to, the following steps:

and step S421, performing superposition processing on a numerical value obtained by one-dimensional convolution calculation of the comprehensive mapping data and the dimensionality reduction hidden state convolution quantity to obtain mixed data.

It can be understood that, since the speech signal and the speech waveform required by the target are both one-dimensional signals, in order to improve the naturalness of speech waveform synthesis, after matrix dot-product calculation is performed on the first mapping data and the second mapping data to obtain comprehensive mapping data, one-dimensional convolution calculation is performed on the comprehensive mapping data to obtain a comprehensive mapping convolution quantity. In addition, one-dimensional convolution calculation is carried out on the dimension reduction hidden state data to obtain the dimension reduction hidden state volume. Therefore, the characteristic dimension of the comprehensive mapping volume corresponds to the characteristic dimension of the dimensionality reduction hidden state volume, the normalization of the comprehensive mapping data and the dimensionality reduction hidden state data is realized, the one-dimensional mapping data and the one-dimensional hidden state data are favorably subjected to superposition processing, and the accuracy of mixed data calculation is improved.

Referring to fig. 5, step S400 in the embodiment shown in fig. 1 includes, but is not limited to, the following steps:

step S430, under the condition that the sequence length of the dimensionality reduction hidden state data is inconsistent with the time domain resolution, importing the mixed data into a next upsampling layer for upsampling to obtain new dimensionality reduction hidden state data;

step S440, importing the speaker vector and the new dimension-reducing hidden state data into a next residual error layer for synthesis processing to obtain new mixed data until the sequence length of the new dimension-reducing hidden state data is consistent with the time domain resolution.

It is understood that the up-sampling layer and the residual layer in the vocoder network structure may be provided in plural, but the number of the up-sampling layer corresponds to the number of the residual layer, for example, the up-sampling layer may be provided in 4, and then the residual layer is also provided in 4. In addition, in the case that there are a plurality of upper sampling layers and residual error layers, the upper sampling layers and the residual error layers are connected in sequence, for example, the upper sampling layers include a first sampling layer, a second sampling layer and a third sampling layer, and the upper sampling layers are provided with 3 layers in total; the residual error layer is correspondingly provided with 3 layers including a first residual error layer, a second residual error layer and a third residual error layer. The output result of the first convolution layer is input to the first sampling layer, the output result of the first sampling layer is input to the first residual error layer, the output result of the first residual error layer is input to the second sampling layer, the output result of the second sampling layer is input to the second residual error layer, the output result of the second residual error layer is input to the third sampling layer, and the output result of the third residual error layer is input to the second convolution layer. The vocoder network structure is provided with a plurality of up-sampling layers and residual error layers for reducing the characteristic dimension of the hidden state data by utilizing the Mel spectrum, namely reducing the characteristic dimension of the initial hidden state data and the characteristic dimension of the dimensionality reduction hidden state data respectively, and improving the time domain resolution of the data until the sequence length of the dimensionality reduction hidden state data is consistent with the time domain resolution of the voice signal, namely the time domain resolution of the dimensionality reduction hidden state data is consistent with the time domain resolution of the voice signal. And when the sequence length of the dimension reduction hidden state data is inconsistent with the time domain resolution of the voice signal, namely the time domain resolution of the dimension reduction hidden state data is inconsistent with the time domain resolution of the voice signal. If the speech synthesis is performed by using the dimension reduction hidden state data with the sequence length inconsistent with the time domain resolution of the speech signal, the obtained speech waveform is inconsistent with the required time domain resolution, which results in lack of naturalness of the synthesized speech waveform. Therefore, it is necessary to determine the sequence length of the dimension reduction hidden state data before speech synthesis, and determine whether the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution of the speech signal. Therefore, when the sequence length of the dimensionality reduction hidden state data is inconsistent with the time domain resolution of the voice signal, the dimensionality reduction processing needs to be performed on the dimensionality reduction hidden state data again, namely, the new upsampling layer is used for performing upsampling processing, so that the characteristic dimensionality of the dimensionality reduction hidden state data is reduced, the time domain resolution corresponding to the dimensionality reduction hidden state data is improved, and the new dimensionality reduction hidden state data is obtained. The convolution kernel size of each upsampling layer can be gradually reduced, the convolution kernel step length of each upsampling layer is also gradually reduced, and the number of channels can also be gradually reduced. For example, in the case where the up-sampling layers are provided with 4 layers, the convolution kernel sizes of the respective up-sampling layers may be 16, 4, and 4, respectively, while the step sizes may be 8, 2, and 2, respectively, and the number of channels may be 256, 128, 64, and 32, respectively. Therefore, the characteristic dimension of the data can be gradually reduced through a plurality of upsampling layers, and the time domain resolution corresponding to the data is improved.

After the dimension-reducing hidden state data is subjected to up-sampling data, new dimension-reducing hidden state data and a speaker vector are led into the next residual error layer for synthesis processing, local information among Mel-spectrum hidden state data is calculated, correlation among voices is modeled, new mixed data is obtained until the sequence length of the new dimension-reducing hidden state data is consistent with the time domain resolution, and the effect of improving voice synthesis naturalness is achieved. Namely, after the up-sampling layer processing, under the condition that the sequence length of the obtained new dimension-reducing hidden state data is consistent with the time domain resolution of the voice signal, the new dimension-reducing hidden state data is still led into the next residual error layer to be synthesized with the speaker vector, so as to obtain new mixed data and improve the naturalness of voice synthesis.

It can be understood that, in the vocoder network structure, in order to improve the accuracy of data, the first convolutional layer includes a plurality of one-dimensional convolutional layers, and the one-dimensional convolutional layer, i.e. the convolutional core of the first convolutional layer, has a size of 7, a step length of 1, and a number of channels of 512, so that the characteristic dimension of mel-spectrum data can be increased to 512, which is helpful for performing subsequent convolution calculation and improving the naturalness of speech synthesis. In addition, the second convolutional layer may be a one-dimensional convolutional layer, the convolutional kernel size of the second convolutional layer is 7, the step size is 1, and the number of channels is 1, so that the characteristic dimension of the mixed data can be reduced to 1, and a voice waveform can be output, wherein the convolutional kernel size and the step size in the second convolutional layer correspond to the convolutional kernel size and the step size in the first convolutional layer one to one, respectively, so that the naturalness of voice synthesis can be improved.

Referring to fig. 6, step S200 in the embodiment shown in fig. 1 includes, but is not limited to, the following steps:

step S210, carrying out short-time Fourier transform calculation processing on the voice signal to obtain a voice amplitude spectrum;

step S220, filtering the voice amplitude spectrum by using an 80V Mel filter bank to obtain Mel spectrum data.

It can be understood that the sound signal is a one-dimensional signal, and only time domain information can be seen intuitively, but not frequency information, i.e. the time domain resolution can be directly extracted from the voice signal. While the voice signal can be transformed to the frequency domain by Fourier Transform (FT), the time domain information is lost, and the time-frequency relationship cannot be seen. Therefore, a Short-Time Fourier Transform (STFT) is used to perform Transform calculation on the speech signal, so as to obtain a speech amplitude spectrum, i.e., perform framing and windowing on the speech signal, perform Fourier Transform on each frame, and stack the results of each frame along another dimension, so as to obtain the speech amplitude spectrum, where the frame length in the speech amplitude spectrum is 50ms and the frame shift is 12.5ms.

Due to the characteristics of human hearing, humans are more sensitive to low frequency sounds, while they are less sensitive to high frequency sounds, i.e., as the frequency of a sound increases linearly, i.e., the higher the frequency, the more difficult it is for humans to distinguish. The core of mel-spectrum data is the mel scale, which is a logarithmic scale of linear variation for frequency perception. Therefore, in order to facilitate recognition of the content in the synthesized speech waveform, the speech signal is converted into mel-spectrum data to be synthesized, and the frequency is converted into mel scale, so that the perception of human beings to the frequency becomes linear, and the naturalness of the speech waveform synthesis can be improved and the recognition degree of the speech can be improved. And filtering the voice amplitude spectrum by using an 80-dimensional Mel filter group so as to obtain Mel spectrum data suitable for voice synthesis processing.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating a vocoder network structure model according to an embodiment of the present invention.

It can be understood that the vocoder network structure model is composed of convolution layers, the convolution layers are used for up-sampling Mel spectrum data corresponding to the voice signals until the sequence length corresponding to the Mel spectrum data is consistent with the time domain resolution of the required voice waveform, and data are output to obtain the voice waveform. The vocoder network structure model comprises a first convolution layer, an up-sampling layer, a residual error layer and a second convolution layer. The up-sampling layer and the residual error layer are respectively provided with 4 layers, and the up-sampling layer and the residual error layer are alternately connected. The upsampling layers are all one-dimensional device convolutions, the convolution kernel sizes of the upsampling layers are 16, 4 and 4 respectively, the step sizes are 8, 2 and 2 respectively, and the channel numbers are 256, 128, 64 and 32 respectively. Therefore, the vocoder network structure model gradually reduces the characteristic dimensionality of the data by utilizing the upsampling layer, and improves the time domain resolution corresponding to the data. And after the data is subjected to the up-sampling processing of the fourth up-sampling layer, the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution of the voice waveform. Each residual layer takes the speaker vector and the output result of the previous layer as input, namely takes the speaker vector and the output result of the previous upsampling layer as input, and calculates mixed data. By introducing the speaker vector into the residual error layer, the voice naturalness of the specific speaker can be improved.

In addition, the first convolution layer comprises a plurality of one-dimensional convolution layers, the convolution kernel sizes of the one-dimensional convolution layers in the first convolution layer are all 7, the step lengths are all 1, and the channel number is 512, so that after the Mel-spectrum data is subjected to convolution extraction of the first convolution layer to obtain a hidden state, the characteristic dimension of the obtained initial hidden state data is increased to 512. And the second convolution layer is a one-dimensional convolution layer, the convolution kernel size of the second convolution layer is 7, the step length is 1, and the number of channels is 1, so that the characteristic dimension of the mixed data can be reduced to 1, and a voice waveform is output.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a residual layer processing model according to an embodiment of the present invention.

It can be understood that the up-sampling layer and the residual layer are alternately connected by 4 layers, and the output result of one up-sampling layer above the residual layer and the speaker vector are used as input to output mixed data. Two activation functions are arranged in the residual error layer, and the activation functions are a Tanh function and a Sigmoid function respectively. The blending data in the residual layer can be calculated by formula (1):

z＝w _x *x+w _z *[tanh(x+w _y *y)⊙sigmoid(x+w _y *y)] (1)

wherein, x is dimension reduction hidden state data output by an upper sampling layer of the previous layer; y is the speaker vector; z is the mixed data; w is a _n * n is expressed as a one-dimensional convolution calculation with a convolution kernel of 1, i.e. w _x * x represents the volume quantity of the dimensionality reduction hidden state; w is a _y * y represents the speaker's volume.

And (4) carrying out superposition processing on the dimensionality reduction hidden state data and the speaker vector input by the last upper sampling layer to obtain initial mixed data. And respectively calculating the initial mixed data by utilizing a Tanh function and a Sigmoid function to obtain first mapping data and second mapping data, wherein the first mapping data is obtained by calculating the initial mixed data by the Tanh function, and the second mapping data is obtained by calculating the initial mixed data by the Sigmoid function. And performing matrix dot multiplication on the first mapping data and the second mapping data to obtain comprehensive mapping data. And performing one-dimensional convolution calculation on the output result of the last up-sampling layer, namely performing one-dimensional convolution calculation on the dimension-reduced hidden state data to obtain a dimension-reduced hidden state volume, and then superposing the dimension-reduced hidden state volume and the comprehensive mapping data to obtain mixed data. Therefore, by introducing the speaker vector into the residual error layer, the naturalness of the synthesized voice can be effectively improved compared with the traditional method without introducing the speaker vector. Aiming at the condition that speakers except the training data set have deficient data quantity, the voice naturalness can be improved.

In a second aspect, referring to fig. 9, fig. 9 is a schematic structural diagram of a speech processing system 900 according to an embodiment of the present invention.

The speech processing system 900 includes a data acquisition module 910, a mel-spectrum calculation module 920, a convolution extraction module 930, a sample calculation module 940, and a dimensionality reduction output module 950.

A data obtaining module 910, configured to obtain a speech signal and a speaker vector, where the speech signal includes a time domain resolution.

And a mel spectrum calculating module 920, configured to obtain mel spectrum data according to the voice signal.

The convolution extraction module 930 is configured to introduce the mel spectrum data into a first convolution layer in a preset vocoder network structure for extraction processing, so as to obtain initial hidden state data, where the vocoder network structure includes the first convolution layer, an upsampling layer, a residual error layer, and a second convolution layer, where the number of channels of the first convolution layer is different from the number of channels of the second convolution layer.

And a sampling calculation module 940, configured to, in a case that the initial hidden state data is subjected to upsampling processing by an upsampling layer to obtain dimension-reduced hidden state data, introduce the speaker vector and the dimension-reduced hidden state data into a residual error layer for synthesis processing to obtain mixed data, where a sequence length of the dimension-reduced hidden state data is consistent with a time domain resolution.

And a dimension reduction output module 950, configured to import the mixed data into the second convolution layer for performing dimension reduction processing, so as to obtain a voice waveform.

The sample calculating module 940 further includes a function mapping module 941 and a mixing processing module 942.

A function mapping module 941, configured to calculate speaker vectors and dimension reduction hidden state data according to a preset activation function to obtain comprehensive mapping data;

the hybrid processing module 942 is configured to obtain hybrid data according to the comprehensive mapping data and the dimensionality reduction hidden state data.

In addition, the function mapping module 941 is further configured to perform superposition processing on the dimensionality reduction hidden state data and the speaker convolution quantity to obtain initial mixed data; importing the initial mixed data into a preset first activation function for calculation to obtain first mapping data; importing the initial mixed data into a preset second activation function for calculation to obtain second mapping data; and performing matrix dot product calculation on the first mapping data and the second mapping data to obtain comprehensive mapping data.

In addition, the hybrid processing module 942 is further configured to perform superposition processing on a value obtained by performing one-dimensional convolution calculation on the comprehensive mapping data and the dimensionality reduction hidden state convolution amount to obtain hybrid data.

The sampling calculation module 940 further includes a resolution determination module 943, where the number of the upsampling layers and the number of the residual error layers are multiple, the number of the upsampling layers corresponds to the number of the residual error layers one to one, and the upsampling layers and the residual error layers are sequentially connected.

A resolution determining module 943, configured to, in a case that the sequence length of the dimension-reduced hidden state data is inconsistent with the time domain resolution, import the mixed data into a next upsampling layer to perform upsampling processing, so as to obtain new dimension-reduced hidden state data.

The hybrid processing module 942 is further configured to import the speaker vector and the new dimension-reduced hidden state data into the next residual layer for synthesis processing, so as to obtain new hybrid data until the sequence length of the new dimension-reduced hidden state data is consistent with the time domain resolution.

In addition, the mel spectrum calculation module 920 further includes a voice spectrum calculation module 921 and a mel filtering module 922.

The voice spectrum calculating module 921 is configured to perform short-time fourier transform calculation on the voice signal to obtain a voice amplitude spectrum.

The mel filtering module 922 is configured to perform filtering processing on the voice magnitude spectrum by using an 80-v-mel filter bank to obtain mel spectrum data.

In a third aspect, referring to fig. 10, fig. 10 illustrates an electronic device 1000 according to an embodiment of the present invention. The electronic device 1000 comprises a memory 1010, a processor 1020 and a computer program stored in the memory 1010 and executable on the processor 1020, wherein the processor 1020 implements the speech processing method in the above embodiment when executing the computer program.

The memory 1010 is used as a non-transitory computer readable storage medium for storing non-transitory software programs and non-transitory computer executable programs, such as the voice processing method in the above embodiment of the present invention. The processor 1020 implements the speech processing method in the above-described embodiment of the present invention by operating the non-transitory software program and instructions stored in the memory 1010.

The memory 1010 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like necessary to execute the density radius-based clustering method in the above-described embodiment. Further, the memory 1010 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. It is to be appreciated that the memory 1010 can alternatively comprise memory located remotely from the processor 1020 and that such remote memory can be coupled to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the speech processing method in the above-described embodiments are stored in a memory, and when executed by one or more processors, perform the speech processing method in the above-described embodiments, for example, performing the above-described method steps S100 to S500 in fig. 1, method steps S410 to S420 in fig. 2, method steps S411 to S413 in fig. 3, method step S421 in fig. 4, method steps S430 to S440 in fig. 5, and method steps S210 to S220 in fig. 6.

In a fourth aspect, the present invention also provides a computer-readable storage medium storing computer-executable instructions for causing a computer to execute the voice processing method as in the above-described embodiments, for example, to execute the above-described method steps S100 to S500 in fig. 1, method steps S410 to S420 in fig. 2, method steps S411 to S413 in fig. 3, method step S421 in fig. 4, method steps S430 to S440 in fig. 5, and method steps S210 to S220 in fig. 6.

The above described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

It should be noted that the server may be an independent server, or may be a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), and a big data and artificial intelligence platform.

It should be noted that all or some of the steps of the above-disclosed methods may be used in any number of general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A method of speech processing, the method comprising:

obtaining Mel spectrum data according to the voice signal;

under the condition that dimension reduction hidden state data are obtained by performing up-sampling processing on the initial hidden state data through the up-sampling layer, introducing the speaker vector and the dimension reduction hidden state data into the residual error layer for synthesis processing to obtain mixed data, wherein the sequence length of the dimension reduction hidden state data is consistent with the time domain resolution;

2. The speech processing method according to claim 1, wherein the importing the speaker vector and the dimension-reduced hidden state data into the residual layer for synthesis to obtain mixed data comprises:

3. The speech processing method of claim 2, wherein the computing the speaker vector and the dimensionality reduction hidden state data according to a preset activation function to obtain comprehensive mapping data comprises:

4. The speech processing method of claim 2, wherein obtaining hybrid data from the synthetic mapping data and the dimensionality reduction hidden state data comprises:

and the dimension reduction hidden state volume is characterized by a numerical value obtained by performing one-dimensional convolution calculation on the dimension reduction hidden state data.

5. The speech processing method according to claim 1, wherein there are a plurality of the upsampling layers and the residual error layers, the number of the upsampling layers corresponds to the number of the residual error layers one by one, and the upsampling layers are sequentially connected to the residual error layers;

under the condition that dimension reduction hidden state data are obtained by performing upsampling processing on the initial hidden state data through the upsampling layer, importing the speaker vector and the dimension reduction hidden state data into the residual error layer for synthesis processing to obtain mixed data, wherein the process comprises the following steps of:

under the condition that the sequence length of the dimensionality reduction hidden state data is inconsistent with the time domain resolution, importing the mixed data into a next upsampling layer for upsampling to obtain new dimensionality reduction hidden state data;

6. The speech processing method of claim 1 wherein the first convolutional layer comprises a plurality of one-dimensional convolutional layers, the convolutional kernel of the one-dimensional convolutional layer has a size of 7, a step size of 1, and a number of channels of 512.

7. The speech processing method of claim 1, wherein the deriving Mel spectral data from the speech signal comprises:

8. A speech processing system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech processing method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech processing method according to any one of claims 1 to 7.