CN116741149A

CN116741149A - Cross-language voice conversion method, training method and related device

Info

Publication number: CN116741149A
Application number: CN202310676661.5A
Authority: CN
Inventors: 彭瑞达
Original assignee: Beijing Jiarui Technology Co ltd
Current assignee: Beijing Jiarui Technology Co ltd
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-09-12
Anticipated expiration: 2043-06-08
Also published as: CN116741149B

Abstract

The application discloses a cross-language voice conversion method, a training method and a related device, wherein the method comprises the following steps: a first media file is input into a cross-language speech conversion model, wherein the first media file includes background features and a first source speech uttered by a first character in a first language, and a second media file is output based on the cross-language speech conversion model. The method and the device generate the target voice content and the target voice corresponding to the source voice according to the background characteristics of the media file and the voice content of the target audience based on the cross-language voice conversion model, so that a user can understand the content of the media file more easily, the quality of cross-language voice conversion is improved, and the user experience is effectively improved.

Description

Cross-language voice conversion method, training method and related device

Technical Field

The present application relates to the field of natural language processing, and in particular, to a cross-language speech conversion method, a training method, and a related apparatus.

Background

With the rapid development of Artificial Intelligence (AI) technology, natural language processing technology (applicable to such fields as education, communication, entertainment, etc.) has received attention as a way of man-machine interaction. Among them, cross-language speech conversion has important application potential. For example, a source voice (in a first language, such as English) uttered by a first voice in a media file may be converted to a target voice (in a second language, such as Chinese) uttered by a second voice that is semantically identical, enabling people to communicate across languages or across cultures.

Prior art cross-language speech conversion systems typically include a feature extractor and a speech synthesizer. For example, feature extraction is performed on source voice uttered by a first voice based on a feature extractor to obtain feature data including background and first voice content; translating the first voice content corresponding to the first language into second voice content corresponding to the second language; and synthesizing the background of the source voice and the second voice content with the voiceprint information of the second voice based on the voice synthesizer, so as to generate the target voice with the second voice color.

However, the conversion speed of the cross-language speech conversion system in the prior art is relatively slow, and differences such as different cultural differences and/or different timbres may result in the quality of the cross-language speech conversion still being low. For example, in media products such as movies, cross-language speech is not natural and smooth enough to be played by a person, and even not enough for the user to understand the context information in the speech, thereby degrading the user experience. As another example, in audio, tone color differences and gamut differences of different human voices may even cause problems of unnatural and blurred sounds.

Disclosure of Invention

The embodiment of the application provides a cross-language voice conversion device which can realize the cross-language voice conversion based on the background characteristics of media files, the target audience, the voice content of source voice and voiceprint characteristics and effectively improve the conversion quality of the cross-language voice.

In a first aspect, a method for cross-language speech conversion is provided, the method comprising: inputting a first media file into a cross-language voice conversion model, wherein the first media file comprises background features and first source voice sent by a first character in a first language, and the cross-language voice conversion model is obtained by training the history media file based on a neural network; outputting a second media file based on the cross-language voice conversion model; wherein the second media file is synthesized from the background features of the first media file and a first target speech in a second language; the first target speech is generated based on the background feature, a target audience for the first media file, and speech content of the first source speech; and the voiceprint characteristics of the first target voice are dependent on the target audience of the media file and/or the voiceprint characteristics of the first source voice.

Optionally, the relationship between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies: and the I t1-t 2I is less than or equal to a first time length preset value.

With reference to the first aspect, in one possible implementation manner, the voiceprint similarity between the first source voice and the first target voice is smaller than a similarity preset value.

Optionally, the target voiceprint is the same as the voiceprint of the first character.

With reference to the first aspect, in one possible implementation manner, the voiceprint of the first target voice depends on a target audience of the media file and a character feature of the first character.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first target voice is generated based on scene features of a movie file included in the media file, cognitive level and cultural features of a target audience, and voice content of the first source voice. Alternatively, the first target speech is generated in a second language based on background sounds of an audio file included in the media file, cognitive levels and cultural characteristics of a target audience, and speech content of the first source speech. Specifically, the first target voice is generated based on first target voice content of a second language and voiceprint features of the first target voice, wherein the first target voice content is obtained by translating voice content of the first source voice by using an interpretation method, an subtraction method and/or a translation method according to the scene features, the cognitive level and the cultural features of the target audience, so that the relation between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies: and the I t1-t 2I is less than or equal to a first time length preset value.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first media file further includes a second source voice of the first language sent by the second character, and a relationship between a duration t3 of the second source voice and a duration t4 of a second target voice corresponding to the second source voice satisfies: the t3-t4 is less than or equal to a second duration preset value, and the relation of the first target voice and the second target voice in pitch depends on the relation of the first source voice and the second source voice in pitch.

Optionally, in the case that the difference between the pitches of the first source voice and the second source voice is a harmony interval, the difference between the pitches of the first target voice and the second target voice is a harmony interval.

With reference to the first aspect, in a possible implementation manner of the first aspect, the cross-language speech conversion model includes a generating pre-training transformer GPT neural network, where the GPT neural network includes an embedded layer, N decoding modules, and an output layer, where each decoding module in the N decoding modules includes a mask attention layer, a first normalization layer, a feed-forward layer, and a second normalization layer that are sequentially connected, and N is an integer greater than or equal to 1. The method step of outputting the second media file based on the cross-language voice conversion model may include: position coding is carried out on the first media file to obtain a position vector, and voiceprint features of the first media file and the first target voice are input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer after passing through the N decoding modules.

Optionally, N is greater than or equal to 5.

With reference to the first aspect, in a possible implementation manner of the first aspect, the cross-language speech conversion model includes a classifier and a generating pre-training transformer GPT neural network, where the GPT neural network includes an embedded layer, N decoding modules, and an output layer, where each decoding module in the N decoding modules includes a mask attention layer, a first normalization layer, a forward feedback layer, and a second normalization layer that are sequentially connected, and N is an integer greater than or equal to 1 (e.g., N is greater than or equal to 5). The method for outputting the second media file based on the cross-language voice conversion model comprises the following steps: obtaining classification information for the first media file using the classifier (e.g., inputting the first media file into the classifier), wherein the classification information for the first media file includes a media type of the first media file, an audience group, a cognitive level of the target audience, cultural characteristics, and/or character characteristics of the first character; position coding is carried out on the first media file to obtain a position vector, and the first media file is input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer based on the classification information of the first media file after passing through the N decoding modules, wherein the classification information of the first media file is used for generating the first target voice of the second media file.

Optionally, the classifier is coupled to the embedded layer, wherein an output of the classifier is part of an input of the embedded layer; or the classifier is connected to the output layer, wherein the output of the classifier is part of the input of the output layer; or the classifier is coupled to the masked attention layer of the first decoding module, wherein the output of the classifier is part of the masked attention layer input of the first decoding module.

Optionally, the audience population is categorized based on age, gender, and/or occupation; the cognitive level is classified based on cultural degrees and/or social life environments; the cultural features are classified based on country, ethnicity, and/or language; the media types are categorized based on the format of the media file; or the character features of the first character are classified based on appearance, behavior, character, and quality.

Optionally, each decoding module adopts a residual structure, and the residual structure includes taking an input and an output of a mask attention layer of each decoding module as an input of the first normalization layer and taking an input and an output of the feed-forward layer as an input of the second normalization layer.

Optionally, before inputting the first media file into the cross-language speech conversion model, the method may further comprise: dividing the first media file into a plurality of segments; wherein the size of each of the plurality of segments is less than a first data threshold, the first data threshold being on the order of less than or equal to 10 to the power 7, and the size of data output through the output layer for each segment is approximately equal to the size of data input at the embedded layer.

In a second aspect, a method for training a cross-language speech conversion model is provided, the method comprising: inputting at least one historical media file in a sample set of historical media files into the cross-language speech conversion model, wherein the historical media file includes background features and source speech uttered in a first language; outputting a target media file based on the cross-language speech conversion model, wherein the target media file is synthesized by the background features of the history media file and target speech in a second language; the target speech is generated based on the background feature, the target audience of the historical media file, and the speech content of the source speech, the voiceprint feature of the target speech being dependent on the voiceprint feature of the target audience of the historical media file and/or the source speech; determining a loss function according to the real file corresponding to the target media file and the history media file, wherein the loss function is determined according to the voiceprint characteristics of the target voice and the voiceprint characteristics of the voice of the real file and according to the voice content of the target voice and the voice content of the real file; parameters of the cross-language speech conversion model are adjusted and a next iterative training is performed based on the historical media file sample set until the loss function converges.

Optionally, the loss function is a mean square error loss function.

Optionally, the loss function may also be determined based on a duration of the target voice of the target media file and a voice duration of the real file.

Optionally, the loss function may further include a prosodic feature loss determined by the prosodic features of the target speech of the first target media file and the prosodic features of the speech of the real file. For example, the prosodic features may be a intonation, a duration, a pitch, or a scale, etc.

In a third aspect, a cross-language speech conversion apparatus is provided, the apparatus comprising a processing unit and a storage unit, the processing unit being configured to input a first media file into a cross-language speech conversion model stored by the storage unit, and to output a second media file based on the cross-language speech conversion model. Wherein the first media file comprises background features and first source voices emitted by a first character in a first language, and the cross-language voice conversion model is obtained by training the historical media file based on a neural network; and the second media file is synthesized from the background feature of the first media file and a first target speech in a second language; the first target speech is generated based on the background feature, a target audience for the first media file, and speech content of the first source speech; and the voiceprint characteristics of the first target voice are dependent on the target audience of the media file and/or the voiceprint characteristics of the first source voice.

With reference to the third aspect, in a possible implementation manner of the third aspect, a voiceprint similarity between the first source voice and the first target voice is smaller than a similarity preset value; optionally, the target voiceprint is the same as the voiceprint of the first character; alternatively, the voiceprint of the first target voice is dependent upon a target audience of the media file and a character characteristic of the first character.

With reference to the third aspect, in a possible implementation manner of the third aspect, the processing unit is specifically configured to: generating the first target voice based on scene features of a movie file, cognitive level and cultural features of a target audience, and voice content of the first source voice, which are included in the media file; or generating the first target speech based on background sounds of an audio file included in the media file, cognitive levels and cultural characteristics of a target audience, and speech content of the first source speech.

Optionally, translating the voice content of the first source voice to obtain first target voice content by adopting an adding method, an subtracting method and/or a translating method according to the scene characteristics, the cognitive level and the cultural characteristics of the target audience; and generating the first target voice based on the first target voice content of the second language and the voiceprint feature of the first target voice so that the relation between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies the following conditions: and the I t1-t 2I is less than or equal to a first time length preset value.

With reference to the third aspect, in one possible implementation manner of the third aspect, the first media file further includes a second source voice of the first language sent by the second character, and a relationship between a duration t3 of the second source voice and a duration t4 of a second target voice corresponding to the second source voice satisfies: the t3-t4 is less than or equal to a second duration preset value; the relationship of the first target speech and the second target speech in pitch depends on the relationship of the first source speech and the second source speech in pitch. Optionally, in the case that the difference between the pitches of the first source voice and the second source voice is a harmony interval, the difference between the pitches of the first target voice and the second target voice is a harmony interval.

With reference to the third aspect, in a possible implementation manner of the third aspect, the cross-language speech conversion model stored by the storage unit includes a generating pre-training transformer GPT neural network, where the GPT neural network includes an embedded layer, N decoding modules, and an output layer, where each decoding module in the N decoding modules includes a mask attention layer, a first normalization layer, a feed-forward layer, and a second normalization layer that are sequentially connected, and N is an integer greater than 1. The processing unit may be specifically configured to: position coding is carried out on the first media file to obtain a position vector, and voiceprint features of the first media file and the first target voice are input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer after passing through the N decoding modules.

With reference to the third aspect, in a possible implementation manner of the third aspect, the cross-language speech conversion model stored by the storage unit includes a classifier and a generating pre-training transformer GPT neural network, where the GPT neural network includes an embedded layer, N decoding modules and an output layer, where each decoding module in the N decoding modules includes a mask attention layer, a first normalization layer, a feed-forward layer and a second normalization layer that are sequentially connected, and N is an integer greater than 1. The processing unit may be specifically configured to: acquiring classification information of the first media file by adopting the classifier, wherein the classification information of the first media file comprises a media type of the first media file, an audience group, a cognition level of the target audience, cultural characteristics and/or character characteristics of the first character; position coding is carried out on the first media file to obtain a position vector, and the first media file is input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer after passing through the N decoding modules, wherein the classification information of the first media file is used for generating the first target voice of the second media file.

With reference to the third aspect, in a possible implementation manner of the third aspect, the processing unit is further configured to divide the first media file into a plurality of segments; wherein the size of each of the plurality of segments is less than a first data threshold, the first data threshold being on the order of less than or equal to 10 to the power 7, and the size of data output through the output layer for each segment is approximately equal to the size of data input at the embedded layer.

In a fourth aspect, a training device for cross-language voice conversion is provided, where the device includes a processing unit and a storage unit, where the processing unit is configured to obtain a sample set of history media files stored in the storage unit, input at least one history media file in the sample set of history media files into the cross-language voice conversion model, output a target media file based on the cross-language voice conversion model, determine a loss function according to real files corresponding to the target media file and the history media file, adjust parameters of the cross-language voice conversion model, and perform next iterative training according to the sample set of history files until the loss function converges; wherein the history media file includes background features and source speech in a first language, and the target media file is synthesized from the background features of the history media file and target speech in a second language; the target speech is generated based on the background feature, the target audience of the historical media file, and the speech content of the source speech, the voiceprint feature of the target speech being dependent on the voiceprint feature of the target audience of the historical media file and/or the source speech; the loss function is determined based on the voiceprint characteristics of the target voice and the voiceprint characteristics of the voice of the real document, and the voice content of the target voice and the voice content of the real document.

Optionally, the loss function is a mean square error loss function.

In a fifth aspect, a cross-language speech conversion apparatus is provided, the apparatus comprising a processor and a memory, the processor being coupled to the memory, the processor being configured to read and execute instructions in the memory to implement the method of any one of the possible implementations of the first aspect.

In a sixth aspect, there is provided a training apparatus for cross-language speech conversion, the apparatus comprising a processor and a memory, the processor being coupled to the memory, the processor being operable to read and execute instructions in the memory to implement the method of any one of the possible implementations of the second aspect described above.

In a seventh aspect, a computer program product is provided, comprising computer program code which, when executed, implements the method of any one of the possible implementations of the first aspect.

In an eighth aspect, a computer program product is provided, comprising computer program code which, when executed, implements the method of any one of the possible implementations of the second aspect described above.

In an embodiment of the application, a first media file is input into a cross-language speech conversion model, wherein the first media file comprises background features and first source speech uttered by a first character in a first language, and a second media file is output based on the cross-language speech conversion model. The method and the device generate the target voice content and the target voice corresponding to the source voice according to the background characteristics of the media file and the voice content of the source voice based on the cross-language voice conversion model, so that a user can more easily understand the content of the media file, the target voice and the source voice can have similar or same duration, and the cross-language voice can be more natural and smooth, thereby improving the quality of the cross-language voice conversion and effectively improving the user experience.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings in which:

Fig. 1 is a schematic flow chart of a cross-language voice conversion method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a network structure of a cross-language speech conversion model according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a network structure of another cross-language speech conversion model according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a network structure of another cross-language speech conversion model according to an embodiment of the present application.

FIG. 5 is a schematic flow chart of a process of a training method of a cross-language speech conversion model provided by an embodiment of the present application.

Fig. 6 is a schematic block diagram of a cross-language voice conversion device according to an embodiment of the present application.

Fig. 7 is a schematic block diagram of another cross-language voice conversion device according to an embodiment of the present application.

Fig. 8 is a schematic block diagram of a training device for a cross-language speech conversion model according to an embodiment of the present application.

FIG. 9 is a schematic block diagram of another training apparatus for a cross-language speech conversion model according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be understood that the "first" and "second" in the embodiments of the present application are only for distinguishing, and should not be construed as limiting the present application in any way. It should also be understood that, in various embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should be further noted that, the "and/or" describing the association relationship of the association object indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The application can be applied to various fields such as education, communication, entertainment and the like, and the media files in the embodiment of the application can be such as movies, televisions, music, news, audio books, games, advertisements and the like.

The technical scheme provided by the application will be described in detail below with reference to the accompanying drawings.

101, inputting a first media file into a cross-language voice conversion model, wherein the first media file comprises background features and first source voice sent by a first character in a first language, and the cross-language voice conversion model is obtained by training a historical media file set based on a neural network.

102, outputting a second media file based on the cross-language voice conversion model; wherein the second media file is synthesized from the background feature of the first media file and a first target speech in a second language, the first target speech being generated based on the background feature, a target audience of the first media file, and speech content of the first source speech; and the voiceprint characteristics of the first target voice are dependent on the target audience of the media file and/or the voiceprint characteristics of the first source voice.

Optionally, the relationship between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies: the first time length preset value is equal to or less than t1-t2 (e.g., the first time length preset value is greater than or equal to 0, such as may be set to one tenth or one hundredth or one thousandth of t1, etc.).

In the embodiment of the application, the target voice content and the target voice corresponding to the source voice are generated according to the voice content of the source voice aiming at the background characteristics of the media file and the target audience based on the cross-language voice conversion model, so that the user can more easily understand the content of the media file. Further, the target voice and the source voice have similar or identical duration, so that the cross-language voice can be more natural and smooth. Therefore, the quality of cross-language voice conversion is improved, thereby improving the user experience.

In some embodiments, the voiceprint similarity of the first source speech to the first target speech may be less than a similarity preset value. For example, the first target voiceprint is the same as the voiceprint of the first character.

In some embodiments, the voiceprint of the first target voice can be dependent upon a target audience of the media file and a character characteristic of the first character. For example, in a movie soundtrack or audio book, the voiceprint of the target voice may be selected based on a target audience, such as an animated movie or a child story in which the audience population is children, may conform to the child's linguistic expressions/spoken habits and character characteristics of the character (including, but not limited to, performance characteristics, mood, personality, identity, etc.).

In the application of a video file, according to some embodiments of the application, the first target speech may be generated based on scene features of the video file, cognitive level and cultural features of the target audience, and speech content of the first source speech. Specifically, the voice content of the first target voice is generated by combining the voice content of the first source voice with movie scenes (such as campus, desert, seaside, home, forest, grassland, starry sky, mountain, vehicles and other environments, and/or emotion atmospheres such as joy, sadness, horror, anger and the like), the cognitive level of the audience group (such as the cognitive level of children in the animation film) and the ethnic cultural characteristics (related whispering, slang and colloquial can be adopted), and then the first target voice is generated based on the selected voiceprint characteristics.

It should be understood that the above examples of video files are described with the audience members as children and the media files as animated movies, the application is not limited thereto, but can be video files including but not limited to literature, suspense, science and technology, horror, and the like.

In the application of the audio file, according to some embodiments of the application, the first target speech may be generated based on the background sound of the audio file, the cognitive level and cultural characteristics of the target audience, and the speech content of the first source speech. Specifically, the cross-language voice model of the application generates voice content of a first target voice aiming at the voice content of a first source voice in combination with background characteristics (such as background sound, story content expressed by audio and emotion atmosphere) of an audio file, the cognition level of an audience group and ethnic cultural characteristics, namely, lyrics creation of different languages is realized, and then the first target voice is generated based on the selected voiceprint characteristics.

Optionally, conversion of voice contents in different languages is achieved based on a cross-language voice model, and the voice contents of the first source voice can be translated by adopting an interpretation method, an interpretation method and/or a translation method according to scene characteristics, cognitive level of a target audience and cultural characteristics, so that the relation between the duration t2 of the first target voice and the duration t1 of the first source voice is satisfied: and the I t1-t 2I is less than or equal to a first time length preset value.

Therefore, the content of the media file after the cross-language voice conversion is more popular and easy to understand, is vivid and coherent, and accords with the target audience cognition level, so that the user experience is improved.

In some embodiments, the first media file may have multiple roles, where the source speech uttered by each role is the same as or similar to the duration of the corresponding target speech, and the relationship between the corresponding different target speech may be determined according to the relationship between the source speech uttered by the different roles. Specifically, the relationship between the first target voice and the second target voice (e.g., the difference between pitches is also a harmony interval) is determined according to the relationship between the first source voice of the first language uttered by the first character and the second source voice of the first language uttered by the second character (e.g., the difference between pitches is a harmony interval). For example, the chord in the original audio file may be such that the converted audio file is also chord rather than noise, thereby improving the user experience.

In some embodiments of the present application, the cross-language speech conversion model may be derived by training (e.g., may be trained one by one or in batches) a set of historical media files based on a Generative Pre-Trained Transformer (GPT) neural network. Optionally, the selected training materials, i.e. the historical media file set, include at least one media file in a first language and at least one real file in a second language corresponding to the media file, the voice of the real file meets the conditions of the target voice (including voice content, voiceprint feature, voice duration, audience group, etc.) described in the above embodiment, and the loss function set in training includes voice content loss, voiceprint feature loss and voice duration loss, and may also set prosodic feature loss.

The GPT neural network comprises an embedded layer (Embedding), N decoding modules and an output layer, wherein each decoding module in the N decoding modules comprises a mask Attention layer (Masked Attention), a first normalization layer, a forward feedback layer and a second normalization layer which are sequentially connected. N is an integer greater than 1, alternatively N may be greater than or equal to 5.

Alternatively, the N decoding modules may employ a residual structure. For example, the residual structure includes inputs and outputs of the mask attention layer of each decoding module as inputs of a first normalization layer, and inputs and outputs of the feed-forward layer as inputs of a second normalization layer. Alternatively, the normalization layer may be implemented using a normalized exponential function. Alternatively, the output layer may employ a tansig function and a linear (linear) function.

In some embodiments, in step 102, a first media file may be position-encoded to obtain a position vector, and voiceprint features of the first media file and the first target voice are input into an embedding layer to obtain an embedding vector; the position vector and the embedded vector (i.e., the sum of the position vector and the embedded vector is multiplied by Q, K and V, respectively, Q represents the query vector, K represents the key vector, and V represents the value vector) are input to the mask attention layer of the first decoding module of the N decoding modules, and the second media file is output at the output layer after passing through the N decoding modules. Alternatively, the voiceprint features of the first target voice may be selected based on the target audience of the first media file or the same voiceprint features as the source voice may be selected.

In some embodiments, the cross-language speech conversion model may also include one or more classifiers to which the media files are input to obtain classification information for the media files.

For example, as exemplarily shown in fig. 2, the output of the classifier may be connected to an embedded layer. In step 102, the first media file may be position-coded to obtain a position vector, the first media file (e.g., may include classification information for generating the first target voice of the second media file) is input into the embedding layer to obtain an embedding vector, and after passing through N decoding modules, the second media file is output at the output layer, where N is an integer greater than 1 (e.g., N may be greater than or equal to 5).

As another example, as exemplarily shown in fig. 3, a classifier may be connected to the masked attention layer of the first decoding module, the classifier outputting a portion of the masked attention layer input of the first decoding module.

Of course, the classifier may also be coupled to the output layer, with the output of the classifier being part of the output layer input, as exemplarily shown in fig. 4. In step 102, a first media file may be subjected to position encoding to obtain a position vector, the first media file is input into an embedding layer to obtain an embedding vector, and after passing through N decoding modules, a second media file is output at an output layer based on classification information output by a classifier.

It should be understood that the structure diagrams of fig. 2-4 are for more clearly and intuitively explaining the technical solution of the present application, and are not limited to the present application, for example, the classifier may also be disposed between different decoding modules, and the GPT neural network output layer may be a full connection layer or the like.

The classification information includes, but is not limited to, media types of the media files, audience groups, cognitive levels of target audience, cultural characteristics, and/or character characteristics of characters in the media files. Alternatively, the audience population may be categorized based on age (e.g., children, adolescents, middle-aged and elderly), gender, and/or occupation; the cognitive level may be classified based on cultural degree (e.g., academic) and/or social life environment; cultural features are classified based on country, ethnicity, and/or language; media types may be categorized based on the format (e.g., video and audio) of the media file; and/or character features of the first character may be categorized based on appearance (e.g., height, weight, wear, and facial features), behavior (e.g., habitual actions and habitual expressions), personality, and quality. It should be understood that the above examples of classification information are merely exemplary and are not intended to limit the present application.

In some embodiments, the media file may also be preprocessed, e.g. the first media file may be divided into a plurality of segments, prior to step 101. Wherein the size of each of the plurality of segments may be set to be less than a first data threshold, the first data threshold being on the order of less than or equal to a power of 7 (e.g., a power of 6 of 10) of 10, and the size of data output through the output layer for each segment is approximately equal to the size of data input at the embedded layer.

In the embodiment of the application, the GPT neural network is not required to adopt a feature extractor to perform feature extraction, so that the conversion speed can be improved, the classifier is adopted to acquire the classification information of the media file, so that the cross-language voice conversion based on the classification information of the media file is realized, the conversion quality of the cross-language voice is effectively improved, and the user experience is improved.

At 501, at least one history media file in a sample set of history media files is input into the cross-language speech conversion model, wherein the history media file includes background features and source speech uttered in a first language.

502, outputting a target media file based on the cross-language voice conversion model; wherein the target media file is synthesized from the background features of the history media file and target speech in a second language; the target speech is generated based on the background feature, the target audience of the historical media file, and the speech content of the source speech, the voiceprint feature of the target speech being dependent on the voiceprint features of the target audience of the historical media file and/or the source speech.

A loss function is determined 503 based on the real files corresponding to the target media file and the history media file.

504, parameters of the cross-language speech conversion model are adjusted and a next iterative training is performed based on the set of historical media file samples until the loss function converges.

Wherein the loss function (e.g., mean square error) is determined based on the voiceprint feature of the target voice and the voiceprint feature of the voice of the real file, and the voice content of the target voice and the voice content of the real file.

That is, the loss function may include a loss of speech content, a loss of voiceprint features, and a loss of speech duration, and may further include a loss of prosodic features, where the loss function may be a weighted sum of the above losses, and the loss function is used to measure a difference between the generated media file and the real media file, and construct a cross-language speech model through continuous iterative learning.

Alternatively, the set of historical media file samples may be partitioned into a training subset, a verification subset, and a test subset, wherein the training subset may be used for training to build a cross-language speech model, the verification subset may be used for adjusting hyper-parameters of the neural network during training, and the test subset may be used for evaluating generalization of the neural network training model.

The specific network structure of the cross-language speech model may refer to the embodiments of fig. 1-4 and is not repeated here.

The cross-language voice model obtained based on the neural network training can be used for cross-language voice conversion, and further, the cross-language voice model based on the GPT neural network training does not need to adopt a feature extractor for feature extraction, so that the speed of conversion can be improved, the classifier is adopted to acquire the classification information of the media file, so that the cross-language voice conversion based on the classification information of the media file is realized, the conversion quality of the cross-language voice is effectively improved, and the user experience is improved.

Fig. 6 is a schematic block diagram of a cross-language voice conversion device according to an embodiment of the present application. The apparatus 600 comprises a processing unit 601 and a storage unit 602.

The processing unit 601 is configured to input a first media file into the cross-language speech conversion model stored in the storage unit 602, and output a second media file based on the cross-language speech conversion model. Wherein the first media file comprises background features and first source voices emitted by a first character in a first language, and the cross-language voice conversion model is obtained by training the historical media file based on a neural network; and the second media file is synthesized from the background feature of the first media file and a first target speech in a second language; the first target speech is generated based on the background feature, a target audience for the first media file, and speech content of the first source speech; and the relation between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies: and the t1-t2 is less than or equal to a first time length preset value, and the voiceprint characteristics of the first target voice depend on the target audience of the media file and/or the voiceprint characteristics of the first source voice.

In some embodiments, the voiceprint similarity of the first source speech and the first target speech is less than a similarity preset value; optionally, the target voiceprint is the same as the voiceprint of the first character; alternatively, the voiceprint of the first target voice is dependent upon a target audience of the media file and a character characteristic of the first character.

In some embodiments, the processing unit 601 may be specifically configured to: generating the first target voice based on scene features of a movie file, cognitive level and cultural features of a target audience, and voice content of the first source voice, which are included in the media file; or generating the first target speech based on background sounds of an audio file included in the media file, cognitive levels and cultural characteristics of a target audience, and speech content of the first source speech.

Optionally, the processing unit 601 may be specifically configured to translate the voice content of the first source voice to obtain a first target voice content according to the scene feature, the cognitive level of the target audience, and the cultural feature by using an addition method, an subtraction method, and/or a translation method; and generating the first target voice based on the first target voice content of the second language and the voiceprint feature of the first target voice so that the relation between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies the following conditions: and the I t1-t 2I is less than or equal to a first time length preset value.

In some embodiments, the first media file further includes a second source voice of the first language uttered by the second character, the relationship of the first target voice and the second target voice at pitch being dependent on the relationship of the first source voice and the second source voice at pitch. Optionally, in the case that the difference between the pitches of the first source voice and the second source voice is a harmony interval, the difference between the pitches of the first target voice and the second target voice is a harmony interval.

In some embodiments, the cross-language speech conversion model stored by the storage unit 602 includes a generating pre-training transformer GPT neural network, the GPT neural network including an embedded layer, N decoding modules, and an output layer, wherein each decoding module of the N decoding modules includes a mask attention layer, a first normalization layer, a feed-forward layer, and a second normalization layer connected in sequence, and N is an integer greater than 1. The processing unit may be specifically configured to: position coding is carried out on the first media file to obtain a position vector, and voiceprint features of the first media file and the first target voice are input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer after passing through the N decoding modules.

In some embodiments, the cross-language speech conversion model stored by the storage unit 602 includes a classifier and a generating pre-training transformer GPT neural network, the GPT neural network including an embedded layer, N decoding modules, and an output layer, wherein each decoding module of the N decoding modules includes a mask attention layer, a first normalization layer, a feed-forward layer, and a second normalization layer connected in sequence, and N is an integer greater than 1. The processing unit 601 may be specifically configured to: acquiring classification information of the first media file by adopting the classifier, wherein the classification information of the first media file comprises a media type of the first media file, an audience group, a cognition level of the target audience, cultural characteristics and/or character characteristics of the first character; position coding is carried out on the first media file to obtain a position vector, and the first media file is input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer after passing through the N decoding modules, wherein the classification information of the first media file is used for generating the first target voice of the second media file.

The processing unit 601 may be further configured to divide the first media file into a plurality of segments; wherein the size of each of the plurality of segments is less than a first data threshold, the first data threshold being on the order of less than or equal to 10 to the power 7, and the size of data output through the output layer for each segment is approximately equal to the size of data input at the embedded layer.

Alternatively, the cross-language voice conversion device may be a server, and the user equipment may communicate with the server and send a request to the server to implement cross-language voice conversion. The means for converting cross-language speech may also be embedded in the user equipment. The user equipment may be mobile terminals such as mobile phones and computers with mobile terminals, for example, portable, pocket, hand-held, computer-built-in or vehicle-mounted mobile devices including, but not limited to, mobile devices such as cell phones or smart phones, personal computers, PADs, ipads, etc.

The apparatus 600 shown in fig. 6 may be used to perform the methods and steps referred to in fig. 1-4, and the specific processes of each unit performing the corresponding steps described above are described in detail in the above method embodiments, which are not repeated herein for brevity.

The device can realize the conversion of the cross-language voice based on the conversion model of the cross-language voice, and generates the target voice content and the target voice corresponding to the source voice according to the background characteristics of the media file and the voice content of the target audience, wherein the target voice and the source voice have similar or same duration, so that a user can more easily understand the content of the media file, the cross-language voice can be more natural and smooth, the quality of the cross-language voice conversion is improved, and the user experience is effectively improved.

Fig. 7 is a schematic block diagram of another cross-language voice conversion device according to an embodiment of the present application. As shown in fig. 7, an apparatus 700 includes one or more processors 701 and one or more memories 702, the processors 701 being coupled to read and execute instructions (or computer programs) stored in the memories 702, such that the apparatus 700 may perform the corresponding processes and/or operations performed by the apparatus 700 in the method embodiments of the present application.

The apparatus 700 shown in fig. 7 may be used to perform the methods and steps referred to in fig. 1-4, and are not described in detail herein for brevity.

Fig. 8 is a schematic diagram of a training device for a cross-language speech conversion model according to an embodiment of the present application. The apparatus 800 comprises a processing unit 801 and a storage unit 802.

The processing unit 801 is configured to input at least one history media file in the history media file sample set stored in the storage unit 802 into the cross-language voice conversion model, and output a target media file based on the cross-language voice conversion model; determining a loss function according to the real file corresponding to the target media file and the history media file; parameters of the cross-language speech conversion model are adjusted and a next iterative training is performed based on the historical media file sample set until the loss function converges.

Wherein the history media file includes background features and source speech uttered in a first language; the target media file is synthesized by the background feature of the history media file and target speech in a second language; the target speech is generated based on the background feature, a target audience for the historical media file, and speech content of the source speech.

Wherein the loss function (such as mean square error) is determined according to the duration of the target voice of the target media file and the duration of the voice of the real file, the voiceprint feature of the target voice and the voiceprint feature of the voice of the real file, and the voice content of the target voice and the voice content of the real file.

Optionally, the loss function may further include a prosodic feature loss, the speech content loss being determined by a prosodic feature of the target speech of the first target media file and a prosodic feature of the speech of the real file. For example, the prosodic features may be a intonation, a duration, a pitch, or a scale, etc.

That is, the loss function may include a loss of speech content, a loss of voiceprint features, and a loss of speech duration, and may further include a loss of prosodic features, where the loss function may be a weighted sum of the above losses, and the loss function is used to measure a difference between the generated media file and the real media file, and the processing unit 801 constructs a cross-language speech model based on the historical media file sample set stored in the storage unit 802 through continuous iterative learning.

Alternatively, the set of historical media file samples may be partitioned into a training subset, a verification subset, and a test subset, wherein the training subset may be used for training to build a cross-language speech model, the verification subset may be used for adjusting hyper-parameters of the neural network during training, and the test subset may be used for evaluating generalization of the neural network training model. The specific network structure of the cross-language speech model may refer to the embodiments of fig. 1-5 and is not repeated here.

The cross-language voice model obtained by training the training device of the cross-language voice conversion model based on the neural network can be used for the conversion of cross-language voice. Furthermore, the cross-language voice model is trained based on the GPT neural network without adopting a feature extractor to perform feature extraction, so that the training speed can be improved, the classification information of the media files is acquired by adopting the classifier to realize the cross-language voice conversion based on the classification information of the media files, and the conversion quality of the cross-language voice is effectively improved, thereby improving the user experience.

FIG. 9 is a schematic block diagram of another cross-language speech model training apparatus provided by an embodiment of the present application. As shown in fig. 9, the apparatus 900 includes one or more processors 901 and one or more memories 902, where the processors 901 are coupled to read and execute instructions (or computer programs) stored in the memories 902, so that the apparatus 900 may perform the corresponding processes and/or operations in the method embodiments of the present application.

The apparatus 900 shown in fig. 9 may be used to perform the methods and steps referred to in fig. 5, and are not described here again for brevity.

It should be noted that the processor in the embodiments of the present application may be an integrated circuit chip with signal processing capability. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be appreciated that the memory in embodiments of the application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a ROM (read-only memory), a PROM (programmable ROM), an EPROM (erasable PROM, erasable programmable ROM), an EEPROM (electrically EPROM, electrically erasable programmable ROM), or a flash memory, among others. The volatile memory may be RAM (random access memory ) which acts as external cache memory. It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The present application also provides a computer readable medium storing program code which, when executed, enables the method performed by the machine learning-based optical device parameter determination apparatus in the above embodiments to be implemented.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for cross-language speech conversion, comprising:

inputting a first media file into a cross-language voice conversion model, wherein the first media file comprises background features and first source voice sent by a first character in a first language, and the cross-language voice conversion model is obtained by training a historical media file set based on a neural network; and

outputting a second media file based on the cross-language voice conversion model; wherein the second media file is synthesized from the background features of the first media file and a first target speech in a second language; the first target speech is generated based on the background feature, a target audience for the first media file, and speech content of the first source speech; and the voiceprint characteristics of the first target voice are dependent on the target audience of the media file and/or the voiceprint characteristics of the first source voice.

2. The method of claim 1, wherein,

the voiceprint similarity of the first source voice and the first target voice is smaller than a similarity preset value; or alternatively

The first target voiceprint is the same as the voiceprint of the first character; or alternatively

The voiceprint of the first target voice is dependent upon a target audience of the media file and a character characteristic of the first character.

3. The method of claim 2, wherein the first target speech is generated based on the background feature, a target audience for the first media file, and speech content of the first source speech, comprising:

the first target voice is generated based on scene features of a movie file, cognitive level and cultural features of a target audience, and voice content of the first source voice, which are included in the media file; or alternatively

The first target speech is generated based on background sounds of an audio file included in the media file, cognitive levels and cultural characteristics of a target audience, and speech content of the first source speech.

4. The method of claim 3, wherein the first target speech is generated based on scene features of a video file included in the media file, cognitive level and cultural features of a target audience, and speech content of the first source speech, comprising:

the first target voice is generated based on first target voice content of a second language and voiceprint features of the first target voice, wherein the first target voice content is obtained by translating voice content of the first source voice by adopting an addition method, an subtraction method and/or a translation method according to the scene features, the cognitive level and the cultural features of the target audience, so that the relation between the duration t2 of the first target voice and the duration t1 of the first source voice is as follows: the I t1-t 2I is less than or equal to a first time length preset value;

Alternatively, the first media file further includes a second source speech in the first language uttered by the second character; the relation between the duration t3 of the second source voice and the duration t4 of the second target voice corresponding to the second source voice satisfies: the t3-t4 is less than or equal to a second duration preset value; and the relation of the first target voice and the second target voice in pitch depends on the relation of the first source voice and the second source voice in pitch, or in the case that the difference in pitches of the first source voice and the second source voice is a harmony interval, the difference in pitches of the first target voice and the second target voice is a harmony interval.

5. The method of any of claims 1-4, wherein the cross-language speech conversion model comprises a classifier and a generative pre-training transformer, GPT, neural network comprising an embedded layer, N decoding modules, and an output layer, wherein each decoding module of the N decoding modules comprises a mask attention layer, a first normalization layer, a feed forward layer, and a second normalization layer connected in sequence, the N being an integer greater than 1;

the outputting the second media file based on the cross-language voice conversion model comprises:

Inputting the first media file into the classifier to obtain classification information of the first media file, wherein the classification information of the first media file comprises an audience group of the first media file, and a cognitive level, a media type, a cultural feature and/or a character feature of the first character of the target audience;

position coding is carried out on the first media file to obtain a position vector, and the first media file is input into the embedding layer to obtain an embedding vector;

inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; the method comprises the steps of,

outputting the second media file at the output layer after passing through the N decoding modules, wherein the classification information of the first media file is used for generating the first target voice of the second media file;

wherein the classifier is connected with the embedded layer, and the output of the classifier is used as a part of the input of the embedded layer; or the classifier is connected with the output layer, and the output of the classifier is used as a part of the input of the output layer; or the classifier is connected to the masked attention layer of the first decoding module, the output of the classifier being part of the masked attention layer input of the first decoding module.

6. The method of claim 5, wherein,

each decoding module adopts a residual structure, and the residual structure comprises taking the input and the output of a mask attention layer of each decoding module as the input of the first normalization layer and taking the input and the output of the forward feedback layer as the input of the second normalization layer; and/or splitting the first media file into a plurality of segments prior to inputting the first media file into the cross-language speech conversion model; wherein the size of each of the plurality of segments is less than a first data threshold, the first data threshold being on the order of less than or equal to 10 to the power 7, and the size of data output through the output layer for each segment is approximately equal to the size of data input at the embedded layer.

7. A method for training a cross-language speech conversion model, comprising:

inputting at least one historical media file in a sample set of historical media files into the cross-language speech conversion model, wherein the historical media file includes background features and source speech uttered in a first language;

outputting a target media file based on the cross-language voice conversion model, wherein the target media file is synthesized by a background feature of the history media file and a target voice in a second language, the target voice is generated based on the background feature, a target audience of the history media file and voice content of the source voice, and voiceprint features of the target voice depend on target audience of the history media file and/or voiceprint features of the source voice;

Determining a loss function according to the real file corresponding to the target media file and the history media file, wherein the loss function is determined according to the voiceprint characteristics of the target voice and the voiceprint characteristics of the voice of the real file, and the voice content of the target voice and the voice content of the real file; and

and adjusting parameters of the cross-language voice conversion model and carrying out next iterative training based on the historical media file sample set until the loss function converges.

8. A cross-language voice conversion device is characterized by comprising a processing unit and a storage unit,

the processing unit is used for inputting the first media file into the cross-language voice conversion model stored in the storage unit and outputting the second media file based on the cross-language voice conversion model;

the first media file comprises background features and first source voices which are sent by a first character in a first language, and the cross-language voice conversion model is obtained by training the historical media file based on a neural network; and

wherein the second media file is synthesized from the background features of the first media file and a first target speech in a second language; the first target speech is generated based on the background feature, a target audience for the first media file, and speech content of the first source speech; and the voiceprint characteristics of the first target voice are dependent on the target audience of the media file and/or the voiceprint characteristics of the first source voice.

9. The apparatus of claim 8, wherein,

The target voiceprint is the same as the voiceprint of the first character; or alternatively, the process may be performed,

10. The apparatus of claim 9, wherein the processing unit is specifically configured to: generating the first target voice based on scene features of a movie file, cognitive level and cultural features of a target audience, and voice content of the first source voice, which are included in the media file; or generating the first target speech based on background sounds of an audio file included in the media file, cognitive level and cultural characteristics of a target audience, and speech content of the first source speech;

alternatively, the processing unit is specifically configured to: translating the voice content of the first source voice to obtain first target voice content by adopting an adding method, an subtracting method and/or a translating method according to the scene characteristics, the cognition level and the cultural characteristics of the target audience; and generating the first target voice based on the first target voice content of the second language and the voiceprint feature of the first target voice so that the relation between the duration t2 of the first target voice and the duration t1 of the first source voice satisfies the following conditions: and the I t1-t 2I is less than or equal to a first time length preset value.

11. The apparatus of claim 10, wherein,

the first media file further includes a second source voice of the first language uttered by a second character, a relationship between a duration t3 of the second source voice and a duration t4 of a second target voice corresponding to the second source voice being: the t3-t4 is less than or equal to a second duration preset value; and

the relation of the first target voice and the second target voice in pitch depends on the relation of the first source voice and the second source voice in pitch, or in the case that the difference between the pitches of the first source voice and the second source voice is a harmony interval, the difference between the pitches of the first target voice and the second target voice is a harmony interval.

12. The apparatus of any of claims 8-11, wherein the cross-language speech conversion model stored by the storage unit comprises a classifier and a generated pre-training transformer, GPT, neural network, the GPT neural network comprising an embedded layer, N decoding modules, and an output layer, wherein each decoding module of the N decoding modules comprises a mask attention layer, a first normalization layer, a feed forward layer, and a second normalization layer connected in sequence, the N being an integer greater than 1;

The processing unit is specifically configured to: inputting the first media file into the classifier to obtain classification information of the first media file, wherein the classification information of the first media file comprises an audience group of the first media file, and a cognitive level, a media type, a cultural feature and/or a character feature of the first character of the target audience; position coding is carried out on the first media file to obtain a position vector, and the first media file is input into the embedding layer to obtain an embedding vector; inputting the position vector and the embedded vector into a mask attention layer of a first decoding module of the N decoding modules; and outputting the second media file at the output layer after passing through the N decoding modules, wherein the classification information of the first media file is used for generating the first target voice of the second media file;

13. The apparatus of claim 12, wherein the processing unit is further configured to: dividing the first media file into a plurality of segments; wherein the size of each of the plurality of segments is less than a first data threshold, the first data threshold being on the order of less than or equal to 10 to the power 7, and the size of data output through the output layer for each segment is approximately equal to the size of data input at the embedded layer.

14. A training device for cross-language speech conversion, comprising: a processor and a memory, the processor being coupled to the memory, the processor being configured to read and execute instructions in the memory to implement the method of claim 7.

15. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed, implements the method according to any of claims 1-7.