WO2021134592A1 - 语音处理方法、装置、设备以及存储介质 - Google Patents

语音处理方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2021134592A1
WO2021134592A1 PCT/CN2019/130767 CN2019130767W WO2021134592A1 WO 2021134592 A1 WO2021134592 A1 WO 2021134592A1 CN 2019130767 W CN2019130767 W CN 2019130767W WO 2021134592 A1 WO2021134592 A1 WO 2021134592A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
tone
voice data
voice
text information
Prior art date
Application number
PCT/CN2019/130767
Other languages
English (en)
French (fr)
Inventor
杨林举
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/130767 priority Critical patent/WO2021134592A1/zh
Priority to CN201980101004.XA priority patent/CN114467141A/zh
Publication of WO2021134592A1 publication Critical patent/WO2021134592A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the embodiments of the present application relate to the field of voice processing technology, and in particular, to a voice processing method, device, device, and storage medium.
  • the existing simultaneous interpretation system can convert the voice information spoken by the user into the translated text information corresponding to another language, and use the speech synthesis model to synthesize the translated text information into the corresponding voice for display.
  • the simultaneous interpretation system can be used not only in international conferences, product launches and other conferences, but also in people's daily lives. For example, in work, you can use the simultaneous interpretation system for technology sharing or video conferencing; in life, you can use the simultaneous interpretation system to meet related needs in social or travel scenarios.
  • the interpretation method is fixed and single, and the accuracy is low.
  • the embodiments of the present application expect to provide a voice processing method, device, device, and storage medium.
  • an embodiment of the present application provides a voice processing method, which includes:
  • Translating the voice text information to obtain translated text information wherein the language type corresponding to the translated text information is different from the language type corresponding to the first voice data;
  • second voice data is generated; wherein the language corresponding to the second voice data is different from the language corresponding to the first voice data, and the second voice The data is used for presentation on the client when the first voice data is played.
  • an embodiment of the present application provides a voice processing device, which includes an acquisition unit, a recognition unit, a translation unit, a matching unit, and a generation unit, wherein:
  • the obtaining unit is configured to obtain the first voice data to be processed
  • a recognition unit configured to recognize tone information and voice text information corresponding to the first voice data
  • a translation unit configured to translate the voice text information to obtain translated text information; wherein the language type corresponding to the translated text information is different from the language type corresponding to the first voice data;
  • a matching unit configured to determine the matching relationship between the translated text information and the tone information based on the matching relationship between the voice text information and the tone information;
  • the generating unit is configured to generate second voice data according to the matching relationship between the translated text information and the tone information; wherein the language corresponding to the second voice data is different from the language corresponding to the first voice data, and The second voice data is used for presentation on the client when the first voice data is played.
  • an embodiment of the present application provides a device, which includes a memory and a processor, where:
  • a memory for storing a computer program that can run on the processor
  • the processor is configured to execute the method described in the first aspect when the computer program is running.
  • an embodiment of the present application provides a computer storage medium that stores a voice processing program that implements the method described in the first aspect when the voice processing program is executed by at least one processor.
  • the embodiments of the present application provide a voice processing method, device, equipment, and storage medium, wherein the first voice data to be processed is obtained; the tone information and voice text information corresponding to the first voice data are recognized; the voice text information is translated , To obtain the translated text information; wherein the language corresponding to the translated text information is different from the language corresponding to the first voice data; based on the matching relationship between the voice text information and the tone information, the matching relationship between the translated text information and the tone information is determined; according to the translation
  • the matching relationship between the text information and the tone information is used to generate second voice data; wherein, the language corresponding to the second voice data is different from the language corresponding to the first voice data, and the second voice data is used in When the first voice data is played, it is presented on the client; in this way, for the first voice data to be processed, not only the text information can be recognized, but also the tone information can be recognized.
  • the technical solution of the embodiment of the present application can easily and real-time obtain the tone change of the speaker Information, thereby effectively avoiding confusion or errors in the user’s judgment of the speaker’s voice changes in the simultaneous interpretation system, and improving the accuracy of the simultaneous interpretation system.
  • Figure 1 is a schematic diagram of the system architecture of a simultaneous interpretation system application provided by related technical solutions
  • Figure 2 is a schematic diagram of the work flow of a simultaneous interpretation system provided by related technical solutions
  • FIG. 3 is a schematic flowchart of a voice processing method provided by an embodiment of this application.
  • FIG. 4 is a schematic flowchart of another voice processing method provided by an embodiment of this application.
  • FIG. 5 is a schematic flowchart of another voice processing method provided by an embodiment of this application.
  • FIG. 6 is a schematic flowchart of yet another voice processing method provided by an embodiment of this application.
  • FIG. 7 is a schematic flowchart of yet another voice processing method provided by an embodiment of this application.
  • FIG. 8 is a detailed flowchart of a voice processing method provided by an embodiment of this application.
  • FIG. 9 is a detailed flowchart of another voice processing method provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram of the composition structure of a voice processing device provided by an embodiment of this application.
  • FIG. 11 is a schematic diagram of the composition structure of another voice processing device provided by an embodiment of this application.
  • FIG. 12 is a schematic diagram of a specific hardware structure of a device provided by an embodiment of the application.
  • first ⁇ second involved only distinguishes similar objects, and does not represent a specific order for the objects. Understandably, “first ⁇ second” is allowed when allowed.
  • the specific order or sequence can be interchanged, so that the embodiments of the present application described herein can be implemented in a sequence other than those illustrated or described herein.
  • FIG. 1 shows a schematic diagram of the system architecture of the simultaneous interpretation system application in the related technology.
  • the system may include: a machine simultaneous interpretation server 110, a voice processing server 120, a viewer mobile terminal 130, a personal computer (PC, Personal Computer) client 140, and a display screen 150.
  • a machine simultaneous interpretation server 110 a voice processing server 120
  • a viewer mobile terminal 130 a viewer mobile terminal
  • a personal computer (PC, Personal Computer) client 140 a personal computer (PC, Personal Computer) client 140
  • display screen 150 a display screen 150.
  • the lecturer can perform conference lectures through the PC client 140.
  • the PC client 140 collects the lecturer’s voice data and sends the collected voice data to the machine simultaneous interpretation server 110.
  • the machine simultaneous interpretation server 110 recognizes the voice data through the voice processing server 120, and obtains the recognition result (the recognition result can be the voice text information in the same language as the voice data, or it can be obtained after translating the voice text information Translated text information in other languages); the machine simultaneous interpretation server 110 can send the recognition result to the PC client 140, and the PC client 140 screens the recognition result on the display screen 150; it can also send the recognition result to the audience
  • the mobile terminal 130 specifically sends the recognition result of the corresponding language according to the language required by the user), displays the recognition result for the user, so as to realize the translation of the speech content of the speaker into the language required by the user and display it.
  • the user can convert the voice information spoken by the user into the text corresponding to another language during the use process, and the text is displayed on the display device.
  • the displayed text represents the user's voice content.
  • the simultaneous interpretation system can translate the content of the speaker's speech and output the translation result on the display device or display screen at the same time.
  • FIG. 2 shows a schematic diagram of a work flow of a simultaneous interpretation system provided by related technical solutions.
  • the workflow can include:
  • S202 The simultaneous interpretation system performs voice recognition on the content of the user's speech to obtain voice text information
  • S203 The simultaneous interpretation system performs machine translation processing on the voice text information to obtain the translated text information
  • S204 The simultaneous interpretation system outputs the translated text information to the display device for display;
  • the simultaneous interpretation system only displays the user's speech content or the corresponding translation content on the display device, but the tone information of the user's speech is not displayed. For example, when a user listens to a speech using the simultaneous interpretation system, the simultaneous interpretation system outputs the speaker’s speech content or the corresponding translation content.
  • the speaker has finished speaking the corresponding words, causing the user to see the output result of the simultaneous interpretation system and the user
  • the tone contained in the speaker's speech is not synchronized, which makes the user's judgment of the speaker's speech tone change prone to confusion or errors; especially in the simultaneous interpretation system, when the user cannot hear the speaker clearly or cannot hear the speaker
  • the speaker’s voice is pronounced, the user’s judgment on the change of the speaker’s speaking tone is more likely to be confused or wrong, which makes the user experience worse.
  • the embodiment of the application provides a voice processing method, after acquiring the first voice data to be processed, identifying the tone information and voice text information corresponding to the first voice data; translating the voice text information to obtain the translated text information; wherein , The language corresponding to the translated text information is different from the language corresponding to the first voice data; based on the matching relationship between the voice text information and the tone information, the matching relationship between the translated text information and the tone information is determined; according to the matching relationship between the translated text information and the tone information, Generate second voice data; wherein, the language corresponding to the second voice data is different from the language corresponding to the first voice data, and the second voice data is used to present the first voice data on the client side; in this way, for the to-be-processed
  • the first voice data can not only recognize the text information, but also the tone information, so in the process of obtaining the second voice data, not only the translation text information is considered, but also the tone information, so that the second voice data is playing In the process, it can also reflect the change of tone;
  • FIG. 3 shows a schematic flowchart of a voice processing method provided by an embodiment of the present application.
  • the method may include:
  • the voice processing method of the embodiment of the present application can be applied to a simultaneous interpretation system, and the execution subject of the method is a voice processing device.
  • the voice processing device may be located on the server side or the terminal device side.
  • the server may be a server with voice processing functions, such as a file server, a database server, etc.;
  • the terminal device may be a terminal with voice processing functions, for example, the terminal device may include Smartphones, tablet computers, notebook computers, handheld computers, personal digital assistants (PDAs), portable media players (PMP), navigation devices, digital TVs, desktop computers, etc., examples of this application There is no specific limitation.
  • the first voice data to be processed can be obtained by voice collection by the voice collection unit.
  • the obtaining the first voice data to be processed may include: performing voice collection by a voice collection unit to obtain the first voice data to be processed.
  • the voice processing device may include a voice collection unit.
  • the voice collection unit may be a microphone or a microphone in a terminal device. In this way, after the first voice data to be processed is collected by the voice collection unit, subsequent voice recognition processing or tone recognition processing can be performed on the first voice data.
  • recognition processing may be performed on the first voice data. Specifically, on the one hand, the tone information corresponding to the first voice data can be recognized, and on the other hand, the voice text information corresponding to the first voice data can be recognized.
  • the recognizing tone information and voice text information corresponding to the first voice data may include:
  • the first voice data can be recognized by the tone recognition unit to obtain the tone information corresponding to the first voice data; the first voice data can also be recognized by the voice recognition unit to obtain the first voice data
  • the corresponding voice recognition result is the voice text information; the language of the voice text information is the same as the language of the first voice data.
  • the input of the tone recognition unit is the first voice data
  • the output is the tone information
  • the tone information may include the first tone information
  • the first tone information represents any kind of tone information in the tone information.
  • the tone time corresponding to different tone information is different, and the tone type is also different; for example, the tone type can be gentle tone type, high tone type, low tone type and brisk tone type, etc.; that is, the first tone
  • the information includes the first mood type and the first mood time, that is, at the first mood time, the user's speaking mood at this time is the first mood type.
  • the input of the voice recognition unit is the first voice data
  • the output is voice text information
  • the voice text information represents the text information of the language corresponding to the first voice data
  • the voice text information includes the first text
  • the first text can be through voice
  • the voice recognition result obtained by the recognition unit, and the first character is expressed as any character in the voice text information.
  • the voice and text information can be translated by the translation unit to obtain the translation result, that is, the translated text information.
  • the language corresponding to the translated text information is different from the language corresponding to the first voice data.
  • the translated text information includes the first translated text
  • the first translated text may be a translation result obtained by the translation unit, and the language corresponding to the first translated text is different from the language corresponding to the first text.
  • the first voice data to be processed can be recognized and translated, so that the tone information and translated text information corresponding to the first voice data can be obtained, so as to facilitate the follow-up according to the tone information.
  • a suitable display strategy is selected to display the translated text information.
  • S304 Determine the matching relationship between the translated text information and the tone information based on the matching relationship between the voice text information and the tone information;
  • the voice text information and the tone information can be matched to determine the first text in the voice text information and the first text in the tone information.
  • the matching pair of tone information then, according to the matching pair of the first text and the first tone information, the first translated text corresponding to the first text can be combined, so that the matching pair of the first translated text and the first tone information can be obtained.
  • the matching relationship between the translated text information and the tone information is established.
  • the voice recognition result ie, voice text information
  • the voice recognition result needs to be displayed according to the display at the same time as the voice text information.
  • the strategy reflects the tone information; or, assuming that the translation result (that is, the translated text information) needs to be displayed through the display unit in the future, then according to the matching pair of the first translated text and the first tone information, the translated text information can be displayed for the translation result.
  • the tone information needs to be reflected according to the display strategy.
  • S305 Generate second voice data according to the matching relationship between the translated text information and the tone information.
  • the language corresponding to the second voice data is different from the language corresponding to the first voice data, but the language corresponding to the second voice data is the same as the language corresponding to the translated text information; in addition, the second voice data is based on the translated text information and The tone information is obtained by speech synthesis, and the second speech data is used for presentation on the client when the first speech data is played.
  • the generating second voice data according to the matching relationship between the translated text information and the tone information may include:
  • a target synthesis model is used to perform speech synthesis on the translated text information and the tone information to obtain the second speech data.
  • the target synthesis model is a model that characterizes the speech synthesis of translated text information and tone information.
  • the target synthesis model can be used to obtain the second voice data according to the matching relationship between the translated text information and the tone information; in this way, since the tone change can also be reflected when the second voice data is presented, This can improve the accuracy of the simultaneous interpretation system.
  • the method may further include:
  • the display result indicates that the translated text information is presented on the client according to the tone information when the first voice data is played.
  • the display strategy may include a color distinction strategy, a text distinction strategy, a font size distinction strategy, a position distinction strategy, a style distinction strategy, an icon distinction strategy, a graphics distinction strategy, an image distinction strategy, etc., which are not included in this embodiment of the application. Specific restrictions.
  • the corresponding relationship between the tone information and the display strategy can be pre-stored in the voice processing device; that is, different tone information will select different display strategies.
  • the tone information corresponding to the translated text information can be obtained, and the corresponding relationship between the pre-stored tone information and the display strategy can be combined to determine the display strategy corresponding to the translated text information. Then the obtained display result is to present the translated text information on the client according to the corresponding display strategy.
  • the obtaining a display result corresponding to the translated text information according to the matching relationship between the translated text information and the tone information may include:
  • the display result corresponding to the translated text information is obtained; wherein the display result indicates that the first voice data is played according to the
  • the determined display strategy presents the translated text information on the client.
  • the display strategy is the font size discrimination strategy as an example
  • the tone information is a gentle tone type
  • the corresponding translated text information is text A
  • text A will be displayed on the client using a regular font size 4 as the display strategy
  • the tone information is a brisk tone type
  • the corresponding translated text information is text B
  • text B will be displayed on the client with bold font four as the display strategy
  • the tone information is a high-pitched tone type
  • the corresponding translated text information is Text C
  • text C will be displayed on the client in the conventional font size 3 as the display strategy
  • the tone information is a low tone type
  • the corresponding translated text information is text D
  • text D will be displayed in the conventional font size 5 on the client side Display as a display strategy.
  • the display strategy is based on the color distinction strategy. If the tone information is a gentle tone type and the corresponding translated text information is text A, then text A will be displayed on the client side with blue as the display strategy; if the tone information is brisk Tone type, the corresponding translated text information is text B, then text B will be displayed on the client side in green as the display strategy; if the tone information is high-pitched tone type, and the corresponding translated text information is text C, then text C is displayed on the client side Red will be used as the display strategy for display; if the tone information is a low tone type and the corresponding translated text information is text D, then text D will be displayed in black as the display strategy on the client; this embodiment of the application does not specifically limit this .
  • the translated text information can be presented on the client according to the display strategy at this time.
  • the user can obtain not only the translated text information, but also the tone information when watching; that is, the technical solution of the embodiment of the present application allows the user to conveniently and real-time obtain the tone change information of the speaker, thereby avoiding Therefore, the user’s judgment of the speaker’s speaking mood changes is confused or wrong; and when the user can’t hear clearly or cannot hear the speaker’s pronunciation, it will not affect the user’s access to the speaker’s speaking mood information.
  • the method may further include:
  • the display result corresponding to the voice text information is obtained; wherein the display result indicates that the first voice data is played according to the
  • the determined display strategy presents the voice text information on the client.
  • the display strategy corresponding to the first text in the voice text information can be determined at this time, and then the first text is displayed according to the determined display strategy; or when the displayed text information
  • the display strategy corresponding to the first translated text in the translated text information can be determined at this time, and then the first translated text can be displayed according to the determined display strategy.
  • the voice processing method can be applied not only to simultaneous interpretation systems, but also to other audio and video systems, for example, it can also be applied to video conferencing systems; that is, when a user conducts a meeting through the video conferencing system ,
  • the display device (or display unit) of the video conference system can display the speech content or translation content of the speaker, and different characters can also be displayed with different display strategies according to the different spoken moods corresponding to the characters.
  • the speaker’s voice data is converted into voice text, and then the voice text is translated into another language’s translated text.
  • the tone information of the speaker synchronously can be added to the translated text in real time, effectively avoiding the user seeing it.
  • the output result of the simultaneous interpretation system is not synchronized with the tone contained in the speaker's speech that the user hears, which leads to confusion or error in the user's judgment of the speaker's tone change, so that the listener of another language can correctly understand the speaker Tonal information when expressing the current sentence.
  • the voice processing device of the embodiment of the present application also includes a tone recognition unit.
  • the user uses the simultaneous interpretation system provided by the embodiment of the present application.
  • the simultaneous interpretation system can not only display the speaker's speech content (that is, the speech recognition result) or the corresponding translation content (that is, the translation result) on the display unit, where For different words (including different words or words) in the spoken content or the corresponding translated content, different display strategies can be selected for display according to the different tone information corresponding to the word or word. For example, a user uses a simultaneous interpretation system to listen to a speech.
  • the simultaneous interpretation system When the simultaneous interpretation system outputs the speech content of the speaker or the corresponding translation content, the words or words whose voice changes in the speech content or the corresponding translation content can be displayed in different ways Display; in this way, when the user is watching, not only can he obtain the speech content of the speaker or the corresponding translation content, but also the tone information of the speaker’s speech; thus avoiding the user from seeing the output results of the simultaneous interpretation system and the user
  • the tone contained in the heard speaker’s speech is out of sync, which leads to confusion or errors in the user’s judgment of the speaker’s tone change; and when the user cannot hear clearly or cannot hear the speaker’s pronunciation, it will not affect the user Get information about the voice of the speaker.
  • This embodiment provides a voice processing method to obtain the first voice data to be processed; recognize the tone information and voice text information corresponding to the first voice data; translate the voice text information to obtain the translated text information; wherein, the translated text
  • the language corresponding to the information is different from the language corresponding to the first voice data; based on the matching relationship between the voice text information and the tone information, the matching relationship between the translated text information and the tone information is determined; the second is generated based on the matching relationship between the translated text information and the tone information Voice data; wherein, the language corresponding to the second voice data is different from the language corresponding to the first voice data, and the second voice data is used to present the first voice data on the client; in this way, for the first to be processed
  • the voice data not only recognizes the text information, but also recognizes the tone information; so in the process of obtaining the second voice data, not only the translation text information is considered, but also the tone information, so that the second voice data is playing
  • the process can also reflect the change of tone; that is, while displaying the
  • FIG. 4 shows a schematic flowchart of another voice processing method provided in an embodiment of the present application.
  • the method may include:
  • S402 Determine whether to perform tone recognition on the first voice data through the first information
  • the first information is used to characterize the spectral characteristics of the sound in the first voice data.
  • the spectral characteristic information may include audio spectrum, energy spectrum, LOG energy spectrum, and Mel spectrum.
  • the tone recognition for the first voice data to be processed may be to perform the tone recognition on the first voice data through the first information, or to perform the tone recognition on the first voice data through other information (such as the second information or the third information).
  • the data is used for tone recognition, which is not specifically limited in the embodiment of this application.
  • S404 Perform tone recognition on the first voice data according to the first information, and determine the tone information corresponding to the first voice data;
  • the first information can be extracted from the first voice data, and then the first voice data can be recognized according to the first information.
  • the tone information corresponding to the first voice data is determined.
  • the performing tone recognition on the first voice data according to the first information and determining the tone information corresponding to the first voice data may include:
  • the first information is input into a preset recognition model, and the tone information corresponding to the first voice data is output through the preset recognition model.
  • the model can be used at this time, that is, the first information is input into a preset recognition model, and then the first voice data is output through the preset recognition model. Tone information.
  • the preset recognition model may be a model trained in advance using a machine learning algorithm.
  • machine learning algorithms such as decision trees, support vector machines, neural networks, and deep neural networks can be used to perform model training with training samples to obtain a preset recognition model.
  • the training sample may be a plurality of sample information extracted from the first voice data of the sample. Training based on the plurality of sample information can continuously optimize the preset recognition model, so that the recognition result of the preset recognition model is more accurate. Improve the recognition accuracy of tone information.
  • S406 Perform tone recognition on the first voice data according to the second information, and determine the tone information corresponding to the first voice data;
  • the second information can be extracted from the first voice data, and then the first voice data can be recognized based on the second information. To determine the tone information corresponding to the first voice data.
  • the second information represents the energy level of the sound in the first voice data.
  • sound has energy.
  • the application of ultrasound reflects the energy of sound; usually, the intensity of sound can be used to reflect the magnitude of energy; among them, sound intensity refers to the sound energy per unit area of sound waves passing through a unit of time perpendicular to the direction of propagation.
  • the tone information can be recognized through the energy information of the sound in the first voice data, and the tone change in the first voice data can be reflected by the energy level.
  • the method may further include:
  • the third information can also be extracted from the first voice data, and then the first voice data is subjected to tone recognition based on the third information , To determine the tone information corresponding to the first voice data.
  • the third information represents the volume of the sound in the first voice data.
  • the volume is also called loudness and sound intensity, which mainly refers to the subjective feeling of the human ear on the strength of the sound heard, and its objective evaluation criterion is the amplitude of the sound.
  • the tone information can be identified by the volume information of the sound in the first voice data, and the tone change in the first voice data can also be reflected by the volume; in addition, since the volume information is closer to the subjective feeling of the human ear, it can be Make the recognized tone information closer to the real tone information.
  • the first voice data when the judgment result is that the first voice data is not recognized by the first information (such as spectral feature information), the first voice data can be recognized according to the second information (such as energy information) at this time.
  • the tone recognition of the first voice data can also be performed based on the third information (such as volume information).
  • the tone information corresponding to the first voice data can be determined according to the result of the comparison by comparing it with a preset information threshold.
  • the performing tone recognition on the first voice data according to the second information to determine the tone information corresponding to the first voice data includes:
  • the second information is less than or equal to the preset information threshold or the duration of the second information greater than the preset information threshold does not exceed the preset time threshold, it is determined that the first voice data corresponds to the second type of tone information.
  • the performing tone recognition on the first voice data according to the third information to determine the tone information corresponding to the first voice data may include:
  • the third information is less than or equal to the preset information threshold or the duration of the third information greater than the preset information threshold does not exceed the preset time threshold, it is determined that the first voice data corresponds to the second type of tone information.
  • the tone information can also be divided into a first type of tone information and a second type of tone information, and the first type and the second type are different.
  • the first type of tone information and the first type of tone information are judged based on the second information or the third information.
  • the first type of tone information can be heavy tone information
  • the second type of tone information can be light tone information
  • the voice processing device applied to the simultaneous interpretation system includes a tone recognition unit.
  • the input of the tone recognition unit is the first voice data to be processed, and the output is the tone information of the speaker (speaker).
  • the tone recognition unit may judge the tone change in the first voice data through spectral feature information, and may also judge the tone change in the first voice data through volume information or energy information.
  • the model when judging the tone in the first voice data through the spectral feature information, the model can be used to input the spectral feature information into the preset recognition model, and the corresponding tone information is output through the preset recognition model; and the volume information is used Or energy information, etc.
  • the judgment when judging the tone in the first voice data, at this time, the judgment can be made based on the change rule of volume information or energy information; specifically, when the volume information or energy information continuously exceeds the preset information threshold for a period of time At this time, you can output the accented tone at this time, for example, the tone information at this time is heavy tone information; otherwise, output other tone information, such as the tone information at this time as light tone information.
  • the tone information can include tone type and tone time; here, the tone type represents the speaking tone of the speaker, and the tone time represents the tone change time of the speaker. In this way, according to the tone time, the matching relationship between the speech recognition result and the tone information can be better obtained.
  • FIG. 5 shows a schematic flowchart of yet another voice processing method provided by an embodiment of the present application.
  • the method may include:
  • S501 Input the first voice data to be processed
  • S502 Perform voice recognition on the first voice data, and determine voice text information corresponding to the first voice data, where the voice text information includes the first text;
  • S503 Match the voice text information and the tone information according to the tone information input by the tone recognition unit;
  • S504 Output a matching pair of the first text in the voice text information and the first tone information in the tone information.
  • the tone information includes first tone information
  • the first tone information includes the first tone type and the first tone time.
  • the matching of the voice text information and the tone information may include:
  • the voice text information and the tone information are matched, the first text in the voice text information is determined according to the first tone time in the first tone information, and the first text is determined according to the first tone time
  • the first text in the speech text information (ie, speech recognition result) can be determined more accurately, so that the speech can be better obtained according to the first tone time
  • the matching relationship between text information and tone information in addition, when the first tone time is displayed on the display unit, it can determine the continuous display time of the first text under the corresponding display strategy, so that the speaker can be better obtained To avoid confusion or errors in the user’s judgment of the speaker’s voice changes.
  • the speech processing device applied to the simultaneous interpretation system also includes a speech recognition unit.
  • the speech recognition unit and the tone recognition unit work in parallel; that is, the speech recognition result and the tone information are executed simultaneously of.
  • the voice recognition unit obtains the voice recognition result
  • the voice recognition result can be matched according to the tone information obtained by the tone recognition unit, and the first tone information corresponding to the first character (word or word) in the voice text information can be obtained, thereby A matching pair of the first text and the first tone information is formed.
  • the voice text information when the voice text information is displayed in the subsequent display unit, not only the text information can be displayed, but also the text with the changed tone can be displayed with different display strategies, so that the user can not only obtain the text information when watching ,
  • the tone information corresponding to the text can also be obtained, avoiding confusion or errors in the user's judgment of the speaker's tone change.
  • FIG. 6 shows a schematic flowchart of yet another voice processing method provided by the embodiment of the present application.
  • the method may include:
  • S601 Input voice and text information
  • S602 Perform translation processing on the voice text information, and determine the translated text information corresponding to the first voice data; wherein the translated text information includes the first translated text;
  • S603 Match the translated text information and the tone information according to the matching pair of the first text in the voice text information and the first tone information in the tone information;
  • S604 Output a matching pair of the first translated text in the translated text information and the first tone information in the tone information.
  • the first translated text is obtained after the first text is translated.
  • the matching pair of the first text in the speech text information and the first tone information a matching pair of the first translated text and the first tone information can be obtained.
  • the speech processing device applied to the simultaneous interpretation system further includes a translation unit.
  • the translation unit works after the speech recognition unit, that is, the translation unit and the speech recognition unit work in series.
  • the translation unit obtains the translated text information, it can also match the translated text information according to the matching pair between the first text in the voice text information and the first tone information in the tone information, and the first translated text in the translated text information can be obtained. (Character or word) corresponding to the first tone information, thereby forming a matching pair of the first translated text and the first tone information.
  • the translated text information when the translated text information is displayed in the subsequent display unit, not only the text information can be displayed, but the translated text whose tone has changed can also be displayed with different display strategies, so that the user can not only obtain the text when watching Information, the tone information corresponding to the text can also be obtained, which avoids confusion or errors in the user's judgment of the speaker's tone change.
  • FIG. 7 shows a schematic flowchart of yet another voice processing method provided in an embodiment of the present application.
  • the method may include:
  • S701 Input the text information to be displayed and the matching relationship between the text information and the tone information;
  • S702 Determine a display strategy corresponding to the text information to be displayed according to the matching relationship between the text information and the tone information;
  • S703 According to the determined display strategy, output the text information to be displayed.
  • the text information to be displayed may be the first text in the voice text information, or the first translated text in the translated text information, which is not specifically limited in the embodiment of the present application.
  • the display strategy may include color discrimination strategy, text discrimination strategy, font size discrimination strategy, position discrimination strategy, style discrimination strategy, icon discrimination strategy, graphics discrimination strategy, image discrimination strategy, etc., in this embodiment of the application There is no specific limitation.
  • the corresponding relationship between the tone information and the display strategy may be stored in advance, that is, different tone information will select different display strategies.
  • the tone information corresponding to the translated text information can be obtained, and the corresponding relationship between the pre-stored tone information and the display strategy can be combined to determine the display strategy corresponding to the translated text information. Then the obtained display result is to present the translated text information on the client according to the corresponding display strategy.
  • the tone time in the tone information can be based on the tone time to better obtain the matching relationship between the first text in the voice text information and the first tone information; on the other hand, it can also be When displaying on the display unit, it can determine the continuous display time of the first text or the first translated text under the corresponding display strategy, so as to better obtain the speaker’s tone change information and avoid the user’s voice changes to the speaker Confusion or error occurred in the judgment.
  • the voice processing device applied to the simultaneous interpretation system includes a display unit, which can display the content of the speaker's speech (ie, voice text information) or the corresponding translation content (ie, translated text information); this It needs to be based on the matching relationship between the text information to be displayed and the tone information, such as the matching pair between the first text in the voice text information and the first tone information, or the matching pair between the first translated text in the translated text information and the first tone information ; In this way, for the different voices of different characters, you can choose different display strategies, such as distinguishing display according to text, color, location, font size, style, icon, graphics, or image.
  • the display unit displays the speech content of the speaker or the corresponding translation content
  • different words in the speech content or the corresponding translation content can be displayed according to different tone information corresponding to the word or word.
  • FIG. 8 shows a detailed flowchart of a voice processing method provided by an embodiment of the present application. As shown in Figure 8, the detailed process may include:
  • S802 Perform tone recognition on the first voice data, and determine tone information corresponding to the first voice data, where the tone information includes the first tone information;
  • S803 Perform voice recognition on the first voice data, and determine voice text information corresponding to the first voice data, where the voice text information includes the first text;
  • the first voice data can be recognized by the tone recognition unit to obtain the tone information corresponding to the first voice data; the first voice data can also be recognized by the voice recognition unit to obtain the first voice data.
  • a voice recognition result ie, voice text information
  • both the tone recognition unit and the voice recognition unit work in parallel.
  • the first tone information includes the first tone type and the first tone time, that is, at the first tone time, the user’s spoken tone at this time is the first tone type; the voice text information includes the first text, In this way, after the voice text information is obtained, the matching pair of the first character in the voice text information and the first tone information can be determined.
  • S804 Match the voice text information and the tone information, and determine the matching pair of the first text in the voice text information and the first tone information in the tone information;
  • S805 Determine a display strategy corresponding to the first text according to the matching pair of the first text in the voice text information and the first tone information in the tone information;
  • S806 Present the first text on the client according to the determined display strategy.
  • the matching of the voice text information and the tone information may include: matching the voice text information and the tone information, and determining the voice according to the first tone time in the first tone information The first character in the text information, and the first tone type in the first tone information is determined according to the first tone time; according to the first character, the first tone time, and the first tone type, it is obtained The matching pair of the first text and the first tone information.
  • the display unit when the display unit needs to display the content of the speaker's speech (ie, voice text information), at this time, after determining the matching pair of the first text in the voice text information and the first tone information in the tone information, it can be based on A matching pair of the first text in the voice text information and the first tone information in the tone information determines the display strategy corresponding to the first text, and then displays the first text in the voice text information according to the determined display strategy.
  • voice text information ie, voice text information
  • the client or display unit
  • different words in the speech content can be displayed according to different tone information corresponding to the word or word by selecting different display strategies, so that When the user is watching, he can better obtain the information of the speaker's tone change, which avoids confusion or error in the user's judgment of the speaker's tone change.
  • FIG. 9 shows a detailed flowchart of another voice processing method provided by an embodiment of the present application. As shown in Figure 9, the detailed process may include:
  • S902 Perform tone recognition on the first voice data, and determine tone information corresponding to the first voice data, where the tone information includes the first tone information;
  • S903 Perform voice recognition on the first voice data, and determine voice text information corresponding to the first voice data, where the voice text information includes the first text;
  • the first voice data can be recognized by the tone recognition unit to obtain the tone information corresponding to the first voice data; the first voice data can also be recognized by the voice recognition unit to obtain the first voice data.
  • a voice text information corresponding to voice data here, both the tone recognition unit and the voice recognition unit work in parallel.
  • S904 Match the voice text information with the tone information, and determine a matching pair between the first text in the voice text information and the first tone information in the tone information;
  • S905 Perform translation processing on the voice text information, and determine translated text information corresponding to the first voice data; wherein the translated text information includes a first translated text corresponding to the first text;
  • the voice text information can be translated by the translation unit to obtain the translated text information corresponding to the first voice data.
  • both the translation unit and the speech recognition unit work serially, and the translation unit is processed after the speech recognition unit.
  • the first translated text is obtained after the first text is translated.
  • the matching pair of the first translated text and the first tone information can be obtained subsequently.
  • S906 Match the translated text information and the tone information according to the matching pair of the first text and the first tone information, and determine the matching pair of the first translated text in the translated text information and the first tone information;
  • S907 Determine a display strategy corresponding to the first translated text according to the matching pair of the first translated text and the first tone information
  • S908 Present the first translated text on the client according to the determined display strategy.
  • the display unit needs to display the translated content (ie translated text information) corresponding to the speaker's speech, at this time, after determining the matching pair of the first translated text in the translated text information and the first tone information, then According to the matching pair of the first translated text in the translated text information and the first tone information in the tone information, the display strategy corresponding to the first translated text is determined, and then the first translated text in the translation result is determined according to the determined display strategy To display. That is to say, when the client (or display unit) displays the translation content corresponding to the speaker's speech, different texts in the translation content can be displayed according to different tone information corresponding to the word or word. Therefore, the user can better obtain the information of the speaker's tone change when watching, and avoid confusion or error in the user's judgment of the speaker's tone change.
  • the client or display unit
  • the voice processing device applied to the simultaneous interpretation system adds a new tone recognition unit, and the tone recognition unit can perform tone recognition through spectral feature information, or through volume information or energy information. Tone recognition; then according to the output result of the tone recognition unit, the voice text information is matched, and the matching pair of the first text (including characters or words) in the voice text information and the first tone information in the tone information can be obtained; and then according to the voice The matching pair of the first text in the text information and the first tone information is matched to the translated text information, and the matching pair of the first translated text (including words or words) in the translated text information and the first tone information can be obtained; finally according to the display The matching relationship between the word or word to be displayed in the unit and the tone information is selected, and the corresponding display strategy is selected, and then the corresponding word or word is displayed in the display unit according to the display strategy.
  • the second In the process of voice data not only the translation text information is considered, but also the tone information is considered, so that the second voice data can also reflect the tone change during the playback process; that is, while the text information is displayed, the tone can also be generated
  • the changed text is displayed with different display strategies, so that when the user is watching, not only can the user obtain the text information, but also the tone information corresponding to the text; in this way, the tone change information of the speaker can be obtained conveniently and in real time, thereby It avoids confusion or errors in the user’s judgment of the speaker’s speaking mood changes; and when the user can’t hear clearly or cannot hear the speaker’s pronunciation, it will not affect the user’s access to the speaker’s speaking mood information, which improves simultaneous interpretation The accuracy of the system.
  • FIG. 10 shows a schematic diagram of the composition structure of a voice processing apparatus 100 provided by an embodiment of the present application.
  • the speech processing device 100 may include: an acquisition unit 1001, a recognition unit 1002, a translation unit 1003, a matching unit 1004, and a generating unit 1005, where:
  • the obtaining unit 1001 is configured to obtain the first voice data to be processed
  • the recognition unit 1002 is configured to recognize tone information and voice text information corresponding to the first voice data
  • the translation unit 1003 is configured to translate the voice text information to obtain translated text information; wherein the language type corresponding to the translated text information is different from the language type corresponding to the first voice data;
  • the matching unit 1004 is configured to determine the matching relationship between the translated text information and the tone information based on the matching relationship between the voice text information and the tone information;
  • the generating unit 1005 is configured to generate second voice data according to the matching relationship between the translated text information and the tone information; wherein, the language corresponding to the second voice data is different from the language corresponding to the first voice data, And the second voice data is used for presentation on the client when the first voice data is played.
  • the recognition unit 1002 further includes a tone recognition unit 1021 and a speech recognition unit 1022, wherein,
  • the tone recognition unit 1021 is configured to perform tone recognition on the first voice data, and determine the tone information corresponding to the first voice data;
  • the voice recognition unit 1022 is configured to perform voice recognition on the first voice data, and determine the voice text information corresponding to the first voice data.
  • the voice processing device 100 may further include an extracting unit 1006 configured to extract first information from the first voice data, and the first information represents the sound content in the first voice data. Spectral characteristics;
  • the tone recognition unit 1021 is configured to input the first information into a preset recognition model, and output tone information corresponding to the first voice data through the preset recognition model.
  • the voice processing device 100 may further include a comparing unit 1007, where:
  • the extraction unit 1006 is further configured to extract second information from the first voice data, where the second information represents the energy level of the sound in the first voice data;
  • the comparing unit 1007 is configured to compare the second information with a preset information threshold
  • the tone recognition unit 1021 is further configured to determine the tone information corresponding to the first voice data according to the result of the comparison.
  • the tone recognition unit 1021 is specifically configured to determine when the second information is greater than the preset information threshold and the duration of the second information greater than the preset information threshold exceeds the preset time threshold
  • the first voice data corresponds to the first type of tone information; or, when the second information is less than or equal to the preset information threshold or the duration of the second information greater than the preset information threshold does not exceed the preset
  • the time threshold is set, it is determined that the first voice data corresponds to the second type of tone information, and the first type is different from the second type.
  • the matching unit 1004 is further configured to match the voice text information and the tone information, and determine a matching pair between the first text in the voice text information and the first tone information in the tone information
  • the voice text information includes the first text
  • the tone information includes the first tone information
  • the matching unit 1004 is further configured to determine a matching pair between the first translated text in the translated text information and the first tone information according to the matching pair between the first text and the first tone information
  • the translated text information includes a first translated text corresponding to the first text; and according to the matching pair of the first translated text and the first tone information, the translated text information and the tone are obtained Information matching relationship.
  • the speech processing device 100 may further include a determining unit 1008 configured to determine a target synthesis model according to the translated text information; wherein, the target synthesis model characterizes the comparison between the translated text information and the translation text information.
  • a determining unit 1008 configured to determine a target synthesis model according to the translated text information; wherein, the target synthesis model characterizes the comparison between the translated text information and the translation text information.
  • the generating unit 1005 is specifically configured to perform speech synthesis on the translated text information and the tone information by using a target synthesis model according to the matching relationship between the translated text information and the tone information to obtain the second speech data.
  • the voice processing device 100 may further include a display unit 1009, where:
  • the obtaining unit 1001 is further configured to obtain the preset correspondence relationship between the tone information and the display strategy;
  • the determining unit 1008 is further configured to determine a display strategy corresponding to the tone information according to the obtained corresponding relationship between the tone information and the display strategy;
  • the display unit 1009 is configured to obtain a display result corresponding to the translated text information according to the determined display strategy and the matching relationship between the translated text information and the tone information; wherein, the display result indicates that the first In the case of a voice data, the translated text information is presented on the client according to the determined display strategy.
  • the display unit 1009 is further configured to obtain a display result corresponding to the voice text information according to the determined display strategy and the matching relationship between the voice text information and the tone information; wherein, the display result It means that the voice text information is presented on the client according to the determined display strategy when the first voice data is played.
  • a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course, it may also be a module, or it may also be non-modular.
  • the various components in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be realized in the form of hardware or software function module.
  • the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this embodiment is essentially or It is said that the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium and includes several instructions to enable a computer device (which can A personal computer, a server, or a network device, etc.) or a processor (processor) executes all or part of the steps of the method described in this embodiment.
  • the aforementioned storage media include: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.
  • this embodiment provides a computer storage medium that stores a voice processing program that implements the method described in any one of the foregoing embodiments when the voice processing program is executed by at least one processor.
  • the device 120 may include: a communication interface 1201, a memory 1202, and a processor 1203; various components are coupled together through a bus system 1204. It can be understood that the bus system 1204 is used to implement connection and communication between these components. In addition to the data bus, the bus system 1204 also includes a power bus, a control bus, and a status signal bus. However, for the sake of clear description, various buses are marked as the bus system 1204 in FIG. 12. among them,
  • the communication interface 1201 is used for receiving and sending signals in the process of sending and receiving information with other external network elements;
  • the memory 1202 is used to store a computer program that can run on the processor 1203;
  • the processor 1203 is configured to execute: when the computer program is running:
  • Translating the voice text information to obtain translated text information wherein the language type corresponding to the translated text information is different from the language type corresponding to the first voice data;
  • second voice data is generated; wherein the language corresponding to the second voice data is different from the language corresponding to the first voice data, and the second voice The data is used for presentation on the client when the first voice data is played.
  • the device 120 in the embodiment of the present application may be a terminal device or a server.
  • the memory 1202 may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), and electrically available Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (Random Access Memory, RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • Synchronous DRAM SDRAM
  • Enhanced Synchronous Dynamic Random Access Memory Enhanced SDRAM, ESDRAM
  • Synchronous Link Dynamic Random Access Memory Synchronous Link DRAM, SLDRAM
  • DRRAM Direct Rambus RAM
  • the processor 1203 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 1203 or instructions in the form of software.
  • the processor 1203 may be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC application specific integrated circuit
  • FPGA ready-made programmable gate array
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1202, and the processor 1203 reads the information in the memory 1202, and completes the steps of the foregoing method in combination
  • the embodiments described in this application can be implemented by hardware, software, firmware, middleware, microcode, or a combination thereof.
  • the processing unit can be implemented in one or more application specific integrated circuits (ASICs), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (DSP Device, DSPD), Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), general-purpose processors, controllers, microcontrollers, microprocessors, used to perform the functions described in this application Other electronic units or their combination.
  • ASICs application specific integrated circuits
  • DSP digital signal processors
  • DSP Device digital signal processing devices
  • PLD Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the technology described in this application can be implemented through modules (for example, procedures, functions, etc.) that perform the functions described in this application.
  • the software codes can be stored in the memory and executed by the processor.
  • the memory can be implemented in the processor or external to the processor.
  • the processor 1203 is further configured to execute the method described in any one of the foregoing embodiments when the computer program is running.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音处理方法、装置(100)、设备(120)以及存储介质,该方法包括:获取待处理的第一语音数据(S301);识别第一语音数据对应的语气信息以及语音文本信息(S302);对语音文本信息进行翻译,得到翻译文本信息(S303);其中,翻译文本信息对应的语种不同于第一语音数据对应的语种;基于语音文本信息和语气信息的匹配关系,确定翻译文本信息和语气信息的匹配关系(S304);根据翻译文本信息和语气信息的匹配关系,生成第二语音数据(S305);其中,第二语音数据对应的语种不同于第一语音数据对应的语种,且第二语音数据用于在播放第一语音数据时在客户端进行呈现。

Description

语音处理方法、装置、设备以及存储介质 技术领域
本申请实施例涉及语音处理技术领域,尤其涉及一种语音处理方法、装置、设备以及存储介质。
背景技术
现有的同传***,可以将用户说话的语音信息转换为另一种语言对应的翻译文字信息,并且利用语音合成模型将该翻译文字信息合成相应的语音进行显示。其中,同传***不仅可以应用于国际会议、产品发布会等会议中,还可以应用于人们的日常生活中。例如,在工作中,可以利用同传***进行技术分享或视频会议;在生活中,可以利用同传***满足社交或旅游场景中的相关需求。然而,在现有的同传过程中,传译方式固定且单一,而且准确度偏低。
发明内容
为解决相关技术问题,本申请实施例期望提供一种语音处理方法、装置、设备以及存储介质。
本申请实施例的技术方案可以如下实现:
第一方面,本申请实施例提供了一种语音处理方法,该方法包括:
获取待处理的第一语音数据;
识别所述第一语音数据对应的语气信息以及语音文本信息;
对所述语音文本信息进行翻译,得到翻译文本信息;其中,所述翻译文本信息对应的语种不同于所述第一语音数据对应的语种;
基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现。
第二方面,本申请实施例提供了一种语音处理装置,该语音处理装置包括获取单元、识别单元、翻译单元、匹配单元和生成单元,其中,
获取单元,配置为获取待处理的第一语音数据;
识别单元,配置为识别所述第一语音数据对应的语气信息以及语音文本信息;
翻译单元,配置为对所述语音文本信息进行翻译,得到翻译文本信息;其中,所述翻译文本信息对应的语种不同于所述第一语音数据对应的语种;
匹配单元,配置为基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
生成单元,配置为根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现。
第三方面,本申请实施例提供了一种设备,该设备包括存储器和处理器,其中,
存储器,用于存储能够在所述处理器上运行的计算机程序;
处理器,用于在运行所述计算机程序时,执行如第一方面所述的方法。
第四方面,本申请实施例提供了一种计算机存储介质,该计算机存储介质存储有语音处理程序,所述语音处理程序被至少一个处理器执行时实现如第一方面所述的方法。
本申请实施例提供了一种语音处理方法、装置、设备以及存储介质,其中,获取待处理的第一语音数据;识别第一语音数据对应的语气信息以及语音文本信息;对语音文本信息进行翻译,得到翻译文本信息;其中,翻译文本信息对应的语种不同于第一语音数据对应的语种;基于语音文本信息和语气信息的匹配关系,确定翻译文本信息和语气信息的匹配关系;根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现;这样,针对待处理的第一语音数据,不仅能够识别文本信息,还能够识别语气信息,如此在得到第二语音数据的过程中,不仅考虑了翻译文本信息,而且还考虑了语气信息,使得第二语音数据在播放的过程中还能够体现语气变化;也就是说,本申请实施例的技术方案可以方便、实时地获取演讲者的语气变化信息,从而有效避免了用户对同传***中演讲者说话语气变化的判断发生混乱或者错误现象,提高了同传***的准确性。
附图说明
图1为相关技术方案提供的一种同传***应用的***架构示意图
图2为相关技术方案提供的一种同传***的工作流程示意图;
图3为本申请实施例提供的一种语音处理方法的流程示意图;
图4为本申请实施例提供的另一种语音处理方法的流程示意图;
图5为本申请实施例提供的又一种语音处理方法的流程示意图;
图6为本申请实施例提供的再一种语音处理方法的流程示意图;
图7为本申请实施例提供的再一种语音处理方法的流程示意图;
图8为本申请实施例提供的一种语音处理方法的详细流程示意图;
图9为本申请实施例提供的另一种语音处理方法的详细流程示意图;
图10为本申请实施例提供的一种语音处理装置的组成结构示意图;
图11为本申请实施例提供的另一种语音处理装置的组成结构示意图;
图12为本申请实施例提供的一种设备的具体硬件结构示意图。
具体实施方式
为了能够更加详尽地了解本申请实施例的特点与技术内容,下面结合附图对本申请实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本申请实施例。
应当理解,在本申请实施例中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本申请实施例的说明,其本身没有特定的意义。因此,“模块”、“部件”或“单元”可以混合地使用。
另外,在后续的描述中,所涉及的术语“第一\第二”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
参见图1,其示出了相关技术中同传***应用的***架构示意图。如图1所示,所述***可以包括:机器同传服务端110、语音处理服务器120、观众移动端130、个人电 脑(PC,Personal Computer)客户端140和显示屏幕150。
实际应用中,演讲者可以通过PC客户端140进行会议演讲,在进行会议演讲的过程中,PC客户端140采集演讲者的语音数据,将采集的语音数据发送给机器同传服务端110,所述机器同传服务端110通过语音处理服务器120对语音数据进行识别,得到识别结果(所述识别结果可以是与语音数据相同语种的语音文本信息,也可以是对语音文本信息进行翻译后所得到的其它语种的翻译文本信息);机器同传服务端110可以将识别结果发送给PC客户端140,由PC客户端140将识别结果投屏到显示屏幕150上;还可以将识别结果发送给观众移动端130(具体依据用户所需的语种,对应发送相应语种的识别结果),为用户展示识别结果,从而实现将演讲者的演讲内容翻译成用户需要的语种并进行展示。
也就是说,针对相关技术中的同传***,用户在使用的过程中,可以将用户说话的语音信息转换为另一种语言对应的文字,而且文字通过显示装置进行显示,其中,显示装置上所显示的文字代表了用户的语音内容。例如,用户通过同传***进行演讲时,该同传***可以将演讲者的发言内容进行翻译,同时将翻译结果输出在显示装置或显示屏幕上。具体地,参见图2,其示出了相关技术方案提供的一种同传***的工作流程示意图。如图2所示,该工作流程可以包括:
S201:启动同传***;
S202:同传***对用户的发言内容进行语音识别,得到语音文本信息;
S203:同传***对语音文本信息进行机器翻译处理,得到翻译文本信息;
S204:同传***将翻译文本信息输出到显示装置进行显示;
S205:关闭同传***。
在上述的工作流程中,同传***只是将用户的发言内容或者对应的翻译内容在显示装置上显示,但是用户说话的语气信息没有显示出来。例如,用户在使用同传***聆听演讲时,同传***输出演讲者的发言内容或者对应的翻译内容,这时候演讲者已经说完对应的话语,导致用户看到同传***的输出结果与用户听到的演讲者说话中包含的语气并不同步,使得用户对演讲者的说话语气变化的判断容易发生混乱或者错误;尤其是在同传***中,当用户听不清或者听不到演讲者的发音时,这时候用户对演讲者的说话语气变化的判断更容易发生混乱或者错误,使得用户体验变差。
本申请实施例提供了一种语音处理方法,在获取待处理的第一语音数据后,识别第一语音数据对应的语气信息以及语音文本信息;对语音文本信息进行翻译,得到翻译文本信息;其中,翻译文本信息对应的语种不同于第一语音数据对应的语种;基于语音文本信息和语气信息的匹配关系,确定翻译文本信息和语气信息的匹配关系;根据翻译文本信息和语气信息的匹配关系,生成第二语音数据;其中,第二语音数据对应的语种不同于第一语音数据对应的语种,且第二语音数据用于在播放第一语音数据时在客户端进行呈现;这样,针对待处理的第一语音数据,不仅能够识别文本信息,还能够识别语气信息,如此在得到第二语音数据的过程中,不仅考虑了翻译文本信息,而且还考虑了语气信息,使得第二语音数据在播放的过程中还能够体现语气变化;也即在显示文本信息的同时,还可以将语气发生变化的文字以不同的显示策略进行显示;也就是说,用户可以方便、实时地获取演讲者的语气变化信息,从而有效避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象;而且当用户听不清或者听不到演讲者的发音时,也不会影响用户获取演讲者的说话语气信息,提高了同传***的准确性。
下面将结合附图对本申请各实施例进行详细描述。
本申请的一实施例中,参见图3,其示出了本申请实施例提供的一种语音处理方法的流程示意图。如图3所示,该方法可以包括:
S301:获取待处理的第一语音数据;
需要说明的是,本申请实施例的语音处理方法可以应用于同传***,而且该方法的执 行主体是语音处理装置。其中,该语音处理装置可以位于服务器侧或终端设备侧,服务器可以为具有语音处理功能的服务器,诸如文件服务器、数据库服务器等;终端设备可以为具有语音处理功能的终端,比如,终端设备可以包括智能手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、便捷式媒体播放器(Portable Media Player,PMP)、导航装置、数字电视和台式计算机等,本申请实施例不作具体限定。
还需要说明的是,针对待处理的第一语音数据,可以通过语音采集单元进行语音采集得到的。具体地,所述获取待处理的第一语音数据,可以包括:通过语音采集单元进行语音采集,获得待处理的第一语音数据。
也就是说,语音处理装置中可以包括有语音采集单元。其中,该语音采集单元可以是终端设备中的麦克风或话筒。这样,通过语音采集单元在采集得到待处理的第一语音数据之后,可以对该第一语音数据进行后续的语音识别处理或者语气识别处理。
S302:识别第一语音数据对应的语气信息以及语音文本信息;
需要说明的是,在得到第一语音数据之后,可以对该第一语音数据进行识别处理。具体地,一方面可以识别出第一语音数据对应的语气信息,另一方面还可以识别出第一语音数据对应的语音文本信息。
在一些实施例中,所述识别所述第一语音数据对应的语气信息以及语音文本信息,可以包括:
对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息;
对所述第一语音数据进行语音识别,确定所述第一语音数据对应的语音文本信息。
这里,可以通过语气识别单元对该第一语音数据进行语气识别,以得到第一语音数据对应的语气信息;还可以通过语音识别单元对该第一语音数据进行语音识别,以得到第一语音数据对应的语音识别结果,即语音文本信息;其中,语音文本信息的语种与第一语音数据的语种是相同的。
具体地,语气识别单元的输入为第一语音数据,输出为语气信息;这里,语气信息可以包括第一语气信息,该第一语气信息表示语气信息中的任意一种语气信息。由于不同的语气信息所对应的语气时间是不同的,而且语气类型也是不同的;比如语气类型可以是温柔语气类型、高亢语气类型、低沉语气类型和轻快语气类型等;也就是说,第一语气信息包括有第一语气类型和第一语气时间,即在第一语气时间,此时用户的说话语气为第一语气类型。
语音识别单元的输入为第一语音数据,输出为语音文本信息;这里,语音文本信息表示了第一语音数据所对应语种的文本信息;其中,语音文本信息包括第一文字,第一文字可以是通过语音识别单元所得到的语音识别结果,而且该第一文字表示为语音文本信息中的任意文字。
S303:对所述语音文本信息进行翻译,得到翻译文本信息;
需要说明的是,在得到语音文本信息之后,可以通过翻译单元对语音文本信息进行翻译处理,以得到翻译结果,即翻译文本信息。其中,翻译文本信息对应的语种不同于第一语音数据对应的语种。
还需要说明的是,翻译文本信息包括有第一翻译文字,第一翻译文字可以是通过翻译单元所得到的翻译结果,而且该第一翻译文字对应的语种与第一文字对应的语种不同。
这样,在获取到待处理的第一语音数据之后,可以通过对该第一语音数据的识别以及翻译处理,从而能够得到该第一语音数据对应的语气信息和翻译文本信息,方便后续根据语气信息和翻译文本信息的匹配关系,以选取合适的显示策略对翻译文本信息进行显示。
S304:基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
需要说明的是,在识别出第一语音数据对应的语气信息以及语音文本信息之后,可以 对语音文本信息和语气信息进行匹配,以确定出语音文本信息中的第一文字与语气信息中的第一语气信息的匹配对;然后还可以根据第一文字与第一语气信息的匹配对,结合与第一文字对应的第一翻译文字,从而可以得到第一翻译文字与第一语气信息的匹配对,也就建立了翻译文本信息与语气信息的匹配关系。
这样,假定后续需要通过显示单元来显示语音识别结果(即语音文本信息),那么可以根据第一文字与第一语气信息的匹配对,针对语音识别结果在显示语音文本信息的同时,还需要根据显示策略体现出语气信息;或者,假定后续需要通过显示单元来显示翻译结果(即翻译文本信息),那么可以根据第一翻译文字与第一语气信息的匹配对,针对翻译结果在显示翻译文本信息的同时,也需要根据显示策略体现出语气信息。
S305:根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据。
这里,第二语音数据对应的语种不同于所述第一语音数据对应的语种,但是第二语音数据对应的语种与翻译文本信息对应的语种相同;另外,第二语音数据是根据翻译文本信息和语气信息进行语音合成得到的,而且第二语音数据用于在播放第一语音数据时在客户端进行呈现。
在一些实施例中,对于S305来说,所述根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据,可以包括:
根据所述翻译文本信息,确定目标合成模型;
根据所述翻译文本信息和所述语气信息的匹配关系,利用目标合成模型对所述翻译文本信息和所述语气信息进行语音合成,得到所述第二语音数据。
需要说明的是,目标合成模型为表征对翻译文本信息和语气信息进行语音合成的模型。这样,在得到翻译文本信息和语气信息之后,可以根据翻译文本信息和语气信息的匹配关系,利用目标合成模型得到第二语音数据;如此,由于在呈现第二语音数据时还可以体现语气变化,从而能够提高同传***的准确性。
进一步地,在一些实施例中,对于S305来说,在所述生成第二语音数据之后,该方法还可以包括:
根据所述翻译文本信息和所述语气信息的匹配关系,获得所述翻译文本信息对应的显示结果。
本申请实施例中,显示结果表示在播放第一语音数据时根据所述语气信息在客户端呈现所述翻译文本信息。
需要说明的是,显示策略可以包括有颜色区分策略、文字区分策略、字体大小区分策略、位置区分策略、风格区分策略、图标区分策略、图形区分策略和图像区分策略等等,本申请实施例不作具体限定。
还需要说明的是,在语音处理装置中,可以预先存储有语气信息与显示策略的对应关系;也就是说,不同的语气信息将选择不同的显示策略。如此,根据翻译文本信息和语气信息的匹配关系,可以得到翻译文本信息所对应的语气信息,结合预先存储的语气信息与显示策略的对应关系,从而可以确定出翻译文本信息所对应的显示策略,然后所获得的显示结果即为按照对应的显示策略在客户端呈现翻译文本信息。
在一些实施例中,所述根据所述翻译文本信息和所述语气信息的匹配关系,获得所述翻译文本信息对应的显示结果,可以包括:
获取预设的语气信息与显示策略的对应关系;
根据所获取的语气信息与显示策略的对应关系,确定所述语气信息对应的显示策略;
根据所确定的显示策略以及所述翻译文本信息和所述语气信息的匹配关系,获得所述翻译文本信息对应的显示结果;其中,所述显示结果表示在播放所述第一语音数据时按照所确定的显示策略在客户端呈现所述翻译文本信息。
示例性地,假定显示策略以字体大小区分策略为例,如果语气信息为温柔语气类型, 对应的翻译文本信息为文字A,那么文字A在客户端将以常规四号字体作为显示策略进行显示;如果语气信息为轻快语气类型,对应的翻译文本信息为文字B,那么文字B在客户端将以加粗四号字体作为显示策略进行显示;如果语气信息为高亢语气类型,对应的翻译文本信息为文字C,那么文字C在客户端将以常规三号字体作为显示策略进行显示;如果语气信息为低沉语气类型,对应的翻译文本信息为文字D,那么文字D在客户端将以常规五号字体作为显示策略进行显示。或者,假定显示策略以颜色区分策略为例,如果语气信息为温柔语气类型,对应的翻译文本信息为文字A,那么文字A在客户端将以蓝色作为显示策略进行显示;如果语气信息为轻快语气类型,对应的翻译文本信息为文字B,那么文字B在客户端将以绿色作为显示策略进行显示;如果语气信息为高亢语气类型,对应的翻译文本信息为文字C,那么文字C在客户端将以红色作为显示策略进行显示;如果语气信息为低沉语气类型,对应的翻译文本信息为文字D,那么文字D在客户端将以黑色作为显示策略进行显示;本申请实施例对此不作具体限定。
这样,在确定出翻译文本信息对应的显示策略之后,这时候可以按照该显示策略在客户端呈现翻译文本信息。如此,使得用户在观看的时候,不仅可以获得翻译文本信息,还可以获得语气信息;也就是说,本申请实施例的技术方案使得用户可以方便、实时地获取演讲者的语气变化信息,从而避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象;而且当用户听不清或者听不到演讲者的发音时,也不会影响用户获取演讲者的说话语气信息。
进一步地,当需要在客户端呈现语音文本信息时,这时候可以根据所确定的显示策略以及语音文本信息和语气信息的匹配关系,获得所述语音文本信息对应的显示结果。因此,在一些实施例中,在确定所述语气信息对应的显示策略之后,该方法还可以包括:
根据所确定的显示策略以及所述语音文本信息和所述语气信息的匹配关系,获得所述语音文本信息对应的显示结果;其中,所述显示结果表示在播放所述第一语音数据时按照所确定的显示策略在客户端呈现所述语音文本信息。
也就是说,当所显示的文本信息为语音文本信息时,这时候可以确定出语音文本信息中第一文字对应的显示策略,然后按照所确定的显示策略对第一文字进行显示;或者当所显示的文本信息为翻译文本信息时,这时候可以确定出翻译文本信息中第一翻译文字对应的显示策略,然后按照所确定的显示策略对第一翻译文字进行显示。如此,可以使得用户方便、及时地获取演讲者的语气变化信息,避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象。
在本申请实施例中,该语音处理方法不仅可以应用于同传***,还可以应用于其他音视频***中,例如,还可以应用到视频会议***中;即当用户通过视频会议***进行会议时,该视频会议***的显示装置(或显示单元)可以将演讲者的发言内容或者翻译内容显示出来,而且不同文字还可以根据该文字所对应的不同说话语气以不同的显示策略进行显示。如此,在同传***的应用场景中,演讲者的语音数据转为语音文字,然后语音文字翻译成另一语种的翻译文字,由于不同语种所表达语气的方式不同,以及同传***本身的时间延迟,这样,另一语种的聆听者在观看所显示的翻译文字时,将无法同步地感受到说话人的语气信息,这时候可以在翻译文字中实时加入语气信息,有效避免了因用户看到同传***的输出结果与用户听到的演讲者说话中包含的语气不同步而导致用户对演讲者说话语气变化的判断发生混乱或者错误现象,使得另一语种的聆听者能够正确理解到演讲者在表达当前语句时的语气信息。
由于本申请实施例的语音处理装置除了包括语音识别单元、翻译单元和显示单元外,该语音识别装置中还包括有语气识别单元。这样,用户使用本申请实施例所提供的同传***,该同传***不仅可以将演讲者的发言内容(即语音识别结果)或者对应的翻译内容(即翻译结果)在显示单元进行显示,其中,针对发言内容或者对应的翻译内容中的不同文字 (包括不同的字或词)还可以根据该字或词对应的不同语气信息选择不同的显示策略进行显示。例如,用户使用同传***聆听演讲,当同传***输出演讲者的发言内容或者对应的翻译内容时,在发言内容或者对应的翻译内容中说话语气发生变化的词或者字可以以不同的显示方式进行显示;这样,用户在观看的时候,不仅可以获得演讲者的发言内容或者对应的翻译内容,还可以获得演讲者说话的语气信息;从而避免了因用户看到同传***的输出结果与用户听到的演讲者说话中包含的语气不同步而导致用户对演讲者说话语气变化的判断发生混乱或者错误现象;而且当用户听不清或者听不到演讲者的发音时,也不会影响用户获取演讲者说话语气信息。
本实施例提供了一种语音处理方法,获取待处理的第一语音数据;识别第一语音数据对应的语气信息以及语音文本信息;对语音文本信息进行翻译,得到翻译文本信息;其中,翻译文本信息对应的语种不同于第一语音数据对应的语种;基于语音文本信息和语气信息的匹配关系,确定翻译文本信息和语气信息的匹配关系;根据翻译文本信息和语气信息的匹配关系,生成第二语音数据;其中,第二语音数据对应的语种不同于第一语音数据对应的语种,且第二语音数据用于在播放第一语音数据时在客户端进行呈现;这样,针对待处理的第一语音数据,不仅识别出了文本信息,还识别出了语气信息;如此在得到第二语音数据的过程中,不仅考虑了翻译文本信息,而且还考虑了语气信息,使得第二语音数据在播放的过程中还能够体现语气变化;也即在显示文本信息的同时,还可以将语气发生变化的文字以不同的显示策略进行显示;如此,本申请实施例的技术方案可以方便、及时地获取演讲者的语气变化信息,从而避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象;而且当用户听不清或者听不到演讲者的发音时,也不会影响用户获取演讲者的说话语气信息,提高了同传***的准确性。
本申请的另一实施例中,从语气识别单元的角度,参见图4,其示出了本申请实施例提供的另一种语音处理方法的流程示意图。如图4所示,该方法可以包括:
S401:输入待处理的第一语音数据;
S402:判断是否通过第一信息对所述第一语音数据进行语气识别;
需要说明的是,第一信息用于表征第一语音数据中声音的频谱特征,比如谱特征信息,可以包括音频谱、能量谱、LOG能量谱和Mel谱等。其中,针对待处理的第一语音数据进行语气识别,可以是通过第一信息来对第一语音数据进行语气识别,也可以通过其他信息(比如第二信息或第三信息)来对第一语音数据进行语气识别,本申请实施例不作具体限定。
S403:当判断结果为通过第一信息对所述第一语音数据进行语气识别时,从所述第一语音数据中提取第一信息;
S404:根据所述第一信息对所述第一语音数据进行语气识别,确定第一语音数据对应的语气信息;
需要说明的是,如果判断结果为通过第一信息对第一语音数据进行语气识别,那么可以从第一语音数据中提取第一信息,然后根据第一信息对第一语音数据进行语气识别,以确定出第一语音数据对应的语气信息。
具体地,在一些实施例中,所述根据所述第一信息对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息,可以包括:
将第一信息输入预设识别模型,通过预设识别模型输出所述第一语音数据对应的语气信息。
也就是说,当根据第一信息对第一语音数据进行语气识别时,这时候可以采用模型的方式,即将第一信息输入预设识别模型,然后通过该预设识别模型输出第一语音数据对应的语气信息。
在本申请实施例中,预设识别模型可以是预先采用机器学习算法训练得到的模型。具 体地,可以采用决策树、支持向量机、神经网络、深度神经网络等机器学习算法,利用训练样本进行模型训练,以得到预设识别模型。其中,训练样本可以是从样本第一语音数据中提取出的多个样本信息,根据多个样本信息进行训练,可以不断优化预设识别模型,使得预设识别模型的识别结果更准确,从而能够提高语气信息的识别准确性。
S405:当判断结果为不通过第一信息对所述第一语音数据进行语气识别时,从所述第一语音数据中提取第二信息;
S406:根据所述第二信息对所述第一语音数据进行语气识别,确定第一语音数据对应的语气信息;
S407:输出第一语音数据对应的语气信息。
需要说明的是,如果判断结果为不通过第一信息对第一语音数据进行语气识别,那么可以从第一语音数据中提取第二信息,然后根据第二信息对第一语音数据进行语气识别,以确定出第一语音数据对应的语气信息。
这里,第二信息表征第一语音数据中声音的能量大小。其中,声音具有能量,例如超声波的应用就体现声音的能量;通常可以利用声强来反映能量的大小;其中,声强是指单位时间内声波通过垂直于传播方向单位面积的声能量。这样,可以通过第一语音数据中声音的能量信息来识别出语气信息,而且通过能量大小就可以反映第一语音数据中的语气变化。
进一步地,在一些实施例中,对于S405来说,当判断结果为不通过第一信息对所述第一语音数据进行语气识别时,该方法还可以包括:
从所述第一语音数据中提取第三信息;
根据所述第三信息对所述第一语音数据进行语气识别,确定第一语音数据对应的语气信息。
需要说明的是,如果判断结果为不通过第一信息对第一语音数据进行语气识别,那么还可以从第一语音数据中提取第三信息,然后根据第三信息对第一语音数据进行语气识别,以确定出第一语音数据对应的语气信息。
这里,第三信息表征第一语音数据中声音的音量大小。其中,音量又称为响度、音强,主要是指人耳对所听到的声音大小强弱的主观感受,其客观评价标准为声音的振幅大小。这样,可以通过第一语音数据中声音的音量信息来识别出语气信息,而且通过音量大小也可以反映第一语音数据中的语气变化;另外,由于音量信息更贴近于人耳的主观感受,可以使得所识别的语气信息更贴近于真实语气信息。
也就是说,当判断结果为不通过第一信息(比如谱特征信息)对第一语音数据进行语气识别时,这时候可以根据第二信息(比如能量信息)对第一语音数据进行语气识别,也可以根据第三信息(比如音量信息)对第一语音数据进行语气识别。另外,对于第二信息或者第三信息来说,可以通过将其与预设信息阈值进行比较,根据比较的结果确定出第一语音数据对应的语气信息。
具体地,在一些实施例中,所述根据所述第二信息对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息,包括:
将第二信息与预设信息阈值进行比较;
当第二信息大于预设信息阈值且所述第二信息大于预设信息阈值的持续时间超过预设时间阈值时,确定第一语音数据对应第一类型的语气信息;
当第二信息小于或等于预设信息阈值或者所述第二信息大于预设信息阈值的持续时间没有超过预设时间阈值时,确定第一语音数据对应第二类型的语气信息。
具体地,在一些实施例中,所述根据所述第三信息对所述第一语音数据进行语气识别,确定第一语音数据对应的语气信息,可以包括:
将第三信息与预设信息阈值进行比较;
当第三信息大于预设信息阈值且所述第三信息大于预设信息阈值的持续时间超过预设时间阈值时,确定第一语音数据对应第一类型的语气信息;
当第三信息小于或等于预设信息阈值或者所述第三信息大于预设信息阈值的持续时间没有超过预设时间阈值时,确定第一语音数据对应第二类型的语气信息。
需要说明的是,语气信息还可以划分为第一类型的语气信息和第二类型的语气信息,且第一类型和第二类型不同。这里,第一类型的语气信息和第一类型的语气信息是根据第二信息或者第三信息来判断的。
示例性地,以第二信息为例,假定第一类型的语气信息可以是重语气信息,第二类型的语气信息可以是轻语气信息;那么当第二信息大于预设信息阈值且第二信息大于预设信息阈值的持续时间超过预设时间阈值时,这时候第一语音数据的语气信息确定为重语气信息;当第二信息小于或等于预设信息阈值或者第二信息大于预设信息阈值的持续时间没有超过预设时间阈值时,这时候第一语音数据的语气信息确定为轻语气信息。
本申请实施例中,应用于同传***的语音处理装置中包括有语气识别单元,该语气识别单元的输入为待处理的第一语音数据,输出为演讲者(说话人)的语气信息。这样,当显示单元上显示演讲者的发言内容(即语音文本信息)或者对应的翻译内容(即翻译文本信息)时,根据所识别出的语气信息可以将发言内容或者对应的翻译内容中的不同文字根据该字或词对应的不同语气信息选择不同的显示策略进行显示。也就是说,语气识别单元可以通过谱特征信息对第一语音数据中的语气变化进行判断,也可以通过音量信息或能量信息对第一语音数据中的语气变化进行判断。其中,通过谱特征信息对第一语音数据中的语气进行判断时,可以使用模型的方式,将谱特征信息输入预设识别模型,通过该预设识别模型输出对应的语气信息;而通过音量信息或者能量信息等对第一语音数据中的语气进行判断时,此时可以通过音量信息或者能量信息的变化规律进行判断;具体地,当音量信息或者能量信息连续在一段时间内超过预设信息阈值时,这时候可以输出加重的语气,比如这时候的语气信息为重语气信息;否则,输出其他语气,比如这时候的语气信息为轻语气信息。由于语气识别模块的输出结果为语气信息,且语气信息可以包括有语气类型和语气时间;这里,语气类型表示了演讲者的说话语气,语气时间表示了演讲者的语气变化时间。如此,根据该语气时间可以更好地得到语音识别结果与语气信息的匹配关系。
本申请的又一实施例中,从语音识别单元的角度,参见图5,其示出了本申请实施例提供的又一种语音处理方法的流程示意图。如图5所示,该方法可以包括:
S501:输入待处理的第一语音数据;
S502:对第一语音数据进行语音识别,确定第一语音数据对应的语音文本信息,所述语音文本信息包括第一文字;
S503:根据语气识别单元所输入的语气信息,将语音文本信息和语气信息进行匹配;
S504:输出语音文本信息中的第一文字与语气信息中的第一语气信息的匹配对。
需要说明的是,语气信息中包括有第一语气信息,而第一语气信息包括第一语气类型和第一语气时间。具体地,针对将所述语音文本信息和所述语气信息进行匹配,可以包括:
将所述语音文本信息和所述语气信息进行匹配,根据所述第一语气信息中的第一语气时间确定所述语音文本信息中的第一文字,并根据所述第一语气时间确定所述第一语气信息中的第一语气类型;
根据所述第一文字、所述第一语气时间以及所述第一语气类型,得到所述第一文字与所述第一语气信息的匹配对。
也就是说,针对第一语气信息中所包括的第一语气时间,可以较准确地确定语音文本信息(即语音识别结果)中的第一文字,从而根据该第一语气时间可以更好地得到语音文本信息与语气信息的匹配关系;另外,该第一语气时间还可以在显示单元进行显示时,能够确定第一文字在对应的显示策略下所需要持续的显示时间,从而能够更好地获取演讲者 的语气变化信息,避免用户对演讲者说话语气变化的判断发生混乱或者错误现象。
本申请实施例中,应用于同传***的语音处理装置中还包括有语音识别单元,这里,语音识别单元和语气识别单元是并行工作的;也就是说,语音识别结果和语气信息是同步执行的。在语音识别单元得到语音识别结果之后,可以根据语气识别单元所得到的语气信息,对语音识别结果进行匹配,能够得到语音文本信息中的第一文字(字或词)对应的第一语气信息,从而形成了第一文字与第一语气信息的匹配对。如此,在后续显示单元中对语音文本信息进行显示的时候,不仅可以显示文本信息,还可以将语气发生变化的文字以不同的显示策略进行显示,使得用户在观看的时候,不仅可以获得文字信息,还可以获得该文字对应的语气信息,避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象。
本申请的再一实施例中,从翻译单元的角度,参见图6,其示出了本申请实施例提供的再一种语音处理方法的流程示意图。如图5所示,该方法可以包括:
S601:输入语音文本信息;
S602:对语音文本信息进行翻译处理,确定第一语音数据对应的翻译文本信息;其中,翻译文本信息包括第一翻译文字;
S603:根据语音文本信息中的第一文字与语气信息中的第一语气信息的匹配对,对翻译文本信息和语气信息进行匹配;
S604:输出翻译文本信息中的第一翻译文字与语气信息中的第一语气信息的匹配对。
需要说明的是,第一文字与第一翻译文字之间具有对应关系,即第一文字经过翻译处理后得到第一翻译文字。这样,根据语音文本信息中的第一文字与第一语气信息的匹配对,可以得到第一翻译文字与第一语气信息的匹配对。
本申请实施例中,应用于同传***的语音处理装置中还包括有翻译单元,这里,翻译单元是在语音识别单元之后工作的,即翻译单元和语音识别单元为串行工作。在翻译单元得到翻译文本信息之后,还可以根据语音文本信息中的第一文字与语气信息中的第一语气信息的匹配对,对翻译文本信息进行匹配,能够得到翻译文本信息中的第一翻译文字(字或词)对应的第一语气信息,从而形成了第一翻译文字与第一语气信息的匹配对。如此,在后续显示单元中对翻译文本信息进行显示的时候,不仅可以显示文本信息,还可以将语气发生变化的翻译文字以不同的显示策略进行显示,使得用户在观看的时候,不仅可以获得文字信息,还可以获得该文字对应的语气信息,避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象。
本申请的再一实施例中,从显示单元的角度,参见图7,其示出了本申请实施例提供的再一种语音处理方法的流程示意图。如图7所示,该方法可以包括:
S701:输入待显示的文本信息以及文本信息与语气信息的匹配关系;
S702:根据文本信息与语气信息的匹配关系,确定待显示的文本信息对应的显示策略;
S703:按照所确定的显示策略,输出待显示的文本信息。
需要说明的是,待显示的文本信息可以是语音文本信息中的第一文字,也可以是翻译文本信息中的第一翻译文字,本申请实施例不作具体限定。
还需要说明的是,显示策略可以包括有颜色区分策略、文字区分策略、字体大小区分策略、位置区分策略、风格区分策略、图标区分策略、图形区分策略和图像区分策略等等,本申请实施例不作具体限定。
另外,在语音处理装置中,可以预先存储有语气信息与显示策略的对应关系,即不同的语气信息将选择不同的显示策略。如此,根据翻译文本信息和语气信息的匹配关系,可以得到翻译文本信息所对应的语气信息,结合预先存储的语气信息与显示策略的对应关系,从而可以确定出翻译文本信息所对应的显示策略,然后所获得的显示结果即为按照对应的显示策略在客户端呈现翻译文本信息。还需要注意的是,针对语气信息中的语气时间,一方面可以是根据该语气时间,能够更好地得到语音文本信息中的第一文字与第一语气信息 的匹配关系;另一方面还可以是在显示单元进行显示时,能够确定第一文字或第一翻译文字在对应的显示策略下所需要持续的显示时间,从而能够更好地获取演讲者的语气变化信息,避免用户对演讲者说话语气变化的判断发生混乱或者错误现象。
本申请实施例中,应用于同传***的语音处理装置中包括有显示单元,该显示单元可以显示演讲者的发言内容(即语音文本信息)或者对应的翻译内容(即翻译文本信息);这时候需要根据待显示的文本信息与语气信息的匹配关系,比如语音文本信息中的第一文字与第一语气信息的匹配对,或者翻译文本信息中的第一翻译文字与第一语气信息的匹配对;这样,针对不同文字的不同说话语气,可以选择不同的显示策略,比如可以是按照文字、颜色、位置、字体大小、风格、图标、图形或图像等进行区分显示。如此,在显示单元将演讲者的发言内容或者对应的翻译内容进行显示时,可以将发言内容或者对应的翻译内容中的不同文字根据该字或词对应的不同语气信息选择不同的显示策略进行显示,从而使得用户在观看的时候,不仅可以获得文字信息,还可以获得该文字对应的语气信息,避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象。
下面将通过两个详细流程示例对该语音处理方法的具体应用场景进行详细描述。
基于前述实施例相同的发明构思,参见图8,其示出了本申请实施例提供的一种语音处理方法的详细流程示意图。如图8所示,该详细流程可以包括:
S801:获取待处理的第一语音数据;
S802:对第一语音数据进行语气识别,确定第一语音数据对应的语气信息,所述语气信息包括第一语气信息;
S803:对第一语音数据进行语音识别,确定第一语音数据对应的语音文本信息,所述语音文本信息包括第一文字;
需要说明的是,可以通过语气识别单元对该第一语音数据进行语气识别,以得到第一语音数据对应的语气信息;还可以通过语音识别单元对该第一语音数据进行语音识别,以得到第一语音数据对应的语音识别结果(即语音文本信息);这里,语气识别单元和语音识别单元两者是并行工作的。
还需要说明的是,第一语气信息包括有第一语气类型和第一语气时间,即在第一语气时间,此时用户的说话语气为第一语气类型;语音文本信息中包括有第一文字,这样,在获得语音文本信息之后,可以确定语音文本信息中的第一文字与第一语气信息的匹配对。
S804:将语音文本信息和语气信息进行匹配,确定所述语音文本信息中的第一文字与所述语气信息中的第一语气信息的匹配对;
S805:根据所述语音文本信息中的第一文字与所述语气信息中的第一语气信息的匹配对,确定所述第一文字对应的显示策略;
S806:将所述第一文字按照所确定的显示策略在客户端进行呈现。
需要说明的是,由于语音信息中包括有第一语气信息,而第一语气信息包括第一语气类型和第一语气时间。具体地,针对将所述语音文本信息和所述语气信息进行匹配,可以包括:将语音文本信息和所述语气信息进行匹配,根据所述第一语气信息中的第一语气时间确定所述语音文本信息中的第一文字,并根据所述第一语气时间确定所述第一语气信息中的第一语气类型;根据所述第一文字、所述第一语气时间以及所述第一语气类型,得到所述第一文字与所述第一语气信息的匹配对。
也就是说,当显示单元需要显示演讲者的发言内容(即语音文本信息)时,这时候在确定出语音文本信息中的第一文字与语气信息中的第一语气信息的匹配对之后,可以根据语音文本信息中的第一文字与语气信息中的第一语气信息的匹配对,确定出第一文字对应的显示策略,然后按照所确定的显示策略对语音文本信息中的第一文字进行显示。也就是说,在客户端(或显示单元)将演讲者的发言内容进行显示时,可以将发言内容中的不同文字根据该字或词对应的不同语气信息选择不同的显示策略进行显示,从而使得用户在观 看的时候,能够更好地获取演讲者的语气变化信息,避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象。
通过上述实施例,对前述实施例的具体实现进行了详细阐述,从中可以看出,针对待处理的第一语音数据,不仅识别出了文本信息(比如语音文本信息),还识别出了语气信息;如此在得到第二语音数据的过程中,不仅考虑了翻译文本信息,而且还考虑了语气信息,使得第二语音数据在播放的过程中还能够体现语气变化;也即在显示文字的同时,还可以将语气发生变化的文字以不同的显示策略进行显示,使得用户在观看的时候,不仅可以获得文字信息,还可以获得该文字对应的语气信息;如此,可以方便、及时地获取演讲者的语气变化信息,从而避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象;而且当用户听不清或者听不到演讲者的发音时,也不会影响用户获取演讲者的说话语气信息。
基于前述实施例相同的发明构思,参见图9,其示出了本申请实施例提供的另一种语音处理方法的详细流程示意图。如图9所示,该详细流程可以包括:
S901:获取待处理的第一语音数据;
S902:对第一语音数据进行语气识别,确定第一语音数据对应的语气信息,所述语气信息包括第一语气信息;
S903:对第一语音数据进行语音识别,确定第一语音数据对应的语音文本信息,所述语音文本信息包括第一文字;
需要说明的是,可以通过语气识别单元对该第一语音数据进行语气识别,以得到第一语音数据对应的语气信息;还可以通过语音识别单元对该第一语音数据进行语音识别,以得到第一语音数据对应的语音文本信息;这里,语气识别单元和语音识别单元两者是并行工作的。
S904:将语音文本信息和语气信息进行匹配,确定所述语音文本信息中的第一文字与所述语气信息中的第一语气信息的匹配对;
S905:对所述语音文本信息进行翻译处理,确定所述第一语音数据对应的翻译文本信息;其中,所述翻译文本信息包括与所述第一文字对应的第一翻译文字;
需要说明的是,可以通过翻译单元对该语音文本信息进行翻译处理,以得到第一语音数据对应的翻译文本信息。这里,翻译单元和语音识别单元两者是串行工作的,而且翻译单元是在语音识别单元之后进行处理。
还需要说明的是,第一文字与第一翻译文字之间具有对应关系,即第一文字经过翻译处理后得到第一翻译文字。这样,根据语音文本信息中的第一文字与第一语气信息的匹配对,后续还可以得到第一翻译文字与第一语气信息的匹配对。
S906:根据第一文字与第一语气信息的匹配对,对翻译文本信息和语气信息进行匹配,确定所述翻译文本信息中的第一翻译文字与所述第一语气信息的匹配对;
S907:根据所述第一翻译文字与所述第一语气信息的匹配对,确定所述第一翻译文字对应的显示策略;
S908:将所述第一翻译文字按照所确定的显示策略在客户端进行呈现。
需要说明的是,当显示单元需要显示演讲者发言对应的翻译内容(即翻译文本信息)时,这时候在确定出翻译文本信息中的第一翻译文字与第一语气信息的匹配对之后,然后根据翻译文本信息中的第一翻译文字与语气信息中的第一语气信息的匹配对,确定出第一翻译文字对应的显示策略,然后按照所确定的显示策略对翻译结果中的第一翻译文字进行显示。也就是说,在客户端(或显示单元)将演讲者发言对应的翻译内容进行显示时,可以将翻译内容中的不同文字根据该字或词对应的不同语气信息选择不同的显示策略进行显示,从而使得用户在观看的时候,能够更好地获取演讲者的语气变化信息,避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象。
也就是说,在本申请实施例中,应用于同传***的语音处理装置新增加了语气识别单元,而且该语气识别单元可以通过谱特征信息进行语气识别,也可以通过音量信息或能量信息进行语气识别;然后根据语气识别单元的输出结果,对语音文本信息进行匹配,可以得到语音文本信息中的第一文字(包括字或词)与语气信息中的第一语气信息的匹配对;再根据语音文本信息中的第一文字与第一语气信息的匹配对,对翻译文本信息进行匹配,可以得到翻译文本信息中第一翻译文字(包括字或词)与第一语气信息的匹配对;最后根据显示单元中待显示的字或词与语气信息的匹配关系,选择与之对应的显示策略,然后在显示单元中按照该显示策略将对应的字或词进行显示。
通过上述实施例,对前述实施例的具体实现进行了详细阐述,从中可以看出,针对待处理的第一语音数据,不仅识别出了文本信息,还识别出了语气信息;如此在得到第二语音数据的过程中,不仅考虑了翻译文本信息,而且还考虑了语气信息,使得第二语音数据在播放的过程中还能够体现语气变化;也即在显示文本信息的同时,还可以将语气发生变化的文字以不同的显示策略进行显示,使得用户在观看的时候,不仅可以获得文字信息,还可以获得该文字对应的语气信息;如此,可以方便、实时地获取演讲者的语气变化信息,从而避免了用户对演讲者说话语气变化的判断发生混乱或者错误现象;而且当用户听不清或者听不到演讲者的发音时,也不会影响用户获取演讲者的说话语气信息,提高了同传***的准确性。
基于前述实施例相同的发明构思,参见图10,其示出了本申请实施例提供的一种语音处理装置100的组成结构示意图。如图10所示,该语音处理装置100可以包括:获取单元1001、识别单元1002、、翻译单元1003、匹配单元1004和生成单元1005,其中,
获取单元1001,配置为获取待处理的第一语音数据;
识别单元1002,配置为识别所述第一语音数据对应的语气信息以及语音文本信息;
翻译单元1003,配置为对所述语音文本信息进行翻译,得到翻译文本信息;其中,翻译文本信息对应的语种不同于所述第一语音数据对应的语种;
匹配单元1004,配置为基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
生成单元1005,配置为根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现。
在上述方案中,参见图11,所述识别单元1002还包括有语气识别单元1021和语音识别单元1022,其中,
语气识别单元1021,配置为对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息;
语音识别单元1022,配置为对所述第一语音数据进行语音识别,确定所述第一语音数据对应的语音文本信息。
在上述方案中,参见图11,语音处理装置100还可以包括提取单元1006,配置为从所述第一语音数据中提取第一信息,所述第一信息表征所述第一语音数据中声音的频谱特征;
语气识别单元1021,配置为将所述第一信息输入预设识别模型,通过所述预设识别模型输出所述第一语音数据对应的语气信息。
在上述方案中,参见图11,语音处理装置100还可以包括比较单元1007,其中,
提取单元1006,还配置为从所述第一语音数据中提取第二信息,所述第二信息表征所述第一语音数据中声音的能量大小;
比较单元1007,配置为将所述第二信息与预设信息阈值进行比较;
语气识别单元1021,还配置为根据比较的结果确定所述第一语音数据对应的语气信息。
在上述方案中,语气识别单元1021,具体配置为当所述第二信息大于所述预设信息阈 值且所述第二信息大于所述预设信息阈值的持续时间超过预设时间阈值时,确定所述第一语音数据对应第一类型的语气信息;或者,当所述第二信息小于或等于所述预设信息阈值或者所述第二信息大于所述预设信息阈值的持续时间没有超过预设时间阈值时,确定所述第一语音数据对应第二类型的语气信息,所述第一类型与所述第二类型不同。
在上述方案中,匹配单元1004,还配置为对所述语音文本信息和所述语气信息进行匹配,确定所述语音文本信息中的第一文字和所述语气信息中的第一语气信息的匹配对;其中,所述语音文本信息包括所述第一文字,所述语气信息包括所述第一语气信息;以及根据所述第一文字与所述第一语气信息的匹配对,获得所述语音文本信息和所述语气信息的匹配关系。
在上述方案中,匹配单元1004,还配置为根据所述第一文字与所述第一语气信息的匹配对,确定所述翻译文本信息中的第一翻译文字与所述第一语气信息的匹配对;其中,所述翻译文本信息包括与所述第一文字对应的第一翻译文字;以及根据所述第一翻译文字与所述第一语气信息的匹配对,获得所述翻译文本信息和所述语气信息的匹配关系。
在上述方案中,参见图11,语音处理装置100还可以包括确定单元1008,配置为根据所述翻译文本信息,确定目标合成模型;其中,所述目标合成模型表征对所述翻译文本信息和所述语气信息进行语音合成的模型;
生成单元1005,具体配置为根据所述翻译文本信息和所述语气信息的匹配关系,利用目标合成模型对所述翻译文本信息和所述语气信息进行语音合成,得到所述第二语音数据。
在上述方案中,参见图11,语音处理装置100还可以包括显示单元1009,其中,
获取单元1001,还配置为获取预设的语气信息与显示策略的对应关系;
确定单元1008,还配置为根据所获取的语气信息与显示策略的对应关系,确定所述语气信息对应的显示策略;
显示单元1009,配置为根据所确定的显示策略以及所述翻译文本信息和所述语气信息的匹配关系,获得所述翻译文本信息对应的显示结果;其中,所述显示结果表示在播放所述第一语音数据时按照所确定的显示策略在客户端呈现所述翻译文本信息。
在上述方案中,显示单元1009,还配置为根据所确定的显示策略以及所述语音文本信息和所述语气信息的匹配关系,获得所述语音文本信息对应的显示结果;其中,所述显示结果表示在播放所述第一语音数据时按照所确定的显示策略在客户端呈现所述语音文本信息。
可以理解地,在本实施例中,“单元”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是模块,还可以是非模块化的。而且在本实施例中的各组成部分可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
所述集成的单元如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时,可以存储在一个计算机可读取存储介质中,基于这样的理解,本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或processor(处理器)执行本实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
因此,本实施例提供了一种计算机存储介质,该计算机存储介质存储有语音处理程序,所述语音处理程序被至少一个处理器执行时实现前述实施例中任一项所述的方法。
参见图12,其示出了本申请实施例提供的一种设备120的具体硬件结构示意图。如图 12所示,该设备120可以包括:通信接口1201、存储器1202和处理器1203;各个组件通过总线***1204耦合在一起。可理解,总线***1204用于实现这些组件之间的连接通信。总线***1204除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图12中将各种总线都标为总线***1204。其中,
通信接口1201,用于在与其他外部网元之间进行收发信息过程中,信号的接收和发送;
存储器1202,用于存储能够在处理器1203上运行的计算机程序;
处理器1203,用于在运行所述计算机程序时,执行:
获取待处理的第一语音数据;
识别所述第一语音数据对应的语气信息以及语音文本信息;
对所述语音文本信息进行翻译,得到翻译文本信息;其中,所述翻译文本信息对应的语种不同于所述第一语音数据对应的语种;
基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现。
可以理解,本申请实施例中的设备120可以是终端设备,也可以是服务器。这里,存储器1202可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步链动态随机存取存储器(Synchronous link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。而处理器1203可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1203中的硬件的集成逻辑电路或者软件形式的指令完成。该处理器1203可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。其中,软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1202,处理器1203读取存储器1202中的信息,结合其硬件完成上述方法的步骤。
还可以理解,本申请描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。其中,对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本申请所述功能的其它电子单元或其组合中。对于软件实现,可通过执行本申请所述功能的模块(例如过程、函数等)来实现本申请所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。
可选地,作为另一个实施例,处理器1203还配置为在运行所述计算机程序时,执行前述实施例中任一项所述的方法。
需要说明的是,在本申请中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 一种语音处理方法,所述方法包括:
    获取待处理的第一语音数据;
    识别所述第一语音数据对应的语气信息以及语音文本信息;
    对所述语音文本信息进行翻译,得到翻译文本信息;其中,所述翻译文本信息对应的语种不同于所述第一语音数据对应的语种;
    基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
    根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现。
  2. 根据权利要求1所述的方法,其中,所述根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据,包括:
    根据所述翻译文本信息,确定目标合成模型;其中,所述目标合成模型表征对所述翻译文本信息和所述语气信息进行语音合成的模型;
    根据所述翻译文本信息和所述语气信息的匹配关系,利用目标合成模型对所述翻译文本信息和所述语气信息进行语音合成,得到所述第二语音数据。
  3. 根据权利要求1所述的方法,其中,所述识别所述第一语音数据对应的语气信息以及语音文本信息,包括:
    对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息;
    对所述第一语音数据进行语音识别,确定所述第一语音数据对应的语音文本信息。
  4. 根据权利要求3所述的方法,其中,所述对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息,包括:
    从所述第一语音数据中提取第一信息,所述第一信息表征所述第一语音数据中声音的频谱特征;
    将所述第一信息输入预设识别模型,通过所述预设识别模型输出所述第一语音数据对应的语气信息。
  5. 根据权利要求3所述的方法,其中,所述对所述第一语音数据进行语气识别,确定所述第一语音数据对应的语气信息,包括:
    从所述第一语音数据中提取第二信息,所述第二信息表征所述第一语音数据中声音的能量大小;
    将所述第二信息与预设信息阈值进行比较,根据比较的结果确定所述第一语音数据对应的语气信息。
  6. 根据权利要求5所述的方法,其中,所述根据比较的结果,确定所述第一语音数据对应的语气信息,包括:
    当所述第二信息大于所述预设信息阈值且所述第二信息大于所述预设信息阈值的持续时间超过预设时间阈值时,确定所述第一语音数据对应第一类型的语气信息;
    或者,
    当所述第二信息小于或等于所述预设信息阈值或者所述第二信息大于所述预设信息阈值的持续时间没有超过预设时间阈值时,确定所述第一语音数据对应第二类型的语气信息,所述第一类型与所述第二类型不同。
  7. 根据权利要求1所述的方法,其中,在所述识别所述第一语音数据对应的语气信息以及语音文本信息之后,所述方法还包括:
    对所述语音文本信息和所述语气信息进行匹配,确定所述语音文本信息中的第一文字和所述语气信息中的第一语气信息的匹配对;其中,所述语音文本信息包括所述第一文字,所述语气信息包括所述第一语气信息;
    根据所述第一文字与所述第一语气信息的匹配对,获得所述语音文本信息和所述语气信息的匹配关系。
  8. 根据权利要求7所述的方法,其中,所述基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系,包括:
    根据所述第一文字与所述第一语气信息的匹配对,确定所述翻译文本信息中的第一翻译文字与所述第一语气信息的匹配对;其中,所述翻译文本信息包括与所述第一文字对应的第一翻译文字;
    根据所述第一翻译文字与所述第一语气信息的匹配对,获得所述翻译文本信息和所述语气信息的匹配关系。
  9. 根据权利要求1至8任一项所述的方法,其中,在所述生成第二语音数据之后,所述方法还包括:
    获取预设的语气信息与显示策略的对应关系;
    根据所获取的语气信息与显示策略的对应关系,确定所述语气信息对应的显示策略;
    基于所确定的显示策略以及所述翻译文本信息和所述语气信息的匹配关系,获得所述翻译文本信息对应的显示结果;其中,所述显示结果表示在播放所述第一语音数据时按照所确定的显示策略在客户端呈现所述翻译文本信息。
  10. 根据权利要求9所述的方法,其中,在所述确定所述语气信息对应的显示策略之后,所述方法还包括:
    基于所确定的显示策略以及所述语音文本信息和所述语气信息的匹配关系,获得所述语音文本信息对应的显示结果;其中,所述显示结果表示在播放所述第一语音数据时按照所确定的显示策略在客户端呈现所述语音文本信息。
  11. 一种语音处理装置,所述语音处理装置包括获取单元、识别单元、翻译单元、匹配单元和生成单元,其中,
    所述获取单元,配置为获取待处理的第一语音数据;
    所述识别单元,配置为识别所述第一语音数据对应的语气信息以及语音文本信息;
    所述翻译单元,配置为对所述语音文本信息进行翻译,得到翻译文本信息;其中,所述翻译文本信息对应的语种不同于所述第一语音数据对应的语种;
    所述匹配单元,配置为基于所述语音文本信息和所述语气信息的匹配关系,确定所述翻译文本信息和所述语气信息的匹配关系;
    所述生成单元,配置为根据所述翻译文本信息和所述语气信息的匹配关系,生成第二语音数据;其中,所述第二语音数据对应的语种不同于所述第一语音数据对应的语种,且所述第二语音数据用于在播放所述第一语音数据时在客户端进行呈现。
  12. 一种设备,所述设备包括存储器和处理器,其中,
    所述存储器,用于存储能够在所述处理器上运行的计算机程序;
    所述处理器,用于在运行所述计算机程序时,执行如权利要求1至10任一项所述的方法。
  13. 一种计算机存储介质,其中,所述计算机存储介质存储有语音处理程序,所述语音处理程序被至少一个处理器执行时实现如权利要求1至10任一项所述的方法。
PCT/CN2019/130767 2019-12-31 2019-12-31 语音处理方法、装置、设备以及存储介质 WO2021134592A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/130767 WO2021134592A1 (zh) 2019-12-31 2019-12-31 语音处理方法、装置、设备以及存储介质
CN201980101004.XA CN114467141A (zh) 2019-12-31 2019-12-31 语音处理方法、装置、设备以及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130767 WO2021134592A1 (zh) 2019-12-31 2019-12-31 语音处理方法、装置、设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2021134592A1 true WO2021134592A1 (zh) 2021-07-08

Family

ID=76687221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130767 WO2021134592A1 (zh) 2019-12-31 2019-12-31 语音处理方法、装置、设备以及存储介质

Country Status (2)

Country Link
CN (1) CN114467141A (zh)
WO (1) WO2021134592A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913857A (zh) * 2022-06-23 2022-08-16 中译语通科技股份有限公司 基于多语言会议***的实时转写方法、***、设备及介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937431A (zh) * 2010-08-18 2011-01-05 华南理工大学 情感语音翻译装置及处理方法
CN102196100A (zh) * 2010-03-04 2011-09-21 深圳富泰宏精密工业有限公司 通话即时翻译***及方法
KR20160138613A (ko) * 2015-05-26 2016-12-06 한국전자통신연구원 이모티콘을 이용한 자동통역 방법 및 이를 이용한 장치
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置
CN108447486A (zh) * 2018-02-28 2018-08-24 科大讯飞股份有限公司 一种语音翻译方法及装置
CN108831436A (zh) * 2018-06-12 2018-11-16 深圳市合言信息科技有限公司 一种模拟说话者情绪优化翻译后文本语音合成的方法
CN109033423A (zh) * 2018-08-10 2018-12-18 北京搜狗科技发展有限公司 同传字幕显示方法及装置、智能会议方法、装置及***
CN110008481A (zh) * 2019-04-10 2019-07-12 南京魔盒信息科技有限公司 翻译语音生成方法、装置、计算机设备和存储介质
CN110401671A (zh) * 2019-08-06 2019-11-01 董玉霞 一种同传翻译***及同传翻译终端
US10565994B2 (en) * 2017-11-30 2020-02-18 General Electric Company Intelligent human-machine conversation framework with speech-to-text and text-to-speech

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102196100A (zh) * 2010-03-04 2011-09-21 深圳富泰宏精密工业有限公司 通话即时翻译***及方法
CN101937431A (zh) * 2010-08-18 2011-01-05 华南理工大学 情感语音翻译装置及处理方法
KR20160138613A (ko) * 2015-05-26 2016-12-06 한국전자통신연구원 이모티콘을 이용한 자동통역 방법 및 이를 이용한 장치
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置
US10565994B2 (en) * 2017-11-30 2020-02-18 General Electric Company Intelligent human-machine conversation framework with speech-to-text and text-to-speech
CN108447486A (zh) * 2018-02-28 2018-08-24 科大讯飞股份有限公司 一种语音翻译方法及装置
CN108831436A (zh) * 2018-06-12 2018-11-16 深圳市合言信息科技有限公司 一种模拟说话者情绪优化翻译后文本语音合成的方法
CN109033423A (zh) * 2018-08-10 2018-12-18 北京搜狗科技发展有限公司 同传字幕显示方法及装置、智能会议方法、装置及***
CN110008481A (zh) * 2019-04-10 2019-07-12 南京魔盒信息科技有限公司 翻译语音生成方法、装置、计算机设备和存储介质
CN110401671A (zh) * 2019-08-06 2019-11-01 董玉霞 一种同传翻译***及同传翻译终端

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913857A (zh) * 2022-06-23 2022-08-16 中译语通科技股份有限公司 基于多语言会议***的实时转写方法、***、设备及介质

Also Published As

Publication number Publication date
CN114467141A (zh) 2022-05-10

Similar Documents

Publication Publication Date Title
CN110517689B (zh) 一种语音数据处理方法、装置及存储介质
CN109461437B (zh) 唇语识别的验证内容生成方法及相关装置
CN110149805A (zh) 双向语音翻译***、双向语音翻译方法和程序
AU2016277548A1 (en) A smart home control method based on emotion recognition and the system thereof
CN104123115A (zh) 一种音频信息处理方法及电子设备
RU2692051C1 (ru) Способ и система для синтеза речи из текста
CN109102824B (zh) 基于人机交互的语音纠错方法和装置
CN108073572B (zh) 信息处理方法及其装置、同声翻译***
CN111739556A (zh) 一种语音分析的***和方法
KR20200027331A (ko) 음성 합성 장치
CN104505103B (zh) 语音质量评价设备、方法和***
KR102548365B1 (ko) 회의 기록 자동 생성 방법 및 그 장치
CN112966090B (zh) 对话音频数据处理方法、电子设备和计算机可读存储介质
WO2021134592A1 (zh) 语音处理方法、装置、设备以及存储介质
CN107767862B (zh) 语音数据处理方法、***及存储介质
CN104424955A (zh) 生成音频的图形表示的方法和设备、音频搜索方法和设备
JP2019215502A (ja) サーバ、音データ評価方法、プログラム、通信システム
CN110992958B (zh) 内容记录方法、装置、电子设备及存储介质
JP6645779B2 (ja) 対話装置および対話プログラム
CN111161710A (zh) 同声传译方法、装置、电子设备及存储介质
KR102472921B1 (ko) 음향 신호를 사용자 인터페이스에 시각적으로 표시하는 사용자 인터페이싱 방법 및 장치
JP2004101637A (ja) オンライン教育システム、情報処理装置、情報提供方法及びプログラム
KR20210147277A (ko) 사용자 음성 분석 기반의 이명 상담 치료 시스템
CN111445925A (zh) 用于生成差异信息的方法和装置
CN114514576A (zh) 数据处理方法、装置和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958690

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.12.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19958690

Country of ref document: EP

Kind code of ref document: A1