WO2021062757A1 - Simultaneous interpretation method and apparatus, and server and storage medium - Google Patents

Simultaneous interpretation method and apparatus, and server and storage medium Download PDF

Info

Publication number
WO2021062757A1
WO2021062757A1 PCT/CN2019/109677 CN2019109677W WO2021062757A1 WO 2021062757 A1 WO2021062757 A1 WO 2021062757A1 CN 2019109677 W CN2019109677 W CN 2019109677W WO 2021062757 A1 WO2021062757 A1 WO 2021062757A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
voice data
translated text
language
text
Prior art date
Application number
PCT/CN2019/109677
Other languages
French (fr)
Chinese (zh)
Inventor
郝杰
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/109677 priority Critical patent/WO2021062757A1/en
Priority to CN201980099995.2A priority patent/CN114341866A/en
Publication of WO2021062757A1 publication Critical patent/WO2021062757A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • This application relates to simultaneous interpretation technology, in particular to a simultaneous interpretation method, device, server and storage medium.
  • Machine simultaneous interpretation technology is a speech translation product for conference scenes that has appeared in recent years. It combines automatic speech recognition (ASR, Automatic Speech Recognition) technology and machine translation (MT, Machine Translation) technology to provide speech content for conference speakers Provide multilingual subtitle display, instead of manual simultaneous interpretation service.
  • ASR Automatic Speech Recognition
  • MT Machine Translation
  • the speech content is usually translated and displayed through text, but the displayed content cannot enable users to truly understand the content of the speech.
  • embodiments of the present application provide a simultaneous interpretation method, device, server and storage medium.
  • the embodiment of the application provides a simultaneous interpretation method, which is applied to a server, and includes:
  • the first translated text generate second voice data; and perform at least one of the following:
  • the first image data includes at least a display document corresponding to the first voice data;
  • the language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
  • the embodiment of the present application also provides a simultaneous interpretation device, including:
  • An obtaining unit configured to obtain the first to-be-processed data
  • the first processing unit is configured to translate the first voice data in the first to-be-processed data to obtain a first translated text
  • the second processing unit is configured to generate second voice data according to the first translated text
  • the third processing unit is configured to perform at least one of the following:
  • the first image data includes at least a display document corresponding to the first voice data;
  • the language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
  • the embodiment of the present application further provides a server, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements the steps of any of the above simultaneous interpretation methods when the program is executed. .
  • the embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the instructions are executed by a processor, the steps of any of the aforementioned simultaneous interpretation methods are implemented.
  • the simultaneous interpretation method, device, server, and storage medium provided in the embodiments of this application obtain the first to-be-processed data; translate the first voice data in the first to-be-processed data to obtain the first translated text;
  • the first translated text is used to generate second voice data; and at least one of the following is performed: according to the first voice data and the first translated text, a typeset document is obtained; Perform image word processing to obtain image processing results;
  • the first image data includes at least a display document corresponding to the first voice data; wherein the language corresponding to the first voice data is different from the language corresponding to the typeset document
  • the language corresponding to the first voice data is different from the language corresponding to the first translated text;
  • the language corresponding to the first voice data is different from the language corresponding to the second voice data;
  • the first image data is displayed
  • the language of the text is different from the language of the text included in the image processing result;
  • the first translated text, typeset document, second voice data, and image processing result are used for presentation
  • Figure 1 is a schematic diagram of the system architecture of the application of simultaneous interpretation methods in related technologies
  • FIG. 2 is a schematic flowchart of a simultaneous interpretation method according to an embodiment of the application
  • FIG. 3 is a schematic diagram of a system architecture applied by the simultaneous interpretation method according to an embodiment of the application
  • FIG. 4 is a schematic diagram of the composition structure of the simultaneous interpretation device according to an embodiment of the application.
  • FIG. 5 is a schematic diagram of the composition structure of a server in an embodiment of the application.
  • Figure 1 is a schematic diagram of the system architecture of the application of the simultaneous interpretation method in the related technology; as shown in Figure 1, the system may include: a machine simultaneous interpretation server, a speech recognition server, a translation server, a mobile terminal issuing server, and a viewer mobile terminal , PC (Personal Computer) client, display screen.
  • a machine simultaneous interpretation server a speech recognition server
  • a translation server a mobile terminal issuing server
  • a viewer mobile terminal a viewer mobile terminal
  • PC Personal Computer
  • the lecturer can give a conference lecture through the PC client, and project the displayed documents, such as presentation software (PPT, PowerPoint) documents, to the display screen, and show them to the user through the display screen.
  • the PC client collects the speaker’s audio and sends the collected audio to the machine simultaneous interpretation server.
  • the machine simultaneous interpretation server recognizes the audio data through the voice recognition server to obtain the recognized text. Then translate the recognized text through the translation server to obtain the translation result; the machine simultaneous interpretation server sends the translation result to the PC client, and sends the translation result to the viewer's mobile terminal through the mobile terminal delivery server to display the translation for the user
  • the speech content of the speaker can be translated into the language required by the user and displayed.
  • the solutions in related technologies can display speech content (ie translation results) in different languages, but only perform simultaneous interpretation for the speaker’s oral content, without translating the document presented by the speaker, making it difficult for users of different languages to understand the content of the document.
  • speech content ie translation results
  • the current machine simultaneous interpretation technology is more of a visual display of text content. In the speech expression process of the speaker, the excessive display of text does not make the user understand the speech content well; the above problems lead to the user's sensory experience Bad.
  • the speech content is translated to obtain the translation result (which may include translated speech and text), and the translation result is sorted (such as abstracting and typesetting) to obtain a typeset document, Abstract documents, to translate the displayed documents; send the translation results, sorted documents, and translated display documents to the audience's mobile terminal for display to help users understand the content of the speech, and it is also convenient for users to summarize and summarize the content of the speech.
  • the translation result which may include translated speech and text
  • the translation result is sorted (such as abstracting and typesetting) to obtain a typeset document, Abstract documents, to translate the displayed documents
  • send the translation results, sorted documents, and translated display documents to the audience's mobile terminal for display to help users understand the content of the speech, and it is also convenient for users to summarize and summarize the content of the speech.
  • FIG. 2 is a schematic flowchart of a simultaneous interpretation method according to an embodiment of the application; as shown in FIG. 2, the method includes:
  • Step 201 Obtain first to-be-processed data.
  • the first data to be processed includes: first voice data and first image data.
  • the first image data includes at least a display document corresponding to the first voice data.
  • the display document may be a Word document, a PPT document or a document in other forms, which is not limited here.
  • the first voice data and first image data may be collected by the first terminal and sent to the server.
  • the first terminal may be a mobile terminal such as a PC and a tablet computer.
  • the first terminal may be provided with or connected to a voice collection module, such as a microphone, through which voice collection is performed to obtain first voice data.
  • a voice collection module such as a microphone
  • the first terminal may be provided with or connected with an image acquisition module (the image acquisition module can be implemented by a stereo camera, a binocular camera, or a structured light camera), and the image acquisition module can be used to display documents. Shooting to obtain the first image data.
  • the first terminal may have a screenshot function, and the first terminal may take a screenshot of a document displayed on its display screen, and use the screenshot result as the first image data.
  • the first terminal when the speaker is giving a speech, the first terminal (such as a PC) uses the voice collection module to collect the content of the speech to obtain the first voice data; the speaker displays the documents related to the content of the speech (such as PPT document), the first terminal uses the image acquisition module to capture the displayed PPT document or takes a screenshot of the PPT document on its own display screen to obtain the first image data.
  • the voice collection module to collect the content of the speech to obtain the first voice data
  • the speaker displays the documents related to the content of the speech (such as PPT document)
  • the first terminal uses the image acquisition module to capture the displayed PPT document or takes a screenshot of the PPT document on its own display screen to obtain the first image data.
  • a communication connection is established between the first terminal and the server.
  • the first terminal sends the acquired first voice data and first image data as the first to-be-processed data to the server, and the server can acquire the first to-be-processed data.
  • Step 202 Translate the first voice data in the first to-be-processed data to obtain a first translated text.
  • the translating the first voice data in the first to-be-processed data to obtain the first translated text includes:
  • the server may use voice recognition technology to perform voice recognition on the first voice data to obtain recognized text.
  • the server may use a preset translation model to translate the recognized text to obtain the first translated text.
  • the translation model is used to translate a text in a first language into at least one text in a second language; the first language is different from the second language.
  • Step 203 Generate second voice data according to the first translated text; and perform at least one of the following:
  • the language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different from the language corresponding to the first translated text; The language is different from the language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result;
  • the first translated text, typeset document, second voice data, and image processing result are used for presentation on the client when the first voice data is played.
  • the first translated text, the typeset document, and the second voice data are used to send to the client, so as to display the content corresponding to the first voice data on the client when the first voice data is played.
  • the image processing result is used to send to the client to display the content corresponding to the display document included in the first image data on the client when the first voice data is played.
  • the server may not only use the above method, but also use a preset voice translation model to translate the first voice data, and obtain The second voice data corresponding to the first voice data is then subjected to voice recognition on the second voice data to obtain the first translated text.
  • typesetting can be performed on the content of the first voice data to obtain a typeset document. Concise and clear layout of documents can help users intuitively read and understand. In addition, it is also convenient for the user to summarize and organize the content of the first voice data later.
  • the determining the typeset document according to the first voice data and the first translated text includes:
  • VAD Voice Activity Detection
  • Typesetting the at least one paragraph to obtain the typeset document.
  • the server may perform voice activity detection on the first voice data, determine the silence period in the first voice data, and record the silence duration of the silence period. When the silence duration meets the condition (for example, the silence duration exceeds a preset Duration), and use the determined silence period as a silence point in the first voice data.
  • the server can check all the files according to the mute point.
  • the first translated text corresponding to the first voice data is pre-segmented to obtain at least one pre-segmented paragraph; and the context corresponding to the silent point in the first translated text can be obtained, and natural language processing (NLP, Natural The Language Processing technology performs semantic analysis on the context, and determines whether to segment by pre-segmented paragraphs according to the semantic analysis results.
  • NLP Natural The Language Processing technology
  • the generating second voice data according to the first translated text includes:
  • TTS Text To Speech
  • segmenting the first translated text may include: performing semantic recognition on the first translated text, and segmenting the first translated text according to the semantic recognition result.
  • a combination of voice activity detection technology and semantic recognition technology can also be used to perform segmentation. The specific description has been described in the above-mentioned determining the typeset document based on the first voice data and the first translated text, and will not be repeated here.
  • the display document can be abstracted, which can help the user to summarize and summarize the content of the first voice data, so that the user can better understand the first voice data.
  • the method may further include:
  • Abstract extraction is performed on the first translated text to obtain a summary document for the first translated text; the summary document is used for presentation on the client when the first voice data is played.
  • the NLP technology is used to perform automatic summary (Automatic Summarization) extraction on the first translated text to obtain a summary document for the first translated text.
  • the performing image word processing on the first image data to obtain an image processing result includes:
  • the image processing result is generated.
  • optical character recognition OCR, Optical Character Recognition
  • OCR optical character recognition
  • the OCR technology is a technology that performs character recognition on an image to translate the characters in the image into text.
  • Use interface positioning technology to determine the position corresponding to the text.
  • the translating the extracted text includes: using a preset translation model to translate the text.
  • the translation model is used to translate text in a first language into at least one text in a second language; the first language is different from the second language.
  • the generating of the image processing result according to the translated text includes at least one of the following:
  • the translated text is used to generate a second translated text, and the second translated text is used as the image processing result.
  • the image processing result may include at least one of the following: second image data and second translated text.
  • the simultaneous interpretation data obtained by using the first to-be-processed data corresponds to at least one language; the method may further include:
  • the simultaneous interpretation data corresponding to at least one language is stored in different databases according to the language.
  • the simultaneous interpretation data includes: a first translated text, a second voice data, and also includes at least one of a typeset document, an image processing result, and an abstract document.
  • the simultaneous interpretation data corresponding to at least one language can be stored in different databases according to the language, and the first translated text and second voice data of the same language can be stored in the typeset document, image processing result, and abstract document. At least one of the corresponding ones is stored in the same database, and the database corresponds to the language identifier.
  • sending the simultaneous interpretation data to each client will execute a serial service, in order to ensure the timeliness of sending simultaneous interpretation data to multiple clients at the same time .
  • the server directly obtains the corresponding result from the cache, which can ensure the high timeliness of the simultaneous interpretation data delivery, and can also protect the server's computing resources.
  • the method may further include:
  • the simultaneous interpretation data corresponding to at least one language is classified and cached according to the language.
  • the server may predetermine the preset language of each client in at least one client, and obtain the simultaneous interpretation data corresponding to the preset language from the database for caching.
  • the simultaneous interpretation data of the corresponding language can be directly obtained from the cache, thereby improving the timeliness and the protection of computing resources.
  • the client selects another language different from the preset language, and the simultaneous interpretation data of the other language may not be cached.
  • the server determines that the client sends an acquisition request for selecting another language that is different from its preset language, it may The simultaneous interpretation data of other languages requested by the client is also cached; when another client selects the same language, the corresponding simultaneous interpretation data can be directly obtained from the cache, thereby improving timeliness and compatibility. Protection of computing resources.
  • the simultaneous interpretation data corresponding to the target language can be obtained according to the acquisition request sent by the user through the client.
  • the method may further include:
  • the client may be provided with a human-computer interaction interface through which the user can select a language.
  • the client generates an acquisition request containing the target language according to the user's selection, and sends the acquisition request to the server, so that the server receives The acquisition request.
  • the client can be installed on the mobile phone; here, considering that most users will carry their mobile phones with them, the simultaneous interpretation data will be sent to the client installed on the mobile phone without adding other devices to receive and display the simultaneous voice. Interpreting data can save costs and is easy to operate.
  • the first to-be-processed data corresponds to simultaneous interpretation data corresponding to at least one language
  • the simultaneous interpretation data includes: the first translated text and the second voice data; and also includes at least one of the following One: The typeset document, the image processing result, and the abstract document. That is, the first data to be processed corresponds to the first translated text in at least one language, the second voice data in at least one language, and at least one of the following: a typeset document in at least one language, and an image in at least one language Processing results, abstract documents in at least one language.
  • the corresponding simultaneous interpretation data can be obtained according to the target time sent by the client.
  • the acquisition request may include a target time; when the simultaneous interpretation data corresponding to the target language is acquired from the cached content, the method further includes:
  • the simultaneous interpretation data corresponding to the target time is obtained from the cache; the time correspondence represents the time relationship between the various data in the simultaneous interpretation data.
  • the user can also select the time through the human-computer interaction interface, and the client generates an acquisition request containing the target time according to the user's selection.
  • the simultaneous interpretation method is applied to a meeting; the user selects a time point in the meeting as the target time.
  • time relationship between the data in the simultaneous interpretation data refers to the relationship between the first translated text, the second voice data, and at least one of the typeset document, the image processing result, and the abstract document in the simultaneous interpretation data. Time relationship.
  • the time correspondence is generated in advance according to the time axis of the first voice data and the time point when the first image data is acquired.
  • the acquisition request contains the target language and the acquisition request contains the target time
  • the target language can be preset by the client. It can also be implemented in the same scheme (that is, the acquisition request includes both the target language and the target time, and the server obtains the simultaneous interpretation data of the target time corresponding to the target language).
  • the corresponding relationship between the data in the simultaneous transmission data can be generated in advance. Based on the corresponding relationship, it can be achieved that when a certain data in the simultaneous transmission data is acquired, the corresponding other data can be obtained at the same time. For example, when the first translated text is obtained, the second voice data, summary document, typeset document corresponding to the first translated text can be obtained correspondingly, and the image processing result corresponding to the display document can be obtained.
  • the method further includes:
  • the respective data in the simultaneous interpretation data are correspondingly saved using the time correspondence relationship.
  • the server when the server receives the first voice data, it determines the receiving time, determines the end time according to the duration of the first voice data, and generates the information for the first voice data according to the receiving time and the end time.
  • the first time axis In another embodiment, the first terminal determines the start time and the duration of the first voice data when collecting the first voice data, and sends them to the server, and the server determines the The first time axis of the first voice data.
  • the corresponding time point when the server obtains the first image data may be used as the time point for obtaining the first image data.
  • the first terminal determines the corresponding time point when intercepting the first image data, and sends the determined time point together with the first image data to the server, and the server receives the time Point and the first image data, and use the time point as the time point for acquiring the first image data.
  • the time relationship between the first voice data and the first image data can be determined; and the first translated text and the second voice data in the simultaneous interpretation data
  • the typeset document and the summary document are all generated on the basis of the first voice data, so the time relationship between the first translated text, the second voice data, the typeset document, and the summary document respectively and the first voice data can be determined. Based on this, the time correspondence between the data in the simultaneous interpretation data can be generated.
  • the time correspondence relationship may be embodied in the form of a time axis, that is, a second time axis is generated; the second time axis may be based on the time axis of the second voice data; the second time axis The start time point and end time point corresponding to each segmented voice in the second voice data are marked on it.
  • the time corresponding to each paragraph in the first translated text is marked on the second time axis; the time may specifically be the segmented voice in the second voice data corresponding to each paragraph in the second time axis Point in time.
  • the time corresponding to the typeset document is marked on the second time axis.
  • the time point of the segmented voice in the second voice data corresponding to the typeset document on the second time axis may be used.
  • the time corresponding to the summary document is marked on the second time axis.
  • the time point of the segmented voice in the second voice data corresponding to the summary document on the second time axis may be used.
  • the time corresponding to the image processing result is marked on the second time axis.
  • the relationship between the time corresponding to the image processing result and the second time axis may be determined according to the relationship between the first time axis and the time point of the first image data.
  • sending the first translated text and the second voice data in the simultaneous interpretation data to the client includes:
  • At least one paragraph in the first translated text and the segmented speech corresponding to the paragraph are sent to the client; the segmented speech is used to play when the client displays the segment corresponding to the segmented speech .
  • the paragraph and the segmented voice corresponding to the paragraph are sent to the client together, and when the paragraph is displayed by the client, the client can play the segmented voice corresponding to the paragraph at the same time.
  • the method may further include: generating a target document in a preset format according to the typeset document and the summary document; the target document is used for presenting on the client when the first voice data is played.
  • the server According to the typeset document and the abstract document, the server generates a target document containing the content of the typeset document and the abstract document, and the target document can display the extracted abstract and typesetting together.
  • the method provided in the embodiments of this application can be applied to simultaneous interpretation scenarios, such as simultaneous interpretation in conferences.
  • simultaneous interpretation scenarios such as simultaneous interpretation in conferences.
  • the translation of conference presentation documents allows users to understand the speaker’s speech more clearly in combination with the presentation documents Content; through typesetting and abstract extraction of speech content (that is, the first voice data), to help users better summarize and retrieve; by combining at least one paragraph in the first translated text with the corresponding segment of the paragraph
  • the corresponding voice is sent to the client to help users better accept the content of the speech for the dense text translation content.
  • the simultaneous interpretation method obtains the first data to be processed; translates the first voice data in the first data to be processed to obtain the first translated text; and generates the first translated text according to the first translated text.
  • Second voice data and perform at least one of the following: obtain a typeset document according to the first voice data and the first translated text; perform image word processing on the first image data in the first to-be-processed data to obtain an image Processing result;
  • the first image data includes at least a display document corresponding to the first voice data; wherein the language corresponding to the first voice data is different from the language corresponding to the typeset document; the first voice data
  • the corresponding language is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different from the language corresponding to the second voice data; the language of the text displayed in the first image data is different from that of the text.
  • the language of the text included in the image processing result; the first translated text, the typeset document, the second voice data, and the image processing result are used to present the first voice data on the client when the first voice data is played, and provide the user with the first voice Data-related text translation results, voice translation results, typeset documents corresponding to the text translation results, and translation results related to the display document corresponding to the first voice data.
  • the content of the first voice data of the speech can be more intuitive and comprehensive Intuitive display enables users to understand the summary of the speech content and the content of the displayed document through the client, helps users better accept the content of the speech, and enhances the user experience; it also facilitates the user's subsequent summary of the content of the speech.
  • FIG. 3 is a schematic diagram of the system architecture of the application of the simultaneous interpretation method of the embodiment of the application.
  • the system is applied to the conference Simultaneous interpretation, the system includes: machine simultaneous interpretation server, speech recognition server, translation server, audience mobile terminal, PC client, display screen, conference management equipment, TTS server, OCR server, NLP server.
  • TTS server a server that is used for simultaneous interpretation
  • OCR server a server that is used for simultaneous interpretation
  • NLP server conference management equipment
  • each function can be implemented on multiple servers, which can be implemented separately in the speech recognition server.
  • Translation server, TTS server, OCR server, NLP server, conference management equipment, etc. so as to improve the efficiency of simultaneous interpretation and ensure high timeliness.
  • the PC client is used to collect the audio of the lecturer's speech content in the conference, that is, to collect the first voice data; the document to be displayed is projected to the display screen by way of projection, and the display screen displays the document to the participants Other users of the conference; and, collecting first image data for the document.
  • the document may be a PPT document, a Word document, and so on.
  • the PC client is also used to send the collected first voice data and first image data to the machine simultaneous interpretation server.
  • the PC client may also have a screenshot function, so that through the screenshot operation on the screen, the document currently displayed by the speaker can be obtained in real time, that is, the first image data is collected;
  • the time corresponding to the first image data can be recorded correspondingly, and the first image data and the corresponding time can be sent to the machine simultaneous interpretation server.
  • the machine simultaneous interpretation server is used to send the first voice data to a voice recognition server; the first voice data is recognized by the voice recognition server using voice recognition technology to obtain the recognized text And send it to the machine's simultaneous interpretation server; and,
  • the recognized text is translated by the translation server using a preset translation model to obtain the first translated text and sent to the machine simultaneous interpretation server;
  • the machine simultaneous interpretation server is also used to send the first translated text and the first image data to the conference management device.
  • the first translated text and the first image data respectively carry their corresponding time information, where the time information corresponding to the translation result may include information about each paragraph in the translation result corresponding to each segment of the first voice data. Time information.
  • the conference management device is configured to receive the first translated text and the first image data
  • the translation result is obtained by the NLP server, and at least one of a typeset document and an abstract document is obtained according to the first translated text;
  • the first image data is sent to the OCR server; the first image data is received by the OCR server, the text in the first image data is extracted and the position of the text is determined; the extracted text and the text The location is sent to the conference management equipment;
  • the translation server sends the received extracted text to the translation server, receive the translation result sent by the translation server, and generate an image processing result according to the translation result and the location of the extracted text; here, the extracted text is received by the translation server Translate the extracted text and send the translation result to the conference management equipment; and,
  • the first translated text is sent to the TTS server; the first translated text is received by the TTS server and second voice data is generated according to the first translated text, and the second voice data is sent to the conference management device.
  • the conference management device is also used to send the first translated text, the second voice data, the typeset document, the summary document, and the image processing result to the mobile terminal of the audience.
  • the OCR server is used to obtain the extractable text in the display document corresponding to the first image data and the interface positioning information corresponding to the text through OCR technology and interface positioning technology;
  • the content is translated in multiple languages; according to the interface positioning information, the corresponding translation content of different languages is merged into the picture to obtain the image processing result; the image processing result is stored in the corresponding server according to the language.
  • the NLP server is configured to generate at least one of a typeset document and an abstract document according to the first translated text.
  • the NLP server is used to use NLP technology and VAD technology to generate a typeset document based on the first voice data and the first translated text; and generate a summary document based on the first translated text.
  • VAD Voice Activity Detection
  • VAD Voice Activity Detection
  • the first translated text corresponding to the first voice data can be pre-segmented to obtain at least one pre-segmented paragraph;
  • semantic analysis is performed on the context using NLP technology to determine whether to segment the pre-segmented paragraph, and finally at least one paragraph is determined.
  • the NLP server is also used to use NLP's abstract extraction technology to organize the abstract content of the first translated text to obtain an abstract document.
  • the conference management device is also used to fill in preset forms according to typeset documents in different languages and abstract documents in different languages to generate a target form.
  • the preset table may adopt the format of Table 1 below, and may include: meeting name, meeting time, meeting topic name, topic time, speaker, identification content, translation content of language A, translation content of language B, language A The content of the abstract and the abstract content of language B.
  • the simultaneous interpretation account, meeting name, meeting time, meeting topic name, topic time, and speaker can be filled in by the user in advance according to the actual situation.
  • the conference management device fills in correspondingly the translated content of language one, the translated content of language two, ..., the translated content of language N in the language correspondence table 1; and For abstract documents in at least one language, fill in the abstract content of language one, the abstract content of language two,..., the abstract content of language N in the language correspondence table 1, so as to realize the sorting and summarization of the conference content.
  • the TTS server provides a TTS simultaneous interpretation service; specifically, the TTS server is used to transfer the first translated text in different languages, call the TTS service, and synthesize audio content in different languages, that is, to obtain the second voice data.
  • the conference management device is also used to store the first translated text, the second voice data, and at least one of the typeset document, the abstract document, and the image processing result in the database of the corresponding language according to the time-corresponding relationship .
  • the time correspondence relationship can be implemented using a time axis, and the specific implementation method has been described in the method shown in FIG. 1 and will not be repeated here.
  • the mobile terminal pulls the corresponding first translation document according to the time axis, it can obtain the corresponding second voice data together; it can also obtain the corresponding typeset document, abstract document, and image processing result. at least one.
  • the simultaneous interpretation of PPT documents is added through the OCR server, the typesetting and summary of the meeting content are provided through the NLP server, and the "listening" service of simultaneous interpretation by the machine is added through the TTS server, which improves the user's experience in the meeting.
  • the sensory experience helps users better understand the content of speeches and documents, and also facilitates the audience to summarize and organize the content of the meeting.
  • Fig. 4 is a schematic diagram of the composition structure of the simultaneous interpretation device according to the embodiment of the application; as shown in Fig. 4, the simultaneous interpretation device includes:
  • the obtaining unit 41 is configured to obtain the first to-be-processed data
  • the first processing unit 42 is configured to translate the first voice data in the first to-be-processed data to obtain a first translated text
  • the second processing unit 43 is configured to generate second voice data according to the first translated text
  • the third processing unit 44 is configured to perform at least one of the following:
  • the first image data includes at least a display document corresponding to the first voice data;
  • the language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
  • the third processing unit 44 is configured to perform voice activity detection on the first voice data, and determine a silent point in the first voice data;
  • Typesetting the at least one paragraph to obtain the typeset document.
  • the second processing unit 43 is configured to segment the first translated text to obtain at least one paragraph in the first translated text
  • the second processing unit 43 segments the first translated text, and the same segmentation method as the first processing unit 42 can be used.
  • the third processing unit 44 is configured to extract a summary of the first translated text to obtain a summary document for the first translated text; the summary document is used to play the first translated text.
  • a voice data is presented at the client terminal.
  • the user can summarize and summarize the content of the first voice data, and better accept the content of the first voice data.
  • the third processing unit 44 is configured to determine a character in the first image data and a position corresponding to the character
  • the image processing result is generated.
  • the image processing result may include at least one of the following: second image data and second translated text.
  • the third processing unit 44 is configured to execute at least one of the following to generate the image processing result:
  • the translated text is used to generate a second translated text, and the second translated text is used as the image processing result.
  • the simultaneous interpretation data obtained by using the first to-be-processed data corresponds to at least one language
  • the device further includes: a storage unit; the storage unit is configured to classify and cache the simultaneous interpretation data corresponding to at least one language type according to the language type.
  • the device further includes: a communication unit; the communication unit is configured to receive an acquisition request sent by the client; the acquisition request is used to acquire simultaneous interpretation data; the acquisition request includes at least: a target Language
  • the acquisition request further includes a target time
  • the communication unit is further configured to, when obtaining the simultaneous interpretation data corresponding to the target language from the cached content, obtain the simultaneous interpretation data corresponding to the target time from the cache according to a preset time correspondence relationship;
  • the time correspondence relationship represents the time relationship between the data in the simultaneous interpretation data; the time correspondence relationship is generated in advance according to the time axis of the first voice data and the time point at which the first image data is acquired of.
  • the storage unit is further configured to determine a first time axis corresponding to the first voice data and a time point for acquiring the first image data;
  • the respective data in the simultaneous interpretation data are correspondingly saved using the time correspondence relationship.
  • the communication unit is further configured to send at least one paragraph in the first translated text and the segmented speech corresponding to the paragraph to the client; when the paragraph is displayed by the client, The segmented voice corresponding to the paragraph is played by the client.
  • the acquisition unit 41 can be implemented through a communication interface; the first processing unit 42, the second processing unit 43, and the third processing unit 44 can all be implemented by a processor in the server, such as a central Processor (CPU, Central Processing Unit), Digital Signal Processor (DSP, Digital Signal Processor), Microcontroller Unit (MCU, Microcontroller Unit) or Programmable Gate Array (FPGA, Field-Programmable Gate Array), etc.;
  • the communication unit can be implemented by a communication interface in the server.
  • the device provided in the above embodiment performs simultaneous interpretation
  • only the division of the above-mentioned program modules is used as an example for illustration.
  • the above-mentioned processing can be allocated to different program modules according to needs, i.e.
  • the internal structure of the terminal is divided into different program modules to complete all or part of the processing described above.
  • the device provided in the foregoing embodiment and the embodiment of the simultaneous interpretation method belong to the same concept. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
  • FIG. 5 is a schematic diagram of the hardware composition structure of the server according to an embodiment of the present application.
  • a computer program that is on the memory 53 and can run on the processor 52; when the processor 52 located on the server executes the program, the method provided by one or more technical solutions on the server side is implemented.
  • the processor 52 located in the server 50 executes the program, it realizes: obtain the first to-be-processed data; translate the first voice data in the first to-be-processed data to obtain the first translated text; First translate text, generate second voice data; and perform at least one of the following:
  • Image word processing is performed on the first image data in the first to-be-processed data to obtain an image processing result;
  • the first image data includes at least a display document corresponding to the first voice data;
  • the typeset document, the second voice data, and the image processing result are used for presentation on the client when the first voice data is played.
  • the server further includes a communication interface 51; various components in the server are coupled together through the bus system 54.
  • the bus system 54 is configured to implement connection and communication between these components.
  • the bus system 54 also includes a power bus, a control bus, and a status signal bus.
  • the memory 53 in this embodiment may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM, Read Only Memory), programmable read-only memory (PROM, Programmable Read-Only Memory), and erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage.
  • the volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • SSRAM synchronous static random access memory
  • Synchronous Static Random Access Memory Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM synchronous connection dynamic random access memory
  • DRRAM Direct Rambus Random Access Memory
  • the memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
  • the method disclosed in the foregoing embodiments of the present application may be applied to the processor 52 or implemented by the processor 52.
  • the processor 52 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 52 or instructions in the form of software.
  • the aforementioned processor 52 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like.
  • the processor 52 may implement or execute various methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a storage medium, and the storage medium is located in a memory.
  • the processor 52 reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.
  • the embodiments of the present application also provide a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • Computer instructions that is, computer programs, are stored thereon, and when the computer instructions are executed by the processor, the method provided by one or more technical solutions on the server side is provided.
  • the disclosed method and smart device can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit;
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • the foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
  • the above-mentioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A simultaneous interpretation method and apparatus, and a server (50) and a storage medium. The method comprises: obtaining first data to be processed (201); interpreting first speech data in the first data to be processed, so as to obtain first interpreted text (202); and generating second speech data according to the first interpreted text, and executing at least one of the following: obtaining a typeset document according to the first speech data and the first interpreted text, and performing image word processing on first image data in the first data to be processed, so as to obtain an image processing result (203), wherein the first image data includes at least a display document corresponding to the first speech data, and the first interpreted text, the typeset document, the second speech data and the image processing result are presented on a client when the first speech data is played.

Description

同声传译方法、装置、服务器和存储介质Simultaneous interpretation method, device, server and storage medium 技术领域Technical field
本申请涉及同声传译技术,具体涉及一种同声传译方法、装置、服务器和存储介质。This application relates to simultaneous interpretation technology, in particular to a simultaneous interpretation method, device, server and storage medium.
背景技术Background technique
机器同传技术是近些年出现的针对会议场景的语音翻译产品,其结合自动语音识别技术(ASR,Automatic Speech Recognition)技术和机器翻译(MT,Machine Translation)技术,为会议演讲者的演讲内容提供多语种的字幕展现,替代人工同传服务。Machine simultaneous interpretation technology is a speech translation product for conference scenes that has appeared in recent years. It combines automatic speech recognition (ASR, Automatic Speech Recognition) technology and machine translation (MT, Machine Translation) technology to provide speech content for conference speakers Provide multilingual subtitle display, instead of manual simultaneous interpretation service.
相关机器同传技术中,通常对演讲内容进行翻译,并通过文字进行展示,但展示的内容不能够使用户真正领会到演讲内容。In related machine simultaneous interpretation technology, the speech content is usually translated and displayed through text, but the displayed content cannot enable users to truly understand the content of the speech.
发明内容Summary of the invention
为解决相关技术问题,本申请实施例提供了一种同声传译方法、装置、服务器和存储介质。In order to solve related technical problems, embodiments of the present application provide a simultaneous interpretation method, device, server and storage medium.
本申请实施例提供了一种同声传译方法,应用于服务器,包括:The embodiment of the application provides a simultaneous interpretation method, which is applied to a server, and includes:
获得第一待处理数据;Obtain the first data to be processed;
对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;Translating the first voice data in the first to-be-processed data to obtain the first translated text;
根据所述第一翻译文本,生成第二语音数据;并执行以下至少之一:According to the first translated text, generate second voice data; and perform at least one of the following:
根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,Perform image word processing on the first image data in the first to-be-processed data to obtain an image processing result; the first image data includes at least a display document corresponding to the first voice data; wherein,
所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。The language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
本申请实施例还提供了一种同声传译装置,包括:The embodiment of the present application also provides a simultaneous interpretation device, including:
获取单元,配置为获得第一待处理数据;An obtaining unit configured to obtain the first to-be-processed data;
第一处理单元,配置为对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;The first processing unit is configured to translate the first voice data in the first to-be-processed data to obtain a first translated text;
第二处理单元,配置为根据所述第一翻译文本,生成第二语音数据;The second processing unit is configured to generate second voice data according to the first translated text;
第三处理单元,配置为执行以下至少之一:The third processing unit is configured to perform at least one of the following:
根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,Perform image word processing on the first image data in the first to-be-processed data to obtain an image processing result; the first image data includes at least a display document corresponding to the first voice data; wherein,
所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。The language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
本申请实施例又提供了一种服务器,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任一同声传译方法的步骤。The embodiment of the present application further provides a server, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the steps of any of the above simultaneous interpretation methods when the program is executed. .
本申请实施例还提供了一种存储介质,其上存储有计算机指令,所述指令被处理器执行时实现上述任一同声传译方法的步骤。The embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the instructions are executed by a processor, the steps of any of the aforementioned simultaneous interpretation methods are implemented.
本申请实施例提供的同声传译方法、装置、服务器和存储介质,获得第一待处理数据;对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;根据所述第一翻译文本,生成第二语音数据;并执行以下至少之一:根据所述第一语音数据和第一翻译文本,获得排版文档;对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现,为用户提供与第一语音数据相关的文本翻译结果、语音翻译结果、文本翻译结果对应的排版文档,以及第一语音数据对应的展示文档相关的翻译结果,如此,能够更直观、全面地对演讲的第一语音数据的内容进行直观展示,从而能够使用户通过客户端直接了解第一语音内容的概要和展示文档的内容,帮助用户理解第一语音数据和展示文档,提升用户体验。The simultaneous interpretation method, device, server, and storage medium provided in the embodiments of this application obtain the first to-be-processed data; translate the first voice data in the first to-be-processed data to obtain the first translated text; The first translated text is used to generate second voice data; and at least one of the following is performed: according to the first voice data and the first translated text, a typeset document is obtained; Perform image word processing to obtain image processing results; the first image data includes at least a display document corresponding to the first voice data; wherein the language corresponding to the first voice data is different from the language corresponding to the typeset document The language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different from the language corresponding to the second voice data; the first image data is displayed The language of the text is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and image processing result are used for presentation on the client when the first voice data is played, Provide users with text translation results related to the first voice data, voice translation results, typeset documents corresponding to the text translation results, and translation results related to the display document corresponding to the first voice data, so that the speech can be more intuitive and comprehensive The content of the first voice data is visually displayed, so that the user can directly understand the outline of the first voice content and the content of the displayed document through the client, help the user understand the first voice data and the displayed document, and improve the user experience.
附图说明Description of the drawings
图1为相关技术中同声传译方法应用的***架构示意图;Figure 1 is a schematic diagram of the system architecture of the application of simultaneous interpretation methods in related technologies;
图2为本申请实施例的同声传译方法的一种流程示意图;FIG. 2 is a schematic flowchart of a simultaneous interpretation method according to an embodiment of the application;
图3为本申请实施例的同声传译方法应用的***架构示意图;FIG. 3 is a schematic diagram of a system architecture applied by the simultaneous interpretation method according to an embodiment of the application;
图4为本申请实施例的同声传译装置的组成结构示意图;FIG. 4 is a schematic diagram of the composition structure of the simultaneous interpretation device according to an embodiment of the application;
图5为本申请实施例的服务器的组成结构示意图。FIG. 5 is a schematic diagram of the composition structure of a server in an embodiment of the application.
具体实施方式Detailed ways
在对本申请实施例的技术方案进行详细说明之前,首先对相关技术中的同声传译方法应用的***进行简单说明。Before describing in detail the technical solutions of the embodiments of the present application, first, a brief description of the system applied by the simultaneous interpretation method in the related art will be given.
图1为相关技术中同声传译方法应用的***架构示意图;如图1所示,所述***可包括:机器同传服务端、语音识别服务器、翻译服务器、移动端下发服务器、观众移动端、电脑(PC,Personal Computer)客户端、显示屏幕。Figure 1 is a schematic diagram of the system architecture of the application of the simultaneous interpretation method in the related technology; as shown in Figure 1, the system may include: a machine simultaneous interpretation server, a speech recognition server, a translation server, a mobile terminal issuing server, and a viewer mobile terminal , PC (Personal Computer) client, display screen.
实际应用中,演讲者可以通过PC客户端进行会议演讲,并将展示的文档,如演示文稿软件(PPT,PowerPoint)的文档,投屏到所述显示屏幕,通过显示屏幕展示给用户。在进行会议演讲的过程中,PC客户端采集演讲者的音频,将采集的音频发送给机器同传服务端,所述机器同传服务端通过语音识别服务器对音频数据进行识别,得到识别文本,再通过翻译服务器对所述识别文本进行翻译,得到翻译结果;机器同传服务端将翻译结果发送给PC客户端,并且通过移动端下发服务器将翻译结果发送给观众移动端,为用户展示翻译结果,从而实现将演讲者的演讲内容翻译成用户需要的语种并进行展示。In practical applications, the lecturer can give a conference lecture through the PC client, and project the displayed documents, such as presentation software (PPT, PowerPoint) documents, to the display screen, and show them to the user through the display screen. During the conference speech process, the PC client collects the speaker’s audio and sends the collected audio to the machine simultaneous interpretation server. The machine simultaneous interpretation server recognizes the audio data through the voice recognition server to obtain the recognized text. Then translate the recognized text through the translation server to obtain the translation result; the machine simultaneous interpretation server sends the translation result to the PC client, and sends the translation result to the viewer's mobile terminal through the mobile terminal delivery server to display the translation for the user As a result, the speech content of the speaker can be translated into the language required by the user and displayed.
相关技术中的方案可展示不同语种的演讲内容(即翻译结果),但是仅针对演讲者口述内容进行同传,并未翻译演讲者演示的文档,使不同语种的用户很难理解文档的内容,演讲内容的展示仍存在缺点;且针对演讲内容,更多的也是直接对翻译后的文字进行展示,不能帮助用户对演讲内容进行排版、摘要提炼;另外,相较于人工同传服务以听为主,目前的机器同传技术更多是对文字内容的视觉展示,在演讲者语音表达过程中,文字的过多展现并不能很好的让用户理解到演讲内容;上述问题导致用户的感官体验不佳。The solutions in related technologies can display speech content (ie translation results) in different languages, but only perform simultaneous interpretation for the speaker’s oral content, without translating the document presented by the speaker, making it difficult for users of different languages to understand the content of the document. There are still shortcomings in the presentation of speech content; and for speech content, it is more direct to display the translated text, which cannot help users to typeset and abstract the speech content; in addition, compared to manual simultaneous interpretation services, listening is the basis. Mainly, the current machine simultaneous interpretation technology is more of a visual display of text content. In the speech expression process of the speaker, the excessive display of text does not make the user understand the speech content well; the above problems lead to the user's sensory experience Bad.
基于此,在本申请的各种实施例中,对演讲内容进行翻译,得到翻译结果(可以包括翻译后的语音和文本),对翻译结果进行整理(如提取摘要、排版),获得排版文档、摘要文档,对展示的文档进行翻译;将翻译结果、整理获得的文档和翻译后的展示文档发送给观众移动端进行展示,帮助用户领会演讲内容,也方便用户后续对于演讲内容进行归纳总结。Based on this, in various embodiments of the present application, the speech content is translated to obtain the translation result (which may include translated speech and text), and the translation result is sorted (such as abstracting and typesetting) to obtain a typeset document, Abstract documents, to translate the displayed documents; send the translation results, sorted documents, and translated display documents to the audience's mobile terminal for display to help users understand the content of the speech, and it is also convenient for users to summarize and summarize the content of the speech.
下面结合附图及具体实施例对本申请作进一步详细的说明。The application will be further described in detail below in conjunction with the drawings and specific embodiments.
本申请实施例提供了一种同声传译方法,应用于服务器,图2为本申请实施例的同声传译方法的一种流程示意图;如图2所示,所述方法包括:The embodiment of the present application provides a simultaneous interpretation method, which is applied to a server. FIG. 2 is a schematic flowchart of a simultaneous interpretation method according to an embodiment of the application; as shown in FIG. 2, the method includes:
步骤201:获得第一待处理数据。Step 201: Obtain first to-be-processed data.
这里,所述第一待处理数据,包括:第一语音数据、第一图像数据。Here, the first data to be processed includes: first voice data and first image data.
其中,所述第一图像数据至少包含与所述第一语音数据对应的展示文档。所述展示文档可以是Word文档、PPT文档或者其他形式的文档,这里不做限定。Wherein, the first image data includes at least a display document corresponding to the first voice data. The display document may be a Word document, a PPT document or a document in other forms, which is not limited here.
实际应用时,所述第一语音数据和第一图像数据可以由第一终端采集并发送给所述服务器。所述第一终端可以是PC、平板电脑等移动终端。In practical applications, the first voice data and first image data may be collected by the first terminal and sent to the server. The first terminal may be a mobile terminal such as a PC and a tablet computer.
所述第一终端可以设有或者连接有语音采集模块,如麦克风,通过所述语音采集模块进行声音采集,得到第一语音数据。The first terminal may be provided with or connected to a voice collection module, such as a microphone, through which voice collection is performed to obtain first voice data.
实际应用时,所述第一终端可以设有或者连接有图像采集模块(所述图像采集模块可通过立体摄像头、双目摄像头或结构光摄像头实现),通过所述图像采集模块可以针对展示文档进行拍摄,以获取所述第一图像数据。在另一实施例中,所述第一终端可以具有截屏功能,所述第一终端可对自身显示屏上的展示文档进行截屏,将截屏结果作为第一图像数据。In practical applications, the first terminal may be provided with or connected with an image acquisition module (the image acquisition module can be implemented by a stereo camera, a binocular camera, or a structured light camera), and the image acquisition module can be used to display documents. Shooting to obtain the first image data. In another embodiment, the first terminal may have a screenshot function, and the first terminal may take a screenshot of a document displayed on its display screen, and use the screenshot result as the first image data.
举例来说,在会议的同声传译场景下,演讲者进行演讲时,第一终端(如PC)利用语音采集模块采集演讲内容,得到第一语音数据;演讲者展示与演讲内容相关的文档(如PPT文档),第一终端利用图像采集模块拍摄展示的PPT文档或者对自身显示屏上的PPT文档进行截屏,得到第一图像数据。For example, in a conference simultaneous interpretation scenario, when the speaker is giving a speech, the first terminal (such as a PC) uses the voice collection module to collect the content of the speech to obtain the first voice data; the speaker displays the documents related to the content of the speech ( Such as PPT document), the first terminal uses the image acquisition module to capture the displayed PPT document or takes a screenshot of the PPT document on its own display screen to obtain the first image data.
所述第一终端与所述服务器之间建立通信连接。所述第一终端将获取的第一语音数据和第一图像数据,作为第一待处理数据发送给服务器,所述服务器即可获取所述第一待处理数据。A communication connection is established between the first terminal and the server. The first terminal sends the acquired first voice data and first image data as the first to-be-processed data to the server, and the server can acquire the first to-be-processed data.
步骤202:对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本。Step 202: Translate the first voice data in the first to-be-processed data to obtain a first translated text.
其中,在一实施例中,所述对第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本,包括:Wherein, in an embodiment, the translating the first voice data in the first to-be-processed data to obtain the first translated text includes:
对所述第一语音数据进行语音识别,获得识别文本;Performing voice recognition on the first voice data to obtain recognized text;
对所述识别文本进行翻译,获得所述第一翻译文本。Translating the recognized text to obtain the first translated text.
这里,所述服务器可以采用语音识别技术对所述第一语音数据进行语音识别,获得识别文本。Here, the server may use voice recognition technology to perform voice recognition on the first voice data to obtain recognized text.
所述服务器可以运用预设的翻译模型对所述识别文本进行翻译,获得所述第一翻译文本。The server may use a preset translation model to translate the recognized text to obtain the first translated text.
所述翻译模型,用于将第一语种的文本翻译为至少一种第二语种的文本;所述第一语种不同于第二语种。The translation model is used to translate a text in a first language into at least one text in a second language; the first language is different from the second language.
步骤203:根据所述第一翻译文本,生成第二语音数据;并执行以下 至少之一:Step 203: Generate second voice data according to the first translated text; and perform at least one of the following:
根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;Performing image word processing on the first image data in the first to-be-processed data to obtain an image processing result;
其中,所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;Wherein, the language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different from the language corresponding to the first translated text; The language is different from the language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result;
所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。The first translated text, typeset document, second voice data, and image processing result are used for presentation on the client when the first voice data is played.
具体来说,所述第一翻译文本、所述排版文档和所述第二语音数据用于发送给客户端,以在播放第一语音数据时在客户端展示所述第一语音数据对应的内容;所述图像处理结果用于发送给客户端,以在播放第一语音数据时在客户端展示所述第一图像数据包括的展示文档对应的内容。Specifically, the first translated text, the typeset document, and the second voice data are used to send to the client, so as to display the content corresponding to the first voice data on the client when the first voice data is played. The image processing result is used to send to the client to display the content corresponding to the display document included in the first image data on the client when the first voice data is played.
实际应用时,对于获得第一语音数据对应的第一翻译文本和第二语音数据,服务器除了采用上述方法,也可以先运用预设的语音翻译模型对所述第一语音数据进行翻译,先得到第一语音数据对应的第二语音数据,再对所述第二语音数据进行语音识别,获得第一翻译文本。所述第二语音数据的语种有至少一种,所述第一翻译文本的语种有至少一种。In actual application, for obtaining the first translated text and second voice data corresponding to the first voice data, the server may not only use the above method, but also use a preset voice translation model to translate the first voice data, and obtain The second voice data corresponding to the first voice data is then subjected to voice recognition on the second voice data to obtain the first translated text. There is at least one language type of the second voice data, and at least one language type of the first translated text.
实际应用时,可以针对第一语音数据的内容进行排版,得到排版文档。通过简洁、条理清晰的排版文档,可以帮助用户直观的进行阅读、理解。另外,也可以方便用户后续对于第一语音数据的内容进行归纳整理。In actual applications, typesetting can be performed on the content of the first voice data to obtain a typeset document. Concise and clear layout of documents can help users intuitively read and understand. In addition, it is also convenient for the user to summarize and organize the content of the first voice data later.
在一实施例中,所述根据第一语音数据和第一翻译文本,确定排版文档,包括:In an embodiment, the determining the typeset document according to the first voice data and the first translated text includes:
对所述第一语音数据进行语音活动检测(VAD,Voice Activity Detection),确定所述第一语音数据中的静音点;Perform voice activity detection (VAD, Voice Activity Detection) on the first voice data, and determine a mute point in the first voice data;
获取所述第一翻译文本中所述静音点对应的上下文;Acquiring the context corresponding to the mute point in the first translated text;
根据所述静音点和所述上下文的语义对所述第一翻译文本进行分段,获得至少一个段落;Segment the first translated text according to the mute point and the semantics of the context to obtain at least one paragraph;
对所述至少一个段落进行排版,得到所述排版文档。Typesetting the at least one paragraph to obtain the typeset document.
这里,所述服务器可以对第一语音数据进行语音活动检测,确定所述第一语音数据中的静音期并记录静音期的静音时长,当所述静音时长满足条件时(例如静音时长超过预设时长),将确定的静音期作为所述第一语音数据中的一个静音点。Here, the server may perform voice activity detection on the first voice data, determine the silence period in the first voice data, and record the silence duration of the silence period. When the silence duration meets the condition (for example, the silence duration exceeds a preset Duration), and use the determined silence period as a silence point in the first voice data.
由于所述第一翻译文本是根据所述第一语音数据进行翻译获得的, 所述第一翻译文本的内容和第一语音数据的内容存在对应关系,因此,所述服务器可以根据静音点对所述第一语音数据对应的第一翻译文本进行预分段,得到至少一个预分段段落;并且可以获取所述第一翻译文本中所述静音点对应的上下文,运用自然语言处理(NLP,Natural Language Processing)技术对上下文进行语义分析,根据语义分析结果确定是否按预分段段落进行分段。通过上述方法可以最终对第一翻译文本进行分段,获得至少一个段落。Since the first translated text is obtained by translating according to the first voice data, there is a corresponding relationship between the content of the first translated text and the content of the first voice data. Therefore, the server can check all the files according to the mute point. The first translated text corresponding to the first voice data is pre-segmented to obtain at least one pre-segmented paragraph; and the context corresponding to the silent point in the first translated text can be obtained, and natural language processing (NLP, Natural The Language Processing technology performs semantic analysis on the context, and determines whether to segment by pre-segmented paragraphs according to the semantic analysis results. Through the above method, the first translated text can be finally segmented to obtain at least one paragraph.
具体地,所述根据所述第一翻译文本,生成第二语音数据,包括:Specifically, the generating second voice data according to the first translated text includes:
对所述第一翻译文本进行分段,得到所述第一翻译文本中的至少一个段落;Segment the first translated text to obtain at least one paragraph in the first translated text;
根据所述第一翻译文本中的至少一个段落,生成至少一个分段语音;Generating at least one segmented speech according to at least one paragraph in the first translated text;
利用所述至少一个分段语音,合成所述第一翻译文本对应的第二语音数据。Using the at least one segmented speech to synthesize second speech data corresponding to the first translated text.
这里,运用文本到语音(TTS,Text To Speech)技术将所述段落转换为对应的分段语音。Here, the text to speech (TTS, Text To Speech) technology is used to convert the paragraphs into corresponding segmented speech.
其中,在一实施例中,对第一翻译文本进行分段,可以包括:对第一翻译文本进行语义识别,根据语义识别结果对第一翻译文本进行分段。在另一实施例中,也可以采用语音活动检测技术和语义识别结合的技术进行分段,具体描述已在上述根据第一语音数据和第一翻译文本确定排版文档中说明,这里不再赘述。Wherein, in an embodiment, segmenting the first translated text may include: performing semantic recognition on the first translated text, and segmenting the first translated text according to the semantic recognition result. In another embodiment, a combination of voice activity detection technology and semantic recognition technology can also be used to perform segmentation. The specific description has been described in the above-mentioned determining the typeset document based on the first voice data and the first translated text, and will not be repeated here.
实际应用时,可以对展示文档进行摘要提取,从而能够帮助用户对第一语音数据的内容进行归纳总结,使得用户更好的理解第一语音数据。另外,也可以方便用户后续对于第一语音数据的内容进行归纳整理。In practical applications, the display document can be abstracted, which can help the user to summarize and summarize the content of the first voice data, so that the user can better understand the first voice data. In addition, it is also convenient for the user to summarize and organize the content of the first voice data later.
基于此,在一请实施例中,所述方法还可以包括:Based on this, in an embodiment, the method may further include:
对所述第一翻译文本进行摘要抽取,得到针对所述第一翻译文本的摘要文档;所述摘要文档用于在播放所述第一语音数据时在所述客户端进行呈现。Abstract extraction is performed on the first translated text to obtain a summary document for the first translated text; the summary document is used for presentation on the client when the first voice data is played.
这里,运用NLP技术对所述第一翻译文本进行自动摘要(Automatic summarization)抽取,得到针对所述第一翻译文本的摘要文档。Here, the NLP technology is used to perform automatic summary (Automatic Summarization) extraction on the first translated text to obtain a summary document for the first translated text.
具体地,所述对所述第一图像数据进行图像文字处理,获得图像处理结果,包括:Specifically, the performing image word processing on the first image data to obtain an image processing result includes:
确定所述第一图像数据中的文字和所述文字对应的位置;Determining a text in the first image data and a position corresponding to the text;
提取所述第一图像数据中的文字,并对提取的文字进行翻译;Extracting text in the first image data, and translating the extracted text;
根据翻译后的文字,生成所述图像处理结果。According to the translated text, the image processing result is generated.
这里,运用光学字符识别(OCR,Optical Character Recognition)技术确定所述第一图像数据中的文字;所述OCR技术是一种对图像进行字符识别,以将图像中的字符翻译成文字的技术。运用界面定位技术确定所述文字对应的位置。Here, optical character recognition (OCR, Optical Character Recognition) technology is used to determine the text in the first image data; the OCR technology is a technology that performs character recognition on an image to translate the characters in the image into text. Use interface positioning technology to determine the position corresponding to the text.
所述对提取的文字进行翻译,包括:运用预设的翻译模型对所述文字进行翻译。The translating the extracted text includes: using a preset translation model to translate the text.
这里,所述翻译模型,用于将第一语种的文字翻译为至少一种第二语种的文字;所述第一语种不同于第二语种。Here, the translation model is used to translate text in a first language into at least one text in a second language; the first language is different from the second language.
具体地,所述根据翻译后的文字,生成所述图像处理结果,包括以下至少一种:Specifically, the generating of the image processing result according to the translated text includes at least one of the following:
根据翻译后的文字,替换所述第一图像数据中所述位置对应的文字,得到第二图像数据,将所述第二图像数据作为所述图像处理结果;According to the translated text, replacing the text corresponding to the position in the first image data to obtain second image data, and using the second image data as the image processing result;
利用翻译后的文字生成第二翻译文本,将所述第二翻译文本作为所述图像处理结果。The translated text is used to generate a second translated text, and the second translated text is used as the image processing result.
这里,所述图像处理结果,可以包括以下至少之一:第二图像数据、第二翻译文本。Here, the image processing result may include at least one of the following: second image data and second translated text.
实际应用时,为了帮助用户理解第一语音数据和展示文档,同声传译数据可以有多种,以根据用户需求提供相应文档(包括第一翻译文本、第二语音数据,以及排版文档、图像处理结果、摘要文档中的至少一个)。In practical applications, in order to help users understand the first voice data and display documents, there can be multiple simultaneous interpretation data to provide corresponding documents according to user needs (including the first translated text, second voice data, as well as typesetting documents, image processing At least one of the result, summary document).
基于此,在一实施例中,利用所述第一待处理数据获得的同声传译数据对应至少一种语种;所述方法还可以包括:Based on this, in an embodiment, the simultaneous interpretation data obtained by using the first to-be-processed data corresponds to at least one language; the method may further include:
将至少一种语种对应的同声传译数据按语种保存在不同数据库中。The simultaneous interpretation data corresponding to at least one language is stored in different databases according to the language.
这里,所述同声传译数据,包括:第一翻译文本、第二语音数据,还包括:排版文档、图像处理结果、摘要文档中的至少一个。Here, the simultaneous interpretation data includes: a first translated text, a second voice data, and also includes at least one of a typeset document, an image processing result, and an abstract document.
这里,可以根据用户需求确定是否提供排版文档、图像处理结果和摘要文档。例如,用户通过客户端发送请求,以告知服务器是否需要排版文档、图像处理结果和摘要文档。再例如,确定用户需要了解第一语音数据的概要,则除第一翻译文本、第二语音数据外,所述同声传译数据还可以包括排版文档、摘要文档;确定用户需要了解展示文档的内容时,则所述同声传译数据还可以包括图像处理结果。Here, you can determine whether to provide typeset documents, image processing results, and summary documents according to user needs. For example, the user sends a request through the client to inform the server whether it needs to typeset documents, image processing results, and summary documents. For another example, if it is determined that the user needs to understand the summary of the first voice data, in addition to the first translated text and the second voice data, the simultaneous interpretation data may also include typeset documents and abstract documents; it is determined that the user needs to understand the content of the displayed document When, the simultaneous interpretation data may also include image processing results.
实际应用时,可以将至少一种语种对应的同声传译数据按语种保存在不同数据库中,可以将同一语种的第一翻译文本、第二语音数据,以及排版文档、图像处理结果、摘要文档中的至少一个,对应保存在同一数据库中,数据库对应有语种的标识。In actual application, the simultaneous interpretation data corresponding to at least one language can be stored in different databases according to the language, and the first translated text and second voice data of the same language can be stored in the typeset document, image processing result, and abstract document. At least one of the corresponding ones is stored in the same database, and the database corresponds to the language identifier.
本实施例中,考虑到同声传译数据面向多个客户端,对每一个客户端发送同声传译数据即执行一条串行服务,为保证向多个客户端同时发送同声传译数据的时效性,可以采用缓存的方式。需发送时服务器直接从缓存中获取相应结果,可以保证同声传译数据下发的高时效性,还可以保护服务器计算资源。In this embodiment, considering that the simultaneous interpretation data is oriented to multiple clients, sending the simultaneous interpretation data to each client will execute a serial service, in order to ensure the timeliness of sending simultaneous interpretation data to multiple clients at the same time , You can use caching. When sending, the server directly obtains the corresponding result from the cache, which can ensure the high timeliness of the simultaneous interpretation data delivery, and can also protect the server's computing resources.
基于此,在一实施例中,所述方法还可以包括:Based on this, in an embodiment, the method may further include:
将至少一种语种对应的同声传译数据,按语种进行分类缓存。The simultaneous interpretation data corresponding to at least one language is classified and cached according to the language.
实际应用时,服务器可以预先确定至少一个客户端中各客户端的预 置语种,从数据库中获取预置语种对应的同声传译数据进行缓存。In actual application, the server may predetermine the preset language of each client in at least one client, and obtain the simultaneous interpretation data corresponding to the preset language from the database for caching.
通过缓存操作,当有客户端选择不同于预置语种的其他语种时,可以直接从缓存中获取相应语种的同声传译数据,从而可以提高时效性和对计算资源的保护。Through the cache operation, when a client selects a language other than the preset language, the simultaneous interpretation data of the corresponding language can be directly obtained from the cache, thereby improving the timeliness and the protection of computing resources.
实际应用中,客户端选择与预置语种不同的其他语种,所述其他语种的同声传译数据可能未缓存,服务器确定客户端发送选择不同于其预置语种的其他语种的获取请求时,可以将该客户端请求的其他语种的同声传译数据也进行缓存;当再有其他客户端也选择相同的语种,则可以直接从缓存中获取相应的同声传译数据,从而可以提高时效性和对计算资源的保护。In actual applications, the client selects another language different from the preset language, and the simultaneous interpretation data of the other language may not be cached. When the server determines that the client sends an acquisition request for selecting another language that is different from its preset language, it may The simultaneous interpretation data of other languages requested by the client is also cached; when another client selects the same language, the corresponding simultaneous interpretation data can be directly obtained from the cache, thereby improving timeliness and compatibility. Protection of computing resources.
实际应用中,为了提供符合用户需求的语种对应的同声传译数据,可以根据用户通过客户端发送的获取请求,获取目标语种对应的同声传译数据。In actual applications, in order to provide the simultaneous interpretation data corresponding to the language that meets the needs of the user, the simultaneous interpretation data corresponding to the target language can be obtained according to the acquisition request sent by the user through the client.
基于此,在一实施例中,所述方法还可以包括:Based on this, in an embodiment, the method may further include:
接收客户端发送的获取请求;所述获取请求用于获取同声传译数据;所述获取请求至少包括:目标语种;Receive an acquisition request sent by the client; the acquisition request is used to acquire simultaneous interpretation data; the acquisition request includes at least: the target language;
从缓存的同声传译数据中获取所述目标语种对应的同声传译数据;Acquiring the simultaneous interpretation data corresponding to the target language from the cached simultaneous interpretation data;
将获取的所述目标语种对应的同声传译数据发送给客户端。Send the acquired simultaneous interpretation data corresponding to the target language to the client.
这里,所述客户端可以设有人机交互界面,用户通过人机交互界面可以选择语种,客户端根据用户的选择生成包含目标语种的获取请求,并将获取请求发送给服务器,从而所述服务器接收所述获取请求。Here, the client may be provided with a human-computer interaction interface through which the user can select a language. The client generates an acquisition request containing the target language according to the user's selection, and sends the acquisition request to the server, so that the server receives The acquisition request.
所述客户端可以安装于手机端;这里,考虑到目前绝大多数用户都会随身携带手机,将同声传译数据发送到安装于手机端的客户端上,无需再增加其他设备来接收并展示同声传译数据,可以节约成本,且操作方便。The client can be installed on the mobile phone; here, considering that most users will carry their mobile phones with them, the simultaneous interpretation data will be sent to the client installed on the mobile phone without adding other devices to receive and display the simultaneous voice. Interpreting data can save costs and is easy to operate.
这里,所述第一待处理数据对应有至少一种语种对应的同声传译数据,所述同声传译数据,包括:所述第一翻译文本、所述第二语音数据;还包括以下至少之一:所述排版文档、所述图像处理结果、所述摘要文档。即,所述第一待处理数据对应有至少一种语种的第一翻译文本、至少一种语种的第二语音数据,以及以下至少一个:至少一种语种的排版文档、至少一种语种的图像处理结果、至少一种语种的摘要文档。Here, the first to-be-processed data corresponds to simultaneous interpretation data corresponding to at least one language, and the simultaneous interpretation data includes: the first translated text and the second voice data; and also includes at least one of the following One: The typeset document, the image processing result, and the abstract document. That is, the first data to be processed corresponds to the first translated text in at least one language, the second voice data in at least one language, and at least one of the following: a typeset document in at least one language, and an image in at least one language Processing results, abstract documents in at least one language.
实际应用中,为了帮助用户快速获得某一时间点的同声传译数据,可以根据客户端发送的目标时间获取对应的同声传译数据。In practical applications, in order to help users quickly obtain simultaneous interpretation data at a certain point in time, the corresponding simultaneous interpretation data can be obtained according to the target time sent by the client.
基于此,在一实施例中,所述获取请求可以包含目标时间;所述从缓存的内容中获取所述目标语种对应的同声传译数据时,所述方法还包括:Based on this, in an embodiment, the acquisition request may include a target time; when the simultaneous interpretation data corresponding to the target language is acquired from the cached content, the method further includes:
根据预设的时间对应关系,从缓存中获取所述目标时间对应的同声传译数据;所述时间对应关系表征所述同声传译数据中各数据之间的时 间关系。According to the preset time correspondence, the simultaneous interpretation data corresponding to the target time is obtained from the cache; the time correspondence represents the time relationship between the various data in the simultaneous interpretation data.
这里,用户还可以通过人机交互界面可以选择时间,客户端根据用户的选择生成包含目标时间的获取请求。例如:所述同声传译方法应用于会议;用户选择会议中的一个时间点,作为目标时间。Here, the user can also select the time through the human-computer interaction interface, and the client generates an acquisition request containing the target time according to the user's selection. For example: the simultaneous interpretation method is applied to a meeting; the user selects a time point in the meeting as the target time.
这里,所述同声传译数据中各数据之间的时间关系指所述同声传译数据中第一翻译文本、第二语音数据,以及排版文档、图像处理结果和摘要文档中至少一个之间的时间关系。Here, the time relationship between the data in the simultaneous interpretation data refers to the relationship between the first translated text, the second voice data, and at least one of the typeset document, the image processing result, and the abstract document in the simultaneous interpretation data. Time relationship.
具体地,所述时间对应关系是根据所述第一语音数据的时间轴和获取所述第一图像数据的时间点预先生成的。Specifically, the time correspondence is generated in advance according to the time axis of the first voice data and the time point when the first image data is acquired.
需要说明的是,针对所述获取请求包含目标语种和所述获取请求包含目标时间两种情况,可以作为单独方案实现(对于获取请求仅包含目标时间的情况,目标语种可以采用客户端设置的预置语种);也可以在同一方案中实现(即获取请求同时包括目标语种和目标时间,则服务器获取目标语种对应的目标时间的同声传译数据)。It should be noted that for the two situations where the acquisition request contains the target language and the acquisition request contains the target time, it can be implemented as a separate solution (for the situation where the acquisition request only contains the target time, the target language can be preset by the client). It can also be implemented in the same scheme (that is, the acquisition request includes both the target language and the target time, and the server obtains the simultaneous interpretation data of the target time corresponding to the target language).
实际应用中,可以预先生成同声传输数据中各数据之间的对应关系,基于对应关系,可以实现在获取同声传输数据中某一数据时,同时获得对应的其他数据。如当获取第一翻译文本时,可以对应获得第一翻译文本对应的第二语音数据、摘要文档、排版文档,及展示文档对应的图像处理结果。In practical applications, the corresponding relationship between the data in the simultaneous transmission data can be generated in advance. Based on the corresponding relationship, it can be achieved that when a certain data in the simultaneous transmission data is acquired, the corresponding other data can be obtained at the same time. For example, when the first translated text is obtained, the second voice data, summary document, typeset document corresponding to the first translated text can be obtained correspondingly, and the image processing result corresponding to the display document can be obtained.
基于此,在一实施例中,所述方法还包括:Based on this, in an embodiment, the method further includes:
确定获取所述第一语音数据对应的第一时间轴和获取所述第一图像数据的时间点;Determining a first time axis corresponding to the first voice data and a time point for acquiring the first image data;
根据所述第一时间轴和所述时间点,生成所述同声传译数据中各数据之间的时间对应关系;Generating a time correspondence between each data in the simultaneous interpretation data according to the first time axis and the time point;
利用所述时间对应关系将所述同声传译数据中各数据对应保存。The respective data in the simultaneous interpretation data are correspondingly saved using the time correspondence relationship.
在一实施例中,服务器接收所述第一语音数据时,确定接收时间,根据所述第一语音数据的时长确定结束时间,根据所述接收时间和所述结束时间生成针对第一语音数据的所述第一时间轴。在另一实施例中,所述第一终端确定采集第一语音数据时的起始时间和第一语音数据的时长,并发送给服务器,服务器根据所述起始时间和所述时长确定所述第一语音数据的第一时间轴。In an embodiment, when the server receives the first voice data, it determines the receiving time, determines the end time according to the duration of the first voice data, and generates the information for the first voice data according to the receiving time and the end time. The first time axis. In another embodiment, the first terminal determines the start time and the duration of the first voice data when collecting the first voice data, and sends them to the server, and the server determines the The first time axis of the first voice data.
在一实施例中,可以采用所述服务器获取所述第一图像数据时对应的时间点,作为所述获取第一图像数据的时间点。在另一实施例中,所述第一终端截取所述第一图像数据时确定对应的时间点,将确定的时间点与所述第一图像数据一同发送给服务器,所述服务器接收所述时间点与所述第一图像数据,将所述时间点作为获取所述第一图像数据的时间点。In an embodiment, the corresponding time point when the server obtains the first image data may be used as the time point for obtaining the first image data. In another embodiment, the first terminal determines the corresponding time point when intercepting the first image data, and sends the determined time point together with the first image data to the server, and the server receives the time Point and the first image data, and use the time point as the time point for acquiring the first image data.
这里,根据所述第一时间轴和所述时间点,可以确定第一语音数据 与第一图像数据之间的时间关系;而所述同声传译数据中的第一翻译文本、第二语音数据、排版文档、摘要文档均是在第一语音数据的基础上生成的,因此可以确定第一翻译文本、第二语音数据、排版文档、摘要文档分别与第一语音数据之间的时间关系。基于此,可以生成所述同声传译数据中各数据之间的时间对应关系。Here, according to the first time axis and the time point, the time relationship between the first voice data and the first image data can be determined; and the first translated text and the second voice data in the simultaneous interpretation data The typeset document and the summary document are all generated on the basis of the first voice data, so the time relationship between the first translated text, the second voice data, the typeset document, and the summary document respectively and the first voice data can be determined. Based on this, the time correspondence between the data in the simultaneous interpretation data can be generated.
具体来说,所述时间对应关系,可以以时间轴的形式体现,即生成一个第二时间轴;所述第二时间轴可以采用第二语音数据的时间轴为基础;所述第二时间轴上标记有第二语音数据中各分段语音对应的起始时间点、结束时间点。Specifically, the time correspondence relationship may be embodied in the form of a time axis, that is, a second time axis is generated; the second time axis may be based on the time axis of the second voice data; the second time axis The start time point and end time point corresponding to each segmented voice in the second voice data are marked on it.
针对第一翻译文本,所述第二时间轴上标记有第一翻译文本中各段落对应的时间;所述时间具体可以采用各段落对应的第二语音数据中的分段语音在第二时间轴上的时间点。For the first translated text, the time corresponding to each paragraph in the first translated text is marked on the second time axis; the time may specifically be the segmented voice in the second voice data corresponding to each paragraph in the second time axis Point in time.
针对排版文档,所述第二时间轴上标记有排版文档对应的时间,具体可以采用所述排版文档对应的第二语音数据中的分段语音在第二时间轴上的时间点。For the typeset document, the time corresponding to the typeset document is marked on the second time axis. Specifically, the time point of the segmented voice in the second voice data corresponding to the typeset document on the second time axis may be used.
针对摘要文档,所述第二时间轴上标记有摘要文档对应的时间,具体可以采用所述摘要文档对应的第二语音数据中的分段语音在第二时间轴上的时间点。For the summary document, the time corresponding to the summary document is marked on the second time axis. Specifically, the time point of the segmented voice in the second voice data corresponding to the summary document on the second time axis may be used.
针对图像处理结果,所述第二时间轴上标记有图像处理结果对应的时间。这里,可以根据第一时间轴和所述第一图像数据的时间点之间的关系,确定所述图像处理结果对应的时间与所述第二时间轴之间的关系。For the image processing result, the time corresponding to the image processing result is marked on the second time axis. Here, the relationship between the time corresponding to the image processing result and the second time axis may be determined according to the relationship between the first time axis and the time point of the first image data.
实际应用中,为帮助用户对第一语音数据进行理解,使得用户更好的接纳演讲内容,这里将第一翻译文本中的段落和对应的分段语音一同发送到客户端,从而用户在查看某一翻译文档的段落时,可以一并听取对应语音。In practical applications, in order to help users understand the first voice data, so that users can better accept the speech content, here the paragraphs in the first translated text and the corresponding segmented voice are sent to the client together, so that the user is viewing a certain When translating a paragraph of a document, you can listen to the corresponding voice at the same time.
基于此,在一实施例中,将所述同声传译数据中的第一翻译文本和第二语音数据发送给客户端,包括:Based on this, in an embodiment, sending the first translated text and the second voice data in the simultaneous interpretation data to the client includes:
将所述第一翻译文本中的至少一个段落和所述段落对应的分段语音发送给客户端;所述分段语音用于在所述客户端展示所述分段语音对应的段落时进行播放。At least one paragraph in the first translated text and the segmented speech corresponding to the paragraph are sent to the client; the segmented speech is used to play when the client displays the segment corresponding to the segmented speech .
这里,所述段落和所述段落对应的分段语音一同发送给客户端,当所述段落由客户端展示时,所述客户端可以同时播放所述段落对应的分段语音。Here, the paragraph and the segmented voice corresponding to the paragraph are sent to the client together, and when the paragraph is displayed by the client, the client can play the segmented voice corresponding to the paragraph at the same time.
具体地,所述方法还可以包括:根据所述排版文档和摘要文档生成预设格式的目标文档;所述目标文档用于在播放所述第一语音数据时在所述客户端进行呈现。Specifically, the method may further include: generating a target document in a preset format according to the typeset document and the summary document; the target document is used for presenting on the client when the first voice data is played.
即根据所述排版文档和摘要文档,服务器生成一种包含排版文档和摘要文档的内容的目标文档,所述目标文档可以将提取的摘要和排版一同展 示。That is, according to the typeset document and the abstract document, the server generates a target document containing the content of the typeset document and the abstract document, and the target document can display the extracted abstract and typesetting together.
本申请实施例提供的方法,可以应用于同声传译场景,比如会议的同声传译,在这种场景下,通过对会议展示文档的翻译,让用户结合展示文档更清楚地了解演讲者的演讲内容;通过对演讲内容(即第一语音数据)进行排版和摘要提取,帮助用户更好的归纳总结和检索;通过将所述第一翻译文本中的至少一个段落和所述段落对应的分段语音对应发送给客户端,针对密集的文字翻译内容,可以帮助用户更好地接纳演讲内容。The method provided in the embodiments of this application can be applied to simultaneous interpretation scenarios, such as simultaneous interpretation in conferences. In this scenario, the translation of conference presentation documents allows users to understand the speaker’s speech more clearly in combination with the presentation documents Content; through typesetting and abstract extraction of speech content (that is, the first voice data), to help users better summarize and retrieve; by combining at least one paragraph in the first translated text with the corresponding segment of the paragraph The corresponding voice is sent to the client to help users better accept the content of the speech for the dense text translation content.
应理解,上述实施例中说明各步骤(如获得第一翻译文本、生成第二语音数据、获得排版文档、获得摘要文档、获得图像处理结果)的顺序并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the order of the steps described in the above embodiments (such as obtaining the first translated text, generating the second voice data, obtaining the typeset document, obtaining the summary document, obtaining the image processing result) does not mean the order of execution. The execution sequence of should be determined by its functions and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
本发明实施例提供的同声传译方法,获得第一待处理数据;对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;根据所述第一翻译文本,生成第二语音数据;并执行以下至少之一:根据所述第一语音数据和第一翻译文本,获得排版文档;对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现,为用户提供与第一语音数据相关的文本翻译结果、语音翻译结果、文本翻译结果对应的排版文档,以及第一语音数据对应的展示文档相关的翻译结果,如此,能够更直观、全面地对演讲的第一语音数据的内容进行直观展示,从而能够使用户通过客户端理解演讲内容的概要和展示文档的内容,帮助用户更好地接纳演讲内容,提升用户体验;也可以方便用户后续对于演讲内容的归纳整理。The simultaneous interpretation method provided by the embodiment of the present invention obtains the first data to be processed; translates the first voice data in the first data to be processed to obtain the first translated text; and generates the first translated text according to the first translated text. Second voice data; and perform at least one of the following: obtain a typeset document according to the first voice data and the first translated text; perform image word processing on the first image data in the first to-be-processed data to obtain an image Processing result; the first image data includes at least a display document corresponding to the first voice data; wherein the language corresponding to the first voice data is different from the language corresponding to the typeset document; the first voice data The corresponding language is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different from the language corresponding to the second voice data; the language of the text displayed in the first image data is different from that of the text. The language of the text included in the image processing result; the first translated text, the typeset document, the second voice data, and the image processing result are used to present the first voice data on the client when the first voice data is played, and provide the user with the first voice Data-related text translation results, voice translation results, typeset documents corresponding to the text translation results, and translation results related to the display document corresponding to the first voice data. In this way, the content of the first voice data of the speech can be more intuitive and comprehensive Intuitive display enables users to understand the summary of the speech content and the content of the displayed document through the client, helps users better accept the content of the speech, and enhances the user experience; it also facilitates the user's subsequent summary of the content of the speech.
本申请实施例还提供了一种同声传译方法应用的实施例;图3为本申请实施例的同声传译方法应用的***架构示意图,如图3所示,所述***应用于会议中的同声传译,所述***包括:机器同传服务端、语音识别服务器、翻译服务器、观众移动端、PC客户端、显示屏幕、会议管理设备、TTS服务器、OCR服务器、NLP服务器。这里,考虑到仅采用一个服务器进行同声传译,对服务器的性能要求较高,为了提高数据处理效率,保证高时效性,可以将各功能再多个服务器上实现,具体可以分别在语音识别服务器、翻译服务器、TTS服务器、OCR服务器、NLP服务器、会议管理设备等上实现,从而提高同声传译的效率,保证高时效性。The embodiment of the application also provides an embodiment of the application of the simultaneous interpretation method; FIG. 3 is a schematic diagram of the system architecture of the application of the simultaneous interpretation method of the embodiment of the application. As shown in FIG. 3, the system is applied to the conference Simultaneous interpretation, the system includes: machine simultaneous interpretation server, speech recognition server, translation server, audience mobile terminal, PC client, display screen, conference management equipment, TTS server, OCR server, NLP server. Here, considering that only one server is used for simultaneous interpretation, the performance requirements of the server are relatively high. In order to improve the efficiency of data processing and ensure high timeliness, each function can be implemented on multiple servers, which can be implemented separately in the speech recognition server. , Translation server, TTS server, OCR server, NLP server, conference management equipment, etc., so as to improve the efficiency of simultaneous interpretation and ensure high timeliness.
所述PC客户端,用于采集会议中演讲者演讲内容的音频,即采集第一 语音数据;通过投屏的方式将需展示的文档投屏到显示屏幕,由所述显示屏幕展示文档给参与会议的其他用户;以及,采集针对所述文档的第一图像数据。这里,所述文档可以是PPT文档、Word文档等。The PC client is used to collect the audio of the lecturer's speech content in the conference, that is, to collect the first voice data; the document to be displayed is projected to the display screen by way of projection, and the display screen displays the document to the participants Other users of the conference; and, collecting first image data for the document. Here, the document may be a PPT document, a Word document, and so on.
所述PC客户端,还用于将采集的第一语音数据和第一图像数据发送给机器同传服务端。The PC client is also used to send the collected first voice data and first image data to the machine simultaneous interpretation server.
这里,所述PC客户端除语音采集、投屏、控制等功能外,还可以具有截屏功能,从而可以通过对屏幕的截屏操作,实时获取演讲者当前展示的文档,即采集第一图像数据;并且,可以对应记录第一图像数据对应的时间,将第一图像数据和对应时间发送到机器同传服务端。Here, in addition to functions such as voice collection, screen projection, and control, the PC client may also have a screenshot function, so that through the screenshot operation on the screen, the document currently displayed by the speaker can be obtained in real time, that is, the first image data is collected; In addition, the time corresponding to the first image data can be recorded correspondingly, and the first image data and the corresponding time can be sent to the machine simultaneous interpretation server.
所述机器同传服务端,用于将所述第一语音数据发送给语音识别服务器;所述第一语音数据由语音识别服务器运用语音识别技术对所述第一语音数据进行识别,得到识别文本并发送给机器同传服务端;以及,The machine simultaneous interpretation server is used to send the first voice data to a voice recognition server; the first voice data is recognized by the voice recognition server using voice recognition technology to obtain the recognized text And send it to the machine's simultaneous interpretation server; and,
将所述识别文本发送给翻译服务器;所述识别文本由所述翻译服务器运用预设的翻译模型对所述识别文本进行翻译,得到第一翻译文本并发送给机器同传服务端;Sending the recognized text to a translation server; the recognized text is translated by the translation server using a preset translation model to obtain the first translated text and sent to the machine simultaneous interpretation server;
所述机器同传服务端,还用于将所述第一翻译文本、所述第一图像数据发送给会议管理设备。The machine simultaneous interpretation server is also used to send the first translated text and the first image data to the conference management device.
这里,所述第一翻译文本、所述第一图像数据分别携带有其对应的时间信息,其中,翻译结果对应的时间信息可以包括翻译结果中各段落对应第一语音数据中各分段语音的时间信息。Here, the first translated text and the first image data respectively carry their corresponding time information, where the time information corresponding to the translation result may include information about each paragraph in the translation result corresponding to each segment of the first voice data. Time information.
所述会议管理设备,用于接收所述第一翻译文本、所述第一图像数据;The conference management device is configured to receive the first translated text and the first image data;
将所述第一翻译文本发送给NLP服务器;所述翻译结果由所述NLP服务器,并根据第一翻译文本,获得排版文档、摘要文档中的至少一个;Sending the first translated text to an NLP server; the translation result is obtained by the NLP server, and at least one of a typeset document and an abstract document is obtained according to the first translated text;
将所述第一图像数据发送给OCR服务器;所述第一图像数据由所述OCR服务器接收,提取第一图像数据中的文字并确定文字的位置;将提取的所述文字和所述文字的位置发送给会议管理设备;The first image data is sent to the OCR server; the first image data is received by the OCR server, the text in the first image data is extracted and the position of the text is determined; the extracted text and the text The location is sent to the conference management equipment;
将接收的所述提取的文字发送给翻译服务器,接收所述翻译服务器发送的翻译结果,根据所述翻译结果和提取的文字的位置生成图像处理结果;这里,所述提取的文字由翻译服务器接收并对提取的文字进行翻译,将翻译结果发送给会议管理设备;以及,Send the received extracted text to the translation server, receive the translation result sent by the translation server, and generate an image processing result according to the translation result and the location of the extracted text; here, the extracted text is received by the translation server Translate the extracted text and send the translation result to the conference management equipment; and,
将第一翻译文本发送给TTS服务器;所述第一翻译文本由TTS服务器接收并根据第一翻译文本生成第二语音数据,将第二语音数据发送给会议管理设备。The first translated text is sent to the TTS server; the first translated text is received by the TTS server and second voice data is generated according to the first translated text, and the second voice data is sent to the conference management device.
所述会议管理设备,还用于将第一翻译文本、第二语音数据、排版文档、摘要文档、图像处理结果发送给观众移动端。The conference management device is also used to send the first translated text, the second voice data, the typeset document, the summary document, and the image processing result to the mobile terminal of the audience.
具体地,所述OCR服务器,用于通过OCR技术和界面定位技术,获取第一图像数据对应的展示文档中可提取的文字和所述文字对应的界面定位信息;通过机器翻译技术,对提取的内容进行多语种的翻译;根据界面 定位信息,将对应的不同语种翻译内容合入图片中,得到图像处理结果;将图像处理结果按语种保存在对应服务器中。Specifically, the OCR server is used to obtain the extractable text in the display document corresponding to the first image data and the interface positioning information corresponding to the text through OCR technology and interface positioning technology; The content is translated in multiple languages; according to the interface positioning information, the corresponding translation content of different languages is merged into the picture to obtain the image processing result; the image processing result is stored in the corresponding server according to the language.
具体地,所述NLP服务器,用于根据第一翻译文本生成排版文档、摘要文档中的至少一个。Specifically, the NLP server is configured to generate at least one of a typeset document and an abstract document according to the first translated text.
这里,所述NLP服务器,用于运用NLP技术和VAD技术,根据第一语音数据和第一翻译文本,生成排版文档;并根据第一翻译文本,生成摘要文档。具体来说,运用语音活动检测(VAD,Voice Activity Detection)技术,对第一语音数据进行语音活动检测,确定所述第一语音数据中的静音期,确定静音期的时长超过预设时长阈值,将确定的静音期作为确定所述第一语音数据中的一个静音点;根据静音点可以对所述第一语音数据对应的第一翻译文本进行预分段,得到至少一个预分段段落;获取所述第一翻译文本中所述静音点对应的上下文,运用NLP技术对上下文进行语义分析,确定是否对预分段段落进行分段,最终确定至少一个段落。Here, the NLP server is used to use NLP technology and VAD technology to generate a typeset document based on the first voice data and the first translated text; and generate a summary document based on the first translated text. Specifically, using voice activity detection (VAD, Voice Activity Detection) technology to perform voice activity detection on the first voice data, determine the silent period in the first voice data, and determine that the duration of the silent period exceeds a preset duration threshold, Taking the determined silence period as a silence point in the first voice data; according to the silence point, the first translated text corresponding to the first voice data can be pre-segmented to obtain at least one pre-segmented paragraph; In the context corresponding to the mute point in the first translated text, semantic analysis is performed on the context using NLP technology to determine whether to segment the pre-segmented paragraph, and finally at least one paragraph is determined.
所述NLP服务器,还用于运用NLP的摘要抽取技术,对第一翻译文本进行摘要内容整理,得到摘要文档。The NLP server is also used to use NLP's abstract extraction technology to organize the abstract content of the first translated text to obtain an abstract document.
具体地,所述会议管理设备,还用于根据不同语种的排版文档、不同语种的摘要文档,对预设表格进行填写,以生成目标表格。Specifically, the conference management device is also used to fill in preset forms according to typeset documents in different languages and abstract documents in different languages to generate a target form.
所述预设表格可以采用以下表1的格式,可以包括:会议名称、会议时间、会议的话题名称、话题时间、演讲者、识别内容、语种A的翻译内容、语种B的翻译内容、语种A的摘要内容、语种B的摘要内容。The preset table may adopt the format of Table 1 below, and may include: meeting name, meeting time, meeting topic name, topic time, speaker, identification content, translation content of language A, translation content of language B, language A The content of the abstract and the abstract content of language B.
其中,同传账号、会议名称、会议时间、会议的话题名称、话题时间、演讲者,可以由用户预先根据实际情况进行填写。Among them, the simultaneous interpretation account, meeting name, meeting time, meeting topic name, topic time, and speaker can be filled in by the user in advance according to the actual situation.
所述会议管理设备根据至少一种语种的排版文档或者至少一种图像处理结果,按语种对应表1中的语种一翻译内容、语种二翻译内容、……、语种N翻译内容对应填写;以及根据至少一种语种的摘要文档,按语种对应表1中的语种一的摘要内容、二的摘要内容、……、语种N的摘要内容对应填写,从而实现对会议内容的整理、归纳。According to the typeset document in at least one language or the result of at least one image processing, the conference management device fills in correspondingly the translated content of language one, the translated content of language two, ..., the translated content of language N in the language correspondence table 1; and For abstract documents in at least one language, fill in the abstract content of language one, the abstract content of language two,..., the abstract content of language N in the language correspondence table 1, so as to realize the sorting and summarization of the conference content.
Figure PCTCN2019109677-appb-000001
Figure PCTCN2019109677-appb-000001
表1Table 1
所述TTS服务器提供TTS同传服务;具体来说,所述TTS服务器,用 于将不同语种的第一翻译文本,调用TTS服务,合成不同语种的音频内容、即获得第二语音数据。The TTS server provides a TTS simultaneous interpretation service; specifically, the TTS server is used to transfer the first translated text in different languages, call the TTS service, and synthesize audio content in different languages, that is, to obtain the second voice data.
具体地,所述会议管理设备,还用于将第一翻译文本、第二语音数据、以及排版文档、摘要文档、图像处理结果中的至少一个,按时间对应关系,保存在对应语种的数据库中。所述时间对应关系可以采用时间轴实现,具体实现方法已在图1所示方法中说明,这里不再赘述。Specifically, the conference management device is also used to store the first translated text, the second voice data, and at least one of the typeset document, the abstract document, and the image processing result in the database of the corresponding language according to the time-corresponding relationship . The time correspondence relationship can be implemented using a time axis, and the specific implementation method has been described in the method shown in FIG. 1 and will not be repeated here.
通过所述时间对应关系,当移动端根据时间轴拉取对应的第一翻译文档时,可以一同获取对应的第二语音数据;还可以一同获得对应的排版文档、摘要文档、图像处理结果中的至少一个。Through the time correspondence, when the mobile terminal pulls the corresponding first translation document according to the time axis, it can obtain the corresponding second voice data together; it can also obtain the corresponding typeset document, abstract document, and image processing result. at least one.
本申请实施例中,通过OCR服务器增加了PPT文档的同传,通过NLP服务器提供了会议内容的排版、摘要,通过TTS服务器增加了机器同声传译的“听”的服务,改善会议中用户的感官体验,帮助用户更好的理解演讲内容和文档内容,也方便观众后续对于会议内容的归纳整理。In the embodiment of this application, the simultaneous interpretation of PPT documents is added through the OCR server, the typesetting and summary of the meeting content are provided through the NLP server, and the "listening" service of simultaneous interpretation by the machine is added through the TTS server, which improves the user's experience in the meeting. The sensory experience helps users better understand the content of speeches and documents, and also facilitates the audience to summarize and organize the content of the meeting.
为实现本申请实施例的同声传译方法,本申请实施例还提供了一种同声传译装置。图4为本申请实施例的同声传译装置的组成结构示意图;如图4所示,所述同声传译装置包括:In order to implement the simultaneous interpretation method in the embodiment of the present application, the embodiment of the present application also provides a simultaneous interpretation device. Fig. 4 is a schematic diagram of the composition structure of the simultaneous interpretation device according to the embodiment of the application; as shown in Fig. 4, the simultaneous interpretation device includes:
获取单元41,配置为获得第一待处理数据;The obtaining unit 41 is configured to obtain the first to-be-processed data;
第一处理单元42,配置为对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;The first processing unit 42 is configured to translate the first voice data in the first to-be-processed data to obtain a first translated text;
第二处理单元43,配置为根据所述第一翻译文本,生成第二语音数据;The second processing unit 43 is configured to generate second voice data according to the first translated text;
第三处理单元44,配置为执行以下至少之一:The third processing unit 44 is configured to perform at least one of the following:
根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,Perform image word processing on the first image data in the first to-be-processed data to obtain an image processing result; the first image data includes at least a display document corresponding to the first voice data; wherein,
所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。The language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
在一实施例中,所述第三处理单元44,配置为对所述第一语音数据进行语音活动检测,确定所述第一语音数据中的静音点;In an embodiment, the third processing unit 44 is configured to perform voice activity detection on the first voice data, and determine a silent point in the first voice data;
获取所述第一翻译文本中所述静音点对应的上下文;Acquiring the context corresponding to the mute point in the first translated text;
根据所述静音点和所述上下文的语义对所述第一翻译文本进行分段,获得至少一个段落;Segment the first translated text according to the mute point and the semantics of the context to obtain at least one paragraph;
对所述至少一个段落进行排版,得到所述排版文档。Typesetting the at least one paragraph to obtain the typeset document.
在一实施例中,所述第二处理单元43,配置为对所述第一翻译文本进行分段,得到所述第一翻译文本中的至少一个段落;In an embodiment, the second processing unit 43 is configured to segment the first translated text to obtain at least one paragraph in the first translated text;
根据所述第一翻译文本中的至少一个段落,生成至少一个分段语音;Generating at least one segmented speech according to at least one paragraph in the first translated text;
利用所述至少一个分段语音,合成所述第一翻译文本对应的第二语音数据。Using the at least one segmented speech to synthesize second speech data corresponding to the first translated text.
这里,所述第二处理单元43对所述第一翻译文本进行分段,可以采用与第一处理单元42相同的分段方法。Here, the second processing unit 43 segments the first translated text, and the same segmentation method as the first processing unit 42 can be used.
在一实施例中,所述第三处理单元44,配置为对所述第一翻译文本进行摘要抽取,得到针对所述第一翻译文本的摘要文档;所述摘要文档用于在播放所述第一语音数据时在所述客户端进行呈现。In an embodiment, the third processing unit 44 is configured to extract a summary of the first translated text to obtain a summary document for the first translated text; the summary document is used to play the first translated text. A voice data is presented at the client terminal.
这里,通过为用户提供摘要文档,帮助用户对第一语音数据的内容进行归纳总结,更好的接纳第一语音数据的内容。Here, by providing the user with a summary document, the user can summarize and summarize the content of the first voice data, and better accept the content of the first voice data.
在一实施例中,所述第三处理单元44,配置为确定所述第一图像数据中的文字和所述文字对应的位置;In an embodiment, the third processing unit 44 is configured to determine a character in the first image data and a position corresponding to the character;
提取所述第一图像数据中的文字,并对提取的文字进行翻译;Extracting text in the first image data, and translating the extracted text;
根据翻译后的文字,生成所述图像处理结果。According to the translated text, the image processing result is generated.
具体来说,所述图像处理结果,可以包括以下至少之一:第二图像数据、第二翻译文本。Specifically, the image processing result may include at least one of the following: second image data and second translated text.
所述第三处理单元44,配置为执行以下至少一种,生成所述图像处理结果:The third processing unit 44 is configured to execute at least one of the following to generate the image processing result:
根据翻译后的文字,替换所述第一图像数据中所述位置对应的文字,得到第二图像数据,将所述第二图像数据作为所述图像处理结果;According to the translated text, replacing the text corresponding to the position in the first image data to obtain second image data, and using the second image data as the image processing result;
利用翻译后的文字生成第二翻译文本,将所述第二翻译文本作为所述图像处理结果。The translated text is used to generate a second translated text, and the second translated text is used as the image processing result.
本申请实施例中,利用所述第一待处理数据获得的同声传译数据对应至少一种语种;In the embodiment of the present application, the simultaneous interpretation data obtained by using the first to-be-processed data corresponds to at least one language;
所述装置还包括:存储单元;所述存储单元,配置为将至少一种语种对应的同声传译数据,按语种进行分类缓存。The device further includes: a storage unit; the storage unit is configured to classify and cache the simultaneous interpretation data corresponding to at least one language type according to the language type.
在一实施例中,所述装置还包括:通信单元;所述通信单元,配置为接收客户端发送的获取请求;所述获取请求用于获取同声传译数据;所述获取请求至少包括:目标语种;In an embodiment, the device further includes: a communication unit; the communication unit is configured to receive an acquisition request sent by the client; the acquisition request is used to acquire simultaneous interpretation data; the acquisition request includes at least: a target Language
从缓存的同声传译数据中获取所述目标语种对应的同声传译数据;Acquiring the simultaneous interpretation data corresponding to the target language from the cached simultaneous interpretation data;
将获取的所述目标语种对应的同声传译数据发送给客户端。Send the acquired simultaneous interpretation data corresponding to the target language to the client.
在一实施例中,所述获取请求还包含目标时间;In an embodiment, the acquisition request further includes a target time;
所述通信单元,还配置为在从缓存的内容中获取所述目标语种对应的同声传译数据时,根据预设的时间对应关系,从缓存中获取所述目标时间对应的同声传译数据;所述时间对应关系表征所述同声传译数据中各数据之间的时间关系;所述时间对应关系是根据所述第一语音数据的时间轴和 获取所述第一图像数据的时间点预先生成的。The communication unit is further configured to, when obtaining the simultaneous interpretation data corresponding to the target language from the cached content, obtain the simultaneous interpretation data corresponding to the target time from the cache according to a preset time correspondence relationship; The time correspondence relationship represents the time relationship between the data in the simultaneous interpretation data; the time correspondence relationship is generated in advance according to the time axis of the first voice data and the time point at which the first image data is acquired of.
在一实施例中,所述存储单元,还配置为确定获取所述第一语音数据对应的第一时间轴和获取所述第一图像数据的时间点;In an embodiment, the storage unit is further configured to determine a first time axis corresponding to the first voice data and a time point for acquiring the first image data;
根据所述第一时间轴和所述时间点,生成所述同声传译数据中各数据之间的时间对应关系;Generating a time correspondence between each data in the simultaneous interpretation data according to the first time axis and the time point;
利用所述时间对应关系将所述同声传译数据中各数据对应保存。The respective data in the simultaneous interpretation data are correspondingly saved using the time correspondence relationship.
在一实施例中,所述通信单元,还配置为将所述第一翻译文本中的至少一个段落和所述段落对应的分段语音发送给客户端;所述段落由客户端展示时,所述段落对应的分段语音由所述客户端播放。In an embodiment, the communication unit is further configured to send at least one paragraph in the first translated text and the segmented speech corresponding to the paragraph to the client; when the paragraph is displayed by the client, The segmented voice corresponding to the paragraph is played by the client.
实际应用时,所述获取单元41可通过通信接口实现;所述第一处理单元42、所述第二处理单元43、所述第三处理单元44均可由所述服务器中的处理器,比如中央处理器(CPU,Central Processing Unit)、数字信号处理器(DSP,Digital Signal Processor)、微控制单元(MCU,Microcontroller Unit)或可编程门阵列(FPGA,Field-Programmable Gate Array)等实现;所述通信单元可由服务器中的通信接口实现。In practical applications, the acquisition unit 41 can be implemented through a communication interface; the first processing unit 42, the second processing unit 43, and the third processing unit 44 can all be implemented by a processor in the server, such as a central Processor (CPU, Central Processing Unit), Digital Signal Processor (DSP, Digital Signal Processor), Microcontroller Unit (MCU, Microcontroller Unit) or Programmable Gate Array (FPGA, Field-Programmable Gate Array), etc.; The communication unit can be implemented by a communication interface in the server.
需要说明的是:上述实施例提供的装置在进行同声传译时,仅以上述各程序模块的划分进行举例说明,实际应用中,可以根据需要而将上述处理分配由不同的程序模块完成,即将终端的内部结构划分成不同的程序模块,以完成以上描述的全部或者部分处理。另外,上述实施例提供的装置与同声传译方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the device provided in the above embodiment performs simultaneous interpretation, only the division of the above-mentioned program modules is used as an example for illustration. In actual applications, the above-mentioned processing can be allocated to different program modules according to needs, i.e. The internal structure of the terminal is divided into different program modules to complete all or part of the processing described above. In addition, the device provided in the foregoing embodiment and the embodiment of the simultaneous interpretation method belong to the same concept. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
基于上述设备的硬件实现,本申请实施例还提供了一种服务器,图5为本申请实施例的服务器的硬件组成结构示意图,如图5所示,服务器50包括存储器53、处理器52及存储在存储器53上并可在处理器52上运行的计算机程序;位于服务器的处理器52执行所述程序时实现上述服务器侧一个或多个技术方案提供的方法。Based on the hardware implementation of the above device, an embodiment of the present application also provides a server. FIG. 5 is a schematic diagram of the hardware composition structure of the server according to an embodiment of the present application. As shown in FIG. A computer program that is on the memory 53 and can run on the processor 52; when the processor 52 located on the server executes the program, the method provided by one or more technical solutions on the server side is implemented.
具体地,位于服务器50的处理器52执行所述程序时实现:获得第一待处理数据;对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;根据所述第一翻译文本,生成第二语音数据;并执行以下至少一个:Specifically, when the processor 52 located in the server 50 executes the program, it realizes: obtain the first to-be-processed data; translate the first voice data in the first to-be-processed data to obtain the first translated text; First translate text, generate second voice data; and perform at least one of the following:
根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。Image word processing is performed on the first image data in the first to-be-processed data to obtain an image processing result; the first image data includes at least a display document corresponding to the first voice data; the first translated text , The typeset document, the second voice data, and the image processing result are used for presentation on the client when the first voice data is played.
需要说明的是,位于服务器50的处理器52执行所述程序时实现的具体步骤已在上文详述,这里不再赘述。It should be noted that the specific steps implemented when the processor 52 located in the server 50 executes the program have been described in detail above, and will not be repeated here.
可以理解,服务器还包括通信接口51;服务器中的各个组件通过总线 ***54耦合在一起。可理解,总线***54配置为实现这些组件之间的连接通信。总线***54除包括数据总线之外,还包括电源总线、控制总线和状态信号总线等。It can be understood that the server further includes a communication interface 51; various components in the server are coupled together through the bus system 54. It can be understood that the bus system 54 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 54 also includes a power bus, a control bus, and a status signal bus.
可以理解,本实施例中的存储器53可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory 53 in this embodiment may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM, Read Only Memory), programmable read-only memory (PROM, Programmable Read-Only Memory), and erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage. The volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory), and dynamic random access memory. Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), enhanced Type synchronous dynamic random access memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), synchronous connection dynamic random access memory (SLDRAM, SyncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, Direct Rambus Random Access Memory) ). The memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
上述本申请实施例揭示的方法可以应用于处理器52中,或者由处理器52实现。处理器52可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器52中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器52可以是通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器52可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器,处理器52读取存储器中的信息,结合其硬件完成前述方法的步骤。The method disclosed in the foregoing embodiments of the present application may be applied to the processor 52 or implemented by the processor 52. The processor 52 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 52 or instructions in the form of software. The aforementioned processor 52 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like. The processor 52 may implement or execute various methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a storage medium, and the storage medium is located in a memory. The processor 52 reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.
本申请实施例还提供了一种存储介质,具体为计算机存储介质,更具 体的为计算机可读存储介质。其上存储有计算机指令,即计算机程序,该计算机指令被处理器执行时上述服务器侧一个或多个技术方案提供的方法。The embodiments of the present application also provide a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium. Computer instructions, that is, computer programs, are stored thereon, and when the computer instructions are executed by the processor, the method provided by one or more technical solutions on the server side is provided.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和智能设备,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个***,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed method and smart device can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各实施例中的各功能单元可以全部集成在一个第二处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application can be embodied in the form of a software product in essence or a part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.
需要说明的是:“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that: "first", "second", etc. are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
另外,本申请实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。In addition, the technical solutions described in the embodiments of the present application can be combined arbitrarily without conflict.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application.

Claims (14)

  1. 一种同声传译方法,应用于服务器,包括:A simultaneous interpretation method, applied to a server, including:
    获得第一待处理数据;Obtain the first data to be processed;
    对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;Translating the first voice data in the first to-be-processed data to obtain the first translated text;
    根据所述第一翻译文本,生成第二语音数据;并执行以下至少之一:According to the first translated text, generate second voice data; and perform at least one of the following:
    根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
    对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,Perform image word processing on the first image data in the first to-be-processed data to obtain an image processing result; the first image data includes at least a display document corresponding to the first voice data; wherein,
    所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。The language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
  2. 根据权利要求1所述的方法,其中,所述根据第一语音数据和第一翻译文本,确定排版文档,包括:The method according to claim 1, wherein said determining the typeset document according to the first voice data and the first translated text comprises:
    对所述第一语音数据进行语音活动检测,确定所述第一语音数据中的静音点;Performing voice activity detection on the first voice data, and determining a mute point in the first voice data;
    获取所述第一翻译文本中所述静音点对应的上下文;Acquiring the context corresponding to the mute point in the first translated text;
    根据所述静音点和所述上下文的语义对所述第一翻译文本进行分段,获得至少一个段落;Segment the first translated text according to the mute point and the semantics of the context to obtain at least one paragraph;
    对所述至少一个段落进行排版,得到所述排版文档。Typesetting the at least one paragraph to obtain the typeset document.
  3. 根据权利要求1所述的方法,其中,所述根据所述第一翻译文本,生成第二语音数据,包括:The method according to claim 1, wherein said generating second voice data according to said first translated text comprises:
    对所述第一翻译文本进行分段,得到所述第一翻译文本中的至少一个段落;Segment the first translated text to obtain at least one paragraph in the first translated text;
    根据所述第一翻译文本中的至少一个段落,生成至少一个分段语音;Generating at least one segmented speech according to at least one paragraph in the first translated text;
    利用所述至少一个分段语音,合成所述第一翻译文本对应的第二语音数据。Using the at least one segmented speech to synthesize second speech data corresponding to the first translated text.
  4. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    对所述第一翻译文本进行摘要抽取,得到针对所述第一翻译文本的摘要文档;所述摘要文档用于在播放所述第一语音数据时在所述客户端进行呈现。Abstract extraction is performed on the first translated text to obtain a summary document for the first translated text; the summary document is used for presentation on the client when the first voice data is played.
  5. 根据权利要求1所述的方法,其中,所述对所述第一图像数据进 行图像文字处理,获得图像处理结果,包括:The method according to claim 1, wherein said performing image word processing on said first image data to obtain an image processing result comprises:
    确定所述第一图像数据中的文字和所述文字对应的位置;Determining a text in the first image data and a position corresponding to the text;
    提取所述第一图像数据中的文字,并对提取的文字进行翻译;Extracting text in the first image data, and translating the extracted text;
    根据翻译后的文字,生成所述图像处理结果。According to the translated text, the image processing result is generated.
  6. 根据权利要求5所述的方法,其中,所述根据翻译后的文字,生成所述图像处理结果,包括以下至少一种:The method according to claim 5, wherein the generating the image processing result according to the translated text includes at least one of the following:
    根据翻译后的文字,替换所述第一图像数据中所述位置对应的文字,得到第二图像数据,将所述第二图像数据作为所述图像处理结果;According to the translated text, replacing the text corresponding to the position in the first image data to obtain second image data, and using the second image data as the image processing result;
    利用翻译后的文字生成第二翻译文本,将所述第二翻译文本作为所述图像处理结果。The translated text is used to generate a second translated text, and the second translated text is used as the image processing result.
  7. 根据权利要求1至6任一项所述的方法,其中,利用所述第一待处理数据获得的同声传译数据对应至少一种语种;所述方法还包括:The method according to any one of claims 1 to 6, wherein the simultaneous interpretation data obtained by using the first to-be-processed data corresponds to at least one language; the method further comprises:
    将至少一种语种对应的同声传译数据,按语种进行分类缓存。The simultaneous interpretation data corresponding to at least one language is classified and cached according to the language.
  8. 根据权利要求7述的方法,其中,所述方法还包括:The method according to claim 7, wherein the method further comprises:
    接收客户端发送的获取请求;所述获取请求用于获取同声传译数据;所述获取请求至少包括:目标语种;Receive an acquisition request sent by the client; the acquisition request is used to acquire simultaneous interpretation data; the acquisition request includes at least: the target language;
    从缓存的同声传译数据中获取所述目标语种对应的同声传译数据;Acquiring the simultaneous interpretation data corresponding to the target language from the cached simultaneous interpretation data;
    将获取的所述目标语种对应的同声传译数据发送给客户端。Send the acquired simultaneous interpretation data corresponding to the target language to the client.
  9. 根据权利要求8所述的方法,其中,所述获取请求还包含目标时间;所述从缓存的内容中获取所述目标语种对应的同声传译数据时,所述方法还包括:The method according to claim 8, wherein the acquisition request further includes a target time; when the simultaneous interpretation data corresponding to the target language is acquired from the cached content, the method further comprises:
    根据预设的时间对应关系,从缓存中获取所述目标时间对应的同声传译数据;所述时间对应关系表征所述同声传译数据中各数据之间的时间关系;所述时间对应关系是根据所述第一语音数据的时间轴和获取所述第一图像数据的时间点预先生成的。According to the preset time correspondence, the simultaneous interpretation data corresponding to the target time is obtained from the cache; the time correspondence represents the time relationship between the data in the simultaneous interpretation data; the time correspondence is It is generated in advance according to the time axis of the first voice data and the time point when the first image data is acquired.
  10. 根据权利要求9所述的方法,其中,所述方法还包括:The method according to claim 9, wherein the method further comprises:
    确定获取所述第一语音数据对应的第一时间轴和获取所述第一图像数据的时间点;Determining a first time axis corresponding to the first voice data and a time point for acquiring the first image data;
    根据所述第一时间轴和所述时间点,生成所述同声传译数据中各数据之间的时间对应关系;Generating a time correspondence between each data in the simultaneous interpretation data according to the first time axis and the time point;
    利用所述时间对应关系将所述同声传译数据中各数据对应保存。The respective data in the simultaneous interpretation data are correspondingly saved using the time correspondence relationship.
  11. 根据权利要求8所述的方法,其中,将所述同声传译数据中的第一翻译文本和第二语音数据发送给客户端,包括:The method according to claim 8, wherein the sending the first translated text and the second voice data in the simultaneous interpretation data to the client terminal comprises:
    将所述第一翻译文本中的至少一个段落和所述段落对应的分段语音发送给客户端;所述段落由客户端展示时,所述段落对应的分段语音由所述客户端播放。At least one paragraph in the first translated text and the segmented voice corresponding to the paragraph are sent to the client; when the paragraph is displayed by the client, the segmented voice corresponding to the paragraph is played by the client.
  12. 一种同声传译装置,包括:A simultaneous interpretation device, including:
    获取单元,配置为获得第一待处理数据;An obtaining unit configured to obtain the first to-be-processed data;
    第一处理单元,配置为对所述第一待处理数据中的第一语音数据进行翻译,获得第一翻译文本;The first processing unit is configured to translate the first voice data in the first to-be-processed data to obtain a first translated text;
    第二处理单元,配置为根据所述第一翻译文本,生成第二语音数据;The second processing unit is configured to generate second voice data according to the first translated text;
    第三处理单元,配置为执行以下至少一个:The third processing unit is configured to execute at least one of the following:
    根据所述第一语音数据和第一翻译文本,获得排版文档;Obtaining a typeset document according to the first voice data and the first translated text;
    对所述第一待处理数据中的第一图像数据进行图像文字处理,获得图像处理结果;所述第一图像数据至少包含与所述第一语音数据对应的展示文档;其中,Perform image word processing on the first image data in the first to-be-processed data to obtain an image processing result; the first image data includes at least a display document corresponding to the first voice data; wherein,
    所述第一语音数据对应的语种不同于所述排版文档对应的语种;所述第一语音数据对应的语种不同于所述第一翻译文本对应的语种;所述第一语音数据对应的语种不同于所述第二语音数据对应的语种;所述第一图像数据显示的文字的语种不同于所述图像处理结果包括的文字的语种;所述第一翻译文本、排版文档、第二语音数据和图像处理结果用于在播放第一语音数据时在客户端进行呈现。The language corresponding to the first voice data is different from the language corresponding to the typeset document; the language corresponding to the first voice data is different from the language corresponding to the first translated text; the language corresponding to the first voice data is different The language corresponding to the second voice data; the language of the text displayed in the first image data is different from the language of the text included in the image processing result; the first translated text, typeset document, second voice data, and The image processing result is used for presentation on the client when the first voice data is played.
  13. 一种服务器,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至11任一项所述方法的步骤。A server includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the steps of the method according to any one of claims 1 to 11 when the processor executes the program.
  14. 一种存储介质,其上存储有计算机指令,所述指令被处理器执行时实现权利要求1至11任一项所述方法的步骤。A storage medium having computer instructions stored thereon, which implement the steps of the method described in any one of claims 1 to 11 when the instructions are executed by a processor.
PCT/CN2019/109677 2019-09-30 2019-09-30 Simultaneous interpretation method and apparatus, and server and storage medium WO2021062757A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/109677 WO2021062757A1 (en) 2019-09-30 2019-09-30 Simultaneous interpretation method and apparatus, and server and storage medium
CN201980099995.2A CN114341866A (en) 2019-09-30 2019-09-30 Simultaneous interpretation method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/109677 WO2021062757A1 (en) 2019-09-30 2019-09-30 Simultaneous interpretation method and apparatus, and server and storage medium

Publications (1)

Publication Number Publication Date
WO2021062757A1 true WO2021062757A1 (en) 2021-04-08

Family

ID=75336728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/109677 WO2021062757A1 (en) 2019-09-30 2019-09-30 Simultaneous interpretation method and apparatus, and server and storage medium

Country Status (2)

Country Link
CN (1) CN114341866A (en)
WO (1) WO2021062757A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023317A (en) * 2021-11-04 2022-02-08 五华县昊天电子科技有限公司 Voice translation system based on cloud platform

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818747A (en) * 2022-04-21 2022-07-29 语联网(武汉)信息技术有限公司 Computer-aided translation method and system of voice sequence and visual terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060143681A1 (en) * 2004-12-29 2006-06-29 Delta Electronics, Inc. Interactive entertainment center
CN101714140A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Instant translation system with multimedia display and method thereof
CN109614628A (en) * 2018-11-16 2019-04-12 广州市讯飞樽鸿信息技术有限公司 A kind of interpretation method and translation system based on Intelligent hardware
CN109696748A (en) * 2019-02-14 2019-04-30 郑州诚优成电子科技有限公司 A kind of augmented reality subtitle glasses for synchronous translation
CN110121097A (en) * 2019-05-13 2019-08-13 深圳市亿联智能有限公司 Multimedia playing apparatus and method with accessible function

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060143681A1 (en) * 2004-12-29 2006-06-29 Delta Electronics, Inc. Interactive entertainment center
CN101714140A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Instant translation system with multimedia display and method thereof
CN109614628A (en) * 2018-11-16 2019-04-12 广州市讯飞樽鸿信息技术有限公司 A kind of interpretation method and translation system based on Intelligent hardware
CN109696748A (en) * 2019-02-14 2019-04-30 郑州诚优成电子科技有限公司 A kind of augmented reality subtitle glasses for synchronous translation
CN110121097A (en) * 2019-05-13 2019-08-13 深圳市亿联智能有限公司 Multimedia playing apparatus and method with accessible function

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023317A (en) * 2021-11-04 2022-02-08 五华县昊天电子科技有限公司 Voice translation system based on cloud platform

Also Published As

Publication number Publication date
CN114341866A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN111883123B (en) Conference summary generation method, device, equipment and medium based on AI identification
US10282162B2 (en) Audio book smart pause
CN108683937B (en) Voice interaction feedback method and system for smart television and computer readable medium
WO2021109678A1 (en) Video generation method and apparatus, electronic device, and storage medium
CN111050201B (en) Data processing method and device, electronic equipment and storage medium
CN108012173B (en) Content identification method, device, equipment and computer storage medium
CN104735468A (en) Method and system for synthesizing images into new video based on semantic analysis
CN112653902A (en) Speaker recognition method and device and electronic equipment
GB2535861A (en) Data lookup and operator for excluding unwanted speech search results
WO2021062757A1 (en) Simultaneous interpretation method and apparatus, and server and storage medium
CN104994404A (en) Method and device for obtaining keywords for video
WO2021087665A1 (en) Data processing method and apparatus, server, and storage medium
CN112581965A (en) Transcription method, device, recording pen and storage medium
CN110992960A (en) Control method, control device, electronic equipment and storage medium
WO2021102754A1 (en) Data processing method and device and storage medium
KR20220130863A (en) Apparatus for Providing Multimedia Conversion Content Creation Service Based on Voice-Text Conversion Video Resource Matching
US20230300429A1 (en) Multimedia content sharing method and apparatus, device, and medium
CN111161710A (en) Simultaneous interpretation method and device, electronic equipment and storage medium
US11874867B2 (en) Speech to text (STT) and natural language processing (NLP) based video bookmarking and classification system
CN112818708B (en) System and method for processing voice translation of multi-terminal multi-language video conference in real time
CN111580766B (en) Information display method and device and information display system
WO2021120174A1 (en) Data processing method, apparatus, electronic device, and storage medium
CN114503546A (en) Subtitle display method, device, electronic equipment and storage medium
CN111161737A (en) Data processing method and device, electronic equipment and storage medium
WO2023273667A1 (en) Data processing method and apparatus, server, client, medium, and product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19947810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19947810

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19947810

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 071022)

122 Ep: pct application non-entry in european phase

Ref document number: 19947810

Country of ref document: EP

Kind code of ref document: A1