CN111738023A - Automatic image-text audio translation method and system - Google Patents

Automatic image-text audio translation method and system Download PDF

Info

Publication number
CN111738023A
CN111738023A CN202010587361.6A CN202010587361A CN111738023A CN 111738023 A CN111738023 A CN 111738023A CN 202010587361 A CN202010587361 A CN 202010587361A CN 111738023 A CN111738023 A CN 111738023A
Authority
CN
China
Prior art keywords
data
language
target language
model
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010587361.6A
Other languages
Chinese (zh)
Inventor
宋万利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010587361.6A priority Critical patent/CN111738023A/en
Publication of CN111738023A publication Critical patent/CN111738023A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an automatic image-text audio translation method, which comprises the steps of acquiring one or more of voice data, image data and action data; performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data; identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data; formatting the acquired action data, and inputting the formatted action data into a pre-constructed LSTM neural network model to obtain action expression data; directly outputting the obtained target language data or action expression data or performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data; the invention can respectively identify and translate based on the acquired voice data, image data and action data, thereby improving the translation precision; the method can correspondingly express random translation scenes and has better adaptability.

Description

Automatic image-text audio translation method and system
Technical Field
The invention relates to the technical field of translation, in particular to an automatic image-text audio translation method and an automatic image-text audio translation system.
Background
Based on the current situation that different languages of voice and text belong to human beings, the foreign language personnel who know to learn different languages of voice and text still need to search for translation on site, and books and publications still need to be translated in various kinds of complicated ways. Meanwhile, translation software, language, voice and text inter-translation software, APP translation platforms and the like on each website platform of the internet at present have the disadvantages that real-time performance is delayed and the like anywhere and anytime, and field strength, manpower and material resources of workers behind the translation platforms are complex and huge, and complex investment is huge in arrangement of personnel facilities and equipment behind the translation platforms and the like in the aspect of simultaneous interpretation and translation at present.
In recent years, the quality of machine translation is higher and higher due to the popularity of Neural Machine Translation (NMT) technology, but the expression of language is limited by the language or local accent and expression habit (such as limb expression), so that the translation result is not ideal; for example, parallel corpora between a large language such as english and a small language or between a minority language or dialect are very lacking, so that part of the language pronunciation cannot find corresponding vocabulary to be output, or an expressor is expressed by limbs in the expression process, and the speech acquisition cannot recognize translation.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for automatically translating image-text audio, which are used for obtaining a translation result and improving the translation accuracy based on image-text audio information.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
in one aspect, the invention provides a method for automatically translating image-text audio, which comprises the following steps:
acquiring one or more of voice data, image data and motion data as source data;
performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;
identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;
formatting the acquired action data, and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
Further, in the above method for automatically translating image-text and audio, the performing preliminary processing on the obtained speech data and inputting the preliminary processing result into a pre-trained translation model to obtain corresponding target language data includes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
Further, in the above method for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
Further, in the above method for automatically translating image-text and audio, the identifying the feature information of the acquired image data and translating the identified result of the feature information to obtain corresponding target language data includes
Identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
Further, in the above method for automatically translating graphics, text and audio, formatting the acquired motion data, and inputting the result of formatting into a pre-constructed LSTM neural network model to obtain motion expression data, including
Acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
In a second aspect, the invention also provides an automatic translation system for image-text and audio, which comprises
The data input module is used for acquiring one or more of voice data, image data and motion data as source data;
the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;
the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;
the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.
Further, in the above system for automatically translating graphics, text and audio, the voice data processing module specifically executes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
Further, in the above system for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
Further, in the above system for automatically translating graphics, text and audio, the image data processing module specifically executes:
identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
Further, in the above system for automatically translating graphics, text and audio, the motion data processing module specifically executes:
acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
Compared with the prior art, the invention has the beneficial effects that:
according to the image-text audio automatic translation method and system, recognition and translation can be respectively carried out on the basis of the acquired voice data, image data and action data, and similarity matching is carried out by fusing translation results of different source data when multiple source data are acquired for translation, so that translation precision is improved; furthermore, when the cross-language voice is translated, the language labeling relation between languages and accents is perfected through the cross-language corresponding relation, and the language labeling resource of the translation model is perfected, so that the translation result can be obtained when the acquired voice data is of a small language or with local accents; the system can correspondingly express random (such as containing limb actions) translation scenes, and has better adaptability.
Drawings
In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
FIG. 1 is a flow chart of a method for automatically translating graphics, text and audio according to the present invention;
FIG. 2 is a flow chart of the processing of voice data as shown in FIG. 1;
FIG. 3 is a flow chart of the image data processing as shown in FIG. 1;
FIG. 4 is a logic block diagram of an automatic translation system for graphics, text and audio according to the present invention;
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.
Example 1
As shown in fig. 1, in one aspect, the present invention provides an automatic text-text audio translation method, including the following steps:
s1, acquiring one or more of voice data, image data and action data as source data;
S2A, performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;
S2B, identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;
S2C, formatting the acquired action data, and inputting a formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
s3, when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
In the method, the source data such as voice data, image data and action data can be independently identified and translated, and fusion output can be performed when various source data are obtained, so that the problem of default translation results caused by the fact that body actions replace languages in spoken language communication is solved.
In a specific implementation provided by the invention, step S1, one or more of voice data, image data and motion data are acquired as source data, wherein the voice data can comprise large languages such as Chinese, English, Russian and Japanese, and also can comprise small languages such as French, German, Spanish, Arabic, Vietnamese and Laos; the image data is a local device photo or a picture imported by a third-party platform, and the photo or the picture can contain characters and graphics, such as common graphics needing translation, such as signs, buses, buildings, animals and plants, catering food and the like; motion data is the data of the motion made by the limb.
The source data can be acquired independently or captured simultaneously. The captured source data enters different processing flows according to the types of the source data.
When the acquired source data is voice data, as shown in fig. 2, performing step s2a. to perform preliminary processing on the acquired voice data, and inputting a result of the preliminary processing into a pre-trained translation model to obtain corresponding target language data:
S2A1, performing rate conversion on the acquired voice data to obtain standard sampling rate voice data;
generally, the standard sampling rate voice data is set to a sampling rate of 22KHz, and the sampling point bit width is 16 bits.
S2A2, inputting standard sampling rate data into a language identification model to obtain a language judgment result;
in this embodiment, the language identification model may be a language identification model that is mature at present and uses a Hidden Markov Model (HMM) and an N-gram model as a core, and details thereof are not repeated.
S2A3, inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language to the voice data through a pre-trained deep neural network; current deep neural network models have been able to recognize accentuated speech data, converting the lexical content thereof into computer-readable input. The method carries out language identification on the source data subjected to language identification through the step, and the identified language data needs to be further translated into the target language.
In this embodiment, the language conversion is implemented by the cross-language markup model, and then the target language data is output. Wherein the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages; because each language can be divided into a plurality of accents according to different regions, for example, the Chinese language is divided into mandarin, Henan Chinese, Guangdong language, Minnan language and the like according to local accents.
Generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
For example, in the cross-language correspondence, according to a word conversion unit, a word "tomorrow" is labeled as "tomorrow" → "tomorrow", an association between english and chinese cantonese is labeled as "tomorrow" → "tomorrow", an association between chinese cantonese and mandarin is labeled as "tomorrow" → "tomorrow", and an association between chinese cantonese and mandarin is labeled as "hear day" → "tomorrow", so that it is possible to realize oral translations such as english translation into chinese (mandarin), english translation into cantonese, and conversion from cantonese into mandarin.
According to daily common expressions, association labeling of a sentence as a conversion unit can also be performed.
Any two kinds of cross-language translation can be refined to be corresponding by taking the cross-language labeling model as a conversion unit, so that the small languages or local accents can be identified and correspondingly converted and output, and parallel linguistic data of the large languages to the small languages or the minority languages or dialects and the like are enriched, so that the linguistic data perfection degree of a translation system is improved, and the accuracy of a translation result is improved.
When the acquired source data is image data, performing step s2b, performing feature information recognition on the acquired image data, and translating the recognized feature information result to obtain corresponding target language data, as shown in fig. 3, including
S2B1, identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
S2B2, translating the recognized character and/or graphic object to obtain corresponding target language data.
The pre-trained Convolutional Neural Network (CNN) can identify characters and graphics contained in a given graph, such as which kind of animals, plants, vehicles and the like the graph belongs to, and the identified characters and graphics information are directly transcribed into text data of target language data for display or broadcast. For example, a picture of a dumpling is taken, after the picture is identified by the convolutional neural network, the target language is Chinese, the text data of the output target language data is "dumpling", the hardware can be used for voice playing, and of course, if the target language is English, the text data of the output target language data is "dumpling".
The step can automatically recognize and translate the picture or the picture of the unknown object to obtain the desired target language data.
Further, in the above method for automatically translating graphics, text and audio, when the acquired source data is motion data, the step s2c is performed to format the acquired motion data, and the result of the formatting process is input to a pre-constructed LSTM neural network model to obtain motion expression data:
S2C1, acquiring motion data through a posture sensor worn by a limb;
the attitude sensor comprises an inclination angle sensor, a three-axis gyroscope, a three-axis linear accelerometer and the like, is worn on the limb of an actor, acquires accurate dynamic precision and provides real-time motion measurement. When many people are in oral communication, along with gesture actions such as interview and call, handshake, way directing and the like, the action parameters can be used as auxiliary translation data for fusion of subsequent translation data, and a speaker (namely the actor) can be supplemented to replace default speech translation caused by oral vocabulary with actions; or may be translated directly as some commonly used sign language data.
S2C2, denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
the LSTM neural network model is constructed by acquiring enough motion samples and training. And then denoising and characteristic extraction processing are carried out on the data after the action data are obtained to obtain formatted input data, the formatted input data enter a trained LSTM neural network model, and action expression data, namely what the action expression means, are output.
S2C3, translating the action expression data into target language data according to requirements and outputting the target language data.
S3, when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
The similarity matching can be based on a DRCN and DIIN similarity matching model, the processing results of the voice data, the image data and the action data (namely the target language data and/or the action expression data) are subjected to feature extraction and similarity matching, for example, after recognition, the feature content of the voice data comprises 'hello', the feature content of the identified action data comprises 'handshake', the voice data and the action data belong to deeply-associated feature content and can be determined as a calling situation, then the action data is used as an auxiliary parameter of the voice data, the two data are fused to obtain fused language data, the final target language data of the 'hello' is accurately output, and the accuracy of the translated data is improved.
Example 2
In a second aspect, as shown in fig. 4, the invention further provides an automatic translation system for graphics, text and audio, comprising
The data input module is used for acquiring one or more of voice data, image data and motion data as source data;
the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;
the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;
the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.
Further, in the above system for automatically translating graphics, text and audio, the voice data processing module specifically executes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
Further, in the above system for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
Further, in the above system for automatically translating graphics, text and audio, the image data processing module specifically executes:
identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
Further, in the above system for automatically translating graphics, text and audio, the motion data processing module specifically executes:
acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
The system of the present invention is used to implement the method in the above embodiment 1 of the present invention, and therefore, the working principle of each module of the system may refer to the related description of the above embodiment 1, and is not described again.
Implementations of the invention and all of the functional operations provided herein may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the present disclosure may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus.
A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (10)

1. A method for automatically translating image, text and audio is characterized by comprising the following steps:
acquiring one or more of voice data, image data and motion data as source data;
performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;
identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;
formatting the acquired action data, and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
2. The method for automatically translating graphics, text and audio according to claim 1, wherein the preliminary processing of the acquired voice data and the inputting of the preliminary processing result into a pre-trained translation model to obtain corresponding target language data comprises:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
3. The method for automatic translation of teletext and audio frequency according to claim 2, wherein the cross-language markup model is obtained by pre-construction training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
4. The method for automatic translation of teletext audio frequency according to claim 1, wherein said identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain the corresponding target language data comprises
Identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
5. The method for automatic translation of teletext audio frequency according to claim 1, wherein the obtained motion data is formatted and the result of the formatting is input to a pre-constructed LSTM neural network model to obtain motion expression data, comprising
Acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
6. An automatic translation system for image, text and audio is characterized by comprising
The data input module is used for acquiring one or more of voice data, image data and motion data as source data;
the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;
the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;
the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.
7. The system for automatic translation of teletext and audio frequency according to claim 6, wherein the speech data processing module specifically executes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
8. The system for automatic translation of teletext and audio according to claim 7, wherein the cross-language markup model is obtained by pre-construction training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
9. The system for automatic translation of teletext and audio according to claim 6, wherein the image data processing module specifically executes:
identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
10. The system for automatic translation of teletext and audio frequency according to claim 6, wherein the action data processing module specifically executes:
acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
CN202010587361.6A 2020-06-24 2020-06-24 Automatic image-text audio translation method and system Pending CN111738023A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010587361.6A CN111738023A (en) 2020-06-24 2020-06-24 Automatic image-text audio translation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010587361.6A CN111738023A (en) 2020-06-24 2020-06-24 Automatic image-text audio translation method and system

Publications (1)

Publication Number Publication Date
CN111738023A true CN111738023A (en) 2020-10-02

Family

ID=72651323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010587361.6A Pending CN111738023A (en) 2020-06-24 2020-06-24 Automatic image-text audio translation method and system

Country Status (1)

Country Link
CN (1) CN111738023A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382275A (en) * 2020-11-04 2021-02-19 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN114047981A (en) * 2021-12-24 2022-02-15 珠海金山数字网络科技有限公司 Project configuration method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1742273A (en) * 2002-12-10 2006-03-01 国际商业机器公司 Multimodal speech-to-speech language translation and display
US20060293874A1 (en) * 2005-06-27 2006-12-28 Microsoft Corporation Translation and capture architecture for output of conversational utterances
US20110153324A1 (en) * 2009-12-23 2011-06-23 Google Inc. Language Model Selection for Speech-to-Text Conversion
CN106057023A (en) * 2016-06-03 2016-10-26 北京光年无限科技有限公司 Intelligent robot oriented teaching method and device for children
CN107832309A (en) * 2017-10-18 2018-03-23 广东小天才科技有限公司 A kind of method, apparatus of language translation, wearable device and storage medium
CN108052510A (en) * 2017-10-23 2018-05-18 成都铅笔科技有限公司 A kind of multilingual translation device for supporting gesture identification and voice pickup
CN108399427A (en) * 2018-02-09 2018-08-14 华南理工大学 Natural interactive method based on multimodal information fusion
CN108766414A (en) * 2018-06-29 2018-11-06 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for voiced translation
CN108960126A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Method, apparatus, equipment and the system of sign language interpreter
CN109472035A (en) * 2018-11-12 2019-03-15 深圳市友杰智新科技有限公司 Switch the control system and method for interpretive scheme
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1742273A (en) * 2002-12-10 2006-03-01 国际商业机器公司 Multimodal speech-to-speech language translation and display
US20060293874A1 (en) * 2005-06-27 2006-12-28 Microsoft Corporation Translation and capture architecture for output of conversational utterances
US20110153324A1 (en) * 2009-12-23 2011-06-23 Google Inc. Language Model Selection for Speech-to-Text Conversion
CN106057023A (en) * 2016-06-03 2016-10-26 北京光年无限科技有限公司 Intelligent robot oriented teaching method and device for children
CN107832309A (en) * 2017-10-18 2018-03-23 广东小天才科技有限公司 A kind of method, apparatus of language translation, wearable device and storage medium
CN108052510A (en) * 2017-10-23 2018-05-18 成都铅笔科技有限公司 A kind of multilingual translation device for supporting gesture identification and voice pickup
CN108399427A (en) * 2018-02-09 2018-08-14 华南理工大学 Natural interactive method based on multimodal information fusion
CN108766414A (en) * 2018-06-29 2018-11-06 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for voiced translation
CN108960126A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Method, apparatus, equipment and the system of sign language interpreter
CN109472035A (en) * 2018-11-12 2019-03-15 深圳市友杰智新科技有限公司 Switch the control system and method for interpretive scheme
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王建华 等: "眼动技术下的视听多模态文本研究:应用、现状与展望", 《外国语言与文化》, vol. 4, no. 1, pages 133 - 142 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382275A (en) * 2020-11-04 2021-02-19 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112382275B (en) * 2020-11-04 2023-08-15 北京百度网讯科技有限公司 Speech recognition method, device, electronic equipment and storage medium
CN114047981A (en) * 2021-12-24 2022-02-15 珠海金山数字网络科技有限公司 Project configuration method and device

Similar Documents

Publication Publication Date Title
CN109255113B (en) Intelligent proofreading system
US8498857B2 (en) System and method for rapid prototyping of existing speech recognition solutions in different languages
US11043213B2 (en) System and method for detection and correction of incorrectly pronounced words
JP2017058674A (en) Apparatus and method for speech recognition, apparatus and method for training transformation parameter, computer program and electronic apparatus
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
CN110010136B (en) Training and text analysis method, device, medium and equipment for prosody prediction model
CN109376360A (en) A kind of method and apparatus of assisted learning language
CN111738023A (en) Automatic image-text audio translation method and system
CN110717341A (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN110648654A (en) Speech recognition enhancement method and device introducing language vectors
CN112530404A (en) Voice synthesis method, voice synthesis device and intelligent equipment
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
Azim et al. Large vocabulary Arabic continuous speech recognition using tied states acoustic models
Lu et al. Implementation of embedded unspecific continuous English speech recognition based on HMM
KR20160106363A (en) Smart lecture system and method
CN114519358A (en) Translation quality evaluation method and device, electronic equipment and storage medium
CN110858268B (en) Method and system for detecting unsmooth phenomenon in voice translation system
CN109446537B (en) Translation evaluation method and device for machine translation
Tits et al. Flowchase: a Mobile Application for Pronunciation Training
US11817079B1 (en) GAN-based speech synthesis model and training method
CN116229994B (en) Construction method and device of label prediction model of Arabic language
CN116386637B (en) Radar flight command voice instruction generation method and system
CN113362803B (en) ARM side offline speech synthesis method, ARM side offline speech synthesis device and storage medium
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
RKDMP et al. Real-Time Sign Language Translator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240517