CN111738023A - Automatic image-text audio translation method and system - Google Patents
Automatic image-text audio translation method and system Download PDFInfo
- Publication number
- CN111738023A CN111738023A CN202010587361.6A CN202010587361A CN111738023A CN 111738023 A CN111738023 A CN 111738023A CN 202010587361 A CN202010587361 A CN 202010587361A CN 111738023 A CN111738023 A CN 111738023A
- Authority
- CN
- China
- Prior art keywords
- data
- language
- target language
- model
- translation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013519 translation Methods 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000014509 gene expression Effects 0.000 claims abstract description 46
- 238000012545 processing Methods 0.000 claims abstract description 44
- 238000003062 neural network model Methods 0.000 claims abstract description 19
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 238000002372 labelling Methods 0.000 claims description 25
- 238000006243 chemical reaction Methods 0.000 claims description 23
- 238000005070 sampling Methods 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims 2
- 230000014616 translation Effects 0.000 description 53
- 241001672694 Citrus reticulata Species 0.000 description 5
- 238000004590 computer program Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 241000196324 Embryophyta Species 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an automatic image-text audio translation method, which comprises the steps of acquiring one or more of voice data, image data and action data; performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data; identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data; formatting the acquired action data, and inputting the formatted action data into a pre-constructed LSTM neural network model to obtain action expression data; directly outputting the obtained target language data or action expression data or performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data; the invention can respectively identify and translate based on the acquired voice data, image data and action data, thereby improving the translation precision; the method can correspondingly express random translation scenes and has better adaptability.
Description
Technical Field
The invention relates to the technical field of translation, in particular to an automatic image-text audio translation method and an automatic image-text audio translation system.
Background
Based on the current situation that different languages of voice and text belong to human beings, the foreign language personnel who know to learn different languages of voice and text still need to search for translation on site, and books and publications still need to be translated in various kinds of complicated ways. Meanwhile, translation software, language, voice and text inter-translation software, APP translation platforms and the like on each website platform of the internet at present have the disadvantages that real-time performance is delayed and the like anywhere and anytime, and field strength, manpower and material resources of workers behind the translation platforms are complex and huge, and complex investment is huge in arrangement of personnel facilities and equipment behind the translation platforms and the like in the aspect of simultaneous interpretation and translation at present.
In recent years, the quality of machine translation is higher and higher due to the popularity of Neural Machine Translation (NMT) technology, but the expression of language is limited by the language or local accent and expression habit (such as limb expression), so that the translation result is not ideal; for example, parallel corpora between a large language such as english and a small language or between a minority language or dialect are very lacking, so that part of the language pronunciation cannot find corresponding vocabulary to be output, or an expressor is expressed by limbs in the expression process, and the speech acquisition cannot recognize translation.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for automatically translating image-text audio, which are used for obtaining a translation result and improving the translation accuracy based on image-text audio information.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
in one aspect, the invention provides a method for automatically translating image-text audio, which comprises the following steps:
acquiring one or more of voice data, image data and motion data as source data;
performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;
identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;
formatting the acquired action data, and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
Further, in the above method for automatically translating image-text and audio, the performing preliminary processing on the obtained speech data and inputting the preliminary processing result into a pre-trained translation model to obtain corresponding target language data includes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
Further, in the above method for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
Further, in the above method for automatically translating image-text and audio, the identifying the feature information of the acquired image data and translating the identified result of the feature information to obtain corresponding target language data includes
Identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
Further, in the above method for automatically translating graphics, text and audio, formatting the acquired motion data, and inputting the result of formatting into a pre-constructed LSTM neural network model to obtain motion expression data, including
Acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
In a second aspect, the invention also provides an automatic translation system for image-text and audio, which comprises
The data input module is used for acquiring one or more of voice data, image data and motion data as source data;
the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;
the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;
the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.
Further, in the above system for automatically translating graphics, text and audio, the voice data processing module specifically executes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
Further, in the above system for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
Further, in the above system for automatically translating graphics, text and audio, the image data processing module specifically executes:
identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
Further, in the above system for automatically translating graphics, text and audio, the motion data processing module specifically executes:
acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
Compared with the prior art, the invention has the beneficial effects that:
according to the image-text audio automatic translation method and system, recognition and translation can be respectively carried out on the basis of the acquired voice data, image data and action data, and similarity matching is carried out by fusing translation results of different source data when multiple source data are acquired for translation, so that translation precision is improved; furthermore, when the cross-language voice is translated, the language labeling relation between languages and accents is perfected through the cross-language corresponding relation, and the language labeling resource of the translation model is perfected, so that the translation result can be obtained when the acquired voice data is of a small language or with local accents; the system can correspondingly express random (such as containing limb actions) translation scenes, and has better adaptability.
Drawings
In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
FIG. 1 is a flow chart of a method for automatically translating graphics, text and audio according to the present invention;
FIG. 2 is a flow chart of the processing of voice data as shown in FIG. 1;
FIG. 3 is a flow chart of the image data processing as shown in FIG. 1;
FIG. 4 is a logic block diagram of an automatic translation system for graphics, text and audio according to the present invention;
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.
Example 1
As shown in fig. 1, in one aspect, the present invention provides an automatic text-text audio translation method, including the following steps:
s1, acquiring one or more of voice data, image data and action data as source data;
S2A, performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;
S2B, identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;
S2C, formatting the acquired action data, and inputting a formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
s3, when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
In the method, the source data such as voice data, image data and action data can be independently identified and translated, and fusion output can be performed when various source data are obtained, so that the problem of default translation results caused by the fact that body actions replace languages in spoken language communication is solved.
In a specific implementation provided by the invention, step S1, one or more of voice data, image data and motion data are acquired as source data, wherein the voice data can comprise large languages such as Chinese, English, Russian and Japanese, and also can comprise small languages such as French, German, Spanish, Arabic, Vietnamese and Laos; the image data is a local device photo or a picture imported by a third-party platform, and the photo or the picture can contain characters and graphics, such as common graphics needing translation, such as signs, buses, buildings, animals and plants, catering food and the like; motion data is the data of the motion made by the limb.
The source data can be acquired independently or captured simultaneously. The captured source data enters different processing flows according to the types of the source data.
When the acquired source data is voice data, as shown in fig. 2, performing step s2a. to perform preliminary processing on the acquired voice data, and inputting a result of the preliminary processing into a pre-trained translation model to obtain corresponding target language data:
S2A1, performing rate conversion on the acquired voice data to obtain standard sampling rate voice data;
generally, the standard sampling rate voice data is set to a sampling rate of 22KHz, and the sampling point bit width is 16 bits.
S2A2, inputting standard sampling rate data into a language identification model to obtain a language judgment result;
in this embodiment, the language identification model may be a language identification model that is mature at present and uses a Hidden Markov Model (HMM) and an N-gram model as a core, and details thereof are not repeated.
S2A3, inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language to the voice data through a pre-trained deep neural network; current deep neural network models have been able to recognize accentuated speech data, converting the lexical content thereof into computer-readable input. The method carries out language identification on the source data subjected to language identification through the step, and the identified language data needs to be further translated into the target language.
In this embodiment, the language conversion is implemented by the cross-language markup model, and then the target language data is output. Wherein the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages; because each language can be divided into a plurality of accents according to different regions, for example, the Chinese language is divided into mandarin, Henan Chinese, Guangdong language, Minnan language and the like according to local accents.
Generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
For example, in the cross-language correspondence, according to a word conversion unit, a word "tomorrow" is labeled as "tomorrow" → "tomorrow", an association between english and chinese cantonese is labeled as "tomorrow" → "tomorrow", an association between chinese cantonese and mandarin is labeled as "tomorrow" → "tomorrow", and an association between chinese cantonese and mandarin is labeled as "hear day" → "tomorrow", so that it is possible to realize oral translations such as english translation into chinese (mandarin), english translation into cantonese, and conversion from cantonese into mandarin.
According to daily common expressions, association labeling of a sentence as a conversion unit can also be performed.
Any two kinds of cross-language translation can be refined to be corresponding by taking the cross-language labeling model as a conversion unit, so that the small languages or local accents can be identified and correspondingly converted and output, and parallel linguistic data of the large languages to the small languages or the minority languages or dialects and the like are enriched, so that the linguistic data perfection degree of a translation system is improved, and the accuracy of a translation result is improved.
When the acquired source data is image data, performing step s2b, performing feature information recognition on the acquired image data, and translating the recognized feature information result to obtain corresponding target language data, as shown in fig. 3, including
S2B1, identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
S2B2, translating the recognized character and/or graphic object to obtain corresponding target language data.
The pre-trained Convolutional Neural Network (CNN) can identify characters and graphics contained in a given graph, such as which kind of animals, plants, vehicles and the like the graph belongs to, and the identified characters and graphics information are directly transcribed into text data of target language data for display or broadcast. For example, a picture of a dumpling is taken, after the picture is identified by the convolutional neural network, the target language is Chinese, the text data of the output target language data is "dumpling", the hardware can be used for voice playing, and of course, if the target language is English, the text data of the output target language data is "dumpling".
The step can automatically recognize and translate the picture or the picture of the unknown object to obtain the desired target language data.
Further, in the above method for automatically translating graphics, text and audio, when the acquired source data is motion data, the step s2c is performed to format the acquired motion data, and the result of the formatting process is input to a pre-constructed LSTM neural network model to obtain motion expression data:
S2C1, acquiring motion data through a posture sensor worn by a limb;
the attitude sensor comprises an inclination angle sensor, a three-axis gyroscope, a three-axis linear accelerometer and the like, is worn on the limb of an actor, acquires accurate dynamic precision and provides real-time motion measurement. When many people are in oral communication, along with gesture actions such as interview and call, handshake, way directing and the like, the action parameters can be used as auxiliary translation data for fusion of subsequent translation data, and a speaker (namely the actor) can be supplemented to replace default speech translation caused by oral vocabulary with actions; or may be translated directly as some commonly used sign language data.
S2C2, denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
the LSTM neural network model is constructed by acquiring enough motion samples and training. And then denoising and characteristic extraction processing are carried out on the data after the action data are obtained to obtain formatted input data, the formatted input data enter a trained LSTM neural network model, and action expression data, namely what the action expression means, are output.
S2C3, translating the action expression data into target language data according to requirements and outputting the target language data.
S3, when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
The similarity matching can be based on a DRCN and DIIN similarity matching model, the processing results of the voice data, the image data and the action data (namely the target language data and/or the action expression data) are subjected to feature extraction and similarity matching, for example, after recognition, the feature content of the voice data comprises 'hello', the feature content of the identified action data comprises 'handshake', the voice data and the action data belong to deeply-associated feature content and can be determined as a calling situation, then the action data is used as an auxiliary parameter of the voice data, the two data are fused to obtain fused language data, the final target language data of the 'hello' is accurately output, and the accuracy of the translated data is improved.
Example 2
In a second aspect, as shown in fig. 4, the invention further provides an automatic translation system for graphics, text and audio, comprising
The data input module is used for acquiring one or more of voice data, image data and motion data as source data;
the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;
the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;
the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.
Further, in the above system for automatically translating graphics, text and audio, the voice data processing module specifically executes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
Further, in the above system for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
Further, in the above system for automatically translating graphics, text and audio, the image data processing module specifically executes:
identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
Further, in the above system for automatically translating graphics, text and audio, the motion data processing module specifically executes:
acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
The system of the present invention is used to implement the method in the above embodiment 1 of the present invention, and therefore, the working principle of each module of the system may refer to the related description of the above embodiment 1, and is not described again.
Implementations of the invention and all of the functional operations provided herein may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the present disclosure may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus.
A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.
Claims (10)
1. A method for automatically translating image, text and audio is characterized by comprising the following steps:
acquiring one or more of voice data, image data and motion data as source data;
performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;
identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;
formatting the acquired action data, and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.
2. The method for automatically translating graphics, text and audio according to claim 1, wherein the preliminary processing of the acquired voice data and the inputting of the preliminary processing result into a pre-trained translation model to obtain corresponding target language data comprises:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
3. The method for automatic translation of teletext and audio frequency according to claim 2, wherein the cross-language markup model is obtained by pre-construction training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
4. The method for automatic translation of teletext audio frequency according to claim 1, wherein said identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain the corresponding target language data comprises
Identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
5. The method for automatic translation of teletext audio frequency according to claim 1, wherein the obtained motion data is formatted and the result of the formatting is input to a pre-constructed LSTM neural network model to obtain motion expression data, comprising
Acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
6. An automatic translation system for image, text and audio is characterized by comprising
The data input module is used for acquiring one or more of voice data, image data and motion data as source data;
the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;
the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;
the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;
the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.
7. The system for automatic translation of teletext and audio frequency according to claim 6, wherein the speech data processing module specifically executes:
carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;
inputting standard sampling rate data into a language identification model to obtain a language judgment result;
inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:
outputting language identification information of a source language from the voice data through a pre-trained deep neural network,
and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.
8. The system for automatic translation of teletext and audio according to claim 7, wherein the cross-language markup model is obtained by pre-construction training:
classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;
generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,
the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;
and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.
9. The system for automatic translation of teletext and audio according to claim 6, wherein the image data processing module specifically executes:
identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;
and translating the recognized text and/or graphic objects to obtain corresponding target language data.
10. The system for automatic translation of teletext and audio frequency according to claim 6, wherein the action data processing module specifically executes:
acquiring motion data through a posture sensor worn by a limb;
denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;
and outputting the action expression data as target language data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010587361.6A CN111738023A (en) | 2020-06-24 | 2020-06-24 | Automatic image-text audio translation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010587361.6A CN111738023A (en) | 2020-06-24 | 2020-06-24 | Automatic image-text audio translation method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111738023A true CN111738023A (en) | 2020-10-02 |
Family
ID=72651323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010587361.6A Pending CN111738023A (en) | 2020-06-24 | 2020-06-24 | Automatic image-text audio translation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111738023A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112382275A (en) * | 2020-11-04 | 2021-02-19 | 北京百度网讯科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN114047981A (en) * | 2021-12-24 | 2022-02-15 | 珠海金山数字网络科技有限公司 | Project configuration method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1742273A (en) * | 2002-12-10 | 2006-03-01 | 国际商业机器公司 | Multimodal speech-to-speech language translation and display |
US20060293874A1 (en) * | 2005-06-27 | 2006-12-28 | Microsoft Corporation | Translation and capture architecture for output of conversational utterances |
US20110153324A1 (en) * | 2009-12-23 | 2011-06-23 | Google Inc. | Language Model Selection for Speech-to-Text Conversion |
CN106057023A (en) * | 2016-06-03 | 2016-10-26 | 北京光年无限科技有限公司 | Intelligent robot oriented teaching method and device for children |
CN107832309A (en) * | 2017-10-18 | 2018-03-23 | 广东小天才科技有限公司 | A kind of method, apparatus of language translation, wearable device and storage medium |
CN108052510A (en) * | 2017-10-23 | 2018-05-18 | 成都铅笔科技有限公司 | A kind of multilingual translation device for supporting gesture identification and voice pickup |
CN108399427A (en) * | 2018-02-09 | 2018-08-14 | 华南理工大学 | Natural interactive method based on multimodal information fusion |
CN108766414A (en) * | 2018-06-29 | 2018-11-06 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and computer readable storage medium for voiced translation |
CN108960126A (en) * | 2018-06-29 | 2018-12-07 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and the system of sign language interpreter |
CN109472035A (en) * | 2018-11-12 | 2019-03-15 | 深圳市友杰智新科技有限公司 | Switch the control system and method for interpretive scheme |
CN110931042A (en) * | 2019-11-14 | 2020-03-27 | 北京欧珀通信有限公司 | Simultaneous interpretation method and device, electronic equipment and storage medium |
-
2020
- 2020-06-24 CN CN202010587361.6A patent/CN111738023A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1742273A (en) * | 2002-12-10 | 2006-03-01 | 国际商业机器公司 | Multimodal speech-to-speech language translation and display |
US20060293874A1 (en) * | 2005-06-27 | 2006-12-28 | Microsoft Corporation | Translation and capture architecture for output of conversational utterances |
US20110153324A1 (en) * | 2009-12-23 | 2011-06-23 | Google Inc. | Language Model Selection for Speech-to-Text Conversion |
CN106057023A (en) * | 2016-06-03 | 2016-10-26 | 北京光年无限科技有限公司 | Intelligent robot oriented teaching method and device for children |
CN107832309A (en) * | 2017-10-18 | 2018-03-23 | 广东小天才科技有限公司 | A kind of method, apparatus of language translation, wearable device and storage medium |
CN108052510A (en) * | 2017-10-23 | 2018-05-18 | 成都铅笔科技有限公司 | A kind of multilingual translation device for supporting gesture identification and voice pickup |
CN108399427A (en) * | 2018-02-09 | 2018-08-14 | 华南理工大学 | Natural interactive method based on multimodal information fusion |
CN108766414A (en) * | 2018-06-29 | 2018-11-06 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and computer readable storage medium for voiced translation |
CN108960126A (en) * | 2018-06-29 | 2018-12-07 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and the system of sign language interpreter |
CN109472035A (en) * | 2018-11-12 | 2019-03-15 | 深圳市友杰智新科技有限公司 | Switch the control system and method for interpretive scheme |
CN110931042A (en) * | 2019-11-14 | 2020-03-27 | 北京欧珀通信有限公司 | Simultaneous interpretation method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
王建华 等: "眼动技术下的视听多模态文本研究:应用、现状与展望", 《外国语言与文化》, vol. 4, no. 1, pages 133 - 142 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112382275A (en) * | 2020-11-04 | 2021-02-19 | 北京百度网讯科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN112382275B (en) * | 2020-11-04 | 2023-08-15 | 北京百度网讯科技有限公司 | Speech recognition method, device, electronic equipment and storage medium |
CN114047981A (en) * | 2021-12-24 | 2022-02-15 | 珠海金山数字网络科技有限公司 | Project configuration method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109255113B (en) | Intelligent proofreading system | |
US8498857B2 (en) | System and method for rapid prototyping of existing speech recognition solutions in different languages | |
US11043213B2 (en) | System and method for detection and correction of incorrectly pronounced words | |
JP2017058674A (en) | Apparatus and method for speech recognition, apparatus and method for training transformation parameter, computer program and electronic apparatus | |
CN110797010A (en) | Question-answer scoring method, device, equipment and storage medium based on artificial intelligence | |
CN110010136B (en) | Training and text analysis method, device, medium and equipment for prosody prediction model | |
CN109376360A (en) | A kind of method and apparatus of assisted learning language | |
CN111738023A (en) | Automatic image-text audio translation method and system | |
CN110717341A (en) | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot | |
CN110648654A (en) | Speech recognition enhancement method and device introducing language vectors | |
CN112530404A (en) | Voice synthesis method, voice synthesis device and intelligent equipment | |
CN112489634A (en) | Language acoustic model training method and device, electronic equipment and computer medium | |
Azim et al. | Large vocabulary Arabic continuous speech recognition using tied states acoustic models | |
Lu et al. | Implementation of embedded unspecific continuous English speech recognition based on HMM | |
KR20160106363A (en) | Smart lecture system and method | |
CN114519358A (en) | Translation quality evaluation method and device, electronic equipment and storage medium | |
CN110858268B (en) | Method and system for detecting unsmooth phenomenon in voice translation system | |
CN109446537B (en) | Translation evaluation method and device for machine translation | |
Tits et al. | Flowchase: a Mobile Application for Pronunciation Training | |
US11817079B1 (en) | GAN-based speech synthesis model and training method | |
CN116229994B (en) | Construction method and device of label prediction model of Arabic language | |
CN116386637B (en) | Radar flight command voice instruction generation method and system | |
CN113362803B (en) | ARM side offline speech synthesis method, ARM side offline speech synthesis device and storage medium | |
CN113555006B (en) | Voice information identification method and device, electronic equipment and storage medium | |
RKDMP et al. | Real-Time Sign Language Translator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20240517 |