CN111738023A

CN111738023A - Automatic image-text audio translation method and system

Info

Publication number: CN111738023A
Application number: CN202010587361.6A
Authority: CN
Inventors: 宋万利
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-02

Abstract

The invention discloses an automatic image-text audio translation method, which comprises the steps of acquiring one or more of voice data, image data and action data; performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data; identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data; formatting the acquired action data, and inputting the formatted action data into a pre-constructed LSTM neural network model to obtain action expression data; directly outputting the obtained target language data or action expression data or performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data; the invention can respectively identify and translate based on the acquired voice data, image data and action data, thereby improving the translation precision; the method can correspondingly express random translation scenes and has better adaptability.

Description

Automatic image-text audio translation method and system

Technical Field

The invention relates to the technical field of translation, in particular to an automatic image-text audio translation method and an automatic image-text audio translation system.

Background

Based on the current situation that different languages of voice and text belong to human beings, the foreign language personnel who know to learn different languages of voice and text still need to search for translation on site, and books and publications still need to be translated in various kinds of complicated ways. Meanwhile, translation software, language, voice and text inter-translation software, APP translation platforms and the like on each website platform of the internet at present have the disadvantages that real-time performance is delayed and the like anywhere and anytime, and field strength, manpower and material resources of workers behind the translation platforms are complex and huge, and complex investment is huge in arrangement of personnel facilities and equipment behind the translation platforms and the like in the aspect of simultaneous interpretation and translation at present.

In recent years, the quality of machine translation is higher and higher due to the popularity of Neural Machine Translation (NMT) technology, but the expression of language is limited by the language or local accent and expression habit (such as limb expression), so that the translation result is not ideal; for example, parallel corpora between a large language such as english and a small language or between a minority language or dialect are very lacking, so that part of the language pronunciation cannot find corresponding vocabulary to be output, or an expressor is expressed by limbs in the expression process, and the speech acquisition cannot recognize translation.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and a system for automatically translating image-text audio, which are used for obtaining a translation result and improving the translation accuracy based on image-text audio information.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

in one aspect, the invention provides a method for automatically translating image-text audio, which comprises the following steps:

acquiring one or more of voice data, image data and motion data as source data;

performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;

identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;

formatting the acquired action data, and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;

when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.

Further, in the above method for automatically translating image-text and audio, the performing preliminary processing on the obtained speech data and inputting the preliminary processing result into a pre-trained translation model to obtain corresponding target language data includes:

carrying out rate conversion on the acquired voice data to obtain standard sampling rate voice data;

inputting standard sampling rate data into a language identification model to obtain a language judgment result;

inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:

outputting language identification information of a source language from the voice data through a pre-trained deep neural network,

and converting the language identification information of the source language through a cross-language labeling model, and outputting target language data.

Further, in the above method for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:

classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages;

generating cross-language corresponding relations of the sub-languages in all languages according to any pairwise corresponding mode,

the cross-language corresponding relation converts corresponding data of a target language in any source language by taking words, phrases and sentences as conversion units to obtain the target language;

and carrying out association labeling between any two languages according to the conversion unit, and training the cross-language corresponding relation to form the cross-language labeling model.

Further, in the above method for automatically translating image-text and audio, the identifying the feature information of the acquired image data and translating the identified result of the feature information to obtain corresponding target language data includes

Identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;

and translating the recognized text and/or graphic objects to obtain corresponding target language data.

Further, in the above method for automatically translating graphics, text and audio, formatting the acquired motion data, and inputting the result of formatting into a pre-constructed LSTM neural network model to obtain motion expression data, including

Acquiring motion data through a posture sensor worn by a limb;

denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;

and outputting the action expression data as target language data.

In a second aspect, the invention also provides an automatic translation system for image-text and audio, which comprises

The data input module is used for acquiring one or more of voice data, image data and motion data as source data;

the voice data processing module is used for carrying out primary processing on the acquired voice data and inputting a primary processing result into a pre-trained translation model so as to obtain corresponding target language data;

the image data processing module is used for identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain corresponding target language data;

the action data processing module is used for formatting the acquired action data and inputting the formatting result into a pre-constructed LSTM neural network model to obtain action expression data;

the data output module is used for directly outputting the obtained target language data or the obtained action expression data when the source data is one type; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data output.

Further, in the above system for automatically translating graphics, text and audio, the voice data processing module specifically executes:

Further, in the above system for automatically translating graphics, text and audio, the cross-language labeling model is obtained by pre-constructing and training:

Further, in the above system for automatically translating graphics, text and audio, the image data processing module specifically executes:

Further, in the above system for automatically translating graphics, text and audio, the motion data processing module specifically executes:

acquiring motion data through a posture sensor worn by a limb;

and outputting the action expression data as target language data.

Compared with the prior art, the invention has the beneficial effects that:

according to the image-text audio automatic translation method and system, recognition and translation can be respectively carried out on the basis of the acquired voice data, image data and action data, and similarity matching is carried out by fusing translation results of different source data when multiple source data are acquired for translation, so that translation precision is improved; furthermore, when the cross-language voice is translated, the language labeling relation between languages and accents is perfected through the cross-language corresponding relation, and the language labeling resource of the translation model is perfected, so that the translation result can be obtained when the acquired voice data is of a small language or with local accents; the system can correspondingly express random (such as containing limb actions) translation scenes, and has better adaptability.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a flow chart of a method for automatically translating graphics, text and audio according to the present invention;

FIG. 2 is a flow chart of the processing of voice data as shown in FIG. 1;

FIG. 3 is a flow chart of the image data processing as shown in FIG. 1;

FIG. 4 is a logic block diagram of an automatic translation system for graphics, text and audio according to the present invention;

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

Example 1

As shown in fig. 1, in one aspect, the present invention provides an automatic text-text audio translation method, including the following steps:

s1, acquiring one or more of voice data, image data and action data as source data;

S2A, performing primary processing on the acquired voice data, and inputting a primary processing result into a pre-trained translation model to obtain corresponding target language data;

S2B, identifying the characteristic information of the acquired image data, and translating the identified characteristic information result to obtain corresponding target language data;

S2C, formatting the acquired action data, and inputting a formatting result into a pre-constructed LSTM neural network model to obtain action expression data;

s3, when the source data is one type, directly outputting the obtained target language data or action expression data; and when the source data are various, performing similarity matching on the obtained target language data and/or action expression data to obtain fusion language data.

In the method, the source data such as voice data, image data and action data can be independently identified and translated, and fusion output can be performed when various source data are obtained, so that the problem of default translation results caused by the fact that body actions replace languages in spoken language communication is solved.

In a specific implementation provided by the invention, step S1, one or more of voice data, image data and motion data are acquired as source data, wherein the voice data can comprise large languages such as Chinese, English, Russian and Japanese, and also can comprise small languages such as French, German, Spanish, Arabic, Vietnamese and Laos; the image data is a local device photo or a picture imported by a third-party platform, and the photo or the picture can contain characters and graphics, such as common graphics needing translation, such as signs, buses, buildings, animals and plants, catering food and the like; motion data is the data of the motion made by the limb.

The source data can be acquired independently or captured simultaneously. The captured source data enters different processing flows according to the types of the source data.

When the acquired source data is voice data, as shown in fig. 2, performing step s2a. to perform preliminary processing on the acquired voice data, and inputting a result of the preliminary processing into a pre-trained translation model to obtain corresponding target language data:

S2A1, performing rate conversion on the acquired voice data to obtain standard sampling rate voice data;

generally, the standard sampling rate voice data is set to a sampling rate of 22KHz, and the sampling point bit width is 16 bits.

S2A2, inputting standard sampling rate data into a language identification model to obtain a language judgment result;

in this embodiment, the language identification model may be a language identification model that is mature at present and uses a Hidden Markov Model (HMM) and an N-gram model as a core, and details thereof are not repeated.

S2A3, inputting the acquired voice data into a pre-trained translation model according to the language judgment result, and performing in the translation model:

outputting language identification information of a source language to the voice data through a pre-trained deep neural network; current deep neural network models have been able to recognize accentuated speech data, converting the lexical content thereof into computer-readable input. The method carries out language identification on the source data subjected to language identification through the step, and the identified language data needs to be further translated into the target language.

In this embodiment, the language conversion is implemented by the cross-language markup model, and then the target language data is output. Wherein the cross-language labeling model is obtained by pre-constructing and training:

classifying languages according to languages, and dividing subclasses according to standard pronunciation and accent under the languages; because each language can be divided into a plurality of accents according to different regions, for example, the Chinese language is divided into mandarin, Henan Chinese, Guangdong language, Minnan language and the like according to local accents.

For example, in the cross-language correspondence, according to a word conversion unit, a word "tomorrow" is labeled as "tomorrow" → "tomorrow", an association between english and chinese cantonese is labeled as "tomorrow" → "tomorrow", an association between chinese cantonese and mandarin is labeled as "tomorrow" → "tomorrow", and an association between chinese cantonese and mandarin is labeled as "hear day" → "tomorrow", so that it is possible to realize oral translations such as english translation into chinese (mandarin), english translation into cantonese, and conversion from cantonese into mandarin.

According to daily common expressions, association labeling of a sentence as a conversion unit can also be performed.

Any two kinds of cross-language translation can be refined to be corresponding by taking the cross-language labeling model as a conversion unit, so that the small languages or local accents can be identified and correspondingly converted and output, and parallel linguistic data of the large languages to the small languages or the minority languages or dialects and the like are enriched, so that the linguistic data perfection degree of a translation system is improved, and the accuracy of a translation result is improved.

When the acquired source data is image data, performing step s2b, performing feature information recognition on the acquired image data, and translating the recognized feature information result to obtain corresponding target language data, as shown in fig. 3, including

S2B1, identifying characteristic information in the image through a pre-trained convolutional neural network, and determining a character and/or a graphic object contained in the image;

S2B2, translating the recognized character and/or graphic object to obtain corresponding target language data.

The pre-trained Convolutional Neural Network (CNN) can identify characters and graphics contained in a given graph, such as which kind of animals, plants, vehicles and the like the graph belongs to, and the identified characters and graphics information are directly transcribed into text data of target language data for display or broadcast. For example, a picture of a dumpling is taken, after the picture is identified by the convolutional neural network, the target language is Chinese, the text data of the output target language data is "dumpling", the hardware can be used for voice playing, and of course, if the target language is English, the text data of the output target language data is "dumpling".

The step can automatically recognize and translate the picture or the picture of the unknown object to obtain the desired target language data.

Further, in the above method for automatically translating graphics, text and audio, when the acquired source data is motion data, the step s2c is performed to format the acquired motion data, and the result of the formatting process is input to a pre-constructed LSTM neural network model to obtain motion expression data:

S2C1, acquiring motion data through a posture sensor worn by a limb;

the attitude sensor comprises an inclination angle sensor, a three-axis gyroscope, a three-axis linear accelerometer and the like, is worn on the limb of an actor, acquires accurate dynamic precision and provides real-time motion measurement. When many people are in oral communication, along with gesture actions such as interview and call, handshake, way directing and the like, the action parameters can be used as auxiliary translation data for fusion of subsequent translation data, and a speaker (namely the actor) can be supplemented to replace default speech translation caused by oral vocabulary with actions; or may be translated directly as some commonly used sign language data.

S2C2, denoising and feature extraction are carried out on the motion data to obtain formatted input data, and the formatted input data are input into a pre-constructed LSTM neural network model to obtain motion expression data;

the LSTM neural network model is constructed by acquiring enough motion samples and training. And then denoising and characteristic extraction processing are carried out on the data after the action data are obtained to obtain formatted input data, the formatted input data enter a trained LSTM neural network model, and action expression data, namely what the action expression means, are output.

S2C3, translating the action expression data into target language data according to requirements and outputting the target language data.

The similarity matching can be based on a DRCN and DIIN similarity matching model, the processing results of the voice data, the image data and the action data (namely the target language data and/or the action expression data) are subjected to feature extraction and similarity matching, for example, after recognition, the feature content of the voice data comprises 'hello', the feature content of the identified action data comprises 'handshake', the voice data and the action data belong to deeply-associated feature content and can be determined as a calling situation, then the action data is used as an auxiliary parameter of the voice data, the two data are fused to obtain fused language data, the final target language data of the 'hello' is accurately output, and the accuracy of the translated data is improved.

Example 2

In a second aspect, as shown in fig. 4, the invention further provides an automatic translation system for graphics, text and audio, comprising

acquiring motion data through a posture sensor worn by a limb;

and outputting the action expression data as target language data.

The system of the present invention is used to implement the method in the above embodiment 1 of the present invention, and therefore, the working principle of each module of the system may refer to the related description of the above embodiment 1, and is not described again.

Implementations of the invention and all of the functional operations provided herein may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the present disclosure may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A method for automatically translating image, text and audio is characterized by comprising the following steps:

acquiring one or more of voice data, image data and motion data as source data;

2. The method for automatically translating graphics, text and audio according to claim 1, wherein the preliminary processing of the acquired voice data and the inputting of the preliminary processing result into a pre-trained translation model to obtain corresponding target language data comprises:

3. The method for automatic translation of teletext and audio frequency according to claim 2, wherein the cross-language markup model is obtained by pre-construction training:

4. The method for automatic translation of teletext audio frequency according to claim 1, wherein said identifying the characteristic information of the acquired image data and translating the identified characteristic information result to obtain the corresponding target language data comprises

5. The method for automatic translation of teletext audio frequency according to claim 1, wherein the obtained motion data is formatted and the result of the formatting is input to a pre-constructed LSTM neural network model to obtain motion expression data, comprising

Acquiring motion data through a posture sensor worn by a limb;

and outputting the action expression data as target language data.

6. An automatic translation system for image, text and audio is characterized by comprising

7. The system for automatic translation of teletext and audio frequency according to claim 6, wherein the speech data processing module specifically executes:

8. The system for automatic translation of teletext and audio according to claim 7, wherein the cross-language markup model is obtained by pre-construction training:

9. The system for automatic translation of teletext and audio according to claim 6, wherein the image data processing module specifically executes:

10. The system for automatic translation of teletext and audio frequency according to claim 6, wherein the action data processing module specifically executes:

acquiring motion data through a posture sensor worn by a limb;

and outputting the action expression data as target language data.