CN106156012A

CN106156012A - A kind of method for generating captions and device

Info

Publication number: CN106156012A
Application number: CN201610490947.4A
Authority: CN
Inventors: 唐熊
Original assignee: LeTV Holding Beijing Co Ltd; LeTV Mobile Intelligent Information Technology Beijing Co Ltd
Current assignee: LeTV Holding Beijing Co Ltd; LeTV Mobile Intelligent Information Technology Beijing Co Ltd
Priority date: 2016-06-28
Filing date: 2016-06-28
Publication date: 2016-11-23

Abstract

The present invention provides a kind of method for generating captions and device, relate to multi-media processing field, the method includes first extracting audio-frequency information, then the first language in audio-frequency information is identified, first language is converted into second language again and obtains second language text message, finally second language text message is loaded into the relevant position of corresponding audio-frequency information, thus generates video caption.The method can automatically generate the captions of other language in video, for not having the video file of required language subtitle, can be generated the captions of required language by which, thus user can preferably watch this video file.The method can automatically generate multilingual captions, can be as an additional function of the smart machine players such as mobile phone, the defect that human translation, cost more time and energy, formation speed are slow is needed when solving captions generation in prior art, simplify the flow process that captions make, it is possible to generate captions quickly and easily simultaneously.

Description

A kind of method for generating captions and device

Technical field

The present invention relates to multimedia signal processing field, be specifically related to a kind of method for generating captions and device.

Background technology

Along with the fast development of information technology, people's daily life can touch increasing audio/video information. But, owing to there is different language different countries, the problem that therefore audio/video information exists when shared is just It is the problem of local language process, watches this video for the ease of other language users, general by loading on video The mode of the captions of other language is easy to video viewers and is understood.Such as, for video resources such as external films, at home In order to keep the audio frequency effect of primary sound during broadcasting, sound is not carried out translation process, but in the lower section of screen by video display English dialogue be translated as Chinese after show, it is simple to spectators are better understood from.

Inventor finds during realizing the present invention, typically uses currently for the language generation captions in video Mode is the language needed for first being become by original language human translation, then the language text after translation is added in corresponding position again In video.First, human translation needs to spend the more time.Secondly as category of language in the world is numerous, if pin Every kind of language is all carried out the translation of primary system to form captions, necessarily makes the cost of manufacture of this video be greatly increased, If do not translated, then for untranslated language users, then cannot use these video resources.Therefore, Captions that the language in video is translated and formed the most quickly and easily become the crucial skill improving video utilization rate Art.

Summary of the invention

Therefore, a kind of method for generating captions and device are embodiments provided, to solve captions of the prior art The problem that human translation, cost more time and energy, formation speed are slow is needed during generation.

An aspect according to embodiments of the present invention, it is provided that a kind of method for generating captions, the method includes: extraction audio frequency Information；Identify the first language in described audio-frequency information, generate first language text message；By described first language text message It is converted into second language text message；Described second language text message is loaded into the relevant position of corresponding audio-frequency information.

Further, described described first language text message is converted into second language text message, including: acquisition is many Individual translation tool；Use each translation tool that described first language text message is translated as second language text envelope undetermined Breath；Described second language text message undetermined is added up, will appear from the second language text envelope undetermined that frequency is the highest Second language text message after ceasing as translation.

Further, identify the first language in described audio-frequency information, generate first language text message and include: extract institute State the voice messaging in audio-frequency information；Intercept each voice segments in described voice messaging；Identify the language in each voice segments described Message ceases, it is thus achieved that first language text message.

Further, the amendment text obtaining multiple user to the partial content of second language text message is also included；；Will Corresponding contents in described second language text message replaced by the amendment text that the frequency of occurrences is the highest.

Further, described second language is one or more language.

Another aspect according to embodiments of the present invention, it is provided that a kind of caption generation device, this device includes: audio frequency is taken out Take unit, be used for extracting audio-frequency information；First language recognition unit, for identifying the first language in described audio-frequency information, raw Become first language text message；Second language conversion unit, for being converted into second language by described first language text message Text message；Captions signal generating unit, for being loaded into the relevant position of corresponding audio-frequency information by described second language text message.

Further, described second language conversion unit includes: translation tool selects subelement, is used for obtaining multiple translation Instrument；Translation subelement, for using each translation tool that described first language text message is translated as the second language undetermined Speech text message；Translation confirms subelement, for adding up described second language text message undetermined, will appear from frequency The highest second language text message undetermined is as the second language text message after translation.

Further, described first language recognition unit includes: voice messaging extracts subelement, is used for extracting described audio frequency Voice messaging in information；Voice segments obtains subelement, is used for intercepting each voice segments in described voice messaging；Speech recognition Unit, for the logical voice messaging identified in each voice segments described, it is thus achieved that first language text message.

Further, also include that revising text obtains subelement, is used for obtaining multiple user to second language text message The amendment text of partial content；Optimizing unit, replacing described second language literary composition for will appear from the highest amendment text of frequency Corresponding contents in this information.

Further, described second language is one or more language.

The technical scheme of the embodiment of the present invention, has the advantage that

1. the embodiment of the present invention provides a kind of method for generating captions and device, and the method includes first extracting audio-frequency information, so After identify the first language in audio-frequency information, then first language be converted into second language obtain second language text message, Finally second language text message is loaded into the relevant position of corresponding audio-frequency information, thus generates the captions of video.The party Method can automatically generate the captions of other language in video, for not having the video file of required language subtitle, can pass through Which generates the captions of required language, thus user can preferably watch this video file.The method can automatically generate many Language subtitle, can solve captions in prior art generate as an additional function of the smart machine players such as mobile phone Time need human translation, spend the slow defect of more time and energy, formation speed, simplify the flow process that captions make simultaneously, Captions can be generated quickly and easily.

2. the method for generating captions described in the embodiment of the present invention and device, when translating second language, uses multiple translation Instrument is translated jointly, then selects the wherein overwhelming majority to translate the identical translation result translation side as second language Formula, thus improve translation precision.

3. the method for generating captions described in the embodiment of the present invention and device, also includes carrying out excellent to second language text message The step changed, first obtains multiple user amendment text to the partial content of second language text message, then will appear from frequency Corresponding contents in described second language text message replaced by the amendment text that rate is the highest.So, after user can be for translation Captions propose amending advice, when the quantity of amending advice reaches to a certain degree, by turning over that great majority are recommended by the way of statistics The mode of translating replaces original cypher text, thus optimizes translation result further, makes translation result more by the way of crowd raises Add accurately.

Accompanying drawing explanation

In order to be illustrated more clearly that the specific embodiment of the invention or technical scheme of the prior art, below will be to specifically In embodiment or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, in describing below Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not paying creative work Put, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of a concrete example of method for generating captions in the embodiment of the present invention 1；

Fig. 2 is the structured flowchart of a concrete example of caption generation device in the embodiment of the present invention 2.

Detailed description of the invention

Below in conjunction with accompanying drawing, technical scheme is clearly and completely described, it is clear that described enforcement Example is a part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill The every other embodiment that personnel are obtained under not making creative work premise, broadly falls into the scope of protection of the invention.

Term " first ", " second ", " the 3rd " are only used for describing purpose, and it is not intended that indicate or imply relatively important Property.As long as additionally, technical characteristic involved in invention described below difference embodiment does not constitutes punching each other Dash forward and just can be combined with each other.

Embodiment 1

The present embodiment provides a kind of method for generating captions, in video file, for the language generation in video its The captions of his language, it is simple to user appreciates video, the method can be used in the player of mobile phone, apparatus such as computer.The method stream Journey figure is as it is shown in figure 1, comprise the following steps:

S1, extraction audio-frequency information.

First, from video, extract audio-frequency information therein, audio-frequency information can from video signal extracting directly, it is possible to To be enrolled by other recording softwares, form single audio file.In this audio file, include some background musics, Noise and the voice messaging of dialogue, be to process for voice messaging in this programme.

S2, the first language identified in described audio-frequency information, generate first language text message.

This step is the linguistic information obtained in audio-frequency information and identifies this linguistic information.Alternatively, this step is permissible Including:

First, extract the voice messaging in described audio-frequency information, owing to the linguistic information of the mankind has himself, very Easily distinguish with music background, noise etc., therefore by the way of voice extracts, it is possible to obtain the voice in audio-frequency information Information.

Then, each voice segments in described voice messaging is intercepted.First this step obtains the initial of described voice messaging Position, intercepts voice segments according to this original position.Owing to the voice messaging in video file has multistage, every section is a series of The word of contact or statement, therefore, first have to obtain each voice segments, then process these voice segments respectively.

Afterwards, the voice messaging in each voice segments is identified, it is thus achieved that first language text message.Use speech recognition herein Method identify the voice messaging in each voice segments, speech recognition technology in prior art, speech recognition skill can be used Art is also referred to as automatic speech recognition Automatic Speech Recognition, (ASR), and its target is by the voice of the mankind In vocabulary Content Transformation be computer-readable input, such as button, binary coding or character string etc., the side of employing Formula has the method such as neutral net, self adaptation.By speech recognition, the lexical information in voice segments can be identified, and will It is converted into the mode of text, obtains first language text message.

S3, described first language text message is converted into second language text message.

In this step, it is that the first language of use original in video is translated as second language, uses and identify from audio frequency The first language text message gone out is translated as input.Described first language text message is converted into second language literary composition The step of this information, including:

First, multiple translation tool is obtained.Translation tool herein can be the translation software of multiple version, turns over if any road Translate software, Kingsoft Powerword translation software etc., it is also possible to also include some web page translation, as Google's translation, Baidu's translation etc. are translated Instrument.

Then, use each translation tool that described first language text message is translated as second language text envelope undetermined Breath.Each translation tool is used to translate for each section of second language text, such that it is able to obtain multiple cypher text, by this A little cypher texts are as second language text message undetermined.

Finally, described second language text message undetermined is added up, will appear from undetermined second that frequency is the highest Language text information is as the second language text message after translation.

The cypher text obtained after adding up all translations in this step, can will appear from the cypher text conduct that frequency is the highest Second language text message, or can find, by similarity mode, the interpretative system that similarity is the highest when statistics, as Second language text message；Can also be by the way of cluster, using cluster to the most most cypher texts as second language Text message.By the way of finding optimum on the translation result of multiple translation tools, preferred translation result can be obtained.

The present embodiment, during translation obtains second language, uses multiple translation tool jointly to translate, then selects Select the wherein overwhelming majority and translate the identical translation result interpretative system as second language, thus improve translation precision.

S4, described second language text message is loaded into the relevant position of corresponding audio-frequency information.

In this step, in conjunction with the video location that former audio-frequency information is corresponding, then by the second language text message after translation It is loaded into suitable position, thus the text message after the translation of former audio-frequency information and its correspondence can be mapped, side Just user watches video.

The said method of the present embodiment can automatically generate the captions of other language in video, for not having required language The video file of captions, can be generated the captions of required language, thus user can preferably watch this video by which File.The method can automatically generate multilingual captions, can solve as an additional function of the smart machine players such as mobile phone In prior art of having determined, captions need the defect that human translation, cost more time and energy, formation speed are slow when generating, simultaneously Simplify the flow process that captions make, it is possible to generate captions quickly and easily.

As the preferred implementation of one, when user watches video, the captions on video can also be marked by user Amending advice, the captions after translation can be optimized, i.e. to the second language by the present embodiment according to the amending advice of user annotation Speech text message be optimized, step particularly as follows:

First, multiple user amendment text to the partial content of second language text message is obtained.Owing to user is permissible Mark amendment text, the most users watch video, when presenting one's view, there has been more reference information.

Then, will appear from the highest amendment text of frequency and replace corresponding contents in described second language text message.This mistake Cheng Zhong, the amendment text providing user carries out statistical analysis, if the subtitle position after translating for same place, has multiple use Family proposes amendment text, and it is identical to there is multiple amendment text, then be considered more by amendment text the highest for this frequency of occurrences Translate accurately, replace original cypher text.

In this preferred embodiment, user can propose amending advice for the captions after translation, when amending advice Quantity reaches to a certain degree, by the way of statistics, the interpretative system that great majority are recommended is replaced original cypher text, thus Optimize translation result further, make translation result more accurate by the way of crowd raises.

Alternatively, the second language of the present embodiment can be a kind of language, it is also possible to being polyglot, therefore the program can So that the voice messaging in video is translated into one or more language, display by the way of captions.User can be voluntarily The quantity of interpreter language it is set and translates into concrete any language, generating the caption for this language.

Embodiment 2:

The present embodiment provides a kind of caption generation device, structured flowchart as in figure 2 it is shown, this device to can be used for hands mechanical, electrical In the player of the equipment such as brain, for the captions of other language of language generation in video, it is simple to user appreciates video, including:

Audio frequency extracting unit 01, is used for extracting audio-frequency information；

First language recognition unit 02, for identifying the first language in described audio-frequency information, generates first language text Information；

Second language conversion unit 03, for being converted into second language text message by described first language text message；

Captions signal generating unit 04, for being loaded into the corresponding positions of corresponding audio-frequency information by described second language text message Put.

Caption generation device in this embodiment, obtains audio-frequency information by audio frequency extracting unit 01, and by the One language identification unit 02 generates first language text message, then by second language conversion unit 03 by first language Text message is converted into second language text message, generates captions finally by captions signal generating unit 04, it is achieved thereby that right The real time translation of captions, user can preferably watch this video file.This device can automatically generate multilingual captions, solves In prior art, captions need the defect that human translation, cost more time and energy, formation speed are slow when generating, and simplify simultaneously The flow process that captions make, it is possible to generate captions quickly and easily.

As a kind of specific embodiment, described second language conversion unit 03 includes: translation tool selects subelement, For obtaining multiple translation tool；Translation subelement, is used for using each translation tool to be turned over by described first language text message It is translated into second language text message undetermined；Translation confirms subelement, for entering described second language text message undetermined Row statistics, will appear from the highest second language text message undetermined of frequency as the second language text message after translation.Should In embodiment, use multiple translation tool jointly to translate, then select the translation knot that wherein most translations are identical Fruit is as the interpretative system of second language, thus improves translation precision.

As a kind of concrete implementation mode, described first language recognition unit 02 includes: voice messaging extracts subelement, For extracting the voice messaging in described audio-frequency information；Voice segments obtains subelement, is used for intercepting in described voice messaging each Voice segments；Speech recognition subelement, for the logical voice messaging identified in each voice segments described, it is thus achieved that first language text envelope Breath.By speech recognition, the lexical information in voice segments can be identified, and be translated into the mode of text, obtain First language text message.

As a kind of preferred embodiment, in order to be optimized the captions of translation, this caption generation device also includes: Amendment text obtains subelement, for obtaining multiple user amending advice to the partial content of second language text message；Excellent Changing unit, replacing corresponding contents in described second language text message for will appear from the highest amendment text of frequency.This optimization Embodiment, it is possible to by the way of statistics, use the cypher text that interpretative system replacement that most people recommends is original, from And optimize translation result further, make translation result more accurate by the way of crowd raises.

As other embodiment, described second language is one or more language, can be by first language text envelope Breath is translated as multiple second language text message, and loads as captions, it is simple to meet the demand of different user.

Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or computer program Product.Therefore, the reality in terms of the present invention can use complete hardware embodiment, complete software implementation or combine software and hardware Execute the form of example.And, the present invention can use at one or more computers wherein including computer usable program code The upper computer program product implemented of usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) The form of product.

The present invention is with reference to method, equipment (system) and the flow process of computer program according to embodiments of the present invention Figure and/or block diagram describe.It should be understood that can the most first-class by computer program instructions flowchart and/or block diagram Flow process in journey and/or square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided Instruction arrives the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device to produce A raw machine so that the instruction performed by the processor of computer or other programmable data processing device is produced for real The device of the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame now.

These computer program instructions may be alternatively stored in and computer or other programmable data processing device can be guided with spy Determine in the computer-readable memory that mode works so that the instruction being stored in this computer-readable memory produces and includes referring to Make the manufacture of device, this command device realize at one flow process of flow chart or multiple flow process and/or one square frame of block diagram or The function specified in multiple square frames.

These computer program instructions also can be loaded in computer or other programmable data processing device so that at meter Perform sequence of operations step on calculation machine or other programmable devices to produce computer implemented process, thus at computer or The instruction performed on other programmable devices provides for realizing at one flow process of flow chart or multiple flow process and/or block diagram one The step of the function specified in individual square frame or multiple square frame.

Obviously, above-described embodiment is only for clearly demonstrating example, and not restriction to embodiment.Right For those of ordinary skill in the field, can also make on the basis of the above description other multi-form change or Variation.Here without also cannot all of embodiment be given exhaustive.And the obvious change thus extended out or Change among still in the protection domain of the invention.

Claims

1. a method for generating captions, it is characterised in that including:

Extraction audio-frequency information；

Identify the first language in described audio-frequency information, generate first language text message；

Described first language text message is converted into second language text message；

Described second language text message is loaded into the relevant position of corresponding audio-frequency information.

Method the most according to claim 1, it is characterised in that described described first language text message is converted into second Language text information includes:

Obtain multiple translation tool；

Use each translation tool that described first language text message is translated as second language text message undetermined；

Described second language text message undetermined is added up, will appear from the second language text envelope undetermined that frequency is the highest Second language text message after ceasing as translation.

Method the most according to claim 1, it is characterised in that identify the first language in described audio-frequency information, generates the One language text information includes:

Extract the voice messaging in described audio-frequency information；

Intercept each voice segments in described voice messaging；

Identify the voice messaging in each voice segments described, it is thus achieved that first language text message.

4. according to the method according to any one of claim 1-3, it is characterised in that described method also includes:

Obtain multiple user amendment text to the partial content of second language text message；

Will appear from the highest amendment text of frequency and replace corresponding contents in described second language text message.

5. according to the method according to any one of claim 1-3, it is characterised in that described second language is one or more languages Speech.

6. a caption generation device, it is characterised in that including:

Audio frequency extracting unit, is used for extracting audio-frequency information；

First language recognition unit, for identifying the first language in described audio-frequency information, generates first language text message；

Second language conversion unit, for being converted into second language text message by described first language text message；

Captions signal generating unit, for being loaded into the relevant position of corresponding audio-frequency information by described second language text message.

Device the most according to claim 6, it is characterised in that: described second language conversion unit includes:

Translation tool selects subelement, is used for obtaining multiple translation tool；

Translation subelement, for using each translation tool that described first language text message is translated as second language undetermined Text message；

Translation confirms subelement, for adding up described second language text message undetermined, will appear from frequency the highest Second language text message undetermined is as the second language text message after translation.

Device the most according to claim 6, it is characterised in that described first language recognition unit includes:

Voice messaging extracts subelement, for extracting the voice messaging in described audio-frequency information；

Voice segments obtains subelement, is used for intercepting each voice segments in described voice messaging；

Speech recognition subelement, for the logical voice messaging identified in each voice segments described, it is thus achieved that first language text message.

9. according to the device according to any one of claim 6-8, it is characterised in that also include

Amendment text acquiring unit, for obtaining multiple user amendment text to the partial content of second language text message；

Optimize unit, corresponding interior for using the amendment text that the frequency of occurrences is the highest to replace in described second language text message Hold.

10. according to the device according to any one of claim 6-8, it is characterised in that described second language is one or more Language.