CN108063970A

CN108063970A - A kind of method and apparatus for handling live TV stream

Info

Publication number: CN108063970A
Application number: CN201711172649.1A
Authority: CN
Inventors: 洪巨成; 项东涛
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2018-05-22

Abstract

An embodiment of the present invention provides a kind of method and apparatus for handling live TV stream, method includes：Original live stream is decoded as original audio data and original video data；Speech recognition is carried out to the original audio data, generates the corresponding text character of the original audio data；According to the first duration that the speech recognition expends, delay disposal is carried out to the original video data；The text character is added in the video data after delay, generates target video data；By the target video data and the original audio data synchronized compound, target live TV stream is generated；Play the target live TV stream.It can be realized using the embodiment of the present invention and play the net cast with subtitle.

Description

A kind of method and apparatus for handling live TV stream

Technical field

The present invention relates to field of computer technology, more particularly to a kind of method and apparatus for handling live TV stream.

Background technology

It is more and more to be liked be subject to user due to the diversity of net cast content.Under normal conditions, net cast is not It can show the subtitle with audio video synchronization.

It is subject to the cacoepy of personage in audio disturbances or net cast is true or velocity of sound is too fast etc. there is net cast In the case of causing the sound of net cast unsharp, user can not be understood in the program of net cast completely according only to sound Hold, affect the viewing experience of user.

The content of the invention

The embodiment of the present invention is designed to provide a kind of method and apparatus for handling live TV stream, and band subtitle is played to realize Net cast.Specific technical solution is as follows：

In the one side that the present invention is implemented, a kind of method for handling live TV stream is provided, the described method includes：

Original live stream is decoded as original audio data and original video data；

Speech recognition is carried out to the original audio data, generates the corresponding text character of the original audio data；

According to the first duration that the speech recognition expends, delay disposal is carried out to the original video data；

The text character is added in the video data after delay, generates target video data；

By the target video data and the original audio data synchronized compound, target live TV stream is generated；

Play the target live TV stream.

Optionally, described the step of original live stream is decoded as original audio data and original video data, including：

The original live stream of preset duration is decoded as original audio data and original video data.

In original live stream in preset duration section, the time point of speech pause is determined；

By before time point described in the original live stream and not decoded live TV stream segment, original audio number is decoded as According to and original video data.

Optionally, first duration expended according to the speech recognition, postpones the original video data The step of processing, including：

Determine the first duration spent by the speech recognition；

By the timestamp of the original video data, postpone first duration.

Optionally, speech recognition is carried out to the original audio data described, generates the original audio data and correspond to Text character the step of after, the method further includes：

The text character is translated into default category of language, generates the second duration, described second when it is a length of will described in Text character translates into the duration spent by default category of language；

The step of timestamp by the original video data, delay first duration, including：

By the timestamp of the original video data, postpone the duration of the sum of first duration and second duration；

It is described by the text character be added to delay after video data in, generate target video data the step of, bag It includes：

Text character after translation is added in the video data after delay, generates target video data.

Optionally, after the described the step of text character is translated into default category of language, the method is also Including：

Correction process is carried out to the text character after translation；

Determine the 3rd duration spent by the correction process；

The timestamp by the original video data, postpone the sum of first duration and second duration when Long step, including：

By the timestamp of the original video data, postpone first duration, second duration and it is described 3rd when The sum of long duration；

The text character translated and after error correction is added in the video data after delay, generates target video data.

Optionally, it is described by the target video data and the original audio data synchronized compound, generation target live streaming The step of stream, including：

Based on default reference time axis, according to the timestamp of video frame in the target video data and the original The target video data and the original audio data are synchronized synthesis by the timestamp of beginning voice data sound intermediate frequency frame, Generate target live TV stream.

At the another aspect that the present invention is implemented, and a kind of device for handling live TV stream is provided, described device includes：

Decoding unit, for original live stream to be decoded as original audio data and original video data；

Recognition unit for carrying out speech recognition to the original audio data, generates the original audio data and corresponds to Text character；

Delay cell for the first duration expended according to the speech recognition, prolongs the original video data Processing late；

Adding device for the text character to be added in the video data after delay, generates target video data；

Synthesis unit, for by the target video data and the original audio data synchronized compound, generation target to be straight Broadcast stream；

Broadcast unit, for playing the target live TV stream.

Optionally, the decoding unit, specifically for the original live stream of preset duration is decoded as original audio data And original video data.

Optionally, the decoding unit, including：First determination subelement and decoding subunit；

First determination subelement, in the original live stream in preset duration section, determining speech pause Time point；

The decoding subunit, for will be before time point described in the original live stream and not decoded live streaming flow Section, is decoded as original audio data and original video data.

Optionally, the delay cell, including：Second determination subelement and delay subelement；

Second determination subelement, for determining the first duration spent by the speech recognition；

The delay subelement, for by the timestamp of the original video data, postponing first duration.

Optionally, described device further includes：

Translation unit for the text character to be translated into default category of language, generates the second duration, and described second The text character is translated into the duration spent by default category of language by Shi Changwei；

The delay subelement, specifically for by the timestamp of the original video data, postpone first duration and The duration of the sum of second duration；

The adding device, specifically for the text character after translating is added in the video data after delay, generation Target video data.

Optionally, described device further includes：

Error correction unit, for carrying out correction process to the text character after translation；

Determination unit, for determining the 3rd duration spent by the correction process；

The delay subelement, specifically for by the timestamp of the original video data, postponing first duration, institute State the duration of the sum of the second duration and the 3rd duration；

The adding device, specifically for the text character translated and after error correction to be added to the video data after delay In, generate target video data.

Optionally, the synthesis unit, specifically for being based on default reference time axis, according to the target video data The timestamp of the timestamp of middle video frame and the original audio data sound intermediate frequency frame, by the target video data and institute It states original audio data and synchronizes synthesis, generate target live TV stream.

At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable Instruction is stored in storage medium, when run on a computer so that it is straight that computer performs any of the above-described processing The method for broadcasting stream.

At the another aspect that the present invention is implemented, a kind of computer program product for including instruction is additionally provided, when it is being counted When being run on calculation machine so that the method that computer performs any of the above-described processing live TV stream.

A kind of method and apparatus for handling live TV stream provided in an embodiment of the present invention, first, by the original straight of preset duration It broadcasts stream and is decoded as original audio data and original video data；Then, speech recognition, generation are carried out to the original audio data The corresponding text character of the original audio data；According to the first duration that the speech recognition expends, to the original video Data carry out delay disposal；Next, the text character is added in the video data after postponing, target video number is generated According to；Finally, by the target video data and the original audio data synchronized compound, target live TV stream is generated；Described in broadcasting Target live TV stream.

In this way, it in embodiments of the present invention, by adding the corresponding text character of voice data in net cast, realizes While net cast is played, synchronous subtitle is played, user can be helped to understand the content of net cast, promote user's Viewing experience.Certainly, implement any of the products of the present invention or method must be not necessarily required to reach all the above excellent simultaneously Point.

Description of the drawings

It in order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described.

Fig. 1 is a kind of flow chart of the method for the processing live TV stream of the embodiment of the present invention；

Fig. 2 is another flow chart of the method for the processing live TV stream of the embodiment of the present invention；

Fig. 3 is another flow chart of the method for the processing live TV stream of the embodiment of the present invention；

Fig. 4 is the schematic diagram of the system of the processing live TV stream of the embodiment of the present invention；

Fig. 5 is the structure chart of the device of the processing live TV stream of the embodiment of the present invention；

Fig. 6 is the schematic diagram of the electronic equipment of the embodiment of the present invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is described.

At present, it is more and more to be liked be subject to user due to the diversity of net cast content.Under normal conditions, video Live streaming will not show the subtitle with audio video synchronization.

User is during using terminal equipment watching video live broadcast, when net cast is subject to audio disturbances or regards During frequency is broadcast live personage cacoepy is true or velocity of sound it is too fast when cause net cast sound it is unintelligible when, due to can not see with The subtitle of audio video synchronization, user can not understand the programme content of net cast according only to sound, influence viewing experience completely.

To solve the above-mentioned problems, an embodiment of the present invention provides a kind of method and apparatus for handling live TV stream, Neng Goutong It crosses and the corresponding text character of voice data is added in net cast, realize while net cast is played, play synchronous Subtitle can help user to understand the content of net cast, promote the viewing experience of user.

An embodiment of the present invention provides a kind of methods for handling live TV stream.Referring to Fig. 1, Fig. 1 is the place of the embodiment of the present invention A kind of flow chart of the method for live TV stream is managed, is included the following steps：

Step 101, original live stream is decoded as original audio data and original video data.

In this step, the original live stream in original live stream can be decoded as original audio data and original video Data, and then the processing of subtitle is added to original live stream.

It should be noted that since live TV stream is required to playing the time limit, delay can influence the viewing experience of user too much, Therefore, the processing that one section of decoded original live stream is added subtitle can be first obtained, it can be first by place after processing Live video after reason plays back, next, being further continued for handling subsequent original live stream.

In a kind of realization method, step 101 can include：

In specific implementation, the processing of subtitle is added except first obtaining one section of decoded original live stream, may be used also First to obtain the original live stream of preset length, after being decoded to the original live stream of preset length, then subtitle is added Processing, after processing can first by treated, live video plays back, next, being further continued for handling subsequent original Live TV stream.Wherein, preset duration can be set according to actual conditions.

In another realization method, step 101 can include：

By before the time point in original live stream and not decoded live TV stream segment, original audio data and original are decoded as Beginning video data.

Specifically, can first preset duration section, duration section can be 30 seconds to 40 seconds.When the voice of original live stream When pausing in the preset duration section, determine the time point paused occur；Next, can will be before the time point Not decoded live TV stream segment, is decoded as original audio data and original video data.In this manner it is ensured that interception is original straight The integrality of the voice in stream is broadcast, avoidance breakout is originally complete in short not only to facilitate subtitle to add, but also the viewing of user Experience is more preferably.

Step 102, speech recognition, the corresponding text character of generation original audio data are carried out to original audio data.

In this step, the original audio data that generation can be decoded to step 101 carries out speech recognition, and generation is corresponding Text character, and the alphabetic character is preserved, text character to be added in live TV stream by subsequent step, realize to play and treat The net cast of subtitle.

Step 103, the first duration expended according to speech recognition carries out delay disposal to original video data.

In this step, since the processing procedure of language identification in step 102 can expend certain duration, therefore, it is necessary to According to the first duration that speech recognition processes expend, delay disposal, the text character that will identify that are carried out to original video data It is added in the video data after delay, just can guarantee subtitle and audio video synchronization.

Optionally, step 103 can include：

Determine the first duration spent by speech recognition；

By the timestamp of original video data, postpone the first duration.

Specifically, the first duration that can be in first obtaining step 102 spent by the processing procedure of language identification；Normal conditions Words, when the preset duration for the original live stream that step 101 obtains is 30 seconds, the first duration spent by speech recognition may be 5 seconds or so；Then, according to the first duration got, by the timestamp of original video data, the first duration is postponed, to ensure The subtitle and audio video synchronization of addition.

Step 104, text character is added in the video data after postponing, generates target video data.

In this step, text character step 102 generated is added to the video data after the delay that step 103 generates In, generate the target video data with subtitle, it is possible to understand that, video in target video data is synchronous with subtitle.

Step 105, by target video data and original audio data synchronized compound, target live TV stream is generated.

In this step, it is the target video data that step 104 generates is synchronous with the original audio data that step 101 generates Synthesis generates target live TV stream, sychronization captions is carried in live TV stream to realize.

Optionally, step 105 can include：

Based on default reference time axis, according to the timestamp of video frame in target video data and original audio number According to the timestamp of sound intermediate frequency frame, target video data and original audio data are synchronized into synthesis, generate target live TV stream.

It specifically, can be on the basis of default reference time axis, according to the timestamp of video frame in target video data And the timestamp of original audio data sound intermediate frequency frame, it is raw by the target video data and the original audio data synchronized compound Into target live TV stream.

Step 106, target live TV stream is played.

In this step, the target live TV stream that step 105 generates is played, wherein, which is regarding with subtitle Frequency is broadcast live, and with the sychronization captions of net cast user can be helped to understand the content of net cast, promote the viewing experience of user.

As it can be seen that the method for processing live TV stream provided in an embodiment of the present invention, it can be by adding audio in net cast The corresponding text character of data is realized while net cast is played, plays synchronous subtitle, and user can be helped to understand and regarded The content of frequency live streaming promotes the viewing experience of user.

For foreign language net cast, for convenience of user to watch, the subtitle after translation can also be increased in net cast. For above application scene, in a kind of realization method, the embodiment of the present invention provides a kind of method for handling live TV stream, reference again Fig. 2, Fig. 2 are another flow chart of the method for the processing live TV stream of the embodiment of the present invention, are included the following steps：

Step 201, the original live stream of preset duration is decoded as original audio data and original video data.

The detailed process and technique effect of this step may be referred to the step in the method for processing live TV stream shown in FIG. 1 101, details are not described herein.

Step 202, speech recognition, the corresponding text character of generation original audio data are carried out to original audio data.

The detailed process and technique effect of this step may be referred to the step in the method for processing live TV stream shown in FIG. 1 102, details are not described herein.

Step 203, the first duration spent by speech recognition is determined.

It in this step, can be with the first duration spent by the processing procedure of language identification in obtaining step 202.

Under normal conditions, when the preset duration of the original live stream obtained when step 201 is 60 seconds, voice recognition processing institute The first duration expended may be 10 seconds or so.When the preset duration for the original live stream that step 201 obtains is 120 seconds, language The first duration spent by sound identifying processing may be 20 seconds or so.

Step 204, text character is translated into default category of language, generates the second duration.

Wherein, second when a length of duration translated into the text character spent by default category of language.

In this step, can be according to actual demand, the corresponding text character of original audio data that step 202 is generated Default category of language is translated into, and calculates the second spent duration of translation processing.

For example, for English net cast, the corresponding text character of original audio data is also English, then, it can incite somebody to action The text character of English translates into Chinese, domestic user to be facilitated to watch.

In the present embodiment, except the voice recognition processing to original audio data, the text character to generation is further included Translation processing, the second duration expended therefore, it is necessary to the first duration for being expended according to voice recognition processing and translation processing Total duration carries out original video data delay disposal, and the text character that will identify that is added in the video data after delay, It just can guarantee subtitle and audio video synchronization.

Step 205, by the timestamp of original video data, the duration of the sum of the first duration and the second duration is postponed.

In this step, the second duration that the first duration and step 204 calculated according to step 203 calculates, regards to original Frequency is according to delay disposal is carried out, to ensure subtitle and audio video synchronization.

Execution sequence on step 202 to step 204 is, it is necessary to which explanation, can first carry out step 202 and perform step again Rapid 204, step 204 can also be first carried out and perform step 202 again.

Step 206, the text character after translation is added in the video data after delay, generates target video data.

In this step, after the text character after translation step 204 generated is added to the delay that step 205 generates In video data, the target video data with caption is generated, it is possible to understand that, video and translation in target video data Subtitle is synchronous.

Step 207, by target video data and original audio data synchronized compound, target live TV stream is generated.

In this step, it is the target video data that step 206 generates is synchronous with the original audio data that step 201 generates Synthesis generates target live TV stream, synchronous caption is carried in live TV stream to realize.

Step 208, target live TV stream is played.

In this step, the target live TV stream that step 207 generates is played, wherein, target live TV stream is with caption Net cast, the synchronous caption with net cast can help the content for the net cast that user is best understood from, especially For foreign language net cast, the viewing experience of user is promoted.

As it can be seen that the method for processing live TV stream provided in an embodiment of the present invention, for foreign language net cast, can by regarding Frequency adds the text character after the corresponding translation of voice data in being broadcast live, realize while net cast is played, and plays synchronous Caption, user can be helped to understand the content of foreign language net cast, largely promoted user viewing experience.

In order to improve translation after subtitle accuracy, can also to the subtitle after translation carry out calibration process.For above-mentioned Application scenarios, in another realization method, the embodiment of the present invention separately provides a kind of method for handling live TV stream, with reference to figure 3, figure 3 be another flow chart of the method for the processing live TV stream of the embodiment of the present invention, is included the following steps：

Step 301, in the original live stream in preset duration section, the time point of speech pause is determined.

Step 302, by before the time point in original live stream and not decoded live TV stream segment, it is decoded as original audio Data and original video data.

The method that the detailed process and technique effect of step 301 and step 302 may be referred to processing live TV stream shown in FIG. 1 In step 101 below associated description, details are not described herein.

Step 303, speech recognition, the corresponding text character of generation original audio data are carried out to original audio data.

Step 304, the first duration spent by speech recognition is determined.

The detailed process and technique effect of this step may be referred to the step in the method for processing live TV stream shown in Fig. 2 203, details are not described herein.

Step 305, text character is translated into default category of language, generates the second duration.

The detailed process and technique effect of this step may be referred to the step in the method for processing live TV stream shown in Fig. 2 204, details are not described herein.

Step 306, correction process is carried out to the text character after translation.

In this step, the text character after the translation that can be generated to step 305 carries out correction process, to ensure subtitle Accuracy.

Wherein, correction process can also be performed by manually performing by machine.

Step 307, the 3rd duration spent by correction process is determined；

In this step, the 3rd duration carried out to the character after translation spent by correction process in calculation procedure 306. In the present embodiment, except the voice recognition processing to original audio data, further include at the translation to the text character of generation Reason and correction process.Therefore, can according to voice recognition processing expend the first duration, translation processing expend the second duration with And the total duration of the 3rd duration of correction process consuming, delay disposal, the text word that will identify that are carried out to original video data Symbol is added in the video data after delay, just can guarantee subtitle and audio video synchronization.

Step 308, by the sum of the timestamp of original video data, the first duration of delay, the second duration and the 3rd duration Duration.

In this step, the second duration and step 307 that the first duration and step 305 calculated according to step 303 calculates The 3rd duration calculated carries out delay disposal, to ensure subtitle and audio video synchronization to original video data.

Step 309, the text character translated and after error correction is added in the video data after delay, generates target video Data.

In this step, what the text character after the translation and error correction that step 306 are generated was added to that step 308 generates prolongs In the video data to lag, the target video data with accurate translation subtitle is generated.

Step 310, by target video data and original audio data synchronized compound, target live TV stream is generated.

In this step, it is the target video data that step 309 generates is synchronous with the original audio data that step 301 generates Synthesis generates target live TV stream, synchronous accurate translation subtitle is carried in live TV stream to realize.

Step 311, target live TV stream is played.

In this step, the target live TV stream that step 310 generates is played, wherein, which is band accurate translation The net cast of subtitle, synchronous with net cast and accurate caption can help user easily and accurately understand video The content of live streaming particularly with foreign language net cast, can largely promote the viewing experience of user.

As it can be seen that the method for processing live TV stream provided in an embodiment of the present invention, it can be by adding audio in net cast Text character after the corresponding translation of data, and error correction is carried out to the text character after translation, it realizes and is playing net cast While, synchronous and accurate caption is played, user can be helped easily and accurately understand foreign language net cast, especially It is the content of foreign language net cast, brings the preferable viewing experience of user.

The embodiment of the present invention provides a kind of system for handling live TV stream again.Referring to Fig. 4, Fig. 4 is the embodiment of the present invention Handle the schematic diagram of the system of live TV stream.

As shown in figure 4, the system of processing live TV stream includes drawing flow module 401, decoder module 402, subtitle acquisition module 403rd, audio coding module 404, video encoding module 405, package module 406 and plug-flow module 407, wherein, subtitle obtains mould Block 403 includes speech recognition submodule 4031, translation submodule 4032 and artificial correction submodule 4033, video encoding module 405 include video data buffer delay submodule 4051, subtitle superposition submodule 4052 and Video coding submodule 4053.

The workflow for handling the system of live TV stream is as follows：

The first step draws flow module 401 to be obtained from server and download the original live stream of preset duration.

In practical applications, pending net cast resource can be stored in the service of multimedia web site under normal conditions In device.

The original live stream of the preset duration is decoded as original audio data and original regarded by second step, decoder module 402 Frequency evidence.

Specifically, one section of original live stream of preset duration in original live stream can be decoded as original by decoder module 402 Beginning voice data and original video data, and then the processing of subtitle is added to the original live stream of the preset duration.

Original audio data is copied as two parts by the 3rd step, and portion is sent to subtitle acquisition module 403, another transmission To audio coding module 404.

Specifically, can be by two parts of original audio datas after duplication, portion is sent to subtitle acquisition module 403, with life Into subtitle corresponding with original audio, another is sent to audio coding module 404.

4th step, audio coding module 404 are compressed original audio data processing.

Specifically, since the data volume of original audio data is larger, using audio coding module 404 to original sound Frequency is handled according into overcompression, in order to network transmission.

5th step, the speech recognition submodule 4031 in subtitle acquisition module 403 carry out voice knowledge to original audio data Not, the corresponding text character of original audio data is generated.

Specifically, speech recognition submodule 4031 can carry out speech recognition to original audio data, corresponding text is generated This character text character to be added in live TV stream by subsequent step, realizes the net cast for playing and treating subtitle.

Text character is translated into default language kind by the 6th step, the translation submodule 4032 in subtitle acquisition module 403 Class.

Specifically, translation submodule 4032 can be according to actual demand, by the corresponding text word of step original audio data Symbol translates into default category of language.

For example, the corresponding foreign language text character of original audio data in foreign language net cast can be translated into Chinese.This Sample for liking watching foreign language net cast but the not high user of L proficiency, can also pass through the subtitle after translation and understand foreign language Net cast content promotes the usage experience of user.

7th step, the artificial correction submodule 4033 in subtitle acquisition module 403 entangle the text character after translation Fault is managed.

Specifically, artificial correction submodule 4033 can carry out correction process to the text character after translation, to ensure word The accuracy of curtain.

8th step, the video data buffer delay submodule 4051 in video encoding module 405 is according to the 3rd step to the 4th The spent duration of step carries out delay process to original video data.

Specifically, video data buffer delay submodule 4051 can according to voice recognition processing expend duration, translation The total duration of the duration expended and the duration of correction process consuming is handled, delay disposal is carried out to original video data, will be known The text character not gone out is added in the video data after delay, just can guarantee subtitle and audio video synchronization.

9th step, the subtitle superposition submodule 4052 in video encoding module 405 will translate and the text character after error correction It is added in the video data after delay, generates target video data.

Specifically, the text character translated and after error correction can be added to regarding after postponing by subtitle superposition submodule 4052 Frequency generates the target video data with accurate translation subtitle in.

Tenth step, the Video coding submodule 4053 in video encoding module 405 are compressed target video data at place Reason.

Specifically, since the data volume of target video data is larger, using Video coding submodule 4053 to target Video data is handled into overcompression, in order to network transmission.

11st step, package module 406 are synchronous with compressed original audio data by compressed target video data Synthesis generates target live TV stream.

Specifically, package module 406 can be same by compressed target video data and compressed original audio data Step synthesis, generates target live TV stream, synchronous accurate translation subtitle is carried in live TV stream to realize.

12nd step, plug-flow module 407 play target live TV stream.

Specifically, plug-flow module 407 plays target live TV stream, wherein, target live TV stream is regarding with accurate translation subtitle Frequency is broadcast live, and synchronous with net cast and accurate caption can help user easily and accurately understand in net cast Hold, particularly with foreign language net cast, can largely promote the viewing experience of user.

As it can be seen that the system of processing live TV stream provided in an embodiment of the present invention, it can be by adding audio in net cast Text character after the corresponding translation of data, and error correction is carried out to the text character after translation, it realizes and is playing net cast While, synchronous and accurate caption is played, user can be helped easily and accurately understand net cast, it is especially outer The content of text video live streaming, brings the preferable viewing experience of user.

The embodiment of the present invention provides a kind of device for handling live TV stream again.Referring to Fig. 5, Fig. 5 is the embodiment of the present invention The structure chart of the device of live TV stream is handled, which includes：

Decoding unit 501, for original live stream to be decoded as original audio data and original video data；

Recognition unit 502, for carrying out speech recognition, the corresponding text of generation original audio data to original audio data Character；

Delay cell 503 for the first duration expended according to speech recognition, is carried out at delay original video data Reason；

Adding device 504 for text character to be added in the video data after delay, generates target video data；

Synthesis unit 505, for by target video data and original audio data synchronized compound, generating target live TV stream；

Broadcast unit 506, for playing target live TV stream.

Optionally, decoding unit 501, specifically for by the original live stream of preset duration be decoded as original audio data and Original video data.

Optionally, decoding unit 501, including：First determination subelement and decoding subunit；

First determination subelement, in the original live stream in preset duration section, determining the time of speech pause Point；

Decoding subunit, for by before time point in original live stream and not decoded live TV stream segment, being decoded as original Beginning voice data and original video data.

Optionally, delay cell 503, including：Second determination subelement and delay subelement；

Second determination subelement, for determining the first duration spent by speech recognition；

Postpone subelement, for by the timestamp of original video data, postponing the first duration.

Optionally, device further includes：

Postpone subelement, specifically for by the timestamp of original video data, postponing the sum of the first duration and the second duration Duration；

Adding device specifically for the text character after translating is added in the video data after delay, generates target Video data.

Optionally, device further includes：

Determination unit, for determining the 3rd duration spent by correction process；

Postpone subelement, specifically for by the timestamp of original video data, the first duration of delay, the second duration and the 3rd The duration of the sum of duration；

Adding device, it is raw specifically for the text character translated and after error correction is added in the video data after delay Into target video data.

Optionally, synthesis unit 505 specifically for being based on default reference time axis, are regarded according in target video data The timestamp of frequency frame and the timestamp of original audio data sound intermediate frequency frame, by target video data and original audio data into Row synchronized compound generates target live TV stream.

As it can be seen that the device of processing live TV stream provided in an embodiment of the present invention, it can be by adding audio in net cast The corresponding text character of data is realized while net cast is played, plays synchronous subtitle, and user can be helped to understand and regarded The content of frequency live streaming promotes the viewing experience of user.

The embodiment of the present invention additionally provides a kind of electronic equipment, and with reference to figure 6, Fig. 6 is the electronic equipment of the embodiment of the present invention Schematic diagram, as shown in fig. 6, electronic equipment include processor 601, communication interface 602, memory 603 and communication bus 604, Wherein, processor 601, communication interface 602, memory 603 complete mutual communication by communication bus 604,

Memory 603, for storing computer program；

Processor 601 during for performing the program stored on memory 603, realizes following steps：

Speech recognition, the corresponding text character of generation original audio data are carried out to original audio data；

According to the first duration that speech recognition expends, delay disposal is carried out to original video data；

Text character is added in the video data after delay, generates target video data；

By target video data and original audio data synchronized compound, target live TV stream is generated；

Play target live TV stream.

The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) bus etc..The communication bus can be divided into address bus, data/address bus, controlling bus etc.. For ease of representing, only represented in figure with a thick line, it is not intended that an only bus or a type of bus.

Communication interface is for the communication between above-mentioned electronic equipment and other equipment.

Memory can include random access memory (Random Access Memory, abbreviation RAM), can also include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Optionally, memory may be used also To be at least one storage device for being located remotely from aforementioned processor.

Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, Abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.；It can also be digital signal processor (Digital Signal Processing, abbreviation DSP), application-specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), field programmable gate array (Field-Programmable Gate Array, Abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.

In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with instruction in storage medium, when run on a computer so that computer performs any institute in above-described embodiment The method for the processing live TV stream stated.

In another embodiment provided by the invention, a kind of computer program product for including instruction is additionally provided, when it When running on computers so that the method that computer performs any processing live TV stream in above-described embodiment.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its any combination real It is existing.When implemented in software, can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and performing the computer program instructions, all or It partly generates according to the flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special meter Calculation machine, computer network or other programmable devices.The computer instruction can be stored in computer readable storage medium In or from a computer readable storage medium to another computer readable storage medium transmit, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is the data storage devices such as server, the data center integrated comprising one or more usable mediums.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk Solid State Disk (SSD)) etc..

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed or further include as this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that Also there are other identical elements in process, method, article or equipment including the element.

Each embodiment in this specification is described using relevant mode, identical similar portion between each embodiment Point just to refer each other, and the highlights of each of the examples are difference from other examples.It is real especially for system For applying example, since it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Claims

A kind of 1. method for handling live TV stream, which is characterized in that the described method includes：

Original live stream is decoded as original audio data and original video data；

Speech recognition is carried out to the original audio data, generates the corresponding text character of the original audio data；

According to the first duration that the speech recognition expends, delay disposal is carried out to the original video data；

The text character is added in the video data after delay, generates target video data；

By the target video data and the original audio data synchronized compound, target live TV stream is generated；

Play the target live TV stream.
2. according to the method described in claim 1, it is characterized in that, it is described by original live stream be decoded as original audio data and The step of original video data, including：

The original live stream of preset duration is decoded as original audio data and original video data.
3. according to the method described in claim 1, it is characterized in that, it is described by original live stream be decoded as original audio data and The step of original video data, including：

In original live stream in preset duration section, the time point of speech pause is determined；

Before time point described in the original live stream and not decoded live TV stream segment, will be decoded as original audio data and Original video data.
4. according to the method described in claim 1, it is characterized in that, it is described according to the speech recognition expend the first duration, The step of delay disposal is carried out to the original video data, including：

Determine the first duration spent by the speech recognition；

By the timestamp of the original video data, postpone first duration.
5. according to the method described in claim 4, it is characterized in that,

Speech recognition is carried out to the original audio data described, generates the corresponding text character of the original audio data After step, the method further includes：

The text character is translated into default category of language, generates the second duration, described second when is a length of by the text Duration spent by character translation into default category of language；

The step of timestamp by the original video data, delay first duration, including：

By the timestamp of the original video data, postpone the duration of the sum of first duration and second duration；

It is described by the text character be added to delay after video data in, generate target video data the step of, including：

Text character after translation is added in the video data after delay, generates target video data.
6. according to the method described in claim 5, it is characterized in that,

After the described the step of text character is translated into default category of language, the method further includes：

Correction process is carried out to the text character after translation；

Determine the 3rd duration spent by the correction process；

The timestamp by the original video data postpones the duration of the sum of first duration and second duration Step, including：

By the timestamp of the original video data, postpone first duration, second duration and the 3rd duration it The duration of sum；

It is described by the text character be added to delay after video data in, generate target video data the step of, including：

The text character translated and after error correction is added in the video data after delay, generates target video data.
It is 7. according to the method described in claim 1, it is characterized in that, described by the target video data and the original audio The step of data synchronized compound, generation target live TV stream, including：

Based on default reference time axis, according to the timestamp of video frame in the target video data and the original sound The target video data and the original audio data are synchronized synthesis, generated by frequency according to the timestamp of sound intermediate frequency frame Target live TV stream.
8. a kind of device for handling live TV stream, which is characterized in that described device includes：

Decoding unit, for original live stream to be decoded as original audio data and original video data；

Recognition unit for carrying out speech recognition to the original audio data, generates the corresponding text of the original audio data This character；

Delay cell for the first duration expended according to the speech recognition, is carried out at delay the original video data Reason；

Adding device for the text character to be added in the video data after delay, generates target video data；

Synthesis unit, for by the target video data and the original audio data synchronized compound, generating target live TV stream；

Broadcast unit, for playing the target live TV stream.
9. device according to claim 8, which is characterized in that

The decoding unit, specifically for the original live stream of preset duration is decoded as original audio data and original video number According to.
10. device according to claim 8, which is characterized in that

The decoding unit, including：First determination subelement and decoding subunit；

First determination subelement, in the original live stream in preset duration section, determining the time of speech pause Point；

The decoding subunit, for by before time point described in the original live stream and not decoded live TV stream segment, It is decoded as original audio data and original video data.
11. device according to claim 8, which is characterized in that

The delay cell, including：Second determination subelement and delay subelement；

Second determination subelement, for determining the first duration spent by the speech recognition；

The delay subelement, for by the timestamp of the original video data, postponing first duration.
12. according to the devices described in claim 11, which is characterized in that described device further includes：

Translation unit for the text character to be translated into default category of language, generates the second duration, second duration For the text character to be translated into the duration spent by default category of language；

The delay subelement, specifically for by the timestamp of the original video data, postponing first duration and described The duration of the sum of second duration；

The adding device specifically for the text character after translating is added in the video data after delay, generates target Video data.
13. device according to claim 12, which is characterized in that described device further includes：

Error correction unit, for carrying out correction process to the text character after translation；

Determination unit, for determining the 3rd duration spent by the correction process；

The delay subelement, specifically for by the timestamp of the original video data, postponing first duration, described the The duration of the sum of two durations and the 3rd duration；

The adding device, it is raw specifically for the text character translated and after error correction is added in the video data after delay Into target video data.
14. device according to claim 8, which is characterized in that

The synthesis unit, specifically for being based on default reference time axis, according to video frame in the target video data The timestamp of timestamp and the original audio data sound intermediate frequency frame, by the target video data and the original audio Data synchronize synthesis, generate target live TV stream.
15. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein, processing Device, communication interface, memory complete mutual communication by communication bus；

Memory, for storing computer program；

Processor during for performing the program stored on memory, realizes any method and steps of claim 1-7.