CN111010614A

CN111010614A - Method, device, server and medium for displaying live caption

Info

Publication number: CN111010614A
Application number: CN201911364479.6A
Authority: CN
Inventors: 孙鹏飞; 张涛
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-04-14

Abstract

The embodiment of the invention provides a method, a device, a server and a medium for displaying live subtitles, and relates to the technical field of information processing. The scheme of the embodiment of the application comprises the following steps: receiving a live broadcast task instruction sent by a live broadcast management system, receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction, performing voice recognition on the audio data to obtain text data corresponding to the audio data, then superposing the text data to video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, obtaining video frame data carrying subtitle information, and outputting a media stream consisting of the video frame data carrying the subtitle information and the audio data. By adopting the method, the subtitle can be displayed for the live video.

Description

Method, device, server and medium for displaying live caption

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method, an apparatus, a server, and a medium for displaying live subtitles.

Background

At present, most video contents have subtitles, such as spring festival evening meetings, television dramas, movies and the like, and the subtitles are synchronously displayed in the playing process. The subtitles in the videos are manually recorded in an off-line state, namely after the videos are recorded, the subtitles are manually recorded, and then the subtitles can be synchronously displayed in the playing process.

For live video, the live video needs to be played while being recorded, so that subtitles cannot be recorded manually and played again, and the subtitles of the live video cannot be synchronously displayed in the live broadcasting process at present.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method, an apparatus, an electronic device, and a medium for displaying live subtitles, so as to solve a problem that subtitles cannot be synchronously displayed in a live video process. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for displaying live subtitles, where the method is applied to a server, and the method includes:

receiving a live broadcast task instruction sent by a live broadcast management system;

receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data;

according to the time stamp of the text data and the time stamp of the video frame data, the text data is superposed to the video frame data with the time stamp same as that of the text data, and the video frame data with the caption information is obtained;

and outputting the media stream consisting of the video frame data carrying the subtitle information and the audio data.

In a possible implementation manner, the receiving and caching audio data and video data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data includes:

analyzing the live broadcast task instruction;

and if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system, and simultaneously carrying out voice recognition on the audio data to obtain text data corresponding to the audio data.

In a possible implementation manner, the performing speech recognition on the audio data to obtain text data corresponding to the audio data includes:

resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data;

assembling the PCM data into PCM packets of a specified size;

performing voice recognition on PCM data in a PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet;

and caching text data corresponding to the PCM data in the PCM packet to a text processing queue.

In a possible implementation manner, the obtaining video frame data carrying subtitle information by superimposing the video frame data having the same text data timestamp as the text data timestamp according to the text data timestamp and the video frame data timestamp includes:

and when the video frame data cache time-out exists, acquiring text data with the same time stamp as the video frame data from the text processing queue, and overlapping the acquired text data serving as subtitle information to the video frame data to obtain the video frame data with the subtitle information.

In a possible implementation manner, after the obtained text data is superimposed to the video frame data as subtitle information to obtain video frame data carrying subtitle information, the method further includes:

and carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.

In a possible implementation manner, the performing audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information includes:

inputting the video frame data carrying the caption information and the audio data with the same timestamp as the video frame data into a cache region, and performing audio and video synchronization processing on the video frame data carrying the caption information and the audio data with the same timestamp as the video frame data carrying the caption information in the cache region;

the outputting the media stream composed of the video frame data carrying the caption information and the audio data comprises:

and when the video frame data carrying the subtitle information and the audio data subjected to audio and video synchronization processing reach specified duration in the cache region, outputting a media stream consisting of the video frame data carrying the subtitle information and the audio data.

In a second aspect, an embodiment of the present application provides an apparatus for displaying live subtitles, where the apparatus is applied to a server, and the apparatus includes:

the receiving module is used for receiving a live broadcast task instruction sent by a live broadcast management system; receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction;

the voice recognition module is used for carrying out voice recognition on the audio data to obtain text data corresponding to the audio data;

the superposition module is used for superposing the text data to the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data to obtain the video frame data with the subtitle information;

and the output module is used for outputting the media stream consisting of the video frame data carrying the subtitle information and the audio data.

In a possible implementation manner, the receiving module is specifically configured to parse the live task instruction; if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system;

the voice recognition module is configured to perform voice recognition on the audio data while the receiving module caches the audio data and the video data, so as to obtain text data corresponding to the audio data.

In a possible implementation manner, the speech recognition module is specifically configured to:

assembling the PCM data into PCM packets of a specified size;

In a possible implementation manner, the superimposing module is specifically configured to, when there is a video frame data cache timeout, obtain text data with a timestamp that is the same as that of the video frame data from the text processing queue, and superimpose the obtained text data as subtitle information on the video frame data to obtain video frame data with subtitle information.

In one possible implementation, the apparatus further includes:

and the synchronization module is used for carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.

In a possible implementation manner, the synchronization module is specifically configured to input the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data into a buffer area, and perform audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information in the buffer area;

the output module is specifically configured to output a media stream composed of the video frame data carrying the subtitle information and the audio data when the buffering duration of the video frame data carrying the subtitle information and the audio data in the buffer area, which are subjected to audio and video synchronization processing, reaches a specified duration.

In a third aspect, an embodiment of the present application provides a server, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor configured to implement the method steps of the first aspect when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program, when executed by a processor, implements the method steps in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

By adopting the method for displaying the live caption, after a live task instruction sent by a live management system is received, audio data and video frame data sent by the live management system can be received based on the live task instruction, voice recognition is carried out on the audio data, text data corresponding to the audio data is obtained, then the text data is superposed on the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, video frame data carrying caption information is obtained, and a media stream consisting of the video frame data carrying the caption information and the audio data is output. Because in the live broadcast process, voice recognition can be carried out on the audio data in real time to obtain the text data corresponding to the audio data, and the text data is already superposed into the video frame data before the media stream is output, the application can be adopted to realize synchronous display of the subtitles in the live broadcast process.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for displaying live subtitles according to an embodiment of the present application;

fig. 2 is a flowchart of another method for displaying live subtitles according to an embodiment of the present application;

fig. 3 is a flowchart of another method for displaying live subtitles according to an embodiment of the present application;

fig. 4 is a flowchart of another method for displaying live subtitles according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a live broadcast system according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for displaying live subtitles according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of another apparatus for displaying live subtitles according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to display subtitles in a live broadcast process, an embodiment of the present application provides a method for displaying live subtitles, where as shown in fig. 1, the method is applied to a server, and includes:

s101, receiving a live broadcast task instruction sent by a live broadcast management system.

S102, receiving and caching audio data and video frame data sent by a live broadcast management system based on a live broadcast task instruction, and carrying out voice recognition on the audio data to obtain text data corresponding to the audio data.

And S103, according to the time stamp of the text data and the time stamp of the video frame data, the text data is superposed to the video frame data with the time stamp same as that of the text data, and the video frame data with the caption information is obtained.

And S104, outputting the media stream consisting of the video frame data and the audio data carrying the subtitle information.

In S101, the live broadcast management system may send a live broadcast task instruction to one of the servers based on the load of each server. The live broadcast task instruction carries a live broadcast channel ID and live broadcast related parameters, wherein the live broadcast related parameters comprise a code rate, a frame rate, a coding type, a live broadcast type and the like of a live broadcast video.

In the above S102, the live transcoding service process of the server may receive and cache the audio data and the video data sent by the live management system, and transmit the audio data to the voice recognition service process, and the voice recognition service process performs voice recognition on the audio data, so as to convert the audio data into text data.

In S103, the live transcoding service process may obtain text data identified by the speech recognition service process, and superimpose the text data on video frame data with the same timestamp as the text data.

In S104, the server may output the media stream to a Content Delivery Network (CDN) server, so that the CDN server provides live video with subtitles for each playback terminal.

Specifically, the server may perform transcoding service on live video frame data according to information such as a code rate, a frame rate, an encoding type, and a live broadcast type in the live broadcast related parameters, and then output the media stream to the corresponding CDN server according to the live broadcast channel ID.

In an implementation manner of the embodiment of the application, as shown in fig. 2, the step S102 of receiving, based on a live task instruction, audio data and video frame data sent by a live broadcast management system, and performing voice recognition on the audio data to obtain text data corresponding to the audio data specifically includes the following steps:

and S1021, analyzing the live broadcast task instruction.

And S1022, if the live broadcast task instruction carries the voice recognition parameter, receiving and caching the audio data and the video data sent by the live broadcast management system, and simultaneously performing voice recognition on the audio data to obtain text data corresponding to the audio data.

If the live broadcast task instruction carries an Automatic Speech Recognition (ASR) parameter, it indicates that a caption needs to be displayed in the live broadcast process, and then the server can simultaneously start a Speech Recognition service process and a live broadcast transcoding service process, and create a cache directory and a cache file for the live broadcast task instruction.

If the live broadcast task instruction does not carry the ASR parameter, voice recognition on the audio data is not needed, and the live broadcast transcoding service process of the server can directly output the received audio and video data after transcoding the received audio and video data.

In another implementation manner of the embodiment of the application, as shown in fig. 3, the S1022 stated above, performing voice recognition on the audio data to obtain text data corresponding to the audio data specifically includes the following steps:

s301, resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data.

Alternatively, the specified sampling rate may be a 16K sampling rate. The voice recognition system can recognize PCM data with a specified sampling rate, so that the live transcoding service process can convert the audio data into the PCM data with the specified sampling rate, stamp the PCM data, and push the PCM data to a receiving port of the voice recognition service process through a message queue according to the frequency of the input audio data.

S302, the PCM data are combined into PCM packets with the specified size.

To facilitate the processing of the speech recognition service process, the speech recognition service process may combine the received PCM data into PCM packets of a specified size, and one PCM packet may include a plurality of pieces of PCM data.

S303, performing voice recognition on the PCM data in one PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet.

The speech recognition process may process one PCM packet at preset time intervals, and for the PCM data in the PCM packet, the PCM data are processed in the order from the early to the late of the time stamp of each PCM data.

S304, caching text data corresponding to the PCM data in the PCM packet to a text processing queue.

It is understood that, during the processing in the order from the early to the late according to the time stamp of each PCM data, the data obtained after the speech recognition may be buffered in the text processing queue in order.

Based on the method flow shown in fig. 3, as shown in fig. 4, in step S103, according to the time stamp of the text data and the time stamp of the video frame data, the text data is superimposed on the video frame data with the same time stamp as the text data, so as to obtain the video frame data carrying the subtitle information, and the method specifically may be implemented as follows:

and S1031, when the video frame data cache time-out exists, acquiring text data with the same time stamp as the video frame data from the text processing queue, and overlapping the acquired text data serving as subtitle information to the video frame data to obtain the video frame data with the subtitle information.

As an example, the buffering timeout duration of the video data may be 0.5 seconds, and may be set to other values, which is not limited in this embodiment of the application.

The embodiment of the application sets the cache overtime length, so that the voice recognition service process has enough time to perform voice recognition on the audio data before the video frame data reaches the overtime length, completes the voice recognition on the audio data in the time period of the audio data and the video frame data cache, and further can acquire text data obtained by the voice recognition when the video frame data cache is overtime, and uses the text data as the subtitle information of the video frame data.

By adopting the method, time is provided for voice recognition through short-time caching of the audio data and the video frame data, so that the text information obtained after the voice recognition can be synchronously played with the video frame data and the audio data.

In another implementation manner of this embodiment, in step S1031, after the obtained text data is superimposed to the video frame data as the subtitle information to obtain the video frame data carrying the subtitle information, the method further includes:

and carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same time stamp as the video frame data carrying the subtitle information.

The step can be specifically realized as follows: and inputting video frame data carrying subtitle information and audio data with the same timestamp as the video frame data into a buffer area, and performing audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information in the buffer area.

Optionally, the buffering duration of the video frame data and the audio data carrying the subtitle information in the buffer area can be adjusted in real time, that is, the video frame data and the audio data can be output from the buffer area after the audio and video synchronization processing is completed.

Or, the buffering duration of the video frame data and the audio data carrying the subtitle information in the buffer area may be set to a fixed value according to an actual requirement, for example, 0.5 second or 1 second, and may also be set to other values, which is not limited in this embodiment of the application. The longer the buffering time is, the higher the fault tolerance rate of the voice recognition service process is, but the longer the buffering time is, the larger the delay of the live broadcast is caused, and the user experience is influenced, so that the buffering time is not more than 3 seconds.

Correspondingly, when the buffering time of the video frame data and the audio data which are subjected to the audio and video synchronization processing in the buffer area reaches the designated time, the media stream composed of the video frame data and the audio frame data is output.

By adopting the method, the server can complete the synchronization of the audio and the video within the cache duration, so that the live video can synchronously play video frames, audio and subtitles. If the voice recognition result is not transmitted to the live transcoding service process in time due to network jitter and the like, the server still has buffer time to wait for the voice recognition result. Therefore, through a delay playing mechanism, the problem that audio, video frames and subtitles cannot be synchronized due to the problems of untimely transmission of a voice recognition result and the like can be avoided, the delay is short, for example, 1 second, the user basically has no perception, and the live video playing effect can be improved.

As shown in fig. 5, the method provided in the embodiment of the present application may be applied to the live broadcast system shown in fig. 5, where the system includes a live broadcast management system, a plurality of service nodes connected to a background control system, and a database.

The live broadcast management system comprises a command analysis module, a data storage module and a scheduling and distributing module.

And the live broadcast management system is used for providing a background display interface.

In one implementation, an administrator can trigger a live broadcast management system to start a method for displaying live broadcast subtitles by triggering a live broadcast real-time subtitle button in a background management system.

Alternatively, in another implementation, the live management system may receive a Uniform Resource Locator (URL) request, such as http:// xxxx/start? asr Zh, and then parses the URL request, thereby initiating a method of displaying live subtitles.

Specifically, the command parsing module is configured to parse the command triggered by the button or parse the URL request, generate an encoding service control command, and send the encoding service control command to the scheduling and distributing module.

And the scheduling and distributing module is used for selecting a proper service node according to the resource condition of each service node after receiving the coded service control command, and then sending a live broadcast task instruction to the selected service node through the message queue.

And the scheduling and distributing module is also used for storing the channel ID corresponding to the live broadcast task and the related parameters of the live broadcast video into a database through the data storage module.

And the data storage module is also used for storing the subsequently received audio data and video data to a database.

The service node, namely the server in the method flow, is configured to receive a live broadcast task instruction sent by the live broadcast management system, and execute the live broadcast task instruction according to the method flow in the method embodiment.

The service nodes in fig. 5 are service nodes deployed in a distributed manner, each service node supports automatic calling, after the live broadcast management system automatically searches the service nodes, the connection can be established with the service nodes, and each service node is uniformly scheduled by the live broadcast management system.

The service node comprises a live broadcast encoding module, a live broadcast transcoding service module and a voice recognition service module. The configuration of each service node is the same in fig. 5, and fig. 5 exemplarily shows the structure of one service node.

The live broadcast coding module is used for receiving a live broadcast task instruction sent by a live broadcast management system, analyzing the live broadcast task instruction, starting a live broadcast transcoding service process of the live broadcast transcoding service module and a voice recognition service process of the voice recognition service module if the live broadcast task instruction carries ASR parameters, and creating a cache directory and a cache file used by the live broadcast transcoding service process and the voice recognition service process.

And the live broadcast transcoding service module is used for receiving the audio data and the video data sent by the live broadcast management system, caching the audio data and the video data in a single-thread mode, resampling the audio data, stamping a timestamp on the audio data, and pushing the audio data to a receiving port of the voice recognition module through a message queue according to audio input frequency.

After receiving the audio data, the voice recognition service module combines the audio data into a PCM packet with a specified size, processes the voice recognition algorithm according to a specified time interval, converts the audio data into text data, and then sequentially stores the text data in a text processing queue.

When the video frame data cache time-out exists in the live transcoding service module, text data with the same time stamp as the video frame data is obtained from the text processing queue, subtitle data is added to the video frame data according to the text data, and after audio and video synchronization processing, media streams are output.

In the embodiment of the application, the live transcoding service module caches the audio data and the video data by adopting a single-thread cache mechanism, so that compared with a multi-thread model, the overhead is reduced, the development efficiency is improved, and the development difficulty and risk are reduced. In addition, the embodiment of the application adopts a plug-in deployment mode to realize the function of adding subtitles to the video, and has low coupling with the existing system, convenient subsequent upgrading and deployment and strong stability.

Based on the same technical concept, an embodiment of the present application further provides an apparatus for displaying live subtitles, as shown in fig. 6, where the apparatus is applied to a server, and the apparatus includes: a receiving module 601, a speech recognition module 602, an overlay module 603, and an output module 604.

A receiving module 601, configured to receive a live broadcast task instruction sent by a live broadcast management system; receiving and caching audio data and video frame data sent by a live broadcast management system based on a live broadcast task instruction;

the voice recognition module 602 is configured to perform voice recognition on the audio data to obtain text data corresponding to the audio data;

the superimposing module 603 is configured to superimpose the text data onto the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, so as to obtain video frame data carrying subtitle information;

the output module 604 is configured to output a media stream composed of video frame data and audio data carrying subtitle information.

Optionally, the receiving module 601 is specifically configured to parse a live task instruction; if the live broadcast task instruction carries the voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system;

the voice recognition module 602 is configured to perform voice recognition on the audio data while the receiving module caches the audio data and the video data, so as to obtain text data corresponding to the audio data.

Optionally, the speech recognition module 602 is specifically configured to:

assembling the PCM data into PCM packets of a specified size;

performing voice recognition on PCM data in one PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet;

Optionally, the superimposing module 603 is specifically configured to, when there is a cache timeout of video frame data, obtain text data with the same timestamp as the video frame data from the text processing queue, and superimpose the obtained text data as subtitle information onto the video frame data to obtain video frame data with subtitle information.

Optionally, as shown in fig. 7, the apparatus further includes: a synchronization module 701.

And the synchronization module 701 is configured to perform audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.

Optionally, the synchronization module 701 is specifically configured to input video frame data carrying subtitle information and audio data having the same timestamp as the video frame data into a buffer area, and perform audio and video synchronization processing on the video frame data carrying subtitle information and the audio data having the same timestamp as the video frame data carrying subtitle information in the buffer area;

the output module 604 is specifically configured to output a media stream composed of video frame data carrying subtitle information and audio data when the buffering duration of the video frame data carrying subtitle information and the audio data which are subjected to audio and video synchronization processing in the buffer area reaches a specified duration.

By adopting the device for displaying the live caption, after receiving the live task instruction sent by the live management system, the audio data and the video frame data sent by the live management system can be received based on the live task instruction, voice recognition is carried out on the audio data, text data corresponding to the audio data is obtained, and then according to the timestamp of the text data and the timestamp of the video frame data, the text data is superposed on the video frame data with the same timestamp of the text data, video frame data carrying caption information is obtained, and then a media stream consisting of the video frame data carrying the caption information and the audio data is output. Because in the live broadcast process, voice recognition can be carried out on the audio data in real time to obtain the text data corresponding to the audio data, and the text data is already superposed into the video frame data before the media stream is output, the application can be adopted to realize synchronous display of the subtitles in the live broadcast process.

The embodiment of the present invention further provides a server, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete mutual communication through the communication bus 804,

a memory 803 for storing a computer program;

the processor 801 is configured to implement the method steps performed by the server in the above method embodiments when executing the program stored in the memory 803.

The communication bus mentioned in the above server may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the server and other devices.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the above methods for displaying live subtitles.

In another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any one of the above-described methods for displaying live subtitles.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for displaying live subtitles, the method being applied to a server and comprising:

2. The method of claim 1, wherein the receiving and caching audio data and video data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data comprises:

analyzing the live broadcast task instruction;

3. The method according to claim 1 or 2, wherein the performing speech recognition on the audio data to obtain text data corresponding to the audio data comprises:

assembling the PCM data into PCM packets of a specified size;

4. The method of claim 3, wherein the superimposing, according to the timestamp of the text data and the timestamp of the video frame data, the video frame data with the same text data timestamp as the text data to obtain the video frame data carrying the caption information comprises:

5. The method according to claim 4, wherein after superimposing the acquired text data as subtitle information on the video frame data to obtain video frame data carrying subtitle information, the method further comprises:

6. The method of claim 5, wherein the performing audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information comprises:

7. An apparatus for displaying live subtitles, the apparatus being applied to a server, the apparatus comprising:

8. The apparatus of claim 7,

the receiving module is specifically used for analyzing the live broadcast task instruction; if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system;

9. A server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.