CN111010614A - Method, device, server and medium for displaying live caption - Google Patents

Method, device, server and medium for displaying live caption Download PDF

Info

Publication number
CN111010614A
CN111010614A CN201911364479.6A CN201911364479A CN111010614A CN 111010614 A CN111010614 A CN 111010614A CN 201911364479 A CN201911364479 A CN 201911364479A CN 111010614 A CN111010614 A CN 111010614A
Authority
CN
China
Prior art keywords
data
video frame
audio data
frame data
live broadcast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911364479.6A
Other languages
Chinese (zh)
Inventor
孙鹏飞
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201911364479.6A priority Critical patent/CN111010614A/en
Publication of CN111010614A publication Critical patent/CN111010614A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/858Linking data to content, e.g. by linking an URL to a video object, by creating a hotspot
    • H04N21/8586Linking data to content, e.g. by linking an URL to a video object, by creating a hotspot by using a URL

Abstract

The embodiment of the invention provides a method, a device, a server and a medium for displaying live subtitles, and relates to the technical field of information processing. The scheme of the embodiment of the application comprises the following steps: receiving a live broadcast task instruction sent by a live broadcast management system, receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction, performing voice recognition on the audio data to obtain text data corresponding to the audio data, then superposing the text data to video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, obtaining video frame data carrying subtitle information, and outputting a media stream consisting of the video frame data carrying the subtitle information and the audio data. By adopting the method, the subtitle can be displayed for the live video.

Description

Method, device, server and medium for displaying live caption
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a method, an apparatus, a server, and a medium for displaying live subtitles.
Background
At present, most video contents have subtitles, such as spring festival evening meetings, television dramas, movies and the like, and the subtitles are synchronously displayed in the playing process. The subtitles in the videos are manually recorded in an off-line state, namely after the videos are recorded, the subtitles are manually recorded, and then the subtitles can be synchronously displayed in the playing process.
For live video, the live video needs to be played while being recorded, so that subtitles cannot be recorded manually and played again, and the subtitles of the live video cannot be synchronously displayed in the live broadcasting process at present.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a method, an apparatus, an electronic device, and a medium for displaying live subtitles, so as to solve a problem that subtitles cannot be synchronously displayed in a live video process. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a method for displaying live subtitles, where the method is applied to a server, and the method includes:
receiving a live broadcast task instruction sent by a live broadcast management system;
receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data;
according to the time stamp of the text data and the time stamp of the video frame data, the text data is superposed to the video frame data with the time stamp same as that of the text data, and the video frame data with the caption information is obtained;
and outputting the media stream consisting of the video frame data carrying the subtitle information and the audio data.
In a possible implementation manner, the receiving and caching audio data and video data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data includes:
analyzing the live broadcast task instruction;
and if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system, and simultaneously carrying out voice recognition on the audio data to obtain text data corresponding to the audio data.
In a possible implementation manner, the performing speech recognition on the audio data to obtain text data corresponding to the audio data includes:
resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data;
assembling the PCM data into PCM packets of a specified size;
performing voice recognition on PCM data in a PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet;
and caching text data corresponding to the PCM data in the PCM packet to a text processing queue.
In a possible implementation manner, the obtaining video frame data carrying subtitle information by superimposing the video frame data having the same text data timestamp as the text data timestamp according to the text data timestamp and the video frame data timestamp includes:
and when the video frame data cache time-out exists, acquiring text data with the same time stamp as the video frame data from the text processing queue, and overlapping the acquired text data serving as subtitle information to the video frame data to obtain the video frame data with the subtitle information.
In a possible implementation manner, after the obtained text data is superimposed to the video frame data as subtitle information to obtain video frame data carrying subtitle information, the method further includes:
and carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.
In a possible implementation manner, the performing audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information includes:
inputting the video frame data carrying the caption information and the audio data with the same timestamp as the video frame data into a cache region, and performing audio and video synchronization processing on the video frame data carrying the caption information and the audio data with the same timestamp as the video frame data carrying the caption information in the cache region;
the outputting the media stream composed of the video frame data carrying the caption information and the audio data comprises:
and when the video frame data carrying the subtitle information and the audio data subjected to audio and video synchronization processing reach specified duration in the cache region, outputting a media stream consisting of the video frame data carrying the subtitle information and the audio data.
In a second aspect, an embodiment of the present application provides an apparatus for displaying live subtitles, where the apparatus is applied to a server, and the apparatus includes:
the receiving module is used for receiving a live broadcast task instruction sent by a live broadcast management system; receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction;
the voice recognition module is used for carrying out voice recognition on the audio data to obtain text data corresponding to the audio data;
the superposition module is used for superposing the text data to the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data to obtain the video frame data with the subtitle information;
and the output module is used for outputting the media stream consisting of the video frame data carrying the subtitle information and the audio data.
In a possible implementation manner, the receiving module is specifically configured to parse the live task instruction; if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system;
the voice recognition module is configured to perform voice recognition on the audio data while the receiving module caches the audio data and the video data, so as to obtain text data corresponding to the audio data.
In a possible implementation manner, the speech recognition module is specifically configured to:
resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data;
assembling the PCM data into PCM packets of a specified size;
performing voice recognition on PCM data in a PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet;
and caching text data corresponding to the PCM data in the PCM packet to a text processing queue.
In a possible implementation manner, the superimposing module is specifically configured to, when there is a video frame data cache timeout, obtain text data with a timestamp that is the same as that of the video frame data from the text processing queue, and superimpose the obtained text data as subtitle information on the video frame data to obtain video frame data with subtitle information.
In one possible implementation, the apparatus further includes:
and the synchronization module is used for carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.
In a possible implementation manner, the synchronization module is specifically configured to input the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data into a buffer area, and perform audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information in the buffer area;
the output module is specifically configured to output a media stream composed of the video frame data carrying the subtitle information and the audio data when the buffering duration of the video frame data carrying the subtitle information and the audio data in the buffer area, which are subjected to audio and video synchronization processing, reaches a specified duration.
In a third aspect, an embodiment of the present application provides a server, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;
a memory for storing a computer program;
a processor configured to implement the method steps of the first aspect when executing the program stored in the memory.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program, when executed by a processor, implements the method steps in the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
By adopting the method for displaying the live caption, after a live task instruction sent by a live management system is received, audio data and video frame data sent by the live management system can be received based on the live task instruction, voice recognition is carried out on the audio data, text data corresponding to the audio data is obtained, then the text data is superposed on the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, video frame data carrying caption information is obtained, and a media stream consisting of the video frame data carrying the caption information and the audio data is output. Because in the live broadcast process, voice recognition can be carried out on the audio data in real time to obtain the text data corresponding to the audio data, and the text data is already superposed into the video frame data before the media stream is output, the application can be adopted to realize synchronous display of the subtitles in the live broadcast process.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for displaying live subtitles according to an embodiment of the present application;
fig. 2 is a flowchart of another method for displaying live subtitles according to an embodiment of the present application;
fig. 3 is a flowchart of another method for displaying live subtitles according to an embodiment of the present application;
fig. 4 is a flowchart of another method for displaying live subtitles according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a live broadcast system according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for displaying live subtitles according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of another apparatus for displaying live subtitles according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to display subtitles in a live broadcast process, an embodiment of the present application provides a method for displaying live subtitles, where as shown in fig. 1, the method is applied to a server, and includes:
s101, receiving a live broadcast task instruction sent by a live broadcast management system.
S102, receiving and caching audio data and video frame data sent by a live broadcast management system based on a live broadcast task instruction, and carrying out voice recognition on the audio data to obtain text data corresponding to the audio data.
And S103, according to the time stamp of the text data and the time stamp of the video frame data, the text data is superposed to the video frame data with the time stamp same as that of the text data, and the video frame data with the caption information is obtained.
And S104, outputting the media stream consisting of the video frame data and the audio data carrying the subtitle information.
By adopting the method for displaying the live caption, after a live task instruction sent by a live management system is received, audio data and video frame data sent by the live management system can be received based on the live task instruction, voice recognition is carried out on the audio data, text data corresponding to the audio data is obtained, then the text data is superposed on the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, video frame data carrying caption information is obtained, and a media stream consisting of the video frame data carrying the caption information and the audio data is output. Because in the live broadcast process, voice recognition can be carried out on the audio data in real time to obtain the text data corresponding to the audio data, and the text data is already superposed into the video frame data before the media stream is output, the application can be adopted to realize synchronous display of the subtitles in the live broadcast process.
In S101, the live broadcast management system may send a live broadcast task instruction to one of the servers based on the load of each server. The live broadcast task instruction carries a live broadcast channel ID and live broadcast related parameters, wherein the live broadcast related parameters comprise a code rate, a frame rate, a coding type, a live broadcast type and the like of a live broadcast video.
In the above S102, the live transcoding service process of the server may receive and cache the audio data and the video data sent by the live management system, and transmit the audio data to the voice recognition service process, and the voice recognition service process performs voice recognition on the audio data, so as to convert the audio data into text data.
In S103, the live transcoding service process may obtain text data identified by the speech recognition service process, and superimpose the text data on video frame data with the same timestamp as the text data.
In S104, the server may output the media stream to a Content Delivery Network (CDN) server, so that the CDN server provides live video with subtitles for each playback terminal.
Specifically, the server may perform transcoding service on live video frame data according to information such as a code rate, a frame rate, an encoding type, and a live broadcast type in the live broadcast related parameters, and then output the media stream to the corresponding CDN server according to the live broadcast channel ID.
In an implementation manner of the embodiment of the application, as shown in fig. 2, the step S102 of receiving, based on a live task instruction, audio data and video frame data sent by a live broadcast management system, and performing voice recognition on the audio data to obtain text data corresponding to the audio data specifically includes the following steps:
and S1021, analyzing the live broadcast task instruction.
And S1022, if the live broadcast task instruction carries the voice recognition parameter, receiving and caching the audio data and the video data sent by the live broadcast management system, and simultaneously performing voice recognition on the audio data to obtain text data corresponding to the audio data.
If the live broadcast task instruction carries an Automatic Speech Recognition (ASR) parameter, it indicates that a caption needs to be displayed in the live broadcast process, and then the server can simultaneously start a Speech Recognition service process and a live broadcast transcoding service process, and create a cache directory and a cache file for the live broadcast task instruction.
If the live broadcast task instruction does not carry the ASR parameter, voice recognition on the audio data is not needed, and the live broadcast transcoding service process of the server can directly output the received audio and video data after transcoding the received audio and video data.
In another implementation manner of the embodiment of the application, as shown in fig. 3, the S1022 stated above, performing voice recognition on the audio data to obtain text data corresponding to the audio data specifically includes the following steps:
s301, resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data.
Alternatively, the specified sampling rate may be a 16K sampling rate. The voice recognition system can recognize PCM data with a specified sampling rate, so that the live transcoding service process can convert the audio data into the PCM data with the specified sampling rate, stamp the PCM data, and push the PCM data to a receiving port of the voice recognition service process through a message queue according to the frequency of the input audio data.
S302, the PCM data are combined into PCM packets with the specified size.
To facilitate the processing of the speech recognition service process, the speech recognition service process may combine the received PCM data into PCM packets of a specified size, and one PCM packet may include a plurality of pieces of PCM data.
S303, performing voice recognition on the PCM data in one PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet.
The speech recognition process may process one PCM packet at preset time intervals, and for the PCM data in the PCM packet, the PCM data are processed in the order from the early to the late of the time stamp of each PCM data.
S304, caching text data corresponding to the PCM data in the PCM packet to a text processing queue.
It is understood that, during the processing in the order from the early to the late according to the time stamp of each PCM data, the data obtained after the speech recognition may be buffered in the text processing queue in order.
Based on the method flow shown in fig. 3, as shown in fig. 4, in step S103, according to the time stamp of the text data and the time stamp of the video frame data, the text data is superimposed on the video frame data with the same time stamp as the text data, so as to obtain the video frame data carrying the subtitle information, and the method specifically may be implemented as follows:
and S1031, when the video frame data cache time-out exists, acquiring text data with the same time stamp as the video frame data from the text processing queue, and overlapping the acquired text data serving as subtitle information to the video frame data to obtain the video frame data with the subtitle information.
As an example, the buffering timeout duration of the video data may be 0.5 seconds, and may be set to other values, which is not limited in this embodiment of the application.
The embodiment of the application sets the cache overtime length, so that the voice recognition service process has enough time to perform voice recognition on the audio data before the video frame data reaches the overtime length, completes the voice recognition on the audio data in the time period of the audio data and the video frame data cache, and further can acquire text data obtained by the voice recognition when the video frame data cache is overtime, and uses the text data as the subtitle information of the video frame data.
By adopting the method, time is provided for voice recognition through short-time caching of the audio data and the video frame data, so that the text information obtained after the voice recognition can be synchronously played with the video frame data and the audio data.
In another implementation manner of this embodiment, in step S1031, after the obtained text data is superimposed to the video frame data as the subtitle information to obtain the video frame data carrying the subtitle information, the method further includes:
and carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same time stamp as the video frame data carrying the subtitle information.
The step can be specifically realized as follows: and inputting video frame data carrying subtitle information and audio data with the same timestamp as the video frame data into a buffer area, and performing audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information in the buffer area.
Optionally, the buffering duration of the video frame data and the audio data carrying the subtitle information in the buffer area can be adjusted in real time, that is, the video frame data and the audio data can be output from the buffer area after the audio and video synchronization processing is completed.
Or, the buffering duration of the video frame data and the audio data carrying the subtitle information in the buffer area may be set to a fixed value according to an actual requirement, for example, 0.5 second or 1 second, and may also be set to other values, which is not limited in this embodiment of the application. The longer the buffering time is, the higher the fault tolerance rate of the voice recognition service process is, but the longer the buffering time is, the larger the delay of the live broadcast is caused, and the user experience is influenced, so that the buffering time is not more than 3 seconds.
Correspondingly, when the buffering time of the video frame data and the audio data which are subjected to the audio and video synchronization processing in the buffer area reaches the designated time, the media stream composed of the video frame data and the audio frame data is output.
By adopting the method, the server can complete the synchronization of the audio and the video within the cache duration, so that the live video can synchronously play video frames, audio and subtitles. If the voice recognition result is not transmitted to the live transcoding service process in time due to network jitter and the like, the server still has buffer time to wait for the voice recognition result. Therefore, through a delay playing mechanism, the problem that audio, video frames and subtitles cannot be synchronized due to the problems of untimely transmission of a voice recognition result and the like can be avoided, the delay is short, for example, 1 second, the user basically has no perception, and the live video playing effect can be improved.
As shown in fig. 5, the method provided in the embodiment of the present application may be applied to the live broadcast system shown in fig. 5, where the system includes a live broadcast management system, a plurality of service nodes connected to a background control system, and a database.
The live broadcast management system comprises a command analysis module, a data storage module and a scheduling and distributing module.
And the live broadcast management system is used for providing a background display interface.
In one implementation, an administrator can trigger a live broadcast management system to start a method for displaying live broadcast subtitles by triggering a live broadcast real-time subtitle button in a background management system.
Alternatively, in another implementation, the live management system may receive a Uniform Resource Locator (URL) request, such as http:// xxxx/start? asr Zh, and then parses the URL request, thereby initiating a method of displaying live subtitles.
Specifically, the command parsing module is configured to parse the command triggered by the button or parse the URL request, generate an encoding service control command, and send the encoding service control command to the scheduling and distributing module.
And the scheduling and distributing module is used for selecting a proper service node according to the resource condition of each service node after receiving the coded service control command, and then sending a live broadcast task instruction to the selected service node through the message queue.
And the scheduling and distributing module is also used for storing the channel ID corresponding to the live broadcast task and the related parameters of the live broadcast video into a database through the data storage module.
And the data storage module is also used for storing the subsequently received audio data and video data to a database.
The service node, namely the server in the method flow, is configured to receive a live broadcast task instruction sent by the live broadcast management system, and execute the live broadcast task instruction according to the method flow in the method embodiment.
The service nodes in fig. 5 are service nodes deployed in a distributed manner, each service node supports automatic calling, after the live broadcast management system automatically searches the service nodes, the connection can be established with the service nodes, and each service node is uniformly scheduled by the live broadcast management system.
The service node comprises a live broadcast encoding module, a live broadcast transcoding service module and a voice recognition service module. The configuration of each service node is the same in fig. 5, and fig. 5 exemplarily shows the structure of one service node.
The live broadcast coding module is used for receiving a live broadcast task instruction sent by a live broadcast management system, analyzing the live broadcast task instruction, starting a live broadcast transcoding service process of the live broadcast transcoding service module and a voice recognition service process of the voice recognition service module if the live broadcast task instruction carries ASR parameters, and creating a cache directory and a cache file used by the live broadcast transcoding service process and the voice recognition service process.
And the live broadcast transcoding service module is used for receiving the audio data and the video data sent by the live broadcast management system, caching the audio data and the video data in a single-thread mode, resampling the audio data, stamping a timestamp on the audio data, and pushing the audio data to a receiving port of the voice recognition module through a message queue according to audio input frequency.
After receiving the audio data, the voice recognition service module combines the audio data into a PCM packet with a specified size, processes the voice recognition algorithm according to a specified time interval, converts the audio data into text data, and then sequentially stores the text data in a text processing queue.
When the video frame data cache time-out exists in the live transcoding service module, text data with the same time stamp as the video frame data is obtained from the text processing queue, subtitle data is added to the video frame data according to the text data, and after audio and video synchronization processing, media streams are output.
In the embodiment of the application, the live transcoding service module caches the audio data and the video data by adopting a single-thread cache mechanism, so that compared with a multi-thread model, the overhead is reduced, the development efficiency is improved, and the development difficulty and risk are reduced. In addition, the embodiment of the application adopts a plug-in deployment mode to realize the function of adding subtitles to the video, and has low coupling with the existing system, convenient subsequent upgrading and deployment and strong stability.
Based on the same technical concept, an embodiment of the present application further provides an apparatus for displaying live subtitles, as shown in fig. 6, where the apparatus is applied to a server, and the apparatus includes: a receiving module 601, a speech recognition module 602, an overlay module 603, and an output module 604.
A receiving module 601, configured to receive a live broadcast task instruction sent by a live broadcast management system; receiving and caching audio data and video frame data sent by a live broadcast management system based on a live broadcast task instruction;
the voice recognition module 602 is configured to perform voice recognition on the audio data to obtain text data corresponding to the audio data;
the superimposing module 603 is configured to superimpose the text data onto the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data, so as to obtain video frame data carrying subtitle information;
the output module 604 is configured to output a media stream composed of video frame data and audio data carrying subtitle information.
Optionally, the receiving module 601 is specifically configured to parse a live task instruction; if the live broadcast task instruction carries the voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system;
the voice recognition module 602 is configured to perform voice recognition on the audio data while the receiving module caches the audio data and the video data, so as to obtain text data corresponding to the audio data.
Optionally, the speech recognition module 602 is specifically configured to:
resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data;
assembling the PCM data into PCM packets of a specified size;
performing voice recognition on PCM data in one PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet;
and caching text data corresponding to the PCM data in the PCM packet to a text processing queue.
Optionally, the superimposing module 603 is specifically configured to, when there is a cache timeout of video frame data, obtain text data with the same timestamp as the video frame data from the text processing queue, and superimpose the obtained text data as subtitle information onto the video frame data to obtain video frame data with subtitle information.
Optionally, as shown in fig. 7, the apparatus further includes: a synchronization module 701.
And the synchronization module 701 is configured to perform audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.
Optionally, the synchronization module 701 is specifically configured to input video frame data carrying subtitle information and audio data having the same timestamp as the video frame data into a buffer area, and perform audio and video synchronization processing on the video frame data carrying subtitle information and the audio data having the same timestamp as the video frame data carrying subtitle information in the buffer area;
the output module 604 is specifically configured to output a media stream composed of video frame data carrying subtitle information and audio data when the buffering duration of the video frame data carrying subtitle information and the audio data which are subjected to audio and video synchronization processing in the buffer area reaches a specified duration.
By adopting the device for displaying the live caption, after receiving the live task instruction sent by the live management system, the audio data and the video frame data sent by the live management system can be received based on the live task instruction, voice recognition is carried out on the audio data, text data corresponding to the audio data is obtained, and then according to the timestamp of the text data and the timestamp of the video frame data, the text data is superposed on the video frame data with the same timestamp of the text data, video frame data carrying caption information is obtained, and then a media stream consisting of the video frame data carrying the caption information and the audio data is output. Because in the live broadcast process, voice recognition can be carried out on the audio data in real time to obtain the text data corresponding to the audio data, and the text data is already superposed into the video frame data before the media stream is output, the application can be adopted to realize synchronous display of the subtitles in the live broadcast process.
The embodiment of the present invention further provides a server, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete mutual communication through the communication bus 804,
a memory 803 for storing a computer program;
the processor 801 is configured to implement the method steps performed by the server in the above method embodiments when executing the program stored in the memory 803.
The communication bus mentioned in the above server may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the server and other devices.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the above methods for displaying live subtitles.
In another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any one of the above-described methods for displaying live subtitles.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for displaying live subtitles, the method being applied to a server and comprising:
receiving a live broadcast task instruction sent by a live broadcast management system;
receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data;
according to the time stamp of the text data and the time stamp of the video frame data, the text data is superposed to the video frame data with the time stamp same as that of the text data, and the video frame data with the caption information is obtained;
and outputting the media stream consisting of the video frame data carrying the subtitle information and the audio data.
2. The method of claim 1, wherein the receiving and caching audio data and video data sent by the live broadcast management system based on the live broadcast task instruction, and performing voice recognition on the audio data to obtain text data corresponding to the audio data comprises:
analyzing the live broadcast task instruction;
and if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system, and simultaneously carrying out voice recognition on the audio data to obtain text data corresponding to the audio data.
3. The method according to claim 1 or 2, wherein the performing speech recognition on the audio data to obtain text data corresponding to the audio data comprises:
resampling the audio data to obtain Pulse Code Modulation (PCM) data with a specified sampling rate, and adding a time stamp to the PCM data based on the time stamp of the audio data;
assembling the PCM data into PCM packets of a specified size;
performing voice recognition on PCM data in a PCM packet at intervals of preset time to obtain text data corresponding to the PCM data in the PCM packet;
and caching text data corresponding to the PCM data in the PCM packet to a text processing queue.
4. The method of claim 3, wherein the superimposing, according to the timestamp of the text data and the timestamp of the video frame data, the video frame data with the same text data timestamp as the text data to obtain the video frame data carrying the caption information comprises:
and when the video frame data cache time-out exists, acquiring text data with the same time stamp as the video frame data from the text processing queue, and overlapping the acquired text data serving as subtitle information to the video frame data to obtain the video frame data with the subtitle information.
5. The method according to claim 4, wherein after superimposing the acquired text data as subtitle information on the video frame data to obtain video frame data carrying subtitle information, the method further comprises:
and carrying out audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information.
6. The method of claim 5, wherein the performing audio and video synchronization processing on the video frame data carrying the subtitle information and the audio data with the same timestamp as the video frame data carrying the subtitle information comprises:
inputting the video frame data carrying the caption information and the audio data with the same timestamp as the video frame data into a cache region, and performing audio and video synchronization processing on the video frame data carrying the caption information and the audio data with the same timestamp as the video frame data carrying the caption information in the cache region;
the outputting the media stream composed of the video frame data carrying the caption information and the audio data comprises:
and when the video frame data carrying the subtitle information and the audio data subjected to audio and video synchronization processing reach specified duration in the cache region, outputting a media stream consisting of the video frame data carrying the subtitle information and the audio data.
7. An apparatus for displaying live subtitles, the apparatus being applied to a server, the apparatus comprising:
the receiving module is used for receiving a live broadcast task instruction sent by a live broadcast management system; receiving and caching audio data and video frame data sent by the live broadcast management system based on the live broadcast task instruction;
the voice recognition module is used for carrying out voice recognition on the audio data to obtain text data corresponding to the audio data;
the superposition module is used for superposing the text data to the video frame data with the same timestamp as the text data according to the timestamp of the text data and the timestamp of the video frame data to obtain the video frame data with the subtitle information;
and the output module is used for outputting the media stream consisting of the video frame data carrying the subtitle information and the audio data.
8. The apparatus of claim 7,
the receiving module is specifically used for analyzing the live broadcast task instruction; if the live broadcast task instruction carries a voice recognition parameter, receiving and caching audio data and video data sent by the live broadcast management system;
the voice recognition module is configured to perform voice recognition on the audio data while the receiving module caches the audio data and the video data, so as to obtain text data corresponding to the audio data.
9. A server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.
CN201911364479.6A 2019-12-26 2019-12-26 Method, device, server and medium for displaying live caption Pending CN111010614A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911364479.6A CN111010614A (en) 2019-12-26 2019-12-26 Method, device, server and medium for displaying live caption

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911364479.6A CN111010614A (en) 2019-12-26 2019-12-26 Method, device, server and medium for displaying live caption

Publications (1)

Publication Number Publication Date
CN111010614A true CN111010614A (en) 2020-04-14

Family

ID=70118680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911364479.6A Pending CN111010614A (en) 2019-12-26 2019-12-26 Method, device, server and medium for displaying live caption

Country Status (1)

Country Link
CN (1) CN111010614A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111416994A (en) * 2020-03-27 2020-07-14 上海依图网络科技有限公司 Method and device for synchronously presenting video stream and tracking information and electronic equipment
CN111479124A (en) * 2020-04-20 2020-07-31 北京捷通华声科技股份有限公司 Real-time playing method and device
CN111654658A (en) * 2020-06-17 2020-09-11 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN111966839A (en) * 2020-08-17 2020-11-20 北京奇艺世纪科技有限公司 Data processing method and device, electronic equipment and computer storage medium
CN112188241A (en) * 2020-10-09 2021-01-05 上海网达软件股份有限公司 Method and system for real-time subtitle generation of live stream
CN112507147A (en) * 2020-11-30 2021-03-16 广州酷狗计算机科技有限公司 Text data display method, device, equipment and storage medium
CN112954374A (en) * 2021-01-28 2021-06-11 广州虎牙科技有限公司 Video data processing method and device, electronic equipment and storage medium
CN113301428A (en) * 2021-05-14 2021-08-24 上海樱帆望文化传媒有限公司 Live caption device for electric competition events
CN113573114A (en) * 2021-07-28 2021-10-29 北京奇艺世纪科技有限公司 Screen projection method and device, electronic equipment and storage medium
CN113626598A (en) * 2021-08-11 2021-11-09 平安国际智慧城市科技股份有限公司 Video text generation method, device, equipment and storage medium
CN113660501A (en) * 2021-08-11 2021-11-16 云知声(上海)智能科技有限公司 Method and device for matching subtitles
CN113852835A (en) * 2021-09-22 2021-12-28 北京百度网讯科技有限公司 Live broadcast audio processing method and device, electronic equipment and storage medium
CN114040220A (en) * 2021-11-25 2022-02-11 京东科技信息技术有限公司 Live broadcasting method and device
CN114598936A (en) * 2022-02-18 2022-06-07 深圳盛显科技有限公司 Caption batch generating and managing method, system, device and storage medium
WO2024056022A1 (en) * 2022-09-14 2024-03-21 北京字跳网络技术有限公司 Subtitle processing method and device
WO2024087732A1 (en) * 2022-10-25 2024-05-02 上海哔哩哔哩科技有限公司 Livestreaming data processing method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014097048A1 (en) * 2012-12-18 2014-06-26 Sony Mobile Communications Ab System and method for generating a second screen experience using video subtitle data
CN107222792A (en) * 2017-07-11 2017-09-29 成都德芯数字科技股份有限公司 A kind of caption superposition method and device
CN108184135A (en) * 2017-12-28 2018-06-19 泰康保险集团股份有限公司 Method for generating captions and device, storage medium and electric terminal
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN108600773A (en) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 Caption data method for pushing, subtitle methods of exhibiting, device, equipment and medium
CN108833810A (en) * 2018-06-21 2018-11-16 珠海金山网络游戏科技有限公司 The method and device of subtitle is generated in a kind of live streaming of three-dimensional idol in real time
US20190199964A1 (en) * 2016-12-22 2019-06-27 T-Mobile Usa, Inc. Systems and methods for improved video call handling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014097048A1 (en) * 2012-12-18 2014-06-26 Sony Mobile Communications Ab System and method for generating a second screen experience using video subtitle data
US20190199964A1 (en) * 2016-12-22 2019-06-27 T-Mobile Usa, Inc. Systems and methods for improved video call handling
CN107222792A (en) * 2017-07-11 2017-09-29 成都德芯数字科技股份有限公司 A kind of caption superposition method and device
CN108184135A (en) * 2017-12-28 2018-06-19 泰康保险集团股份有限公司 Method for generating captions and device, storage medium and electric terminal
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN108600773A (en) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 Caption data method for pushing, subtitle methods of exhibiting, device, equipment and medium
CN108833810A (en) * 2018-06-21 2018-11-16 珠海金山网络游戏科技有限公司 The method and device of subtitle is generated in a kind of live streaming of three-dimensional idol in real time

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111416994A (en) * 2020-03-27 2020-07-14 上海依图网络科技有限公司 Method and device for synchronously presenting video stream and tracking information and electronic equipment
CN111479124A (en) * 2020-04-20 2020-07-31 北京捷通华声科技股份有限公司 Real-time playing method and device
CN111654658B (en) * 2020-06-17 2022-04-15 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN111654658A (en) * 2020-06-17 2020-09-11 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN111966839A (en) * 2020-08-17 2020-11-20 北京奇艺世纪科技有限公司 Data processing method and device, electronic equipment and computer storage medium
CN111966839B (en) * 2020-08-17 2023-07-25 北京奇艺世纪科技有限公司 Data processing method, device, electronic equipment and computer storage medium
CN112188241A (en) * 2020-10-09 2021-01-05 上海网达软件股份有限公司 Method and system for real-time subtitle generation of live stream
CN112507147A (en) * 2020-11-30 2021-03-16 广州酷狗计算机科技有限公司 Text data display method, device, equipment and storage medium
CN112954374B (en) * 2021-01-28 2023-05-23 广州虎牙科技有限公司 Video data processing method and device, electronic equipment and storage medium
CN112954374A (en) * 2021-01-28 2021-06-11 广州虎牙科技有限公司 Video data processing method and device, electronic equipment and storage medium
CN113301428A (en) * 2021-05-14 2021-08-24 上海樱帆望文化传媒有限公司 Live caption device for electric competition events
CN113573114A (en) * 2021-07-28 2021-10-29 北京奇艺世纪科技有限公司 Screen projection method and device, electronic equipment and storage medium
CN113626598A (en) * 2021-08-11 2021-11-09 平安国际智慧城市科技股份有限公司 Video text generation method, device, equipment and storage medium
CN113660501A (en) * 2021-08-11 2021-11-16 云知声(上海)智能科技有限公司 Method and device for matching subtitles
CN113852835A (en) * 2021-09-22 2021-12-28 北京百度网讯科技有限公司 Live broadcast audio processing method and device, electronic equipment and storage medium
CN114040220A (en) * 2021-11-25 2022-02-11 京东科技信息技术有限公司 Live broadcasting method and device
WO2023093322A1 (en) * 2021-11-25 2023-06-01 京东科技信息技术有限公司 Live broadcast method and device
CN114598936A (en) * 2022-02-18 2022-06-07 深圳盛显科技有限公司 Caption batch generating and managing method, system, device and storage medium
CN114598936B (en) * 2022-02-18 2023-12-01 深圳盛显科技有限公司 Subtitle batch generation and management method, system, device and storage medium
WO2024056022A1 (en) * 2022-09-14 2024-03-21 北京字跳网络技术有限公司 Subtitle processing method and device
WO2024087732A1 (en) * 2022-10-25 2024-05-02 上海哔哩哔哩科技有限公司 Livestreaming data processing method and system

Similar Documents

Publication Publication Date Title
CN111010614A (en) Method, device, server and medium for displaying live caption
US11627351B2 (en) Synchronizing playback of segmented video content across multiple video playback devices
CN110933449B (en) Method, system and device for synchronizing external data and video pictures
CN105991962B (en) Connection method, information display method, device and system
CN103200461B (en) A kind of multiple stage playback terminal synchronous playing system and player method
WO2017063399A1 (en) Video playback method and device
CN102263959B (en) Direct broadcast transfer method and system
US10863247B2 (en) Receiving device and data processing method
CN109714622B (en) Video data processing method and device and electronic equipment
CN106998485B (en) Video live broadcasting method and device
Boronat et al. HbbTV-compliant platform for hybrid media delivery and synchronization on single-and multi-device scenarios
CN111447455A (en) Live video stream playback processing method and device and computing equipment
CN112616065B (en) Screen image initiating method, device and system and readable storage medium
CN109803151A (en) Multi-medium data stream switching method, device, storage medium and electronic device
CN103517135A (en) Method, system and television capable of playing MP4-format video files continuously
WO2015154549A1 (en) Data processing method and device
CN107547517B (en) Audio and video program recording method, network equipment and computer device
CN110113298B (en) Data transmission method, device, signaling server and computer readable medium
CN112822435A (en) Security method, device and system allowing user to easily access
CN106303754A (en) A kind of audio data play method and device
US20200053427A1 (en) Reception apparatus, transmission apparatus, and data processing method
CN110545447B (en) Audio and video synchronization method and device
CN112532719B (en) Information stream pushing method, device, equipment and computer readable storage medium
CN114679593B (en) Live broadcast transcoding processing method, device and system
CN113411634A (en) Video stream operation method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200414