CN113613025A

CN113613025A - Method and device for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcasting

Info

Publication number: CN113613025A
Application number: CN202010369348.3A
Authority: CN
Inventors: 高爱平
Original assignee: Anhui Wenhui Technology Co ltd
Current assignee: Anhui Wenhui Technology Co ltd
Priority date: 2020-05-05
Filing date: 2020-05-05
Publication date: 2021-11-05

Abstract

The invention provides a method for real-time voice conversion caption data synchronous processing and picture synthesis live broadcasting, belonging to the technical field of image processing, and solving the problem of synchronous output of sound picture captions, wherein the method comprises the following steps: the first pickup collects the real-time sound on site and transmits the real-time sound to the first camera and the data processing host, and the real-time image data collected by the first camera and the sound picture transmitted by the pickup are synchronously synthesized into the streaming media data which are live broadcast in real time and transmitted to the data processing host; the data processing host separates the sound and the video of the acquired streaming media data, and identifies and converts the real-time sound data of the first sound pickup collected on site into text subtitles in real time; the data processing host directly and synchronously superimposes, synthesizes and outputs the character subtitles and the live broadcast pictures, and the synchronous live broadcast pictures containing the character subtitles are output through signals of HDMI/VGA/SDI, RTMP streaming media and the like.

Description

Method and device for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcasting

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method and a device for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcasting.

Background

Under the fun of network live broadcast forms such as news program live broadcast, network live broadcast, fine course live broadcast public class and the like, in order to enable deaf-mute, foreign friends and participants to more intuitively know the capacity of the lecturer, the speech of the on-site real-time lecturer needs to be converted into displayable subtitles and real-time pictures for superposition output.

In order to achieve the above live broadcast effect, it is necessary to transmit the sound and the picture to the data processing host device through the sound pickup and the camera in real time, synchronize and convert the sound and the picture through the data processing host, and then superimpose and synthesize the text subtitle and the synchronized picture to output the final synchronized live broadcast video containing the text subtitle.

Disclosure of Invention

The invention provides a method and a device for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcast, which aim to solve the problems.

The invention provides a method for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcasting. The method comprises the following steps: the first pickup collects on-site real-time sound and transmits the on-site real-time sound to the first camera and the data processing host, and the first camera collects on-site real-time image data and sound transmitted by the pickup; synchronously processing and synthesizing the sound and the picture into real-time live streaming media data in a video coding and decoding mode, and transmitting the generated data to a data processing host through network signals or entity wires such as HDMI/SDI/VGA and the like; after the data are transmitted to the data processing host, the data processing host separates the sound and the video of the acquired streaming media data according to specific decoding, connects the real-time sound data acquired by the first sound pickup on site to a voice recognition engine in real time, and recognizes and converts the real-time sound data into character subtitles in various styles to display the subtitles on a screen; the method comprises the steps that the sound obtained by a data processing host after streaming media data separation is subjected to coding comparison synchronous processing with the live real-time sound collected by a first sound pick-up, video data are directly coded and synchronized with the preprocessed text subtitle information through a processing timestamp, pictures synchronized with the text subtitles are directly output through the data processing host, the preprocessed text subtitles are adjusted to be superposed with the output pictures synchronized with the text subtitles, the live pictures synchronously containing the text subtitles are synthesized, the synthesized pictures are coded into streaming media data again, and signals such as HDMI/VGA/SDI and RTMP streaming media are output. The invention also provides a method and a device for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcasting, wherein the device comprises the following steps: the method comprises the following steps: the system comprises a sound pick-up, a camera and data processing host equipment, wherein the sound pick-up is connected with the data processing host equipment, and the camera is connected with the data processing host equipment; the sound pick-up is used for collecting real-time field sound; the camera is used for acquiring a field real-time image; the data processing host equipment is used for carrying out decoding processing, comparison synchronization processing, synthesis coding output processing and the like on data transmitted by the sound pick-up and the camera; if the on-site pickup collects on-site real-time sound and transmits the on-site real-time sound to the data processing host device and simultaneously transmits the on-site real-time sound to the camera, the camera collects on-site real-time images and synthesizes the on-site real-time images and the sound to the data processing host device, the data processing host device can convert the sound into character subtitles and synthesize the character subtitles and real-time live pictures in a synchronous and superposed mode, and outputs video pictures with synchronous subtitles, and the pictures can be output through HDMI/VGA/SDI signals or RTMP streaming media signals.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for real-time voice conversion and caption data synchronization and frame composition live broadcasting according to a preferred embodiment of the present invention;

fig. 2 is a block diagram of an apparatus for real-time voice conversion subtitle data synchronization processing and frame composition live broadcasting according to a preferred embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 1 is a flowchart illustrating a method for real-time voice conversion subtitle data synchronization processing and picture composition live broadcasting according to a preferred embodiment of the present invention. As shown in fig. 1, the method for real-time voice conversion subtitle data synchronization processing and picture composition live broadcasting according to the preferred embodiment of the present invention includes steps 101-109.

Step 101: the first sound pickup collects real-time field sound;

the first microphone is fixed to the field area and collects real-time sounds of the field.

Step 102: transmitting to a first camera and a data processing host;

real-time sound collected by a first sound pickup in a field area is transmitted to a first camera and a data processing host.

Step 103: the first camera collects on-site real-time image data and sound transmitted by the first sound pickup;

when the camera collects real-time image data of the field area, real-time sound transmitted from the first sound pick-up is transmitted.

Step 104: synchronously processing and synthesizing the sound and the picture into real-time live streaming media data according to a video coding and decoding mode, and transmitting the real-time live streaming media data to a data processing host;

when the first camera obtains the real-time image data and the real-time sound data, the real-time image data and the real-time sound data are subjected to video coding to form a video with audio, and the video is synchronously processed into streaming media data and transmitted to the data processing host.

105, separating the sound and the video according to the specific decoding by the data processing host;

when the data processing host receives the video streaming media data of the first camera, the data processing host can separate sound and video from the transmitted video streaming media data according to a specific decoding mode.

Step 106, connecting the real-time sound data collected by the first sound pickup on site to a voice recognition engine in real time, and recognizing and converting the real-time sound data into character subtitles with various styles for screen display;

the real-time voice data of scene that first adapter gathered connect speech recognition engine in real time, discern the voice data and convert into the text message, refresh the text message in real time and become the subtitle on the screen again and demonstrate.

Step 107, the timestamp processed by the data processing host directly encodes and synchronizes the video data with the preprocessed text subtitle information;

after the data processing host computer separates the voice and the video of the video streaming media data, the timestamp is used for synchronizing the text information, namely the preprocessed text, which is obtained by identifying and converting the video data and the voice data.

Step 108, the data processing host machine superposes the preprocessed character subtitles and the output synchronous pictures of the character subtitles to synthesize the superposed pictures into the synchronous live broadcast pictures containing the character subtitles;

after the video data and the text information are synchronized by the data processing host, the synchronized video data and the text information are subjected to picture superposition, and the picture layer superposition mode can be displayed in real time or synthesized and superposed by using a player picture layer mode, and meanwhile, the data processing host can also process in a data coding synthesis output mode.

Step 109, the data processing host machine encodes and outputs the synthesized picture;

after the synthesis processing is completed, the data processing host continues to encode the processed video and the superimposed subtitles, the encoded video can be pushed to an RTMP server, can also be played through a streaming media player at a network far end, and can be directly output and displayed through the HDMI, the VGA and the SDI of the data processing host.

The following examples are given for illustrative purposes:

in live broadcast activities, a set of data processing host 206, a camera 202 and a sound pickup 201 are deployed on site, the sound pickup 201 collects live real-time sound, the collected sound is synchronously transmitted to the camera 202, the camera 202 synchronously processes and synthesizes the live real-time image data and the sound transmitted by the sound pickup at the same time, the synthesized live broadcast streaming media data is transmitted to the data processing host 206, the data processing host 206 separates the sound and the video according to specific decoding, the sound pickup 201 collects the live real-time sound data and connects the live real-time sound data with a voice recognition engine in real time, and the live real-time sound data are recognized and converted into various styles of character subtitles to be displayed on a screen; the sound obtained by the data processing host 206 after separating the streaming media data is coded, compared and synchronously processed with the live real-time sound collected by the sound pickup 201, the video data is directly coded and synchronized with the preprocessed text subtitle information by the processed timestamp, the picture synchronized with the text subtitle is directly output by the data processing host 206, the preprocessed text subtitle is adjusted to be superposed with the picture synchronized with the output text subtitle to be synthesized into a live broadcast picture synchronously containing the text subtitle, the synthesized picture is coded into streaming media data again, and signals such as HDMI/VGA/SDI and RTMP streaming media are output. The invention also provides a method and a device for real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcasting, wherein the device comprises the following steps: the method comprises the following steps: the system comprises a sound pickup 201, a video camera 202 and a data processing host 206, wherein the sound pickup 201 is connected with the data processing host 206, and the video camera 202 is connected with the data processing host 206; the sound pickup 201 is used for collecting real-time field sound; the camera 202 is used for acquiring a live real-time image; the data processing host 206 is used for decoding, comparing and synchronizing, synthesizing and encoding output processing and the like of pictures and sounds 203 acquired in real time by data transmitted from the sound pickup 201 and the camera 202; if the on-site sound pickup 201 collects on-site real-time sounds and transmits the on-site real-time sounds to the data processing host 206 device and the camera 202 collects on-site real-time images and sounds and synthesizes the on-site real-time images and sounds to the data processing host 206 device, the data processing host 206 device can convert the sounds into text subtitles and synchronize and superimpose and synthesize the text subtitles and real-time live broadcast pictures, and output video pictures with synchronized subtitles, i.e., synthesized and output pictures and sounds 204, and the synthesized and output pictures and sounds 204 can be output to the display 207 or other display and play devices 208 through HDMI/VGA/SDI signals or RTMP streaming media signals.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A real-time voice conversion subtitle data synchronous processing and picture synthesis live broadcast method is characterized by comprising the following steps:

the first pickup collects on-site real-time sound and transmits the on-site real-time sound to the first camera and the data processing host, and the first camera collects on-site real-time image data and sound transmitted by the pickup;

synchronously processing and synthesizing the sound and the picture into real-time live streaming media data in a video coding and decoding mode, and transmitting the generated data to a data processing host through network signals or entity wires such as HDMI/SDI/VGA and the like; after the data are transmitted to the data processing host, the data processing host separates the sound and the video of the acquired streaming media data according to specific decoding, connects the real-time sound data acquired by the first sound pickup on site to a voice recognition engine in real time, and recognizes and converts the real-time sound data into character subtitles in various styles to display the subtitles on a screen; the method comprises the steps that the sound obtained by a data processing host after streaming media data separation is subjected to coding comparison synchronous processing with the live real-time sound collected by a first sound pick-up, video data are directly coded and synchronized with the preprocessed text subtitle information through a processing timestamp, pictures synchronized with the text subtitles are directly output through the data processing host, the preprocessed text subtitles are adjusted to be superposed with the output pictures synchronized with the text subtitles, the live pictures synchronously containing the text subtitles are synthesized, the synthesized pictures are coded into streaming media data again, and signals such as HDMI/VGA/SDI and RTMP streaming media are output.

2. The method of claim 1, wherein the data processing host separates the audio and video of the acquired streaming media data, compares the separated audio with the live real-time audio data collected by the first microphone in real time to perform timestamp synchronization, and processes the live real-time audio data collected by the first microphone with the text-to-speech recognition engine to perform timestamp comparison synchronization.

3. The method as claimed in claim 1, wherein after the live sound and the picture are synchronously inputted, the preprocessed text subtitle is adjusted to be overlapped with the picture synchronously outputted by the text subtitle, and the data processing host collects and synthesizes the desktop of the overlapped picture in a coding software mode into the live broadcast picture synchronously containing the text subtitle.

4. The method as claimed in claim 1, wherein after the live sound and the image are input synchronously, the preprocessed text subtitle is adjusted to be overlapped with the output image with the synchronous text subtitle, and the data processing host collects and synthesizes the desktop of the overlapped image into the live broadcast image with the synchronous text subtitle in a hardware collection card mode.

5. The method as claimed in claim 1, wherein after the live sound and the picture are synchronously inputted, the preprocessed text subtitle is adjusted to be overlapped with the outputted picture synchronized with the text subtitle, and the data processing host collects and synthesizes the desktop of the overlapped picture by a hardware collecting encoder mode into the live broadcast picture synchronously containing the text subtitle.

6. An apparatus for real-time voice conversion and subtitle data synchronization processing and picture composition live broadcasting, comprising: the system comprises a sound pick-up, a camera and data processing host equipment, wherein the sound pick-up is connected with the data processing host equipment, and the camera is connected with the data processing host equipment; the pickup is used for collecting on-site real-time sound, the camera is used for collecting on-site real-time images, the data processing host equipment is used for decoding, comparing, synchronizing, synthesizing, encoding and outputting the data transmitted by the pickup and the camera, if the on-site pickup collects the on-site real-time sound, the on-site real-time sound is transmitted to the data processing host equipment and is simultaneously transmitted to the camera, the camera collects the on-site real-time images and the sound, the data processing host equipment can convert the sound into text subtitles and synchronously superimpose and synthesize the real-time live pictures, and the video pictures with the synchronized subtitles are output and can be output through HDMI/VGA/SDI signals or RTMP streaming media signals.