CN115150631A

CN115150631A - Subtitle processing method, subtitle processing device, electronic equipment and storage medium

Info

Publication number: CN115150631A
Application number: CN202110282757.4A
Authority: CN
Inventors: 李秋平; 刘坚; 李磊; 王明轩
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-10-04

Abstract

The embodiment of the disclosure discloses a subtitle processing method, a subtitle processing device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a live video stream, and acquiring an audio stream in the live video stream; performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages; and generating a subtitle file according to each text message and the time information corresponding to each text message. The technical scheme provided by the disclosure can prepare for adding the caption to the live video stream, the whole process is completed by one electronic device, the whole process is simple, the reliability of the whole caption processing process is high, and because the finally generated caption file is stored in a text form instead of a picture, when the caption file is fused with the live video stream subsequently, the picture background does not need to be removed, the caption style is convenient to set, and the effects of enriching the caption style and improving the caption display definition can be achieved.

Description

Subtitle processing method, subtitle processing device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of information technology, and in particular, to a method and an apparatus for processing subtitles, an electronic device, and a storage medium.

Background

With the continuous development of video live broadcast technology, the demand of users for live video streaming is also increasing. However, the existing live video stream has no subtitles, so that a user watching the live video stream cannot clearly know the content of the live video stream, and the user experience is reduced.

Disclosure of Invention

To solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a subtitle processing method, apparatus, electronic device, and storage medium.

The embodiment of the disclosure provides a subtitle processing method, which includes:

acquiring a live video stream, and acquiring an audio stream in the live video stream;

performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages;

and generating a subtitle file according to each piece of text information and the time information corresponding to each piece of text information.

An embodiment of the present disclosure further provides a subtitle processing apparatus, including:

the acquisition module is used for acquiring a live video stream and acquiring an audio stream in the live video stream;

the voice recognition module is used for carrying out voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages;

and the generating module is used for generating a subtitle file according to each piece of text information and the time information corresponding to each piece of text information.

An embodiment of the present disclosure further provides an electronic device, which includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the subtitle processing method as described above.

The disclosed embodiments also provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the subtitle processing method as described above.

The embodiments of the present disclosure also provide a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the computer program or instructions implement the subtitle processing method as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has at least the following advantages:

the technical scheme provided by the embodiment of the disclosure includes acquiring a live video stream and acquiring an audio stream in the live video stream; performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages; and generating a subtitle file according to each text message and the time information corresponding to each text message, so that preparation can be made for adding subtitles to the live video stream, and a user watching the live video stream can be clear of the content of the live video stream. The whole process can be completed by one electronic device without being completed by matching of a plurality of devices, the whole process is simple, the cost for allocating the subtitles to the live video stream is low, and the whole subtitle processing process has higher reliability.

According to the technical scheme provided by the embodiment of the disclosure, the finally generated caption file is stored in a text form, but not a picture, and when the caption file is subsequently fused with the live video stream, the picture background does not need to be removed, so that the caption style is convenient to set, and the effects of enriching the caption style and improving the caption display definition and the like can be achieved. Here, the subtitle style includes a font size, a font color, a subtitle display mode, and the like. The subtitle display mode comprises a sentence display mode and a word-by-word display mode.

The technical scheme provided by the embodiment of the disclosure is particularly suitable for scenes such as large conferences, live activity, live industrial and academic conferences, live entertainment stars, live e-commerce and the like.

Drawings

Reference is made to the following detailed description taken in conjunction with the accompanying drawings, the above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic diagram of a usage scenario of a subtitle processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a usage scenario of another subtitle processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a usage scenario of another subtitle processing method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a usage scenario of another subtitle processing method according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a subtitle processing method according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another subtitle processing method according to an embodiment of the present disclosure;

fig. 7 is a screenshot of a certain time in a process of synchronously playing a subtitle file and a live video stream according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a display interface for displaying each text message and a target text corresponding to each text message according to an embodiment of the present disclosure;

fig. 9 is a flowchart of another subtitle processing method according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a subtitle processing apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

Fig. 1 is a schematic diagram of a usage scenario of a subtitle processing method according to an embodiment of the present disclosure. The subtitle processing method provided by the present disclosure can be applied to the application environment shown in fig. 1. Referring to fig. 1, the subtitle processing system includes a terminal 1, a server 2, a server 3, a server 6, and a terminal 7. Wherein, the terminal 1 and the server 2 are connected through network communication. The server 2 and the server 3 are connected through network communication. The server 3 and the server 6 are connected in communication through a network. The server 6 and the terminal 7 are connected in communication through a network. The terminal 1 is a terminal of a main broadcasting party, the terminal 1 uploads the live video stream to the server 2, and the live video stream does not carry subtitles at the moment. The server 3 pulls the live video stream from the server 2, executes the subtitle processing method provided by the present disclosure to obtain a subtitle file, and then pushes the subtitle file and the live video stream without subtitles to the server 6. The server 6 can further send the caption file and the live video stream without the caption to the terminal 7, and the terminal 7 synchronously plays the caption file and the live video stream without the caption to effectively realize the display of the live video stream with the caption. Alternatively, the server 6 obtains a live video stream to which subtitles are added based on the subtitle file and the live video stream without subtitles, and then sends the live video stream to which subtitles are added to the terminal 7. The terminal 7 can play a live video stream with subtitles. The terminal 7 is a terminal for watching the fans which live.

Alternatively, the subtitle processing system may be configured not to include the server 6, and the server 3 and the terminal 7 are communicatively connected via a network. The server 3 pulls the live video stream from the server 2, executes the subtitle processing method provided by the present disclosure to obtain a subtitle file, and then pushes the subtitle file and the live video stream without subtitles to the terminal 7. The terminal 7 synchronously plays the caption file and the live video stream without the caption to effectively realize the display of the live video stream with the caption.

Fig. 2 is a schematic diagram of another subtitle processing method using a scene according to an embodiment of the present disclosure. The subtitle processing method provided by the present disclosure can be applied to the application environment shown in fig. 2. In contrast to fig. 1, the server 2 is absent from fig. 2, and the terminal 1 and the server 3 are communicatively connected via a network. The terminal 1 uploads the live video stream to the server 3, and the live video stream does not carry subtitles at the moment. The server 3 executes the subtitle processing method provided by the present disclosure for a live video stream without subtitles acquired from the terminal 1.

Fig. 3 is a schematic diagram of a usage scenario of another subtitle processing method according to an embodiment of the present disclosure. The subtitle processing method provided by the present disclosure may be applied to the application environment shown in fig. 3. In fig. 3, the server 3 is replaced with the terminal 4, and the other connection relationship is maintained as compared with fig. 1. Namely, the terminal 4 pulls the live video stream without subtitles from the server 2, executes the subtitle processing method provided by the present disclosure to obtain the subtitle file, and then pushes the subtitle file and the live video stream without subtitles to the server 6.

Fig. 4 is a schematic diagram of a usage scenario of another subtitle processing method according to an embodiment of the present disclosure. The subtitle processing method provided by the present disclosure may be applied to the application environment shown in fig. 4. Compared with fig. 1, fig. 4 further includes a terminal 5, and the terminal 5 is communicatively connected with the server 3 through a network. The server 3 utilizes the subtitle processing method provided by the present disclosure to obtain the subtitle file. The terminal 5 acquires the subtitle file from the server 3 for manual verification, and transmits the manually verified subtitle file to the server 3. The server 3 pushes the manually checked subtitle file and the live video stream without subtitles to the server 6.

Alternatively, server 2 in fig. 1, server 3 in fig. 2, server 2 in fig. 3, and server 2 in fig. 4 are live servers, also referred to as origin servers. The servers 6 in fig. 1, 2, 3, and 4 are all CDN (Content Delivery Network) cache servers. The essence of this is that the streaming of the live video stream from terminal 1 to terminal 7 is done using CDN technology. It may further reduce the time it takes for the live video stream to be streamed from terminal 1 to terminal 7.

In each of the above usage scenarios, each server may be implemented by an independent server or a server cluster composed of a plurality of servers. The terminal may include, but is not limited to, a smart phone, a palm top computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, a kiosk, etc.

Fig. 5 is a flowchart of a subtitle processing method according to an embodiment of the present disclosure, where the method is applicable to a situation where subtitles need to be added to a live video stream, and the method may be executed by a subtitle processing apparatus, and the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be configured in an electronic device, such as a terminal, specifically including but not limited to a smart phone, a palmtop computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, an all-in-one machine, and the like. Alternatively, the embodiment may be applicable to a case where the server performs processing of a live video stream, and the method may be executed by a subtitle processing apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, such as a server. As shown in fig. 5, the method may specifically include:

s110, acquiring a live video stream, and acquiring an audio stream in the live video stream.

There are many ways to implement this step, and this disclosure should not be limited thereto. Illustratively, the implementation method of the step includes: and acquiring the live video stream according to the address information of the live video stream (namely the source stream address of the live video stream), and acquiring the audio stream in the live video stream. Alternatively, the address of the live video stream may be a URL (Uniform Resource Locator).

Illustratively, referring to fig. 1, when a user needs to control the server 3 to execute the subtitle processing method provided by the present disclosure, the user inputs address information of a live video stream in the server 3 (optionally, the address information of the live video stream is in format such as rtmp), and the server 3 obtains the live video stream from the server 2 according to the address information of the live video stream; and then decoding and separating the live video stream to obtain the audio stream in the live video stream. Alternatively, decoding and separation may be performed synchronously; it can also be done in two steps, i.e. decoding first and then separating the decoded live stream.

And separating the live video stream to obtain an audio stream and a video stream. The audio stream includes only audio information and the video stream includes both audio and picture information. The video stream obtained after separation is the same as the live video stream before separation. The audio in the audio stream obtained after the separation is the same as the audio in the live video stream before the separation. Therefore, the essence of this step is to extract the audio information in the live video stream to form an audio stream.

S120, carrying out voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages.

One text message refers to a speech recognition result of a sentence or a paragraph. The time information corresponding to the text information, namely, the time stamp corresponding to the text information, includes the start time of the text information and the end time of the text information. The starting time of the text information is the starting time information of the audio information corresponding to the text information in the live video stream or the audio stream; the end time of the text information is the end time information of the audio information corresponding to the text information in the live video stream or the audio stream.

There are many ways to implement this step, and this disclosure should not be limited thereto. Optionally, if the live video stream includes a plurality of video frames, the separated audio stream includes a plurality of audio frames, and the video frames correspond to the audio frames one to one. The video frame and the audio frame with corresponding relation have the same time stamp, and the implementation method of the step comprises the following steps: dividing the audio stream into a plurality of units to be identified by taking N (N is a positive integer) audio frames as a unit to be identified; respectively identifying each voice identification unit to form a plurality of text units; splicing and sentence-breaking all the text units; integrating all text units into one or more text messages by taking a sentence or a paragraph as a unit; and determining the time information corresponding to each text message based on the time stamp of the audio frame corresponding to each text message.

Exemplarily, assuming that 100 audio frames are used as one unit to be recognized for speech recognition, a certain audio stream may be divided into 3 units to be recognized, i.e., a unit to be recognized 1, a unit to be recognized 2, and a unit to be recognized 3. The speech recognition result (i.e., text unit) of a unit to be recognized may include a half-word, a half-word, or two words, etc. Illustratively, the recognition result of the unit 1 to be recognized is "spring March flower in Pujiang suburb park spring art exhibition" which is a half-word. The recognition result of the unit to be recognized 2 is "the wonderful garden of 43 ten thousand square meters is revealed today in spring of a garden", which is a sentence and a half. The recognition result of the unit to be recognized 3 is that "the wonder flower carpet made of tens of thousands of tulips of each color looks like peacock is beautiful, is a sentence. After the texts in the voice recognition results (namely the text units) of all the units to be recognized are spliced and sentence-broken, the' three months in spring and with various flowers can be obtained. The spring art flower show in the Pujiang suburb park is revealed today. A 43 ten thousand square meters wonder garden appears to be spring full. The wonderful flower carpet made of the ten thousands of tulips with various colors looks like a peacock spreading its tail, and is beautiful. Further, the recognition results of the three units to be recognized can be integrated into four text messages by taking a sentence as a unit, and the text message 1 is 'March in spring and in full bloom'. The text information 2 is 'the spring art flower show of the Pujiang suburb park is uncovered today'. The text message 3 is "43 ten thousand square meters wonderful garden shows spring full". The text information 4 is "wonderful tapestry made of thousands of tulips of various colors looks like a peacock in its tail and is beautiful". And finally, determining which audio frames each text message specifically corresponds to, taking the time stamp of the first audio frame corresponding to the text message as the start time information of the text message, and taking the time stamp of the last audio frame corresponding to the text message as the end time information of the text message. Exemplarily, it is assumed that the determined text information 1 corresponds to the first audio frame to the 53 th audio frame. The time stamp of the first audio frame is used as the start time information of the text information 1, and the time stamp of the 53 th audio frame is used as the end time information of the text information 1.

Further, there are various methods for implementing that the video frame and the audio frame having the corresponding relationship have the same time stamp, which is not limited in the present application. Illustratively, the timestamps of the video frames can be set to be generated together in the process of forming the live video stream, and the timestamps of the video frames are continuously increased along with the continuous formation of the live video frames; when the audio stream in the live video stream is acquired based on the live video stream, the same time stamp is added to the audio frame corresponding to each video frame based on the time stamp of each video frame in the live video stream.

Or, the method for implementing that "the video frame and the audio frame having the corresponding relationship have the same timestamp" may further be that, when the audio stream in the live video stream is acquired based on the live video stream, a timestamp is added to each video frame and each audio frame in the separated live video stream at the same time, and the timestamps added to the video frame and the audio frame having the corresponding relationship are the same; according to the sequence of playing the live video frames, the time stamps of the video frames are continuously increased.

And S130, generating a subtitle file according to each text message and the time information corresponding to each text message.

The text information in the subtitle file is stored in the form of text, not in the form of pictures.

According to the technical scheme, the live video stream is obtained, and the audio stream in the live video stream is obtained; performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages; and generating a subtitle file according to each text message and the time information corresponding to each text message, so that preparation can be made for adding subtitles to the live video stream, and a user watching the live video stream can be clear of the content of the live video stream. The whole process can be completed by one electronic device without being completed by matching of a plurality of devices, the whole process is simple, the cost for allocating the subtitles to the live video stream is low, and the whole subtitle processing process has higher reliability.

Alternatively, in practice, when S130 is executed, it may be set that one subtitle file is generated based on all the text information, or it may be set that one subtitle file is generated for each text information. This is not limited by the present application.

According to the technical scheme, the finally generated caption file is stored in a text form instead of a picture, and when the caption file is subsequently fused with the live video stream, the picture background does not need to be removed, so that the caption style is convenient to set, and the effects of enriching the caption style, improving the caption display definition and the like can be achieved. Here, the subtitle style includes a font size, a font color, a subtitle display mode, and the like. The subtitle display mode comprises a sentence display mode and a word-by-word display mode.

The technical scheme is particularly suitable for the situations of large conferences and live broadcasting in activities, live broadcasting in industry and academic conferences, live broadcasting in entertainment stars, live broadcasting in E-commerce and the like.

Optionally, on the basis of the foregoing technical solution, optionally, after S130, the method further includes: and synchronously sending the subtitle file and the live video stream to one or more first devices, wherein each first device is used for synchronously sending the subtitle file and the live video stream to a user terminal, and the user terminal is used for synchronously playing the subtitle file and the live video stream. The essence of this arrangement is to merge the subtitle file with the live video stream using the user terminal. The whole subtitle processing process does not need the participation of the director on the premise of synchronously playing the subtitle files and the live video stream, and the probability of occurrence of live broadcast accidents can be further reduced.

There are various methods for the user terminal to synchronously play the subtitle file and the live video stream, and for example, if a subtitle file is generated for each text message, the start time of the text message is used as the start time of the subtitle file, and the end time of the text message is used as the end time of the subtitle file, the method for the user terminal to synchronously play the subtitle file and the live video stream includes: in the whole process of playing the live video stream by the terminal, synchronously reading a timestamp of a next frame of video frame to be played by taking the current moment as a reference; judging whether a subtitle file with the timestamp of the next frame of video frame as the starting time exists, if so, synchronously playing the subtitle file with the timestamp of the next frame of video frame as the starting time in the process of playing the next frame of video frame; in the process of synchronously playing the caption file with the time stamp of the next frame of video frame as the starting time, continuously reading the time stamp of the next frame of video frame to be played by taking the current time as the reference; and judging whether the time stamp of the next frame of video frame is consistent with the end time of the currently played subtitle file, if so, stopping playing the currently played subtitle file after the next frame of video frame is played.

Optionally, the first device may specifically be a server or a director. Therefore, the method can support the broadcasting to pushing and multi-terminal to pushing. The multi-terminal forwarding refers to that the first device is a server and synchronously sends the subtitle file and the live video stream to a plurality of servers, and the plurality of servers respectively push the subtitle file and the live video stream to the corresponding terminals.

Optionally, on the basis of the foregoing technical solution, optionally, after S130, the method further includes: and synchronously sending the subtitle file and the live video stream to one or more first devices, wherein each first device is used for obtaining the live video stream added with the subtitles according to the subtitle file and the live video stream and sending the live video stream added with the subtitles to a user terminal. The essence of this is to merge the subtitle file with the live video stream using the first device. The whole subtitle processing process does not need the participation of the director on the premise of synchronously playing the subtitle files and the live video stream, and the probability of occurrence of live broadcast accidents can be further reduced. In addition, the subtitle file is suppressed in the live video stream, the performance requirement on a terminal player is low, and the platform adaptability is high.

There are various implementation methods for obtaining a live video stream added with subtitles according to a subtitle file and a live video stream, for example, if the start time of text information is used as the start time of the subtitle file and the end time of the text information is used as the end time of the subtitle file, an implementation method for obtaining a live video stream added with subtitles according to a subtitle file and a live video stream includes: aligning the start time and the end time corresponding to each subtitle file with the time stamp of the video frame in the live video stream to obtain the video frame corresponding to each subtitle file in the live video stream; and adding any one of the subtitle files into a video frame corresponding to any one of the subtitle files to obtain a fused live video stream added with the subtitles.

Illustratively, if a certain subtitle file has a start time of 20.

The "adding any one subtitle file to a video frame corresponding to any one subtitle file" may be to add the subtitle file to a video frame corresponding to any one subtitle file by using a superimposition technology.

Similarly, the first device may specifically be a server or a director. Therefore, the method can support the broadcasting to pushing and multi-terminal to pushing. The multi-terminal forwarding refers to that the first device is a server, the subtitle file and the live video stream are synchronously sent to the servers, the servers respectively obtain the live video stream added with the subtitles based on the subtitle file and the live video stream, and then the live video stream added with the subtitles pushes the subtitle file and the live video stream to the corresponding terminals respectively.

Fig. 6 is a flowchart of another subtitle processing method according to an embodiment of the present disclosure. Fig. 6 is a specific example of fig. 5. Referring to fig. 6, the method may specifically include:

s210, acquiring a live video stream, and acquiring an audio stream in the live video stream.

S220, voice recognition is carried out on the audio stream, and one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages are obtained.

And S230, translating each piece of text information into a target text of a target language.

The target language may be english, korean, chinese, japanese, etc.

Exemplarily, if the host explains in chinese during the live broadcast, the voices used by the live video stream and the audio stream are both in chinese. If the target language is english, the text information in S220 is presented in a chinese form, and the target text is presented in an english form.

And S240, generating a subtitle file according to each text message, the target text corresponding to each text message and the time information corresponding to each text message.

Illustratively, fig. 7 is a screenshot of a certain time in a process of playing a subtitle file and a live video stream synchronously according to an embodiment of the present disclosure. Referring to fig. 7, from this screenshot, the pictures and subtitle files of the live video stream can be seen. And a person speaks in the picture, and the content presented by the subtitle file is the content that the person speaks. The user can understand the video content by means of the subtitle file.

According to the technical scheme, the step of translating each text message into the target text of the target language is added, and the subtitle file is generated according to each text message, the target text corresponding to each text message and the time information corresponding to each text message, so that the watching requirements of users with different languages can be met, and the audience of live videos can be wider.

On the basis of the foregoing technical solution, optionally after S230, the method further includes: displaying each text message and a target text corresponding to each text message; and in response to the text information modification instruction and/or the target text modification instruction, modifying the text information pointed by the text information modification instruction and/or modifying the target text pointed by the target text modification instruction. This case applies to the case where the execution subject of the subtitle processing method is a terminal with a display screen. As shown in fig. 3, this case is applicable to the case where the execution subject of the subtitle processing method is the terminal 4. The text information modification instruction and/or the target text modification instruction are/is generated based on user operation. The essence of this is that the text information recognized in S220 and the target text translated in S230 are manually collated to improve the accuracy of the text information and/or the target text.

Fig. 8 is a schematic diagram of a display interface for displaying each piece of text information and a target text corresponding to each piece of text information according to an embodiment of the present disclosure. Illustratively, referring to fig. 8, the display interface includes four regions, where a region 1 is used to display the text information recognized in S220, a region 2 is used to display the target text translated in S230, a region 3 is used to display the playing progress of the audio stream or the live video stream, and a region 4 is used to configure the relevant parameters in the live process. The relevant parameters include, but are not limited to, the recognition language, the translation language, the source stream address of the live video stream, the push stream address of the live video stream. Optionally, the related parameters may also be set to include live delay, subtitle style, and the like.

On the basis of the foregoing technical solution, optionally after S230, the method further includes: sending each text message and the target text corresponding to each text message to one or more second devices; modified text information and/or modified target text is received from one or more second devices. This case applies to the case where the execution subject of the subtitle processing method is a server without a display screen. The second device is an electronic device with a display screen. As shown in fig. 4, this case can be used when the execution subject of the subtitle processing method is the server 3 and the second device is the terminal 5. The essence of this is to facilitate manual collation of the text information recognized in S220 and the target text translated in S230, so as to improve the accuracy of the text information and/or the target text.

On the basis of the foregoing technical solutions, optionally, if the text information recognized in S220 and the target text obtained by translation in S230 are manually collated, S240 may include: and generating a subtitle file according to the confirmed text information, the modified text information, the confirmed target text, the modified target text and the time information corresponding to each text information. The confirmed text information refers to text information which is considered to be error-free by manual proofreading. The modified text information is the text information which is modified after the original text information is considered to be wrong through manual proofreading. The confirmed target text is the target text which is considered to be correct through manual proofreading. The modified target text refers to the target text which is corrected manually, and (4) considering that the original target text is wrong, and modifying the original target text to obtain the modified target text. The arrangement can ensure that the displayed subtitles are correct during subsequent playing.

Fig. 9 is a flowchart of another subtitle processing method according to an embodiment of the present disclosure. Fig. 9 is a specific example of fig. 5. Referring to fig. 9, the method may specifically include:

s310, acquiring a live video stream, and acquiring an audio stream in the live video stream.

S320, performing voice recognition on a plurality of continuous audio frames in the audio stream to obtain text information corresponding to non-silent frames in the audio frames and time information corresponding to the text information, wherein the time information corresponding to the text information comprises time stamps of the non-silent frames corresponding to the text information.

A mute frame is an audio frame corresponding to a video frame generated by the anchor during muting. Therefore, the silent frame does not include contents to be subjected to speech recognition.

Optionally, the implementation method of this step may be: dividing the audio stream into a plurality of units to be identified by taking N audio frames as a unit to be identified; eliminating the mute frames in each unit to be identified to obtain the unit to be identified after elimination; respectively identifying the units to be identified after the elimination processing to form a plurality of text units; splicing and sentence-breaking all the text units; integrating all text units into one or more text messages by taking a sentence or a paragraph as a unit; and determining the time information corresponding to each text message based on the time stamp of the audio frame corresponding to each text message.

The essence of this step is to extract the non-silent frames in the audio stream and perform speech recognition on the extracted non-silent frames.

And S330, generating a subtitle file according to the text information corresponding to the non-mute frame in the plurality of audio frames and the time information corresponding to the text information.

According to the technical scheme, the text information corresponding to the non-silent frames in the audio stream and the time information corresponding to the text information are obtained by performing voice recognition on the continuous audio frames in the audio stream, the time information corresponding to the text information comprises the time stamps of the non-silent frames corresponding to the text information, and the essence is that the non-silent frames in the audio stream are extracted, and the extracted non-silent frames are subjected to voice recognition, so that the time consumption of voice recognition of the whole audio stream can be reduced, and the time consumption of overall processing of the live video stream is further reduced.

On the basis of the foregoing technical solutions, optionally, after acquiring an audio stream in a live video stream, the subtitle processing method further includes: the format of the audio stream is converted to a format that can be recognized by automatic speech recognition techniques. The purpose of this setting is to ensure that the audio stream is successfully subjected to speech recognition, and further ensure that the subtitle processing method provided by the present disclosure can be successfully executed.

Fig. 10 is a schematic structural diagram of a subtitle processing apparatus in an embodiment of the present disclosure. The subtitle processing apparatus provided by the embodiments of the present disclosure may be configured in a client, or may be configured in a server, and specifically includes:

an obtaining module 410, configured to obtain a live video stream, and obtain an audio stream in the live video stream;

a speech recognition module 420, configured to perform speech recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages;

and a generating module 430, configured to generate a subtitle file according to each piece of text information and the time information corresponding to each piece of text information.

Further, the device further comprises a stream pushing module, which is used for generating a subtitle file according to each text message and the time information corresponding to each text message, and then synchronously sending the subtitle file and the live video stream to one or more first devices, wherein each first device is used for synchronously sending the subtitle file and the live video stream to a user terminal, and the user terminal is used for synchronously playing the subtitle file and the live video stream.

And the device further comprises a stream pushing module, which is used for generating a subtitle file according to each piece of text information and time information corresponding to each piece of text information, then synchronously sending the subtitle file and the live video stream to one or more first devices, wherein each first device is used for obtaining the live video stream added with the subtitles according to the subtitle file and the live video stream, and sending the live video stream added with the subtitles to a user terminal.

Further, the device further comprises a translation module, configured to perform speech recognition on the audio stream, obtain one or more pieces of text information corresponding to the audio stream, and time information corresponding to each piece of text information in the one or more pieces of text information, and then translate each piece of text information into a target text in a target language;

a generating module 430, configured to generate the subtitle file according to each piece of text information, the target text corresponding to each piece of text information, and the time information corresponding to each piece of text information.

Further, the device further comprises a first proofreading module, configured to display each piece of text information and a target text corresponding to each piece of text information after each piece of text information is translated into the target text in the target language;

and in response to a text information modification instruction and/or a target text modification instruction, modifying text information pointed by the text information modification instruction and/or modifying target text pointed by the target text modification instruction.

Further, the device further comprises a first proofreading module, configured to send each piece of text information and a target text corresponding to each piece of text information to one or more second devices after each piece of text information is translated into the target text in the target language;

modified text information and/or modified target text is received from the one or more second devices.

Further, the generating module 430 is configured to generate the subtitle file according to the confirmed text information, the modified text information, the confirmed target text, the modified target text, and the time information corresponding to each text information.

Further, the speech recognition module 420 is configured to perform speech recognition on a plurality of consecutive audio frames in the audio stream, so as to obtain text information corresponding to non-silent frames in the plurality of audio frames and time information corresponding to the text information, where the time information corresponding to the text information includes a timestamp of the non-silent frame corresponding to the text information.

Further, the obtaining module 410 is configured to obtain the live video stream according to the address information of the live video stream, and obtain an audio stream in the live video stream.

Further, the device also comprises a conversion module, which is used for converting the format of the audio stream into a format which can be recognized by an automatic speech recognition technology after the audio stream in the live video stream is acquired.

The subtitle processing apparatus provided in the embodiment of the present disclosure may perform steps performed by a client or a server in the subtitle processing method provided in the embodiment of the present disclosure, and the steps and the beneficial effects are not described herein again.

Fig. 11 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure. Referring now specifically to fig. 11, a schematic diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device 1000 in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), a wearable electronic device, and the like, and fixed terminals such as a digital TV, a desktop computer, a smart home device, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 11, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003 to implement the subtitle processing method according to the embodiments described in the present disclosure. In the RAM 1003, various programs and information necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communications apparatus 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While fig. 11 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flow chart, thereby implementing the subtitle processing method as described above. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include an information signal propagated in baseband or as part of a carrier wave, in which computer readable program code is carried. Such a propagated information signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital information communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

Optionally, when the one or more programs are executed by the electronic device, the electronic device may further perform other steps described in the above embodiments.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flow charts and block diagrams in the figures, the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure are illustrated. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing subtitles, the method comprising:

2. The method according to claim 1, wherein after generating a subtitle file according to the each text message and the time information corresponding to the each text message, the method further comprises:

and synchronously sending the subtitle file and the live video stream to one or more first devices, wherein each first device is used for synchronously sending the subtitle file and the live video stream to a user terminal, and the user terminal is used for synchronously playing the subtitle file and the live video stream.

3. The method according to claim 1, wherein after generating a subtitle file according to the each text message and the time information corresponding to the each text message, the method further comprises:

and synchronously sending the subtitle file and the live video stream to one or more first devices, wherein each first device is used for obtaining the live video stream added with the subtitles according to the subtitle file and the live video stream, and sending the live video stream added with the subtitles to a user terminal.

4. The method according to any one of claims 1-3, wherein after performing speech recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message of the one or more text messages, the method further comprises:

translating each text message into a target text of a target language;

correspondingly, generating a subtitle file according to each piece of text information and the time information corresponding to each piece of text information, including:

and generating the subtitle file according to each piece of text information, the target text corresponding to each piece of text information and the time information corresponding to each piece of text information.

5. The method of claim 4, wherein after translating each text message into a target text in a target language, the method further comprises:

displaying each piece of text information and a target text corresponding to each piece of text information;

6. The method of claim 4, wherein after translating each text message into a target text in a target language, the method further comprises:

sending each text message and the target text corresponding to each text message to one or more second devices;

7. The method according to claim 5 or 6, wherein generating the subtitle file according to the each text message, the target text corresponding to the each text message, and the time information corresponding to the each text message comprises:

and generating the subtitle file according to the confirmed text information, the modified text information, the confirmed target text, the modified target text and the time information corresponding to each text information.

8. The method of claim 1, wherein performing speech recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages comprises:

and performing voice recognition on a plurality of continuous audio frames in the audio stream to obtain text information corresponding to non-silent frames in the audio frames and time information corresponding to the text information, wherein the time information corresponding to the text information comprises time stamps of the non-silent frames corresponding to the text information.

9. The method of claim 1, wherein obtaining a live video stream and obtaining an audio stream in the live video stream comprises:

and acquiring the live video stream according to the address information of the live video stream, and acquiring an audio stream in the live video stream.

10. The method of claim 1 or 9, wherein after acquiring the audio stream in the live video stream, the method further comprises:

converting the format of the audio stream into a format recognizable by an automatic speech recognition technology.

11. A subtitle processing apparatus, comprising:

12. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.