CN114363535A

CN114363535A - Video subtitle extraction method, apparatus, and computer-readable storage medium

Info

Publication number: CN114363535A
Application number: CN202111595067.0A
Authority: CN
Inventors: 张悦; 黄均昕; 董治; 姜涛
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-15

Abstract

The embodiment of the application discloses a video subtitle extraction method, video subtitle extraction equipment and a computer readable storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a target video, and determining a target subtitle area of the target video based on a plurality of image frames in the target video. Determining a subtitle area of each image frame in the target video through the target subtitle area, and obtaining an initial subtitle text of the target video based on the subtitle area of each image frame, wherein the initial subtitle text comprises a plurality of initial subtitle clauses with an arrangement order. And obtaining a plurality of target caption clauses with an arrangement sequence from the plurality of initial caption clauses based on the similarity degree between the adjacent initial caption clauses, wherein the characters between the adjacent target caption clauses are different from each other, and finally obtaining a target caption text of the target video based on the plurality of target caption clauses with the arrangement sequence. By adopting the method and the device, the problem of repeated extraction of the subtitles can be solved, the operation is simple, the subtitle extraction effect is good, and the applicability is strong.

Description

Video subtitle extraction method, apparatus, and computer-readable storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a method and apparatus for extracting video subtitles, and a computer-readable storage medium.

Background

With the development of internet technology, various K song application programs provide people with indispensable leisure and entertainment activities of singing, recording videos, uploading videos and the like. The karaoke application may allow a user to upload music-related videos in the application, such as a user singing and recording a completed song. In order to push videos associated with songs that the user prefers (such as a song flipping video) to optimize the push strategy of the Karaoke application videos, the song names need to be located by automatically identifying lyric subtitles in videos uploaded by the user, so that the videos are associated with the identified song names.

The inventor of the application discovers that in the prior art, in the process of identifying lyric subtitles in a video, the problem of repeated extraction of some lyrics can occur by intercepting image frames and identifying characters according to the distance between videos, the lyric extraction effect is poor, so that the success rate of song name positioning is low, related videos cannot be accurately pushed, and the user experience is poor.

Disclosure of Invention

The embodiment of the application provides a video subtitle extraction method, video subtitle extraction equipment and a computer-readable storage medium, which can solve the problem of repeated subtitle extraction and have the advantages of simple operation, good subtitle extraction effect and strong applicability.

In a first aspect, an embodiment of the present application provides a method for extracting a video subtitle, where the method includes:

acquiring a target video, and determining a target subtitle area of the target video based on a plurality of image frames in the target video;

determining a subtitle area of each image frame in the target video based on the target subtitle area, and obtaining an initial subtitle text of the target video based on the subtitle area of each image frame, wherein the initial subtitle text has a plurality of initial subtitle clauses arranged in sequence;

obtaining a plurality of target caption clauses with an arrangement sequence from the plurality of initial caption clauses based on the similarity degree between the adjacent initial caption clauses, wherein the characters between the adjacent target caption clauses are different from each other;

and obtaining the target caption text of the target video based on the plurality of target caption clauses with the arrangement sequence.

In one possible implementation manner, the determining a target caption area of the target video based on a plurality of image frames in the target video includes:

acquiring a plurality of image frames in the target video, and inputting the image frames into a pixel point classification model to obtain the probability that each pixel point in each image frame output by the pixel point classification model belongs to a subtitle region;

acquiring subtitle positioning information of the target video based on the probability that each pixel point in each image frame belongs to a subtitle region;

and determining a target subtitle area of the target video based on the subtitle positioning information.

In a possible implementation manner, the obtaining subtitle positioning information of the target video based on the probability that each pixel point in each image frame belongs to a subtitle region includes:

obtaining the number of times that each pixel point is determined as a subtitle region in each image frame based on the probability that each pixel point in each image frame belongs to the subtitle region;

and determining a target pixel point from the pixel points based on the number of times that the pixel points are determined as the subtitle areas in the image frames, and acquiring subtitle positioning information of the target video based on the target pixel point.

In a possible implementation manner, the determining a target pixel point from the pixel points based on the number of times that the pixel points are determined as the subtitle region in the image frames, and acquiring the subtitle positioning information of the target video based on the target pixel point includes:

determining pixel points which are determined as subtitle areas in the image frames and have the times larger than or equal to a time threshold value from the pixel points as target pixel points to obtain a plurality of target pixel points, wherein the target pixel points form one or more pixel point sets;

selecting one or more target pixel point sets with the number of the target pixel points larger than a preset number from the one or more pixel point sets;

and selecting a target pixel point set in a specified display area from the one or more target pixel point sets, and acquiring subtitle positioning information of the target video based on the selected target pixel point set.

In a possible implementation manner, the subtitle positioning information includes a region start coordinate, a region length, and a region width; the determining the target caption area of the target video based on the caption positioning information includes:

determining a preset region extension size based on the region length and the region width included in the subtitle positioning information, and determining the target region start coordinate, the target region length and the target region width based on the region start coordinate, the region length, the region width and the preset region extension size;

and determining an area consisting of the initial coordinates of the target area, the length of the target area and the width of the target area as the target subtitle area.

In one possible implementation manner, the determining a subtitle region for each image frame in the target video based on the target subtitle region and obtaining an initial subtitle text of the target video based on the subtitle region for each image frame includes:

acquiring a plurality of image frames of the target video based on a preset frame extraction frequency, and determining a subtitle area of each image frame from the plurality of image frames based on the target subtitle area;

intercepting target sub-images of each image frame from each image frame based on the subtitle area of each image frame to obtain a plurality of target sub-images, and sequencing the target sub-images according to the appearance sequence of the image frame to which each target sub-image belongs in the target video;

and generating an image to be recognized based on the sequenced target sub-images, and performing character recognition on the image to be recognized to obtain an initial subtitle text of the target video.

In a possible implementation manner, the similarity degree includes an edit distance and a length distance; the obtaining a plurality of target caption clauses having an arrangement order from the plurality of initial caption clauses based on the degree of similarity between the respective adjacent initial caption clauses includes:

detecting the minimum target editing operation times required for converting any initial subtitle clause into an adjacent previous initial subtitle clause, and obtaining the editing distance of any initial subtitle clause based on the target editing operation times;

detecting the number of characters with a phase difference between any initial caption clause and the adjacent previous initial caption clause, and obtaining the length distance of any initial caption clause based on the number of characters with the phase difference;

and determining a target caption clause from the plurality of initial caption clauses based on the edit distance and the length distance of each of the initial caption clauses.

In one possible implementation, the determining a target caption clause from the plurality of initial caption clauses based on the edit distance and the length distance of each of the initial caption clauses includes:

determining a first initial caption clause in the plurality of initial caption clauses as an alternative caption clause;

if the difference value between the editing distance and the length distance of any initial caption clause after the first initial caption clause is not smaller than a preset threshold value, determining the any initial caption clause and the adjacent previous initial caption clause as the alternative caption clause, otherwise, removing the adjacent previous initial caption clause of the any initial caption clause and determining the any initial caption clause as the alternative caption clause so as to determine the alternative caption clause from the plurality of initial caption clauses; wherein, a second initial caption clause adjacent to the first one in the plurality of initial caption clauses is the first initial caption clause;

and determining all the determined alternative subtitle clauses as target subtitle clauses.

In a second aspect, an embodiment of the present application provides a computer device, where the computer device includes: a processor, a memory, and a network interface;

the processor is connected to a memory and a network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the method according to the first aspect of the embodiment of the present application.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on the terminal device, the terminal device is enabled to execute the video subtitle extracting method provided in the first aspect and/or any one of the possible implementation manners of the first aspect, and beneficial effects of the method provided in the first aspect can also be achieved.

In a fourth aspect, an embodiment of the present application provides a video subtitle extracting apparatus, including:

the acquisition module is used for acquiring a target video;

a target caption area generation module, configured to determine a target caption area of the target video based on the plurality of image frames in the target video acquired by the acquisition module;

an initial caption text generating module, configured to determine a caption area of each image frame in the target video based on a target caption area of the target caption area generating module, and obtain an initial caption text of the target video based on the caption area of each image frame, where the initial caption text includes a plurality of initial caption clauses in an arrangement order;

a target caption text generation module, configured to obtain a plurality of target caption clauses with an arrangement order from the plurality of initial caption clauses based on a similarity degree between each adjacent initial caption clause of the initial caption text generation module, where characters between adjacent target caption clauses are different from each other;

the target caption text generation module is further configured to obtain a target caption text of the target video based on the plurality of target caption clauses with the arrangement order.

In the embodiment of the present application, a target subtitle region may be obtained from a target video to obtain a position distribution of a target subtitle text in each image frame of the video, and an initial subtitle text (including a plurality of initial subtitle clauses in an arrangement order) may be obtained from each image frame of the target video through the target subtitle region. And performing target caption clause extraction on the plurality of initial caption clauses based on the similarity degree between each adjacent clause in each initial caption clause to obtain a plurality of target caption clauses with an arrangement sequence, wherein incomplete initial caption clauses or repeated initial caption clauses which are repeated with the caption in the target video can be removed in the target caption clause extraction process, so that a target caption text can be obtained through the plurality of target caption clauses. The problem of repeated extraction of the subtitles can be solved, the operation is simple, the subtitle extraction effect is good, and the applicability is strong.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

fig. 2 is a schematic flowchart of a video subtitle extraction method according to an embodiment of the present application;

fig. 3 is a schematic view of a scene of a video subtitle extraction method according to an embodiment of the present application;

fig. 4 is a schematic view of another scene of a video subtitle extraction method according to an embodiment of the present application;

fig. 5 is a schematic view of another scene of a video subtitle extraction method according to an embodiment of the present application;

fig. 6 is a schematic view of another scene of a video subtitle extraction method according to an embodiment of the present application;

FIG. 7a is a schematic of text detection for optical character recognition;

FIG. 7b is a schematic of text recognition for optical character recognition;

fig. 8 is a schematic view of another scene of a video subtitle extraction method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a video subtitle extracting apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The video subtitle extraction method provided by the embodiment of the application is suitable for application programs (such as video application, live broadcast application, short video application, karaoke application and the like) with video viewing and/or uploading functions, and videos (for convenience of description, a target video can be taken as an example for explanation) containing subtitles in the application programs can be processed to obtain subtitle texts (the subtitle texts can be target subtitle texts, namely subtitle texts completely consistent with the subtitles in the target video) from the target video. The categories of the target video include, but are not limited to, movies, news, art programs, training courses, and User Generated Content (UGC) videos, among others. The subtitles may be the title, credits, lyrics, dialog, introduction to a character or background, place name, and age information of the movie. The method can be determined according to the requirements of the actual application scene, and is not limited herein. The target videos in the application program can be further processed based on the target caption text to achieve better user experience, for example, the target videos can be associated with relevant category labels based on the target caption text, so that the target videos are classified to be delivered to corresponding video channels, and users can watch favorite videos on different categories of video channels conveniently. Or, the target video may also be subjected to health examination based on the target subtitle text, and a developer of an application (e.g., a video application, a live application) may obtain the target subtitle text of the target video to evaluate whether the content of the target video is in health compliance, and may off-shelf the target video with unqualified content (e.g., sensitive information, violation information) to ensure the legality of the application.

In addition, in the using process of the related application program (such as the karaoke application), the user can sing a favorite song and upload a target video (such as a UGC video) to the karaoke application when using the karaoke application, wherein the UGC video can be a video shot and clipped by the user through a mobile phone, a digital camera, a tablet personal computer and other devices with a video recording function. Or, the video may be a video obtained by acquiring a related video from the internet and editing the acquired video. For example, the user a records a target video (the target video includes lyrics of the song B) of the song B that is finishing the singing through a related device (a mobile phone, a camera, etc.), the corresponding target subtitle text can be acquired based on the target video, the name of the song B is acquired based on the target subtitle text, the singing turning video of the song B uploaded by the user a is associated with the song B (for convenience of description, the song associated with the target video is called as the target song), that is, when using the K song application (for example, opening an application page related to the song B), other users can view the singing turning video of the song B uploaded by the user a on the application related page, so that the target video is more effectively pushed and the use experience of the user is improved. If the difference between the lyrics of the target caption text and the corresponding target song is large or the song lyrics are repeated when the song name is obtained based on the target caption text, the positioning accuracy of the song name is affected. Therefore, in the process of extracting the target caption text from the target video, the obtained caption text is further processed (for example, deduplicated) to obtain the target caption text which is consistent with the lyrics of the corresponding song and has no repetition, and the method is very key for accurately positioning and associating the corresponding song. For convenience of description, the following will exemplify the extraction of the target caption text (i.e. the complete lyrics of the corresponding song) for the UGC video uploaded by the user in the development and/or use process of the K song application.

In the video subtitle extraction method provided by the embodiment of the application, a target subtitle area is obtained from a target video to obtain the position distribution of a target subtitle text in each image frame of the video, and an initial subtitle text (containing a plurality of initial subtitle clauses) is obtained from each image frame of the target video through the target subtitle area. The method comprises the steps of extracting target caption clauses from a plurality of initial caption clauses based on the similarity degree between adjacent clauses in each initial caption clause to obtain a plurality of target caption clauses, wherein the initial caption clauses which are different from lyrics of corresponding songs greatly (such as incomplete sentences) or repeated can be removed in the target caption clause extraction process, so that a target caption text can be obtained through the target caption clauses, the problem of repeated extraction of captions can be solved, and the method is simple in operation, good in caption extraction effect and strong in applicability.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture provided in an embodiment of the present application. As shown in fig. 1, the system architecture may include a service server 100 and a terminal cluster, where the terminal cluster may include: terminal devices such as terminal device 200a, terminal device 200b, terminal devices 200c, … …, and terminal device 200 n. The service server 100 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud database, a cloud service, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal device (including the terminal device 200a, the terminal device 200b, the terminal devices 200c, … …, and the terminal device 200n) may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart bracelet, etc.), a smart computer, a smart car-mounted smart terminal, and the like. The service server 100 may establish a communication connection with each terminal device in the terminal cluster, and a communication connection may also be established between each terminal device in the terminal cluster. In other words, the service server 100 may establish a communication connection with each of the terminal device 200a, the terminal device 200b, the terminal devices 200c, … …, and the terminal device 200n, for example, a communication connection may be established between the terminal device 200a and the service server 100. A communication connection may be established between the terminal device 200a and the terminal device 200b, and a communication connection may also be established between the terminal device 200a and the terminal device 200 c. The communication connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, or may be directly or indirectly connected through a wireless communication manner, and the like, and may be determined according to an actual application scenario, and the present application is not limited herein.

It should be understood that each terminal device in the terminal cluster shown in fig. 1 may be installed with an application client, and when the application client runs in each terminal device, data interaction may be performed between the application client and the service server 100 shown in fig. 1, respectively, so that the service server 100 may receive service data from each terminal device (for example, UGC video uploaded by a user through the terminal device). The application client can be an application client having a function of displaying data information such as text, images, audio and video, such as a social application, an instant messaging application, a live application, a game application, a short video application, a music application, a karaoke application, a shopping application, a novel application, a payment application, and the like, and can be specifically determined according to the requirements of an actual application scene, and is not limited herein. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (e.g., an instant messaging client, a social client, etc.), which may be determined specifically according to an actual application scenario and is not limited herein. Taking the karaoke application as an example, in the process that the user uses the karaoke application through the terminal device, the user can not only sing a favorite song, but also upload a UGC video (such as a song singing video finished by the user through singing and recording) related to music to the service server 100 through the terminal device, and the service server 100 receives the UGC video sent by the user. Each terminal device (terminal device 200a, terminal device 200b, terminal devices 200c, … …, and terminal device 200n) that establishes a communication connection with the service server 100 can view the UGC video transmitted by the user through the karaoke application. In addition, after receiving the UGC video sent by the user through the K song application, the service server 100 may extract video subtitles from the video, and associate a target song based on the extracted subtitles to better perform video push. The method provided in the embodiment of the present application may be executed by the service server 100 shown in fig. 1, or may be executed by a terminal device (any one of the terminal device 200a, the terminal devices 200b, … …, and the terminal device 200n shown in fig. 1), or may be executed by both the terminal device and the service server, which may be determined according to an actual application scenario, and is not limited herein.

In some possible embodiments, the terminal device 200a (installed with the karaoke application) may transmit the target video to the service server 100, and the service server 100 performs video caption extraction based on the received target video to obtain caption text (which may be referred to as target caption text) of the target video. The terminal device 200a may send a target video (e.g., a singing video of song B recorded by the user a) to the service server 100 through the karaoke application, and the service server 100 obtains a target subtitle area based on the received target video, so as to obtain a position distribution of a target subtitle text in each image frame of the video. An initial caption text (including a plurality of initial caption clauses) is obtained from each image frame of a target video through the target caption area, the target caption clauses are extracted based on the similarity degree between each adjacent clause in each initial caption clause to obtain a plurality of target caption clauses, the initial caption clauses with larger difference or repetition with lyrics of a corresponding target song can be removed in the target caption clause extraction process, and therefore the service server 100 obtains the target caption text completely consistent with the lyrics of the song (namely the target song) corresponding to the target video through the plurality of target caption clauses. The problem of repeated extraction of the subtitles is solved, the operation is simple, the subtitle extraction effect is good, and the applicability is strong.

For convenience of description, a service server is taken as an execution subject of the method provided by the embodiment of the present application, and an implementation manner of video subtitle extraction by the service server is specifically described by an embodiment.

Referring to fig. 2, fig. 2 is a schematic flow chart of a video subtitle extracting method according to an embodiment of the present application.

As shown in fig. 2, the method comprises the steps of:

s101, acquiring a target video, and determining a target subtitle area of the target video based on a plurality of image frames in the target video.

In some possible embodiments, the service server may obtain a target video (which may be UGC video, and the UGC video may be a singing video of a target song) sent by an application client (such as a karaoke application) installed by a terminal device (such as the terminal device 200a), and determine a target subtitle region of the target video (i.e., a region where a target subtitle text is located in the target video) based on a plurality of image frames in the received target video, so that the target subtitle text (i.e., a subtitle text consistent with lyrics of the target song) may be obtained from the target video based on the target subtitle region. Specifically, the service server may obtain a plurality of image frames in the target video, where the plurality of image frames may be a plurality of image frames extracted from the target video and carrying the initial caption clause. Referring to fig. 3, fig. 3 is a scene schematic diagram of a video subtitle extraction method according to an embodiment of the present application. As shown in fig. 3, the image frame 30 in the figure may be one of a plurality of image frames extracted from the target video, the image frame 30 includes an initial caption clause 31 and other video pictures, and the content of the initial caption clause 31 is: "I am used to bury deeply in fog and feel stiff. Here, the region where the initial caption clause 31 is located in the image frame 30 is the caption region of the image frame, and the target caption region of the target video can be acquired based on the caption region of the image frame and the caption regions of the other extracted image frames.

In some possible embodiments, the obtained multiple image frames may be input into a pixel point classification model (which may be a pixel point classification model based on a VGG16 network architecture), and through the model, the probability that each pixel point in each image frame belongs to the subtitle region of each image frame may be obtained, so that the target subtitle region of the target video may be obtained based on the probability that each pixel point in each image frame belongs to the subtitle region of each image frame. Here, the caption area of the above-described image frame may include a text area (text portion in the caption area) and a non-text area (other area than text in the caption area, such as a gap between the text). Specifically, for the image frame of each input pixel point classification model, a first probability map of the probability that each pixel point in each image frame belongs to a text region and a second probability map of the probability that each pixel point belongs to a non-text region may be obtained (the size of the probability maps may be the same as that of each image frame), and based on the first probability map and the second probability map of each image frame, a caption region binary map of each image frame may be obtained, where the caption region binary map of each image frame reflects the specific position of the caption region of each image frame.

Referring to fig. 4, fig. 4 is a schematic view of another scene of a video subtitle extracting method according to an embodiment of the present application. As shown in fig. 4, the image frame 30 in the figure may be one of a plurality of image frames extracted from the target video, the image frame is input to the pixel point classification model to obtain a first probability map and a second probability map, and the caption area binary map 40 of the image frame 30 may be obtained based on the first probability map and the second probability map. The caption area binary map 40 includes the caption area 41 (i.e., the white area in the caption area binary map) of the image frame 30 obtained based on the first probability map and the second probability map, and it is understood that other positions in the caption area binary map 40 may be identified as caption areas (not shown in fig. 4).

In some possible embodiments, inter-frame integration may be performed based on the binary images of the plurality of caption regions, that is, the sum of the times that each pixel point is determined as the caption region (including the text region and the non-text region) in each image frame may be counted, a target pixel point may be determined based on the sum of the times that each pixel point is determined as the caption region, and caption positioning information of the target video may be obtained based on the target pixel point. For example, if the determination threshold is 2, all the pixels determined as the subtitle region with the frequency greater than or equal to 2 are used as target pixels to obtain a plurality of target pixels, and the plurality of target pixels may form one or more pixel sets. It is to be understood that the one or more pixel point sets may be one or more target caption regions of the target video, and if there are multiple target caption regions, a target caption region related to the target caption text may be extracted from the multiple target caption regions. And obtaining a target caption area binary image based on the obtained target pixel points, wherein the target caption area binary image is the synthesis of the caption area binary images and reflects the target caption area of the target video as a whole.

Referring to fig. 5, fig. 5 is a schematic view of another scene of a video subtitle extracting method according to an embodiment of the present application. As shown in fig. 5, the inter-frame comprehensive graph 50 in fig. 5 includes a caption area 51 and a caption area 52 (i.e., the sum of the times that each pixel in the caption area is determined as the caption area in each image frame is greater than or equal to 2), where a caption area 511 in the caption area 51 indicates that the sum of the times that each pixel in the area is determined as the caption area in each image frame is 4 times, a caption area 512 indicates that the sum of the times that each pixel in the area is determined as the caption area in each image frame is 6 times, and a caption area 513 indicates that the sum of the times that each pixel in the area is determined as the caption area in each image frame is 8 times. The total number of times that each pixel point in the caption area 52 is determined as the caption area in each image frame is 6 times. If the number of times that a pixel point is determined as a caption region in each image frame reaches 2 times or more, the pixel point belongs to a target pixel point in a binary image of the target caption region, a plurality of target pixel points can form one or more pixel point sets, and each pixel point set can be a target caption region. The binary map of the target caption area is shown as a binary map 60 in fig. 5, and the binary map 60 includes a target caption area 61 and a target caption area 62. It can be understood that, among the pixels determined to belong to the target caption region in the binary image of the target caption region, some pixels do not belong to the target caption text (for example, other captions except the lyric caption in the target video, such as the target caption region 62 in fig. 5). One or more target caption regions including target pixels with a number greater than a preset number (for example, greater than 150 pixels) may be extracted from a plurality of target caption regions (or pixel point sets) in the binary image of the target caption region. And acquiring a target caption region in a designated display region (for example, the lowest part of the display region of the image frame, that is, the position of lyrics in the target caption is lower) from the one or more target caption regions, and extracting the positioning information of the target video based on the target caption region, thereby effectively avoiding that part of pixel points which do not belong to the target caption text are brought into the target caption region, and further improving the accuracy of the target caption region. Referring to fig. 5 again, after the binary image 60 in fig. 5 is extracted from the target caption area, a binary image 70 may be obtained, where the binary image 70 includes a target caption area 71, that is, the target caption area 62 unrelated to the target caption text in the binary image 60 is removed, and the positioning information of the target video may be obtained more accurately based on the binary image.

In some possible embodiments, the positioning information of the target video may be obtained based on the binary map of the target subtitle region, and the target subtitle region of the target video may be determined by the positioning information. Specifically, the positioning information includes a region start point coordinate, a region length, and a region width. In order to avoid that when the subtitle region in each image frame is obtained through the target subtitle region determined by the positioning information, characters are incomplete in the intercepted subtitle region to influence the extraction of the target subtitle text, the preset region extension size is determined based on the region length and the region width in the positioning information, so that the starting coordinate of the target region, the length of the target region and the width of the target region are determined based on the starting coordinate of the region, the length of the region, the width of the region and the preset region extension size, and the target subtitle region determined by the starting coordinate of the target region, the length of the target region and the width of the target region can prevent incomplete characters from occurring when the subtitle region in each image frame is intercepted, and the accuracy of the extraction of the target subtitle text is improved. For example, the positioning information may be obtained based on a binary map of the target subtitle region (e.g., binary map 70 in fig. 5): region start coordinates (x, y), region length w, and region width h. And determining the expansion size of the preset area as follows: and (3) expanding the left side and the right side of the subtitle region determined based on the positioning information by 0.1 w respectively, and expanding the upper side and the lower side of the subtitle region by 0.1 h respectively. I.e. the target starting coordinates (x-0.1 w, y-0.1 h), target zone length 1.2 w and target zone width 1.2 h were determined. The target caption area is expanded based on the preset area expansion size, so that the condition that characters are incomplete in the intercepted caption area when the caption area in each image frame is acquired based on the target caption area is avoided, and the target caption text is extracted more completely and accurately.

S102, determining a subtitle area of each image frame in the target video based on the target subtitle area, and obtaining an initial subtitle text of the target video based on the subtitle area of each image frame.

In some possible embodiments, the service server may acquire a plurality of image frames of the target video (until the target video ends) based on a preset frame extraction frequency (for example, one image frame is acquired every 60 seconds), and determine a subtitle region of the plurality of image frames by using the target subtitle region to intercept a target sub-image of each image frame from the image frame, where the target sub-image includes an initial subtitle clause in each image frame, so that an initial subtitle text (which may include a plurality of initial subtitle clauses) may be acquired based on the acquired plurality of target sub-images.

In some possible embodiments, since the target subtitle clauses in the target subtitle text are arranged according to a certain order (i.e., the appearance order of the target subtitle clauses in the target video or the sequence order of the target subtitle clauses in the target song), the target sub-images are sorted according to the appearance order of the image frame to which each target sub-image belongs in the target video, and a target subtitle text consistent with the lyrics of the target song can be obtained (e.g., characters are identified and obtained by an Optical Character Recognition (OCR) technology) from the sorted target sub-images (an image to be identified can be generated). Referring to fig. 6, fig. 6 is a schematic view of another scene of a video subtitle extracting method according to an embodiment of the present application. As shown in fig. 6, fig. 6 includes a portion of consecutive image frames, including image frame 80, image frame 81, and image frame 82, of a plurality of image frames acquired based on a target video (e.g., one image frame acquired every 60 seconds). The image frame 80 includes a target sub-image 801 of the image frame (the subtitle region of the image frame is determined by the target subtitle region, for example, the region start coordinates are (x1-0.1 × w1, y1-0.1 × h1), the region length is 1.2 × w1, and the region width is 1.2 × h1., so that the subtitle sub-image of the image frame is obtained based on the subtitle region), and the target sub-image 801 includes an initial subtitle sub-sentence "i'm used deeply buried" of the image frame. Similarly, the target sub-image 811 in image frame 81 contains the initial caption clause of the image frame "i am too busy in deep fog", and the target sub-image 821 in image frame 82 contains the initial caption clause of the image frame "i am too busy in deep fog". After the target sub-images are acquired, the target sub-images are sorted according to the appearance sequence of the image frame of each target sub-image in the target video, and the sorted target sub-images can be spliced into an image to be identified 83. The image to be recognized 83 includes the above-mentioned image frame 80, image frame 81, and target sub-images in the image frame 82 arranged in this order, and character recognition is performed on the image to be recognized 83, so that a plurality of initial caption clauses ("i's habit is deeply buried", "i's habit is deeply buried in fog", and "i's habit is more powerful in deeply buried fog") can be obtained from the image to be recognized (in the order of the caption clauses). Similarly, for other image frames in the target video, the same method can be adopted to obtain the target sub-images of all the image frames so as to obtain the target sub-images of all the image frames, the target sub-images are sequenced according to the appearance sequence of the image frame to which each target sub-image belongs in the target video so as to obtain the image to be identified, and finally the initial subtitle text of the target video is obtained through the OCR technology, so that the operation is simple, and the acquisition efficiency of the initial subtitle text is high.

In some possible implementations, the initial subtitle text extraction performed on the image to be recognized through the OCR technology may include text detection and text recognition. Specifically, please refer to fig. 7a, fig. 7a is a schematic diagram of text detection in optical character recognition. As shown in fig. 7a, in the process of performing optical character recognition, in order to obtain a text detection result based on an input image to be detected (i.e., the image to be detected 90 in fig. 7a, which includes a text "test word"), the text detection is performed through a feature extraction Network composed of Compact addition blocks, a feature enhancement Network composed of a plurality of adaptive Neural networks (RNNs), and finally a text detection result is obtained through Box generation and edge reference networks (as shown in a text detection block 91 in fig. 7 a). Referring to fig. 7b, fig. 7b is a schematic diagram of text recognition by optical character recognition. As shown in fig. 7b, based on the text detection result obtained by the text detection in fig. 7a, the text 92 to be recognized in fig. 7b is input into the text Recognition model, and features of horizontal asymmetric convolution and comprehensive multiple scale receptive fields are added into the text Recognition model, so that the support of the network on multi-scale fonts is enhanced, and meanwhile, by using a Fine convolution method in the Fine-grained Recognition, the image feature extraction under the conditions of similar characters, fuzzy characters and the like is effectively enhanced. The text "test words" in the text 92 to be recognized can be recognized and obtained through the text recognition model. Through the text detection and the text recognition, the initial caption text of the target video can be acquired from the image to be recognized, which is obtained by sequencing the target sub-images according to a certain sequence, and the method is simple to operate, high in recognition accuracy and good in initial caption text acquisition effect.

And S103, obtaining a plurality of target caption clauses with an arrangement order from the plurality of initial caption clauses based on the similarity degree between the adjacent initial caption clauses.

In some possible embodiments, the service server may detect a similarity degree of each initial subtitle clause in the initial subtitle text, where the similarity degree includes an editing distance and a length distance, and perform, based on the similarity degree, target subtitle clause extraction on a plurality of initial subtitle clauses in the initial subtitle text to obtain a plurality of target subtitle clauses with an arrangement order, where characters between two adjacent subtitle clauses in the plurality of target subtitle clauses with the arrangement order are different from each other, so that there is no repeated target subtitle clause in a final target subtitle text. Specifically, the service server may detect a minimum number of target editing operations (the target editing operation number refers to an editing operation in a unit of a single character) required for converting each initial subtitle clause in the initial subtitle text into an adjacent previous initial subtitle clause, where the minimum number of target editing operations may be referred to as an editing distance of each initial subtitle clause. Meanwhile, the number of characters with a difference between each initial subtitle clause and the adjacent previous initial subtitle clause can be detected, and the length distance of each initial subtitle clause is obtained based on the number of characters with the difference. The editing distance and the length distance both reflect the degree of similarity between each initial subtitle clause and its adjacent previous initial subtitle clause, the smaller the editing distance and the length distance, the greater the degree of similarity, and if the editing distance and the length distance of an initial subtitle clause are both 0, the initial subtitle clause is the same as its adjacent previous initial subtitle clause. The first plurality of initial subtitle clauses may be extracted based on the edit distance and the length distance to obtain a plurality of target subtitle clauses having an arrangement order.

The following describes a specific example of the acquisition process of the edit distance and the length distance. Referring to fig. 8, fig. 8 is a schematic view of another scene of a video subtitle extracting method according to an embodiment of the present application. As shown in fig. 8, the initial caption text 100 in fig. 8 is obtained by character-recognizing an image to be recognized, which is generated by intercepting target sub-images from a plurality of image frames of a target video and sequentially generating the image to be recognized. The initial caption text 100 includes 14 initial caption clauses (where the 8 th initial caption clause has no content, that is, the initial caption text in the target sub-image corresponding to the clause does not include any character), and taking the 2 nd initial caption clause "i si shi di ziqi buries deeply in fog", as an example, the previous initial caption clause adjacent to the initial caption clause is "i shi di ziqi buries deeply". And detecting that the minimum target editing operation times required for converting the 2 nd initial caption clause into the adjacent previous initial caption clause are 4 times, including 3 times of deleting operations (deleting 'fog', 'interior' and 'self') and 1 time of replacing operations (replacing 'department' with 'learning'), and then obtaining that the editing distance of the 2 nd initial caption clause is 4. Meanwhile, detecting that the number of characters of the 2 nd initial caption clause is 8, the number of characters of the adjacent previous initial caption clause is 5, and the number of characters with a difference of 3, the length distance of the 2 nd initial caption clause is 3, so that the editing distance and the length distance of the 2 nd initial caption clause, i si shi di zi buries in fog deeply, are obtained. Therefore, the editing distance and the length distance of each clause from the 2 nd initial caption clause to the 14 th initial caption clause can be obtained (since the 8 th initial caption clause has no content, the initial caption clause has no editing distance and length distance), and the specific process is not repeated. Referring to table 101 in fig. 8, the edit distances and length distances of the 2 nd, 3 rd initial caption clauses are 4 and 3, respectively, the edit distances and length distances of the 4 th, 6 th, 7 th, 12 th, and 14 th initial caption clauses are 0, the edit distances and length distances of the 5 th, 13 th initial caption clauses are 11 and 7, respectively, the edit distances and length distances of the 9 th initial caption clause are 6 and 2, respectively, and the edit distances and length distances of the 11 th initial caption clause are 2.

In some possible embodiments, the service server may determine, from the plurality of initial subtitle clauses, alternative subtitle clauses based on the edit distance and the length distance of each of the initial subtitle clauses, and determine all the determined alternative subtitle clauses as target subtitle clauses to obtain the target subtitle text. Specifically, a first initial caption clause in the plurality of initial caption clauses is determined as an alternative caption clause, and for any initial caption clause after the first initial caption clause, if a difference value between an editing distance and a length distance of the initial caption clause is not smaller than a preset threshold value, the any initial caption clause and an adjacent previous initial caption clause are determined as the alternative caption clause. Otherwise, removing the previous initial subtitle clause adjacent to any initial subtitle clause, and determining any initial subtitle clause as a candidate subtitle clause. Through the determination rule of the alternative subtitle clauses, repeated alternative subtitle clauses do not exist in the finally obtained alternative subtitle clauses, namely the alternative subtitle clauses can form a target subtitle text. The method is simple to operate, good in target subtitle text extraction effect and strong in applicability.

Referring to fig. 8 again, a description will be given of an example in which a plurality of target subtitle clauses are extracted from a plurality of initial subtitle clauses in the initial subtitle text 100 of fig. 8. Firstly, the 1 st initial caption clause ' i'm used to deeply bury ' is determined as an alternative caption clause, and the preset threshold value is 3. And (3) for the 2 nd initial subtitle clause, if the difference value between the editing distance and the length distance of the 2 nd initial subtitle clause is 1 and is smaller than a preset threshold value 3, removing the 1 st initial subtitle clause and determining the 2 nd initial subtitle clause as an alternative subtitle clause. And if the difference value between the editing distance and the length distance of the 3 rd initial caption clause is smaller than a preset threshold value, removing the 2 nd initial caption clause, and determining the 3 rd initial caption clause as the alternative caption clause. And if the difference value between the editing distance and the length distance of the 4 th initial caption clause is smaller than a preset threshold value, removing the 3 rd initial caption clause, and determining the 4 th initial caption clause as the alternative caption clause. That is, after the above process, the current candidate subtitle clause has only the 4 th initial subtitle clause. And if the difference value between the editing distance and the length distance of the 5 th initial caption clause is greater than the preset threshold value, adding and determining the 5 th initial caption clause as the alternative caption clause under the condition of reserving the 4 th initial caption clause, namely the current alternative caption clause has the 4 th initial caption clause and the 5 th initial caption clause. Next, candidate caption clauses are extracted from the 14 initial caption clauses in the initial caption text 100 by the same method, and it is determined that the final candidate caption clauses include one of the 14 initial caption clauses: the 4 th, 7 th, 12 th and 14 th initial subtitle texts. Obtaining a target caption clause 102 of the initial caption text 100 based on the plurality of candidate caption clauses: "I's habit is to bury deeply in fog and feel harder by oneself", "fetch separates first", "I likes to bury in fog and is lifted up separately", and "wake up again brightly". And determining alternative caption clauses from the initial caption clauses based on the editing distance and the length distance of each initial caption clause to obtain a plurality of target caption clauses different from each other.

In some possible embodiments, the plurality of target subtitle clauses may be stored in a designated storage space of the service server, and the designated storage space may be a text list. The text list may be empty before the plurality of target subtitle clauses are extracted based on the plurality of initial subtitle clauses. In the process of extracting the candidate subtitle clauses based on the editing distance and the length distance of each initial subtitle clause, the first initial subtitle clause may be first placed in the first storage unit of the text, that is, the first initial subtitle clause is determined as the first candidate subtitle clause. Meanwhile, for any initial subtitle clause after the first initial subtitle clause, if the difference value between the editing distance and the length distance of the initial subtitle clause is not smaller than a preset threshold value, adding the initial subtitle clause into the latest vacant storage unit of the text list, namely representing that the previous initial subtitle clause adjacent to the initial subtitle clause is determined as an alternative subtitle clause. Otherwise, removing the adjacent previous initial caption clause from the text list, and adding any initial caption clause into the latest vacant storage unit of the text list, namely representing to remove the adjacent previous initial caption clause of any initial caption clause, and only determining any initial caption clause as an alternative caption clause. Through the above process, the initial caption clause included in the text list is the final extracted alternative caption clause, and the alternative caption clauses form a plurality of target caption clauses. The initial caption clause with larger lyric difference or repetition with the corresponding target song is removed in the process of extracting the target caption clause, so that the problem of repeated caption extraction is solved, and the method is simple to operate, good in caption extraction effect and strong in applicability.

And S104, obtaining the target caption text of the target video based on the plurality of target caption clauses with the arrangement sequence.

In some possible embodiments, the service server may obtain the target subtitle text of the target video based on the plurality of target subtitle clauses with the arrangement order, for example, please refer to fig. 8 again, and extract a plurality of target subtitle clauses 102 with the arrangement order from a plurality of initial subtitle clauses in the initial subtitle text 100 of fig. 8: "I's habit is to bury deeply in fog and feel harder by oneself", "fetch separates first", "I likes to bury in fog and is lifted up separately", and "wake up again brightly". The plurality of target caption clauses form a target caption text of the target video: ' I is used to bury deeply in fog, the self is relatively strong, the soul is separated from the head, I likes to bury in fog and is lifted up, and the person is awaken again. The initial caption clauses which are different from or repeated with the captions in the target video are removed in the process of obtaining the target caption text, so that the problem of repeated extraction of the captions is solved, and the method is simple to operate, good in caption extraction effect and strong in applicability.

In some possible embodiments, the service server may further obtain a name of the target song based on the target subtitle text, and associate the target video with the name of the target song to perform more effective pushing on the target video. For example, the user a records a target video (including lyrics of the song B) of the song B that is completely played through a related device (a mobile phone, a camera, etc.), and uploads the target video to the service server through the K song application. And the service server receives and acquires a corresponding target caption text based on the target video, wherein the target caption text is consistent with the lyrics of the song B. The service server may obtain the name of the song B based on the target subtitle text, and simultaneously associate the singing turning video of the song B uploaded by the user a with the song B, and other users may view the singing turning video of the song B uploaded by the user a on an application-related page when using the song K application (for example, opening an application page related to the song B), so that the target video is more effectively pushed and the use experience of the user is improved.

In this embodiment of the application, the service server may obtain a target video (which may be sent by an application client (such as a karaoke application) installed in the terminal device) and simultaneously obtain a plurality of image frames in the target video, where the plurality of image frames may be a plurality of image frames extracted from the target video and carrying an initial caption clause. The acquired image frames are input into a pixel point classification model (which may be a pixel point classification model based on a VGG16 network architecture), and the probability that each pixel point in each image frame belongs to the subtitle region of each image frame can be obtained through the model. Then, the sum of the times of each pixel point being determined as the caption region in each image frame is counted, a target pixel point is determined based on the sum of the times of each pixel point being determined as the caption region, if the number of times of each pixel point being determined as the caption region in each image frame reaches a determination threshold (for example, 2 times or more), the pixel point belongs to the target pixel point in a binary image of the target caption region, a plurality of target pixel points may form one or more pixel point sets, and each pixel point set may be one target caption region. One or more target caption regions including target pixels with a number greater than a preset number (for example, greater than 150 pixels) may be extracted from the target caption regions in the binary image of the target caption region. And acquiring a target subtitle region positioned at the lowest position from the one or more target subtitle regions, and extracting the positioning information of the target video based on the target subtitle region. Therefore, partial pixel points which do not belong to the target caption text are effectively prevented from being brought into the target caption area, and the accuracy of the target caption area is further improved. Here, the positioning information includes a start point coordinate, a region length, and a region width. In order to avoid that when the subtitle region in each image frame is obtained through the target subtitle region determined by the positioning information, characters are incomplete in the intercepted subtitle region to influence the extraction of the target subtitle text, the preset region extension size is determined based on the region length and the region width in the positioning information, so that the starting coordinate of the target region, the length of the target region and the width of the target region are determined based on the starting coordinate of the region, the length of the region, the width of the region and the preset region extension size, and the target subtitle region determined by the starting coordinate of the target region, the length of the target region and the width of the target region can prevent incomplete characters from occurring when the subtitle region in each image frame is intercepted, and the accuracy of the extraction of the target subtitle text is improved. The service server can also obtain a plurality of image frames of the target video based on a preset frame extraction frequency, determine the subtitle areas of the image frames through the target subtitle areas to intercept target sub-images of the image frames from the image frames, sort the target sub-images according to the appearance sequence of the image frames to which the target sub-images belong in the target video, and obtain the initial subtitle text of the target video from the sorted target sub-images. The service server may detect a similarity degree of each initial caption clause in the initial caption text, where the similarity degree includes an editing distance and a length distance, and first initial caption clauses in a plurality of initial caption clauses having an arrangement order are determined as alternative caption clauses by the service server. And for any initial subtitle clause after the first initial subtitle clause, if the difference value between the editing distance and the length distance of the initial subtitle clause is not smaller than a preset threshold value, determining the initial subtitle clause and the adjacent previous initial subtitle clause as alternative subtitle clauses. Otherwise, removing the previous initial subtitle clause adjacent to any initial subtitle clause, and determining any initial subtitle clause as a candidate subtitle clause. Through the determination rule of the alternative subtitle clauses, repeated alternative subtitle clauses do not exist in the finally obtained alternative subtitle clauses, namely the alternative subtitle clauses can form a target subtitle text. The method is simple to operate, good in subtitle extraction effect and strong in applicability.

An embodiment of the present application further provides a video subtitle extracting apparatus, please refer to fig. 9, where fig. 9 is a schematic structural diagram of the video subtitle extracting apparatus provided in the embodiment of the present application, and in the embodiment of the present application, the apparatus may operate the following modules:

an obtaining module 21, configured to obtain a target video;

a target caption area obtaining module 22, configured to determine a target caption area of the target video based on a plurality of image frames in the target video obtained by the obtaining module 21;

an initial caption text generating module 23, configured to determine a caption area of each image frame in the target video based on the target caption area of the target caption area generating module 22, and obtain an initial caption text of the target video based on the caption area of each image frame, where the initial caption text includes a plurality of initial caption clauses in an arrangement order;

a target caption text generating module 24, configured to obtain a plurality of target caption clauses with an arrangement order from the plurality of initial caption clauses based on a similarity degree between each adjacent initial caption clause of the initial caption text generating module 23, where characters between adjacent target caption clauses are different from each other;

the target caption text generating module 24 is further configured to obtain a target caption text of the target video based on the plurality of target caption clauses with the arrangement order.

In some possible embodiments, the target subtitle region obtaining module 22 is further configured to:

acquiring a plurality of image frames in the target video, and inputting the image frames into a pixel point classification model;

outputting the probability that each pixel point in each image frame belongs to the subtitle region based on the pixel point classification model, and acquiring the subtitle positioning information of the target video based on the probability that each pixel point in each image frame belongs to the subtitle region;

extracting one or more target pixel point sets with the number of the target pixel points larger than a preset number based on the one or more pixel point sets;

In some possible embodiments, the initial subtitle text generating module 23 is further configured to:

In some possible embodiments, the target subtitle text generating module 24 is further configured to:

In the embodiment of the application, the video caption extraction module can acquire a target caption area from a target video to acquire the position distribution of a target caption text in each image frame of the video, and acquire an initial caption text (including a plurality of initial caption clauses in an arrangement order) from each image frame of the target video through the target caption area. And performing target caption clause extraction on the plurality of initial caption clauses with the arrangement sequence based on the similarity degree between each adjacent clause in each initial caption clause to obtain a plurality of target caption clauses with the arrangement sequence, wherein incomplete initial caption clauses or repeated initial caption clauses with the caption in the target video can be removed in the target caption clause extraction process, so that a target caption text can be obtained through the plurality of target caption clauses. The problem of repeated extraction of the subtitles can be solved, the operation is simple, the subtitle extraction effect is good, and the applicability is strong.

In the embodiment of the present application, the modules in the apparatus shown in fig. 9 may be respectively or entirely combined into one or several other modules to form the apparatus, or some of the modules may be further split into multiple functionally smaller modules to form the apparatus, which may implement the same operation without affecting implementation of technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical application, the functions of one module can be realized by a plurality of modules, or the functions of a plurality of modules can be realized by one module. In other possible implementations of the present application, the apparatus may also include other modules, and in practical applications, the functions may also be implemented by being assisted by other modules, and may be implemented by cooperation of a plurality of modules, which is not limited herein.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 10, the computer device 1000 may be the service server in the embodiment corresponding to fig. 2. The computer device 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

The network interface 1004 in the computer device 1000 may also be connected to the terminal 200a in the embodiment corresponding to fig. 2 through a network, and the optional user interface 1003 may further include a Display screen (Display) and a Keyboard (Keyboard). In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user (or developer) with input; and the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement the video subtitle extracting method in the embodiment corresponding to fig. 2.

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the video subtitle extracting method in the embodiment corresponding to fig. 2, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Moreover, it should be noted that an embodiment of the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program executed by the above-mentioned sentence expanding apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the video subtitle extraction method in the embodiment corresponding to fig. 2 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for extracting a video subtitle, the method comprising:

determining a subtitle region of each image frame in the target video based on the target subtitle region, and obtaining an initial subtitle text of the target video based on the subtitle region of each image frame, wherein the initial subtitle text comprises a plurality of initial subtitle clauses with an arrangement sequence;

obtaining a plurality of target caption clauses with an arrangement sequence from the plurality of initial caption clauses based on the similarity degree between the adjacent initial caption clauses, wherein characters between the adjacent target caption clauses are different from each other;

2. The method of claim 1, wherein the determining a target caption region of the target video based on a plurality of image frames in the target video comprises:

3. The method according to claim 2, wherein the obtaining subtitle location information of the target video based on the probability that each pixel point in each image frame belongs to a subtitle region comprises:

and determining a target pixel point from the pixel points based on the times of judging the pixel points to be caption areas in the image frames, and acquiring caption positioning information of the target video based on the target pixel point.

4. The method according to claim 3, wherein the determining a target pixel point from the pixel points based on the number of times the pixel points are determined as the caption area in the image frames, and obtaining the caption positioning information of the target video based on the target pixel point comprises:

and selecting a target pixel point set in a designated display area from the one or more target pixel point sets, and acquiring subtitle positioning information of the target video based on the selected target pixel point set.

5. The method of claim 4, wherein the subtitle positioning information comprises region start coordinates, region length, and region width; the determining a target caption area of the target video based on the caption positioning information includes:

determining a preset region extension size based on the region length and the region width included in the subtitle positioning information, and determining a target region starting coordinate, a target region length and a target region width based on the region starting point coordinate, the region length, the region width and the preset region extension size;

and determining a region consisting of the starting coordinates of the target region, the length of the target region and the width of the target region as the target subtitle region.

6. The method of claim 1, wherein the determining a caption region for each image frame in the target video based on the target caption region and obtaining an initial caption text for the target video based on the caption region for each image frame comprises:

7. The method of claim 1, wherein the similarity measure comprises an edit distance and a length distance; the obtaining a plurality of target caption clauses with an arrangement order from the plurality of initial caption clauses based on the similarity degree between the adjacent initial caption clauses comprises:

and determining a target subtitle clause from the plurality of initial subtitle clauses based on the editing distance and the length distance of each initial subtitle clause.

8. The method of claim 7, wherein said determining a target caption clause from said plurality of initial caption clauses based on an edit distance and a length distance of each of said initial caption clauses comprises:

determining a first initial caption clause in the plurality of initial caption clauses as the alternative caption clause;

if the difference value between the editing distance and the length distance of any initial caption clause after the first initial caption clause is not smaller than a preset threshold value, determining the any initial caption clause and the previous adjacent initial caption clause as the alternative caption clause, otherwise, removing the previous adjacent initial caption clause of the any initial caption clause and determining the any initial caption clause as the alternative caption clause; wherein a previous one adjacent to a second initial subtitle clause in the plurality of initial subtitle clauses is the first initial subtitle clause;

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the method of any of claims 1-8.

10. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor and to carry out the method of any one of claims 1 to 8.