CN115129933A

CN115129933A - Video text extraction method, device, equipment, medium and computer program product

Info

Publication number: CN115129933A
Application number: CN202210364170.2A
Authority: CN
Inventors: 宋浩; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-09-30

Abstract

The application discloses a video text extraction method, a video text extraction device, a video text extraction apparatus, a video text extraction medium and a computer program product, and related embodiments of the video text extraction method, the video text extraction apparatus, the video text extraction device, the video text extraction medium and the computer program product can be applied to scenes such as artificial intelligence, intelligent traffic and automatic driving. The method comprises the following steps: performing text feature extraction processing on each video frame in the target video to obtain the text feature of each video frame; generating one or more key frames according to the text characteristics of each video frame; the similarity between the text features of any two key frames is smaller than a first similarity threshold value; performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located; acquiring text contents of the identification texts of the key frames, and combining the identification texts of the key frames according to the text contents of the identification texts of the key frames and the corresponding text regions to obtain combined texts; the merged text is taken as the video text of the target video; the rate of extracting text from video can be increased.

Description

Video text extraction method, device, equipment, medium and computer program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, a medium, and a computer program product for extracting a video text.

Background

Video text extraction aims at extracting text contained in a video; in the existing video text extraction method, texts contained in each video frame in a video are usually extracted, and the extracted repeated texts are combined to obtain a video text of the video; however, in the conventional method, when a different video frame containing the same text in a video is targeted, the repeated extraction operation of the text contained in the different video frame increases time consumption, resulting in a low rate of extracting the text contained in the video.

Disclosure of Invention

The embodiment of the application provides a video text extraction method, a video text extraction device, a video text extraction equipment, a video text extraction medium and a computer program product, which can improve the speed of extracting texts in videos.

In one aspect, an embodiment of the present application provides a method for extracting a video text, including:

performing text feature extraction processing on each video frame in a target video to obtain text features of each video frame;

generating one or more key frames according to the text characteristics of the video frames; the similarity between the text features of any two key frames is smaller than a first similarity threshold value;

performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located;

acquiring text contents of the identification texts of the key frames, and combining the identification texts of the key frames according to the text contents of the identification texts of the key frames and the corresponding text areas to obtain combined texts; wherein the merged text is taken as the video text of the target video.

In one aspect, an embodiment of the present application provides a video text extraction apparatus, including:

the key frame generating unit is used for extracting text characteristics of each video frame in the target video to obtain the text characteristics of each video frame;

the key frame generating unit is further used for generating one or more key frames according to the text features of the video frames; the similarity between the text features of any two key frames is smaller than a first similarity threshold;

the text recognition unit is used for performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located;

the video text generation unit is used for acquiring the text content of the identification text of each key frame and merging the identification text of each key frame according to the text content of the identification text of each key frame and the corresponding text area to obtain a merged text; wherein the merged text is taken as the video text of the target video.

In one aspect, an embodiment of the present application provides a video text extraction device, where the video text extraction device includes an input interface and an output interface, and further includes:

a processor adapted to implement one or more instructions; and (c) a second step of,

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to perform the above-described video text extraction method.

In one aspect, an embodiment of the present application provides a computer storage medium, where computer program instructions are stored in the computer storage medium, and when the computer program instructions are executed by a processor, the computer storage medium is configured to execute the above video text extraction method.

In one aspect, embodiments of the present application provide a computer program product, which includes a computer program, the computer program being stored in a computer storage medium; the processor of the video text extraction device reads the computer program from the computer storage medium, and the processor executes the computer program, so that the video text extraction device executes the video text extraction method.

In the embodiment of the application, text feature extraction processing can be performed on each video frame in the target video to obtain the text feature of each video frame; generating one or more key frames according to the text characteristics of each video frame; the similarity between the text features of any two key frames is smaller than a first similarity threshold; performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located; acquiring text contents of the identification texts of the key frames, and combining the identification texts of the key frames according to the text contents of the identification texts of the key frames and the corresponding text regions to obtain combined texts; wherein the merged text is taken as the video text of the target video; one or more key frames are generated according to the text characteristics of each video frame in the target video, and then the video text of the target video is obtained based on the text recognition processing of each key frame, so that the time consumption for performing the text recognition processing on each video frame in the target video is reduced, and the speed for extracting the text in the video can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a trained text mask twin network according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video text extraction method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a feature extraction area when text feature extraction is performed according to an embodiment of the present application;

FIG. 4 is a diagram illustrating merging two video frames into a key frame according to an embodiment of the present application;

FIG. 5 is a schematic diagram of obtaining a merged text according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another video text extraction method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a masking process for an initial feature map of a video frame according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a determination of similarity between text features of two video frames based on a trained text mask twin network according to an embodiment of the present disclosure;

FIG. 9 is a diagram illustrating merging a plurality of video frames into a key frame according to an embodiment of the present application;

FIG. 10 is a diagram of another embodiment of the present application for merging multiple video frames into a key frame;

FIG. 11 is a diagram of another embodiment of the present application for merging multiple video frames into a key frame;

fig. 12 is a schematic diagram illustrating a text detection area where text included in a key frame is determined according to an embodiment of the present application;

fig. 13 is a schematic diagram of a text detection result of a text detection area according to an embodiment of the present application;

FIG. 14 is a flowchart illustrating a training method for a text mask twin network according to an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a video text extraction apparatus according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a video text extraction device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly includes several directions, such as Computer Vision technology (CV), speech processing technology, natural language processing technology, and Machine Learning (ML)/Deep Learning (DL).

Computer vision is a science for researching how to make a machine look, and in particular, it is a science for using a camera and a computer to replace human eyes to make machine vision of identifying, determining and measuring targets and further making image processing so as to make the computer processing become an image more suitable for human eyes observation or transmitting to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional (3-dimensional, 3D) techniques, three-dimensional object reconstruction, virtual reality, augmented reality, synchronized positioning and map building, and the like.

The method is based on a video processing technology in a computer vision technology, and provides a video text extraction scheme, so that the extraction rate of texts in videos can be improved; the video text extraction scheme indicates: performing text feature extraction processing on each video frame in the target video to obtain text features of each video frame; generating one or more key frames according to the text characteristics of each video frame; performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located; merging the identification texts of the key frames according to the text content of the identification texts of the key frames and the corresponding text areas to obtain merged texts; and the similarity between the text features of any two key frames is smaller than a first similarity threshold, and the combined text is taken as the video text of the target video.

In a specific implementation, the video text extraction scheme provided by the application can be executed by a video text extraction device, and the video text extraction device can be a terminal device or a server; terminal devices herein may include, but are not limited to: the system comprises a computer, a smart phone, a tablet computer, a notebook computer, intelligent voice interaction equipment, intelligent household appliances, a vehicle-mounted terminal, intelligent wearable equipment and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The video text extraction scheme provided by the application can be realized based on a trained text mask twin network, and the trained text mask twin network can be used for determining the text feature similarity between any two video frames (wherein the text feature similarity between any two video frames can be defined as the similarity between the text features of any two video frames); the two feature extraction modules are twin modules, have the same structure and share parameters; the two feature extraction modules are used for respectively carrying out text feature extraction processing on any two video frames to obtain text features of the any two video frames, and the similarity determination module is used for determining the similarity between the text features of the any two video frames; specifically, in the video text extraction scheme provided by the application, the relevant process of extracting the text features of each video frame in the target video to obtain the text features of each video frame can be realized through a feature extraction module in a trained text mask twin network. Optionally, only one of the two feature extraction modules included in the trained text mask twin network may be retained, so that when determining the similarity between the text features of any two video frames, one feature extraction module may respectively perform text feature extraction processing on the any two video frames to obtain the text features of the any two video frames, and further determine the similarity between the text features of the any two video frames through the similarity determination module.

Referring to fig. 1, a schematic structural diagram of a trained text mask twin network provided in an embodiment of the present application is shown; the trained text mask twin network shown in fig. 1 includes two feature extraction modules and a similarity determination module; the feature extraction module can comprise a first convolution submodule for performing convolution operation and a text positioning submodule for performing text positioning on a text contained in the video frame; the first convolution sub-module may be a Deep residual network (ResNet), such as a ResNet18 network, a ResNet50 network, or the like, and the first convolution sub-module is subsequently described as a ResNet18 network in the embodiment of the present application; referring to fig. 1, the first convolution submodule may include one convolution layer having a convolution kernel size of 7 × 7 and a convolution kernel number of 64, two convolution layers having a convolution kernel size of 3 × 3 and a convolution kernel number of 64, a convolution layer having a convolution kernel size of 3 × 3 and a convolution kernel number of 128, and a convolution layer having a convolution kernel size of 3 × 3 and a convolution kernel number of 256. The similarity determination module may include a second convolution sub-module for performing convolution operation, an inclusion-a sub-module, an averaging pooling Layer (FC Layer) for pooling, and a Fully Connected Layer (FC Layer); referring to fig. 1, the second convolution sub-module may include two convolution layers with convolution kernel size of 3 × 3 and convolution kernel number of 512.

In one embodiment, a feature extraction module in the trained text mask twin network can be used for performing text feature extraction processing on each video frame in a target video to obtain text features of each video frame; determining the similarity between the text features of any two adjacent video frames in each video frame through a similarity determination module in the trained text mask twin network, and generating one or more key frames based on the determined similarity between the text features of any two adjacent video frames; further, text recognition processing can be carried out on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located; merging the identification texts of the key frames according to the text content of the identification texts of the key frames and the corresponding text areas to obtain merged texts; and the similarity between the text features of any two key frames is smaller than a first similarity threshold, and the combined text is taken as the video text of the target video.

It should be particularly noted that, in the specific implementation manner of the present application, data related to users, for example, in the case that the target video is a video generated by a user, when the embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with local laws and regulations and standards.

Based on the video text extraction scheme, the embodiment of the application provides a video text extraction method. Referring to fig. 2, a schematic flow chart of a video text extraction method provided in the embodiment of the present application is shown. The video text extraction method shown in fig. 2 may be performed by a video text extraction apparatus. The video text extraction method shown in fig. 2 may include the following steps:

s201, text feature extraction processing is carried out on each video frame in the target video, and text features of each video frame are obtained.

In one embodiment, the target video may be any legal video, and the target video may include a plurality of video frames; the video text extraction equipment performs text feature extraction processing on one video frame in the target video in order to extract the features of the text contained in the video frame; when text is contained in the video frame, the text exists in some area in the video frame, and in other words, the text feature extraction process may be performed on the video frame in order to extract features in the area where the text contained in the video frame exists. Furthermore, the idea of image segmentation can be introduced to perform image segmentation processing on the video frame to obtain a plurality of image areas of the video frame; when the text feature extraction processing is performed on the video frame, the feature of the image area where the text included in the video frame is located may be extracted, that is, the feature of the image area including the text in each image area may be extracted. For example, refer to fig. 3, which is a schematic diagram of a feature extraction area when performing text feature extraction according to an embodiment of the present application; the image segmentation processing may be performed on the video frame to obtain 4 × 4 image regions of the video frame, and if the image region in which the text included in the video frame is located is the image region indicated by row 1 and column 1 as denoted by 301, the image region as denoted by 301 is a feature extraction region when performing text feature extraction, that is, when performing the text feature extraction processing on the video frame, the feature in the image region as denoted by 301 may be extracted.

S202, one or more key frames are generated according to the text characteristics of each video frame.

The similarity between the text features of any two key frames is smaller than a first similarity threshold, and the first similarity threshold may be set according to specific requirements.

In one embodiment, the video text extraction device generates one or more key frames according to text features of each video frame, and the similarity between the text features of any two key frames is smaller than a first similarity threshold, so as to merge video frames containing the same text in each video frame of a target video into one key frame to obtain key frames containing different texts; in specific implementation, the video frames with the similarity between the corresponding text features meeting a certain condition can be determined as the video frames containing the same text, and then combined into a key frame.

For example, referring to fig. 4, a schematic diagram of merging two video frames into one key frame is provided for the embodiment of the present application; if the video frame is divided into 4 × 4 image areas, the text included in the video frame marked by 401 is "a", and the image area where the text is located is the image area indicated by the 1 st row and 1 st column marked by 402, specifically, in the upper left corner area of the image area marked by 402, as shown by 403; as shown by 404, the text included in the video frame is "a", and the image area where the text is located is the image area indicated by line 1 and column 1 as shown by 405, specifically in the lower left corner area of the image area as shown by 405, as shown by 406; the key frame obtained by merging the video frame indicated by the 401 label and the video frame indicated by the 404 label can be indicated by a 407 label, and at this time, the video frame indicated by the 404 label is directly used as the key frame.

S203, performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located.

In an embodiment, when the video text extraction device performs text Recognition processing on each key frame to obtain a Recognition text of each key frame and a text region where the Recognition text is located, the Recognition text can be implemented by using an existing Optical Character Recognition algorithm (OCR).

S204, acquiring the text content of the identification text of each key frame, and combining the identification texts of each key frame according to the text content of the identification text of each key frame and the corresponding text area to obtain a combined text; wherein the merged text is taken as the video text of the target video.

In one embodiment, the merging, by the video text extraction device, the identification texts of the key frames according to the text content of the identification texts of the key frames and the corresponding text regions to obtain a merged text, may include: and combining the identification texts with the same text region and the same text content in the identification texts of the key frames into one text, and reserving the rest identification texts to obtain a combined text. Optionally, the text regions are the same, which may mean that the text regions are located at the same position; further, the position of the text region may be represented by a specified position in the text region, for example, the position of the center point in the text region may be represented by the position of the text region.

For example, refer to fig. 5, which is a schematic diagram of obtaining a merged text according to an embodiment of the present application; if the target video correspondingly comprises 3 key frames, respectively, a key frame 1 marked as 501, a key frame 2 marked as 502, and a key frame 3 marked as 503 in fig. 5; the key frame 1 corresponds to an identification text, which is 'A'; the key frame 2 corresponds to two identification texts, namely 'A' and 'B'; the key frame 3 corresponds to three identification texts, namely 'A', 'B' and 'C'; the text regions corresponding to the identification text "a" of the key frame 1, the identification text "a" of the key frame 2 and the identification text "a" of the key frame 3 are the same, and the text regions corresponding to the identification text "B" of the key frame 2 and the identification text "B" of the key frame 3 are the same; when merging the recognition texts of the key frames, the recognition text "a" of the key frame 1, the recognition text "a" of the key frame 2, and the recognition text "a" of the key frame 3 may be merged into one text, the recognition text "B" of the key frame 2 and the recognition text "B" of the key frame 3 may be merged into one text, and the recognition text "C" of the key frame 3 may be retained; the resulting merged text is "A", "B", "C".

Based on the above embodiments, the present application provides another video text extraction method. Referring to fig. 6, a schematic flow chart of another video text extraction method provided in the embodiment of the present application is shown. The video text extraction method illustrated in fig. 6 may be performed by a video text extraction apparatus. The video text extraction method shown in fig. 6 may include the following steps:

s601, text feature extraction processing is carried out on each video frame in the target video, and text features of each video frame are obtained.

In one embodiment, when the video text extraction device performs text feature extraction processing on each video frame in the target video to obtain text features of each video frame, image segmentation processing can be performed on any video frame in each video frame to obtain a plurality of image areas of any video frame; performing convolution operation on a plurality of image areas of any video frame to obtain an initial characteristic diagram of any video frame; and performing text prediction on each image area of any video frame according to the initial characteristic diagram, and generating text characteristics of any video frame according to a text prediction result and the initial characteristic diagram.

In an embodiment, the number of image regions of any video frame may be obtained by performing image segmentation processing on any video frame according to specific requirements, for example, the number of image regions of any video frame may be set to be 4 × 4, 14 × 14, 16 × 16, and the like. The video text extraction equipment performs convolution operation on a plurality of image areas of any video frame to obtain an initial feature map of any video frame, and can be realized by calling a first convolution sub-module in a feature extraction module; the number of the obtained initial feature maps of any video frame is related to the network structure of the first convolution submodule, and if the first convolution submodule is the ResNet18 network shown in fig. 1, the number of the initial feature maps of any video frame is 512. The video text extraction device performs text prediction on each image area of any video frame according to the initial feature map, generates text features of any video frame according to the text prediction result and the initial feature map, and can be realized by calling a text positioning sub-module in the feature extraction module.

In one embodiment, the video text extraction device performs text prediction on each image region of any video frame according to the initial feature map, and generates text features of any video frame according to the text prediction result and the initial feature map, and the method may include: merging the plurality of initial characteristic graphs to obtain a reference characteristic graph; predicting text prediction probability of texts contained in each image area of any video frame based on the characteristic information indicated by the reference characteristic graph; wherein the text prediction probability is used as a text prediction result; and respectively carrying out text positioning processing on each initial characteristic graph based on the text prediction probability to obtain the text characteristics of any video frame.

In a specific implementation, the merging, by the video text extraction device, the multiple initial feature maps to obtain the reference feature map may include: determining the attention degree of any initial feature map based on an attention mechanism, and taking the attention degree as the feature weight of any initial feature map; carrying out weighting and merging processing on the corresponding initial characteristic graphs by adopting the characteristic weights to obtain reference characteristic graphs; this process may be implemented by a channel attention module incorporated in the text localization sub-module. In a possible embodiment, each initial feature map may be converted into a vector as an initial feature vector, and the processing on each initial feature map may be converted into the processing on each initial feature vector; namely, each initial feature map can be converted into a vector as an initial feature vector; then, carrying out feature coding processing on each initial feature vector to obtain a coding vector corresponding to each initial feature vector; normalizing the coding vectors corresponding to the initial feature vectors to respectively obtain attention degrees aiming at the initial feature vectors (namely attention degrees aiming at the initial feature maps); taking the attention degree aiming at each initial feature vector as the feature weight of each initial feature vector, and performing weighting and merging processing on the corresponding initial feature vectors by adopting the feature weight to obtain reference feature vectors; and converting the reference feature vector into a feature map as a reference feature map. Optionally, each initial feature map may be converted into a 256-dimensional vector as an initial feature vector; the coding vector corresponding to each initial feature vector may be normalized by an exponential normalization function (i.e., softmax function).

If the number of the initial feature maps is I, the initial feature vector corresponding to the ith initial feature map in the I initial feature maps is f _i If so, performing feature encoding processing on the ith initial feature vector to obtain an encoded vector corresponding to the ith initial feature vector, which can be shown by the following formula 1:

e _i ＝W _i ·f _i +b _i (1)

wherein e is _i Code corresponding to representing ith initial characteristic vectorVector, W _i And b _i The model parameters used in the channel attention module for feature coding each initial feature vector are used.

The normalization processing is performed on the coding vector corresponding to the ith initial feature vector in the coding vectors corresponding to the initial feature vectors, and the obtained attention degree for the ith initial feature vector (i.e. the attention degree for the ith initial feature map) can be shown by the following formula 2:

wherein alpha is _i Indicating the degree of attention for the ith initial feature vector (i.e., the degree of attention for the ith initial feature map), e _i Representing the code vector corresponding to the i-th initial feature vector, e _j Representing the coded vector corresponding to the jth initial feature vector.

The weighting and combining process of the initial feature vectors using the feature weights to obtain the reference feature vector can be shown by the following equation 3, wherein the reference feature vector is represented by f _attn Represents:

furthermore, the video text extraction equipment can call a text positioning sub-module in the feature extraction module, and predict the text prediction probability of texts contained in each image area of any video frame based on the feature information indicated by the reference feature map; and respectively carrying out text positioning processing on each initial characteristic graph based on the text prediction probability to obtain the text characteristics of any video frame. When the video text extraction device performs text positioning processing on each initial feature map based on the text prediction probability to obtain the text features of any video frame, the method may include: taking the image area of which the corresponding text prediction probability is less than or equal to the probability threshold value in each image area as a mask area; for any initial feature map, carrying out masking treatment on the position of the mask area in any initial feature map to obtain a target feature map corresponding to any initial feature map; taking the target characteristic graph corresponding to each initial characteristic graph as the text characteristic of any video frame; wherein, the probability threshold can be set according to specific requirements.

For example, referring to fig. 7, a schematic diagram of a masking process performed on an initial feature map of a video frame according to an embodiment of the present application is shown; if the video frame is divided into 4 × 4 image regions, an initial feature map of the video frame is marked by 701, and text prediction probabilities of texts contained in the image regions of the video frame obtained by prediction are respectively marked by 702; if the probability threshold is set to 0.9, the image area, of the image areas, whose corresponding text prediction probability is less than or equal to the probability threshold of 0.9, is taken as a mask area, and the mask area may be marked as 703; the position of the mask region in the initial feature map is masked, and the obtained target feature map corresponding to the initial feature map may be denoted by 704.

S602, one or more key frames are generated according to the text characteristics of each video frame.

And the similarity between the text features of any two key frames is smaller than a first similarity threshold value.

In an embodiment, taking the generation of a key frame as an example, the video text extraction device generates a key frame according to the text features of each video frame, and may include: determining a related video frame set from each video frame according to the similarity between the text characteristics of any two video frames; the associated video frame set comprises video frames of which the similarity between at least two corresponding text features is greater than or equal to a second similarity threshold; and merging each video frame in the associated video frame set to obtain a key frame. Further, the video frames that are not combined in the video frames of the target video may be used as key frames. The second similarity threshold may be set according to specific requirements, for example, the second similarity threshold may be the same as the first similarity threshold. In specific implementation, the video text extraction device may first invoke the similarity determination module to determine the similarity between the text features of any two video frames in each video frame of the target video; adding video frames of which the similarity between corresponding text features is greater than or equal to a second similarity threshold value in each video frame to the associated video frame set; merging each video frame in the associated video frame set to obtain a key frame; and the video frames which are not combined in the video frames of the target video are used as key frames. In a feasible implementation manner, when the video text extraction device merges the video frames in the associated video frame set to obtain a key frame, any video frame in the associated video frame set can be used as the key frame.

Referring to fig. 8, a schematic diagram for determining similarity between text features of two video frames based on a trained text mask twin network provided in an embodiment of the present application is shown; the trained text mask twin network comprises a feature extraction module and a similarity determination module, wherein the feature extraction module comprises a first convolution submodule and a text positioning submodule, the text positioning submodule comprises a channel attention module, and the similarity determination module comprises a second convolution submodule, an inclusion-A submodule, an average pooling layer and a full connection layer; aiming at any one of two video frames, image segmentation processing can be carried out on the video frame to obtain a plurality of image areas of the video frame; performing convolution operation on a plurality of image areas of the video frame through a first convolution sub-module to obtain a plurality of initial feature maps of the video frame; inputting a plurality of initial feature maps of the video frame into a channel attention module in a text positioning sub-module, determining the attention degree aiming at any initial feature map based on an attention mechanism, and taking the attention degree as the feature weight of any initial feature map; and carrying out weighting and merging processing on the corresponding initial characteristic graphs by adopting the characteristic weights to obtain reference characteristic graphs. Predicting text prediction probability of texts contained in each image area of the video frame based on the characteristic information indicated by the reference characteristic graph; taking the image area of which the corresponding text prediction probability is less than or equal to the probability threshold value in each image area as a mask area, and masking the position of the mask area in any initial characteristic image aiming at any initial characteristic image to obtain a target characteristic image corresponding to any initial characteristic image; and taking the target characteristic graph corresponding to each initial characteristic graph as the text characteristic of the video frame. Furthermore, the text features of the two video frames may be merged, and the merged text features may be input to the similarity determination module to obtain the similarity between the text features of the two video frames.

Referring to fig. 9, for a schematic diagram of merging multiple video frames into a key frame provided in this embodiment of the application, if a target video includes 7 video frames, including video frame 1 indicated by a 901 mark, where text included in video frame 1 is "a", video frame 2 indicated by a 902 mark, where text included in video frame 2 is "a", video frame 3 indicated by a 903 mark, where text included in video frame 3 is "a" and "B", video frame 4 indicated by a 904 mark, where text included in video frame 4 is "a" and "B", video frame 5 indicated by a 905 mark, where text included in video frame 5 is "a", "B", and "C", video frame 6 indicated by a 906 mark, where text included in video frame 6 is "a", and video frame 7 indicated by a 907 mark, where text included in video frame 7 is "a"; then video frame 1 as indicated by the 901 flag, video frame 2 as indicated by the 902 flag, video frame 6 as indicated by the 906 flag, and video frame 7 as indicated by the 907 flag may be merged into one key frame as key frame 1, as indicated by the 908 flag, video frame 3 as indicated by the 903 flag and video frame 4 as indicated by the 904 flag may be merged into one key frame as key frame 2, as indicated by the 909 flag, and video frame 5 as indicated by the 905 flag may be merged into one key frame as key frame 3, as indicated by the 910 flag.

In one embodiment, the video frames of the target video may be sequentially arranged according to the corresponding playing order in the target video; taking the generation of a key frame as an example, the video text extraction device generates a key frame according to the text features of each video frame, and may include: determining one or more associated video frame sets from each video frame according to the arrangement sequence of each video frame and the similarity between the text characteristics of any two adjacent video frames; any associated video frame set comprises video frames of which the similarity between at least two corresponding text features is greater than or equal to a second similarity threshold; merging each video frame in each associated video frame set to obtain one or more intermediate key frames; taking the video frames which are not combined in all the video frames of the target video as intermediate key frames; and merging a plurality of intermediate key frames of which the similarity between corresponding text features is greater than or equal to a second similarity threshold value into one key frame in each intermediate key frame, and taking the intermediate key frames which are not merged in each intermediate key frame as the key frames. The second similarity threshold may be set according to specific requirements, for example, the second similarity threshold may be greater than the first similarity threshold. In a possible implementation manner, when the video text extraction device merges a plurality of intermediate key frames, of which the similarity between corresponding text features is greater than or equal to the second similarity threshold, into one key frame, any one of the plurality of intermediate key frames, of which the similarity between corresponding text features is greater than or equal to the second similarity threshold, may be used as the key frame.

Referring to fig. 10, for another schematic diagram of merging a plurality of video frames into a key frame provided in the embodiment of the present application, if a target video includes 7 video frames, and the 7 video frames included in the target video correspond to and are the same as the 7 video frames shown in fig. 8, adjacent video frames 1 and 2 may be merged into an intermediate key frame, which is denoted by a reference numeral 1001 and is used as the intermediate key frame 1; video frame 3 and video frame 4 may be merged into one intermediate key frame as intermediate key frame 2, as indicated by 1002; video frame 5 may be considered as intermediate key frame 3, as indicated by the 1003 marker; video frame 6 and video frame 7 may be merged into one intermediate key frame as intermediate key frame 4, as indicated by reference numeral 1004; further, intermediate key frame 1 and intermediate key frame 4 may be merged into one key frame as key frame 1, as indicated by the 1005 label, intermediate key frame 2 as key frame 2, as indicated by the 1006 label, and intermediate key frame 3 as key frame 3, as indicated by the 1007 label.

In one embodiment, under the condition that the video frames of the target video are sequentially arranged according to the corresponding playing sequence in the target video, each obtained intermediate key frame can be directly used as a key frame; then, taking the generation of a key frame as an example, the video text extraction device generates a key frame according to the text feature of each video frame, which may include: determining a related video frame set from each video frame according to the arrangement sequence of each video frame and the similarity between the text characteristics of any two adjacent video frames; the associated video frame set comprises video frames of which the similarity between at least two corresponding text features is greater than or equal to a second similarity threshold; and merging each video frame in the associated video frame set to obtain a key frame. Further, the video frames that are not combined in the video frames of the target video may be used as key frames. The second similarity threshold may be set according to specific requirements, for example, the second similarity threshold may be greater than the first similarity threshold. In specific implementation, the video text extraction device may first invoke the similarity determination module to determine the similarity between the text features of any two adjacent video frames in each video frame of the target video; adding adjacent video frames of which the similarity between corresponding text features is greater than or equal to a second similarity threshold value in each video frame into an associated video frame set; merging each video frame in the associated video frame set to obtain a key frame; and the video frames which are not combined in the video frames of the target video are used as key frames. In a possible implementation manner, when the video text extraction device performs merging processing on each video frame in the associated video frame set to obtain one key frame, any video frame in the associated video frame set may be used as the key frame.

Referring to fig. 11, for another schematic diagram of merging multiple video frames into a key frame provided in the embodiment of the present application, if a target video includes 7 video frames, and the 7 video frames included in the target video correspond to the 7 video frames shown in fig. 8, adjacent video frames 1 and 2 may be merged into one key frame, which is used as the key frame 1, as denoted by 1101; video frame 3 and video frame 4 may be merged into one key frame as key frame 2, as indicated by the 1102 designation; video frame 5 may be considered as key frame 3, as indicated by the 1103 marker; video frame 6 and video frame 7 may be merged into one key frame as key frame 4, as indicated by the 1104 label.

In one embodiment, when the trained text mask twin network is invoked to determine text features of adjacent video frames based on the adjacent video frames and generate one or more key frames based on the text features of the adjacent video frames, taking generating one key frame as an example, the video text extraction device generates one key frame according to the text features of each video frame, which may include: respectively inputting two adjacent video frames into two feature extraction modules in a trained text mask twin network, extracting text features of the two adjacent video frames, and further determining the similarity between the text features of the two adjacent video frames through a similarity determination module in the trained text mask network; if the similarity between the text features of the two adjacent video frames is greater than or equal to a second similarity threshold, taking the video frame with a relatively lower arrangement sequence in the two adjacent video frames as a key frame; then, when two adjacent video frames composed of the video frame with the relatively later arrangement order and the next video frame of the video frame with the relatively later arrangement order are further processed, it is equivalent to that whether merging is needed or not is judged for one key frame and the next video frame obtained through merging, and then the key frame of the target video can be determined. For example, if each video frame included in the target video is the video frame shown in fig. 8, the sequence of the video frame 1, the video frame 2, the video frame 3, and the video frame 4 is: video frame 1, video frame 2, video frame 3, video frame 4; then, the video frame 1 and the video frame 2 may be input into the trained text mask twin network to obtain a similarity between text features of the video frame 1 and the video frame 2, and if the similarity is greater than a second similarity threshold, the video frame 2 is used as a key frame; and then inputting the video frame 2 and the video frame 3 into the trained text mask twin network to obtain the similarity between the text features of the video frame 2 and the video frame 3, if the similarity is smaller than a second similarity threshold, inputting the video frame 3 and the video frame 4 into the trained text mask twin network to obtain the similarity between the text features of the video frame 3 and the video frame 4, if the similarity is larger than the second similarity threshold, using the video frame 4 as a key frame, and repeating the steps until all the video frames of the target video are processed to obtain the key frame of the target video.

S603, aiming at any key frame, performing text detection processing on any key frame to obtain a text detection area where a text contained in any key frame is located.

In An embodiment, when the video Text extraction device performs Text detection processing on any one of the key frames to obtain a Text detection region where a Text included in any one of the key frames is located, An existing Text detection algorithm may be used, for example, a Text detection algorithm (CTPN) based on a Network connecting preselected frames, a Scene Text detection algorithm (EAST), a Text detection algorithm (pixel connection) based on pixel connection, and the like may be used.

In an embodiment, the performing, by the video text extraction device, text detection processing on any key frame to obtain a text detection area where a text included in any key frame is located may include: predicting whether the pixel attribute of each pixel is a positive pixel for constituting the text or not based on the pixel feature of each pixel in any key frame; predicting, for any one of the respective pixels, whether or not a connection between the any pixel and each of the adjacent pixels is a positive connection based on a pixel attribute of the any pixel and a pixel attribute of each of the adjacent pixels adjacent to the any pixel; positive connection is used to indicate: the pixel attributes of the two correspondingly connected pixels are both positive pixels, or the pixel attribute of one pixel in the two correspondingly connected pixels is a positive pixel, and the pixel attribute of one pixel is a negative pixel which is not used for forming a text; determining the rotation angle of each pixel based on the rotation angle of any key frame; determining a text detection area where a text contained in any key frame is located based on a connected domain formed by a plurality of pixels, and the rotation angle of each pixel, wherein the pixel attribute of each pixel is a positive pixel and the pixels are correspondingly connected into a positive connection; each adjacent pixel adjacent to any pixel may be each adjacent pixel adjacent to any pixel by 8; the correlation process for determining the text detection area may be implemented based on a convolutional neural network. Referring to fig. 12, in order to provide a schematic diagram for determining a text detection area where text included in a key frame is located according to an embodiment of the present application, a rotation angle of the key frame may be determined by a partial convolution layer in a convolutional neural network and an angular attention mechanism, and the determined text detection area may be as indicated by 1201.

In a feasible implementation manner, when the video text extraction device determines a text detection region where a text included in any one key frame is located, an initial text detection region where the text included in any one key frame is located may be determined based on a connected domain formed by a plurality of pixels, of which the pixel attribute is a positive pixel and which are correspondingly connected as the positive pixel, in each pixel; and performing rotation processing on the initial text detection area based on the rotation angle of each pixel in the initial text detection area to obtain a text detection area. Optionally, when the video text extraction device determines an initial text detection region where a text included in any one of the key frames is located, a connected domain, which is composed of a plurality of pixels that are connected in a corresponding manner and have a pixel attribute of a positive pixel, in each pixel may be determined as a text detection region; in each pixel, a circumscribed area of a connected domain formed by pixels with a positive pixel attribute and correspondingly connected as the positive connection may also be determined as a text detection area, where the circumscribed area of the connected domain may be set according to specific requirements, for example, the circumscribed area of the connected domain may be a circumscribed rectangular area of the connected domain, a circumscribed circular area of the connected domain, and the like. In another possible implementation manner, an initial text detection area where the text included in any determined key frame is located may also be used as the text detection area.

S604, intercepting the text detection area, and performing text extraction processing on the text detection area to obtain a text extraction result.

In an embodiment, when the video text extraction device performs text extraction processing on the text detection area to obtain a text extraction result, an existing text extraction method may be adopted, for example, a text extraction method based on CTC (connection Temporal Classification) or a text extraction method based on Attention (Attention) may be adopted.

In an embodiment, the video text extraction device performs text extraction processing on the text detection area to obtain a text extraction result, and may include: performing visual feature coding processing on the text detection area to obtain visual coding features corresponding to the text detection area; predicting a first character prediction probability that each reference character in the plurality of reference characters is used as the nth character in the text extraction result based on the visual coding feature corresponding to the text detection area and the decoding reference character; n is a positive integer, when n is equal to 1, the decoding reference character is a special decoding character, and when n is greater than 1, the decoding reference character is the (n-1) th character in the text extraction result; predicting second character prediction probability taking each reference character as an nth character based on visual coding features corresponding to the text detection region and a sequence prediction algorithm; an nth character is determined from the respective reference characters based on the first character prediction probability and the second character prediction probability.

The relevant process of performing the visual feature coding processing on the text detection area can be realized based on a convolutional neural network. When predicting the first character prediction probability taking each reference character as the nth character, the prediction probability may be obtained by decoding the visual coding feature corresponding to the text detection region and the decoded reference character based on an Attention (Attention) mechanism; the plurality of reference characters may be characters in a preset dictionary, and the preset dictionary may be an existing dictionary used in the field of text extraction; the special decoding character when n is equal to 1 may be set according to specific requirements, and is used to prompt the start of decoding processing on the visual coding feature and the special decoding character corresponding to the text detection area, so as to further predict the 1 st character in the text extraction result. The sequence prediction algorithm used in predicting the second character prediction probability with each reference character as the nth character may be a CTC algorithm. Further, when the video text extraction device determines the nth character from the reference characters based on the first character prediction probability and the second character prediction probability, the reference character with the largest sum of the first character prediction probability and the second character prediction probability in the reference characters may be determined as the nth character. The text extraction is carried out by adopting a method of combining CTC and Attention, the excellent prediction performance and stability of the text extraction method based on CTC can be effectively combined, the characters are positioned by the text extraction method based on Attention, the problem that effective characteristics are difficult to extract due to uncertain character positions in the text extraction method based on CTC is solved, and the accuracy of text extraction can be improved.

Fig. 13 is a schematic diagram illustrating a text detection result of a text detection area obtained according to an embodiment of the present application; the text detection area may be marked as 1301, the correlation process for performing visual feature coding processing on the text detection area may be marked as 1302, the obtained visual coding feature corresponding to the text detection area may be marked as 1303, the correlation process for predicting the second character prediction probability that each reference character is taken as the nth character based on the visual coding feature corresponding to the text detection area and a CTC algorithm may be marked as 1304, and the second character prediction probability may be predicted based on a plurality of classifiers; based on an Attention mechanism, decoding processing is carried out on visual coding features and decoding reference characters corresponding to a text detection area, and a relevant process for predicting the first character prediction probability taking each reference character as an nth character can be marked as 1305; the reference character with the largest sum of the first character prediction probability and the second character prediction probability in all the reference characters can be determined as the nth character in the text prediction result.

S605, using the text extraction result as the identification text of any key frame, and using the text detection area as the text area where the identification text of any key frame is located.

S606, acquiring the text content of the identification text of each key frame, and combining the identification texts of each key frame according to the text content of the identification text of each key frame and the corresponding text area to obtain a combined text; wherein the merged text is taken as the video text of the target video.

Step S606 is the same as step S204, and is not described herein again.

In the embodiment of the application, when any video frame in all video frames of a target video is targeted, image segmentation processing can be performed on any video frame to obtain a plurality of image areas of any video frame; performing convolution operation on a plurality of image areas of any video frame to obtain an initial characteristic diagram of any video frame; performing text prediction on each image area of any video frame according to the initial characteristic diagram, and generating text characteristics of any video frame according to a text prediction result and the initial characteristic diagram; one or more key frames can be generated according to the text characteristics of each video frame; performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located; merging the identification texts of the key frames according to the text content of the identification texts of the key frames and the corresponding text areas to obtain merged texts; wherein the merged text is taken as the video text of the target video; the method can improve the attention degree of each image area including the text in each image area of each video frame based on the text prediction result of text prediction of each image area of each video frame, and further can improve the accuracy of generating one or more key frames according to the text characteristics of each video frame in the target video; when the video text of the target video is obtained based on the text recognition processing performed on each key frame in the follow-up process, the time consumption for performing the text recognition processing on each video frame in the target video can be reduced, and the speed for extracting the text in the video can be improved.

The method for extracting the video text can be realized based on a trained text mask twin network, wherein the trained text mask twin network is obtained based on optimization training of the text mask twin network, and the model structure of the text mask twin network is the same as that of the trained text mask twin network, but the model parameters are different; based on this, the embodiment of the present application provides a training method for a text mask twin network, so as to explain a related process of obtaining a trained text mask twin network. Referring to fig. 14, a flowchart of a training method for a text mask twin network provided in an embodiment of the present application is schematically shown, where the training method for the text mask twin network may be executed by a video text extraction device, or may be executed by any electronic device capable of implementing optimization training for the text mask twin network, and the embodiment of the present application is described later with a video text extraction device. The training method of the text mask twin network shown in fig. 14 may include the steps of:

s1401, acquiring a training sample group; the training sample set includes two sample video frames, a sample label of each sample video frame, and a true value of similarity between text features of the two sample video frames.

In one embodiment, a sample tag may be used to indicate whether text is contained in the corresponding sample video frame; further, when the idea of image segmentation is adopted to segment the sample video frame into a plurality of image regions, the sample label may be used to indicate whether each image region of the corresponding sample video frame contains text, for example, if the sample video frame is segmented into 4 × 4 image regions, if the image region indicated by the 1 st row and the 1 st column contains text and the rest of the image regions do not contain text, the sample label may be identified by a matrix, where the 1 st row and the 1 st column in the matrix are 1 and the rest are 0. A true value of similarity between text features of two sample video frames is used to indicate whether the two sample video frames are similar.

S1402, inputting each sample video frame into a text mask twin network, obtaining a prediction label of each sample video frame and a text feature of each sample video frame, and obtaining a prediction similarity between the text features of the two sample video frames based on the text feature of each sample video frame.

Wherein the prediction tag can be used for indicating whether a prediction result of text is contained in a corresponding sample video frame; further, when the sample video frame is divided into a plurality of image regions by using the image division idea, the prediction label may be used to indicate whether a prediction result of text is contained in each image region of the corresponding sample video frame, that is, a text prediction probability of text contained in each image region of the corresponding sample video frame. In specific implementation, when the video text extraction device inputs each sample video frame into a text mask twin network to obtain a prediction label of each sample video frame and text features of each sample video frame, and obtains prediction similarity between the text features of two sample video frames based on the text features of each sample video frame, taking any sample video frame as an example, image segmentation processing can be performed on any sample video frame to obtain a plurality of image areas of any sample video frame; calling a feature extraction module in a text mask twin network to carry out convolution operation on a plurality of image areas of any sample video frame to obtain a plurality of initial feature maps of any sample video frame; merging a plurality of initial feature maps of any sample video frame to obtain a reference feature map of any sample video frame; predicting text prediction probability of texts contained in each image area of any sample video frame based on feature information indicated by a reference feature map of any sample video frame; wherein the text prediction probability is used as a prediction label of any sample video frame; and respectively carrying out text positioning processing on each initial feature map of any sample video frame based on the prediction label of any sample video frame to obtain the text features of any sample video frame. Further, a similarity determination module in the text mask twin network may be invoked to determine a predicted similarity between text features of two sample video frames.

And S1403, performing optimization training on the text mask twin network based on the sample label of each sample video frame, the prediction label of each sample video frame, the similarity truth value and the prediction similarity to obtain the trained text mask twin network.

In one embodiment, the sample label of each sample video frame, the prediction label of each sample video frame, the similarity true value, and the prediction similarity may be used as a function parameter of a loss function of the text mask twin network to obtain a loss function value; and twinning the text mask in the direction of reducing the loss function valueAnd optimizing and adjusting model parameters of the generative network to obtain the trained text mask twin network. Wherein the penalty function of the text mask twin network may include: a first loss function composed of a sample label of one of the two sample video frames and a corresponding prediction label, a second loss function composed of a sample label of the other of the two sample video frames and a corresponding prediction label, and a third loss function composed of a true similarity value and a prediction similarity; if one of the two sample video frames uses x ₁ Representing, another sample video frame by x ₂ Expressed, the penalty function for a text mask twin network can be shown by the following equation 4.1:

L(x ₁ ,x ₂ )＝α·L _{text_mask1} (x ₁ )+α· _{text_mask2} (x ₂ )+β·L _sim (x ₁ ,x ₂ ) (4.1)

wherein L is _{text_mask1} (x ₁ ) Representing a first loss function consisting of a sample label of one of two sample video frames and a corresponding prediction label, L _{text_mask2} (x ₂ ) Representing a second loss function consisting of a sample label of the other of the two sample video frames and of a corresponding prediction label, L _sim (x ₁ ,x ₂ ) Representing a third loss function consisting of a similarity truth value and a predicted similarity; alpha and beta are weight parameters. If the first loss function and the second loss function are L2 norms, the third loss function is a cross entropy loss function; if x is used in two sample video frames ₁ Sample label y of a sample video frame ₁ Representation, prediction label for the sample video frame

Represents; in two sample video frames with x ₂ Sample label y of a sample video frame ₂ Representation, prediction tag for the sample video frame

Represents; the true value of the similarity between text features of two sample video frames is y (x) ₁ ,x ₂ ) Representation, the prediction similarity between text features of two sample video frames is expressed in p (x) ₁ ,x ₂ ) Expressed, the loss function of the text mask twin network can be shown as the following equation 4.2:

in the embodiment of the present application, based on the training sample set including two sample video frames, the sample label of each sample video frame, and the true value of the similarity between the text features of the two sample video frames, optimally training the text mask twin network, constraining the sample labels of the sample video frames and the corresponding prediction labels obtained by the text mask twin network by adopting an L2 norm, adopting a cross entropy loss function to carry out similarity truth value between the text characteristics of the two sample video frames, and the prediction similarity between the text characteristics of two sample video frames obtained by the text mask twin network is restricted, a multi-task joint training mode is adopted, and training the text mask twin network, so that the text features of the video frames can be accurately extracted by the trained text mask twin network, and the similarity between the text features of any two video frames can be accurately predicted.

Based on the related embodiments of the video text extraction method, the embodiment of the application provides a video text extraction device. Referring to fig. 15, a schematic structural diagram of a video text extraction apparatus according to an embodiment of the present disclosure may include a key frame generation unit 1501, a text recognition unit 1502, and a video text generation unit 1503. The video text extraction apparatus shown in fig. 15 may operate as follows:

a key frame generating unit 1501, configured to perform text feature extraction processing on each video frame in a target video to obtain a text feature of each video frame;

the key frame generating unit 1501 is further configured to generate one or more key frames according to text features of the video frames; the similarity between the text features of any two key frames is smaller than a first similarity threshold;

a text recognition unit 1502, configured to perform text recognition processing on each key frame to obtain a recognition text of each key frame and a text region where the recognition text is located;

the video text generation unit 1503 is configured to acquire text contents of the identification texts of the key frames, and merge the identification texts of the key frames according to the text contents of the identification texts of the key frames and corresponding text regions to obtain a merged text; wherein the merged text is taken as the video text of the target video.

In one embodiment, the video frames are arranged in sequence according to the corresponding playing sequence in the target video;

when the key frame generating unit 1501 generates a key frame according to the text feature of each video frame, the following operations are specifically performed:

determining a related video frame set from each video frame according to the arrangement sequence of each video frame and the similarity between the text characteristics of any two adjacent video frames; the associated video frame set comprises video frames of which the similarity between at least two corresponding text features is greater than or equal to a second similarity threshold;

and merging each video frame in the associated video frame set to obtain the key frame.

In one embodiment, when the key frame generating unit 1501 performs text feature extraction processing on each video frame in a target video to obtain a text feature of each video frame, the following operations are specifically performed:

performing image segmentation processing on any video frame in each video frame to obtain a plurality of image areas of the any video frame;

performing convolution operation on a plurality of image areas of any video frame to obtain an initial characteristic diagram of any video frame;

and performing text prediction on each image area of any video frame according to the initial feature map, and generating text features of any video frame according to a text prediction result and the initial feature map.

In one embodiment, the number of the initial feature maps is multiple;

the key frame generation unit 1501 specifically performs the following operations when performing text prediction on each image region of the video frame according to the initial feature map and generating a text feature of the video frame according to a text prediction result and the initial feature map:

merging the plurality of initial characteristic graphs to obtain a reference characteristic graph;

predicting text prediction probability of texts contained in each image area of any video frame based on the feature information indicated by the reference feature map; wherein the text prediction probability is taken as the text prediction result;

and respectively carrying out text positioning processing on each initial characteristic graph based on the text prediction probability to obtain the text characteristics of any video frame.

In an embodiment, when the key frame generating unit 1501 merges a plurality of initial feature maps to obtain a reference feature map, the following operations are specifically performed:

determining the attention degree of any initial feature map based on an attention mechanism, and taking the attention degree as the feature weight of any initial feature map;

and carrying out weighting and merging processing on the corresponding initial characteristic graphs by adopting the characteristic weight to obtain the reference characteristic graph.

In an embodiment, the key frame generating unit 1501 specifically performs the following operations when performing text positioning processing on each initial feature map based on the text prediction probability to obtain a text feature of any video frame:

taking the image area of which the corresponding text prediction probability is less than or equal to a probability threshold value in each image area as a mask area;

for any initial feature map, performing masking processing on the position of the mask region in the any initial feature map to obtain a target feature map corresponding to the any initial feature map;

and taking the target feature map corresponding to each initial feature map as the text feature of any video frame.

In an embodiment, the video text extraction apparatus further includes a model training unit 1504, where the text feature of each video frame is obtained by invoking a trained text mask twin network, and when the model training unit 1504 is configured to obtain the trained text mask twin network, the following operations are specifically performed:

acquiring a training sample group; the training sample set comprises two sample video frames, a sample label of each sample video frame and a similarity true value between text features of the two sample video frames, wherein the sample label is used for indicating whether text is contained in the corresponding sample video frame;

inputting each sample video frame into a text mask twin network to obtain a prediction label of each sample video frame and text characteristics of each sample video frame, and obtaining prediction similarity between the text characteristics of the two sample video frames based on the text characteristics of each sample video frame; the prediction tag is used for indicating whether a prediction result of text is contained in a corresponding sample video frame;

and performing optimization training on the text mask twin network based on the sample label of each sample video frame, the prediction label of each sample video frame, the similarity truth value and the prediction similarity to obtain the trained text mask twin network.

In an embodiment, when the text recognition unit 1502 performs text recognition processing on each key frame to obtain a recognition text and a text region of each key frame, the following operations are specifically performed:

performing text detection processing on any key frame to obtain a text detection area where a text contained in the any key frame is located;

intercepting the text detection area, and performing text extraction processing on the text detection area to obtain a text extraction result;

and taking the text extraction result as the identification text of any key frame, and taking the text detection area as the text area where the identification text of any key frame is located.

In an embodiment, when the text recognition unit 1502 performs text detection processing on any one of the key frames to obtain a text detection area where a text included in the any one of the key frames is located, the following operations are specifically performed:

predicting whether the pixel attribute of each pixel is a positive pixel for constituting a text or not based on the pixel feature of each pixel in any key frame;

predicting, for any one of the respective pixels, whether a connection between the any one pixel and each of adjacent pixels adjacent to the any one pixel is a positive connection based on a pixel attribute of the any one pixel and a pixel attribute of the each of adjacent pixels adjacent to the any one pixel; the positive connection is used to indicate: the pixel attributes of the two correspondingly connected pixels are both positive pixels, or the pixel attribute of one pixel in the two correspondingly connected pixels is a positive pixel, and the pixel attribute of one pixel is a negative pixel which is not used for forming a text;

determining a rotation angle of each pixel based on the rotation angle of any key frame;

and determining a text detection area where the text contained in any one key frame is located based on a connected domain formed by a plurality of pixels of which the pixel attributes are positive pixels and which are correspondingly connected as positive connections and the rotation angle of each pixel.

In an embodiment, the text recognition unit 1502 performs text extraction on the text detection area, and when a text extraction result is obtained, the following operations are specifically performed:

performing visual feature coding processing on the text detection area to obtain visual coding features corresponding to the text detection area;

predicting a first character prediction probability that each reference character in a plurality of reference characters is used as the nth character in the text extraction result based on the visual coding feature corresponding to the text detection area and the decoding reference character; n is a positive integer, when n is equal to 1, the decoding reference character is a special decoding character, and when n is greater than 1, the decoding reference character is the (n-1) th character in the text extraction result;

predicting a second character prediction probability taking each reference character as the nth character based on visual coding features corresponding to the text detection area and a sequence prediction algorithm;

and determining the nth character from the reference characters based on the first character prediction probability and the second character prediction probability.

According to an embodiment of the present application, the steps involved in the video text extraction methods shown in fig. 2, fig. 6, and fig. 14 may be performed by units in the video text extraction apparatus shown in fig. 15. For example, steps S201 to S202 shown in fig. 2 may be performed by the key frame generation unit 1501 in the video text extraction device shown in fig. 15, step S203 shown in fig. 2 may be performed by the text recognition unit 1502 in the video text extraction device shown in fig. 15, and step S204 shown in fig. 2 may be performed by the video text generation unit 1503 in the video text extraction device shown in fig. 15. As another example, steps S601 to S602 shown in fig. 6 may be performed by the key frame generating unit 1501 in the video text extracting apparatus shown in fig. 15, steps S603 to S605 shown in fig. 6 may be performed by the text recognizing unit 1502 in the video text extracting apparatus shown in fig. 15, and step S606 shown in fig. 6 may be performed by the video text generating unit 1503 in the video text extracting apparatus shown in fig. 15. For another example, when the video text extraction device shown in fig. 15 further includes the model training unit 1504, steps S1401 to S1403 shown in fig. 14 may be performed by the model training unit 1504 in the video text extraction device.

According to another embodiment of the present application, the units in the video text extraction apparatus shown in fig. 15 may be respectively or entirely combined into one or several other units to form the video text extraction apparatus, or some unit(s) thereof may be further split into multiple units with smaller functions to form the video text extraction apparatus, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the video text extraction apparatus based on logical function division may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of multiple units.

According to another embodiment of the present application, the video text extraction apparatus as shown in fig. 15 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2, fig. 6, and fig. 14 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and implementing the video text extraction method according to the embodiment of the present application. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the above-described computing apparatus via the computer-readable storage medium.

In the embodiment of the application, text feature extraction processing can be performed on each video frame in the target video to obtain the text feature of each video frame; generating one or more key frames according to the text characteristics of each video frame; the similarity between the text features of any two key frames is smaller than a first similarity threshold; performing text recognition processing on each key frame to obtain a recognition text of each key frame and a text area where the recognition text is located; acquiring text contents of the identification texts of the key frames, and combining the identification texts of the key frames according to the text contents of the identification texts of the key frames and the corresponding text areas to obtain combined texts; the merged text is used as the video text of the target video; one or more key frames are generated according to the text characteristics of each video frame in the target video, and then the video text of the target video is obtained based on the text recognition processing of each key frame, so that the time consumption for performing the text recognition processing on each video frame in the target video is reduced, and the speed for extracting the text in the video can be improved.

Based on the related embodiment of the video text extraction method and the embodiment of the video text extraction device, the application also provides video text extraction equipment. Referring to fig. 16, a schematic structural diagram of a video text extraction device provided in the embodiment of the present application is shown. The video text extraction device shown in fig. 16 can include at least a processor 1601, an input interface 1602, an output interface 1603, and a computer storage medium 1604. The processor 1601, the input interface 1602, the output interface 1603, and the computer storage medium 1604 may be connected by a bus or other means.

A computer storage medium 1604 may be stored in the memory of the video text extraction device, the computer storage medium 1604 for storing a computer program, the computer program comprising program instructions, the processor 1601 for executing the program instructions stored by the computer storage medium 1604. The processor 1601 (or CPU) is a computing core and a control core of the video text extraction device, and is adapted to implement one or more instructions, and specifically, is adapted to load and execute the one or more instructions so as to implement the above-mentioned video text extraction method flow or corresponding functions.

The embodiment of the application also provides a computer storage medium (Memory), which is a Memory device in the video text extraction device and is used for storing programs and data. It is understood that the computer storage medium herein may include a built-in storage medium in the terminal, and may also include an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code) suitable for loading and execution by processor 1601. It should be noted that the computer storage medium herein may be a Random Access Memory (RAM) memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by processor 1601 to implement the corresponding steps of the method in the embodiments of the video text extraction method described above with respect to fig. 2, 6, and 14, and in particular, the one or more instructions stored in the computer storage medium may be loaded and executed by processor 1601 to implement the following steps:

generating one or more key frames according to the text characteristics of the video frames; the similarity between the text features of any two key frames is smaller than a first similarity threshold;

In one embodiment, the video frames are sequentially arranged according to a corresponding playing sequence in the target video;

when the processor 1601 generates a key frame according to the text feature of each video frame, the following operations are specifically executed:

determining a related video frame set from each video frame according to the arrangement sequence of each video frame and the similarity between the text characteristics of any two adjacent video frames; the associated video frame set comprises video frames with the similarity between at least two corresponding text features larger than or equal to a second similarity threshold;

In an embodiment, the processor 1601 is configured to perform text feature extraction processing on each video frame in a target video, and when obtaining a text feature of each video frame, specifically perform the following operations:

In one embodiment, the number of the initial feature maps is plural;

the processor 1601 specifically performs the following operations when performing text prediction on each image region of the video frame according to the initial feature map and generating a text feature of the video frame according to a text prediction result and the initial feature map:

In an embodiment, the processor 1601 is configured to perform merging processing on a plurality of initial feature maps to obtain a reference feature map, and specifically perform the following operations:

In an embodiment, the processor 1601 is configured to perform text positioning processing on each initial feature map based on the text prediction probability, and when obtaining a text feature of any video frame, specifically perform the following operations:

In an embodiment, the text feature of each video frame is obtained by invoking a trained text mask twin network, and when the processor 1601 is configured to obtain the trained text mask twin network, the following operations are specifically performed:

In an embodiment, the processor 1601 performs text recognition processing on each key frame, and when obtaining a recognition text and a text region of each key frame, specifically performs the following operations:

In an embodiment, the processor 1601 performs text detection processing on any one of the key frames, and when a text detection area where a text included in the any one of the key frames is located is obtained, specifically performs the following operations:

In one embodiment, the processor 1601 performs text extraction on the text detection area, and when a text extraction result is obtained, the following operations are specifically performed:

predicting a second character prediction probability taking each reference character as the nth character based on the visual coding feature corresponding to the text detection region and a sequence prediction algorithm;

Embodiments of the present application provide a computer program product, which includes a computer program, the computer program being stored in a computer storage medium; the processor of the video text extraction device reads the computer program from the computer storage medium, and the processor executes the computer program, so that the video text extraction device executes the method embodiments shown in fig. 2, fig. 6, and fig. 14. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for extracting video text, comprising:

2. The method of claim 1, wherein the video frames are arranged in sequence according to a corresponding playing order in the target video;

according to the text characteristics of each video frame, the mode for generating a key frame comprises the following steps:

3. The method of claim 1, wherein the performing text feature extraction processing on each video frame in the target video to obtain the text feature of each video frame comprises:

4. The method of claim 3, wherein the number of the initial feature maps is plural;

the performing text prediction on each image area of any video frame according to the initial feature map, and generating text features of any video frame according to a text prediction result and the initial feature map includes:

predicting text prediction probability of texts contained in each image area of any video frame based on feature information indicated by the reference feature map; wherein the text prediction probability is taken as the text prediction result;

5. The method of claim 4, wherein said merging the plurality of initial feature maps to obtain the reference feature map comprises:

6. The method of claim 4, wherein the performing text localization processing on each initial feature map based on the text prediction probability to obtain the text feature of any video frame comprises:

for any initial characteristic diagram, carrying out covering processing on the position of the mask region in the initial characteristic diagram to obtain a target characteristic diagram corresponding to the initial characteristic diagram;

7. The method of claim 1, wherein the text features of each video frame are obtained by invoking a trained text mask twin network, the obtaining of the trained text mask twin network comprising:

8. The method according to claim 1, wherein the performing text recognition processing on each key frame to obtain the recognized text of each key frame and the text region thereof comprises:

9. The method of claim 8, wherein the performing text detection processing on any one of the key frames to obtain a text detection area where a text included in any one of the key frames is located comprises:

and determining a text detection area where the text contained in any key frame is located based on a connected domain formed by a plurality of pixels which are connected as positive pixels and have positive pixel attributes in each pixel and the rotation angle of each pixel.

10. The method of claim 8, wherein the performing text extraction processing on the text detection area to obtain a text extraction result comprises:

11. A video text extraction apparatus, comprising:

the key frame generating unit is further used for generating one or more key frames according to the text characteristics of the video frames; the similarity between the text features of any two key frames is smaller than a first similarity threshold;

12. A video text extraction device, comprising an input interface and an output interface, and further comprising:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to perform the video text extraction method according to any of claims 1-10.

13. A computer storage medium having computer program instructions stored therein, which when executed by a processor, is configured to perform the video text extraction method of any one of claims 1-10.

14. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, is adapted to load and carry out a video text extraction method according to any one of claims 1-10.