CN117877016A - Video text extraction method, device, equipment and storage medium - Google Patents

Video text extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN117877016A
CN117877016A CN202410024908.XA CN202410024908A CN117877016A CN 117877016 A CN117877016 A CN 117877016A CN 202410024908 A CN202410024908 A CN 202410024908A CN 117877016 A CN117877016 A CN 117877016A
Authority
CN
China
Prior art keywords
text
video
image
video frame
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410024908.XA
Other languages
Chinese (zh)
Inventor
王小山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202410024908.XA priority Critical patent/CN117877016A/en
Publication of CN117877016A publication Critical patent/CN117877016A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • G06V30/19093Proximity measures, i.e. similarity or distance measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the field of financial science and technology and discloses a video text extraction method, a device, equipment and a storage medium, wherein the video text extraction method comprises the steps of determining a target video for text extraction to be executed, and acquiring each video frame corresponding to the target video, wherein the video frame comprises a target object; identifying the target object mouth shape variation in each video frame to obtain first text information; screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value; identifying text features of text areas in the key video frames to obtain second text information; according to the first text information and the second text information, the video text corresponding to the target video is generated, and the text in the video can be accurately extracted.

Description

Video text extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of financial technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting video text.
Background
Video text extraction aims at extracting text contained in video, and in the related art, a user wants to extract text in video, usually, text contained in each video frame in video is extracted, and repeated text obtained by extraction is combined to obtain video text of video.
However, in the related art, the text recognition effect is poor due to the background color or the shape of the video frame, so that the accuracy of the text extracted from the video is low.
Disclosure of Invention
The embodiment of the application mainly aims to provide a video text extraction method, device and equipment and a storage medium, and aims to accurately extract text in a video.
In a first aspect, an embodiment of the present application provides a method for extracting video text, where the method includes:
determining a target video to be subjected to text extraction, and acquiring each video frame corresponding to the target video, wherein the video frame comprises a target object;
identifying the target object mouth shape variation in each video frame to obtain first text information;
screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value;
Identifying text features of text areas in the key video frames to obtain second text information;
and generating a video text corresponding to the target video according to the first text information and the second text information.
In a second aspect, an embodiment of the present application further provides a video text extraction apparatus, including:
the frame extraction module is used for determining a target video to be subjected to text extraction and acquiring each video frame corresponding to the target video, wherein the video frames comprise target objects;
the mouth shape recognition module is used for recognizing the mouth shape variation of the target object in each video frame to obtain first text information;
the frame screening module is used for screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value;
the text extraction module is used for identifying text characteristics of text areas in the key video frames to obtain second text information;
and the text generation module is used for generating a video text corresponding to the target video according to the first text information and the second text information.
In a third aspect, embodiments of the present application further provide a video text extraction device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of the video text extraction method as provided by any of the embodiments of the present application.
In a fourth aspect, embodiments of the present application further provide a storage medium for computer readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of a video text extraction method as provided in any of the embodiments of the present application.
The embodiment of the application provides a video text extraction method, a device, equipment and a storage medium, wherein the video text extraction method is applied to video text extraction equipment, and the method comprises the steps of determining a target video for text extraction to be executed, and acquiring each video frame corresponding to the target video, wherein the video frame comprises a target object; identifying target object mouth shape variation in each video frame to obtain first text information; screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value; identifying text features of text areas in the key video frames to obtain second text information; and generating a video text corresponding to the target video according to the first text information and the second text information.
The method comprises the steps of screening each video frame to obtain a key video frame, extracting text characteristics and mouth shape change of a target object from each key video frame, and fusing second text information extracted from the key video frame and first text information extracted from the mouth shape change, so that the obtained video text is more accurate, in the text extraction process, the key video frame is obtained by screening the video frames, text extraction and mouth shape recognition of all the video frames are not needed, and the efficiency of obtaining the video text can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of steps of a video text extraction method according to an embodiment of the present application;
fig. 2 is a schematic application scenario structure of an electronic card according to an embodiment of the present application;
FIG. 3 is a schematic view of a scene structure of a target video divided into a plurality of video frames according to the embodiment of the present application;
FIG. 4 is a schematic diagram of a location structure of a text region in a video frame according to an embodiment of the present application;
FIG. 5 is a schematic view of a scene structure for extracting key frames using video frames in the implementation of the present application;
fig. 6 is a schematic view of a scene of obtaining a target text according to a first text message and a second text message in the implementation of the present application;
fig. 7 is a schematic block diagram of a video text extraction device according to an embodiment of the present application;
fig. 8 is a schematic block diagram of a video text extraction apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
Video text extraction aims at extracting text contained in video, and in the related art, a user wants to extract text in video, usually, text contained in each video frame in video is extracted, and repeated text obtained by extraction is combined to obtain video text of video.
However, in the related art, the text recognition effect is poor due to the background color or the shape of the video frame, so that the accuracy of the text extracted from the video is low.
Based on the above, the application provides a video text extraction method, a device, equipment and a storage medium, which aim at accurately extracting text in a video, wherein the method determines a target video for text extraction to be performed, and acquires each video frame corresponding to the target video, wherein the video frames comprise target objects; identifying target object mouth shape variation in each video frame to obtain first text information; screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value; identifying text features of text areas in the key video frames to obtain second text information; and generating a video text corresponding to the target video according to the first text information and the second text information.
The method comprises the steps of screening each video frame to obtain a key video frame, extracting text characteristics and mouth shape change of a target object from each key video frame, and fusing second text information extracted from the key video frame and first text information extracted from the mouth shape change, so that the obtained video text is more accurate, in the text extraction process, the key video frame is obtained by screening the video frames, text extraction and mouth shape recognition of all the video frames are not needed, and the efficiency of obtaining the video text can be effectively improved.
The video text extraction method can be executed by video text extraction equipment, and the video text extraction equipment can be terminal equipment or a server; the terminal devices herein may include, but are not limited to: computers, smart phones, tablet computers, notebook computers, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals, intelligent wearable equipment and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the like.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
It should be specifically noted that, in the specific embodiment of the present application, data related to a user is referred to, for example, in a case where the target video is a video generated by a user, or the like, when the embodiment of the present application is applied to a specific product or technology, permission or consent of the user needs to be obtained, and the collection, use and processing of the related data needs to comply with local laws and regulations and standards.
Referring to fig. 1, fig. 1 is a schematic flow chart of steps of a video text extraction method according to an embodiment of the present application.
As shown in fig. 1, the video text extraction method includes steps S101 to S105.
Step S101: and determining a target video to be subjected to text extraction, and acquiring each video frame corresponding to the target video, wherein the video frame comprises a target object.
By way of example, the target video may be any legal video, such as, but not limited to, a financial product related video, a financial business related video, wherein the financial product related video includes a financial product narration video, a financial product promotion video. For example, the commentary video of the financial product may be insurance business commentary video, such as car insurance commentary, financial insurance commentary, and the like.
For example, the target video is a financial product explanation video, and usually in the process of staff attendance training, a financial company usually lets staff quickly understand the product attributes of various financial products through the form of explanation videos corresponding to target objects (such as product managers and senior staff), so that customers can be served with high-quality solutions of customers during consultation, and sales of the financial products are promoted.
In the training process, customer service personnel generally need to quickly and accurately acquire and record key information in a video in order to deepen the attribute impression of the product, the record forms of the information are various, and the information is usually recorded in the form of characters in notes so as to be convenient for subsequent efficient reading or review. Therefore, accurate and efficient extraction of video characters can be facilitated for training customer service personnel.
As shown in fig. 2, the video to be subjected to text extraction is placed in a preset storage location, for example, a local memory of the frequency text extraction device, and a user displays, through a preset display interface 101 of the frequency text extraction device 10, a plurality of videos corresponding to the video to be subjected to text extraction in a display window 102 in the display interface 101, for example, the videos capable of being subjected to text extraction in the display window 102 include a first video, a second video, a third video and a fourth video.
The frequent text extraction apparatus 10 may extract a plurality of video frames constituting a target video from among the target videos using preset video processing software in response to a user selecting the target video from among the videos to be subjected to text extraction at a preset operation of the display interface 101. For example, after the user selects the first video in the display window 102 and clicks the ok button, the first video is selected as the target video.
After determining the target video, the frequency text extraction apparatus 10 acquires a plurality of video frames constituting the target video, each including a target object, through a preset video frame extraction software, and as shown in fig. 3, the target video is split into N video frames.
Text fields are typically included in video frames for displaying text information, such as the content of the speech of a target object in the video frame. As shown in fig. 4, the video frame includes a text area and an image display area, and the text area is located in the image display area, that is, the text is displayed at a preset position of the image display area, the target object is displayed in the image display area, and the target object displays corresponding comment content in the text area during the comment process.
Step S102: and identifying the mouth shape change of the target object in each video frame to obtain first text information.
The method includes the steps of identifying target objects in each video frame, extracting mouth-shaped deformation characteristics of the target objects according to time sequence corresponding to each video frame, and determining words of the target objects according to the mouth-shaped deformation characteristics to obtain first text information.
In some embodiments, the identifying the mouth shape change of the target object in each video frame to obtain the first text information includes:
acquiring the playing time of each key video frame, and extracting the mouth shape image of a target object in each key video frame;
and inputting each mouth shape image into a preset lip language identification model according to the sequencing of the playing time to obtain first text information.
The obtained target video includes N video frames, that is, N video images, extracting a mouth shape image of a target object in each frame of the N video images, assigning a corresponding timestamp to the extracted mouth shape image according to a sequence of playing each frame of image in the target video, and inputting the mouth shape image into the lip recognition model according to the sequence of the timestamps to obtain first text information corresponding to the target object in the target video.
For example, a first video frame in a target video acquires a first mouth shape image, a second video frame acquires a second mouth shape image, a third video frame acquires a third mouth shape image until an N-th video frame acquires an N-th mouth shape image, and a corresponding mouth shape image time stamp is given according to the time sequence of each video frame, so that the mouth shape changing sequence of a target object is accurately identified, and the mouth shape images acquired from the first video frame to the N-th video frame of the target video are input into a lip language identification model according to the time sequence of the time stamps so as to acquire first text information corresponding to the target object in the target video.
Step S103: and screening the key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value.
As shown in fig. 4, a text field is typically included in a video frame for displaying text information, such as the speaking content of a target object in the video frame.
In the multiple video frames, there may be video frames with text similarity exceeding a preset value, and in general, the content displayed in the video frames is basically the same, so if text extraction is performed on all video frames forming the target video, the text content extracted in different video frames may appear the same, and in the later text merging process, text deduplication with larger workload is further required to be performed on the text content, and finally, the text extraction efficiency is affected.
In the method, a plurality of video frames extracted from a target video are subjected to screening or de-duplication processing, each key video frame of the target video is obtained, and the text similarity of any two key video frames is smaller than a preset value, namely, the similarity of image characteristics corresponding to text areas of the key video frames is smaller than the preset value, and the preset value is set according to requirements, for example, is set according to experience values in a more text extraction process.
In some embodiments, the selecting key video frames from each of the video frames comprises:
sorting according to the playing time of each key video frame, and classifying each video frame into a plurality of associated video frame sets according to the text similarity between each adjacent video frame after sorting;
and determining a key video frame according to each associated video frame set to obtain each key video frame corresponding to the target video.
Referring to fig. 5, illustratively, the acquired video frames are sorted according to the playing order, that is, the sequence of playing time, and then the adjacent video frames with similarity smaller than a preset value are classified into one type by comparing the text similarity between the adjacent video frames, so that each video frame can be divided into a plurality of associated video frame sets. At this time, the text of the video frames in each associated video frame set may be considered to be approximate, or even identical, and thus, the corresponding key frames may be extracted from each associated video frame set.
For example, the target video is divided into N video frames, the N video frames are ordered according to the sequence of time, the text region of each video frame can be identified through a model, and the image similarity of the text region of the adjacent video frame is compared, so that whether the adjacent video frame belongs to the associated video frame can be obtained. If the similarity of the text regions corresponding to the first video frame and the second video frame exceeds a preset value, the first video frame and the second video frame are used as an associated video frame set. If the similarity of text areas corresponding to the third video frame, the fourth video frame and the fifth video frame exceeds a preset value, the third video frame, the fourth video frame and the fifth video frame are used as an associated video frame set, M associated video frame sets are obtained by N video frames, and N is more than M.
Based on the fact that the similarity of the text areas of the video frames in each associated video frame set exceeds a preset value, the text areas of the video frames of one associated video frame set can be considered to be the same, one video frame is arbitrarily selected from each associated video frame set to serve as a key video frame, or each video frame in the associated video frame set is combined to obtain one key video frame.
Step S104: and identifying the text characteristics of the text areas in each key video frame to obtain second text information.
Illustratively, text features of text regions in the key video frames are identified by OCR (Optical Character Recognition ) techniques to obtain the second text information. Or, inputting each key video frame into a corresponding text recognition model, recognizing the text characteristics of the text region in each key video frame by using the text recognition model, and sequencing the text characteristics according to the time sequence of the key video frames so as to obtain second text information.
In some embodiments, the identifying text features of text regions in the key video frame to obtain second text information includes:
marking text regions in the key video frames using a block diagram;
identifying the text features in the marked text region to obtain second text information.
Illustratively, the text region in the key video frame is marked by the marking model through a block diagram, text features of the text region in the key video frame are identified by utilizing an OCR text recognition technology, and then the text features are ordered according to the time sequence of the key video frame, so that second text information is obtained.
Or marking the text region in the key video frame by using a block diagram, extracting the marked text region to obtain text images, and inputting each text image into a preset text recognition model according to time sequence, so that text features in the text region are recognized by the text recognition model to obtain second text information.
In some embodiments, the identifying text features within the marked text region, resulting in second text information, includes:
extracting text areas marked by the block diagrams of the key video frames to obtain a first text image;
performing preset image processing operation on the first text image to obtain a second text image;
identifying character features in the second text image to obtain second text information;
the preset image processing operation at least comprises the steps of extracting a foreground image of the first text image, eliminating a background image of the first text image and replacing at least one of the background images of the first text image.
Optionally, the preset image processing operation is to replace a background image of the first text image, and the performing the preset image processing operation on the first text image to obtain a second text image includes:
Acquiring gray features of Chinese information in the first text image, and acquiring a background replacement image according to the gray features;
and replacing the background image of the first text image by using the background replacement image to obtain a second text image.
For example, the recognition effect of the text in the text region may be affected by the color of the background image, which is typically displayed in the video frame over the background image, when the background color is similar to the text color. Or, the text information is usually displayed in the foreground image of the video frame, and the text recognition effect of the text region can be effectively improved by extracting the foreground image.
Therefore, in the method, after the text region of the key video frame is extracted to obtain the first text image, the foreground image of the first text image is extracted and/or the background image of the first text image is replaced, so that the interference of the background on the text characteristics of the text region in the text extraction process of the text region is effectively reduced, and the recognition effect of the text region is improved.
For example, the text in the text region in the key video frame is displayed by a white font, the gray value is 0, and in order to enhance the display effect of the text, the background of the text region is replaced by black or gray before the text feature is extracted, so that the text feature of the text region can be effectively improved, the text recognition accuracy is improved, and the acquired second text information is higher in accuracy.
Step S105: and generating a video text corresponding to the target video according to the first text information and the second text information.
In general, the text recognition rate of the OCR (Optical Character Recognition ) technology is high, and the recognition accuracy of partial words depends on the definition of the text in the text region, so for recognizing error-prone words, an error word library can be established, the error recognition text in the second text information can be recognized through the error word library, and the corresponding replacement text is obtained from the first text information, and the corresponding error recognition text is replaced by the replacement tail, so that the accuracy of the obtained video text is high.
In some embodiments, the generating the video text corresponding to the target video according to the second text information and the first text information includes:
judging whether the second text information has text recognition errors or not;
when the second text information has text recognition errors, acquiring a misrecognition position corresponding to misrecognition text in the second text information, and acquiring a replacement text from the first text information according to the misrecognition position;
and replacing the misrecognized text corresponding to the misrecognized position by using the replacement text to obtain the video text corresponding to the target video.
As shown in fig. 6, the video text extraction device is provided with a word-misplacement library for text recognition, and the word-misplacement library is compared with the second text information, so that whether the second text information has the text recognized by mistake is judged. When the second text information has the misrecognition text, determining the misrecognition position of the misrecognition text in the second text information, and acquiring a corresponding replacement text from the first text information according to the misrecognition position, so that the misrecognition text in the second text area is replaced by the replacement text, and a video text corresponding to the target video is obtained.
In some embodiments, the generating the video text corresponding to the target video according to the second text information and the first text information includes:
judging whether the second text information has text recognition errors or not;
when the second text information has text recognition errors, acquiring a misrecognition position corresponding to misrecognition text in the second text information, and evaluating text semantics of the misrecognition text in the second text information;
acquiring a replacement text from the first text information according to the misrecognition position and the text semantic;
and replacing the misrecognized text corresponding to the misrecognized position by using the replacement text to obtain the video text corresponding to the target video.
The video text extraction device is provided with a word-misplacement library for text recognition, and the word-misplacement library is compared with the second text information, so that whether the second text information has the text recognized by mistake or not is judged. When the misrecognition text exists in the second text information, determining the misrecognition position of the misrecognition text in the second text information, and evaluating the text semantics of the misrecognition text by combining the context of the misrecognition text.
Positioning the first text information to a replacement text area according to the misrecognition position, and acquiring a text with the semantic similarity exceeding a preset value from the replacement text area according to the text semantic as a corresponding replacement text, so that the misrecognition text of the second text area is replaced by the replacement text, and a video text corresponding to the target video is obtained.
Referring to fig. 7, fig. 7 is a schematic block diagram of a video text extraction device provided in the present application.
As shown in fig. 7, the video text extraction apparatus 200 includes a frame extraction module 201, a mouth shape recognition module 202, a frame screening module 203, a text extraction module 204, and a text generation module 205. The frame extraction module 201 is configured to determine a target video for text extraction, and obtain each video frame corresponding to the target video, where the video frame includes a target object. The mouth shape recognition module 202 is configured to recognize mouth shape changes of the target object in each video frame, and obtain first text information. And the frame screening module 203 is configured to screen key video frames from the video frames, where the text similarity of any two of the key video frames is smaller than a preset value. The text extraction module 204 is configured to identify text features of text regions in each of the key video frames, so as to obtain second text information. And the text generation module 205 generates a video text corresponding to the target video according to the first text information and the second text information.
In some embodiments, the identifying the mouth shape change of the target object in each video frame to obtain the first text information includes:
acquiring the playing time of each key video frame, and extracting the mouth shape image of a target object in each key video frame;
and inputting each mouth shape image into a preset lip language identification model according to the sequencing of the playing time to obtain first text information.
In some embodiments, the selecting key video frames from each of the video frames comprises:
sorting according to the playing time of each key video frame, and classifying each video frame into a plurality of associated video frame sets according to the text similarity between each adjacent video frame after sorting;
and determining corresponding key video frames according to the associated video frame set, and obtaining each key video frame corresponding to the target video.
In some embodiments, the identifying text features of text regions in the key video frame to obtain second text information includes:
marking text regions in the key video frames using a block diagram;
identifying the text features in the marked text region to obtain second text information.
In some embodiments, the identifying text features within the marked text region, resulting in second text information, includes:
extracting text areas marked by the block diagrams of the key video frames to obtain a first text image;
performing preset image processing operation on the first text image to obtain a second text image;
identifying character features in the second text image to obtain second text information;
the preset image processing operation at least comprises the steps of extracting a foreground image of the first text image, eliminating a background image of the first text image and replacing at least one of the background images of the first text image.
In some embodiments, the preset image processing operation is to replace a background image of the first text image, and the performing the preset image processing operation on the first text image to obtain a second text image includes:
acquiring gray features of Chinese information in the first text image, and acquiring a background replacement image according to the gray features;
and replacing the background image of the first text image by using the background replacement image to obtain a second text image.
In some embodiments, the generating the video text corresponding to the target video according to the second text information and the first text information includes:
judging whether the second text information has text recognition errors or not;
when the second text information has text recognition errors, acquiring a misrecognition position corresponding to misrecognition text in the second text information, and evaluating text semantics of the misrecognition text in the second text information;
acquiring a replacement text from the first text information according to the misrecognition position and the text semantic;
and replacing the misrecognized text corresponding to the misrecognized position by using the replacement text to obtain the video text corresponding to the target video.
It will be appreciated that the video text extraction apparatus 200 may be applied to a video text extraction device and used to perform the steps of the video text extraction method provided in any of the embodiments of the present application.
It should be noted that, for convenience and brevity of description, the specific working process of the video text extraction apparatus 200 described above may refer to the corresponding process in the foregoing embodiment of the video text extraction method, and will not be described herein again.
Referring to fig. 8, fig. 8 is a schematic block diagram of a video text extraction apparatus according to an embodiment of the present application.
As shown in fig. 8, the video text extraction apparatus 10 includes a processor 11 and a memory 12, the processor 11 and the memory 12 being connected by a bus 13, such as an I2C (Inter-integrated Circuit) bus.
In particular, the processor 11 is operative to provide computing and control capabilities supporting the operation of the entire video text extraction device 10. The processor 11 may be a central processing unit (Central Processing Unit, CPU), and the processor 11 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Specifically, the Memory 12 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk, a removable hard disk, or the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with an embodiment of the present application and is not limiting of the video text extraction apparatus to which an embodiment of the present application is applied, and that a particular video text extraction apparatus may include more or less components than those shown, or may combine some components, or have a different arrangement of components.
The processor 11 is configured to execute a computer program stored in the memory, and implement any one of the video text extraction methods provided in the embodiments of the present application when the computer program is executed.
In some embodiments, the processor 11 is configured to run a computer program stored in a memory and to implement the following steps when executing the computer program:
determining a target video to be subjected to text extraction, and acquiring each video frame corresponding to the target video, wherein the video frame comprises a target object;
identifying the target object mouth shape variation in each video frame to obtain first text information;
screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value;
Identifying text features of text areas in the key video frames to obtain second text information;
and generating a video text corresponding to the target video according to the first text information and the second text information.
In some embodiments, the identifying the mouth shape change of the target object in each video frame to obtain the first text information includes:
acquiring the playing time of each key video frame, and extracting the mouth shape image of a target object in each key video frame;
and inputting each mouth shape image into a preset lip language identification model according to the sequencing of the playing time to obtain first text information.
In some embodiments, the selecting key video frames from each of the video frames comprises:
sorting according to the playing time of each key video frame, and classifying each video frame into a plurality of associated video frame sets according to the text similarity between each adjacent video frame after sorting;
and determining corresponding key video frames according to the associated video frame set, and obtaining each key video frame corresponding to the target video.
In some embodiments, the identifying text features of text regions in the key video frame to obtain second text information includes:
Marking text regions in the key video frames using a block diagram;
identifying the text features in the marked text region to obtain second text information.
In some embodiments, the identifying text features within the marked text region, resulting in second text information, includes:
extracting text areas marked by the block diagrams of the key video frames to obtain a first text image;
performing preset image processing operation on the first text image to obtain a second text image;
identifying character features in the second text image to obtain second text information;
the preset image processing operation at least comprises the steps of extracting a foreground image of the first text image, eliminating a background image of the first text image and replacing at least one of the background images of the first text image.
In some embodiments, the preset image processing operation is to replace a background image of the first text image, and the performing the preset image processing operation on the first text image to obtain a second text image includes:
acquiring gray features of Chinese information in the first text image, and acquiring a background replacement image according to the gray features;
And replacing the background image of the first text image by using the background replacement image to obtain a second text image.
In some embodiments, the generating the video text corresponding to the target video according to the second text information and the first text information includes:
judging whether the second text information has text recognition errors or not;
when the second text information has text recognition errors, acquiring a misrecognition position corresponding to misrecognition text in the second text information, and evaluating text semantics of the misrecognition text in the second text information;
acquiring a replacement text from the first text information according to the misrecognition position and the text semantic;
and replacing the misrecognized text corresponding to the misrecognized position by using the replacement text to obtain the video text corresponding to the target video.
It should be noted that, for convenience and brevity of description, specific working processes of the video text extraction apparatus described above may refer to corresponding processes in the foregoing embodiments of the video text extraction method, and are not repeated herein.
The embodiments of the present application also provide a storage medium for computer readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of the video text extraction method as provided in any of the embodiments of the present application.
The storage medium may be an internal storage unit of the video text extraction apparatus of the foregoing embodiment, for example, a hard disk or a memory of the video text extraction apparatus. The storage medium may also be an external storage device of the video text extraction device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the video text extraction device.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware embodiment, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
It should be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. The foregoing is merely illustrative of the embodiments of the present application, but the scope of the present application is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present application, and these modifications or substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for extracting video text, the method comprising:
determining a target video to be subjected to text extraction, and acquiring each video frame corresponding to the target video, wherein the video frame comprises a target object;
identifying the target object mouth shape variation in each video frame to obtain first text information;
screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value;
identifying text features of text areas in the key video frames to obtain second text information;
and generating a video text corresponding to the target video according to the first text information and the second text information.
2. The method of claim 1, wherein said identifying said target object mouth-variations in each of said video frames to obtain first text information comprises:
acquiring the playing time of each key video frame, and extracting the mouth shape image of a target object in each key video frame;
and inputting each mouth shape image into a preset lip language identification model according to the sequencing of the playing time to obtain first text information.
3. The method of claim 1, wherein said screening key video frames from each of said video frames comprises:
sorting according to the playing time of each key video frame, and classifying each video frame into a plurality of associated video frame sets according to the text similarity between each adjacent video frame after sorting;
and determining corresponding key video frames according to the associated video frame set, and obtaining each key video frame corresponding to the target video.
4. The method of claim 1, wherein the identifying text features of text regions in the key video frames to obtain second text information comprises:
marking text regions in the key video frames using a block diagram;
identifying the text features in the marked text region to obtain second text information.
5. The method of claim 4, wherein said identifying text features within said text region that is marked results in second text information, comprising:
extracting text areas marked by the block diagrams of the key video frames to obtain a first text image;
performing preset image processing operation on the first text image to obtain a second text image;
Identifying character features in the second text image to obtain second text information;
the preset image processing operation at least comprises the steps of extracting a foreground image of the first text image, eliminating a background image of the first text image and replacing at least one of the background images of the first text image.
6. The method of claim 5, wherein the preset image processing operation is to replace a background image of the first text image, and the performing the preset image processing operation on the first text image to obtain a second text image includes:
acquiring gray features of Chinese information in the first text image, and acquiring a background replacement image according to the gray features;
and replacing the background image of the first text image by using the background replacement image to obtain a second text image.
7. The method of any of claims 1-6, wherein the generating video text corresponding to the target video from the second text information and the first text information comprises:
judging whether the second text information has text recognition errors or not;
when the second text information has text recognition errors, acquiring a misrecognition position corresponding to misrecognition text in the second text information, and evaluating text semantics of the misrecognition text in the second text information;
Acquiring a replacement text from the first text information according to the misrecognition position and the text semantic;
and replacing the misrecognized text corresponding to the misrecognized position by using the replacement text to obtain the video text corresponding to the target video.
8. A video text extraction apparatus, comprising:
the frame extraction module is used for determining a target video to be subjected to text extraction and acquiring each video frame corresponding to the target video, wherein the video frames comprise target objects;
the mouth shape recognition module is used for recognizing the mouth shape variation of the target object in each video frame to obtain first text information;
the frame screening module is used for screening key video frames from the video frames, wherein the text similarity of any two key video frames is smaller than a preset value;
the text extraction module is used for identifying text characteristics of text areas in the key video frames to obtain second text information;
and the text generation module is used for generating a video text corresponding to the target video according to the first text information and the second text information.
9. A video text extraction device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connected communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of the video text extraction method according to any of claims 1 to 7.
10. A storage medium for computer readable storage, wherein the storage medium stores one or more programs executable by one or more processors to implement the steps of the video text extraction method of any one of claims 1 to 7.
CN202410024908.XA 2024-01-03 2024-01-03 Video text extraction method, device, equipment and storage medium Pending CN117877016A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410024908.XA CN117877016A (en) 2024-01-03 2024-01-03 Video text extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410024908.XA CN117877016A (en) 2024-01-03 2024-01-03 Video text extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117877016A true CN117877016A (en) 2024-04-12

Family

ID=90591174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410024908.XA Pending CN117877016A (en) 2024-01-03 2024-01-03 Video text extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117877016A (en)

Similar Documents

Publication Publication Date Title
US10013643B2 (en) Performing optical character recognition using spatial information of regions within a structured document
US20230376527A1 (en) Generating congruous metadata for multimedia
US20160092730A1 (en) Content-based document image classification
KR102002024B1 (en) Method for processing labeling of object and object management server
CN109756760B (en) Video tag generation method and device and server
CN112287914B (en) PPT video segment extraction method, device, equipment and medium
CN110598008B (en) Method and device for detecting quality of recorded data and storage medium
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
US10769247B2 (en) System and method for interacting with information posted in the media
US20150278248A1 (en) Personal Information Management Service System
CN111683285A (en) File content identification method and device, computer equipment and storage medium
CN113205047B (en) Medicine name identification method, device, computer equipment and storage medium
CN117851639A (en) Video processing method, device, electronic equipment and storage medium
CN108268488B (en) Webpage main graph identification method and device
CN113205130A (en) Data auditing method and device, electronic equipment and storage medium
CN110532449B (en) Method, device, equipment and storage medium for processing service document
US10216988B2 (en) Information processing device, information processing method, and computer program product
CN111581937A (en) Document generation method and device, computer readable medium and electronic equipment
CN117877016A (en) Video text extraction method, device, equipment and storage medium
CN112818984B (en) Title generation method, device, electronic equipment and storage medium
CN114254138A (en) Multimedia resource classification method and device, electronic equipment and storage medium
CN112749660A (en) Method and equipment for generating video content description information
CN112232320B (en) Printed matter text proofreading method and related equipment
CN110879868A (en) Consultant scheme generation method, device, system, electronic equipment and medium
CN111125345A (en) Data application method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination