CN114401419B - Video-based content generation method and device, electronic equipment and storage medium - Google Patents

Video-based content generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114401419B
CN114401419B CN202111616214.8A CN202111616214A CN114401419B CN 114401419 B CN114401419 B CN 114401419B CN 202111616214 A CN202111616214 A CN 202111616214A CN 114401419 B CN114401419 B CN 114401419B
Authority
CN
China
Prior art keywords
picture
text information
text
key
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111616214.8A
Other languages
Chinese (zh)
Other versions
CN114401419A (en
Inventor
黄焱晖
卞东海
蔡远俊
彭卫华
徐伟建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111616214.8A priority Critical patent/CN114401419B/en
Publication of CN114401419A publication Critical patent/CN114401419A/en
Application granted granted Critical
Publication of CN114401419B publication Critical patent/CN114401419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The present disclosure discloses a video-based content generation method, which relates to the technical field of image processing, in particular to the fields of natural language processing, image recognition, optical character recognition, etc. The specific implementation scheme is as follows: the method comprises the steps of performing frame cutting processing on a video, determining key pictures and text information contained in the video according to a picture sequence after the picture sequence contained in the video is obtained, fusing the text information contained in each picture to generate text content, and inserting the key pictures into the text content to generate target content corresponding to the video. Therefore, the video content is converted into the image-text content, so that the content material is enriched, the readability of the video content is improved, and conditions are provided for saving the time for a user to read the material.

Description

Video-based content generation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to the fields of natural language processing, image recognition, optical character recognition, and the like, and in particular, to a method and an apparatus for generating content based on video, and an electronic device.
Background
In the internet, there is a large amount of video material, but reading of the video material takes a lot of time. Therefore, how to generate content that can be read quickly based on video is an urgent problem to be solved.
Disclosure of Invention
The present disclosure provides a video-based content generation method and apparatus.
According to an aspect of the present disclosure, there is provided a video-based content generating method including:
performing frame cutting processing on a video to obtain a picture sequence contained in the video;
performing character recognition on each picture in the picture sequence to determine text information contained in each picture and the position of the text information in the picture;
determining key pictures contained in the video according to text information contained in each picture and/or the position of the text information in the picture;
according to the sequence of the pictures contained in the picture sequence, fusing text information contained in each picture to generate text content;
and inserting the key pictures into the text content according to the positions of the text information contained in the key pictures in the text content to generate target content corresponding to the video.
According to another aspect of the present disclosure, there is provided a video-based content generating apparatus including:
the frame cutting module is used for carrying out frame cutting processing on the video so as to obtain a picture sequence contained in the video;
the recognition module is used for carrying out character recognition on each picture in the picture sequence so as to determine text information contained in each picture and the position of the text information in the picture;
the determining module is used for determining key pictures contained in the video according to text information contained in each picture and/or the position of the text information in the picture;
the generating module is used for fusing the text information contained in each picture according to the sequence of the pictures contained in the picture sequence to generate text content;
the generating module is further configured to insert the key pictures into the text content according to positions of text information included in the key pictures in the text content, so as to generate target content corresponding to the video.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the above-described embodiments.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method of the above embodiment.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic flow chart of a video-based content generation method according to an embodiment of the present disclosure;
fig. 2 is a schematic flow chart of another video-based content generation method provided by the embodiment of the present disclosure;
fig. 3 is a schematic flow chart of another video-based content generation method provided by the embodiment of the present disclosure;
fig. 4 is a schematic process diagram of another video-based content generating apparatus provided in the embodiment of the present disclosure;
fig. 5 is a schematic flow chart of another video-based content generating apparatus according to an embodiment of the present disclosure;
fig. 6 is a block diagram of an electronic device for implementing a video-based content generation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes but is not limited to the following branch fields: text classification, information extraction, automatic summarization, intelligent question answering, topic recommendation, machine translation, subject word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.
Image recognition, which refers to a technique for processing, analyzing and understanding images by a computer to recognize various different patterns of objects and objects, is a practical application of applying a deep learning algorithm. Image recognition technology at present is generally divided into face recognition and commodity recognition, and the face recognition is mainly applied to security inspection, identity verification and mobile payment; the commodity identification is mainly applied to the commodity circulation process, in particular to the field of unmanned retail such as unmanned goods shelves and intelligent retail cabinets.
OCR (Optical Character Recognition) refers to a process in which an electronic device (e.g., a scanner or a digital camera) checks a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer text by a Character Recognition method; the method is characterized in that characters in a paper document are converted into an image file with a black-white dot matrix in an optical mode aiming at print characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software. How to debug or use auxiliary information to improve Recognition accuracy is the most important issue of OCR, and the term of ICR (Intelligent Character Recognition) is generated accordingly. The main indicators for measuring the performance of an OCR system are: the rejection rate, the false recognition rate, the recognition speed, the user interface friendliness, the product stability, the usability, the feasibility and the like.
A video-based content generation method, apparatus, electronic device, and storage medium according to embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a video-based content generation method according to an embodiment of the present disclosure.
As shown in fig. 1, the method includes:
step 101, performing frame cutting processing on the video to obtain a picture sequence contained in the video.
The picture sequence may include a plurality of pictures arranged in time sequence, which is not limited in this disclosure.
According to the method and the system, the user can upload the URL corresponding to the video, and the server side can acquire the video based on the video URL after acquiring the video URL. Or the user can directly upload the video to the server. The present disclosure is not so limited. After the server side obtains the video, the frame cutting processing can be carried out on the video by using any video processing software.
It can be understood that after the frame-cutting processing is performed on the video by the server, a plurality of pictures corresponding to the video can be obtained, and then the plurality of pictures can be stored according to the time sequence in the video, or the pictures can be numbered according to the time sequence in the video and then stored, so that a picture sequence corresponding to the video can be determined.
In the method and the device, the images in two frames of videos can be inscribed within one second, so that the redundancy of the picture sequences is reduced while the obtained picture sequences can contain all content information in the videos.
Step 102, performing character recognition on each picture in the picture sequence to determine text information contained in each picture and the position of the text information in the picture.
In the disclosure, the character information in the picture and the position of the character information in the picture can be recognized by using an OCR recognition technology.
The text information included in each picture may include subtitles, barracks, and information related to the type of the video content, such as the type of the title corresponding to the video, which is not limited in this disclosure.
It can be understood that, in a video image, the positions of the subtitle, the bullet screen, and the column name are different in the video, and in general, the subtitle is at the lower position of the video image, the column name is at the upper left corner of the video image, and the bullet screen is at the upper half of the video image. Therefore, the type of the text information can be determined according to the position of the text information.
And 103, determining key pictures contained in the video according to the text information contained in each picture and/or the position of the text information in the picture.
In the present disclosure, the key picture may be a picture that is important for understanding the content of the video, for example, a picture where the content that the user pays attention to is located, or a picture with a changed content plot. In general, when the content scenario changes, the barrage text information will change accordingly. Therefore, whether the picture is a key picture can be determined according to whether the bullet screen text information appears in the picture. Therefore, the key picture can be effectively determined, and the information content of the key picture can be further improved.
In the present disclosure, the bullet screen text information included in the adjacent pictures can be compared, and when the bullet screen text information changes, the picture corresponding to the bullet screen text information can be determined as the key picture.
Optionally, under the condition that the text information included in any picture is at the preset position in the picture, it may be determined that any picture is a key picture.
For example, when the bullet screen text information appears at the bullet screen position of a certain picture and the previous picture does not contain the bullet screen text information, or the picture is the first picture in the picture sequence, it may be determined that the picture is the key picture.
Optionally, under the condition that the bullet screen text information included in the multiple adjacent pictures is the same, it may be determined that any picture in the multiple adjacent pictures is a key picture.
And 104, fusing the text information contained in each picture according to the sequence of the pictures contained in the picture sequence to generate text content.
In the present disclosure, after the server acquires the text information corresponding to each picture, the subtitle text information may be spliced according to the picture sequence corresponding to the text information to generate the text content corresponding to the video.
Optionally, each two adjacent subtitle text messages may be input into a preset network model, so as to determine the type of the punctuation mark between the two subtitle text messages according to the output of the network model. And then fusing the subtitle text information contained in each picture based on the type of punctuation marks between every two adjacent subtitle text information to generate text content.
The network model may be any natural language processing model such as a Knowledge and semantic information fusion model (ERNIE). The model may be trained through the set of video subtitle text data tagged with punctuation marks to obtain a network model for predicting punctuation marks between adjacent subtitle texts.
In the disclosure, after the type of the punctuation mark between two pieces of subtitle text information is determined based on a preset network model, the subtitle text information in each picture can be spliced according to the sequence of the picture corresponding to the subtitle text information. In the splicing process, if a punctuation mark exists between two adjacent subtitle text information, the punctuation mark can be inserted between the two-character subtitle text information, so that the readability of the text content is improved.
And 105, inserting the key pictures into the text content according to the positions of the text information contained in the key pictures in the text content to generate target content corresponding to the video.
In the disclosure, the position of the caption text information in each key picture in the text content may be determined, and then the key picture may be inserted into the position of the caption text information included in the key picture, so that the image-text content corresponding to the video may be generated. And the image-text content is the target content corresponding to the video.
In the disclosure, the content scenario may change before and after the key picture, and therefore, the text content may be segmented according to the key picture to increase readability of the text content.
Optionally, the text content may be segmented according to the key pictures, and the key pictures are inserted into corresponding segmentation positions to generate target content corresponding to the video.
For example, the position of the subtitle text information contained in the key picture in the text content is first determined, and then the key picture may be segmented before the position, and then the key picture may be inserted after the position of the corresponding subtitle text information. Therefore, the text content can be reasonably segmented, and the readability of the text content is favorably improved.
Optionally, when the key pictures do not include the subtitle information, the key pictures may be inserted into the text content according to the sequence corresponding to the key pictures in the picture sequence, so as to generate the target content corresponding to the video. For example, a certain key picture does not contain subtitle information, the position of the subtitle text information contained in the previous picture of the key picture in the picture sequence in the text content can be determined, then, segmentation can be performed after the position, and the key picture is inserted into the segmentation position.
Optionally, the text information of each subtitle and the key picture may be fused according to the sequence of the pictures included in the picture sequence to generate the text content.
For example, the subtitle text information and the key pictures may be spliced according to the order of the pictures corresponding to each subtitle text information and the order of each key picture to generate the target content. In addition, when any key picture contains subtitle text information, the sequence of the picture corresponding to the subtitle text information is the sequence of the corresponding key picture, and at this time, the key picture can be placed behind the subtitle text information.
According to the method, after the video is subjected to frame cutting processing to obtain a picture sequence contained in the video, character recognition can be performed on each picture in the picture sequence to determine text information contained in each picture and the position of the text information in the picture, then key pictures contained in the video can be determined according to the text information contained in each picture and/or the position of the text information in the picture, then text information contained in each picture is fused according to the sequence of the pictures contained in the picture sequence to generate text content, and the key pictures are inserted into the text content according to the position of the text information contained in the key pictures in the text content to generate target content corresponding to the video. Therefore, the video content is converted into the image-text content, so that the content material is enriched, the readability of the video content is improved, and conditions are provided for saving the time for a user to read the material.
Fig. 2 is a schematic flowchart of a video-based content generation method according to an embodiment of the present disclosure.
As shown in fig. 2, the method includes:
step 201, performing frame cutting processing on the video to obtain a picture sequence included in the video.
Step 202, performing character recognition on each picture in the picture sequence to determine text information contained in each picture and a position of the text information in the picture.
Step 203, determining key pictures contained in the video according to the text information contained in each picture and/or the position of the text information in the picture.
In the present disclosure, for a specific implementation process of step 201 to step 203, reference may be made to the detailed description of the above embodiments, and details are not repeated here.
And 204, carrying out face recognition on the key picture to determine whether a face region exists in the key picture and the definition of the face region.
In the disclosure, there may be a picture including a face in the key pictures, and when the definition of the picture including the face is low, the readability of the image-text content is greatly affected. Therefore, the key pictures with low definition containing the human faces can be deleted, so that the readability of the image-text content is ensured.
In the present disclosure, face recognition may be performed on a key picture based on a face recognition technology to determine whether a face region exists in the key picture and the definition of the face region.
In step 205, the key picture is retained when the key picture does not include the face region.
In the disclosure, after the key picture is subjected to face recognition, the key picture not including the face region can be reserved to ensure the richness of the key picture.
And step 206, under the condition that the key picture comprises the face region and the definition of the face region is greater than or equal to the threshold value, the key picture is reserved.
The threshold value can be preset in the system according to the requirement on the definition of the picture.
In the disclosure, when the key picture includes a face region, the definition of the face region can be compared with a preset threshold, and the key picture can be retained under the condition that the definition is greater than or equal to the threshold. Therefore, the definition of the key pictures in the image-text content can be ensured, and the readability of the image-text content is improved.
Step 207, discarding the key picture when the key picture includes the face region and the definition of the face region is smaller than the threshold.
In the disclosure, when the key picture includes a face region, the definition of the face region can be compared with a preset threshold, and the key picture can be discarded under the condition that the definition is smaller than the threshold. Therefore, the definition of the key pictures in the image-text content can be ensured, and the readability of the image-text content is improved.
It should be noted that, in practical use, the above steps 205-207 may be executed in parallel, or may also be executed in other sequences, such as executing 206 and 207 first, then executing 205, and so on, which is not limited by the present disclosure.
And step 208, fusing the text information contained in each picture according to the sequence of the pictures contained in the picture sequence to generate text content.
And step 209, inserting the key pictures into the text content according to the positions of the text information contained in the key pictures in the text content to generate target content corresponding to the video.
In the present disclosure, the detailed implementation process of step 208 to step 209 may refer to the detailed description of the above embodiments, and is not described herein again.
In the present disclosure, after determining the key picture, face recognition may be performed on the picture to determine whether a face region exists in the key picture and the definition of the face region. And if the key picture does not contain the face area, keeping the key picture. And under the condition that the key picture contains the face region and the definition of the face region is greater than or equal to the threshold value, reserving the key picture. And discarding the key picture under the condition that the key picture comprises the face region and the definition of the face region is less than a threshold value. Therefore, the definition of the key pictures in the image-text content can be ensured, and the readability of the image-text content is improved.
Fig. 3 is a schematic flowchart of a video-based content generation method according to an embodiment of the present disclosure.
As shown in fig. 3, the method includes:
step 301, performing frame cutting processing on the video to obtain a picture sequence included in the video.
Step 302, performing character recognition on each picture in the picture sequence to determine text information contained in each picture and a position of the text information in the picture.
In the present disclosure, the specific implementation process of steps 301 to 302 may refer to the detailed description of the above embodiments, and is not described herein again.
Step 303, performing deduplication processing on the text information contained in each picture to obtain the text information to be fused.
In the disclosure, the subtitle text information in the adjacent pictures may have a repeated phenomenon, so that the subtitle text information can be deduplicated to ensure readability of text content.
In the present disclosure, the edit distance may be calculated for two adjacent pieces of text information, and when the edit distance is equal to or less than 2, the two pieces of text information may be determined to be the same, and then any one of the two pieces of text information may be deleted.
And step 304, determining semantic association degree between every two adjacent text messages.
It can be understood that each punctuation mark has a different context of use, and the degree of semantic association between before and after the punctuation mark is different. For example, a period generally indicates that a sentence has ended, followed by the beginning of a new sentence, and thus, the two sentences before and after the period are semantically related to a lesser degree. Commas generally represent a short pause, but the semantics of two sentences before and after the comma are strongly linked. Therefore, the type of the corresponding punctuation mark can be determined according to the semantic association degree between two adjacent text messages.
In the disclosure, the semantic association degree between two adjacent subtitle text information can be determined based on a semantic analysis technology.
And 305, determining the type of punctuation marks between every two adjacent text messages according to the semantic association degree and the time interval between two pictures corresponding to the two text messages.
In the present disclosure, each picture corresponds to one frame cutting time in the picture sequence, and therefore, the time interval between two pieces of text information can be determined according to the frame cutting times of the pictures corresponding to the two pieces of text information.
Further, in the picture sequence, there may be a picture that does not contain subtitle information, and thus, the time interval between adjacent two subtitle text information may be different. The degree of semantic conversion may be different for different time intervals. For example, when the time interval between two subtitle text information is long, it can be considered that there is a large semantic transition, and when the time interval between two subtitle text information is short, it can be considered that the semantic transition is small. When the semantic conversion is large, a long pause may exist between two subtitle text information, and when the semantic conversion is small, a short pause may exist between two subtitle text information. Therefore, the type of punctuation between each adjacent two text messages can be determined according to the time interval between two subtitle text messages.
For example, when the time interval between two subtitle text messages is 3 seconds or more, a period may be added between two subtitle text messages, when the time interval between two subtitle text messages is 0 to 1 second, a punctuation mark may not be added, and when the time interval between two subtitle text messages is 1 to 3 seconds, a comma may be added.
In the present disclosure, the type of punctuation can be determined according to the semantic association between two subtitle text messages. For example, when the semantic association between two subtitle text messages is low, the corresponding punctuation mark may be determined as a period. Or, when the semantic association degree between two subtitle text messages is low and the previous sentence is a query context, the corresponding punctuation mark can be determined to be a question mark.
And step 306, fusing the text information contained in each picture based on the type of the punctuation mark between every two adjacent text information to generate text content.
In the present disclosure, the subtitle text information in each picture may be spliced according to the order of the pictures corresponding to the subtitle text information. In the splicing process, if a punctuation mark exists between two adjacent subtitle text messages, the punctuation mark can be inserted between the two-character subtitle text messages. Therefore, the reasonability of sentence segmentation of the text content is improved, and the readability of the text content is improved.
In the present disclosure, when only punctuation determined based on the semantic relevance exists between two pieces of text information, or punctuation determined according to a time interval exists between two pieces of text information, the punctuation may be directly determined as punctuation between the two pieces of text information. When punctuation marks determined according to the semantic association degree and punctuation marks determined according to the time interval exist between two pieces of text information at the same time, the punctuation marks determined according to the time interval can be determined to be punctuation marks between the two pieces of text information.
Step 307, fusing the text information included in each picture according to the sequence of the pictures included in the picture sequence to generate text content.
And 308, inserting the key pictures into the text content according to the positions of the text information contained in the key pictures in the text content to generate target content corresponding to the video.
In the present disclosure, the specific implementation process of steps 307 to 308 may refer to the detailed description of the above embodiments, and is not described herein again.
In the disclosure, after determining the text information contained in each picture and the position of the text information in the picture, the text information contained in each picture may be deduplicated to obtain the text information to be fused, then, the semantic association degree between each two adjacent text information is determined, the type of the punctuation mark between each two adjacent text information is determined according to the semantic association degree and the time interval between the two pictures corresponding to the two text information, and then, the text information contained in each picture is fused based on the type of the punctuation mark between each two adjacent text information to generate the text content. Therefore, the method is beneficial to improving the rationality of the sentence division of the text content, and further improves the readability of the text content.
For ease of understanding, the procedure of the video-based content generation method in the present disclosure is explained below with reference to fig. 4. Fig. 4 is a process diagram of a video-based content generation method according to an embodiment of the present disclosure. As shown in fig. 4, after the video is subjected to frame cutting processing to obtain a picture sequence included in the video, character recognition may be performed on each picture in the picture sequence to determine text information included in each picture and a position of the text information in the picture. And then, a subtitle list can be determined according to the text information contained in each picture, and semantic discrimination can be performed on two adjacent subtitles by utilizing a natural language processing technology so as to determine punctuation marks between the two adjacent subtitles. Or, the subtitle list can be subjected to deduplication processing according to the editing distance between two adjacent subtitles to determine the text content corresponding to the video. When the text content is determined, the key pictures contained in the video can be determined according to the text information contained in each picture and/or the position of the text information in the picture, and face recognition can be performed on the key pictures, so that the key pictures with lower definition containing the face region can be removed based on the definition of the face region. After the key pictures and the text content are determined, the subtitle text information in the key pictures and the text content can be fused according to the sequence of the pictures corresponding to the subtitle text information in the key pictures and the text content, so that target content corresponding to the video is generated. Therefore, the readability of the image-text content is improved while the image-text content is generated based on the video.
In order to implement the foregoing embodiments, the embodiments of the present disclosure further provide a content generating apparatus based on video. Fig. 5 is a schematic structural diagram of a video-based content generating apparatus according to an embodiment of the present disclosure.
As shown in fig. 5, the video-based content generating apparatus 500 includes: a frame cutting module 510, an identification module 520, a determination module 530, and a generation module 540.
A frame cutting module 510, configured to perform frame cutting processing on a video to obtain a picture sequence included in the video;
an identifying module 520, configured to perform character identification on each picture in the sequence of pictures to determine text information included in each picture and a position of the text information in the picture;
a determining module 530, configured to determine, according to text information included in each of the pictures and/or a position of the text information in the picture, a key picture included in the video;
a generating module 540, configured to fuse text information included in each picture according to an order of pictures included in the picture sequence to generate text content;
the generating module 540 is further configured to insert the key pictures into the text content according to positions of the text information included in the key pictures in the text content, so as to generate target content corresponding to the video.
In a possible implementation manner of the embodiment of the present disclosure, the determining module 530 is specifically configured to:
determining any picture in a plurality of adjacent pictures as a key picture under the condition that text information contained in the adjacent pictures is the same;
or determining any picture as a key picture under the condition that the text information contained in the any picture is at the preset position in the picture.
In a possible implementation manner of the embodiment of the present disclosure, the identifying module 520 is further configured to:
performing face recognition on the key picture to determine whether a face region exists in the key picture and the definition of the face region;
under the condition that the key picture does not contain a face region, reserving the key picture;
when the key picture comprises a face region and the definition of the face region is greater than or equal to a threshold value, the key picture is reserved;
and under the condition that the key picture comprises a face region and the definition of the face region is smaller than the threshold value, discarding the key picture.
In a possible implementation manner of the embodiment of the present disclosure, the generating module 540 is further configured to:
and carrying out duplication elimination processing on the text information contained in each picture to obtain the text information to be fused.
In a possible implementation manner of the embodiment of the present disclosure, the generating module 540 is specifically configured to:
inputting every two adjacent text messages into a preset network model, and determining the type of punctuation marks between the two text messages according to the output of the network model;
and fusing the text information contained in each picture based on the type of punctuation marks between every two adjacent text information to generate the text content.
In a possible implementation manner of the embodiment of the present disclosure, the generating module 540 is specifically configured to:
determining semantic association degree between every two adjacent text messages;
determining the type of punctuation marks between every two adjacent text messages according to the semantic association degree and the time interval between two pictures corresponding to the two text messages;
and fusing the text information contained in each picture based on the type of punctuation marks between every two adjacent text information to generate the text content.
It should be noted that the explanation of the embodiment of the video-based content generation method is also applicable to the apparatus of the embodiment, and therefore, the description thereof is omitted here.
According to the method, after the video is subjected to frame cutting processing to obtain a picture sequence contained in the video, character recognition can be performed on each picture in the picture sequence to determine text information contained in each picture and the position of the text information in the picture, then key pictures contained in the video can be determined according to the text information contained in each picture and/or the position of the text information in the picture, then text information contained in each picture is fused according to the sequence of the pictures contained in the picture sequence to generate text content, and the key pictures are inserted into the text content according to the position of the text information contained in the key pictures in the text content to generate target content corresponding to the video. Therefore, the video content is converted into the image-text content, so that the content material is enriched, the readability of the video content is improved, and conditions are provided for saving the time for a user to read the material.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the device 600 includes a computing unit 601, which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 602 or a computer program loaded from a storage unit 608 into a RAM (Random Access Memory) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An I/O (Input/Output) interface 605 is also connected to the bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 601 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 601 performs the respective methods and processes described above, such as a video-based content generation method. For example, in some embodiments, the method for video-based content generation may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the video-based content generation method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the video-based content generation method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, system On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.
According to an embodiment of the present disclosure, the present disclosure further provides a computer program product, which when executed by an instruction processor in the computer program product, performs the video-based content generation method proposed by the above-mentioned embodiment of the present disclosure.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (12)

1. A video-based content generation method, comprising:
performing frame cutting processing on a video to obtain a picture sequence contained in the video;
performing character recognition on each picture in the picture sequence to determine text information contained in each picture and the position of the text information in the picture;
determining key pictures contained in the video according to text information contained in each picture and/or the position of the text information in the picture;
according to the sequence of the pictures contained in the picture sequence, fusing text information contained in each picture to generate text content;
inserting the key pictures into the text content according to the positions of the text information contained in the key pictures in the text content to generate target content corresponding to the video;
the fusing the text information contained in each picture to generate text content includes:
determining semantic association degree between every two adjacent text messages;
determining the type of punctuation marks between every two adjacent text messages according to the semantic relevance and the time interval between two pictures corresponding to the two text messages;
when punctuation marks determined according to the semantic association degree and punctuation marks determined according to the time interval of the two pictures exist between the two texts at the same time, determining the punctuation marks determined according to the time interval of the two pictures as the punctuation marks between the two texts;
and fusing the text information contained in each picture based on the type of punctuation marks between every two adjacent text information to generate the text content.
2. The method of claim 1, wherein the determining key pictures included in the video according to the text information included in each picture and/or the position of the text information in the picture comprises:
determining any picture in a plurality of adjacent pictures as a key picture under the condition that text information contained in the adjacent pictures is the same;
or determining any picture as a key picture under the condition that the text information contained in the any picture is at the preset position in the picture.
3. The method of claim 2, wherein after said determining key pictures contained in said video, further comprising:
performing face recognition on the key picture to determine whether a face region exists in the key picture and the definition of the face region;
under the condition that the key picture does not contain a face region, reserving the key picture;
when the key picture comprises a face region and the definition of the face region is greater than or equal to a threshold value, the key picture is reserved;
and under the condition that the key picture comprises a face region and the definition of the face region is smaller than the threshold value, discarding the key picture.
4. The method of claim 1, wherein before fusing the text information contained in each of the pictures to generate text content, further comprising:
and carrying out duplication elimination processing on the text information contained in each picture to obtain the text information to be fused.
5. The method according to any one of claims 1-4, wherein said fusing the text information contained in each of said pictures to generate text content comprises:
inputting every two adjacent text messages into a preset network model, and determining the type of punctuation marks between the two text messages according to the output of the network model;
and fusing the text information contained in each picture based on the type of punctuation marks between every two adjacent text information to generate the text content.
6. A video-based content generation apparatus, comprising:
the frame cutting module is used for carrying out frame cutting processing on the video so as to obtain a picture sequence contained in the video;
the recognition module is used for carrying out character recognition on each picture in the picture sequence so as to determine text information contained in each picture and the position of the text information in the picture;
the determining module is used for determining key pictures contained in the video according to text information contained in each picture and/or the position of the text information in the picture;
the generating module is used for fusing text information contained in each picture according to the sequence of the pictures contained in the picture sequence so as to generate text content;
the generating module is further configured to insert the key pictures into the text content according to positions of text information included in the key pictures in the text content to generate target content corresponding to the video;
the generation module is specifically configured to:
determining semantic association degree between every two adjacent text messages;
determining the type of punctuation marks between every two adjacent text messages according to the semantic relevance and the time interval between two pictures corresponding to the two text messages;
when punctuation marks determined according to the semantic association degree and punctuation marks determined according to the time interval of the two pictures exist between the two texts at the same time, determining the punctuation marks determined according to the time interval of the two pictures as the punctuation marks between the two texts;
and fusing the text information contained in each picture based on the type of punctuation marks between every two adjacent text information to generate the text content.
7. The apparatus of claim 6, wherein the determining module is specifically configured to:
determining any picture in a plurality of adjacent pictures as a key picture under the condition that text information contained in the adjacent pictures is the same;
or determining any picture as a key picture under the condition that the text information contained in the picture is at the preset position in the picture.
8. The apparatus of claim 7, wherein the identification module is further configured to:
performing face recognition on the key picture to determine whether a face region exists in the key picture and the definition of the face region;
under the condition that the key picture does not contain a face region, reserving the key picture;
when the key picture comprises a face region and the definition of the face region is greater than or equal to a threshold value, the key picture is reserved;
and under the condition that the key picture comprises a face region and the definition of the face region is smaller than the threshold value, discarding the key picture.
9. The apparatus of claim 6, wherein the generating means is further configured to:
and carrying out duplication elimination processing on the text information contained in each picture to obtain the text information to be fused.
10. The apparatus according to any one of claims 6 to 9, wherein the generating means is specifically configured to:
inputting every two adjacent text messages into a preset network model, and determining the type of punctuation marks between the two text messages according to the output of the network model;
and fusing the text information contained in each picture based on the type of punctuation marks between every two adjacent text information to generate the text content.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.
CN202111616214.8A 2021-12-27 2021-12-27 Video-based content generation method and device, electronic equipment and storage medium Active CN114401419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111616214.8A CN114401419B (en) 2021-12-27 2021-12-27 Video-based content generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111616214.8A CN114401419B (en) 2021-12-27 2021-12-27 Video-based content generation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114401419A CN114401419A (en) 2022-04-26
CN114401419B true CN114401419B (en) 2023-03-24

Family

ID=81227814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111616214.8A Active CN114401419B (en) 2021-12-27 2021-12-27 Video-based content generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114401419B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010056901A (en) * 1999-12-17 2001-07-04 구자홍 Method for recognizing a picture document of the digital broadcasting receiver
CN111343496A (en) * 2020-02-21 2020-06-26 北京字节跳动网络技术有限公司 Video processing method and device
CN112738554B (en) * 2020-12-22 2022-12-13 北京百度网讯科技有限公司 Video processing method and device and electronic equipment
CN112733545A (en) * 2020-12-28 2021-04-30 中电金信软件有限公司 Text blocking method and device, computer equipment and storage medium
CN112287916B (en) * 2020-12-28 2021-04-30 平安国际智慧城市科技股份有限公司 Video image text courseware text extraction method, device, equipment and medium

Also Published As

Publication number Publication date
CN114401419A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN113807098B (en) Model training method and device, electronic equipment and storage medium
EP3709212A1 (en) Image processing method and device for processing image, server and storage medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN112541359B (en) Document content identification method, device, electronic equipment and medium
CN112559800A (en) Method, apparatus, electronic device, medium, and product for processing video
CN114861677B (en) Information extraction method and device, electronic equipment and storage medium
CN113642584A (en) Character recognition method, device, equipment, storage medium and intelligent dictionary pen
CN115982376A (en) Method and apparatus for training models based on text, multimodal data and knowledge
CN115098729A (en) Video processing method, sample generation method, model training method and device
CN115017898A (en) Sensitive text recognition method and device, electronic equipment and storage medium
CN113361462B (en) Method and device for video processing and caption detection model
US10261987B1 (en) Pre-processing E-book in scanned format
CN114401419B (en) Video-based content generation method and device, electronic equipment and storage medium
CN114880498B (en) Event information display method and device, equipment and medium
CN114820885B (en) Image editing method and model training method, device, equipment and medium thereof
CN114880520B (en) Video title generation method, device, electronic equipment and medium
CN113791860B (en) Information conversion method, device and storage medium
CN106959945B (en) Method and device for generating short titles for news based on artificial intelligence
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
CN113033333B (en) Entity word recognition method, entity word recognition device, electronic equipment and storage medium
CN115130437A (en) Intelligent document filling method and device and storage medium
CN114996494A (en) Image processing method, image processing device, electronic equipment and storage medium
US11132500B2 (en) Annotation task instruction generation
CN113221566A (en) Entity relationship extraction method and device, electronic equipment and storage medium
CN109344254B (en) Address information classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant