CN117176979B

CN117176979B - Method, device, equipment and storage medium for extracting content frames of multi-source heterogeneous video

Info

Publication number: CN117176979B
Application number: CN202310445280.6A
Authority: CN
Inventors: 汪昭辰; 刘世章
Original assignee: Qingdao Chenyuan Technology Information Co ltd
Current assignee: Qingdao Chenyuan Technology Information Co ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2024-05-03
Anticipated expiration: 2043-04-24
Also published as: CN117176979A

Abstract

The application discloses a content frame extraction method, a device, equipment and a storage medium of a multi-source heterogeneous video, wherein the method comprises the following steps: de-framing the original video to be processed to obtain an original frame sequence; when a processing trace exists in the original frame sequence, extracting a local target image in an image containing the processing trace to obtain a local frame sequence; performing shot segmentation on the original frame sequence and the local frame sequence respectively to obtain an original shot sequence and a local shot sequence; extracting a content frame from each original shot to obtain an original content frame sequence; and extracting the content frames from each local shot to obtain a local content frame sequence. The application can normalize multi-source videos with inconsistent resolution, amplitude-shape ratio and the like, can extract partial images of the videos after processing deformation such as picture-in-picture videos, framed videos and the like, reduces the interference of the processing deformation on image content analysis, enables the videos with various changes in content from different files to be subjected to similarity comparison, and provides preconditions for image content association analysis.

Description

Method, device, equipment and storage medium for extracting content frames of multi-source heterogeneous video

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a content frame of a multi-source heterogeneous video.

Background

With the explosion growth of videos in the internet, various forms of video contents appear, and the analysis video contents are not limited to whole frame images, such as picture-in-picture videos, mobile phone videos with effective local images, black-edge videos, and local areas in the whole frame images are also complete images. For a large number of processed and deformed video images, key frames of the whole frame image and the partial image cannot be extracted through the existing algorithm, the whole frame image and the partial image are mutually interfered in the analysis and calculation process, and correct key frames cannot be extracted.

In the prior art, a common scheme uses key frames to represent the content of a shot, the key frames are concepts in the video compression technology, when video frames are encoded, the video frames are divided into a plurality of video frame groups, each group has n video frames, one video frame group has only one key frame, generally the first frame in the video frame group, and other frames are backward predicted frames P or bidirectional predicted frames B calculated by the key frames I. Some image content is present in predicted frames and not in key frames. Therefore, it is difficult for the key frames extracted in the prior art to fully express the contents of the shots.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for extracting content frames of multi-source heterogeneous video, which are used for at least solving the technical problem that in the related technology, a whole frame image and a local image are mutually interfered and a correct key frame cannot be extracted.

According to an aspect of an embodiment of the present invention, there is provided a content frame extraction method of a multi-source heterogeneous video, including: de-framing the original video to be processed to obtain an original frame sequence; detecting whether a processing trace exists in the original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence; performing shot segmentation on the original frame sequence and the local frame sequence respectively to obtain an original shot sequence and a local shot sequence; extracting a content frame from each original shot to obtain an original content frame sequence; extracting a content frame from each local shot to obtain a local content frame sequence; the content frame is a frame representing the content of a shot, and comprises a first frame, a last frame and N intermediate frames, wherein N is a natural number, the intermediate frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all sub-frames of a shot except the first frame and the last frame and the previous content frame.

According to another aspect of the embodiment of the present invention, there is also provided a content frame extraction apparatus for multi-source heterogeneous video, including: the frame-decoding module is used for decoding the original video to be processed to obtain an original frame sequence; the local image extraction module is used for detecting whether a processing trace exists in the original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence; the shot segmentation module is used for respectively carrying out shot segmentation on the original frame sequence and the local frame sequence to obtain an original shot sequence and a local shot sequence; the content frame extraction module is used for extracting the content frames from each original lens to obtain an original content frame sequence; extracting a content frame from each local shot to obtain a local content frame sequence; the content frame is a frame representing the content of a shot, and comprises a first frame, a last frame and N intermediate frames, wherein N is a natural number, the intermediate frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all sub-frames of a shot except the first frame and the last frame and the previous content frame.

According to still another aspect of the embodiments of the present invention, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the content frame extraction method of a multi-source heterogeneous video described above by the computer program.

According to yet another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the content frame extraction method of multi-source heterogeneous video described above when run.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

According to the method for extracting the content frames of the multi-source heterogeneous video, for an original video to be processed, on one hand, a content frame sequence of the original video is extracted, on the other hand, when a processing trace exists in the original frame sequence, a local target image in an image containing the processing trace is extracted, a local frame sequence is obtained, and a content frame of the local frame sequence is extracted. The method has the advantages that through respectively extracting and analyzing the whole frame image and the local image, the mutual interference between the whole frame image and the local image can be avoided, the interference of processing deformation on image content analysis is reduced, the accuracy of content frame extraction is improved, the video images with various changes in content from different files can be subjected to similarity comparison, and preconditions are provided for image content association analysis. The content frame is different from the key frame in the prior art, and only a part of video image frames are selected as the content frame, so that all contents of the shot can be expressed completely, and the problem that the key frame in the prior art cannot express the content of the shot completely is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment of an alternative multi-source heterogeneous video content frame extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application environment of another alternative method for content frame extraction of multi-source heterogeneous video according to an embodiment of the present invention;

FIG. 3 is a flow chart of an alternative method of content frame extraction for multi-source heterogeneous video according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative sequence of content frames according to an embodiment of the invention;

FIG. 5 is a schematic diagram of content frame extraction according to an embodiment of the invention;

FIG. 6 is a schematic diagram of an alignment of multi-source heterogeneous videos according to an embodiment of the present invention;

FIG. 7 is a schematic illustration of a picture-in-picture image in accordance with an embodiment of the invention;

FIG. 8 is a schematic illustration of a framed image according to an embodiment of the invention;

FIG. 9 is a schematic illustration of a black-edge image according to an embodiment of the invention;

FIG. 10 is a schematic illustration of a partially imaged active cell phone video image in accordance with an embodiment of the present invention;

FIG. 11 is a schematic diagram of a multi-source heterogeneous video content frame extraction process according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of a content frame extraction system architecture for multi-source heterogeneous video according to an embodiment of the present invention;

fig. 13 is a schematic structural view of an alternative multi-source heterogeneous video content frame extraction apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural view of an alternative electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiment of the present invention, there is provided a method for extracting a content frame of a multi-source heterogeneous video, which may be applied, but not limited to, in an application environment as shown in fig. 1, as an alternative implementation manner. The application environment comprises the following steps: a terminal device 102, a network 104 and a server 106 which interact with a user in a man-machine manner. Human-machine interaction can be performed between the user 108 and the terminal device 102, and a content frame extraction application program of the multi-source heterogeneous video runs in the terminal device 102. The terminal device 102 includes a man-machine interaction screen 1022, a processor 1024 and a memory 1026. The man-machine interaction screen 1022 is used for displaying a sequence of video frames; the processor 1024 is used to obtain the original video to be processed. The memory 1026 is used for storing the original video to be processed as described above.

In addition, the server 106 includes a database 1062 and a processing engine 1064, where the database 1062 is used to store the video to be processed. The processing engine 1064 is configured to: de-framing the original video to be processed to obtain an original frame sequence; detecting whether a processing trace exists in an original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence; performing shot segmentation on the original frame sequence and the local frame sequence respectively to obtain an original shot sequence and a local shot sequence; extracting a content frame from each original shot to obtain an original content frame sequence; and extracting the content frames from each local shot to obtain a local content frame sequence.

In one or more embodiments, the content frame extraction method of the multi-source heterogeneous video of the present application may be applied to the application environment shown in fig. 2. As shown in fig. 2, a human-machine interaction may be performed between a user 202 and a user device 204. The user device 204 includes a memory 206 and a processor 208. The user equipment 204 in this embodiment may, but is not limited to, extract the content frame with reference to performing the operations performed by the terminal equipment 102.

Optionally, the terminal device 102 and the user device 204 include, but are not limited to, a mobile phone, a tablet computer, a notebook computer, a PC, a vehicle-mounted electronic device, a wearable device, and the like, and the network 104 may include, but is not limited to, a wireless network or a wired network. Wherein the wireless network comprises: WIFI and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The server 106 may include, but is not limited to, any hardware device that may perform calculations. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and is not limited in any way in the present embodiment.

At present, for a large number of effective mobile phone videos and picture-in-picture videos of partial images, key frames of the whole frame images and the partial images cannot be extracted through the existing algorithm, the whole frame images and the partial images interfere with each other in the analysis and calculation process, and correct key frames cannot be extracted. The following four types of methods are commonly used for extracting key frames: the first type is a method based on image content, which takes the change degree of video content as a standard for selecting key frames, wherein the video content is mainly represented by the characteristics of images, and the method is commonly used in the video coding technology; the second category is methods based on motion analysis; the third class is a key frame detection method based on the density characteristics of track curve points; the fourth class is cluster-based methods. However, none of the above 4 methods can solve the key frame extraction of the local image content. And the key frames extracted in the prior art are difficult to completely express the content of the shot.

Moreover, images of multi-source heterogeneous video have differences in resolution, aspect ratio, color space, etc., and different videos cannot be correlated together due to differences in images even though the content is the same.

Based on this, the embodiment of the present application provides a method for extracting content frames of a multi-source heterogeneous video, which is described in detail below with reference to fig. 3, and as shown in fig. 3, the method mainly includes the following steps:

S301, de-framing the original video to be processed to obtain an original frame sequence.

In an alternative embodiment, an original video to be processed is first obtained, where the original video to be processed may be a monitoring video in a scene such as a school, a factory, a park, or a video such as a television show, a movie, or a variety program. Further, the obtained original video is subjected to frame de-framing to obtain a granulated original frame sequence.

S302, detecting whether a processing trace exists in the original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence.

The video in the internet is complex and various, the new video generated by processing and deforming the original video usually has larger change, such as black video, picture-in-picture video, framed video, local image effective mobile phone video and the like, and the image content which is directly analyzed and processed is difficult to be related with the original video, so that effective parts in the image are required to be extracted, and the effective local target image is analyzed and processed.

Specifically, whether a processing trace exists in the original frame sequence or not is detected, and an image with the processing trace, such as a black-edge video, a picture-in-picture video, a framed video or a mobile phone video with a local image effective, exists. If the original frame sequence is a black video, a picture-in-picture video, a framed video or a mobile phone video with effective local images, extracting a local target image in the images containing the processing trace, and taking the extracted local target image as the local frame sequence.

In an alternative embodiment, detecting whether a processing trace exists in the original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence includes:

Detecting whether an image in an original frame sequence is a picture-in-picture image or not through an edge detection algorithm in the prior art, extracting an image in a picture-in-picture window when the image in the original frame sequence is the picture-in-picture image, and taking the image in the picture-in-picture window as a first local target image; and extracting a background image in the picture-in-picture image, and taking the background image as a second local target image.

Fig. 7 is a schematic diagram of a picture-in-picture image according to an embodiment of the present application, as shown in fig. 7, in which a background image of the picture-in-picture image and an image in a window are two different images, and by using the local image extraction method of the present application, the image in the window and the background image can be extracted respectively, so as to avoid mutual interference therebetween. When the background image is extracted, the change of the image in the window is ignored, the position of the window in the original frame image is recorded, and the pixel skipping of the window area is not processed.

Detecting whether the image in the original frame sequence is an image containing a frame or not through an edge detection algorithm in the prior art; when the image in the original frame sequence is an image containing a frame, extracting an image in the frame, and taking the image in the frame as a local target image.

Fig. 8 is a schematic view of a framed image, as shown in fig. 8, with a border added to the edge of the video frame, typically created by software editing. Such images require removal of the borders and restoration to an unframed video. Therefore, the image in the frame is extracted, and the image in the frame is taken as a local target image.

Detecting whether the image in the original frame sequence is a black image or not through an edge detection algorithm in the prior art; when the image in the original frame sequence is a black image, the image in the black is extracted, and the image in the black is taken as a local target image.

Fig. 9 is a schematic diagram of a black-edge image according to an embodiment of the present invention, as shown in fig. 9, the black-edge image is generally generated by a ratio conversion, for example, will be 4:3 to 16:9, the video is converted into 12 to keep the image content consistent: 9, then compensating black areas on both sides of the picture to form 16:9 images. Such video requires detection of the black edge of the video, extraction of 4: 3. Therefore, when the image in the original frame sequence is a black image, it is necessary to extract an image in the black, and the image in the black is taken as a local target image.

Detecting whether an image in an original frame sequence is a mobile phone video image with a local image effective or not through an edge detection algorithm in the prior art; when the images in the original frame sequence are effective mobile phone video images, extracting the local images in the mobile phone video images, and taking the local images as local target images.

Fig. 10 is a schematic diagram of a mobile phone video image in which a local image is effective, as shown in fig. 10, a normal video is changed into a video with a mobile phone amplitude-to-shape ratio after being edited by software, which is also a special picture-in-picture video, for example, when a video is played through a certain APP, a video picture may occupy only a part of a whole mobile phone page, so that a local image area in the mobile phone video image is extracted, and the extracted local image is used as a local target image.

In one embodiment, the processing trace of the image in the original frame sequence may also be watermarking, but watermarking has less impact on the image content analysis, so that no local image extraction is performed for the watermarked image.

According to the embodiment of the application, the local target image is extracted from the deformed images such as the picture-in-picture, the bordered block image, the blacked-out image and the local image effective mobile phone video image, so that the effective part in the image is extracted, the mutual interference between the whole frame image and the local image in the analysis and calculation process can be avoided, and the problem that the video processing deformation affects the video content analysis and consistency contrast in the prior art is solved.

S303, shot segmentation is carried out on the original frame sequence and the local frame sequence respectively, so that the original shot sequence and the local shot sequence are obtained.

Before executing step S303 to perform shot segmentation, normalization processing is further performed on the original frame sequence and the local frame sequence, so as to obtain a normalized original frame sequence and a normalized local frame sequence.

After the original frame sequence is obtained, the amplitude-shape ratio, the resolution ratio and the color space of the original frame sequence are normalized, and the normalized original frame sequence is obtained. After the extracted local frame sequence is obtained, normalizing the amplitude-shape ratio, the resolution and the color space of the local frame sequence to obtain the normalized local frame sequence.

After the video with different codes, file formats, resolutions, amplitude-shape ratios and the like is deframed, normalization of the amplitude-shape ratios, the resolutions, color spaces and the like is carried out, so that images are mapped into the image spaces with the same dimension, and the functions of image comparison, content association analysis and the like of different videos become possible.

Fig. 6 is a schematic diagram of comparison of multi-source heterogeneous videos, as shown in fig. 6, in which the original video is processed and deformed to generate new multi-source heterogeneous videos, including black-edge videos, bordered videos, picture-in-picture videos, local-image effective mobile phone videos, watermark videos, and the like. And then carrying out local image extraction on the deformed video, taking the image in the black video as an extracted local target image, taking the image in the frame as an extracted local target image, taking the image in the picture-in-picture window as a local target image, taking the picture-in-picture background image as a local target image, adding the watermark image, not processing the watermark image, normalizing the amplitude-shape ratio, the resolution ratio and the color space of the extracted local target image, and comparing the normalized image with the original video, so that an accurate comparison result can be obtained quickly.

In an exemplary scenario, if the original video is a movie, the existing playing platform processes and deforms the video in order to evade infringement detection, so that the existing video detection method cannot accurately identify the infringement video. According to the application, through normalization and local image extraction of the multi-source heterogeneous video in the Internet, the functions of image comparison, content association analysis and the like of different subsequent videos can be made possible.

Further, shot segmentation is performed on the original frame sequence and the local frame sequence respectively, so that the original shot sequence and the local shot sequence are obtained. Specific lens segmentation algorithm the embodiment of the present application is not particularly limited, and a lens segmentation method in the prior art may be adopted. The lens refers to continuous picture segments shot by the camera from one start to stop, and is a basic unit of video composition.

S304, extracting a content frame from each original shot to obtain an original content frame sequence; and extracting the content frames from each local shot to obtain a local content frame sequence.

The embodiment of the application respectively carries out shot segmentation on the original video frame sequence and the local frame sequence to obtain an original shot sequence and a local shot sequence, and respectively extracts the content frames of the original shot and the content frames of the local shot. The content frame is a frame representing shot content and comprises a first frame, a last frame and N intermediate frames, wherein N is a natural number, and the intermediate frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all sub-frames of a shot except the first frame and the last frame and the previous content frame.

Fig. 4 is a schematic diagram of an alternative sequence of content frames according to an embodiment of the invention. As shown in fig. 4, the video content is composed of a sequence of consecutive frames, and the sequence of consecutive frames can be divided into a plurality of groups according to the continuity of the video content, and each group of consecutive frame sequence is a shot.

Further, the content frames corresponding to each shot are extracted, and the embodiment of the application selects a small number of frames from the continuous frame sequence of each shot to represent the content of the shot by analyzing the difference of the content in the video shot, wherein the frames are the content frames. The content frames at least comprise the first and the last frames of the shots, and the first and the last frames are also called shot frames, so that the content frames of one shot are more than or equal to 2.

Fig. 5 is a schematic diagram of content frame extraction according to an embodiment of the present invention, as shown in fig. 5, the first frame is the first content frame, and then the 2 nd and 3 rd frames are calculated. And then calculating the difference rates of the 5 th, 6 th and 4 th frames until the difference rate is larger than a preset threshold, and if the difference rates of the 5 th, 6 th and 7 th frames and the first frame are smaller than the preset threshold and the 8 th frame is larger than the preset threshold, the 8 th frame is the third content frame. And by analogy, calculating the content frames in all subframes between all the first frames and all the tail frames. The end frame is selected directly as the last content frame without having to calculate the rate of difference with its previous content frame. The difference rate is the calculated difference rate between two frames of images.

For example, a surveillance video, with few people and few cars during the night, the video frame changes little, and the content frames will be few, for example, only a single number of content frames are extracted within 10 hours. The number of people and vehicles in the daytime is large, the change of people and objects in the video picture is frequent, and the content frames calculated according to the method are much more than those in the evening. Thus, the content frames are guaranteed not to lose all of the content information of the shot video relative to the key frames, as the key frames may lose part of the shot content. Compared with the scheme that each frame of the video is calculated and considered, the selection of the content frames only selects partial video image frames, so that the image calculation amount is greatly reduced on the premise of not losing the content.

The application designs a brand-new method for extracting the video content frame image, which effectively solves the problems of frame missing and redundancy in the key frame, and the content frame can completely, accurately and accurately reflect the original video content. In addition to the shot segmentation and content frame extraction of the original video frame sequence, the shot segmentation and content frame extraction of the local target image extracted after the video content analysis processing is required, so that a plurality of groups of shot sequences and content frame sequences exist in the mobile phone video, the black video, the framed video and the like with the effective picture-in-picture and local image, and an effective means is provided for the video content comparison and association analysis.

In order to facilitate understanding of the content frame extraction method of the multi-source heterogeneous video provided by the embodiment of the present application, the following is further described with reference to fig. 11. As shown in fig. 11, the method includes:

Firstly, obtaining a video to be processed, performing frame decomposition on the video to be processed to obtain an original frame sequence, normalizing the amplitude ratio, the color space, the resolution and the like of the original frame sequence to obtain a normalized image, performing shot segmentation on the normalized original frame sequence to obtain an original shot sequence, extracting the content frame of the original shot sequence, and obtaining the original content frame sequence.

On the other hand, detecting whether the original frame sequence has processing traces or not includes detecting whether an image in the original frame sequence is a framed video, a black-edge video, a picture-in-picture video, a mobile phone video with a local image effective, and the like, and when the image in the original frame sequence is an image containing the processing traces, extracting a local target image in the image to obtain an extracted local image frame sequence. Normalizing the extracted local image frame sequence and segmenting the lens to obtain an extracted local image lens sequence. And then extracting the content frames of the local shot sequence to obtain an extracted local image content frame sequence. Therefore, the mobile phone video, the black video, the framed video and the like with the effective picture-in-picture and partial images have a plurality of groups of lens sequences and content frame sequences, and an effective means is provided for video content comparison and association analysis.

Fig. 12 is a schematic diagram of a content frame extraction system architecture of a multi-source heterogeneous video according to an embodiment of the present invention. As shown in fig. 12, shot segmentation and content frame extraction are performed on multi-source, massive, complex video. According to the design thought, the system logic framework design is divided into four layers of an infrastructure layer, an information production layer, a platform technology layer and an application display layer.

The infrastructure layer is the basis for the reliable, stable and efficient operation of the whole system, and comprises an operating system, application software, hardware for data storage and dynamic management service and a communication facility.

The information production layer stores, normalizes and granulates the access video, carries out sample-free autonomous learning on the video content, realizes the analysis of the video content and provides data support for system application.

The platform technology layer provides an adaptive service interface for diversified applications through image feature space analysis.

The application presentation layer comprises resource allocation and provides content frames after system analysis processing.

According to the content frame extraction method of the multi-source heterogeneous video, firstly, after different videos such as codes, file formats, resolutions and amplitude-shape ratios are subjected to frame decomposition, the amplitude-shape ratios, the resolutions and the color space normalization are carried out, images are mapped into the same image space, and the functions of image comparison, content association analysis and the like of different videos are made possible.

Furthermore, the application can avoid the mutual interference between the whole frame image and the local image by respectively extracting and analyzing the whole frame image and the local image, reduce the interference of processing deformation on the image content and improve the accuracy of extracting the content frame. Therefore, the mobile phone video, the black video, the framed video and the like with the effective picture-in-picture and partial images have a plurality of groups of lens sequences and content frame sequences, and an effective means is provided for video content comparison and association analysis.

Finally, the content frame of the application is different from the key frame in the prior art, and only partial video image frames are selected as the content frames, but all contents of the shot can be expressed completely, so that the problem that the key frame in the prior art cannot express the shot contents completely is solved.

According to another aspect of the embodiment of the present invention, there is also provided a content frame extraction apparatus of a multi-source heterogeneous video for implementing the content frame extraction method of a multi-source heterogeneous video. As shown in fig. 13, the apparatus includes: a deframer module 1301, a partial image extraction module 1302, a shot segmentation module 1303, a content frame extraction module 1304.

The frame de-framing module 1301 is configured to de-frame an original video to be processed to obtain an original frame sequence;

The local image extraction module 1302 is configured to detect whether a processing trace exists in the original frame sequence, and extract a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence, so as to obtain a local frame sequence;

The shot segmentation module 1303 is used for respectively performing shot segmentation on the original frame sequence and the local frame sequence to obtain an original shot sequence and a local shot sequence;

A content frame extraction module 1304, configured to extract a content frame for each original shot, to obtain an original content frame sequence; extracting a content frame from each local shot to obtain a local content frame sequence;

The content frame is a frame representing shot content and comprises a first frame, a last frame and N intermediate frames, wherein N is a natural number, and the intermediate frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all sub-frames of a shot except the first frame and the last frame and the previous content frame.

It should be noted that, when the content frame extraction device for multi-source heterogeneous video provided in the above embodiment performs the content frame extraction method for multi-source heterogeneous video, only the division of the above functional modules is used for illustration, in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the content frame extraction device of the multi-source heterogeneous video provided in the above embodiment and the content frame extraction method embodiment of the multi-source heterogeneous video belong to the same concept, and detailed implementation processes of the method embodiment are shown and will not be described herein.

According to still another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the content frame extraction method of multi-source heterogeneous video, which may be a terminal device or a server as shown in fig. 14. The present embodiment is described taking the electronic device as an example. As shown in fig. 14, the electronic device comprises a memory 1405 and a processor 1403, the memory 1405 having stored therein a computer program, the processor 1403 being arranged to perform the steps of any one of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: de-framing the original video to be processed to obtain an original frame sequence; detecting whether a processing trace exists in an original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence; performing shot segmentation on the original frame sequence and the local frame sequence respectively to obtain an original shot sequence and a local shot sequence; extracting a content frame from each original shot to obtain an original content frame sequence; and extracting the content frames from each local shot to obtain a local content frame sequence.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 14 is only schematic, and the electronic device may be a smart phone (such as an Android Mobile phone, an iOS Mobile phone, etc.), a tablet computer, a palm computer, a Mobile internet device (Mobile INTERNET DEVICES, MID), a PAD, etc. Fig. 14 is not limited to the structure of the electronic device and the electronic apparatus described above. For example, the electronics can also include more or fewer components (e.g., network interfaces, etc.) than shown in fig. 14, or have a different configuration than shown in fig. 14.

The memory 1405 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for extracting content frames of multi-source heterogeneous videos in the embodiment of the present invention, and the processor 1403 executes the software programs and modules stored in the memory 1405 to perform various functional applications and data processing, that is, implement the method for extracting content frames of multi-source heterogeneous videos. Memory 1405 may include high-speed random access memory, but may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, memory 1405 may further include memory located remotely from processor 1403, which may be connected to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1405 may be used to store information such as content frames, but is not limited to. As an example, as shown in fig. 14, the above memory 1405 may include, but is not limited to, a frame de-extracting module 1301, a partial image extracting module 1302, a shot segmentation module 1303, and a content frame extracting module 1304 in the content frame extracting apparatus including the above multi-source heterogeneous video. In addition, other module units in the content frame extraction device of the multi-source heterogeneous video may be included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1404 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 1404 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1404 is a Radio Frequency (RF) module that is used to communicate wirelessly with the internet.

In addition, the electronic device further includes: a display 1401 for displaying the above-described sequence of content frames; and a connection bus 1402 for connecting the respective module parts in the above-described electronic device.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer readable storage medium, the processor executing the computer instructions, causing the computer device to perform the method for content frame extraction of video multi-source heterogeneous video described above, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of: de-framing the original video to be processed to obtain an original frame sequence; detecting whether a processing trace exists in an original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence; performing shot segmentation on the original frame sequence and the local frame sequence respectively to obtain an original shot sequence and a local shot sequence; extracting a content frame from each original shot to obtain an original content frame sequence; and extracting the content frames from each local shot to obtain a local content frame sequence.

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method of the various embodiments of the present invention.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for extracting a content frame of a multi-source heterogeneous video, comprising:

de-framing the original video to be processed to obtain an original frame sequence;

Detecting whether a processing trace exists in the original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence;

Performing shot segmentation on the original frame sequence and the local frame sequence respectively to obtain an original shot sequence and a local shot sequence;

Extracting a content frame from each original shot to obtain an original content frame sequence; extracting a content frame from each local shot to obtain a local content frame sequence;

the content frame is a frame representing the content of a shot, and comprises a first frame, a last frame and N intermediate frames, wherein N is a natural number, the intermediate frames are obtained when the difference rate is larger than a preset threshold value by calculating the difference rate of all sub-frames of a shot except the first frame and the last frame and the previous content frame.

2. The method of claim 1, further comprising, prior to shot segmentation of the original frame sequence and the partial frame sequence, respectively:

And respectively carrying out normalization processing on the original frame sequence and the local frame sequence to obtain a normalized original frame sequence and a normalized local frame sequence.

3. The method of claim 2, wherein normalizing the original frame sequence and the partial frame sequence, respectively, comprises:

normalizing the amplitude-shape ratio, the resolution ratio and the color space of the original frame sequence to obtain a normalized original frame sequence;

normalizing the amplitude-shape ratio, the resolution and the color space of the local frame sequence to obtain a normalized local frame sequence.

4. The method of claim 1, wherein detecting whether a processing trace is present in the original frame sequence, and extracting a local target image from an image containing the processing trace when the processing trace is present in the original frame sequence, comprises:

Detecting whether an image in the original frame sequence is a picture-in-picture image;

When the image in the original frame sequence is the picture-in-picture image, extracting the image in a picture-in-picture window, and taking the image in the picture-in-picture window as a first local target image;

And extracting a background image in the picture-in-picture image, and taking the background image as a second local target image.

5. The method of claim 1, wherein detecting whether a processing trace is present in the original frame sequence, and extracting a local target image from an image containing the processing trace when the processing trace is present in the original frame sequence, comprises:

Detecting whether an image in the original frame sequence is an image containing a frame;

And when the image in the original frame sequence is an image containing a frame, extracting an image in the frame, and taking the image in the frame as the local target image.

6. The method of claim 1, wherein detecting whether a processing trace is present in the original frame sequence, and extracting a local target image from an image containing the processing trace when the processing trace is present in the original frame sequence, comprises:

detecting whether the image in the original frame sequence is a black image or not;

And when the image in the original frame sequence is the black image, extracting an image in the black image, and taking the image in the black image as the local target image.

7. The method of claim 1, wherein detecting whether a processing trace is present in the original frame sequence, and extracting a local target image from an image containing the processing trace when the processing trace is present in the original frame sequence, comprises:

detecting whether an image in the original frame sequence is a mobile phone video image with a local image effective;

When the images in the original frame sequence are the mobile phone video images with effective local images, extracting the local images in the mobile phone video images with the effective local images, and taking the local images as the local target images.

8. A content frame extraction apparatus for multi-source heterogeneous video, comprising:

the frame-decoding module is used for decoding the original video to be processed to obtain an original frame sequence;

the local image extraction module is used for detecting whether a processing trace exists in the original frame sequence, and extracting a local target image in an image containing the processing trace when the processing trace exists in the original frame sequence to obtain a local frame sequence;

The shot segmentation module is used for respectively carrying out shot segmentation on the original frame sequence and the local frame sequence to obtain an original shot sequence and a local shot sequence;

The content frame extraction module is used for extracting the content frames from each original lens to obtain an original content frame sequence; extracting a content frame from each local shot to obtain a local content frame sequence;

9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.