CN115471772A

CN115471772A - Method, device, equipment and medium for extracting key frame

Info

Publication number: CN115471772A
Application number: CN202211127523.3A
Authority: CN
Inventors: 李振铎
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-12-13

Abstract

The invention discloses a method, a device, equipment and a medium for extracting a key frame. The method for extracting the key frame comprises the following steps: acquiring a video frame sequence to be processed of a moving target, and extracting motion characteristics and color characteristics of the video frame sequence to be processed to obtain a multi-dimensional characteristic extraction result; determining an interframe difference index according to a multi-dimensional feature extraction result and the number of video shots matched with a video frame sequence to be processed; determining a key frame threshold according to the interframe difference index and the total frame number of the video frame sequence to be processed; and determining a target key video frame in the video frame sequence to be processed through a target detection model, the interframe difference index and a key frame threshold. The technical scheme of the embodiment of the invention can adaptively set the key frame threshold value and flexibly and accurately screen out the key frame.

Description

Method, device, equipment and medium for extracting key frame

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting a key frame.

Background

With the development of internet technology and internet of things technology, the number of videos is increased in a large amount, and it is estimated that the total amount of video data will be increased by 50 times every 10 years. However, videos have the disadvantages of large change, large data amount, low abstraction level, and the like, so that the key frame extraction technology of the videos is more and more concerned.

At present, for videos with moving objects (people, vehicles, and the like), key frame extraction is generally performed manually, that is, key frames are extracted by a manual identification method, but the manual identification of key frames is time-consuming and labor-consuming and is limited by human physiological features, so that the key frame extraction is wrong or neglected. Some non-manual key frame extraction methods have the problems of inflexible key frame threshold values, insufficient video frame information utilization and the like, so that extracted key frames are incomplete or inaccurate.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for extracting a key frame, which can adaptively set a key frame threshold value and flexibly and accurately screen out the key frame.

According to an aspect of the present invention, there is provided a method for extracting a key frame, including:

acquiring a video frame sequence to be processed of a moving target, and extracting motion characteristics and color characteristics of the video frame sequence to be processed to obtain a multi-dimensional characteristic extraction result;

determining an interframe difference index according to a multi-dimensional feature extraction result and the number of video shots matched with a video frame sequence to be processed;

determining a key frame threshold according to the interframe difference index and the total frame number of the video frame sequence to be processed;

and determining a target key video frame in the video frame sequence to be processed through a target detection model, an interframe difference index and a key frame threshold.

According to another aspect of the present invention, there is provided an apparatus for extracting a key frame, including:

the characteristic extraction module is used for acquiring a video frame sequence to be processed of a moving target, and extracting motion characteristics and color characteristics of the video frame sequence to be processed to obtain a multi-dimensional characteristic extraction result;

the interframe difference index determining module is used for determining interframe difference indexes according to the multi-dimensional feature extraction result and the number of video shots matched with the video frame sequence to be processed;

the key frame threshold value determining module is used for determining a key frame threshold value according to the interframe difference index and the total frame number of the video frame sequence to be processed;

and the target key video frame determining module is used for determining a target key video frame in the video frame sequence to be processed through the target detection model, the interframe difference index and the key frame threshold.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of extracting a key frame according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the method for extracting a key frame according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme of the embodiment of the invention, a multi-dimensional feature extraction result is obtained by obtaining a to-be-processed video frame sequence of a moving target and extracting motion features and color features of the to-be-processed video frame sequence, and then an interframe difference index is determined according to the multi-dimensional feature extraction result and the number of video shots matched with the to-be-processed video frame sequence, so that a key frame threshold value is determined according to the interframe difference index and the total frame number of the to-be-processed video frame sequence, and a target key video frame in the to-be-processed video frame sequence is further determined through a target detection model, the interframe difference index and the key frame threshold value. According to the scheme, potential key features of a moving target in a video frame sequence to be processed are mined through motion feature and color feature extraction, and a key frame threshold value can be generated in a self-adaptive mode according to the difference degree between video frames and the total frame number of the video frame sequence to be processed, so that accurate positioning and capturing of the moving target in the video frame sequence to be processed are achieved, namely automatic flexible accurate extraction of the target key video frame is achieved, the problems that key frame extraction in the prior art is inflexible and inaccurate are solved, the key frame threshold value can be set in a self-adaptive mode, and the key frame can be screened out flexibly and accurately.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for extracting a key frame according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for extracting a key frame according to a second embodiment of the present invention;

fig. 3 is a flowchart of another method for extracting a key frame according to the second embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for extracting a key frame according to a third embodiment of the present invention;

FIG. 5 illustrates a schematic diagram of an electronic device that may be used to implement embodiments of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It is to be understood that the terms "target" and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a method for extracting a key frame according to an embodiment of the present invention, where the method is applicable to a case where a video with a moving object is extracted from a key frame, and the method may be executed by a key frame extraction device, where the key frame extraction device may be implemented in a form of hardware and/or software, and the key frame extraction device may be configured in an electronic device. As shown in fig. 1, the method includes:

s110, obtaining a video frame sequence to be processed of the moving target, and extracting motion characteristics and color characteristics of the video frame sequence to be processed to obtain a multi-dimensional characteristic extraction result.

The moving target can be any moving object and can be determined according to actual tracking requirements. Alternatively, the moving object may be a person and/or a vehicle, etc. The video frame sequence to be processed may be a video frame sequence associated with a moving object for extracting a key video frame where the moving object is located. A plurality of video frames may be included in the sequence of video frames to be processed. The motion features may be behavioral features that occupy time and space. The color feature may be a visual feature on a color space. The multi-dimensional feature extraction result may be a feature extraction result obtained after motion feature and color feature extraction is performed on the video frame sequence to be processed.

In the embodiment of the invention, the video can be acquired according to the detection requirement of the key video frame of the video comprising the moving object, the acquired video is converted to obtain the video frame sequence to be processed, and the motion characteristic and the color characteristic of the video frame sequence to be processed are extracted to obtain the multi-dimensional characteristic extraction result.

And S120, determining the interframe difference index according to the multi-dimensional feature extraction result and the number of the video shots matched with the video frame sequence to be processed.

The number of video shots may be the total number of shots that have finished capturing video. The inter-frame difference index may be used to characterize the degree of difference between two consecutive video frames to be processed in a sequence of video frames to be processed.

In the embodiment of the present invention, dimensions of the multidimensional feature extraction result may be unified, and then a feature vector is created according to the multidimensional feature extraction result with unified dimensions, and the number of video shots of the obtained video (that is, the number of video shots matched with the sequence of video frames to be processed) is further determined, so that inter-frame difference indexes of two consecutive video frames to be processed (hereinafter, simply referred to as consecutive video frames to be processed) in the sequence of video frames to be processed are sequentially determined according to the feature vector and the number of video shots matched with the sequence of video frames to be processed.

And S130, determining a key frame threshold according to the interframe difference index and the total frame number of the video frame sequence to be processed.

The key frame threshold may be a threshold determined according to the interframe difference index and the total frame number of the to-be-processed video frame sequence, and is used for screening the to-be-processed video frames in the to-be-processed video frame sequence.

In the embodiment of the invention, the mean value of the interframe difference indexes can be determined according to the interframe difference indexes of the continuous video frames to be processed and the total frame number of the video frame sequence to be processed which are sequentially determined, so that the obtained mean value is used as the key frame threshold.

S140, determining a target key video frame in the video frame sequence to be processed through the target detection model, the interframe difference index and the key frame threshold.

Wherein the target detection model may be a network for identifying target objects in the image.

In the embodiment of the invention, the video frames to be processed of the video frame sequence to be processed can be screened for the first time according to the interframe difference index and the key frame threshold, the screened continuous video frames to be processed with the interframe difference index larger than the key frame threshold are input into the target detection model, and then the video frames to be processed input into the target detection model are screened for the second time according to the output result of the target detection model to obtain the target key video frames in the video frame sequence to be processed.

Example two

Fig. 2 is a flowchart of a method for extracting a key frame according to a second embodiment of the present invention, which is embodied on the basis of the foregoing embodiment, and shows a specific optional implementation manner for extracting motion features and color features of a video frame sequence to be processed to obtain a result of extracting multidimensional features. As shown in fig. 2, the method includes:

s210, obtaining a video frame sequence to be processed of the moving target, and performing first motion characteristic extraction on the video frame sequence to be processed based on a Gaussian mixture background model to obtain the moving target area of each video frame to be processed.

The Gaussian mixture background model can be a model describing the change rule of each pixel point through Gaussian distribution. Generally, the change of each pixel point in a video frame is regarded as a random process which continuously generates a pixel value, and the background representation of the video frame can be realized by mixing a Gaussian background model.

The first motion feature may be a motion feature extracted after background subtraction is performed on the video frame to be processed and the gaussian background model. The moving object area may be an area occupied by the moving object in a background subtraction result of the video frame to be processed and the gaussian background model.

In the embodiment of the invention, after the video frame sequence to be processed of the moving target is obtained, modeling can be performed according to a Gaussian mixture background model, so that background subtraction is performed on the current video frame to be processed and a modeling result, first motion characteristic extraction is performed based on the background subtraction result, the area occupied by the moving target in the background subtraction result is determined, the area of the moving target of the current video frame to be processed is obtained, and by analogy, the area of the moving target of each video frame to be processed can be obtained.

S220, performing second motion characteristic extraction on the continuous video frames to be processed in the video frame sequence to be processed through a diamond search method to obtain the motion vectors of the continuous video frames to be processed.

The diamond search method may be a search algorithm that determines an area with the highest similarity in two frames of images by using a diamond as a search template in the two frames of images. The second motion feature may be a motion feature in the video frame to be processed extracted based on a diamond search method. The motion vector may be a vector formed between regions of the two images that satisfy a similarity threshold (which may be set as desired).

In the embodiment of the invention, based on a diamond search method, similar region matching is performed on the continuous video frames to be processed in the video frame sequence to be processed, and second motion feature extraction is performed on the regions meeting the similarity threshold, so as to obtain the vectors formed between the regions meeting the similarity threshold in the continuous video frames to be processed, and then the motion vectors of the continuous video frames to be processed are obtained.

S230, extracting color features of the video frame sequence to be processed to obtain the color entropy of each video frame to be processed.

Where color entropy can be used to describe the degree of colorfulness in an image.

In the embodiment of the invention, based on the calculation principle of the information entropy, color feature extraction can be performed on each video frame to be processed in the video frame sequence to be processed, so as to obtain the color entropy of each video frame to be processed.

The information entropy is a concept extended from thermodynamics to informatics and represents the occurrence probability of certain specific information, and if one system is more ordered, the information entropy is lower, and conversely, if one system is more disordered, the information entropy is higher.

S240, determining a multi-dimensional feature extraction result according to the moving target area of each video frame to be processed, the motion vector of each continuous video frame to be processed and the color entropy of each video frame to be processed.

In the embodiment of the invention, the moving target area of each to-be-processed video frame and the dimension of the color entropy of each to-be-processed video frame can be adjusted according to the dimension of the motion vector of the to-be-processed continuous video frame, so that the feature extraction result with uniform dimension is used as the multi-dimensional feature extraction result.

And S250, determining the interframe difference index according to the multi-dimensional feature extraction result and the number of the video shots matched with the video frame sequence to be processed.

In an optional embodiment of the present invention, determining the inter-frame difference index according to the multidimensional feature extraction result and the number of video shots matched with the sequence of video frames to be processed may include: determining a feature vector between the current continuous frames according to a multi-dimensional feature extraction result matched with the current continuous video frames to be processed; clustering the video frame sequence to be processed to obtain the number of video shots matched with the video frame sequence to be processed; and calculating a target weight coefficient according to the number of the video shots, and determining the interframe difference index of the current continuous video frame to be processed according to the current continuous interframe feature vector and the target weight coefficient.

The feature vector between the current continuous frames can be a vector describing the feature of each dimension of the current continuous video frame to be processed. The target weight coefficient may be a weight coefficient determined by the number of video shots, which matches the number of feature dimensions of the multi-dimensional feature extraction result.

In the embodiment of the invention, the feature extraction results of the current continuous video frames to be processed on each dimension can be determined based on the multi-dimensional feature extraction results of the current continuous video frames to be processed, and then the feature vectors of the current continuous frames to be processed are determined according to the feature extraction results of the current continuous video frames to be processed on each dimension and the sum of the feature extraction results of all the continuous video frames to be processed on each dimension, so that the video frame sequences to be processed are clustered according to the color entropy of the current continuous video frames to be processed, and the number of video shots matched with the video frame sequences to be processed is obtained. Further, according to the number of video shots, determining a weight coefficient matched with the feature dimension number of the multi-dimensional feature extraction result to obtain a target weight coefficient, and performing vector multiplication operation according to the current continuous interframe feature vector and the target weight coefficient to obtain an interframe difference index of the current continuous video frame to be processed.

And S260, determining a key frame threshold according to the interframe difference index and the total frame number of the video frame sequence to be processed.

In an optional embodiment of the present invention, determining the key frame threshold according to the inter-frame difference index and the total frame number of the video frame sequence to be processed may include: summing the interframe difference indexes of all continuous video frames to be processed to obtain a target sum value; determining the target number of all continuous video frames to be processed according to the total frame number of the video frame sequence to be processed; the key frame threshold is determined based on the quotient of the target sum and the target number.

Wherein the target sum value may be a sum value of inter-frame difference indices of all consecutive video frames to be processed. The target number may be the number of consecutive video frames to be processed in total.

In the embodiment of the present invention, the interframe difference indexes of all the continuous video frames to be processed may be summed to obtain a target sum value, and then the difference value between the total frame number of the video frame sequence to be processed and 1 is used as the target number of all the continuous video frames to be processed, and further the quotient of the target sum value and the target number is used as the key frame threshold.

S270, determining a target key video frame in the video frame sequence to be processed through the target detection model, the interframe difference index and the key frame threshold.

In an optional embodiment of the present invention, determining a target key video frame in a to-be-processed video frame sequence by using a target detection model, an inter-frame difference index, and a key frame threshold may include: determining a primary screening key video frame of a video frame sequence to be processed according to a key frame threshold value and an interframe difference index; identifying a moving target in the primary screening key video frame according to the target detection model to obtain moving target associated data of the primary screening key video frame; and determining a target key video frame according to the moving target associated data and the intersection ratio threshold.

The preliminary screening of the key video frame may be a screening result of a to-be-processed video frame sequence according to a key frame threshold and an interframe difference index, and is used for inputting the result to the target detection model. The moving object associated data may be model output data obtained after inputting the preliminary screening key video frame into the object detection model. The moving object associated data may include an object class to which the detection frame identified by the object detection model belongs, an object position, and the like. The intersection ratio threshold may be a preset threshold for screening the preliminary screening key video frames. The intersection ratio can be understood as the ratio of the area where two objects coincide to the area where two objects do not coincide.

In the embodiment of the invention, the inter-frame difference indexes can be grouped according to the sequence of the video frames to be processed in the video frame sequence to be processed, the minimum value in the group is deleted by taking the group as a unit, the inter-frame difference indexes are grouped through iteration execution, and the step of deleting the minimum value in the group by taking the group as a unit is carried out until the iteration results are all larger than the key frame threshold value, so that the video frames to be processed matched with the iteration results are used as primary screening key video frames of the video frame sequence to be processed, the primary screening key video frames are input into a target detection model, moving targets in the primary screening key video frames are identified through the target detection model to obtain moving target associated data, the intersection ratio of detection frames of the moving targets of two consecutive video frames in the primary screening key video frames is calculated according to the moving target associated data, and the target key video frames are determined according to the intersection ratio and the intersection ratio threshold value of the detection frames of the moving targets of the two consecutive video frames.

In an optional embodiment of the present invention, determining a preliminary screening key video frame of a to-be-processed video frame sequence according to a key frame threshold and an inter-frame difference index may include: grouping the inter-frame difference indexes according to the continuous relation of each continuous video frame to be processed to obtain each current inter-frame difference index grouping; carrying out minimum deletion processing on each current interframe difference index group to obtain each group processing result; and comparing each grouping processing result with the key frame threshold, if the data in at least one grouping processing result is smaller than the key frame threshold, updating each current interframe difference index grouping according to each grouping processing result, and returning to execute the operation of performing minimum deletion processing on each current interframe difference index grouping until each grouping processing result is larger than the key frame threshold.

The current inter-frame difference index grouping may be a grouping result of the inter-frame difference indexes according to a continuous relationship of the continuous video frames to be processed. The grouping processing result may be a result of deleting the minimum value in the current inter-frame difference index grouping.

In the embodiment of the present invention, two consecutive inter-frame difference indexes may be grouped into one group according to a consecutive relationship of each to-be-processed consecutive video frame in the to-be-processed video frame sequence to obtain each current inter-frame difference index group, and further, a minimum value deletion process may be performed on each current inter-frame difference index group, that is, a minimum value in each current inter-frame difference index group is deleted to obtain each group processing result, so that each group processing result is compared with a key frame threshold, and if there is at least one group processing result in which data is less than the key frame threshold, each current inter-frame difference index group is updated according to each group processing result, and an operation of performing the minimum value deletion process on each current inter-frame difference index group is returned until each group processing result is greater than the key frame threshold.

In an optional embodiment of the present invention, determining the target key video frame according to the moving target associated data and the intersection ratio threshold may include: and when the moving target exists in the primary screening key video frame according to the moving target correlation data, removing redundant video frames in the primary screening key video frame according to an NMS algorithm and an intersection ratio threshold value to obtain a target key video frame.

The redundant video frames can be to-be-processed video frames needing to be removed from the primary screening key video frames. And the primary screening key video frames have video frames to be processed, the intersection ratio of which to the redundant video frames is greater than the intersection ratio threshold value.

In the embodiment of the present invention, whether the target category in the detection frame of the primary screening key video frame is the category of the moving target may be determined according to the moving target associated data, if so, it is determined that the moving target exists in the primary screening key video frame, and then an intersection-comparison threshold of a Non Maximum Suppression (NMS) algorithm is set, so that according to the position of the detection frame where the moving target exists in the moving target associated data and the NMS algorithm, a to-be-processed video frame group (the to-be-processed video frame group includes two consecutive video frames in the primary screening key video frame) whose intersection ratio is smaller than the intersection-comparison threshold in the primary screening key video frame is determined, one to-be-processed video frame in the to-be-processed video frame group whose intersection ratio is smaller than the intersection-comparison threshold is used as a redundant video frame, and then the redundant video frame in the primary screening key video frame is removed, so as to obtain the target key video frame.

In a specific example, the key frame extraction step is as follows:

step 1, converting a monitoring video of a moving object into a video frame sequence to be processed.

And 2, converting a video frame fi (i =1,2 … …, M) to be processed in the video frame sequence to be processed into an HSV (hue, saturation and brightness) space from an RGB (red, green and blue) space, and converting the video frame to be processed into the HSV space which is more suitable for the visual perception of human eyes from the RGB space. Wherein, M is the total frame number of the video frame sequence to be processed. Normalizing the range of the HSV component and obtaining a histogram, wherein the normalized range is as follows:

wherein z is>y>x>0, exemplary, z =22, y =17, x =12. In addition, the standardized calculation method in the scheme is as follows:

h represents hue, S represents saturation, and V represents lightness.

The formula for calculating the color entropy is as follows:

wherein λ is ₁ +λ ₂ +λ ₃ And =1. Exemplary, λ ₁ ＝0.5，λ ₂ ＝0.3，λ ₃ And =0.2.h (i), s (i), v (i), represent the statistics of the normalized HSV histogram, and log is the logarithm based on 2.

The current video frame f to be processed is processed _k And the next video frame f to be processed _k+1 For comparison, f is extracted _k To f _k+1 The motion vector of (2). For each 16 x 16 (representing the number of pixel points) search block C by using a diamond search method _j (j =1,2, …, N) to obtain the X-axis direction component X _j And a Y-axis direction component Y _j And calculating the sum of a norm of the coordinate axes, wherein the motion vector calculation mode is as follows:

calculating f by mixing the Gaussian background model and the background subtraction method _k Area of the moving object. Will f is _k And performing background subtraction on the mixed Gaussian background model, marking the pixel point where the moving target is positioned after the background subtraction as 255 white, and marking the background as 0 black. Area of moving object A (f) _k ) And counting the number of white pixel points in the image after the background subtraction.

The color entropy and the moving target area are extracted according to the characteristics of each video frame to be processed, the motion vector is the difference characteristics of the front video frame and the rear video frame to be processed, the color entropy of HSV and the front frame and the rear frame of the moving target area are sequentially differenced, the dimension is kept uniform, and the expression mode is as follows:

E(f _k ,f _k+1 )＝|E(f _k )-E(f _k+1 )|

A(f _k ,f _k+1 )＝|A(f _k )-A(f _k+1 )|

normalizing the feature value, and constructing a feature vector of each to-be-processed video frame, namely determining a continuous interframe feature vector, wherein the representation mode of the continuous interframe feature vector is as follows:

step 3, after obtaining the feature vectors between the continuous frames, further determining the number of the video shots, wherein the specific process is as follows: comparing the similarity of HSV color entropy of each video frame to be processed and the clustering center frame, if the similarity is greater than the threshold value of 0.85, judging that the video frames are two similar frames, merging the video frames to be processed which are compared with the current clustering center frame into a cluster matched with the current clustering center frame, and taking the newly added video frame to be processed of the cluster as a new clustering center frame; and if the cluster center frame similar to the current video frame to be processed cannot be found, establishing a new cluster for the current video frame to be processed, and adding the current video frame to be processed into the new cluster. The HSV color entropy similarity calculation mode is as follows:

Similarity＝s1×a1+s2×a2+s3×a3

the method comprises the following steps that S1 is the accumulated sum of the minimum values of a current video frame to be processed and a clustering center frame in the dimension H of an HSV histogram, S2 is the accumulated sum of the minimum values of the current video frame to be processed and the clustering center frame in the dimension S of the HSV histogram, and S3 is the accumulated sum of the minimum values of the current video frame to be processed and the clustering center frame in the dimension V of the HSV histogram. Illustratively, a1 may be set to 0.5, a2 may be set to 0.3, and a3 may be set to 0.2.

And obtaining the video shot number eta according to the similarity and the clustering rule, namely taking the clustering number as the video shot number eta.

And 4, dynamically weighting and summing each feature of the continuous interframe feature vectors by corresponding coefficients according to the number eta of the video shots to obtain an interframe difference index.

The interframe difference index is calculated as follows:

wherein the target weight coefficient: omega ₁ ,ω ₂ ,ω ₃ The calculation method is as follows:

step 5, arranging the interframe difference indexes in sequence by frame numbers { d (f) } ₁ ,f ₂ ),d(f ₂ ,f ₃ ),...,d(f _M-1 ,f _M ) And calculating a key frame threshold based on the following formula:

dividing every two adjacent interframe difference indexes in the interframe difference indexes into a group, deleting the minimum value in each group, and iteratively dividing every two adjacent interframe difference indexes in the interframe difference indexes into oneAnd group, deleting the minimum value in each group until all the interframe difference values are greater than the key frame threshold value, and taking each remaining interframe difference index d (f) _k ,f _k+1 ) And the frame corresponding to the subscript k +1 is a key frame, so as to obtain a primary screening key video frame.

And 6, obtaining the category and the position of the target in the primary screening key video frame by using the target detection model.

And 7, filtering the redundant detection frames through an NMS algorithm, and if the number of the residual detection frames is more than half of the number of the detection frames before filtering, judging that a redundant video frame exists between two video frames to be processed, wherein the method specifically comprises the following steps: comparing the moving target associated data in the primarily screened key video frames, and if no moving target appears, ending; if a moving object appears, the positions and the types of the detection frames of the moving object in the two video frames to be processed are filtered through an NMS algorithm, the intersection ratio threshold of the NMS algorithm is set to be 0.50, whether the intersection ratio of the two detection frames is larger than the intersection ratio threshold or not is further judged, if the intersection ratio is larger than 0.50, one of the two frames is deleted, and if the intersection ratio is smaller than 0.50, the two frames are both reserved, and a target key video frame is obtained.

To sum up, the schematic diagram of the flow from step 1 to step 7 can be seen in fig. 3.

According to the technical scheme of the embodiment of the invention, a video frame sequence to be processed of a moving target is obtained, a first motion characteristic extraction is carried out on the video frame sequence to be processed on the basis of a Gaussian mixture background model, the motion target area of each video frame to be processed is obtained, a second motion characteristic extraction is carried out on continuous video frames to be processed in the video frame sequence to be processed through a diamond search method, the motion vector of each continuous video frame to be processed is obtained, the color characteristic extraction is further carried out on the video frame sequence to be processed, the color entropy of each video frame to be processed is obtained, the multi-dimensional characteristic extraction result is determined according to the motion target area of each video frame to be processed, the motion vector of each continuous video frame to be processed and the color entropy of each video frame to be processed, the inter-frame difference index is further determined according to the multi-dimensional characteristic extraction result and the number of video shots matched with the video frame sequence to be processed, the key frame threshold is further determined according to the inter-frame difference index and the total number of the video frames in the video frame sequence to be processed through a target detection model, the inter-frame difference index and the key frame threshold. According to the scheme, potential key features of a moving target in a video frame sequence to be processed are mined through motion feature and color feature extraction, and a key frame threshold value can be generated in a self-adaptive mode according to the difference degree between video frames and the total frame number of the video frame sequence to be processed, so that accurate positioning and capturing of the moving target in the video frame sequence to be processed are achieved, namely automatic flexible accurate extraction of the target key video frame is achieved, the problems that key frame extraction in the prior art is inflexible and inaccurate are solved, the key frame threshold value can be set in a self-adaptive mode, and the key frame can be screened out flexibly and accurately.

EXAMPLE III

Fig. 4 is a schematic structural diagram of an apparatus for extracting a key frame according to a third embodiment of the present invention. As shown in fig. 4, the apparatus includes: a feature extraction module 310, an inter-frame difference index determination module 320, a key frame threshold determination module 330, and a target key video frame determination module 340, wherein,

the feature extraction module 310 is configured to obtain a to-be-processed video frame sequence of a moving object, and perform motion feature and color feature extraction on the to-be-processed video frame sequence to obtain a multi-dimensional feature extraction result;

the interframe difference index determining module 320 is configured to determine an interframe difference index according to the multi-dimensional feature extraction result and the number of video shots matched with the to-be-processed video frame sequence;

a key frame threshold determining module 330, configured to determine a key frame threshold according to the interframe difference index and the total frame number of the video frame sequence to be processed;

and the target key video frame determination module 340 is configured to determine a target key video frame in the to-be-processed video frame sequence through the target detection model, the inter-frame difference index, and the key frame threshold.

Optionally, the feature extraction module 310 includes a first feature extraction unit, a second feature extraction unit, a third feature extraction unit, and a multi-feature combination unit, where the first feature extraction unit is configured to perform first motion feature extraction on the to-be-processed video frame sequence based on a mixed gaussian background model to obtain a motion target area of each to-be-processed video frame; the second feature extraction unit is used for performing second motion feature extraction on the continuous video frames to be processed in the video frame sequence to be processed by a diamond search method to obtain motion vectors of the continuous video frames to be processed; a third feature extraction unit, configured to perform color feature extraction on the sequence of video frames to be processed to obtain a color entropy of each of the video frames to be processed; and the multi-feature combination unit is used for determining the multi-dimensional feature extraction result according to the moving target area of each video frame to be processed, the motion vector of the continuous video frame to be processed and the color entropy of each video frame to be processed.

Optionally, the interframe difference index determining module 320 includes a feature vector determining unit, a video shot number determining unit, and an interframe difference index calculating unit, where the feature vector determining unit is configured to determine a current continuous interframe feature vector according to the multi-dimensional feature extraction result matched with the current continuous video frame to be processed; the video shot number determining unit is used for clustering the video frame sequence to be processed to obtain the number of video shots matched with the video frame sequence to be processed; and the interframe difference index calculating unit is used for calculating a target weight coefficient according to the number of the video shots and determining the interframe difference index of the current continuous video frame to be processed according to the current continuous interframe feature vector and the target weight coefficient.

Optionally, the key frame threshold determining module 330 includes a target sum value determining unit, a target number determining unit, and a key frame threshold calculating unit, where the target sum value determining unit is configured to sum the inter-frame difference indexes of all the continuous video frames to be processed to obtain a target sum value; a target number determining unit, configured to determine the target number of all the to-be-processed continuous video frames according to the total frame number of the to-be-processed video frame sequence; and the key frame threshold value calculating unit is used for determining the key frame threshold value according to the quotient of the target sum value and the target quantity.

Optionally, the key frame threshold determining module 330 includes a preliminary screening key video frame determining unit, a moving object associated data determining unit, and a target key video frame determining unit, where the preliminary screening key video frame determining unit is configured to determine a preliminary screening key video frame of the to-be-processed video frame sequence according to the key frame threshold and the inter-frame difference index; a moving target associated data determining unit, configured to identify a moving target in the primary screening key video frame according to the target detection model, to obtain moving target associated data of the primary screening key video frame; and the target key video frame determining unit is used for determining a target key video frame according to the moving target associated data and the intersection ratio threshold.

Optionally, the preliminary screening key video frame determining unit is specifically configured to group the inter-frame difference indexes according to a continuous relationship between the to-be-processed continuous video frames to obtain current inter-frame difference index groups; carrying out minimum deletion processing on each current interframe difference index group to obtain each group processing result; and comparing each grouping processing result with the key frame threshold, if data in at least one grouping processing result is smaller than the key frame threshold, updating each current interframe difference index grouping according to each grouping processing result, and returning to execute the operation of performing minimum value deletion processing on each current interframe difference index grouping until each grouping processing result is larger than the key frame threshold.

Optionally, the target key video frame determining unit is specifically configured to, when it is determined that a moving target exists in the primary screened key video frame according to the moving target associated data, remove a redundant video frame in the primary screened key video frame according to a non-maximum suppression NMS algorithm and the cross-over ratio threshold, and obtain the target key video frame.

The key frame extraction device provided by the embodiment of the invention can execute the key frame extraction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 5 illustrates a schematic diagram of an electronic device that may be used to implement embodiments of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the key frame extraction method.

In some embodiments, the method of extracting key frames may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the key frame extraction method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the key frame extraction method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting a key frame, comprising:

determining an interframe difference index according to the multi-dimensional feature extraction result and the number of video shots matched with the video frame sequence to be processed;

and determining a target key video frame in the video frame sequence to be processed through a target detection model, the interframe difference index and the key frame threshold.

2. The method according to claim 1, wherein the extracting motion features and color features of the sequence of video frames to be processed to obtain a multi-dimensional feature extraction result comprises:

based on a mixed Gaussian background model, performing first motion characteristic extraction on the video frame sequence to be processed to obtain the area of a motion target of each video frame to be processed;

performing second motion characteristic extraction on the continuous video frames to be processed in the video frame sequence to be processed by a diamond search method to obtain motion vectors of the continuous video frames to be processed;

extracting color features of the video frame sequence to be processed to obtain the color entropy of each video frame to be processed;

and determining the multi-dimensional feature extraction result according to the moving target area of each video frame to be processed, the motion vector of the continuous video frame to be processed and the color entropy of each video frame to be processed.

3. The method according to claim 2, wherein determining the inter-frame difference index according to the multi-dimensional feature extraction result and the number of video shots matched with the sequence of the video frames to be processed comprises:

determining a feature vector between the current continuous frames according to the multi-dimensional feature extraction result matched with the current continuous video frames to be processed;

clustering the video frame sequence to be processed to obtain the number of video shots matched with the video frame sequence to be processed;

and calculating a target weight coefficient according to the number of the video shots, and determining an interframe difference index of the current continuous video frame to be processed according to the current continuous interframe feature vector and the target weight coefficient.

4. The method of claim 2, wherein determining a key frame threshold based on the inter-frame difference index and a total number of frames of the sequence of video frames to be processed comprises:

summing the interframe difference indexes of all the continuous video frames to be processed to obtain a target sum value;

determining the target number of all the continuous video frames to be processed according to the total frame number of the video frame sequence to be processed;

and determining the key frame threshold value according to the quotient of the target sum value and the target quantity.

5. The method of claim 2, wherein the determining a target key video frame in the sequence of video frames to be processed by the target detection model, the inter-frame difference index, and the key frame threshold comprises:

determining a primary screening key video frame of the video frame sequence to be processed according to the key frame threshold and the interframe difference index;

identifying a moving target in the primary screening key video frame according to the target detection model to obtain moving target associated data of the primary screening key video frame;

and determining a target key video frame according to the moving target associated data and the intersection ratio threshold.

6. The method of claim 5, wherein the determining a prescreened key video frame of the sequence of video frames to be processed according to the key frame threshold and the inter-frame difference index comprises:

grouping the interframe difference indexes according to the continuous relation of the continuous video frames to be processed to obtain current interframe difference index groups;

carrying out minimum value deletion processing on each current interframe difference index group to obtain each group processing result;

and comparing each grouping processing result with the key frame threshold, if data in at least one grouping processing result is smaller than the key frame threshold, updating each current interframe difference index grouping according to each grouping processing result, and returning to execute the operation of performing minimum value deletion processing on each current interframe difference index grouping until each grouping processing result is larger than the key frame threshold.

7. The method of claim 5, wherein determining a target key video frame based on the moving target associated data and an intersection ratio threshold comprises:

and when a moving target exists in the primary screening key video frames according to the moving target associated data, removing redundant video frames in the primary screening key video frames according to a non-maximum value suppression NMS algorithm and the intersection ratio threshold value to obtain the target key video frames.

8. An apparatus for extracting a key frame, comprising:

a key frame threshold value determining module, configured to determine a key frame threshold value according to the interframe difference index and the total frame number of the to-be-processed video frame sequence;

and the target key video frame determining module is used for determining a target key video frame in the video frame sequence to be processed through a target detection model, the interframe difference index and the key frame threshold.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of key frame extraction of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to implement the method of extracting key frames according to any one of claims 1 to 7 when executed.