CN110826491A - Video key frame detection method based on cascading manual features and depth features - Google Patents

Video key frame detection method based on cascading manual features and depth features Download PDF

Info

Publication number
CN110826491A
CN110826491A CN201911079839.8A CN201911079839A CN110826491A CN 110826491 A CN110826491 A CN 110826491A CN 201911079839 A CN201911079839 A CN 201911079839A CN 110826491 A CN110826491 A CN 110826491A
Authority
CN
China
Prior art keywords
video
features
frames
input
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911079839.8A
Other languages
Chinese (zh)
Inventor
毋立芳
赵宽
简萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911079839.8A priority Critical patent/CN110826491A/en
Publication of CN110826491A publication Critical patent/CN110826491A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • G06V10/507Summing image-intensity values; Histogram projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A video key frame detection method combining manual features and depth features is used in the field of sports video content analysis. The depth feature based approach is inefficient because the network structure is complex. Broadcast video also includes many other types of shots such as midfield breaks, progressive shots, etc. The detection result contains a large number of irrelevant frames. Aiming at the problem, a video key frame detection method combining manual features and depth features is provided. First, shot boundary detection is performed based on color histogram features to obtain the last frame. A similar clustering method is further provided based on the histogram similarity to obtain candidate key frames. And finally, classifying the candidate key frames based on the deep neural network to obtain real key frames. The comparison experiment results on the curling game video and the basketball game video show that compared with the traditional background difference method, the optical flow method and the like, the method can quickly and reliably extract the key frames.

Description

Video key frame detection method based on cascading manual features and depth features
Technical Field
The invention is mainly used in the field of sports video content analysis, and particularly relates to digital image processing technologies such as image feature extraction, lens segmentation and neural network analysis. The method can quickly and effectively detect the key frames in the sports video, and further analyze the video content.
Background
Video key frame extraction is to automatically extract a required frame from a video with thousands of frames. The manual operation consumes a great deal of time and energy, and the key frame detection algorithm can quickly and accurately detect the key frame according to the inter-frame characteristics and the self characteristics of the video. Common key frame detection methods include background subtraction, optical flow, and the like. Background subtraction is used to detect moving objects by comparing frames in a video through background modeling. The method has high detection speed, is greatly influenced by the background, and can increase the difficulty of background modeling under the condition of severe background change, thereby influencing the detection effect of the moving target. The optical flow method has a remarkable detection effect on a moving target, detects the motion process of each point by taking each pixel point in an image as a motion vector and extracting the optical flow, thereby judging the motion state and detecting the key frame. The color histogram feature statistic color composition can judge the similarity of two adjacent frames by utilizing a certain distance measurement strategy, and is widely applied to the fields of video key frame detection and lens segmentation. The method of combining HOG and HSV is proposed by Quyang, so that higher recall ratio can be obtained, but the precision ratio is lower, and the calculation amount of HOG characteristics is large and the consumed time is more.
At present, deep learning is widely applied in the field of image processing, and some typical problems such as image classification and the like far exceed traditional artificial features. Wu et al propose a method for combining FCN and CNN for detecting the key posture of the weightlifting video, and achieve good effects. However, the method combines two depth frames, so that the processing speed is very low and the real-time processing cannot be realized. It is therefore a reasonable idea to combine the manual and depth features. The Yangfei et al adopts SIFT features to detect key points to obtain a plurality of candidate local images, further extracts a small number of regions on each image as candidates, extracts depth features of the candidate regions through AlexNet, obtains a good effect in the text and railway videos, but cannot realize real-time performance in speed because the SIFT features are adopted and each frame of image is required to be sent to a neural network for classification.
Disclosure of Invention
In order to solve the problems, the invention provides a video key frame detection method combining manual features and depth features. Firstly, shot boundary detection is carried out on a scene of match video based on the color histogram characteristics, and the last frame of a shot is obtained. Candidate keyframes are derived further based on histogram similarity. And finally classifying the candidate key frames based on the deep neural network to obtain real key frames. The curling game video is usually two to three hours and comprises hundreds of thousands of frames of images, shot segmentation is carried out on the video by using a simple and effective color histogram, about 700 candidate key frames are obtained from about 20 thousands of video frames, and a large number of negative samples are filtered by using the screening method provided by the text. And further, the image features can be accurately extracted based on the deep neural network classification model, so that the accuracy of extracting the key frames is ensured.
The method comprises the following specific steps:
1. selection of video
A video of a complete game, such as a basketball game video, a curling game video, etc., is selected as an input to the present invention.
2. Shot segmentation
This step is shot segmentation based on color histogram features. The method mainly comprises the steps of extracting the color histogram feature of each frame of the video, comparing the distance between feature values of two adjacent frames, judging that the lens jump occurs at the moment when the distance is greater than a set threshold value, and storing the last frame of each lens.
The color histogram feature extracts the probability of various colors appearing on the color histogram, i.e., R, G, B, three color channels. Each color channel is provided with a 256-dimensional vector, and in order to reduce the complexity of calculation and improve the detection speed, the invention quantizes the characteristic dimension on each color channel into a 16-dimensional vector. For example, the blue channel is an original 256-dimensional color vector, and is added and combined into 1-dimensional vector according to 16 adjacent color statistics, and finally the blue channel is called as a 16-dimensional vector. The quantization formula is:
Figure BDA0002263621370000021
result N in the formulaq(i) Value, N, representing the ith dimension characteristic after quantizationo(j) Representing the value of the j-th dimension of the feature before quantization. Wherein i is more than or equal to 0 and less than or equal to 15, and j is more than or equal to 0 and less than or equal to 255.
The distance measurement measures the distance between vectors in a manner including euclidean distance, cosine distance, etc. The euclidean distance is sensitive to changes of adjacent frame images, but has low calculation complexity relative to the cosine distance, and the time required by the euclidean distance is 75% of the cosine distance under the condition that the number of frames detected by lens segmentation is close by adjusting a threshold value. Therefore, the invention adopts the Euclidean distance as the inter-frame distance measuring method. The formula is as follows, wherein x and y vectors represent feature vectors of two frames respectively.
d={∑i(x[i]-y[i])2}0.5
The gradual change frames have a large number of gradual change frames in the video, such as overlapping, fade-in and fade-out, and wipe-over, for example, several frames of pictures belong to the same shot, but the next frame gradually presents the picture of the next shot, and the color characteristic distance between adjacent frames is correspondingly increased in the process of the gradual change of the overlapping. Since the euclidean distance is sensitive to frame-to-frame variations, successive fade frames are also retained, and only the first of these fade frames is actually required. Therefore, by comparing the length of each shot, it is found that 96.2% is greater than 100 frames, 0.05% is less than 50 frames, and 3.75% is between 50 and 100 frames. Finally, when the last frame of the shot is found, the last 50 frames are directly skipped, so that the gradual change frame can be avoided, and the speed can be increased.
In this step, a certain threshold is set to divide the video shots, and through experiments, the euclidean distance is selected as a distance calculation method, and the threshold is set to 0.2, that is, when the euclidean distance of the feature values of two adjacent frames is greater than 0.2, the shot conversion is determined.
3. Candidate key frame screening
The result of the last frame of each shot stored by shot segmentation contains a large number of negative samples, and sudden change frames can occur at small intervals for shots such as advertisements in a video, so that subsequent processing tasks are greatly increased. In the curling game, the similarity of most key frames is very close, and in order to screen out partial negative samples, a similar clustering method is provided: reducing a distance threshold value by using the color histogram features extracted in the lens segmentation step, and randomly selecting k pictures meeting the initial threshold value from the results of the lens segmentation in the step two; in order not to mistakenly screen out the key frames, the threshold is increased, the characteristic distances between the rest pictures and the k selected pictures are traversed, and if the threshold is met, the feature distances are reserved. And screening partial negative samples. Because only the key frames in the video frames extracted by using shot segmentation have concentrated characteristics, and the characteristics of the rest frames are randomly distributed in the whole characteristic space, the key frames are difficult to screen by using the traditional clustering mode. By adopting the method, the k value of a large number of texts segmented by the lens can be well set to be 4. The step uses the features extracted before, and repeated extraction is not needed, so that the time is saved. The candidate key frame algorithm is shown in the following table.
Figure BDA0002263621370000041
4. Neural network based picture classification
The method comprises the steps of utilizing a neural network to extract key frames, wherein a large number of irrelevant pictures except the key frames, namely negative samples, exist in picture data obtained by lens segmentation.
After the candidate video key frames are screened in the step 3, a large number of negative samples can be filtered, and real video key frames are reserved, but meanwhile, partial negative samples still exist in the reserved video frames, and the frames cannot be filtered by adopting manual features, so that in the step, the video frames are classified by adopting a deep neural network with higher classification accuracy.
The invention designs a neural network comprising 4 convolutional layers, 4 pooling layers and 3 full-connection layers, adds regularization loss in order to prevent the problem of overfitting, uses dropout layers behind the first two full-connection layers, randomly removes network nodes in the training process, and retains all the nodes in the test process. On different videos, through training of the neural network, in a testing link, a video key frame can be effectively separated from other negative samples, and under the operation preprocessing of the step 1-3, the number of pictures input by the neural network can be greatly reduced, so that on one hand, the processing time of the network (especially under the processing condition of a cpu) can be reduced, and on the other hand, the influence on the accuracy can be avoided. In the previous invention or thesis, after extracting the characteristics of each frame of the video or the characteristics of each frame are manually extracted, the video is sent to a neural network for classification operation; or each frame is directly fed into the neural network, which is especially inefficient without gpu. The invention only sends about 1000 pictures in the whole section of 20 to 30 ten thousand frames of video into the neural network, thus greatly reducing the processing time of the neural network and improving the final accuracy.
The invention has the following advantages:
in the key frame extraction of the cascading manual features and the depth features, the advantages of high manual feature extraction speed and high depth feature classification accuracy are fully utilized, the FPS can reach 120 in the video processing, the accuracy is 97%, and good effects are obtained in speed and accuracy. And a new candidate frame screening algorithm similar to clustering is provided in step 3, so that a large number of negative samples can be filtered, the previous features are utilized, and the feature extraction does not need to take time, so that the time consumption is basically negligible. Finally, the method can better detect the key frame of the video, and is superior to other algorithms in detection effect and real-time performance.
Drawings
FIG. 1 is a flowchart of a video key frame detection algorithm
FIG. 2 color histogram feature quantization diagram
Detailed Description
The invention provides a method for detecting video key frames by cascading manual features and depth features. The method comprises the following concrete implementation steps:
1. data set production
The training set used herein was 10 games of the last 5 years major curling game (world championship, subtropical, etc.). The main roles of the training set include: and adjusting and searching a proper threshold parameter in the shot segmentation part, and providing a positive sample and a negative sample of the training set in the part for filtering the key frame by using the neural network.
The size of each picture in the training set is 1280x720, and the training set is used as a positive sample; and randomly selecting a plurality of non-key frames in the video as negative samples of training. Wherein, the training set comprises 3776 positive samples and 3000 negative samples.
The quantity and richness of the training sets are an important standard for determining the training of the network model, but the quantity of the training sets cannot meet the requirement. Data enhancement is employed herein to address this issue. Data enhancement can increase the quantity of training data and improve the generalization capability of the model. Data enhancement can effectively reduce error rates. Data enhancement strategies employed herein include: turning over, contrast adjustment, hue adjustment and saturation adjustment.
2. Shot segmentation
The key frame of a video is defined as the last frame of a particular shot, so we first get the last frame of each shot by comparing the color histogram features with the distance measures.
Firstly, different shots in a video sequence are segmented, and finally, the last frame of each shot is extracted. The lens segmentation mainly utilizes the characteristic of a color histogram, a threshold value is set to be 0.2 by comparing Euclidean distances of characteristic values of the color histogram of two adjacent frames, and when the distance between the two frames is less than 0.2, the two frames are judged to belong to the same lens; and when the distance between the two frames is greater than 0.2, judging that the two frames belong to different shots respectively, and storing the last frame of the previous shot as the input of the subsequent step.
A color histogram, i.e., R, G, B probabilities of various colors appearing on three color channels. Each color channel is provided with a 256-dimensional vector, and in order to reduce the complexity of calculation and improve the detection speed, the invention quantizes the characteristic dimension on each color channel into a 16-dimensional vector. For example, the blue channel is an original 256-dimensional color vector, and is added and combined into 1-dimensional vector according to 16 adjacent color statistics, and finally the blue channel is called as a 16-dimensional vector. The way to measure the distance between vectors includes euclidean distance, cosine distance, etc. The euclidean distance is sensitive to changes of adjacent frame images, but has low calculation complexity relative to the cosine distance, and the time required by the euclidean distance is 75% of the cosine distance under the condition that the number of frames detected by lens segmentation is close by adjusting a threshold value. Therefore, the invention adopts the Euclidean distance as the inter-frame distance measuring method. After the distance between two adjacent frames is obtained every time, whether the two frames of books belong to the same shot and whether the last frame of the last shot needs to be stored is judged by comparing the obtained distance with a set threshold value.
3. Candidate key frame screening
The main operation of screening the candidate key frames is to reduce the distance threshold value by using the color histogram characteristics extracted in the step of lens segmentation, and randomly select k pictures meeting the initial threshold value in the result of the step two of lens segmentation; in order not to mistakenly screen out the key frames, the threshold is increased, the characteristic distances between the rest pictures and the k selected pictures are traversed, and if the threshold is met, the feature distances are reserved. And screening partial negative samples.
In the experimental process, the k value is selected to be too small, so that k pictures can be easily obtained as negative samples, and although the situation can be avoided to the greatest extent by reducing the threshold value, the effect is not good through experiments; if the k value is too large, the calculation time will be increased, and through experiments, setting a large k value will improve the final result to some extent, but at the same time, the time consumption will be increased greatly. Therefore, we adopt a value with both precision and speed, and finally set k to be 4, the initial threshold to be 0.1 and the second threshold to be 0.2. Meanwhile, the features used in the step are the features used in the previous shot segmentation, so that the features do not need to be repeatedly extracted, and the time used in the step is greatly reduced.
4. Neural network based picture classification
The method comprises the steps of utilizing a neural network to extract key frames, wherein a large number of irrelevant pictures except the key frames, namely negative samples, exist in picture data obtained by lens segmentation. The invention designs a neural network comprising 4 convolutional layers, 4 pooling layers and 3 full-connection layers, adds regularization loss in order to prevent the problem of overfitting, uses dropout layers behind the first two full-connection layers, randomly removes network nodes in the training process, and retains all the nodes in the test process.
The size of the pictures in the network input data set is 1280x720, the pictures are uniformly adjusted to 100x100 to be used as the input of the whole network, and the length and the width of the pictures are respectively reduced by 12.8 times and 7.2 times. If the difference between the original image size and the original image size is too large, part of details are lost, and the accuracy is reduced. With the input of 100x100, compared with the input of 256x256, the time required by the training of the latter is increased by 210% compared with the former, the accuracy is improved by 0.6 percentage point, and the latter is easy to cause memory overflow. We therefore finally use 100x100 pictures as input for the whole network.
The output of the pooling layer, the last pooling layer of the convolutional layer is a feature map of 6x6x128 size. The convolution kernel size used by the first two convolution layers is 5x5, which can effectively increase the receptive field, but the too large convolution kernel can increase the calculation amount, which slows down the network training. Studies have shown that a series of smaller convolution kernels can be used instead of a larger convolution kernel, for example two series of convolution layers of 3x3 function similarly to a convolution layer of 5x5, so that fewer parameters can be used (3x3x2/5x 5-72%); the latter two convolutional layers use a convolution kernel of 3x3, increasing in number of channels to 128. In this network, the feature map is not changed in size across the convolutional layer, and the downsizing operation is performed on the pooling layer.
Full connectivity layer the network comprises 3 full connectivity layers, 1024,512,2 in size. The final output is two probability values of whether the key frame is. The parameter quantity of the full link layer is (6x6x128) x1024x512x2 ≈ 4.83x109, while the parameter quantity of the convolution layer is 4704, so that the parameter quantity of the full link layer is much higher than that of the convolution layer.
The convolutional layer 1: convolution kernel 5x 32, input 100x 3, output 100x 32
A pooling layer 1: input 100x 32, output 50 x 32
And (3) convolutional layer 2: convolution kernel 5x 64, input 50 x 32, output 50 x 64
And (3) a pooling layer 2: input 50 x 64, output 25 x 64
And (3) convolutional layer: convolution kernel 3x 128, input 25 x 64, output 25 x128
A pooling layer 3: input 25 x128, output 12x 128
And (3) convolutional layer: convolution kernel 3x 128, input 12x 128, output 12x 128
A pooling layer 3: input 12x 128, output 6x128
The quantity and richness of the training sets are an important standard for determining the training of the network model, but the quantity of the training sets cannot meet the requirement. The present invention addresses this problem with data enhancement. Data enhancement can increase the quantity of training data and improve the generalization capability of the model. Data enhancement can effectively reduce error rates. Data enhancement strategies employed herein include: turning over, contrast adjustment, hue adjustment and saturation adjustment.
For a complex data set, the AlexNet, VGG and ResNet are adopted to test the data set respectively, and the data classification effect is better with the increase of the network depth, so that the VGG or ResNet can be selected as a classification network when the data set is complex. When only the cpu is used for testing, the lighter-magnitude network designed in the step 4 of the invention is recommended to be used, so that the influence of too slow operation time or overflow of the memory on use can be avoided.
The experimental results are as follows:
and extracting candidate key frames based on the histogram similarity, and directly utilizing the color histogram features extracted in the shot segmentation. Comparing the time of extracting the candidate frames with the results, as shown in table 1, it can be seen that the total number of detected frames of the three videos is respectively 60.9%, 59.9%, 78.3%, but the consumed time is only 1.7%, 0.87%, 0.86% of the total time. Therefore, the process can filter out a large number of negative samples only by consuming a small amount of time, and the subsequent classification task is relieved. In experiments, we observe that the process depends on the fact that the similarity of key frames in the whole video is relatively close, while the similarity of other frames is greatly different, and if the k value is selected to be too small or the initial threshold value is set to be too large, wrong k pictures are easily selected, and the key frames are filtered out. After a number of experiments we have chosen a k value of 4 and an initial threshold of 0.1.
Table 1 candidate key frame extraction
(note: N/N represents the ratio of the number of extracted candidate key frames to the total number of extracted frames in the video segmentation; T/T represents the ratio of the time of extracting candidate key frames to the total time)
As shown in table 2, after shot segmentation, candidate key frame detection, and neural network classification, the average accuracy in the test set video can reach 97.2%, and 122 frames per second can be detected on average.
TABLE 2 test set curling video key frame test results
The present invention compares the performance of this method with the conventional method and compares the method of key frame detection using HOG and HSV, as shown in table 3, where the number of video frames used is 209278 frames. It can be seen that the background difference method is higher than the text method in speed, and the text method is higher than the background difference method and the optical flow method in accuracy and reaches 97%; in the HOG + HSV method, the recall ratio and the accuracy are both more than 90%, but the HOG characteristic requires the gradient value of an image, the calculated amount is large, and the FPS is only 10 and is far lower than that of the method; in the block and global color histogram, the recall ratio is 97%, but the accuracy is low because only the color features are utilized, and the method utilizes the deep features of the image through the neural network method in addition to the color features, so the accuracy is high.
Table 3 comparison of the method herein with the conventional method
Figure BDA0002263621370000093

Claims (4)

1. A video key frame detection method for cascading manual features and depth features is characterized by comprising the following steps:
(1) shot segmentation based on color histogram features; by extracting the color histogram feature of each frame of the video, comparing the distance between the feature values of two adjacent frames, when the distance is greater than a set threshold value, judging that the lens jump occurs at the moment, and storing the last frame of each lens;
(2) screening candidate key frames; after the shot segmentation is carried out by utilizing the color histogram characteristics, a large number of negative samples are removed by utilizing the candidate key frame screening process of the step, and the input of the subsequent steps is reduced;
(3) and (5) performing further fine classification on the screened pictures by utilizing a neural network to obtain a final video frame result.
2. The method according to claim 1, wherein in step (1), the specific method of shot segmentation based on color histogram features is as follows:
extracting the probability of various colors appearing on a color histogram, namely R, G, B three color channels by using the color histogram feature; each color channel is provided with a 256-dimensional vector, and the characteristic dimension of each color channel is quantized into a 16-dimensional vector;
the distance measurement adopts Euclidean distance as an interframe distance measurement method; when the distance is larger than the set threshold value of 0.2, the shot jump is judged to occur at the moment, and the last frame of each shot is stored.
3. The method according to claim 1, wherein in step (2), the distance threshold is reduced to 0.1 by using the color histogram features extracted in the shot segmentation step, and 4 pictures satisfying the initial threshold are randomly selected from the segmentation result; then increasing the threshold to 0.2, traversing the characteristic distances between the rest pictures and 4 selected pictures, and reserving the pictures when the threshold is met; and screening partial negative samples.
4. The method according to claim 1, wherein in step (3), a neural network comprising 4 convolutional layers, 4 pooling layers and 3 fully-connected layers is designed, in order to prevent the over-fitting problem, regularization loss is added, dropout layers are used after the first two fully-connected layers, network nodes are randomly removed in the training process, and all nodes are reserved during testing;
the convolutional layer 1: convolution kernel 5x 32, input 100x 3, output 100x 32
A pooling layer 1: input 100x 32, output 50 x 32
And (3) convolutional layer 2: convolution kernel 5x 64, input 50 x 32, output 50 x 64
And (3) a pooling layer 2: input 50 x 64, output 25 x 64
And (3) convolutional layer: convolution kernel 3x 128, input 25 x 64, output 25 x128
A pooling layer 3: input 25 x128, output 12x 128
And (3) convolutional layer: convolution kernel 3x 128, input 12x 128, output 12x 128
A pooling layer 3: input 12x 128, output 6x128
And finally, obtaining the key frame of the video through the classification of the neural network.
CN201911079839.8A 2019-11-07 2019-11-07 Video key frame detection method based on cascading manual features and depth features Pending CN110826491A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911079839.8A CN110826491A (en) 2019-11-07 2019-11-07 Video key frame detection method based on cascading manual features and depth features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911079839.8A CN110826491A (en) 2019-11-07 2019-11-07 Video key frame detection method based on cascading manual features and depth features

Publications (1)

Publication Number Publication Date
CN110826491A true CN110826491A (en) 2020-02-21

Family

ID=69553066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911079839.8A Pending CN110826491A (en) 2019-11-07 2019-11-07 Video key frame detection method based on cascading manual features and depth features

Country Status (1)

Country Link
CN (1) CN110826491A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709301A (en) * 2020-05-21 2020-09-25 哈尔滨工业大学 Method for estimating motion state of curling ball
CN111723713A (en) * 2020-06-09 2020-09-29 上海合合信息科技股份有限公司 Video key frame extraction method and system based on optical flow method
CN111738117A (en) * 2020-06-12 2020-10-02 鞍钢集团矿业有限公司 Method for detecting video key frame of electric bucket tooth based on deep learning
CN113033308A (en) * 2021-02-24 2021-06-25 北京工业大学 Team sports video game lens extraction method based on color features
CN113095295A (en) * 2021-05-08 2021-07-09 广东工业大学 Fall detection method based on improved key frame extraction
CN113221674A (en) * 2021-04-25 2021-08-06 广东电网有限责任公司东莞供电局 Video stream key frame extraction system and method based on rough set reduction and SIFT
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113627342A (en) * 2021-08-11 2021-11-09 人民中科(济南)智能技术有限公司 Method, system, device and storage medium for video depth feature extraction optimization

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070214418A1 (en) * 2006-03-10 2007-09-13 National Cheng Kung University Video summarization system and the method thereof
CN106446015A (en) * 2016-08-29 2017-02-22 北京工业大学 Video content access prediction and recommendation method based on user behavior preference
CN106709511A (en) * 2016-12-08 2017-05-24 华中师范大学 Urban rail transit panoramic monitoring video fault detection method based on depth learning
CN107590442A (en) * 2017-08-22 2018-01-16 华中科技大学 A kind of video semanteme Scene Segmentation based on convolutional neural networks
CN109831664A (en) * 2019-01-15 2019-05-31 天津大学 Fast Compression three-dimensional video quality evaluation method based on deep learning
CN110210379A (en) * 2019-05-30 2019-09-06 北京工业大学 A kind of lens boundary detection method of combination critical movements feature and color characteristic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070214418A1 (en) * 2006-03-10 2007-09-13 National Cheng Kung University Video summarization system and the method thereof
CN106446015A (en) * 2016-08-29 2017-02-22 北京工业大学 Video content access prediction and recommendation method based on user behavior preference
CN106709511A (en) * 2016-12-08 2017-05-24 华中师范大学 Urban rail transit panoramic monitoring video fault detection method based on depth learning
CN107590442A (en) * 2017-08-22 2018-01-16 华中科技大学 A kind of video semanteme Scene Segmentation based on convolutional neural networks
CN109831664A (en) * 2019-01-15 2019-05-31 天津大学 Fast Compression three-dimensional video quality evaluation method based on deep learning
CN110210379A (en) * 2019-05-30 2019-09-06 北京工业大学 A kind of lens boundary detection method of combination critical movements feature and color characteristic

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MINGHUI ZHAO等: "Key Frame Extraction of Assembly Process Based on Deep Learning", 《2018 IEEE 8TH ANNUAL INTERNATIONAL CONFERENCE ON CYBER TECHNOLOGY IN AUTOMATION, CONTROL, AND INTELLIGENT SYSTEMS (CYBER)》 *
YIZE CUI等: "Scene Detection of News Video Using CNN Features", 《2017 10TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI)》 *
王洋: "基于卷积神经网络的视频敏感内容识别研究", 《广播电视网络》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709301B (en) * 2020-05-21 2023-04-28 哈尔滨工业大学 Curling ball motion state estimation method
CN111709301A (en) * 2020-05-21 2020-09-25 哈尔滨工业大学 Method for estimating motion state of curling ball
CN111723713B (en) * 2020-06-09 2022-10-28 上海合合信息科技股份有限公司 Video key frame extraction method and system based on optical flow method
CN111723713A (en) * 2020-06-09 2020-09-29 上海合合信息科技股份有限公司 Video key frame extraction method and system based on optical flow method
CN111738117A (en) * 2020-06-12 2020-10-02 鞍钢集团矿业有限公司 Method for detecting video key frame of electric bucket tooth based on deep learning
CN111738117B (en) * 2020-06-12 2023-12-19 鞍钢集团矿业有限公司 Deep learning-based detection method for electric bucket tooth video key frame
CN113033308A (en) * 2021-02-24 2021-06-25 北京工业大学 Team sports video game lens extraction method based on color features
CN113221674B (en) * 2021-04-25 2023-01-24 广东电网有限责任公司东莞供电局 Video stream key frame extraction system and method based on rough set reduction and SIFT
CN113221674A (en) * 2021-04-25 2021-08-06 广东电网有限责任公司东莞供电局 Video stream key frame extraction system and method based on rough set reduction and SIFT
CN113095295B (en) * 2021-05-08 2023-08-18 广东工业大学 Fall detection method based on improved key frame extraction
CN113095295A (en) * 2021-05-08 2021-07-09 广东工业大学 Fall detection method based on improved key frame extraction
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113239869B (en) * 2021-05-31 2023-08-11 西安电子科技大学 Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN113627342A (en) * 2021-08-11 2021-11-09 人民中科(济南)智能技术有限公司 Method, system, device and storage medium for video depth feature extraction optimization
CN113627342B (en) * 2021-08-11 2024-04-12 人民中科(济南)智能技术有限公司 Method, system, equipment and storage medium for video depth feature extraction optimization

Similar Documents

Publication Publication Date Title
CN110826491A (en) Video key frame detection method based on cascading manual features and depth features
Tian et al. Padnet: Pan-density crowd counting
KR102449841B1 (en) Method and apparatus for detecting target
Deng et al. Image aesthetic assessment: An experimental survey
US7949188B2 (en) Image processing apparatus, image processing method, and program
CN106778854B (en) Behavior identification method based on trajectory and convolutional neural network feature extraction
CN106682108B (en) Video retrieval method based on multi-mode convolutional neural network
Tran et al. Two-stream flow-guided convolutional attention networks for action recognition
US8326042B2 (en) Video shot change detection based on color features, object features, and reliable motion information
CN106446015A (en) Video content access prediction and recommendation method based on user behavior preference
CN110569773B (en) Double-flow network behavior identification method based on space-time significance behavior attention
CN108921130A (en) Video key frame extracting method based on salient region
WO2019007020A1 (en) Method and device for generating video summary
Javed et al. A decision tree framework for shot classification of field sports videos
CN111340019A (en) Grain bin pest detection method based on Faster R-CNN
CN115393788B (en) Multi-scale monitoring pedestrian re-identification method based on global information attention enhancement
Ulges et al. Content-based video tagging for online video portals
CN107341456B (en) Weather sunny and cloudy classification method based on single outdoor color image
Liu et al. Effective feature extraction for play detection in american football video
CN115661618A (en) Training method of image quality evaluation model, image quality evaluation method and device
Nandyal et al. Bird swarm optimization-based stacked autoencoder deep learning for umpire detection and classification
CN112800968A (en) Method for identifying identity of pig in drinking area based on feature histogram fusion of HOG blocks
Çakar et al. Thumbnail Selection with Convolutional Neural Network Based on Emotion Detection
Niu et al. Semantic video shot segmentation based on color ratio feature and SVM
Premaratne et al. A Novel Hybrid Adaptive Filter to Improve Video Keyframe Clustering to Support Event Resolution in Cricket Videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination