CN110097026B

CN110097026B - Paragraph association rule evaluation method based on multi-dimensional element video segmentation

Info

Publication number: CN110097026B
Application number: CN201910395119.6A
Authority: CN
Inventors: 胡燕祝; 田雯嘉
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2021-04-27
Anticipated expiration: 2039-05-13
Also published as: CN110097026A

Abstract

The invention mainly provides a paragraph association rule evaluation method based on multi-dimensional element video segmentation, which comprises the following specific contents: the method comprises the following steps: analyzing a video; step two: extracting key frames in scene segmentation; step three: scene segmentation based on key frames; step four, audio segmentation of the video; fifthly, semantic segmentation of the video; step six: a GNN network segmented video paragraph association rule judgment method; step seven: and constructing the associated network. After the same video is subjected to multi-dimensional segmentation, the corresponding multi-dimensional elements are matched in a way of constructing paragraph association rules. Compared with other paragraph association rule judging methods for video segmentation, the paragraph association rule judging method for video segmentation based on the multi-dimensional elements has the advantages that the change of pixels in an image sequence in a time domain and the correlation between adjacent frames are combined to realize good segmentation of the video in image dimensions, the key information of the video is reserved, and the paragraph association rule judging method for video segmentation based on the multi-dimensional elements can be effectively provided.

Description

Paragraph association rule evaluation method based on multi-dimensional element video segmentation

Technical Field

The invention mainly relates to a paragraph association rule evaluation method, in particular to a paragraph association rule evaluation method based on multi-dimensional element video segmentation.

Background

At present, aiming at the problem of video structuring, most of videos are segmented on the aspect of a single-dimensional element of an image, and the research on a video structuring method based on multi-dimensional segmentation involves less. In practice, audio information, text information, and the like contained in the video play an important role in video monitoring. In addition, when a moving object in a video is segmented and a key frame is extracted, in order to consider the problem of computational efficiency, only one frame in the video is taken as the key frame, and important information contained in the video is often ignored, or the key frame is selected by sequentially performing visual feature comparison on the video frames in a threshold setting mode. Meanwhile, after the same video is subjected to three-dimensional segmentation of scene, sound and text, videos of different time periods are obtained. The segmented video in these three dimensions is not perfectly aligned, creating an intersection situation. Therefore, there is a need to establish a paragraph association rule evaluation method capable of completely matching three-dimensional elements of scenes, sounds and texts.

The current application in video structuring is very wide. For example, the video-based structuring is applied to a fire-fighting facility monitoring system in a public place, the video structuring in public safety, the video structuring technology, the application in a safe city and the like. With the large-scale deployment of urban video monitoring systems, video monitoring has penetrated into all corners of cities, and a large amount of monitoring video data is generated in all industries such as intelligent transportation, government supervision, enterprise operation and the like. With the continuous deepening of edge computing, cloud computing and big data technologies, the problems of huge video data volume, difficult storage, inconvenient retrieval and the like are increasingly highlighted, large-scale real-time video monitoring data are oriented, image processing work such as real-time spatio-temporal information labeling, character extraction, feature extraction, target classification, structured labeling and the like needs to be carried out on video stream data and is quickly transmitted to central computing processing, a paragraph association rule evaluation method for multi-dimensional element video segmentation needs to be constructed, scenes, sounds and texts can be quickly and accurately matched, and a real-time and efficient monitoring means is provided for the operation of governments and enterprises in China.

Disclosure of Invention

To solve the problems in the prior art, the present invention mainly provides a method for evaluating paragraph association rules based on multi-dimensional element video segmentation, and the specific flow of the method is shown in fig. 1.

The technical scheme comprises the following implementation steps:

the method comprises the following steps: and (6) video parsing.

The first step of video analysis is data reception, and the video needs to be demultiplexed into an image track, an audio track and a subtitle track.

Step two: key frame extraction in scene segmentation.

The key frame extraction method is mainly divided into five categories, and the specific method is shown in fig. 2.

(1) Key frames are extracted based on the boundaries. The method selects the first frame and the last frame or the intermediate frame of each shot directly as key frames. The operation amount is small, and the method is suitable for the lens with small content activity or unchanged content.

(2) Key frames are extracted based on the visual features. The method first selects the first frame as the nearest key frame, and then the following frames are compared with the nearest key frame in turn for visual features, such as color, motion, edge, shape, spatial relationship, and the like. If the difference between the current frame and the nearest key frame exceeds a predetermined threshold, then the current frame is selected as the key frame.

(3) Key frames are extracted based on clustering. This kind of method uses clustering technique to cluster all frames of a shot, then according to some criteria, such as the number of frames in the category, selects key category from these categories, and then selects the frame with the minimum clustering parameter from the key category as the key frame.

(4) Key frames are extracted based on the multiple modes. The method mainly simulates human perception capability to carry out simplified video content analysis, and generally integrates video, audio, text and the like. For example, in the scene change of videos such as movies and sports, the video and audio contents often change simultaneously, so a multi-mode extraction method is needed, and when the audio and video features of the shot boundary change greatly at the same time, the shot boundary is a new scene boundary.

(5) Key frames are extracted based on the compressed domain. The compressed domain based method does not need to decompress the video stream or only needs to partially decompress, and directly extracts the key frame from the MPEG compressed video stream, thereby reducing the complexity of calculation.

Step three: scene segmentation based on keyframes.

The method mainly comprises the following three aspects:

(1) based on inter-frame differential detection. The interframe difference method is a method for obtaining the contour of a moving target by carrying out difference operation on two adjacent frames in a video image sequence, and can be well suitable for the condition that a plurality of moving targets exist and a camera moves.

(2) Based on background differential detection. The background difference method is a general method for motion segmentation of a static scene, which performs difference operation on a currently acquired image frame and a background image to obtain a gray level image of a target motion area, performs thresholding on the gray level image to extract the motion area, and updates the background image according to the currently acquired image frame in order to avoid the influence of environmental illumination change. The details are shown in fig. 3.

(3) And detecting based on an optical flow method. The optical flow method calculates motion information of an object between adjacent frames according to the corresponding relation between a previous frame and a current frame by using the change of pixels in an image sequence on a time domain and the correlation between the adjacent frames.

(4) The segmented video, which can be represented as x₁,…,x_iWhere x represents a time period of the segmented videoAnd i represents the number of divided videos.

Step four: audio segmentation of video.

The EMD-based audio segmentation method comprises the following specific processes:

(1) and determining all maximum value points of the original audio data sequence X (t), and fitting by using a cubic spline interpolation function to form an upper envelope line of the original data.

(2) And finding out all minimum value points, and fitting all the minimum value points through a cubic spline interpolation function to form a lower envelope curve of the data.

(3) The mean value of the upper envelope and the lower envelope is denoted as ml, and the mean envelope ml is subtracted from the original data sequence x (t) to obtain a new audio data sequence hl, as shown in the formula:

hl＝X(t)-ml

(4) and clustering and segmenting the audio data subjected to EMD decomposition.

(5) The divided audio, which can be represented as y₁，…，y_jWhere y represents the time period of the divided audio and j represents the number of divided audio.

Step five: and (4) semantic segmentation of the video.

For semantic segmentation of paragraphs, the following aspects are mainly included:

(1) semantic blocks are defined. The semantic block is used for dividing a sentence into a plurality of relatively independent semantic units, and the length of the semantic units is based on the length of the sentence above the word meaning and below the word meaning; the method is a preprocessing means associated with grammar, semantics and pragmatics. The semantic blocks are non-recursive, non-nested and non-overlapping.

(2) And (5) segmenting sentence meaning. Natural language processing typically requires analysis of three aspects: grammar, semanteme and context, so that the statistical treatment of text word segmentation and part of speech tag is firstly carried out, after word classification is finished, the fast labeling work is carried out, then the semanteme recombination is carried out on words, and finally sentence meaning segmentation is carried out according to the defined semantic blocks.

(3) The segmented paragraph can be denoted as z₁，…，z_kWhere z represents a time segment of the segmented audio and k represents the segmented audioThe number of (2).

Step six: a method for judging paragraph association rules of segmented videos of a GNN network.

Graphical Neural Networks (GNNs) are primarily effective at modeling relationships or interactions between objects in a system. For the same video, after the same video is segmented in three dimensions of the scene, the sound and the paragraph, videos in different time periods are obtained, and the three-dimensional segmented videos cannot be completely aligned and can generate a cross condition. t represents video per second, and GNN (t | x), GNN (t | y), GNN (t | z) refer to feature vectors currently extracted for segmenting video segments in various dimensions.

Step seven: and constructing the associated network.

The construction of the associated network is mainly divided into 2 steps.

(1) And starting from a single dimension, constructing a network association rule in each video segment according to the Euclidean distance or the Hamming distance, wherein the network association rule comprises the strength and the direction between nodes.

(2) And combining the association networks of the three dimensions together to form a new directed association network.

Compared with the prior art, the invention has the advantages that:

(1) the invention combines the change of the pixels in the image sequence in the time domain, the correlation between the adjacent frames and the corresponding relation between the previous frame and the current frame to realize good segmentation of the video in the image dimension, and reserves the key information of the video.

(2) The method and the device have the advantages that after the same video is segmented in three dimensions of scene, sound and text, the corresponding scene, sound and text are matched in a way of constructing paragraph association rules.

Drawings

For a better understanding of the present invention, reference is made to the following further description taken in conjunction with the accompanying drawings.

FIG. 1 is a flowchart illustrating steps of a method for building a paragraph association rule evaluation method based on multi-dimensional element video segmentation;

FIG. 2 is a schematic diagram of a key frame extraction method;

FIG. 3 is a schematic diagram of the content of the background difference detection method;

detailed description of the preferred embodiments

The present invention will be described in further detail below with reference to examples.

The technical scheme comprises the following implementation steps:

the method comprises the following steps: and (6) video parsing.

The traffic monitoring video at a certain place in Beijing city is demultiplexed, the video time is 1 minute and 50 seconds, the video is decomposed into an image track, an audio track and a subtitle track, and the time of the decomposed audio track and subtitle track is 1 minute and 50 seconds.

Step two: key frame extraction in scene segmentation.

In this example, the video is processed by a method of clustering to extract key frames, and the key frames are clustered into 5 categories.

Step three: scene segmentation based on keyframes.

The method mainly comprises the following three aspects:

(4) The segmented video, which can be represented as x₁，…，x_iWhere x denotes a time period of the divided video and i denotes the number of the divided videos.

Eye-to-eyeAfter key frames are extracted, the video is divided by adopting an optical flow method detection technology, and the divided video has 25 segments which are x respectively₁，x₂，…，x₂₅。

Step four: audio segmentation of video.

hl＝X(t)-ml

The maximum points included in the original audio data sequence x (t) are 2.3, 2.1, 2, 1.9, 1.8, 1.7, 0.9, and 0.8, respectively. The minimum values are respectively-1.9, -2.1, -2.6, -3.0, 0, -1.0 and-0.5. The mean of the upper envelope is 1.6875 and the mean of the lower envelope is-1.586. The number of the divided audios is 25, and the audios are respectively y₁，y₂，…，y₂₅。

Step five: and (4) semantic segmentation of the video.

(3) The segmented paragraph can be denoted as z₁，…，z_kWhere z represents a time period of the divided audio and k represents the number of divided audio.

The number of the divided texts is 25, and the divided texts are respectively z₁，z₂，…，z₂₅The concrete contents include 'right turn at crossroad', 'pedestrian stopping', 'serious vehicle congestion' and the like.

Extracting the feature vector of each dimension of the segmented video segment at the 5s moment to obtain a scene feature vector GNN (5| x)₁，x₂，…，x₂₅) The sound feature vector is GNN (5| y)₁，y₂，…，y₂₅) The paragraph feature vector is GNN (5| z)₁，z₂，…，z₂₅)。

Step seven: and constructing the associated network.

The construction of the associated network is mainly divided into 2 steps.

Claims

1. A paragraph association rule evaluation method based on multi-dimensional element video segmentation is characterized by comprising the following steps:

the method comprises the following steps: video analysis:

the first step of video analysis is data reception, and the video needs to be subjected to demultiplexing processing and is decomposed into an image track, an audio track and a subtitle track;

step two: extracting key frames in scene segmentation:

clustering all frames of a shot by using a clustering technology, then selecting key categories from the categories according to a frame number criterion in the categories, and then selecting the frame with the minimum clustering parameter from the key categories as a key frame;

step three: scene segmentation based on keyframes:

segmenting a scene by adopting an optical flow method, calculating motion information of an object between adjacent frames according to the corresponding relation between a previous frame and a current frame by utilizing the change of pixels in an image sequence on a time domain and the correlation between the adjacent frames, wherein a segmented video can be expressed as x₁,…,x_iWhere x represents a time period of the divided videos, and i represents the number of the divided videos;

step four: audio segmentation of video:

(1) determining all maximum value points of the original audio data sequence X (t), and fitting by using a cubic spline interpolation function to form an upper envelope line of the original data;

(2) finding out all minimum value points, and fitting all the minimum value points through a cubic spline interpolation function to form a lower envelope curve of the data;

hl＝X(t)-ml；

(4) clustering and dividing the audio data subjected to EMD decomposition;

(5) the divided audio, which can be represented as y₁,…,y_jWhere y represents a time period of the divided audio, and j represents the number of the divided audio;

step five: semantic segmentation of video:

for semantic segmentation of paragraphs, the following aspects are included:

(1) defining a semantic block: the semantic block is used for dividing a sentence into a plurality of relatively independent semantic units, is a preprocessing means for associating grammar, semantics and pragmatics, and is non-recursive, non-nested and non-overlapping;

(2) sentence meaning segmentation: natural language processing requires analysis of three aspects: grammar, semanteme and context, therefore, the statistical processing work of text word segmentation and part of speech marks is carried out at first, after the word classification is finished, the fast labeling work is carried out on the word, then the semanteme recombination is carried out on the word, and finally the sentence meaning segmentation is carried out according to the well-defined semantic blocks;

(3) the segmented paragraph can be denoted as z₁,…,z_kWherein z represents a time period of the divided audio, and k represents the number of the divided audio;

step six: the method for judging the paragraph association rule of the segmented video of the GNN network comprises the following steps:

the method comprises the steps that the relationship or interaction among objects in a Graphical Neural Network (GNN) modeling system is achieved, for the same video, the video in different time periods is obtained after segmentation is carried out on the three dimensions according to the scene, the sound and the paragraph, and the videos segmented in the three dimensions cannot be completely aligned and can generate the crossing condition, so that the GNN Neural Network is adopted to evaluate the relevance of the segmented video paragraphs; t represents a video of each second, GNN (t | x) refers to a feature vector extracted by currently dividing a video segment in a scene dimension, GNN (t | y) refers to a feature vector extracted by currently dividing a video segment in a sound dimension, and GNN (t | z) refers to a feature vector extracted by currently dividing a video segment in a paragraph dimension, and on the basis, an association network is constructed for the divided three-dimensional video segments;

step seven: constructing a correlation network:

the construction of the associated network is divided into 2 steps:

(1) starting from a single dimension, constructing a network association rule in each video segment according to Euclidean distance or Hamming distance, wherein the network association rule comprises the strength and the direction between nodes;