CN111126262A

CN111126262A - Video highlight detection method and device based on graph neural network

Info

Publication number: CN111126262A
Application number: CN201911341937.4A
Authority: CN
Inventors: 徐常胜; 高君宇; 张莹莹; 刘畅; 李岩
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-08
Anticipated expiration: 2039-12-24
Also published as: CN111126262B

Abstract

The invention relates to the technical field of video information, in particular to a video highlight detection method and device based on a graph neural network. In order to solve the problem of low detection precision of video highlights in the prior art, the invention provides a method which comprises the steps of obtaining image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the video to be detected which is obtained in advance; constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image; according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image; and according to the time sequence corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model. The method improves the detection accuracy of the video highlight.

Description

Video highlight detection method and device based on graph neural network

Technical Field

The invention relates to the technical field of video information, in particular to a video highlight detection method and device based on a graph neural network.

Background

With the popularization of wearable devices such as portable cameras and intelligent glasses, more and more people record their lives through videos, and the detection of video highlights becomes increasingly important.

Most of the existing video highlight detection methods extract the overall characteristics of the video and do not consider the difference of the space-time local characteristics. Due to the complexity of the video content, this mixed feature will affect the detection effect of the final highlight. Existing models are mainly divided into three types, namely hidden variable-based ordering models, automatic encoder-based models, and convolutional neural network-based models. The model based on the hidden variable solves the problem that a large amount of noise exists in the video, the range of a training sample is expanded, but the precision of the method is limited because the video is represented by manual features; the model based on the automatic encoder reduces the requirement on the number of negative samples in training data, but the whole process is unsupervised learning, so the detection precision is low; the model based on the convolutional neural network utilizes the double-branch network to consider information on the video space and time dimensions, and obtains higher detection precision, but does not consider that the information provided by different frames is different, and the information provided by different areas is also different in the same frame.

Therefore, how to provide a method for accurately detecting the highlight of the video is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the above-mentioned problems in the prior art, that is, to solve the problem of low video highlight detection precision in the prior art, a first aspect of the present invention provides a video highlight detection method based on a graph neural network, where the method includes:

the method comprises the steps that image feature information of each frame of image in a video to be detected is obtained through a preset image feature extraction model based on the pre-obtained video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information comprises position information of an object in the image;

constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the space map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;

according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image,

the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;

and according to a time sequence diagram corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.

Preferably, "the semantic features of the objects in each frame of image are obtained through a preset semantic feature extraction model according to the space map corresponding to each frame of image", and the method includes:

performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;

and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.

Preferably, "obtaining semantic features of the object in each frame of image" includes:

obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:

m_i＝∑_j,j≠iα_i,jM(x_j|e_i,j)

e_i,j＝H(x_i|x_j)

wherein ,m_iRepresenting the semantic features of the ith object in each of said frames of images, α_i,jRepresenting the weight parameter between the ith object and the jth object, M represents a two-layer fully-connected neural network with the inputs being the characteristics of the nodes and edges, x_iImage characteristic information x representing the ith object_jImage characteristic information representing the jth object, e_i,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.

Preferably, after the step of obtaining the semantic features of the objects in each frame of image according to the spatial map corresponding to each frame of image through a preset semantic feature extraction model, before the step of constructing the time sequence diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further includes:

and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:

wherein ,

representing semantic features of the ith object after update, x_iImage feature information representing the i-th object, m_iAnd the semantic features of the ith object in each frame image are represented.

Preferably, the "obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image" includes:

obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:

f(x_i)＝Wx_i+b

wherein ,f(x_i) Score of interest, x, of the user representing the image corresponding to the ith frame_iRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,

representing the loss of classification of the image, y_iIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,

indicating the loss of ordering of the images,

representing a set of training data, x_pImage characteristic information, x, representing the p-th object_nImage feature information representing the nth object, f (x)_p) A score of interest to the user, f (x), representing the image corresponding to the p-th object_n) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.

A second aspect of the present invention provides a video highlight detection apparatus based on a graph neural network, the apparatus comprising:

the device comprises a first module, a second module and a third module, wherein the first module is used for acquiring image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the pre-acquired video to be detected, the image characteristic extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image characteristic information of the image, and the image characteristic information comprises position information of an object in the image;

the second module is used for constructing a spatial map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the spatial map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;

a third module, configured to obtain semantic features of objects in each frame of image through a preset semantic feature extraction model according to a space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,

and the fourth module is used for acquiring the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.

Preferably, the third module is further configured to:

m_i＝∑_j,j≠iα_i,jM(x_j|e_i,j)

e_i,j＝H(x_i|x_j)

Preferably, the apparatus further comprises an update module configured to:

wherein ,

Preferably, the fourth module is further configured to:

f(x_i)＝Wx_i+b

representing the loss of classification of the image, y_iIs shown asi labels corresponding to the objects, Cross entry represents a classification loss function,

indicating the loss of ordering of the images,

The invention provides a video highlight detection method based on a graph neural network, which comprises the steps of obtaining image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the video to be detected which is obtained in advance; constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image; according to a space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image; and according to the time sequence corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model.

The video highlight detection method based on the graph neural network comprehensively considers the relationship among the video clips, utilizes two patterns of a space graph and a timing graph to depict the object semantics among the video frames, and utilizes the graph neural network to detect the video highlight, thereby improving the detection accuracy and reducing the loss of time memory.

Drawings

FIG. 1 is a first flowchart of a video highlight detection method based on graph neural network according to the present invention;

FIG. 2 is a second flow chart of the video highlight detection method based on graph neural network of the present invention;

fig. 3 is a schematic structural diagram of a video highlight detection device based on a graph neural network according to the present invention.

Detailed Description

In order to make the embodiments, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

Referring to fig. 1, fig. 1 exemplarily shows a first flowchart of a video highlight detection method based on a graph neural network of the present invention.

The invention provides the following steps:

step S101, based on a pre-acquired video to be detected, acquiring image characteristic information of each frame of image in the video to be detected through a preset image characteristic extraction model.

The image feature extraction model is constructed based on a neural network, trained through a preset first training set and used for extracting image feature information of the image, and the image feature information comprises position information of an object in the image.

Referring to fig. 2, fig. 2 schematically shows a second flow chart of the video highlight detection method based on the graph neural network of the present invention.

In a possible implementation manner, the image feature extraction model in the embodiment of the present application may use ResNet50 as a feature extractor, and after obtaining image feature information, use RPN (Region pro-active Network) and ROI Pooling to obtain the position and the feature of an object in a picture.

In practical application, in order to make the image feature information extracted by the image feature extraction model more accurate, the image feature information may be trained in advance through a first training set, wherein the first training set may be labeled with a picture set in advance.

And S102, constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image.

The space map comprises a plurality of first nodes and a plurality of first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects.

In one possible implementation manner, after the image feature information of each frame of image is obtained, a spatial map corresponding to each frame of image may be constructed based on the image feature information. Optionally, the spatial map may include a plurality of first nodes and first lines between the plurality of first nodes. The first node is an object in each frame of image, and the first connection line is a relation between the objects.

Step S103, according to the space image corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image.

The semantic feature extraction model is constructed based on a neural network, trained through a preset second training set and used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object.

In a possible implementation manner, the method includes that according to a space map corresponding to each frame of image, semantic features of an object in each frame of image are obtained through a preset semantic feature extraction model, and the method includes:

In order to better describe the semantic relationship of the connecting lines between the nodes, the representation of the edges can be learned through a double-layer full-connected function and the characteristics of the source node and the target node of the edges, and specifically, the representation of the edges can be learned through a method shown in formula (1):

formula (1):

e_i,j＝H(x_i|x_j)

wherein ,x_iImage characteristic information x representing the ith object_jImage characteristic information representing the jth object, e_i,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.

In practical application, the method in the prior art does not consider the influence of the semantic features of the object on video detection, and can calculate the semantic features of the object in order to improve the accuracy of video detection. Specifically, the semantic features of the object in each frame of image can be calculated by a method shown in a double-layer full-connected layer, namely, formula (2):

formula (2):

m_i＝∑_j,j≠iα_i,jM(x_j|e_j,i)

wherein ,m_iRepresenting the semantic features of the ith object in each of said frames of images, α_i,jAnd representing a weight parameter between the ith object and the jth object, and M represents a double-layer fully-connected neural network, and the input of the double-layer fully-connected neural network is the characteristics of the node and the edge.

In a possible implementation manner, after the step of obtaining the semantic features of the objects in each frame of image according to the space map corresponding to each frame of image through the preset semantic feature extraction model, before the step of constructing the timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further includes:

updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula (3):

formula (3):

wherein ,

In the practical application process, an object in each frame of image may be related to an object in a subsequent image, images of different frames may affect each other, and the relationship of the same object in the images of different frames may change, so that the semantic features of the object in each frame of image need to be updated. The image characteristic information is extracted through the graph neural network, the relation between the intra-frame object and the inter-frame object is split into the space graph and the time sequence diagram for depicting, the scale of the graph can be reduced, the calculation amount is reduced, and the efficiency is improved.

And step S104, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence corresponding to each frame of image.

The video clip detection model is constructed based on a neural network, trained through a preset third training set and used for calculating the user interest score of each frame of image in the video.

In a possible implementation manner, the method includes that a user interest score of each frame of image in the video to be detected is obtained through a preset video segment detection model according to a time sequence diagram corresponding to each frame of image, and includes:

obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula (4):

formula (4):

f(x_i)＝Wx_i+b

indicating the loss of ordering of the images,

And classification loss is used for all samples, sequencing loss is used for difficult samples, and the advantages of the classification loss and the sequencing loss are combined, so that the model is better optimized. Therefore, the influence of the relation between objects in the video and the relation between frames on the highlight detection of the video is fully considered, and the key factors influencing the highlight detection are found out in a fine-grained manner.

To evaluate the effectiveness of the method of the present application, the method of the present application was compared to prior art methods by detecting data sets using video clips, with the results shown in table 1:

table 1 comparison of the method of the present application with the prior art method

Class	LR	DCA	DCM	AFM	VHG
						gymnastics	0.4	0.75	0.52	0.56	0.66
parkour	0.61	0.54	0.71	0.75	0.83
						skating	0.62	0.66	0.64	0.68	0.7
skiing	0.36	0.6	0.61	0.64	0.69
						surfing	0.61	0.65	0.73	0.78	0.69
dog	0.6	0.58	0.69	0.72	0.67
						mAP	0.53	0.63	0.65	0.68	0.69

As can be seen from Table 1, the method of the present application has a good video clip detection effect.

Referring to fig. 3, fig. 3 schematically shows a structural diagram of the video highlight detection apparatus based on graph neural network according to the present invention.

the first module 1 is configured to acquire image feature information of each frame of image in a video to be detected through a preset image feature extraction model based on the pre-acquired video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information includes position information of an object in the image;

the second module 2 is configured to construct a spatial map corresponding to each frame of image based on image feature information of each frame of image, where the spatial map includes a plurality of first nodes and first connecting lines between the plurality of first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relationships between the objects;

a third module 3, configured to obtain semantic features of the objects in each frame of image through a preset semantic feature extraction model according to the space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,

and the fourth module 4 is configured to obtain the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, where the video segment detection model is constructed based on a neural network, is trained through a preset third training set, and is used to calculate the user interest score of each frame of image in the video.

In a possible implementation manner, the third module 3 is further configured to:

and (3) acquiring the semantic features of the objects in each frame of image according to the methods shown in the formulas (1) and (2).

In one possible implementation manner, the apparatus further includes an update module, and the update module is configured to:

and (4) updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in a formula (3).

In a possible implementation manner, the fourth module 4 is further configured to:

and obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the formula (4).

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In summary, the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A video highlight detection method based on a graph neural network is characterized by comprising the following steps:

2. The method according to claim 1, wherein the semantic features of the objects in each frame of image are obtained through a preset semantic feature extraction model according to the space map corresponding to each frame of image, and the method includes:

3. The method according to claim 2, wherein the step of obtaining semantic features of the objects in each frame of image comprises:

e_i,j＝H(x_i|x_j)

wherein ,m_iRepresenting the semantic features of the ith object in each of said frames of images, α_i,jTo representThe weight parameter between the ith object and the jth object, M represents a double-layer full-connection layer network, and the input is the characteristics of nodes and edges, x_iImage characteristic information x representing the ith object_jImage characteristic information representing the jth object, e_i,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.

4. The method according to claim 1, wherein after the step of obtaining the semantic features of the objects in each frame of image according to the spatial map corresponding to each frame of image through a preset semantic feature extraction model, before the step of constructing the time sequence diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further comprises:

wherein ,

5. The method according to claim 1, wherein the method for obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image comprises:

f(x_i)＝Wx_i+b

indicating the loss of ordering of the images,

6. An apparatus for detecting highlight of video based on graph neural network, the apparatus comprising:

7. The apparatus of claim 6, wherein the third module is further configured to:

8. The apparatus of claim 7, wherein the third module is further configured to:

e_i,j＝H(x_i|x_j)

wherein ,m_iRepresenting the semantic features of the ith object in each of said frames of images, α_i,jRepresenting a weight parameter between the ith object and the jth object, M representing a two-layer fully-connected layer network with inputs being node and edge characteristics, x_iImage characteristic information x representing the ith object_jImage characteristic information representing the jth object, e_i,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.

9. The apparatus of claim 6, further comprising an update module configured to:

wherein ,

10. The apparatus of claim 6, wherein the fourth module is further configured to:

f(x_i)＝Wx_i+b

indicating the loss of ordering of the images,