CN111126262A - Video highlight detection method and device based on graph neural network - Google Patents
Video highlight detection method and device based on graph neural network Download PDFInfo
- Publication number
- CN111126262A CN111126262A CN201911341937.4A CN201911341937A CN111126262A CN 111126262 A CN111126262 A CN 111126262A CN 201911341937 A CN201911341937 A CN 201911341937A CN 111126262 A CN111126262 A CN 111126262A
- Authority
- CN
- China
- Prior art keywords
- image
- frame
- representing
- video
- objects
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 65
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 59
- 238000000605 extraction Methods 0.000 claims abstract description 36
- 238000010586 diagram Methods 0.000 claims description 35
- 238000012549 training Methods 0.000 claims description 27
- 238000011176 pooling Methods 0.000 claims description 13
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 description 14
- 230000008569 process Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2193—Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of video information, in particular to a video highlight detection method and device based on a graph neural network. In order to solve the problem of low detection precision of video highlights in the prior art, the invention provides a method which comprises the steps of obtaining image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the video to be detected which is obtained in advance; constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image; according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image; and according to the time sequence corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model. The method improves the detection accuracy of the video highlight.
Description
Technical Field
The invention relates to the technical field of video information, in particular to a video highlight detection method and device based on a graph neural network.
Background
With the popularization of wearable devices such as portable cameras and intelligent glasses, more and more people record their lives through videos, and the detection of video highlights becomes increasingly important.
Most of the existing video highlight detection methods extract the overall characteristics of the video and do not consider the difference of the space-time local characteristics. Due to the complexity of the video content, this mixed feature will affect the detection effect of the final highlight. Existing models are mainly divided into three types, namely hidden variable-based ordering models, automatic encoder-based models, and convolutional neural network-based models. The model based on the hidden variable solves the problem that a large amount of noise exists in the video, the range of a training sample is expanded, but the precision of the method is limited because the video is represented by manual features; the model based on the automatic encoder reduces the requirement on the number of negative samples in training data, but the whole process is unsupervised learning, so the detection precision is low; the model based on the convolutional neural network utilizes the double-branch network to consider information on the video space and time dimensions, and obtains higher detection precision, but does not consider that the information provided by different frames is different, and the information provided by different areas is also different in the same frame.
Therefore, how to provide a method for accurately detecting the highlight of the video is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In order to solve the above-mentioned problems in the prior art, that is, to solve the problem of low video highlight detection precision in the prior art, a first aspect of the present invention provides a video highlight detection method based on a graph neural network, where the method includes:
the method comprises the steps that image feature information of each frame of image in a video to be detected is obtained through a preset image feature extraction model based on the pre-obtained video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information comprises position information of an object in the image;
constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the space map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and according to a time sequence diagram corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
Preferably, "the semantic features of the objects in each frame of image are obtained through a preset semantic feature extraction model according to the space map corresponding to each frame of image", and the method includes:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
Preferably, "obtaining semantic features of the object in each frame of image" includes:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
mi=∑j,j≠iαi,jM(xj|ei,j)
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jRepresenting the weight parameter between the ith object and the jth object, M represents a two-layer fully-connected neural network with the inputs being the characteristics of the nodes and edges, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
Preferably, after the step of obtaining the semantic features of the objects in each frame of image according to the spatial map corresponding to each frame of image through a preset semantic feature extraction model, before the step of constructing the time sequence diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further includes:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
wherein ,representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
Preferably, the "obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image" includes:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,indicating the loss of ordering of the images,representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
A second aspect of the present invention provides a video highlight detection apparatus based on a graph neural network, the apparatus comprising:
the device comprises a first module, a second module and a third module, wherein the first module is used for acquiring image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the pre-acquired video to be detected, the image characteristic extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image characteristic information of the image, and the image characteristic information comprises position information of an object in the image;
the second module is used for constructing a spatial map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the spatial map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
a third module, configured to obtain semantic features of objects in each frame of image through a preset semantic feature extraction model according to a space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and the fourth module is used for acquiring the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
Preferably, the third module is further configured to:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
Preferably, the third module is further configured to:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
mi=∑j,j≠iαi,jM(xj|ei,j)
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jRepresenting the weight parameter between the ith object and the jth object, M represents a two-layer fully-connected neural network with the inputs being the characteristics of the nodes and edges, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
Preferably, the apparatus further comprises an update module configured to:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
wherein ,representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
Preferably, the fourth module is further configured to:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,representing the loss of classification of the image, yiIs shown asi labels corresponding to the objects, Cross entry represents a classification loss function,indicating the loss of ordering of the images,representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
The invention provides a video highlight detection method based on a graph neural network, which comprises the steps of obtaining image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the video to be detected which is obtained in advance; constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image; according to a space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image; and according to the time sequence corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model.
The video highlight detection method based on the graph neural network comprehensively considers the relationship among the video clips, utilizes two patterns of a space graph and a timing graph to depict the object semantics among the video frames, and utilizes the graph neural network to detect the video highlight, thereby improving the detection accuracy and reducing the loss of time memory.
Drawings
FIG. 1 is a first flowchart of a video highlight detection method based on graph neural network according to the present invention;
FIG. 2 is a second flow chart of the video highlight detection method based on graph neural network of the present invention;
fig. 3 is a schematic structural diagram of a video highlight detection device based on a graph neural network according to the present invention.
Detailed Description
In order to make the embodiments, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
Referring to fig. 1, fig. 1 exemplarily shows a first flowchart of a video highlight detection method based on a graph neural network of the present invention.
The invention provides the following steps:
step S101, based on a pre-acquired video to be detected, acquiring image characteristic information of each frame of image in the video to be detected through a preset image characteristic extraction model.
The image feature extraction model is constructed based on a neural network, trained through a preset first training set and used for extracting image feature information of the image, and the image feature information comprises position information of an object in the image.
Referring to fig. 2, fig. 2 schematically shows a second flow chart of the video highlight detection method based on the graph neural network of the present invention.
In a possible implementation manner, the image feature extraction model in the embodiment of the present application may use ResNet50 as a feature extractor, and after obtaining image feature information, use RPN (Region pro-active Network) and ROI Pooling to obtain the position and the feature of an object in a picture.
In practical application, in order to make the image feature information extracted by the image feature extraction model more accurate, the image feature information may be trained in advance through a first training set, wherein the first training set may be labeled with a picture set in advance.
And S102, constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image.
The space map comprises a plurality of first nodes and a plurality of first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects.
In one possible implementation manner, after the image feature information of each frame of image is obtained, a spatial map corresponding to each frame of image may be constructed based on the image feature information. Optionally, the spatial map may include a plurality of first nodes and first lines between the plurality of first nodes. The first node is an object in each frame of image, and the first connection line is a relation between the objects.
Step S103, according to the space image corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image.
The semantic feature extraction model is constructed based on a neural network, trained through a preset second training set and used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object.
In a possible implementation manner, the method includes that according to a space map corresponding to each frame of image, semantic features of an object in each frame of image are obtained through a preset semantic feature extraction model, and the method includes:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
In order to better describe the semantic relationship of the connecting lines between the nodes, the representation of the edges can be learned through a double-layer full-connected function and the characteristics of the source node and the target node of the edges, and specifically, the representation of the edges can be learned through a method shown in formula (1):
formula (1):
ei,j=H(xi|xj)
wherein ,xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
In practical application, the method in the prior art does not consider the influence of the semantic features of the object on video detection, and can calculate the semantic features of the object in order to improve the accuracy of video detection. Specifically, the semantic features of the object in each frame of image can be calculated by a method shown in a double-layer full-connected layer, namely, formula (2):
formula (2):
mi=∑j,j≠iαi,jM(xj|ej,i)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jAnd representing a weight parameter between the ith object and the jth object, and M represents a double-layer fully-connected neural network, and the input of the double-layer fully-connected neural network is the characteristics of the node and the edge.
In a possible implementation manner, after the step of obtaining the semantic features of the objects in each frame of image according to the space map corresponding to each frame of image through the preset semantic feature extraction model, before the step of constructing the timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further includes:
updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula (3):
formula (3):
wherein ,representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
In the practical application process, an object in each frame of image may be related to an object in a subsequent image, images of different frames may affect each other, and the relationship of the same object in the images of different frames may change, so that the semantic features of the object in each frame of image need to be updated. The image characteristic information is extracted through the graph neural network, the relation between the intra-frame object and the inter-frame object is split into the space graph and the time sequence diagram for depicting, the scale of the graph can be reduced, the calculation amount is reduced, and the efficiency is improved.
And step S104, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence corresponding to each frame of image.
The video clip detection model is constructed based on a neural network, trained through a preset third training set and used for calculating the user interest score of each frame of image in the video.
In a possible implementation manner, the method includes that a user interest score of each frame of image in the video to be detected is obtained through a preset video segment detection model according to a time sequence diagram corresponding to each frame of image, and includes:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula (4):
formula (4):
f(xi)=Wxi+b
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,indicating the loss of ordering of the images,representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
And classification loss is used for all samples, sequencing loss is used for difficult samples, and the advantages of the classification loss and the sequencing loss are combined, so that the model is better optimized. Therefore, the influence of the relation between objects in the video and the relation between frames on the highlight detection of the video is fully considered, and the key factors influencing the highlight detection are found out in a fine-grained manner.
To evaluate the effectiveness of the method of the present application, the method of the present application was compared to prior art methods by detecting data sets using video clips, with the results shown in table 1:
table 1 comparison of the method of the present application with the prior art method
Class | LR | DCA | DCM | AFM | VHG |
gymnastics | 0.4 | 0.75 | 0.52 | 0.56 | 0.66 |
parkour | 0.61 | 0.54 | 0.71 | 0.75 | 0.83 |
skating | 0.62 | 0.66 | 0.64 | 0.68 | 0.7 |
skiing | 0.36 | 0.6 | 0.61 | 0.64 | 0.69 |
surfing | 0.61 | 0.65 | 0.73 | 0.78 | 0.69 |
dog | 0.6 | 0.58 | 0.69 | 0.72 | 0.67 |
mAP | 0.53 | 0.63 | 0.65 | 0.68 | 0.69 |
As can be seen from Table 1, the method of the present application has a good video clip detection effect.
Referring to fig. 3, fig. 3 schematically shows a structural diagram of the video highlight detection apparatus based on graph neural network according to the present invention.
A second aspect of the present invention provides a video highlight detection apparatus based on a graph neural network, the apparatus comprising:
the first module 1 is configured to acquire image feature information of each frame of image in a video to be detected through a preset image feature extraction model based on the pre-acquired video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information includes position information of an object in the image;
the second module 2 is configured to construct a spatial map corresponding to each frame of image based on image feature information of each frame of image, where the spatial map includes a plurality of first nodes and first connecting lines between the plurality of first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relationships between the objects;
a third module 3, configured to obtain semantic features of the objects in each frame of image through a preset semantic feature extraction model according to the space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and the fourth module 4 is configured to obtain the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, where the video segment detection model is constructed based on a neural network, is trained through a preset third training set, and is used to calculate the user interest score of each frame of image in the video.
In a possible implementation manner, the third module 3 is further configured to:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
In a possible implementation manner, the third module 3 is further configured to:
and (3) acquiring the semantic features of the objects in each frame of image according to the methods shown in the formulas (1) and (2).
In one possible implementation manner, the apparatus further includes an update module, and the update module is configured to:
and (4) updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in a formula (3).
In a possible implementation manner, the fourth module 4 is further configured to:
and obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the formula (4).
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In summary, the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (10)
1. A video highlight detection method based on a graph neural network is characterized by comprising the following steps:
the method comprises the steps that image feature information of each frame of image in a video to be detected is obtained through a preset image feature extraction model based on the pre-obtained video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information comprises position information of an object in the image;
constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the space map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and according to a time sequence diagram corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
2. The method according to claim 1, wherein the semantic features of the objects in each frame of image are obtained through a preset semantic feature extraction model according to the space map corresponding to each frame of image, and the method includes:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
3. The method according to claim 2, wherein the step of obtaining semantic features of the objects in each frame of image comprises:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jTo representThe weight parameter between the ith object and the jth object, M represents a double-layer full-connection layer network, and the input is the characteristics of nodes and edges, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
4. The method according to claim 1, wherein after the step of obtaining the semantic features of the objects in each frame of image according to the spatial map corresponding to each frame of image through a preset semantic feature extraction model, before the step of constructing the time sequence diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further comprises:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
5. The method according to claim 1, wherein the method for obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image comprises:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,indicating the loss of ordering of the images,representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
6. An apparatus for detecting highlight of video based on graph neural network, the apparatus comprising:
the device comprises a first module, a second module and a third module, wherein the first module is used for acquiring image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the pre-acquired video to be detected, the image characteristic extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image characteristic information of the image, and the image characteristic information comprises position information of an object in the image;
the second module is used for constructing a spatial map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the spatial map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
a third module, configured to obtain semantic features of objects in each frame of image through a preset semantic feature extraction model according to a space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and the fourth module is used for acquiring the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
7. The apparatus of claim 6, wherein the third module is further configured to:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
8. The apparatus of claim 7, wherein the third module is further configured to:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jRepresenting a weight parameter between the ith object and the jth object, M representing a two-layer fully-connected layer network with inputs being node and edge characteristics, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
9. The apparatus of claim 6, further comprising an update module configured to:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
10. The apparatus of claim 6, wherein the fourth module is further configured to:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,indicating the loss of ordering of the images,representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341937.4A CN111126262B (en) | 2019-12-24 | 2019-12-24 | Video highlight detection method and device based on graphic neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341937.4A CN111126262B (en) | 2019-12-24 | 2019-12-24 | Video highlight detection method and device based on graphic neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111126262A true CN111126262A (en) | 2020-05-08 |
CN111126262B CN111126262B (en) | 2023-04-28 |
Family
ID=70501420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911341937.4A Active CN111126262B (en) | 2019-12-24 | 2019-12-24 | Video highlight detection method and device based on graphic neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111126262B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950425A (en) * | 2020-08-06 | 2020-11-17 | 北京达佳互联信息技术有限公司 | Object acquisition method, device, client, server, system and storage medium |
CN113111770A (en) * | 2021-04-12 | 2021-07-13 | 杭州赛鲁班网络科技有限公司 | Video processing method, device, terminal and storage medium |
CN113822316A (en) * | 2020-06-18 | 2021-12-21 | 香港科技大学 | Method and equipment for predicting student performance in interactive online question bank |
WO2022134576A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Infrared video timing behavior positioning method, apparatus and device, and storage medium |
WO2023130326A1 (en) * | 2022-01-06 | 2023-07-13 | Huawei Technologies Co., Ltd. | Methods and devices for generating customized video segment based on content features |
CN116721093A (en) * | 2023-08-03 | 2023-09-08 | 克伦斯(天津)轨道交通技术有限公司 | Subway rail obstacle detection method and system based on neural network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018157746A1 (en) * | 2017-02-28 | 2018-09-07 | 阿里巴巴集团控股有限公司 | Recommendation method and apparatus for video data |
CN110097026A (en) * | 2019-05-13 | 2019-08-06 | 北京邮电大学 | A kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation |
-
2019
- 2019-12-24 CN CN201911341937.4A patent/CN111126262B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018157746A1 (en) * | 2017-02-28 | 2018-09-07 | 阿里巴巴集团控股有限公司 | Recommendation method and apparatus for video data |
CN110097026A (en) * | 2019-05-13 | 2019-08-06 | 北京邮电大学 | A kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation |
Non-Patent Citations (1)
Title |
---|
李鸣晓;庚琦川;莫红;吴威;周忠;: "基于片段关键帧的视频行为识别方法" * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822316A (en) * | 2020-06-18 | 2021-12-21 | 香港科技大学 | Method and equipment for predicting student performance in interactive online question bank |
CN113822316B (en) * | 2020-06-18 | 2024-01-12 | 香港科技大学 | Method and equipment for predicting student performance in interactive online question bank |
CN111950425A (en) * | 2020-08-06 | 2020-11-17 | 北京达佳互联信息技术有限公司 | Object acquisition method, device, client, server, system and storage medium |
CN111950425B (en) * | 2020-08-06 | 2024-05-10 | 北京达佳互联信息技术有限公司 | Object acquisition method, device, client, server, system and storage medium |
WO2022134576A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Infrared video timing behavior positioning method, apparatus and device, and storage medium |
CN113111770A (en) * | 2021-04-12 | 2021-07-13 | 杭州赛鲁班网络科技有限公司 | Video processing method, device, terminal and storage medium |
WO2023130326A1 (en) * | 2022-01-06 | 2023-07-13 | Huawei Technologies Co., Ltd. | Methods and devices for generating customized video segment based on content features |
CN116721093A (en) * | 2023-08-03 | 2023-09-08 | 克伦斯(天津)轨道交通技术有限公司 | Subway rail obstacle detection method and system based on neural network |
CN116721093B (en) * | 2023-08-03 | 2023-10-31 | 克伦斯(天津)轨道交通技术有限公司 | Subway rail obstacle detection method and system based on neural network |
Also Published As
Publication number | Publication date |
---|---|
CN111126262B (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111126262A (en) | Video highlight detection method and device based on graph neural network | |
CA3066029A1 (en) | Image feature acquisition | |
CN108280477B (en) | Method and apparatus for clustering images | |
CN111161311A (en) | Visual multi-target tracking method and device based on deep learning | |
CN109214002A (en) | A kind of transcription comparison method, device and its computer storage medium | |
KR102265573B1 (en) | Method and system for reconstructing mathematics learning curriculum based on artificial intelligence | |
WO2023284465A1 (en) | Image detection method and apparatus, computer-readable storage medium, and computer device | |
CN111931859B (en) | Multi-label image recognition method and device | |
CN111783712A (en) | Video processing method, device, equipment and medium | |
CN112364204A (en) | Video searching method and device, computer equipment and storage medium | |
CN111144215A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN111506755A (en) | Picture set classification method and device | |
CN115482418B (en) | Semi-supervised model training method, system and application based on pseudo-negative labels | |
CN113642400A (en) | Graph convolution action recognition method, device and equipment based on 2S-AGCN | |
CN110427819A (en) | The method and relevant device of PPT frame in a kind of identification image | |
TWI803243B (en) | Method for expanding images, computer device and storage medium | |
CN115062783B (en) | Entity alignment method and related device, electronic equipment and storage medium | |
CN115098732B (en) | Data processing method and related device | |
Yang et al. | Student Classroom Behavior Detection Based on YOLOv7+ BRA and Multi-model Fusion | |
CN112214639B (en) | Video screening method, video screening device and terminal equipment | |
CN114821140A (en) | Image clustering method based on Manhattan distance, terminal device and storage medium | |
CN115203532A (en) | Project recommendation method and device, electronic equipment and storage medium | |
Qu et al. | The foreground detection algorithm combined the temporal–spatial information and adaptive visual background extraction | |
WO2019212407A1 (en) | A system and method for image retrieval | |
Yang et al. | Robust feature mining transformer for occluded person re-identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |