CN111126262A - Video highlight detection method and device based on graph neural network - Google Patents

Video highlight detection method and device based on graph neural network Download PDF

Info

Publication number
CN111126262A
CN111126262A CN201911341937.4A CN201911341937A CN111126262A CN 111126262 A CN111126262 A CN 111126262A CN 201911341937 A CN201911341937 A CN 201911341937A CN 111126262 A CN111126262 A CN 111126262A
Authority
CN
China
Prior art keywords
image
frame
representing
video
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911341937.4A
Other languages
Chinese (zh)
Other versions
CN111126262B (en
Inventor
徐常胜
高君宇
张莹莹
刘畅
李岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201911341937.4A priority Critical patent/CN111126262B/en
Publication of CN111126262A publication Critical patent/CN111126262A/en
Application granted granted Critical
Publication of CN111126262B publication Critical patent/CN111126262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of video information, in particular to a video highlight detection method and device based on a graph neural network. In order to solve the problem of low detection precision of video highlights in the prior art, the invention provides a method which comprises the steps of obtaining image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the video to be detected which is obtained in advance; constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image; according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image; and according to the time sequence corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model. The method improves the detection accuracy of the video highlight.

Description

Video highlight detection method and device based on graph neural network
Technical Field
The invention relates to the technical field of video information, in particular to a video highlight detection method and device based on a graph neural network.
Background
With the popularization of wearable devices such as portable cameras and intelligent glasses, more and more people record their lives through videos, and the detection of video highlights becomes increasingly important.
Most of the existing video highlight detection methods extract the overall characteristics of the video and do not consider the difference of the space-time local characteristics. Due to the complexity of the video content, this mixed feature will affect the detection effect of the final highlight. Existing models are mainly divided into three types, namely hidden variable-based ordering models, automatic encoder-based models, and convolutional neural network-based models. The model based on the hidden variable solves the problem that a large amount of noise exists in the video, the range of a training sample is expanded, but the precision of the method is limited because the video is represented by manual features; the model based on the automatic encoder reduces the requirement on the number of negative samples in training data, but the whole process is unsupervised learning, so the detection precision is low; the model based on the convolutional neural network utilizes the double-branch network to consider information on the video space and time dimensions, and obtains higher detection precision, but does not consider that the information provided by different frames is different, and the information provided by different areas is also different in the same frame.
Therefore, how to provide a method for accurately detecting the highlight of the video is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In order to solve the above-mentioned problems in the prior art, that is, to solve the problem of low video highlight detection precision in the prior art, a first aspect of the present invention provides a video highlight detection method based on a graph neural network, where the method includes:
the method comprises the steps that image feature information of each frame of image in a video to be detected is obtained through a preset image feature extraction model based on the pre-obtained video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information comprises position information of an object in the image;
constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the space map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and according to a time sequence diagram corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
Preferably, "the semantic features of the objects in each frame of image are obtained through a preset semantic feature extraction model according to the space map corresponding to each frame of image", and the method includes:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
Preferably, "obtaining semantic features of the object in each frame of image" includes:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
mi=∑j,j≠iαi,jM(xj|ei,j)
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jRepresenting the weight parameter between the ith object and the jth object, M represents a two-layer fully-connected neural network with the inputs being the characteristics of the nodes and edges, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
Preferably, after the step of obtaining the semantic features of the objects in each frame of image according to the spatial map corresponding to each frame of image through a preset semantic feature extraction model, before the step of constructing the time sequence diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further includes:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
Figure BDA0002332506440000031
wherein ,
Figure BDA0002332506440000032
representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
Preferably, the "obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image" includes:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
Figure BDA0002332506440000033
Figure BDA0002332506440000034
Figure BDA0002332506440000035
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,
Figure BDA0002332506440000036
representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,
Figure BDA0002332506440000038
indicating the loss of ordering of the images,
Figure BDA0002332506440000037
representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
A second aspect of the present invention provides a video highlight detection apparatus based on a graph neural network, the apparatus comprising:
the device comprises a first module, a second module and a third module, wherein the first module is used for acquiring image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the pre-acquired video to be detected, the image characteristic extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image characteristic information of the image, and the image characteristic information comprises position information of an object in the image;
the second module is used for constructing a spatial map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the spatial map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
a third module, configured to obtain semantic features of objects in each frame of image through a preset semantic feature extraction model according to a space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and the fourth module is used for acquiring the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
Preferably, the third module is further configured to:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
Preferably, the third module is further configured to:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
mi=∑j,j≠iαi,jM(xj|ei,j)
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jRepresenting the weight parameter between the ith object and the jth object, M represents a two-layer fully-connected neural network with the inputs being the characteristics of the nodes and edges, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
Preferably, the apparatus further comprises an update module configured to:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
Figure BDA0002332506440000051
wherein ,
Figure BDA0002332506440000052
representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
Preferably, the fourth module is further configured to:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
Figure BDA0002332506440000053
Figure BDA0002332506440000054
Figure BDA0002332506440000055
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,
Figure BDA0002332506440000056
representing the loss of classification of the image, yiIs shown asi labels corresponding to the objects, Cross entry represents a classification loss function,
Figure BDA0002332506440000058
indicating the loss of ordering of the images,
Figure BDA0002332506440000057
representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
The invention provides a video highlight detection method based on a graph neural network, which comprises the steps of obtaining image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the video to be detected which is obtained in advance; constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image; according to a space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image; and according to the time sequence corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model.
The video highlight detection method based on the graph neural network comprehensively considers the relationship among the video clips, utilizes two patterns of a space graph and a timing graph to depict the object semantics among the video frames, and utilizes the graph neural network to detect the video highlight, thereby improving the detection accuracy and reducing the loss of time memory.
Drawings
FIG. 1 is a first flowchart of a video highlight detection method based on graph neural network according to the present invention;
FIG. 2 is a second flow chart of the video highlight detection method based on graph neural network of the present invention;
fig. 3 is a schematic structural diagram of a video highlight detection device based on a graph neural network according to the present invention.
Detailed Description
In order to make the embodiments, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
Referring to fig. 1, fig. 1 exemplarily shows a first flowchart of a video highlight detection method based on a graph neural network of the present invention.
The invention provides the following steps:
step S101, based on a pre-acquired video to be detected, acquiring image characteristic information of each frame of image in the video to be detected through a preset image characteristic extraction model.
The image feature extraction model is constructed based on a neural network, trained through a preset first training set and used for extracting image feature information of the image, and the image feature information comprises position information of an object in the image.
Referring to fig. 2, fig. 2 schematically shows a second flow chart of the video highlight detection method based on the graph neural network of the present invention.
In a possible implementation manner, the image feature extraction model in the embodiment of the present application may use ResNet50 as a feature extractor, and after obtaining image feature information, use RPN (Region pro-active Network) and ROI Pooling to obtain the position and the feature of an object in a picture.
In practical application, in order to make the image feature information extracted by the image feature extraction model more accurate, the image feature information may be trained in advance through a first training set, wherein the first training set may be labeled with a picture set in advance.
And S102, constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image.
The space map comprises a plurality of first nodes and a plurality of first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects.
In one possible implementation manner, after the image feature information of each frame of image is obtained, a spatial map corresponding to each frame of image may be constructed based on the image feature information. Optionally, the spatial map may include a plurality of first nodes and first lines between the plurality of first nodes. The first node is an object in each frame of image, and the first connection line is a relation between the objects.
Step S103, according to the space image corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image.
The semantic feature extraction model is constructed based on a neural network, trained through a preset second training set and used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object.
In a possible implementation manner, the method includes that according to a space map corresponding to each frame of image, semantic features of an object in each frame of image are obtained through a preset semantic feature extraction model, and the method includes:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
In order to better describe the semantic relationship of the connecting lines between the nodes, the representation of the edges can be learned through a double-layer full-connected function and the characteristics of the source node and the target node of the edges, and specifically, the representation of the edges can be learned through a method shown in formula (1):
formula (1):
ei,j=H(xi|xj)
wherein ,xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
In practical application, the method in the prior art does not consider the influence of the semantic features of the object on video detection, and can calculate the semantic features of the object in order to improve the accuracy of video detection. Specifically, the semantic features of the object in each frame of image can be calculated by a method shown in a double-layer full-connected layer, namely, formula (2):
formula (2):
mi=∑j,j≠iαi,jM(xj|ej,i)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jAnd representing a weight parameter between the ith object and the jth object, and M represents a double-layer fully-connected neural network, and the input of the double-layer fully-connected neural network is the characteristics of the node and the edge.
In a possible implementation manner, after the step of obtaining the semantic features of the objects in each frame of image according to the space map corresponding to each frame of image through the preset semantic feature extraction model, before the step of constructing the timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further includes:
updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula (3):
formula (3):
Figure BDA0002332506440000081
wherein ,
Figure BDA0002332506440000082
representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
In the practical application process, an object in each frame of image may be related to an object in a subsequent image, images of different frames may affect each other, and the relationship of the same object in the images of different frames may change, so that the semantic features of the object in each frame of image need to be updated. The image characteristic information is extracted through the graph neural network, the relation between the intra-frame object and the inter-frame object is split into the space graph and the time sequence diagram for depicting, the scale of the graph can be reduced, the calculation amount is reduced, and the efficiency is improved.
And step S104, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence corresponding to each frame of image.
The video clip detection model is constructed based on a neural network, trained through a preset third training set and used for calculating the user interest score of each frame of image in the video.
In a possible implementation manner, the method includes that a user interest score of each frame of image in the video to be detected is obtained through a preset video segment detection model according to a time sequence diagram corresponding to each frame of image, and includes:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula (4):
formula (4):
f(xi)=Wxi+b
Figure BDA0002332506440000096
Figure BDA0002332506440000091
Figure BDA0002332506440000092
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,
Figure BDA0002332506440000093
representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,
Figure BDA0002332506440000094
indicating the loss of ordering of the images,
Figure BDA0002332506440000095
representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
And classification loss is used for all samples, sequencing loss is used for difficult samples, and the advantages of the classification loss and the sequencing loss are combined, so that the model is better optimized. Therefore, the influence of the relation between objects in the video and the relation between frames on the highlight detection of the video is fully considered, and the key factors influencing the highlight detection are found out in a fine-grained manner.
To evaluate the effectiveness of the method of the present application, the method of the present application was compared to prior art methods by detecting data sets using video clips, with the results shown in table 1:
table 1 comparison of the method of the present application with the prior art method
Class LR DCA DCM AFM VHG
gymnastics 0.4 0.75 0.52 0.56 0.66
parkour 0.61 0.54 0.71 0.75 0.83
skating 0.62 0.66 0.64 0.68 0.7
skiing 0.36 0.6 0.61 0.64 0.69
surfing 0.61 0.65 0.73 0.78 0.69
dog 0.6 0.58 0.69 0.72 0.67
mAP 0.53 0.63 0.65 0.68 0.69
As can be seen from Table 1, the method of the present application has a good video clip detection effect.
Referring to fig. 3, fig. 3 schematically shows a structural diagram of the video highlight detection apparatus based on graph neural network according to the present invention.
A second aspect of the present invention provides a video highlight detection apparatus based on a graph neural network, the apparatus comprising:
the first module 1 is configured to acquire image feature information of each frame of image in a video to be detected through a preset image feature extraction model based on the pre-acquired video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information includes position information of an object in the image;
the second module 2 is configured to construct a spatial map corresponding to each frame of image based on image feature information of each frame of image, where the spatial map includes a plurality of first nodes and first connecting lines between the plurality of first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relationships between the objects;
a third module 3, configured to obtain semantic features of the objects in each frame of image through a preset semantic feature extraction model according to the space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and the fourth module 4 is configured to obtain the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, where the video segment detection model is constructed based on a neural network, is trained through a preset third training set, and is used to calculate the user interest score of each frame of image in the video.
In a possible implementation manner, the third module 3 is further configured to:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
In a possible implementation manner, the third module 3 is further configured to:
and (3) acquiring the semantic features of the objects in each frame of image according to the methods shown in the formulas (1) and (2).
In one possible implementation manner, the apparatus further includes an update module, and the update module is configured to:
and (4) updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in a formula (3).
In a possible implementation manner, the fourth module 4 is further configured to:
and obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the formula (4).
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In summary, the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A video highlight detection method based on a graph neural network is characterized by comprising the following steps:
the method comprises the steps that image feature information of each frame of image in a video to be detected is obtained through a preset image feature extraction model based on the pre-obtained video to be detected, wherein the image feature extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image feature information of the image, and the image feature information comprises position information of an object in the image;
constructing a space map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the space map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
according to the space map corresponding to each frame of image, obtaining the semantic features of the objects in each frame of image through a preset semantic feature extraction model, and constructing a time sequence corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and according to a time sequence diagram corresponding to each frame of image, obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
2. The method according to claim 1, wherein the semantic features of the objects in each frame of image are obtained through a preset semantic feature extraction model according to the space map corresponding to each frame of image, and the method includes:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
3. The method according to claim 2, wherein the step of obtaining semantic features of the objects in each frame of image comprises:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
Figure FDA0002332506430000021
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jTo representThe weight parameter between the ith object and the jth object, M represents a double-layer full-connection layer network, and the input is the characteristics of nodes and edges, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
4. The method according to claim 1, wherein after the step of obtaining the semantic features of the objects in each frame of image according to the spatial map corresponding to each frame of image through a preset semantic feature extraction model, before the step of constructing the time sequence diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image, the method further comprises:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
Figure FDA0002332506430000022
wherein ,
Figure FDA0002332506430000023
representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
5. The method according to claim 1, wherein the method for obtaining the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image comprises:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
Figure FDA0002332506430000024
Figure FDA0002332506430000025
Figure FDA0002332506430000031
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,
Figure FDA0002332506430000032
representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,
Figure FDA0002332506430000033
indicating the loss of ordering of the images,
Figure FDA0002332506430000034
representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
6. An apparatus for detecting highlight of video based on graph neural network, the apparatus comprising:
the device comprises a first module, a second module and a third module, wherein the first module is used for acquiring image characteristic information of each frame of image in a video to be detected through a preset image characteristic extraction model based on the pre-acquired video to be detected, the image characteristic extraction model is constructed based on a neural network, is trained through a preset first training set and is used for extracting the image characteristic information of the image, and the image characteristic information comprises position information of an object in the image;
the second module is used for constructing a spatial map corresponding to each frame of image based on the image characteristic information of each frame of image, wherein the spatial map comprises a plurality of first nodes and first connecting lines among the first nodes, the first nodes are objects in each frame of image, and the first connecting lines are relations among the objects;
a third module, configured to obtain semantic features of objects in each frame of image through a preset semantic feature extraction model according to a space map corresponding to each frame of image, and construct a timing diagram corresponding to each frame of image according to the semantic features of the objects in each frame of image,
the semantic feature extraction model is constructed based on a neural network, is trained through a preset second training set and is used for extracting semantic features of objects in the images, the timing diagram comprises a plurality of second nodes and second connecting lines among the second nodes, the second nodes are the semantic features of the objects in the video to be detected, and the second connecting lines are the time sequence relation of each frame of image containing the same object;
and the fourth module is used for acquiring the user interest score of each frame of image in the video to be detected through a preset video segment detection model according to the time sequence diagram corresponding to each frame of image, wherein the video segment detection model is constructed based on a neural network, is trained through a preset third training set and is used for calculating the user interest score of each frame of image in the video.
7. The apparatus of claim 6, wherein the third module is further configured to:
performing convolution operation on nodes in the time sequence diagram corresponding to each frame of image through a convolution layer in the video segment detection model according to the time sequence diagram corresponding to each frame of image;
and performing maximum pooling operation on the nodes in the time sequence chart after convolution operation through a pooling layer in the video segment detection model to obtain the semantic features of the objects in each frame of image.
8. The apparatus of claim 7, wherein the third module is further configured to:
obtaining the semantic features of the objects in each frame of image according to the method shown in the following formula:
Figure FDA0002332506430000041
ei,j=H(xi|xj)
wherein ,miRepresenting the semantic features of the ith object in each of said frames of images, αi,jRepresenting a weight parameter between the ith object and the jth object, M representing a two-layer fully-connected layer network with inputs being node and edge characteristics, xiImage characteristic information x representing the ith objectjImage characteristic information representing the jth object, ei,jRepresenting the relationship between the ith object and the jth object, and H represents a two-layer fully-connected neural network whose inputs are the characteristics of two nodes adjacent to an edge.
9. The apparatus of claim 6, further comprising an update module configured to:
and updating the semantic features of the objects in each frame of image by a residual error connection method according to a method shown in the following formula:
Figure FDA0002332506430000042
wherein ,
Figure FDA0002332506430000043
representing semantic features of the ith object after update, xiImage feature information representing the i-th object, miAnd the semantic features of the ith object in each frame image are represented.
10. The apparatus of claim 6, wherein the fourth module is further configured to:
obtaining the user interest score of each frame of image in the video to be detected according to the classification loss function and the sequencing loss function in the video segment detection model and the method shown in the following formula:
f(xi)=Wxi+b
Figure FDA0002332506430000051
Figure FDA0002332506430000052
Figure FDA0002332506430000053
wherein ,f(xi) Score of interest, x, of the user representing the image corresponding to the ith frameiRepresenting image characteristic information of an ith frame, W representing a first preset parameter, b representing a second preset parameter,
Figure FDA0002332506430000054
representing the loss of classification of the image, yiIndicates the label corresponding to the ith object, Cross entry indicates the classification loss function,
Figure FDA0002332506430000055
indicating the loss of ordering of the images,
Figure FDA0002332506430000056
representing a set of training data, xpImage characteristic information, x, representing the p-th objectnImage feature information representing the nth object, f (x)p) A score of interest to the user, f (x), representing the image corresponding to the p-th objectn) And expressing the user interest score of the image corresponding to the nth object, wherein lambda expresses the weight of classification loss and sorting loss and ranges from 0 to 1, gamma expresses the weight of a regular term, theta expresses all parameters in the model, and F expresses the regular term.
CN201911341937.4A 2019-12-24 2019-12-24 Video highlight detection method and device based on graphic neural network Active CN111126262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911341937.4A CN111126262B (en) 2019-12-24 2019-12-24 Video highlight detection method and device based on graphic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911341937.4A CN111126262B (en) 2019-12-24 2019-12-24 Video highlight detection method and device based on graphic neural network

Publications (2)

Publication Number Publication Date
CN111126262A true CN111126262A (en) 2020-05-08
CN111126262B CN111126262B (en) 2023-04-28

Family

ID=70501420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911341937.4A Active CN111126262B (en) 2019-12-24 2019-12-24 Video highlight detection method and device based on graphic neural network

Country Status (1)

Country Link
CN (1) CN111126262B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950425A (en) * 2020-08-06 2020-11-17 北京达佳互联信息技术有限公司 Object acquisition method, device, client, server, system and storage medium
CN113111770A (en) * 2021-04-12 2021-07-13 杭州赛鲁班网络科技有限公司 Video processing method, device, terminal and storage medium
CN113822316A (en) * 2020-06-18 2021-12-21 香港科技大学 Method and equipment for predicting student performance in interactive online question bank
WO2022134576A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Infrared video timing behavior positioning method, apparatus and device, and storage medium
WO2023130326A1 (en) * 2022-01-06 2023-07-13 Huawei Technologies Co., Ltd. Methods and devices for generating customized video segment based on content features
CN116721093A (en) * 2023-08-03 2023-09-08 克伦斯(天津)轨道交通技术有限公司 Subway rail obstacle detection method and system based on neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018157746A1 (en) * 2017-02-28 2018-09-07 阿里巴巴集团控股有限公司 Recommendation method and apparatus for video data
CN110097026A (en) * 2019-05-13 2019-08-06 北京邮电大学 A kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018157746A1 (en) * 2017-02-28 2018-09-07 阿里巴巴集团控股有限公司 Recommendation method and apparatus for video data
CN110097026A (en) * 2019-05-13 2019-08-06 北京邮电大学 A kind of paragraph correlation rule evaluation method based on multidimensional element Video segmentation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李鸣晓;庚琦川;莫红;吴威;周忠;: "基于片段关键帧的视频行为识别方法" *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822316A (en) * 2020-06-18 2021-12-21 香港科技大学 Method and equipment for predicting student performance in interactive online question bank
CN113822316B (en) * 2020-06-18 2024-01-12 香港科技大学 Method and equipment for predicting student performance in interactive online question bank
CN111950425A (en) * 2020-08-06 2020-11-17 北京达佳互联信息技术有限公司 Object acquisition method, device, client, server, system and storage medium
CN111950425B (en) * 2020-08-06 2024-05-10 北京达佳互联信息技术有限公司 Object acquisition method, device, client, server, system and storage medium
WO2022134576A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Infrared video timing behavior positioning method, apparatus and device, and storage medium
CN113111770A (en) * 2021-04-12 2021-07-13 杭州赛鲁班网络科技有限公司 Video processing method, device, terminal and storage medium
WO2023130326A1 (en) * 2022-01-06 2023-07-13 Huawei Technologies Co., Ltd. Methods and devices for generating customized video segment based on content features
CN116721093A (en) * 2023-08-03 2023-09-08 克伦斯(天津)轨道交通技术有限公司 Subway rail obstacle detection method and system based on neural network
CN116721093B (en) * 2023-08-03 2023-10-31 克伦斯(天津)轨道交通技术有限公司 Subway rail obstacle detection method and system based on neural network

Also Published As

Publication number Publication date
CN111126262B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN111126262A (en) Video highlight detection method and device based on graph neural network
CA3066029A1 (en) Image feature acquisition
CN108280477B (en) Method and apparatus for clustering images
CN111161311A (en) Visual multi-target tracking method and device based on deep learning
CN109214002A (en) A kind of transcription comparison method, device and its computer storage medium
KR102265573B1 (en) Method and system for reconstructing mathematics learning curriculum based on artificial intelligence
WO2023284465A1 (en) Image detection method and apparatus, computer-readable storage medium, and computer device
CN111931859B (en) Multi-label image recognition method and device
CN111783712A (en) Video processing method, device, equipment and medium
CN112364204A (en) Video searching method and device, computer equipment and storage medium
CN111144215A (en) Image processing method, image processing device, electronic equipment and storage medium
CN111506755A (en) Picture set classification method and device
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
CN113642400A (en) Graph convolution action recognition method, device and equipment based on 2S-AGCN
CN110427819A (en) The method and relevant device of PPT frame in a kind of identification image
TWI803243B (en) Method for expanding images, computer device and storage medium
CN115062783B (en) Entity alignment method and related device, electronic equipment and storage medium
CN115098732B (en) Data processing method and related device
Yang et al. Student Classroom Behavior Detection Based on YOLOv7+ BRA and Multi-model Fusion
CN112214639B (en) Video screening method, video screening device and terminal equipment
CN114821140A (en) Image clustering method based on Manhattan distance, terminal device and storage medium
CN115203532A (en) Project recommendation method and device, electronic equipment and storage medium
Qu et al. The foreground detection algorithm combined the temporal–spatial information and adaptive visual background extraction
WO2019212407A1 (en) A system and method for image retrieval
Yang et al. Robust feature mining transformer for occluded person re-identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant