CN115630188A - Video recommendation method and device and electronic equipment - Google Patents

Video recommendation method and device and electronic equipment Download PDF

Info

Publication number
CN115630188A
CN115630188A CN202211154166.XA CN202211154166A CN115630188A CN 115630188 A CN115630188 A CN 115630188A CN 202211154166 A CN202211154166 A CN 202211154166A CN 115630188 A CN115630188 A CN 115630188A
Authority
CN
China
Prior art keywords
video
user
sample
interest
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211154166.XA
Other languages
Chinese (zh)
Inventor
高宸
李勇
商宇
金德鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202211154166.XA priority Critical patent/CN115630188A/en
Publication of CN115630188A publication Critical patent/CN115630188A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a video recommendation method, a video recommendation device and electronic equipment, and relates to the technical field of information processing, wherein the method comprises the following steps: dividing each video to be recommended in the acquired video set to be recommended into a preset number of video segments; inputting the extracted user characteristics and the video clip visual characteristics into a video recommendation model, and obtaining the interest degree of a user to-be-recommended video corresponding to user attribute information output by the video recommendation model; determining a target recommendation video from a video set to be recommended according to the interestingness and outputting the target recommendation video to a user; the video recommendation model is obtained based on training of a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to a user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the skipping action of the user occurs. The technical scheme provided by the invention can improve the accuracy of video recommendation.

Description

Video recommendation method and device and electronic equipment
Technical Field
The invention relates to the technical field of information processing, in particular to a video recommendation method and device and electronic equipment.
Background
City computing is a process that addresses the challenges facing cities by constantly acquiring, integrating, and analyzing many heterogeneous large data in cities. In urban computing, people's preference and rules can be mined by utilizing social media data, and interested things can be recommended to users by combining a recommendation system, so that help is provided for daily life of urban people, and healthy operation of cities is promoted.
The short video can be used as a part of urban calculation, provides life services such as news hotspots, food recommendation, travel advice and the like for the user, can depict the behavior mode and interest preference of the user through mining the interaction behavior of the user and the video, and accordingly recommends the interested video for the user, facilitates the user to quickly capture the interested content, and improves the life quality.
At present, video recommendation is to evaluate the interest degree of a user in a video by using the whole video as a unit for modeling the interest preference of the user, but the user often has preference differences for different parts of the whole video, and the whole video is not necessarily completely viewed by the user, so that more videos which do not accord with the real interest of the user are recommended to the user easily, and the recommendation accuracy is not high.
Disclosure of Invention
The invention provides a video recommendation method, a video recommendation device and electronic equipment, which are used for overcoming the defect of low video recommendation accuracy in the prior art and improving the accuracy of video recommendation.
The invention provides a video recommendation method, which comprises the following steps:
acquiring user attribute information and a video set to be recommended;
dividing each video to be recommended in the video set to be recommended into a preset number of video segments;
respectively extracting the user attribute information and the video clip to obtain user characteristics and video clip visual characteristics;
inputting the user characteristics and the video clip visual characteristics into a video recommendation model, and obtaining the interest degree of the user corresponding to the user attribute information output by the video recommendation model on the video to be recommended;
determining a target recommendation video from the video set to be recommended according to the interestingness, and outputting the target recommendation video to the user;
the video recommendation model is obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to the user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the user skipping action occurs.
According to the video recommendation method provided by the invention, the step of inputting the user characteristics and the video clip visual characteristics into a video recommendation model to obtain the interest degree of the user on the video to be recommended, corresponding to the user attribute information output by the video recommendation model, comprises the following steps:
inputting the user features and the video segment visual features into a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding features, negative interest embedding features and video segment embedding features of the user output by the graph convolution neural network layer, wherein the graph convolution neural network layer is used for embedding and transmitting feature nodes of the user features and the video segment visual features;
and inputting the positive interest embedding feature, the negative interest embedding feature and the video segment embedding feature into a fusion prediction layer of the video recommendation model to obtain the interest level output by the fusion prediction layer, wherein the fusion prediction layer is used for fusing a first interest level of the positive interest embedding feature and a second interest level of the negative interest embedding feature which are obtained by calculation based on the video segment embedding feature.
According to a video recommendation method provided by the present invention, the inputting the user characteristics and the video segment visual characteristics into a graph convolution neural network layer of the video recommendation model to obtain positive interest embedded characteristics, negative interest embedded characteristics and video segment embedded characteristics of the user output by the graph convolution neural network layer comprises:
inputting the user characteristics and the video clip visual characteristics into a user aggregation layer of a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding characteristics and negative interest embedding characteristics of the user, which are output by the user aggregation layer;
inputting the positive interest embedding feature and the negative interest embedding feature into a video aggregation layer of the graph convolution neural network layer to obtain a video segment embedding feature output by the video aggregation layer;
the user aggregation layer is used for clustering positive interests and negative interests of the visual features of the video segments based on the user features; the video aggregation layer is configured to perform embedding propagation of the user to the video segment visual features based on the positive interest embedding features and the negative interest embedding features.
According to a video recommendation method provided by the present invention, the inputting the positive interest embedding feature, the negative interest embedding feature and the video segment embedding feature into a fusion prediction layer of the video recommendation model to obtain the interestingness output by the fusion prediction layer includes:
inputting the video segment embedding features into a first prediction layer of a fusion prediction layer of the video recommendation model to obtain target video segment embedding features output by the first prediction layer, wherein the first prediction layer is used for carrying out weighted combination on the video segment embedding features and the video segment visual features;
inputting the target video segment embedding feature, the positive interest embedding feature and the negative interest embedding feature into a second prediction layer of the fusion prediction layer to obtain the first interest degree and the second interest degree output by the second prediction layer, wherein the second prediction layer is used for splicing the positive interest embedding feature and the negative interest embedding feature with the target video segment embedding feature respectively and then carrying out multilayer perception mapping;
and inputting the first interestingness and the second interestingness into a fusion layer of the fusion prediction layer to obtain the interestingness output by the fusion layer, wherein the fusion layer is used for performing average pooling processing on the first interestingness and the second interestingness.
According to a video recommendation method provided by the present invention, before the inputting the user features and the video clip visual features into the graph convolution neural network layer of the video recommendation model, the method further comprises:
inputting the video clip visual features into a feature enhancement embedding layer of the video recommendation model to obtain video clip enhancement representation features output by the feature enhancement embedding layer, wherein the feature enhancement embedding layer is used for enhancing the video clip visual features based on a transformation matrix;
the inputting the user characteristics and the video clip visual characteristics into a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding characteristics, negative interest embedding characteristics and video clip embedding characteristics of the user output by the graph convolution neural network layer comprises:
inputting the user characteristics and the video segment enhancement representation characteristics into a graph convolution neural network layer of the video recommendation model, and obtaining positive interest embedding characteristics, negative interest embedding characteristics and video segment embedding characteristics of the user output by the graph convolution neural network layer.
According to the video recommendation method provided by the invention, the loss function in the training of the video recommendation model is obtained by weighting and summing the user preference loss function L1 and the video segment loss function L2;
the user preference loss function L1 is:
Figure BDA0003857752530000041
the video segment loss function L2 is:
Figure BDA0003857752530000051
wherein, U represents a user attribute information set, U represents user attribute information in U, and N c Represents the number of sample video segments divided from the sample original video, j represents the segment index,
Figure BDA0003857752530000052
a sample video segment set representing the jth of all video segments of the sample positive feedback video segment set of user u that are in the complete sample video,
Figure BDA0003857752530000053
sample negative feedback video segment set, ω, on behalf of user u j Is a penalty factor and ω j =j/N c
Figure BDA0003857752530000054
Represents the interest level of the user u in the sample video segment c, σ () represents the Sigmoid function,
Figure BDA0003857752530000055
representing a collection of sample videos not completely viewed by user u, v represents
Figure BDA0003857752530000056
One of the elements of (a) or (b),
Figure BDA0003857752530000057
representing the interest of the user u in the positively fed back video segment in the incompletely viewed sample video,
Figure BDA0003857752530000058
representing the interest degree of the user u in the negative feedback video segment in the incompletely viewed sample video.
The present invention also provides a video recommendation apparatus, comprising:
the acquisition module is used for acquiring user attribute information and a video set to be recommended;
the dividing module is used for dividing each video to be recommended in the video set to be recommended into a preset number of video segments;
the characteristic extraction module is used for respectively extracting the characteristics of the user attribute information and the video clip to obtain user characteristics and video clip visual characteristics;
the processing module is used for inputting the user characteristics and the video clip visual characteristics into a video recommendation model and obtaining the interest degree of the user corresponding to the user attribute information output by the video recommendation model on the video to be recommended;
the determining module is used for determining a target recommended video from the video set to be recommended according to the interestingness;
the output module is used for outputting the target recommendation video to the user;
the video recommendation model is obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to the user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the user skipping action occurs.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the video recommendation method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a video recommendation method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a video recommendation method as described in any one of the above.
According to the video recommendation method, the video recommendation device and the electronic equipment, the videos to be recommended are divided into the preset number of video segments respectively, the user attribute information and the user characteristics extracted from the video segments and the visual characteristics of the video segments are input into the video recommendation model, the interestingness of the user attribute information output by the video recommendation model corresponding to the videos to be recommended by the user can be obtained, the target recommended video can be determined from the video set to be recommended and recommended to the user according to the interestingness, and video recommendation based on the interestingness of the user is achieved. The video recommendation model is obtained by training based on a sample original video, a sample positive feedback video clip set and a sample negative feedback video clip set corresponding to a user, the sample positive feedback video clip set is a set of video clips of the sample original video which is completely watched by the user, the sample negative feedback video clip set is a set of video clips of the moment when the user skips, the samples used in the model training fully consider the fine-grained interaction characteristics of the clip level between the user and the whole video, the interest difference of the user on different parts of contents in the whole video can be captured, therefore, the interest degree of the user on the video to be recommended can be captured by the video recommendation model obtained by training from the fine-grained angle of the video clip level, the interest degree can be calculated more accurately, the video which is more in line with the interest of the user can be recommended when the video is recommended based on the interest degree output by the video recommendation model, and the accuracy of video recommendation is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a video recommendation method provided by the present invention;
FIG. 2 is a schematic flowchart of a method for obtaining interest degree of a user in a video to be recommended, corresponding to user attribute information, based on a video recommendation model according to the present invention;
FIG. 3 is a schematic structural diagram of a video recommendation model provided by the present invention;
FIG. 4 is a schematic diagram illustrating a training method of a video recommendation model provided by the present invention;
FIG. 5 is a schematic structural diagram of a video recommendation apparatus provided in the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The video recommendation system can infer the interest of the user according to the interaction record of the user and the video, and provides a personalized video recommendation list for the user according to the interest of the user, and the video recommendation system can be applied to a short video platform for video recommendation. The current video recommendation mainly adopts an immersive interface, a user can continuously watch videos and can realize the switching of the videos by sliding a screen, and the cost for selecting and watching the contents by the user is reduced. Unlike a video that is desired to be viewed only by clicking on the video, in this way, the interaction between the user and the video is continuous, and the browsing behavior of the user can be finished at any position of the browsed video content, that is, the user can select to skip to finish the viewing of the video content at any video frame position of the browsed video content. The video recommendation system can be used for depicting the behavior pattern and interest preference of the user by mining the interactive behaviors such as watching, praise, commenting and attention between the user and the video, screening a candidate set which better accords with the user preference from a candidate video set according to the interest preference of the user, estimating the probability that the user can watch the video completely through a parameterized recommendation model, then sequencing according to the probability and pushing the video which is ranked at the front to the user.
Of the many interactive behaviors of users and video, the viewing and skipping behavior is the most important user feedback. On one hand, the watching time is longer for the content which is interesting to the user, and the skipping is selected for the content which is not interesting; on the other hand, other interactive behaviors such as praise and comment are sparse and less in number, and are not enough to complete learning of user interest preferences. Therefore, how to evaluate the preference of the user for the video according to the watching and skipping behaviors of the user for the video and accordingly realize personalized video recommendation has important significance in the aspects of improving the video recommendation accuracy, optimizing the user experience and the like. The single-column full-screen mode is a main stream mode recommended by videos, and under the single-column full-screen mode, a user can continuously watch videos and switch video contents by sliding up and down. Different from a display feedback mode that a user selects interesting video content and clicks in to watch the interesting video content, the user often watches video sequences recommended by a platform more acceptably, the watching behavior of the user belongs to a weaker feedback type reflecting the preference of the user, and the new challenge is brought to modeling the interest of the user.
In the related art, the video recommendation system can obtain the user interest representation by distinguishing and aggregating videos of historical interactions of the user according to content and time sequence by paying attention to the dynamics and diversity of the user interest, can also model the user interest by utilizing different modalities of the videos, and models the user interest by utilizing a historical interaction sequence and a possible interaction sequence in the future. The methods adopt coarse-grained positive and negative samples when modeling the user interest, namely, the whole video is regarded as a whole, and the content preference of the default user to different parts of the video is consistent. However, there are often differences in preference among users for the content of different elements or different segments of a video, and not all pictures of the video are completely viewed by the users, so that there is a certain deviation in using the entire video as a unit for modeling user interest, and a coarse-grained user representation is obtained. The recommendation system based on the coarse-grained user interests can also cause more videos which do not accord with the real interests of the users to be recommended to the users, the accuracy rate of video recommendation is poor, and the user experience is influenced.
Different from the situation that a user selects interesting video content by himself and clicks in to watch the video content, in the single-column full-screen mode, the video content watched by the user is mainly determined by a video recommendation platform, the user often watches a video sequence recommended by the platform in an accepting mode, and in the process, the interaction between the user and the video is continuous and fine-grained, namely, the interaction between the user and the video is at a frame level or a fragment level. Based on the fine-grained interaction characteristics between the user and the video, if the fine-grained interest change of the user is captured by performing fine-grained division on the video interacted with the user, and the video recommendation model is modeled by the fine-grained interest change, the accuracy of the recommendation model is improved.
Based on this, the embodiment of the invention provides a video recommendation method, which can divide each video to be recommended in a video set to be recommended into a preset number of video segments, respectively perform feature extraction on user attribute information and the video segments to obtain user features and video segment visual features, then input the user features and the video segment visual features into a video recommendation model to obtain the interest degree of the user attribute information output by the video recommendation model corresponding to the video to be recommended by a user, then determine a target recommended video from the video set to be recommended according to the interest degree, and output the target recommended video to the user; the video recommendation model can be obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to a user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the skipping behavior of the user occurs.
The video recommendation method of the present invention is described below with reference to fig. 1-4. The video recommendation method can be applied to electronic equipment such as a server, a mobile phone and a computer, and can also be applied to a video recommendation device arranged in the electronic equipment such as the server, the mobile phone and the computer, and the video recommendation device can be realized through software or combination of the software and hardware.
Fig. 1 is a schematic flowchart illustrating a video recommendation method according to an embodiment of the present invention, and referring to fig. 1, the video recommendation method may include the following steps 110 to 150.
Step 110: and acquiring user attribute information and a video set to be recommended.
The user attribute information may include, but is not limited to, user identity information and other characteristic information of the user, such as sex, age, and the like.
Step 120: and respectively dividing each video to be recommended in the video set to be recommended into a preset number of video segments.
The video set to be recommended may include a plurality of videos to be recommended, and each video to be recommended in the set may be first divided into a preset number of video segments, for example, into 20 video segments.
Step 130: and respectively extracting the user attribute information and the video clip to obtain the user characteristics and the video clip visual characteristics.
After the user attribute information is acquired, feature extraction can be performed on the user attribute information to acquire user features representing the user attribute information. For the divided video clips, the visual features of the video clips can be obtained through feature extraction.
For example, for each video segment, video frame sampling may be performed, and visual feature extraction may be performed on each captured frame of video image to obtain a video segment visual feature of the video segment.
For example, a pre-trained Convolutional Neural Network (CNN) model may be used as the visual encoder, such as a visual encoder may be constructed using a depth residual Network model ResNet-50. Specifically, the visual encoder can retain the first 5 CNN layers of the pre-trained ResNet-50 model and add a pooling layer to obtain the feature vectors of the video segments, while adding a fully-connected layer to reduce the dimensionality of the visual features. For each video segment, one fusion layer may be used to aggregate frames in the video segment to obtain the original features of the video segment, which are visual features. Wherein the fusion layer can be used for average pooling.
Step 140: and inputting the user characteristics and the visual characteristics of the video clips into the video recommendation model to obtain the interest degree of the user to the video to be recommended corresponding to the user attribute information output by the video recommendation model.
After the user characteristics and the video clip visual characteristics of each video clip are obtained, the user characteristics and all the video clip visual characteristics can be input into a video recommendation model, the visual recommendation model can calculate the score of the interest degree of each user for each video clip based on the user characteristics, for each user, the scores of the video clips of the same video to be recommended are finally fused, for example, weighted fusion is carried out, the interest degree of the user for the video to be recommended can be obtained, and thus the interest degree of each user characterized by the user characteristics for each video to be recommended can be obtained.
For example, the weighted fusion may be to assign a score smaller than a score threshold to a first weight, assign a score larger than the score threshold to a second weight, and then fuse the weighted scores, for example, average pooling after summing to obtain the interest level of the user in the video to be recommended.
The video recommendation model can be obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to a user. The sample positive feedback video segment set is a set of video segments of a sample original video completely watched by a user, and the sample negative feedback video segment set is a set of video segments of a moment when a user skips over.
Illustratively, for a sample original video, historical interactive video data of a user corresponding to user attribute information can be collected, and the historical interactive video data is original video file data of a video interacted by the user, carries all information of the video, and can support fine-grained division processing on the video. The method includes the steps that when the sample original video is recorded, interaction record information of a user can be recorded, and the interaction record information can include attribute information of the user, identification information of the sample original video, watching duration information of the user on each sample original video, watching behaviors of the user on each sample original video, duration of each sample original video and other information. According to the interaction record information, the watching condition of each sample original video in the sample original videos can be determined by the user, so that the video clips can be divided and the positive and negative sample video clips can be classified.
For example, each sample original video may be uniformly divided into 5 video segments, when a user watches the sample original video a, if the user chooses to skip when watching the 2 nd video segment of the sample original video a, the 2 nd video segment may be used as a sample negative feedback video segment, and if all 5 video segments of the sample original video a are watched, the 5 video segments are all used as sample positive feedback video segments. It can be understood that, for any sample original video, as long as there is an unviewed video segment, no 5 video segments of the sample original video are taken as sample positive feedback video segments; for any video segment, as long as a skip action occurs on the video segment, the video segment is treated as a sample negative feedback video segment. Therefore, a sample positive feedback video segment set and a sample negative feedback video segment set can be determined from the sample original video. Therefore, a fine-grained training sample can be obtained, the fine-grained interaction characteristics of the user and the original video of the whole sample at the fragment level are fully considered, the interest difference of the user on different parts of contents in the whole video can be captured, and therefore the video recommendation model obtained through training can capture the interest degree of the user on the video to be recommended from the fine-grained angle of the video fragment level.
Step 150: and determining a target recommendation video from the video set to be recommended according to the interestingness, and outputting the target recommendation video to the user.
The interest degree can represent the probability that the user watches the video to be recommended, and reflects the interest degree of the user in the video to be recommended.
In an example embodiment, after the interestingness of the user in relation to the videos to be recommended is obtained, for each user, the interestingness of the user in relation to the videos to be recommended may be ranked, for example, ranked from large to small, and then the videos to be recommended corresponding to the top set number of interestingness in the ranking result are determined as the target recommended videos, for example, the videos to be recommended corresponding to the top 10 interestingness are determined as the target recommended videos.
In another example embodiment, after the interestingness of the video to be recommended of the user corresponding to the user attribute information is obtained, for each user, an interestingness greater than an interestingness threshold value may also be determined from the interestingness of the video to be recommended of the user, a target interestingness is obtained, and the video to be recommended corresponding to the target interestingness in the video set to be recommended is determined as the target recommended video.
According to the video recommendation method provided by the embodiment of the invention, each video to be recommended is divided into the preset number of video segments, the user attribute information and the user characteristics extracted from the video segments and the visual characteristics of the video segments are input into the video recommendation model, the interest degree of the video to be recommended corresponding to the user by the user attribute information output by the video recommendation model can be obtained, and then the target recommended video can be determined from the video set to be recommended and recommended to the user according to the interest degree, so that the video recommendation based on the user interest degree is realized. The video recommendation model is obtained by training based on a sample original video, a sample positive feedback video clip set and a sample negative feedback video clip set corresponding to a user, the sample positive feedback video clip set is a set of video clips of the sample original video which is completely watched by the user, the sample negative feedback video clip set is a set of video clips of the moment when the user skips, the samples used in the model training fully consider the fine-grained interaction characteristics of the clip level between the user and the whole video, the interest difference of the user on different parts of contents in the whole video can be captured, therefore, the interest degree of the user on the video to be recommended can be captured by the video recommendation model obtained by training from the fine-grained angle of the video clip level, the interest degree can be calculated more accurately, the video which is more in line with the interest of the user can be recommended when the video is recommended based on the interest degree output by the video recommendation model, and the accuracy of video recommendation is improved.
Based on the video recommendation method in the embodiment corresponding to fig. 1, fig. 2 exemplarily shows a flowchart of a method for obtaining interest degree of a user to treat a recommended video corresponding to user attribute information based on a video recommendation model according to an embodiment of the present invention, where the method may include the following steps 210 to 220, that is, step 140 may include the following steps 210 to 220.
Step 210: and inputting the user characteristics and the visual characteristics of the video segments into a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding characteristics, negative interest embedding characteristics and video segment embedding characteristics of the user output by the graph convolution neural network layer.
The graph convolution neural network layer can be used for embedding and propagating the feature nodes for the user features and the visual features of the video segments. The Graph Convolutional neural network layer may be a Graph Convolutional neural network (GCN) model, which may study representations of Graph nodes to extract features from Graph data. The video clip visual characteristics comprise video image data of the video clip, information of neighbor nodes of the video image data can be aggregated by using a convolutional neural network model, and positive interest embedded characteristics and negative interest embedded characteristics of a user can be obtained through an embedding propagation mechanism of the convolutional neural network model, wherein the positive interest embedded characteristics and the negative interest embedded characteristics can respectively reflect positive preference and negative preference of the user.
Meanwhile, the graph convolution neural network layer can embed and propagate the visual features of the video clips to users based on the positive interest embedding features and the negative interest embedding features so as to mine the similarity features among the video clips, capture high-order information transmission among the video clips and obtain the video clip embedding features.
Illustratively, step 210 may include the steps of: inputting the user characteristics and the visual characteristics of the video segments into a user aggregation layer of a graph-convolution neural network layer of a video recommendation model, and obtaining positive interest embedding characteristics and negative interest embedding characteristics of the user, which are output by the user aggregation layer; and inputting the positive interest embedding characteristics and the negative interest embedding characteristics into a video aggregation layer of the graph convolution neural network layer to obtain video segment embedding characteristics output by the video aggregation layer. The user aggregation layer is used for clustering positive interests and negative interests of visual features of the video segments based on user features, wherein the positive interests represent interesting features of users, and the negative interests represent uninteresting features of the users; the video aggregation layer is used for embedding and transmitting the user to the visual features of the video segments based on the positive interest embedding features and the negative interest embedding features.
Step 220: and inputting the positive interest embedding characteristics, the negative interest embedding characteristics and the video segment embedding characteristics into a fusion prediction layer of the video recommendation model to obtain the interest degree output by the fusion prediction layer.
The fusion prediction layer can be used for fusing a first interest level of a positive interest embedding feature and a second interest level of a negative interest embedding feature which are obtained through calculation based on the video segment embedding feature.
Specifically, for each user, the fusion prediction layer may splice a video segment embedding feature and a positive interest embedding feature of the user together, perform first interest-level calculation on the video segment through a Multilayer perceptron (MLP), simultaneously may splice the video segment embedding feature and a negative interest embedding feature of the user together, perform second interest-level calculation on the video segment through the Multilayer perceptron, and then perform average pooling processing after weighting and summing the first interest-level and the second interest-level, to obtain the interest level output by the fusion prediction layer.
Illustratively, step 220 may be implemented by the following steps: inputting the video segment embedding characteristics into a first prediction layer of a fusion prediction layer of a video recommendation model to obtain target video segment embedding characteristics output by the first prediction layer; inputting the embedding feature, the positive interest embedding feature and the negative interest embedding feature of the target video segment into a second prediction layer of the fusion prediction layer to obtain a first interest degree and a second interest degree output by the second prediction layer; and inputting the first interest degree and the second interest degree into a fusion layer of the fusion prediction layer to obtain the interest degree output by the fusion layer. Wherein the first prediction layer can be used for performing weighted combination on the video segment embedding characteristics and the video segment visual characteristics; the second prediction layer can be used for splicing the positive interest embedded feature and the negative interest embedded feature with the target video segment embedded feature respectively and then carrying out multi-layer perception mapping; the fusion layer may be used to average pool the first and second interestingness.
According to the method corresponding to the embodiment in fig. 2, the positive interest preference feature and the negative interest preference feature of each user for each video clip can be obtained through the convolutional neural network layer, high-order information transmission among the video clips can be captured, the video clip embedding feature is obtained, the prediction layer is fused, then the interest degree of each user for each video to be recommended can be fused from the video clip level according to the features output by the convolutional neural network layer, the probability that the user finishes watching the video to be recommended can be predicted from the view of fine-grained video content, and the prediction accuracy is improved.
In an example embodiment of the present invention, the video recommendation model may further include a feature enhancement embedding layer, which may be used to enhance the video segment visual features based on the transformation matrix. In video recommendation, user interaction behaviors may not be completely influenced by visual features, for example, a background scene of one frame of video image, such as audio, text, and the like, may also influence user behaviors, for this reason, extracted visual features of a video clip may be input to a feature enhancement embedding layer under the condition that an embedding dimension is not changed, the extracted visual features of the video clip are multiplied by a transformation matrix through the feature enhancement embedding layer, and information such as some texts, voices, and the like is added to the visual features of the video clip by using the transformation matrix, so as to obtain embedding in a user preference space, and further improve accuracy of video recommendation.
Specifically, the visual features of the video segment can be enhanced by using the following feature enhancement formula under the condition that the embedding dimension is not changed. The feature enhancement formula can be expressed as:
E c =W t f c
wherein f is c Visual characteristics of video segment representing video segment c, E c Representing visual features f on video segments c Enhanced representation of video segments, W, obtained after enhancement t Represents a transformation matrix, and W t The method is learnable and can be obtained by training when the video recommendation model is trained.
Based on this, before inputting the user characteristics and the visual characteristics of the video segment into the graph convolution neural network layer of the video recommendation model, the video recommendation method provided by the embodiment of the present invention may further include: and inputting the visual features of the video clips into a feature enhancement embedding layer of the video recommendation model to obtain the video clip enhancement representation features output by the feature enhancement embedding layer. Accordingly, step 210 may be inputting the user features and the video segment enhancement representation features into a convolutional neural network layer of the video recommendation model, and obtaining positive interest embedded features, negative interest embedded features and video segment embedded features of the user output by the convolutional neural network layer.
Based on the methods of the foregoing embodiments, fig. 3 exemplarily shows a structural schematic diagram of a video recommendation model provided by an embodiment of the present invention, and the video recommendation model and the video recommendation method described above may be referred to with each other. Referring to fig. 3, the video recommendation model may include a graph convolution neural network layer and a fusion prediction layer, where the graph convolution neural network layer may be configured to perform feature node embedding propagation on user features and video segment visual features to obtain positive interest embedding features, negative interest embedding features, and video segment embedding features of a user; the fusion prediction layer can be used for fusing a first interest degree of the positive interest embedding feature and a second interest degree of the negative interest embedding feature which are obtained based on video segment embedding feature calculation to obtain the interest degree of the user on the video to be recommended.
For example, the graph convolutional neural network layer may include a user aggregation layer and a video aggregation layer, and the user aggregation layer may be configured to perform clustering of positive interests and clustering of negative interests of visual features of video segments based on user features; the video aggregation layer may be used to embed and propagate users to video segment visual features based on positive interest embedding features and negative interest embedding features. The fusion prediction layer may include a first prediction layer, a second prediction layer, and a fusion layer, where the first prediction layer may be configured to perform weighted combination on video segment embedding features and video segment visual features, the second prediction layer may be configured to perform multi-layer perceptual mapping after splicing the positive interest embedding features and the negative interest embedding features with target video segment embedding features, respectively, and the fusion layer may be configured to perform average pooling processing on the first interest level and the second interest level.
For example, the video recommendation model may further include a feature enhancement embedding layer, the video segment visual features may be input into the feature enhancement embedding layer, the feature enhancement embedding layer performs enhancement on the video segment visual features based on the transformation matrix, and then the user features and the video segment enhancement representation features output by the feature enhancement embedding layer are input into the graph convolution neural network layer.
Based on the methods of the foregoing embodiments, fig. 4 illustrates a schematic diagram of a training method of a video recommendation model, and referring to fig. 4, the video recommendation model may include a feature enhancement embedding layer, a graph convolution neural network layer, and a fusion prediction layer.
Before the video recommendation model training is performed, each sample original video may be divided, for example, into N (N is a positive integer greater than 1) sample video segments, each sample video segment is sampled to obtain a video frame corresponding to each sample video segment, then the video frame is visually encoded to extract visual features at a frame level of each sample video segment, and the visual features of the video frames corresponding to each sample video segment may be fused by a time sequence fusion method to obtain a sample video segment visual feature corresponding to each sample video segment. In this way, fine-grained raw features of each sample raw video can be extracted.
The original sample video can be obtained by collecting historical interactive video data of the user, the historical interactive video data is original video file data of the user interactive video, carries all information of the video, and can support fine-grained division processing on the original video. For example, before the original sample video is divided, data preprocessing may be performed to filter out the failed original sample video and the corresponding interaction records, so as to improve the quality of the training sample. The interaction record information can represent the watching condition of the user to each sample original video.
Meanwhile, based on the watching behavior of the user on the original sample video, the sample video segments interacted by the user can be divided into a sample positive feedback video segment set and a sample negative feedback video segment set according to the playing completion or the skipping, and label information is added respectively. Specifically, the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments of the moment when the user skips over. It can be understood that, in the process of acquiring the sample positive feedback video segment set and the sample negative feedback video segment set, the user attribute information corresponding to the sample is obtained.
And then, inputting the extracted visual features of the sample video segments into a feature enhancement embedding layer, inputting a sample positive feedback video segment set and a sample negative feedback video segment set added with label information into a graph convolution neural network layer, and training a video recommendation model by adopting a mixed supervised learning training mode.
Specifically, in the feature enhancement embedded layer, feature enhancement can be performed on the visual features of the sample video clip based on the feature enhancement formula to obtain enhanced representation features of the sample video clip
Figure BDA0003857752530000181
Suppose that user u's sample positive feedback video clip set is
Figure BDA0003857752530000182
User u's sample negative feedback video clip set as
Figure BDA0003857752530000183
Based on
Figure BDA0003857752530000184
And
Figure BDA0003857752530000185
a forward interest interaction relation matrix R between a user and a sample video clip can be constructed at a graph convolution neural network layer p And a negative interest interaction relation matrix R n The elements in the matrix may be defined as follows:
Figure BDA0003857752530000186
Figure BDA0003857752530000187
wherein the content of the first and second substances,
Figure BDA0003857752530000188
represents R p The elements (A) and (B) in (B),
Figure BDA0003857752530000189
represents R n Element (ii) c j Represents the video segment of the j-th sample,
Figure BDA00038577525300001810
represents the ith user u i A set of corresponding sample positive feedback video segments,
Figure BDA00038577525300001811
representing the ith user u i The corresponding sample negatively feeds back the set of video segments.
Further, R may be utilized p Adjacent matrix A of p And R n Adjacent matrix A of n And characterizing the connection relation between the user and the sample video clip.
In particular, the method comprises the following steps of,
Figure BDA0003857752530000191
wherein (R) p ) T Is R p Transpose of (R) n ) T Is R n The transposing of (1).
Further, can be used for A p And A n Respectively carrying out normalization processing to obtain respective corresponding normalization processing results
Figure BDA0003857752530000192
And
Figure BDA0003857752530000193
Figure BDA0003857752530000194
Figure BDA0003857752530000195
wherein D is p Represents A p Degree matrix of D n Represents A n The degree matrix of (c) is,
Figure BDA0003857752530000196
represents R p The result of the normalization of (a) is,
Figure BDA0003857752530000197
represents R n The result of the normalization of (a) is,
Figure BDA0003857752530000198
and
Figure BDA0003857752530000199
respectively represent
Figure BDA00038577525300001910
And
Figure BDA00038577525300001911
the transposing of (1).
Neighbor node information can then be aggregated using the GCN model based on the normalization results to learn positive, negative interest representations for the user, and embedded representations for the sample video segments. Meanwhile, the sample forward interest embedded characteristics of the user can be obtained through an embedded propagation mechanism in the GCN model
Figure BDA00038577525300001912
And sample negative interest embedding features
Figure BDA00038577525300001913
Figure BDA00038577525300001914
Figure BDA00038577525300001915
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00038577525300001916
is a sample video clip enhancement representation feature output by the feature enhancement embedding layer; σ () represents a Sigmoid function, which is used as an activation function of the graph convolution neural network layer;
Figure BDA00038577525300001917
and
Figure BDA0003857752530000201
is a trainable weight matrix. Based on this, the video segment-user propagation in FIG. 4 may be implemented, whereby positive and negative interest preferences of the user for the sample video segment may be captured.
In addition to the association relationship between the user and the sample video clips, high-order transmission information also exists between the sample video clips, for example, two sample video clips may be connected by the same user in a positive interest relation graph or a negative interest relation graph, and the two sample video clips may have similar characteristics, so that the relationship between the sample video clips can be mined through two-hop embedding propagation in a graph convolution neural network layer, that is, the user-video clip propagation shown in fig. 4 is performed, so that the training effect of a video recommendation model and the accuracy of video recommendation can be improved.
In particular, a first subsample video clip embedding feature may be obtained through user-video clip propagation
Figure BDA0003857752530000202
And a second sub-sample video segment embedding feature
Figure BDA0003857752530000203
Figure BDA0003857752530000204
Figure BDA0003857752530000205
Wherein the content of the first and second substances,
Figure BDA0003857752530000206
and
Figure BDA0003857752530000207
is a trainable weight matrix, σ () stands for Sigmoid function, which is used as an activation function for the graph-convolution neural network layer.
The graph convolution neural network layer may then pair
Figure BDA0003857752530000208
And
Figure BDA0003857752530000209
performing average pooling to obtain final representation of the sample video segment, i.e. obtaining the sample video segmentEmbedded features
Figure BDA00038577525300002010
Illustratively, this may be formulated by average pooling
Figure BDA00038577525300002011
To obtain
Figure BDA00038577525300002012
Where Mean () represents average pooling.
To this end, sample video segment enhanced representation features may be obtained
Figure BDA00038577525300002013
And sample video clip embedding features
Figure BDA00038577525300002014
They carry different information of the sample video segments and can be input into the fusion prediction layer to form the final content representation of the sample video segments.
Further, in the fusion prediction layer, the prediction layers can be adaptively combined through coefficient weighting
Figure BDA0003857752530000211
And
Figure BDA0003857752530000212
obtaining target video clip embedding characteristics H c . Specifically, the weighted combination formula pair can be based on
Figure BDA0003857752530000213
And
Figure BDA0003857752530000214
and (3) carrying out weighted combination:
Figure BDA0003857752530000215
wherein alpha is 0 And alpha 1 Respectively, are weighting coefficients.
Sample forward interest embedded feature for fusing prediction layer and receiving graph convolution neural network layer output simultaneously
Figure BDA0003857752530000216
And sample negative interest embedding features
Figure BDA0003857752530000217
The two characteristics can be respectively embedded into the characteristics H with the target video clip c Splicing the two video segments together, and then performing interest prediction on the positive side and the negative side of the segment level by using a multilayer perceptron to obtain a first interest level and a second interest level of each sample video segment. Specifically, the ith sample video segment c in the sample video c of the user u can be obtained through the following multi-layer perceptron mechanism i First interest degree of
Figure BDA0003857752530000218
And the user u pairs i sample video segments c in the first sample video c i Second interest level of
Figure BDA0003857752530000219
Figure BDA00038577525300002110
Figure BDA00038577525300002111
Wherein the content of the first and second substances,
Figure BDA00038577525300002112
to represent
Figure BDA00038577525300002113
Of the sample video segment is determined,
Figure BDA00038577525300002114
to represent
Figure BDA00038577525300002115
Of the sample video segment is determined,
Figure BDA00038577525300002116
represents H c Video clip c of ith sample i Is characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA00038577525300002117
representative pair
Figure BDA00038577525300002118
And
Figure BDA00038577525300002119
the splicing is carried out, and the splicing,
Figure BDA00038577525300002120
representative pair
Figure BDA00038577525300002121
And
Figure BDA00038577525300002122
splicing is carried out, MLP p () And MLP n () Respectively, represent a multi-layer perceptual mapping.
Then, the fusion prediction layer can be paired
Figure BDA00038577525300002123
And
Figure BDA00038577525300002124
carrying out average pooling to obtain the interest degree of the user u in the sample video c
Figure BDA0003857752530000221
Specifically, the fused prediction layer may first pass through a set of weighting coefficients α p And alpha n To pair
Figure BDA0003857752530000222
And
Figure BDA0003857752530000223
fusing to obtain the ith sample video clip c in the sample video c of the user u i The final prediction result can be recorded as the predicted interest degree
Figure BDA0003857752530000224
Illustratively, the interestingness fusion formula acquisition can be based on
Figure BDA0003857752530000225
The interestingness fusion formula is:
Figure BDA0003857752530000226
and subtraction operation is adopted during fusion, so that the negative influence of the negative interest preference on the final prediction result can be reduced, and the prediction accuracy is improved.
And then, the fusion prediction layer can fuse the final prediction result of each sample video segment to obtain the interest degree of the user u in the sample video c. Specifically, the fusion prediction layer may obtain the probability that the user u finishes watching the sample video c through the following average pooling formula, that is, the interest level of the user u in the sample video c is obtained
Figure BDA0003857752530000227
The average pooling formula is:
Figure BDA0003857752530000228
wherein N is c The number of sample video segments divided for sample video c.
It can be understood that the sample video clips used in the training of the video recommendation model may be historical interactive videos from multiple users, and according to the principle, the video recommendation model may learn the interest level of each user in each sample video.
Further, based on the above description of the principle of fig. 4, when performing video recommendation model training, a loss function L during video recommendation model training may be obtained by performing weighted summation on the user preference loss function L1 and the video segment loss function L2. Illustratively, the loss function L may be expressed as:
L=αL1+βL2+λ||θ|| 2
wherein alpha, beta and lambda are respectively weighting coefficients, | theta | | Y 2 Representing a regularization term.
Specifically, the interest preference of the user can be learned from a fine-grained sample video segment based on the user preference loss function L1. The sample videos completely watched by the user reflect the positive interest preference of the user to a certain extent, and sample video clips of the sample videos can be used as positive samples; the sample video segment where the user skipping action is located and the sample video segments near the sample video segment reflect the negative interest preference of the user to a certain extent, the sample video segment where the skipping action is located can be used as a negative sample, and the interest tendency of the user cannot be determined for the sample video segment after the occurrence moment of the skipping action, so that the interest tendency of the user can be eliminated. However, the method still ignores the difference of the strength of the positive sample, and based on this, different penalty coefficients can be set for sample video clips at different positions of the same sample video, and the strength of the positive sample can be distinguished through the penalty coefficients. Specifically, the penalty coefficient ω of the jth sample video segment may be set j Is omega j =j/N c It shows that as the user's browsing time increases, it can obtain higher confidence when the jth sample video segment is defined as a positive sample.
In addition, in the same interactive sample video, the sample video segment skipped by the user always reflects more negative interest preference than the part watched by the user. The portion viewed by the user may consist of several sample video segments that can be combined as a whole to represent a relatively forward portion. Illustratively, bayesian Personalized Ranking (BPR) losses may be employed to handle losses between sample video segments, enabling the video recommendation model to distinguish positive interest-preference sample video segments from negative interest-preference sample video segments in the same interactive sample video. That is, for each user fine-grained interaction record (u, v, r), there is a pair of positive and negative samples, where u represents the user, v represents the sample video viewed by user u, r represents the percentage of the portion of the video v that is viewed by user u, and r <1.
On this basis, the user preference loss function L1 and the video segment loss function L2 can be defined as:
Figure BDA0003857752530000231
Figure BDA0003857752530000232
wherein, U represents a user attribute information set, U represents user attribute information in U, and N c Represents the number of sample video segments divided from the original video, j represents the segment index,
Figure BDA0003857752530000241
a sample video segment set representing the jth of all video segments of the sample positive feedback video segment set of user u that are in the complete sample video,
Figure BDA0003857752530000242
sample negative feedback video segment set, ω, on behalf of user u j Is a penalty system and ω j =j/N c
Figure BDA0003857752530000243
Represents the interest level of the user u in the sample video segment c, σ () represents the Sigmoid function,
Figure BDA0003857752530000244
represents a collection of sample videos not completely viewed by user u, v represents
Figure BDA0003857752530000245
One of the elements of (a) or (b),
Figure BDA0003857752530000246
representing the interest of the user u in the positively fed back video segment in the incompletely viewed sample video,
Figure BDA0003857752530000247
representing the interest degree of the user u in the negative feedback video segment in the incompletely viewed sample video.
In the training process of the video recommendation model, the loss function L can be continuously optimized until the value of L is minimum.
The following describes the video recommendation apparatus provided by the present invention, and the video recommendation apparatus described below and the video recommendation method described above may be referred to correspondingly.
Fig. 5 is a schematic structural diagram illustrating a video recommendation apparatus according to an embodiment of the present invention, and referring to fig. 5, the video recommendation apparatus 500 may include an obtaining module 510, a dividing module 520, a feature extraction module 530, a processing module 540, a determining module 550, and an output module 560. Wherein: the obtaining module 510 may be configured to obtain user attribute information and a video set to be recommended; the dividing module 520 may be configured to divide each video to be recommended in the video set to be recommended into a preset number of video segments; the feature extraction module 530 may be configured to perform feature extraction on the user attribute information and the video segment, respectively, to obtain a user feature and a video segment visual feature; the processing module 540 may be configured to input the user characteristics and the video clip visual characteristics into the video recommendation model, and obtain the interest degree of the user in the video to be recommended, where the user attribute information output by the video recommendation model corresponds to the user; the determining module 550 may be configured to determine a target recommended video from the set of videos to be recommended according to the interestingness; the output module 560 may be used to output the target recommendation video to the user; the video recommendation model is obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to a user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the skipping action of the user occurs.
In an example embodiment, the processing module 540 may include: the first processing unit can be used for inputting the user characteristics and the video segment visual characteristics into a graph convolution neural network layer of a video recommendation model to obtain positive interest embedded characteristics, negative interest embedded characteristics and video segment embedded characteristics of a user, which are output by the graph convolution neural network layer, and the graph convolution neural network layer is used for embedding and propagating the user characteristics and the video segment visual characteristics in characteristic nodes; the second processing unit may be configured to input the positive interest embedding feature, the negative interest embedding feature, and the video segment embedding feature into a fusion prediction layer of the video recommendation model, to obtain an interest level output by the fusion prediction layer, where the fusion prediction layer is configured to fuse a first interest level of the positive interest embedding feature and a second interest level of the negative interest embedding feature, which are obtained based on the video segment embedding feature calculation.
In an example embodiment, the first processing unit may include: the first processing subunit is configured to input the user characteristics and the video segment visual characteristics into a user aggregation layer of a graph convolution neural network layer of the video recommendation model, and obtain positive interest embedding characteristics and negative interest embedding characteristics of the user output by the user aggregation layer; the second processing subunit is configured to input the positive interest embedding feature and the negative interest embedding feature into a video aggregation layer of the graph convolution neural network layer, and obtain a video segment embedding feature output by the video aggregation layer; the user aggregation layer is used for clustering positive interests and negative interests of visual features of the video segments based on user features; the video aggregation layer is used for embedding and transmitting the user to the visual features of the video segments based on the positive interest embedding features and the negative interest embedding features.
In an example embodiment, the second processing unit may include: the third processing subunit may be configured to input the video segment embedding feature into a first prediction layer of a fusion prediction layer of the video recommendation model, to obtain a target video segment embedding feature output by the first prediction layer, where the first prediction layer is configured to perform weighted combination on the video segment embedding feature and the video segment visual feature; the fourth processing subunit is configured to input the target video segment embedding feature, the positive interest embedding feature, and the negative interest embedding feature into a second prediction layer of the fusion prediction layer, and obtain a first interest level and a second interest level output by the second prediction layer, where the second prediction layer is configured to perform multilayer perceptual mapping after splicing the positive interest embedding feature and the negative interest embedding feature with the target video segment embedding feature, respectively; the fifth processing subunit may be configured to input the first interest level and the second interest level into a fusion layer of the fusion prediction layer to obtain an interest level output by the fusion layer, where the fusion layer is configured to perform average pooling processing on the first interest level and the second interest level.
In an example embodiment, the processing module 540 may further include a third processing unit, and the third processing unit may be configured to input the visual features of the video segment into a feature enhancement embedding layer of the video recommendation model, and obtain video segment enhancement representation features output by the feature enhancement embedding layer, where the feature enhancement embedding layer is configured to enhance the visual features of the video segment based on the transformation matrix; the first processing unit may be specifically configured to input the user features and the video segment enhancement representation features into a convolutional neural network layer of the video recommendation model, and obtain positive interest embedded features, negative interest embedded features, and video segment embedded features of the user output by the convolutional neural network layer.
In an example embodiment, the loss function during the training of the video recommendation model is obtained by performing weighted summation on a user preference loss function L1 and a video segment loss function L2; wherein, the user preference loss function L1 is:
Figure BDA0003857752530000261
the video segment loss function L2 is:
Figure BDA0003857752530000262
wherein, U represents a user attribute information set, U represents user attribute information in U, and N c Represents the number of sample video segments divided from the original video of the sample, j represents the segment index,
Figure BDA0003857752530000263
a sample video segment set representing the jth of all video segments of the sample positive feedback video segment set of user u that are in the complete sample video,
Figure BDA0003857752530000264
sample negative feedback video segment set, ω, on behalf of user u j Is a penalty factor and ω j =j/N c
Figure BDA0003857752530000271
Represents the interest level of the user u in the sample video segment c, σ () represents the Sigmoid function,
Figure BDA0003857752530000272
representing a collection of sample videos not completely viewed by user u, v represents
Figure BDA0003857752530000273
One of the elements of (a) or (b),
Figure BDA0003857752530000274
representing the interest degree of the user u in the incompletely viewed sample video positive feedback video segment,
Figure BDA0003857752530000275
representing the interest degree of the user u in the negative feedback video segment in the incompletely viewed sample video.
Fig. 6 illustrates a physical structure diagram of an electronic device, and as shown in fig. 6, the electronic device 600 may include: a processor (processor) 610, a Communication Interface (Communication Interface) 620, a memory (memory) 630 and a Communication bus 640, wherein the processor 610, the Communication Interface 620 and the memory 630 complete Communication with each other through the Communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform the video recommendation method provided by the above-described method embodiments, which may include: acquiring user attribute information and a video set to be recommended; dividing each video to be recommended in a video set to be recommended into a preset number of video segments; respectively extracting the user attribute information and the video clip to obtain the user characteristics and the video clip visual characteristics; inputting the user characteristics and the video clip visual characteristics into a video recommendation model to obtain the interest degree of a user to be recommended video corresponding to the user attribute information output by the video recommendation model; determining a target recommendation video from a video set to be recommended according to the interestingness, and outputting the target recommendation video to a user; the video recommendation model is obtained by training based on a sample original video corresponding to a user, a sample positive feedback video segment set and a sample negative feedback video segment set, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the skipping action of the user occurs.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, a computer can execute the video recommendation method provided by the above method embodiments, and the method can include: acquiring user attribute information and a video set to be recommended; dividing each video to be recommended in a video set to be recommended into a preset number of video segments; respectively extracting the user attribute information and the video clip to obtain the user characteristics and the video clip visual characteristics; inputting the user characteristics and the visual characteristics of the video clips into a video recommendation model, and obtaining the interest degree of a user to-be-recommended video corresponding to user attribute information output by the video recommendation model; determining a target recommendation video from a video set to be recommended according to the interestingness, and outputting the target recommendation video to a user; the video recommendation model is obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to a user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the skipping action of the user occurs.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the video recommendation method provided by the above method embodiments, the method may include: acquiring user attribute information and a video set to be recommended; dividing each video to be recommended in a video set to be recommended into a preset number of video segments; respectively extracting the user attribute information and the video clip to obtain the user characteristics and the video clip visual characteristics; inputting the user characteristics and the visual characteristics of the video clips into a video recommendation model, and obtaining the interest degree of a user to-be-recommended video corresponding to user attribute information output by the video recommendation model; determining a target recommendation video from a video set to be recommended according to the interestingness, and outputting the target recommendation video to a user; the video recommendation model is obtained by training based on a sample original video, a sample positive feedback video segment set and a sample negative feedback video segment set corresponding to a user, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the skipping action of the user occurs.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for video recommendation, comprising:
acquiring user attribute information and a video set to be recommended;
dividing each video to be recommended in the video set to be recommended into a preset number of video segments;
respectively extracting the user attribute information and the video clip to obtain user characteristics and video clip visual characteristics;
inputting the user characteristics and the video clip visual characteristics into a video recommendation model, and obtaining the interest degree of the user corresponding to the user attribute information output by the video recommendation model on the video to be recommended;
determining a target recommendation video from the video set to be recommended according to the interestingness, and outputting the target recommendation video to the user;
the video recommendation model is obtained by training based on a sample original video corresponding to the user, a sample positive feedback video segment set and a sample negative feedback video segment set, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the user skips the action occurrence moment.
2. The video recommendation method according to claim 1, wherein the inputting the user characteristics and the video clip visual characteristics into a video recommendation model to obtain the user attribute information output by the video recommendation model corresponding to the user's interest level in the video to be recommended comprises:
inputting the user features and the video segment visual features into a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding features, negative interest embedding features and video segment embedding features of the user output by the graph convolution neural network layer, wherein the graph convolution neural network layer is used for embedding and transmitting feature nodes of the user features and the video segment visual features;
and inputting the positive interest embedding feature, the negative interest embedding feature and the video segment embedding feature into a fusion prediction layer of the video recommendation model to obtain the interest level output by the fusion prediction layer, wherein the fusion prediction layer is used for fusing a first interest level of the positive interest embedding feature and a second interest level of the negative interest embedding feature which are obtained by calculation based on the video segment embedding feature.
3. The video recommendation method according to claim 2, wherein said inputting said user features and said video segment visual features into a graph convolution neural network layer of said video recommendation model, and obtaining positive interest embedded features, negative interest embedded features and video segment embedded features of said user output by said graph convolution neural network layer, comprises:
inputting the user characteristics and the video clip visual characteristics into a user aggregation layer of a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding characteristics and negative interest embedding characteristics of the user, which are output by the user aggregation layer;
inputting the positive interest embedding feature and the negative interest embedding feature into a video aggregation layer of the graph convolution neural network layer to obtain a video segment embedding feature output by the video aggregation layer;
the user aggregation layer is used for clustering positive interests and negative interests of the visual features of the video segments based on the user features; the video aggregation layer is used for carrying out embedded propagation on the user to the visual features of the video segments based on the positive interest embedding features and the negative interest embedding features.
4. The video recommendation method according to claim 2 or 3, wherein said inputting said positive interest embedding feature, said negative interest embedding feature and said video segment embedding feature into a fusion prediction layer of said video recommendation model, and obtaining said interestingness output by said fusion prediction layer comprises:
inputting the video segment embedding features into a first prediction layer of a fusion prediction layer of the video recommendation model to obtain target video segment embedding features output by the first prediction layer, wherein the first prediction layer is used for carrying out weighted combination on the video segment embedding features and the video segment visual features;
inputting the target video segment embedding feature, the positive interest embedding feature and the negative interest embedding feature into a second prediction layer of the fusion prediction layer to obtain the first interest degree and the second interest degree output by the second prediction layer, wherein the second prediction layer is used for splicing the positive interest embedding feature and the negative interest embedding feature with the target video segment embedding feature respectively and then carrying out multilayer perception mapping;
and inputting the first interestingness and the second interestingness into a fusion layer of the fusion prediction layer to obtain the interestingness output by the fusion layer, wherein the fusion layer is used for carrying out average pooling processing on the first interestingness and the second interestingness.
5. The video recommendation method according to claim 2 or 3, wherein before said inputting said user features and said video segment visual features into a convolutional neural network layer of said video recommendation model, said method further comprises:
inputting the video clip visual features into a feature enhancement embedding layer of the video recommendation model to obtain video clip enhancement representation features output by the feature enhancement embedding layer, wherein the feature enhancement embedding layer is used for enhancing the video clip visual features based on a transformation matrix;
the inputting the user characteristics and the video clip visual characteristics into a graph convolution neural network layer of the video recommendation model to obtain positive interest embedding characteristics, negative interest embedding characteristics and video clip embedding characteristics of the user output by the graph convolution neural network layer comprises:
inputting the user characteristics and the video segment enhancement representation characteristics into a graph convolution neural network layer of the video recommendation model, and obtaining positive interest embedding characteristics, negative interest embedding characteristics and video segment embedding characteristics of the user output by the graph convolution neural network layer.
6. The video recommendation method according to any one of claims 1 to 3, wherein the loss function during the video recommendation model training is obtained by performing weighted summation on a user preference loss function L1 and a video segment loss function L2;
the user preference loss function L1 is:
Figure FDA0003857752520000031
the video segment loss function L2 is:
Figure FDA0003857752520000041
wherein, U represents a user attribute information set, U represents user attribute information in U, and N c Represents the number of sample video segments divided from the sample original video, j represents a segment index,
Figure FDA0003857752520000042
a sample video segment set representing the jth of all video segments of the sample positive feedback video segment set of user u that are in the complete sample video,
Figure FDA0003857752520000043
sample negative feedback video segment set, ω, on behalf of user u j Is a penalty factor and ω j =j/N c
Figure FDA0003857752520000044
Represents the interest level of the user u in the sample video segment c, σ () represents the Sigmoid function,
Figure FDA0003857752520000045
representing a collection of sample videos not completely viewed by user u, v represents
Figure FDA0003857752520000046
Of the number of elements in (a) is,
Figure FDA0003857752520000047
representing the interest of the user u in the positively fed back video segment in the incompletely viewed sample video,
Figure FDA0003857752520000048
representing the interest degree of the user u in the negative feedback video segment in the incompletely viewed sample video.
7. A video recommendation apparatus, comprising:
the acquisition module is used for acquiring user attribute information and a video set to be recommended;
the dividing module is used for dividing each video to be recommended in the video set to be recommended into a preset number of video segments;
the characteristic extraction module is used for respectively extracting the characteristics of the user attribute information and the video clip to obtain user characteristics and video clip visual characteristics;
the processing module is used for inputting the user characteristics and the video clip visual characteristics into a video recommendation model and obtaining the interest degree of the user corresponding to the user attribute information output by the video recommendation model on the video to be recommended;
the determining module is used for determining a target recommended video from the video set to be recommended according to the interestingness;
the output module is used for outputting the target recommendation video to the user;
the video recommendation model is obtained by training based on a sample original video corresponding to the user, a sample positive feedback video segment set and a sample negative feedback video segment set, wherein the sample positive feedback video segment set is a set of video segments of the sample original video completely watched by the user, and the sample negative feedback video segment set is a set of video segments where the user skips the action occurrence moment.
8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the video recommendation method of any one of claims 1-6 when executing the computer program.
9. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the video recommendation method of any one of claims 1-6.
10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the video recommendation method of any of claims 1 to 6.
CN202211154166.XA 2022-09-21 2022-09-21 Video recommendation method and device and electronic equipment Pending CN115630188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211154166.XA CN115630188A (en) 2022-09-21 2022-09-21 Video recommendation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211154166.XA CN115630188A (en) 2022-09-21 2022-09-21 Video recommendation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115630188A true CN115630188A (en) 2023-01-20

Family

ID=84902440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211154166.XA Pending CN115630188A (en) 2022-09-21 2022-09-21 Video recommendation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115630188A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541608A (en) * 2023-07-04 2023-08-04 深圳须弥云图空间科技有限公司 House source recommendation method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541608A (en) * 2023-07-04 2023-08-04 深圳须弥云图空间科技有限公司 House source recommendation method and device, electronic equipment and storage medium
CN116541608B (en) * 2023-07-04 2023-10-03 深圳须弥云图空间科技有限公司 House source recommendation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109874053B (en) Short video recommendation method based on video content understanding and user dynamic interest
CN111061946B (en) Method, device, electronic equipment and storage medium for recommending scenerized content
CN107861938B (en) POI (Point of interest) file generation method and device and electronic equipment
CN112163165B (en) Information recommendation method, device, equipment and computer readable storage medium
CN111931062A (en) Training method and related device of information recommendation model
CN110619081B (en) News pushing method based on interactive graph neural network
CN111246256A (en) Video recommendation method based on multi-mode video content and multi-task learning
CN111400591A (en) Information recommendation method and device, electronic equipment and storage medium
US20220171760A1 (en) Data processing method and apparatus, computer-readable storage medium, and electronic device
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN112100504B (en) Content recommendation method and device, electronic equipment and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN113051468B (en) Movie recommendation method and system based on knowledge graph and reinforcement learning
CN115238126A (en) Method, device and equipment for reordering search results and computer storage medium
CN114357201B (en) Audio-visual recommendation method and system based on information perception
CN116977701A (en) Video classification model training method, video classification method and device
CN115618101A (en) Streaming media content recommendation method and device based on negative feedback and electronic equipment
CN112115131A (en) Data denoising method, device and equipment and computer readable storage medium
CN115630188A (en) Video recommendation method and device and electronic equipment
CN116956183A (en) Multimedia resource recommendation method, model training method, device and storage medium
Yan et al. Hybrid CNN-transformer based meta-learning approach for personalized image aesthetics assessment
CN116010696A (en) News recommendation method, system and medium integrating knowledge graph and long-term interest of user
CN114357301A (en) Data processing method, device and readable storage medium
CN113569557B (en) Information quality identification method, device, equipment, storage medium and program product
Liu et al. AGRFNet: Two-stage cross-modal and multi-level attention gated recurrent fusion network for RGB-D saliency detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination