WO2020215962A1 - 视频推荐方法、装置、计算机设备及存储介质 - Google Patents

视频推荐方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020215962A1
WO2020215962A1 PCT/CN2020/081052 CN2020081052W WO2020215962A1 WO 2020215962 A1 WO2020215962 A1 WO 2020215962A1 CN 2020081052 W CN2020081052 W CN 2020081052W WO 2020215962 A1 WO2020215962 A1 WO 2020215962A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature
user
text
recommended
Prior art date
Application number
PCT/CN2020/081052
Other languages
English (en)
French (fr)
Inventor
苏舟
刘书凯
孙振龙
饶君
丘志杰
刘毅
刘祺
王良栋
商甜甜
梁铭霏
陈磊
张博
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020215962A1 publication Critical patent/WO2020215962A1/zh
Priority to US17/329,928 priority Critical patent/US11540019B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/252Processing of multiple end-users' preferences to derive collaborative data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/26603Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel for automatically generating descriptors from content, e.g. when it is not made available by its provider, using content analysis techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/4508Management of client data or end-user data
    • H04N21/4532Management of client data or end-user data involving end-user characteristics, e.g. viewer profile, preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies

Definitions

  • the invention relates to the field of machine learning, and in particular to a video recommendation method, device, computer equipment and storage medium.
  • the server can recommend some videos that users may be interested in from the massive video database, so as to better satisfy users' video watching demand.
  • the server can extract the joint feature between any video in the video library and the user based on the Attentive Collaborative Filtering (ACT) model, repeat the above steps for each video in the video library, and obtain the Multiple joint features corresponding to multiple videos, and further based on the Euclidean distance between the multiple joint features, obtain the ranking of all joint features, so that the video corresponding to the joint feature with the highest ranking is recommended to the user .
  • ACT Attentive Collaborative Filtering
  • the embodiment of the present invention provides a video recommendation method, device, computer equipment and storage medium, and a recommended video display method, device, electronic equipment and storage medium.
  • a video recommendation method executed by a computer device, the method including:
  • Input the video to a first feature extraction network perform feature extraction on at least one continuous video frame in the video through the first feature extraction network, and output video features of the video;
  • the recommendation probability it is determined whether to recommend the video to the user.
  • a recommended video display method executed by an electronic device, the method including:
  • Video display interface includes at least one first recommended video
  • the viewing record of the first recommended video is sent to the server, and the viewing record is used to instruct the server to recommend the video based on the viewing record. Perform optimization training and return the video information of at least one second recommended video in real time;
  • the at least one second recommended video is displayed in the video display interface.
  • a video recommendation device which includes:
  • the first output module is configured to input a video into a first feature extraction network, perform feature extraction on at least one continuous video frame in the video through the first feature extraction network, and output video features of the video;
  • the second output module is configured to input user data of the user into a second feature extraction network, perform feature extraction on the discrete user data through the second feature extraction network, and output the user characteristics of the user;
  • the fusion obtaining module is used to perform feature fusion based on the video feature and the user feature to obtain the recommendation probability of the user recommending the video;
  • the determining recommendation module is used to determine whether to recommend the video to the user according to the recommendation probability.
  • the first output module includes:
  • the convolution extraction unit is configured to input at least one continuous video frame in the video into the temporal convolutional network and the convolutional neural network in the first feature extraction network respectively, and the temporal convolutional network and the convolutional neural network are A continuous video frame is subjected to convolution processing to extract the video features of the video.
  • the convolution extraction unit includes:
  • the causal convolution subunit is used to input at least one image frame included in at least one continuous video frame in the video into the temporal convolution network in the first feature extraction network, and perform processing on the at least one image frame through the temporal convolution network Causal convolution to obtain the image characteristics of the video;
  • the convolution processing subunit is configured to input at least one audio frame included in the at least one continuous video frame into the convolutional neural network in the first feature extraction network, and convolve the at least one audio frame through the convolutional neural network Processing to obtain the audio characteristics of the video;
  • the fusion subunit is used to perform feature fusion between the image feature of the video and the audio feature of the video to obtain the video feature of the video.
  • the fusion subunit is used for:
  • the second output module includes:
  • the first input unit is used to input the user data of the user into the second feature extraction network
  • the first linear combination unit is configured to extract the width part of the network through the second feature, and perform generalized linear combination on the discrete user data to obtain the width feature of the user;
  • the first embedding convolution unit is used to extract the depth part of the network through the second feature, and perform embedding processing and convolution processing on the discrete user data to obtain the depth feature of the user;
  • the first fusion unit is used to perform feature fusion on the width feature of the user and the depth feature of the user to obtain the user feature of the user.
  • the first fusion unit is used for:
  • the width feature of the user and the depth feature of the user are cascaded through the fully connected layer to obtain the user feature of the user.
  • the fused module is used to:
  • the device further includes:
  • the third input module is configured to input the text corresponding to the video into a third feature extraction network, and perform feature extraction on the discrete text through the third feature extraction network, and output text features corresponding to the video.
  • the third input module includes:
  • the second input unit is used to input the text into the third feature extraction network
  • the second linear combination unit is used to extract the width part of the network through the third feature, and perform generalized linear combination on the discrete text to obtain the width feature of the text;
  • the second embedding convolution unit is used to extract the depth part of the network through the third feature, and perform embedding processing and convolution processing on the discrete text to obtain the depth feature of the text;
  • the second fusion unit is used to perform feature fusion on the width feature of the text and the depth feature of the text to obtain the text feature corresponding to the video.
  • the second fusion unit is used for:
  • the width feature of the text and the depth feature of the text are cascaded through the fully connected layer to obtain the text feature corresponding to the video.
  • the second fusion unit is also used to cascade the width feature of the text and the depth feature of the text through the fully connected layer to obtain the text feature corresponding to the video.
  • the fusion obtained module includes:
  • the third fusion unit is used to perform feature fusion on the video feature and the user feature to obtain the first associated feature between the video and the user;
  • the third fusion unit is also used to perform feature fusion on the text feature and the user feature to obtain a second association feature between the text and the user;
  • the dot product unit is configured to perform dot product processing on the first associated feature and the second associated feature to obtain a recommendation probability for the user to recommend the video.
  • the third fusion unit is used for:
  • the third fusion unit is also used to:
  • the determining recommendation module is used for:
  • the recommendation probability is less than or equal to the probability threshold, it is determined that the video is not recommended for the user.
  • the determining recommendation module is used for:
  • the probability ranking is greater than the target threshold, it is determined not to recommend the video corresponding to the corresponding probability ranking for the user.
  • a recommended video display device which includes:
  • the display module is configured to display a video display interface, and the video display interface includes at least one first recommended video;
  • the sending module is used for when a click operation on any first recommended video is detected, in response to the click operation, send the viewing record of the first recommended video to the server, and the viewing record is used to instruct the server based on the viewing Record the optimization training of the video recommendation model, and return the video information of at least one second recommended video in real time;
  • the display module is configured to display the at least one second recommended video in the video display interface based on the video information of the at least one second recommended video when the video information of the at least one second recommended video is received.
  • a computer device including a processor and a memory, and computer readable instructions are stored in the memory.
  • the processor is caused to execute the steps of the video recommendation method as described above .
  • An electronic device including a processor and a memory, and computer-readable instructions are stored in the memory.
  • the processor executes the recommended video display method described above. step.
  • a non-volatile computer-readable storage medium that stores computer-readable instructions.
  • the one or more processors execute the video recommendation as described above The steps of the method, or the steps of the recommended video display method as described above.
  • FIG. 1 is a schematic diagram of an implementation environment of a video recommendation method provided by an embodiment of the present invention
  • Figure 2 is an interactive flowchart of a video recommendation method provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a video display interface provided by an embodiment of the present invention.
  • FIG. 4 is a flowchart of a video recommendation method provided by an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a temporal convolutional network provided by an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of a temporal convolutional network provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a second feature extraction network provided by an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a video recommendation method provided by an embodiment of the present invention.
  • FIG. 9 is a flowchart of a video recommendation method provided by an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a video recommendation device provided by an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a recommended video display device provided by an embodiment of the present invention.
  • Figure 12 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of an implementation environment of a video recommendation method provided by an embodiment of the present invention.
  • the implementation environment may include at least one terminal 101 and server 102, and each terminal 101 and server 102 respectively communicate through a network connection.
  • the at least one terminal 101 is used to browse videos, and the server 102 is used to recommend videos to at least one user corresponding to the at least one terminal 101.
  • an application client may be installed on each terminal of the at least one terminal 101, and the application client may be any client that can provide video browsing services, so that the server 102 can be based on the user's application
  • the behavior log on the client collects sample user data and sample videos, so that the first feature extraction network, the second feature extraction network, and the third feature extraction network are trained based on the sample user data and sample videos.
  • the server 102 can determine whether to recommend any video for any user based on the first feature extraction network, the second feature extraction network, and the third feature extraction network. Thus, in some embodiments, the server 102 can determine whether to recommend any video from multiple At least one video is screened out for each user from the videos, so that video recommendation to the user can be realized.
  • the server 102 sends the at least one video determined to be recommended to the at least one terminal 101
  • the at least one terminal 101 can be based on the video display interface To display at least one recommended video, where the at least one recommended video is also at least one video recommended by the server for the user corresponding to the terminal.
  • Fig. 2 is an interaction flowchart of a video recommendation method provided by an embodiment of the present invention. See Fig. 2. This embodiment is applied to the interaction process between a computer device and an electronic device. The present invention only uses the computer device as a server. The device is a terminal as an example. This embodiment includes:
  • the server inputs the video into a first feature extraction network, performs feature extraction on at least one continuous video frame in the video through the first feature extraction network, and outputs the video feature of the video.
  • the video may be any video in the local video library, the video may also be any video downloaded from the cloud, and the video may include at least one continuous video frame.
  • the server inputs the user data of the user into a second feature extraction network, performs feature extraction on the discrete user data through the second feature extraction network, and outputs the user features of the user.
  • the user may be a user corresponding to any terminal
  • the user data may include user personal information and video preferences
  • the personal information may include at least one of user gender, user age, user location, or user occupation.
  • Personal information may be information authorized by the user to the server.
  • the video preference can be obtained by the server performing data analysis on the user's video viewing behavior log.
  • the user data is discrete.
  • the discrete user data is input into the second feature extraction network after Through the function of the second feature extraction network, the discrete user data can be converted into a continuous feature vector, which can reflect the joint features of the discrete user data.
  • the server performs feature fusion based on the video feature and the user feature to obtain a recommendation probability for the user to recommend the video.
  • the server may perform dot product processing on the video feature and the user feature, that is, the inner product of the video feature and the user feature is calculated, and the video feature and the value of the corresponding position in the user feature are multiplied by The value obtained by the summation is obtained as the recommended probability.
  • the server determines whether to recommend the video to the user according to the recommendation probability.
  • step S204 the server determines whether to recommend the video to the user according to the recommendation probability.
  • the server can execute the video recommendation process in steps S201-S204, so as to determine whether the video is recommended. Any user recommends any video.
  • the following step S205 is performed, taking the determination of at least one first recommended video for the same user as an example for description, but for different users, the process is similar, and will not be repeated here.
  • the server repeats the above steps S201-S204, determines at least one first recommended video recommended to the user, and sends video information of the at least one first recommended video to the terminal corresponding to the user.
  • the server may set a recommended quantity threshold for the first recommended video.
  • the recommended quantity threshold may be any value greater than or equal to 1.
  • the recommended quantity threshold may be the same or different.
  • the server may analyze the user's video viewing behavior log, so that the recommendation number threshold corresponding to the user is positively correlated with the user's daily video viewing time, that is, when the user's daily video viewing time is longer In the long run, the number of first recommended videos corresponding to the user is greater. For example, if the user’s daily average video viewing time is 1 hour, 2 first recommended videos can be sent to the user’s terminal, and if the user’s daily average video viewing time is 3 hours, 6 first recommended videos can be sent to the user’s terminal .
  • the terminal receives video information of the at least one first recommended video.
  • the video information may be at least one of a thumbnail, a web page link, or text of the at least one first recommended video.
  • the video information may include the thumbnail, web page link, title, author information, and summary of the first recommended video.
  • the embodiment of the present invention does not specifically limit the content of the video information.
  • the video information may also be the at least one first recommended video itself, thereby avoiding the terminal from frequently sending access requests to the server in the subsequent interaction process.
  • the terminal When the terminal detects the user's click operation on the video function entrance, it displays a video display interface, and the video display interface includes at least one first recommended video.
  • the video function entrance may be provided by any application client that supports video display on the terminal, and the video display interface may include at least one user interface (UI) card, and each user interaction card is used to display one The first recommended video.
  • UI user interface
  • the video display interface may also include at least one window, and each window is used to display a first recommended video.
  • the embodiment of the present invention does not specifically limit the form of displaying the first recommended video in the video display interface.
  • the video function entry may be a function option on the main interface of the application client, so that when the terminal detects a user's click operation on the function option, the video display is switched from the main interface of the application client. interface.
  • FIG. 3 is a schematic diagram of a video display interface provided by an embodiment of the present invention. Referring to FIG. 3, the terminal may display multiple first recommended videos on the video display interface.
  • the video function entrance may also be the icon of the application client, so that when the terminal detects a click operation on the icon of the application client, the terminal directly starts the application client and displays the video display interface
  • the main interface of the application client is the video display interface
  • the at least one first recommended video is determined based on multiple recommendation probabilities, and one recommendation probability may be the user feature of the current user output by the first feature extraction network, the video feature of the video to be recommended output by the second feature extraction network, or the first The probability obtained by fusing at least one of the text features output by the three-feature extraction network.
  • the terminal may only display the video information of the at least one first recommended video in the video display interface, and when detecting a user's click operation on any of the first recommended videos, send a notification to the first recommended video.
  • the access request is sent through the webpage link of, so that the first recommended video is cached locally, and the first recommended video is played based on the video display control, which can save storage space of the terminal and improve the processing efficiency of the terminal.
  • the terminal may also send an access request to the webpage link corresponding to each first recommended video in the at least one first recommended video while displaying the video display interface, and cache the at least one first recommended video locally.
  • Recommended video when a user's click operation on any first recommended video is detected, the first recommended video is played directly based on the video display control, so that each first recommended video on the interface can be completed when the video display interface is displayed In the loading process, the first recommended video can be played in time when the user clicks, thereby shortening the time the user waits for the video to load and optimizing the effect of video recommendation.
  • the terminal can also directly automatically play the video with the highest recommended probability after displaying the video display interface, thereby simplifying the video playback process.
  • the terminal detects a click operation on any of the first recommended videos, in response to the click operation, send the viewing record of the first recommended video to the server, and the viewing record is used to instruct the server to pair the viewing record based on the viewing record.
  • the video recommendation model performs optimization training, and returns the video information of at least one second recommended video in real time.
  • the terminal in response to the user's click operation on any of the first recommended videos, the terminal sends the viewing record of the first recommended video to the server, and the viewing record may include the exposure time and cumulative number of views of the first recommended video Wait.
  • the server receives the viewing record, optimizes training of the video recommendation model based on the viewing record, determines at least one second recommended video according to the optimized trained video recommendation model, and sends video information of the at least one second recommended video To the terminal.
  • the video recommendation model includes at least one of a first feature extraction network, a second feature extraction network, or a third feature extraction network.
  • the server can collect the viewing records of each user for each first recommended video, and based on the viewing records, mark the first recommended video with an exposure longer than the preset duration as a positive example in the optimization training process (that is, Is marked as true), the first recommended video with an exposure duration less than or equal to the preset duration is marked as a negative example in the optimization training process (that is, marked as false).
  • the specific training process is the same as the video in the following embodiment
  • the recommendation method is similar, except that the video needs to be replaced with the marked first recommended video, which will not be repeated here, and the dynamic optimization training of the video recommendation model can be realized through the above step S209.
  • the foregoing process of determining the second recommended video and sending the video information of the second recommended video is similar to the foregoing steps S201-S205, and will not be repeated here.
  • step S210 is similar to steps S206-S207, and will not be repeated here.
  • the terminal when the terminal detects a user's click operation on any of the first recommended videos, the terminal sends a viewing record to the server in response to the click operation, and the server immediately optimizes training on each feature extraction network in the video recommendation model. Then determine at least one second recommended video, so that the terminal displays each second recommended video, so that before the user clicks on a certain first recommended video and after clicking on the first recommended video, different video displays will be displayed on the video display interface. Recommended results.
  • the server originally predicted that a user’s probability of liking a cat’s video is as high as that of a dog’s video. Therefore, the determined 10 first recommended videos include 5 cat videos and 5 dog videos, and when the user When the video of the cat pushed on the terminal is clicked and the exposure time is longer than the preset time, the terminal sends the watch record to the server.
  • the server marks the cat’s video as a positive example, it optimizes training for each feature extraction network in the video recommendation model. Since the number of positive examples of cat videos has increased by one, the server may predict that the user’s probability of liking cat’s videos is greater than the probability of liking dogs’ videos, thus determining 10 second recommendations in the new round of prediction process
  • the video includes 7 videos of cats and 3 videos of dogs.
  • the server may also not perform the optimization training process immediately after receiving the viewing record, but periodically perform optimization training on each feature extraction network in the video recommendation model. For example, the server may perform optimization training on the basis of the daily zero point.
  • One or more viewing records of a day are optimized for training, and the second recommended video is sent to the terminal, so that the terminal can update the recommended video displayed in the video display interface, thereby avoiding each additional viewing record of each video recommendation model.
  • the feature extraction network is trained once, which improves the performance bump problem of the feature extraction network and increases the stability of the feature extraction network.
  • a video is input to a first feature extraction network, and at least one continuous video frame in the video is feature extracted through the first feature extraction network, and the video feature of the video is output.
  • it can extract high-dimensional video features in a targeted manner without increasing the computational pressure, and input the user data into the second feature extraction network.
  • Feature extraction is performed on the user data of the user, and the user characteristics of the user are output. Due to the variety of user characteristics and low dimensionality, the second feature extraction network can be used to extract low-dimensional user characteristics in a targeted manner, reducing the need to extract user characteristics. Calculate the pressure, perform feature fusion based on the video feature and the user feature to obtain the recommendation probability of the user recommending the video.
  • the recommendation probability determine whether to recommend the video to the user, so that the user features and user features with large differences in nature
  • different networks are used for feature extraction, avoiding the loss of user features and information in video features, improving the problem of gradient dispersion, and improving the accuracy of video recommendation.
  • a video display interface is displayed on the terminal side, and at least one first recommended video is displayed on the video display interface.
  • the The watch record of the recommended video is sent to the server, so that the quality of the first recommended video can be fed back to the user in time, so that the server can mark the true and false samples of the first recommended video based on the watch record and mark the first recommended video.
  • a recommended video is used as a sample video in a new round of optimization training to realize dynamic optimization training of the video recommendation model, and the server can also return video information of at least one second recommended video to the terminal according to the optimized training video recommendation model.
  • the terminal When the terminal receives the video information of at least one second recommended video, based on the video information of the at least one second recommended video, the at least one second recommended video is displayed in the video display interface, so that as the user clicks, It can update and display recommended videos with higher recommendation accuracy on the video display interface in real time.
  • the foregoing embodiment provides a video recommendation process in which a terminal interacts with a server. After determining any recommended video, the server pushes the recommended video to the terminal, so that the terminal displays the recommended video based on the video display interface. When the user clicks on the recommended video After the video, the recommended video in the video display interface can also be updated.
  • the recommended video in the video display interface can also be updated.
  • how to determine the recommended video from the server side will be described in detail.
  • the step S206 in the above embodiment can still be executed.
  • the terminal side display process similar to S210 will not be repeated in the embodiment of the present invention.
  • Fig. 4 is a flowchart of a video recommendation method provided by an embodiment of the present invention. Referring to Fig. 4, this embodiment is applied to a computer device. The embodiment of the present invention only uses the computer device as a server as an example for description. The method includes:
  • the server inputs at least one image frame included in at least one continuous video frame of the video into the temporal convolution network in the first feature extraction network, and performs causal convolution on the at least one image frame through the temporal convolution network to obtain the The image characteristics of the video.
  • the video can be any video in the local video library, the video can also be any video downloaded from the cloud, the video can include at least one continuous video frame, and the at least one continuous video frame can include at least one The image frame and at least one audio frame, that is, each continuous video frame includes one image frame and one audio frame.
  • the at least one image frame may be represented in the form of a sequence, an array, or a linked list, and the embodiment of the present invention does not specifically limit the form of the image frame.
  • the image feature of the video may include at least one image frame feature corresponding to the at least one image frame, and one image frame feature is used to represent the image feature of an image frame and the difference between the image frame and the image frame before the image frame. The relationship between.
  • the first feature extraction network can include a temporal convolutional network (temporal convolutional networks, TCN) and a convolutional neural network (convolutional neural networks, CNN), where the TCN can be used for The image features are extracted.
  • TCN temporal convolutional networks
  • CNN convolutional neural networks
  • the CNN can be used to extract audio features.
  • the CNN will be described in detail in the following step 402, which will not be repeated here.
  • TCN independently extracts the image features of the video
  • CNN independently extracts the audio features of the video, and further features the image features output by TCN and the audio features output by CNN Fusion, which can get the video characteristics of the video.
  • the TCN may include an input layer, at least one hidden layer, and an output layer.
  • the input layer is used to decode the input image frame
  • the at least one hidden layer is used to perform causality on the decoded image frame.
  • Convolution causal convolutions
  • the output layer is used to perform non-linear processing and normalization processing on the image frame after causal convolution.
  • the input layer, the at least one hidden layer, and the output layer are serially connected.
  • the serial connection is that the server inputs at least one image frame of the video to the input layer, and At least one image frame decoded by the input layer is input to the first hidden layer, at least one feature map output from the first hidden layer is input to the second hidden layer, and so on, until the last hidden layer is output At least one feature map of is input to the output layer, and the at least one image frame feature output by the output layer is the image feature of the video extracted by the TCN.
  • each hidden layer can include at least one convolution kernel (filter).
  • filter For any hidden layer, when performing causal convolution on at least one feature map output by the previous hidden layer, in the traditional CNN framework
  • One convolution kernel is used to convolve one feature map, and in the TCN provided by the embodiment of the present invention, one convolution kernel is used to convolve multiple feature maps, and this convolution is called "Causal convolution", wherein the multiple feature maps may be the feature maps at the current moment, and at least one feature map corresponding to at least one moment before the current moment.
  • step S401 the server inputs the at least one image frame into the TCN, performs causal convolution on the at least one image frame through at least one hidden layer of the TCN, and outputs at least one image frame corresponding to the at least one image frame Feature, thereby determining the at least one image frame feature as an image feature of the video.
  • At least one feature map output by the previous hidden layer is used for the feature map at any time according to the convolution kernel corresponding to the time in the hidden layer.
  • the “superimposition” mentioned here refers to the direct addition of the values of corresponding positions in the multiple feature maps.
  • FIG. 5 is a schematic diagram of a temporal convolutional network provided by an embodiment of the present invention. See FIG. 5.
  • the T-th convolution kernel in a hidden layer convolves three image frames at the three moments T, T-1, and T-2 of the input layer to obtain the T moment in the first hidden layer
  • T is any value greater than or equal to 0.
  • one convolution kernel is used to convolve three feature maps, but in some embodiments, one convolution kernel in TCN can perform any value greater than or equal to The number of feature maps of 2 is convolved, and Figure 5 should not constitute a specific limitation on the number of feature maps included in each causal convolution in the TCN.
  • each image frame feature in the output layer can not only represent the image feature of an image frame, but also represent the association relationship between the image frame and the image frame before the image frame.
  • LSTM long short-term memory
  • the feature map obtained after the causal convolution can include the information of each image frame of the image data in the input layer.
  • At least one feature map output by the previous hidden layer can be zero-padding, and at least one zero-padding layer can be added to the periphery of each feature map.
  • the number of zero-filled layers can be determined according to the size of the convolution kernel and the step size of the causal convolution, so as to ensure that the size of the feature map output by each hidden layer is consistent with the size of the input feature map.
  • any convolution kernel in each of the above hidden layers may also be a dilated convolution (dilated convolutions, also called dilated convolution) kernel, which refers to adjacent convolution kernels in the original convolution kernel.
  • dilated convolutions also called dilated convolution
  • a new convolution kernel composed of at least one zero element is inserted between the elements of. Since the hole convolution kernel is always filled with 0 at the hole, no new convolution kernel parameters are obtained, so that it can be used without additional convolution kernel parameters. In this case, the size of the convolution kernel is effectively enlarged, the size of the receptive field is increased, and a better fitting effect can be obtained. It can further reduce the number of hidden layers in the TCN and reduce the TCN training process The amount of calculation, shorten the training time of TCN.
  • the convolution kernel is a hole convolution kernel
  • the causal convolution operation is also performed, that is, a hole convolution kernel is also used to convolve multiple feature maps
  • the multiple feature maps may be feature maps that are adjacent in time series, or feature maps that are not adjacent in time series.
  • the multiple feature maps are not adjacent in time series, the multiple feature maps
  • the time sequence intervals between adjacent feature maps in the figure may be the same or different. The embodiment of the present invention does not specifically limit whether the time sequence intervals between adjacent feature maps are the same.
  • the multiple feature maps are not adjacent in time sequence and have the same time sequence interval, it is possible to set an expansion coefficient d greater than or equal to 1 for each hidden layer, and d is a positive integer ,
  • the expansion coefficients in different hidden layers can be the same or different.
  • the embodiment of the present invention does not specifically limit the value of the expansion coefficient.
  • the server can also directly set the time sequence interval as a hyperparameter.
  • the embodiment of the present invention also does not specifically limit whether to set the expansion coefficient.
  • T-4 and T-8 The feature map corresponding to the image frame at the moment is causally convolved, which can reduce the number of hidden layers in the TCN, reduce the calculation amount of the TCN training process, and shorten the training time of the TCN.
  • the causal convolution is performed every time When using a hollow convolution kernel, the size of the convolution kernel is effectively enlarged, the size of the receptive field is increased, and a better fitting effect can be obtained.
  • a residual connection may be used between the at least one hidden layer, and the residual connection is: for each hidden layer, any feature map output by the previous hidden layer can be compared with the current The corresponding feature maps output by the hidden layer are superimposed to obtain a residual block, which is used as a feature map of the input to the next hidden layer, so as to solve the degradation problem of TCN and make the depth of TCN deeper , The better the accuracy of image feature extraction.
  • the convolution kernel performs a convolution operation on the feature map output by the previous hidden layer, thereby upgrading or reducing the dimension of the feature map output by the previous hidden layer, thereby ensuring that the two feature maps involved in the superimposition process have the same dimensions.
  • FIG. 6 is a schematic diagram of a temporal convolutional network provided by an embodiment of the present invention.
  • the input layer The image frames at time T, T-1, and T-2 are causally convolved, and before the second hidden layer performs causal convolution on the feature maps at time T, T-1, and T-2, Superimpose the image frame at time T with the feature map at time T, the image frame at time T-1 with the feature map at time T-1, and the image frame at time T-2 with the feature map at time T-2.
  • the “superposition” mentioned here refers to the direct addition of the values of the corresponding positions in any two feature maps.
  • a convolution operation may be performed on the image frame through a convolution kernel with a size of 1 ⁇ 1, so that the image frame has the same dimension as the feature map.
  • At least one non-linear layer can be introduced between each hidden layer.
  • the non-linear layer is used to perform non-linear processing on the feature map output by the hidden layer.
  • the non-linear layer can be any one that can add non-linearity.
  • the activation function of the factor for example, the activation function may be a sigmoid function, a tanh function, or a ReLU function.
  • At least one weight normalization layer may be introduced between each hidden layer, so that the weight of each convolution kernel can be normalized, so that the feature map output by each hidden layer has a similar distribution.
  • the training speed of TCN can be accelerated, and the gradient dispersion problem of TCN can be improved.
  • a weight normalization layer is connected in series after any hidden layer, and then a non-linear layer is connected in series after the weight normalization layer. Linear layer.
  • the output layer may be an exponential normalization (softmax) layer.
  • softmax exponential normalization
  • each feature map output by the last hidden layer is exponentially normalized based on the softmax function to obtain an image of the video feature.
  • the server inputs the at least one audio frame included in the at least one continuous video frame into the convolutional neural network in the first feature extraction network, and performs convolution processing on the at least one audio frame through the convolutional neural network to obtain the The audio characteristics of the video.
  • the at least one audio frame may be represented in the form of a sequence, an array, or a linked list, and the embodiment of the present invention does not specifically limit the form of the audio frame.
  • the audio feature of the video may include the audio feature of each audio frame in the at least one audio frame.
  • the CNN in the first feature extraction network is used to extract audio features.
  • the CNN may include an input layer, at least one hidden layer, and an output layer.
  • the input layer is used to decode input audio frames
  • the at least one hidden layer is used to perform convolution processing on the decoded audio frame
  • the output layer is used to perform non-linear processing and normalization processing on the audio frame after the convolution processing.
  • the input layer, the at least one hidden layer, and the output layer are connected in series, which is similar to the connection mode of the TCN in step S401, and will not be repeated here.
  • At least one pooling layer may be introduced between each hidden layer, and the pooling layer is used to compress the feature map output by the previous hidden layer, thereby reducing the size of the feature map.
  • the residual connection may also be used in the CNN, which is similar to the residual connection of the TCN in step S401 above, and will not be repeated here.
  • the CNN may be a VGG (visual geometry group) network.
  • VGG visual geometry group
  • each hidden layer uses a small convolution kernel of 3*3 and a maximum of 2*2. Pooling core, and the residual connection between each hidden layer, so that with the deepening of the VGG network, the size of the image after each pooling is reduced by half, and the depth is doubled, thus simplifying the structure of CNN and making it easy to obtain at least one
  • the spectrogram of the audio frame facilitates the extraction of high-level audio features.
  • the CNN may be VGG-16 or VGG-19, etc.
  • the embodiment of the present invention does not specifically limit the architecture level of the VGG network.
  • the server may input at least one audio frame of the video into the CNN, perform convolution processing on the at least one audio frame through at least one hidden layer of the CNN, and output at least one audio frame corresponding to the at least one audio frame.
  • An audio frame feature thereby determining the at least one audio frame feature as an audio feature of the video.
  • the feature map at that time Perform convolution processing.
  • S403 The server performs bilinear convergence processing on the image feature of the video and the audio feature of the video to obtain the video feature of the video.
  • the server can perform multi-modal compact bilinear pooling (MCB) processing on the image feature and the audio feature.
  • the MCB processing means that the server obtains the image feature and the audio feature.
  • the tensor product of audio features (outer product).
  • the tensor product is polynomially expanded through the quadratic term to obtain the video feature.
  • the server can also expand the tensor product through methods such as Taylor expansion and power series expansion. Get the video feature.
  • the server may approximate the tensor product by the projection vector between the image feature and the audio feature, thereby reducing the amount of calculation in the bilinear merging process and shortening the duration of the video recommendation process.
  • the server may also perform multi-modal low-rank bilinear pooling (MLB) processing on the image feature and the audio feature.
  • the MLB processing is also: server acquisition The projection matrix of the image feature, the projection matrix of the audio feature is obtained, the Hadamard product between the projection matrix of the image feature and the projection matrix of the audio feature is obtained, and the Hadamard product is determined as the video feature, thereby It can improve the defects of the MCB that is limited by the performance of the graphics processing unit (GPU), reduce the demand for the GPU, and save the cost of bilinear convergence processing.
  • GPU graphics processing unit
  • the server may also perform multi-modal factorized bilinear pooling (MFB) processing on the image feature and the audio feature.
  • MFB processing means that the server acquires the image.
  • the low-order projection matrix of the feature, the low-order projection matrix of the audio feature is obtained, the sum pooling between the low-order projection matrix of the image feature and the low-order projection matrix of the audio feature is obtained, and the pooling sum Determined as the video feature, it can improve the defect of the convergence speed in the MLB, reduce the time length of the bilinear convergence processing, and improve the efficiency of the bilinear convergence processing.
  • the server Since in the above steps S401-S402, the server obtains the image characteristics of the video based on TCN and the audio characteristics of the video based on CNN, in the above step S403, the server can perform feature fusion of the image characteristics of the video with the audio characteristics of the video, The video features of the video are obtained, and the image features and audio features are extracted separately through different network structures.
  • the correlation between image frames is taken into account, which improves the expressive ability of image features.
  • a simplified network structure is used for the feature, which facilitates the extraction of deeper audio features, and then merges the two features to obtain the video feature, which improves the accuracy of the video recommendation process.
  • the bilinear convergence process can improve the efficiency of feature fusion, ensure sufficient interaction between image features and audio features, and also efficiently Reduce the dimensionality of the fusion features.
  • the server may not perform bilinear fusion processing on image features and audio features, but may perform feature fusion by obtaining dot products, obtaining average values, or cascading to further shorten the duration of feature fusion. Reduce the amount of calculation in the feature fusion process.
  • the server separately inputs the at least one continuous video frame in the video into the temporal convolutional network and the convolutional neural network in the first feature extraction network, and the temporal convolutional network and the convolutional neural network are passed through the temporal convolutional network and the convolutional neural network.
  • the neural network performs convolution processing on the at least one continuous video frame to extract video features of the video.
  • the first feature extraction network includes TCN and CNN.
  • the server may directly at least one image frame of the video And at least one audio frame is input to the same TCN or CNN, and the video features of the video are output. That is, the server extracts both image features and audio features through the same TCN or CNN, so there is no need to perform image features and audio features.
  • Feature fusion which can complete the extraction of video features based on only one convolutional neural network, reduces the amount of calculation when acquiring the video, and accelerates the speed of acquiring the video features.
  • the server can also extract only the image features of the video, or only the audio features of the video.
  • feature fusion which reduces the amount of calculation when acquiring the video and accelerates the speed of acquiring the video features.
  • the server inputs the user data of the user into the second feature extraction network.
  • the user may be a user corresponding to any terminal, the user data may include user personal information and video preferences, and the personal information may include at least one of user gender, user age, user location, or user occupation.
  • the personal information may be information authorized by the user to the server, and the video preference may be obtained by the server performing data analysis on the user's video viewing behavior log.
  • any one of the personal information and the various video preferences in the user data is referred to as one user component information, so the user data includes at least one user component information.
  • each user component information in the user data is usually one or more isolated word vectors
  • the user data is discrete.
  • the second feature extraction network can convert discrete user data into a continuous feature vector, which can reflect the combined features of discrete user component information.
  • the second feature extraction network may include a width part and a depth part.
  • the second feature extraction network may be a wide and deep model, where the width part is used to compare Generalized linear processing is performed on user data.
  • the width portion may be a generalized linear model, which will be described in detail in the following step S405.
  • the depth portion is used for embedding and convolution processing on user data, for example, the depth The part may be a DNN (deep neural network, deep neural network), which will be described in detail in the following step S406.
  • the server extracts the width part in the network through the second feature, and performs a generalized linear combination on the discrete user data to obtain the width feature of the user.
  • the wide component may be a generalized linear model.
  • the server may perform one-hot encoding of at least one user component information in the user data to obtain at least one original feature of the user data, and input the at least one original feature into the second feature extraction
  • the width part of the network facilitates linear combination in this width part, and speeds up the speed of acquiring the width characteristics of the user.
  • the generalized linear model may include a first weight matrix and a bias term (bias), so in the above step S405, the server can weight the at least one original feature based on the first weight matrix Processing: summing the weighted original features and bias items to obtain the width feature of the user, wherein the number of weight items of the first weight matrix is greater than or equal to the number of original features.
  • bias bias term
  • the generalized linear model may include a second weight matrix and a bias term, so that the server can obtain at least one cross feature of the at least one original feature between two, so as to be based on the second weight matrix,
  • the at least one original feature and the at least one cross feature are weighted, and the weighted original features, the cross features, and the offset items are summed to obtain the width feature of the user.
  • a cross feature is used to represent the product of any original feature and another original feature, and the number of weights in the second weight matrix is greater than or equal to the number of original features and the number of cross features added to the result. The value obtained.
  • S406 The server extracts the depth part in the network through the second feature, and performs embedding processing and convolution processing on the discrete user data to obtain the depth feature of the user.
  • the wide component may be a DNN.
  • the DNN may include an input layer, an embedding layer, at least one hidden layer, and an output layer.
  • the layers are connected in series, wherein the embedding layer is used to store user data
  • the information of at least one user component is converted into an embedded vector form.
  • step S406 at least one user component information is input into the embedding layer, and the at least one user component information is embedded through the embedding layer, so that relatively sparse (that is, discrete) user data can be mapped to a low-dimensional space , Obtain at least one embedding vector, one embedding vector corresponding to one user component information, thereby inputting the at least one embedding vector into the at least one hidden layer, performing convolution processing on the at least one embedding vector through the at least one hidden layer, and outputting The depth characteristics of the user.
  • the server cascades the width characteristics of the user and the depth characteristics of the user through the fully connected layer to obtain the user characteristics of the user.
  • the server can cascade the width feature of the user and the depth feature of the user through a fully connected (FC) layer.
  • FC fully connected
  • the output user feature and the user's width feature It is connected to each component of the user's depth feature.
  • the server performs feature fusion on the width feature of the user and the depth feature of the user to obtain the user feature of the user.
  • the server may not perform the feature integration on the width feature of the user and the depth feature of the user. Cascade, but feature fusion can be performed by obtaining dot product or average value, which shortens the time of feature fusion and reduces the amount of calculation in the feature fusion process.
  • the server can also perform user's width characteristics through bilinear convergence The feature fusion with the user's deep features can ensure sufficient interaction between the features.
  • the server inputs the user data of the user into the second feature extraction network, performs feature extraction on the discrete user data through the second feature extraction network, and outputs the user characteristics of the user, taking into account the width part
  • the memory ability of the second feature extraction network also takes into account the generalization ability of the second feature extraction network through the depth part, so that the second feature extraction network can more accurately express the user characteristics of the user.
  • FIG. 7 is a schematic diagram of a second feature extraction network provided by an embodiment of the present invention. Referring to FIG. 7, the left part is the width part, and the right part is the depth part, which will not be repeated here.
  • the server inputs the text corresponding to the video into the third feature extraction network.
  • the text may be the text metadata of the video.
  • the text may be at least one of the title of the video, the tag of the video, the comment of the video, the author of the video, or the summary of the video, and the third feature extraction network and
  • the network architecture in the above step S404 is similar, but the parameters of the network can be the same or different.
  • the text metadata, video title, video tag, video comment, video author, or video summary are usually one or more isolated word vectors, the text is discrete.
  • the discrete text is input into the third feature extraction network, through the function of the third feature extraction network, the discrete text can be converted into a continuous feature vector, which can reflect the joint features of the discrete text.
  • step S408 is similar to the foregoing step S404, and will not be repeated here.
  • the server extracts the width part in the network through the third feature, and performs a generalized linear combination on the discrete text to obtain the width feature of the text.
  • the foregoing step S409 is similar to the foregoing step S405, and will not be repeated here.
  • the server extracts the depth part in the network through the third feature, and performs embedding processing and convolution processing on the discrete text to obtain the depth feature of the text.
  • the foregoing step S410 is similar to the foregoing step S406, and will not be repeated here.
  • S411 The server cascades the width feature of the text and the depth feature of the text through the fully connected layer to obtain the text feature corresponding to the video.
  • the foregoing step S411 is similar to the foregoing step S407, and will not be repeated here.
  • the server performs feature fusion on the width feature of the text and the depth feature of the text to obtain the text feature corresponding to the video.
  • the server may not cascade the width feature of the text and the depth feature of the text, but may perform feature fusion by obtaining the dot product or obtaining the average value, thereby shortening the duration of feature fusion and reducing the feature
  • the calculation amount of the fusion process of course, the server can also perform feature fusion between the width feature of the text and the depth feature of the text through bilinear convergence, so as to ensure sufficient interaction between the features.
  • the server inputs the text corresponding to the video into the third feature extraction network, and performs feature extraction on the discrete text through the third feature extraction network, and outputs the text features corresponding to the video, thereby not only It can take into account the image characteristics of the video, the audio characteristics of the video, and the user characteristics of the user, and does not ignore the role of the text metadata of the video.
  • the text characteristics of the video are obtained after the feature extraction of the text, thereby increasing the video recommendation
  • the diversity of the feature types of the process further enhances the accuracy of the video recommendation process.
  • S412 The server performs bilinear convergence processing on the video feature and the user feature to obtain the first associated feature.
  • the first association feature is used to indicate the feature association relationship between the video and the user.
  • the foregoing step S412 is similar to the foregoing step S403.
  • the server can perform bilinear convergence processing based on MCB, MLB, or MFB. On the basis of improving the efficiency of feature fusion, it also ensures sufficient interaction between video features and user features. I won't repeat it here.
  • the server performs feature fusion on the video feature and the user feature to obtain the first associated feature between the video and the user.
  • the server may not double the video feature and the user feature.
  • Linear convergence processing can be used to perform feature fusion by obtaining dot products, obtaining average values, or cascading, so as to further shorten the duration of feature fusion and reduce the amount of calculation in the feature fusion process.
  • S413 The server performs bilinear fusion processing on the text feature and the user feature to obtain a second associated feature.
  • the second association feature is used to indicate the feature association relationship between the text and the user.
  • the above step S413 is similar to the above step S403.
  • the server can perform bilinear convergence processing based on MCB, MLB, or MFB. On the basis of improving the efficiency of feature fusion, it also ensures sufficient interaction between video features and user features. I won't repeat it here.
  • the server performs feature fusion on the text feature and the user feature to obtain the second associated feature between the text and the user.
  • the server may not double the text feature and the user feature.
  • Linear convergence processing can be used to perform feature fusion by obtaining dot products, obtaining average values, or cascading, so as to further shorten the duration of feature fusion and reduce the amount of calculation in the feature fusion process.
  • S414 The server performs dot multiplication processing on the first associated feature and the second associated feature to obtain a recommendation probability for the user to recommend the video.
  • the server may perform a dot product process on the first associated feature and the second associated feature, that is, the process of calculating the inner product of the first associated feature and the second associated feature, and the first associated feature
  • the value obtained by multiplying the feature and the value of the corresponding position in the second associated feature and then summing is the recommended probability of the video.
  • the server performs feature fusion based on the video feature and the user feature to obtain the recommendation probability of the user recommending the video, so that the user can recommend the video based on the recommendation probability, see the following steps for details S415.
  • the server may also not perform the above steps S408-S414, that is, not obtain the text feature, but after performing the above step S407, directly perform the dot multiplication process on the video feature and the user feature to obtain the The recommendation probability of the user recommending the video, thereby avoiding the cumbersome calculation process of acquiring text features and subsequent feature fusion, and reducing the duration of the recommended video.
  • the probability threshold can be any value greater than or equal to 0 and less than or equal to 1.
  • the server compares the recommendation probability with the probability threshold. When the recommendation probability is greater than the probability threshold, it is determined that the user recommends the video. When the recommendation probability is less than or equal to the probability threshold, the server can determine that the video is not recommended. The user recommends the video.
  • the server determines whether to recommend the video to the user according to the recommendation probability. For different users and different videos, the server can execute the video recommendation process in the above steps S401-S415 to determine whether Recommend any video to any user.
  • the server may not judge whether to recommend according to the probability threshold, but perform the following steps: For each video in the multiple videos, the server repeats the operation of generating recommendation probabilities to obtain multiple recommendation probabilities; The recommendation probability is ranked in descending order of the multiple recommendation probabilities. When the probability ranking is less than or equal to the target threshold, it is determined that the user recommends the video; when the probability ranking is greater than the target threshold, it is determined not to be This user recommends the video.
  • the target threshold may be a value greater than or equal to 1 and less than or equal to the number of the multiple videos.
  • the server can control the number of recommended videos selected by obtaining the probability ranking, and avoid recommending too many videos for users when the probability threshold is small, thereby optimizing the effect of video recommendation.
  • the server can repeat the operations performed in the above steps S401-S415, so as to determine at least one recommended video recommended to the user, and send the video information of the at least one recommended video to the terminal, thereby executing
  • the terminal side display process similar to steps S206-S210 in the foregoing embodiment will not be repeated here.
  • a video is input to a first feature extraction network, and at least one continuous video frame in the video is feature extracted through the first feature extraction network, and the video feature of the video is output.
  • it can extract high-dimensional video features in a targeted manner without increasing the computational pressure, and input the user data into the second feature extraction network.
  • Feature extraction is performed on the user data of the user, and the user characteristics of the user are output. Due to the variety of user characteristics and low dimensionality, the second feature extraction network can be used to extract low-dimensional user characteristics in a targeted manner, reducing the need to extract user characteristics. Calculate the pressure, perform feature fusion based on the video feature and the user feature to obtain the recommendation probability of the user recommending the video.
  • the recommendation probability determine whether to recommend the video to the user, so that the user features and user features with large differences in nature
  • different networks are used for feature extraction, avoiding the loss of user features and information in video features, improving the problem of gradient dispersion, and improving the accuracy of video recommendation.
  • the image features of the video are extracted through TCN, and causal convolution is introduced.
  • TCN time since the image frame feature in the TCN output layer can not only represent the image feature of an image frame, but also represent the image frame and the image frame before the image frame Relationship.
  • LSTM long short-term memory
  • the feature map obtained after the causal convolution can include the information of each image frame of the image data in the input layer.
  • the audio features of the video are extracted through CNN.
  • the CNN network is a VGG network
  • the VGG network deepens, the size of the image after each pooling is reduced by half and the depth is doubled, which simplifies the structure of CNN and facilitates extraction High-level audio features.
  • extracting user features through the second feature extraction network not only takes into account the memory ability of the second feature extraction network through the width part, but also takes into account the generalization ability of the second feature extraction network through the depth part, so that the second feature extraction network Can express the user characteristics of the user more accurately.
  • the text features of the video are obtained by extracting the features of the text, so that not only the image features of the video, the audio features of the video, and the user features of the user can be considered, but the effect of the text metadata of the video is not ignored. , Thereby increasing the diversity of feature types in the video recommendation process, and further improving the accuracy of the video recommendation process.
  • extracting text features through the third feature extraction network not only takes into account the memory ability of the third feature extraction network through the width part, but also takes into account the generalization ability of the third feature extraction network through the depth part, so that the third feature extraction network It can express the text features corresponding to the video more accurately.
  • FIG. 8 is a schematic diagram of a video recommendation method provided by an embodiment of the present invention.
  • the server uses different network architectures to extract features of different properties, that is, to extract videos of different modalities.
  • User data and text corresponding to the video are extracted through the first feature extraction network, the second feature extraction network, and the third feature extraction network respectively, which can reduce the loss of multi-modal fusion information and avoid high-dimensional features from squeezing low-dimensionality
  • the ability to express features reduces the dimensional explosion caused by ineffective fusion.
  • the user’s video viewing preferences and text reading preferences can be characterized from the two dimensions of video features and text features, which enhances the server’s ability to describe and interpret multimodal data.
  • the server uses TCN to extract the image features of the video and CNN to extract the audio features of the video.
  • the width part is used to extract the width features of the user, and the depth is used. Part of the user’s depth features are extracted.
  • the width part is used to extract the width features of the text, and the depth part is used to extract the depth features of the text. Further, the features of similar structure are first fused within the class.
  • the image feature and audio feature of the video are fused to obtain the video feature
  • the user’s width feature and the user’s depth feature are fused to obtain the user feature
  • the text width feature and the text depth feature are fused to obtain the text feature, which can reduce Feature dimension, improve fusion efficiency, and then perform inter-class fusion of features with dissimilar structures, such as obtaining the first joint feature and the second joint feature, so that the two joint features can be multiplied based on the multi-modal video recommendation method
  • Obtaining the recommendation probability, making full use of the video features and text features, and being able to portray the video from a more dimensional perspective can also express the video more accurately, thereby improving the accuracy of video recommendation.
  • the first feature extraction network can be obtained by training based on the back propagation algorithm, and the second feature extraction network and the third feature extraction network can be obtained for training based on the joint training method of width and depth.
  • the training process is similar to the foregoing embodiment, except that sample videos, sample user data, and sample text are used, which will not be repeated here.
  • FIG. 9 is a flowchart of a video recommendation method provided by an embodiment of the present invention. See FIG. 9, which will be described in detail below:
  • the server inputs at least one image frame included in at least one continuous video frame of the video into the temporal convolution network in the first feature extraction network, and performs causal convolution on the at least one image frame through the temporal convolution network to obtain the The image characteristics of the video.
  • step S901 is similar to the step S401 in the foregoing embodiment, and will not be repeated here.
  • the server inputs the at least one audio frame included in the at least one continuous video frame into the convolutional neural network in the first feature extraction network, and performs convolution processing on the at least one audio frame through the convolutional neural network to obtain the The audio characteristics of the video.
  • step S902 is similar to the step S402 in the foregoing embodiment, and will not be repeated here.
  • the server performs bilinear convergence processing on the image feature of the video and the audio feature of the video to obtain the video feature of the video.
  • step S903 is similar to the step S403 in the foregoing embodiment, and will not be repeated here.
  • the server inputs the user data of the user into the second feature extraction network.
  • step S904 is similar to the step S404 in the foregoing embodiment, and will not be repeated here.
  • the server extracts the width part in the network through the second feature, and performs a generalized linear combination on the discrete user data to obtain the width feature of the user.
  • step S905 is similar to the step S405 in the foregoing embodiment, and will not be repeated here.
  • S906 The server extracts the depth part in the network through the second feature, and performs embedding processing and convolution processing on the discrete user data to obtain the depth feature of the user.
  • step S906 is similar to the step S406 in the above embodiment, and will not be repeated here.
  • the server cascades the width characteristics of the user and the depth characteristics of the user through the fully connected layer to obtain the user characteristics of the user.
  • step S907 is similar to the step S407 in the foregoing embodiment, and will not be repeated here.
  • S908 The server performs dot multiplication processing on the video feature and the user feature to obtain a recommendation probability of recommending the video to the user.
  • step S908 The method of the dot product processing in step S908 is similar to step S414 in the above embodiment, and will not be repeated here.
  • step S909 is similar to the step S415 in the foregoing embodiment, and will not be repeated here.
  • the server can repeatedly perform the operations performed in the above steps S901-S909, so as to determine at least one recommended video recommended to the user, and send the video information of the at least one recommended video to the terminal, thereby executing
  • the terminal side display process similar to steps S206-S210 in the foregoing embodiment will not be repeated here.
  • a video is input to a first feature extraction network, and at least one continuous video frame in the video is feature extracted through the first feature extraction network, and the video feature of the video is output.
  • it can extract high-dimensional video features in a targeted manner without increasing the computational pressure, and input the user data into the second feature extraction network.
  • Feature extraction is performed on the user data of the user, and the user characteristics of the user are output. Due to the variety of user characteristics and low dimensionality, the second feature extraction network can be used to extract low-dimensional user characteristics in a targeted manner, reducing the need to extract user characteristics. Calculate the pressure, perform feature fusion based on the video feature and the user feature to obtain the recommendation probability of the user recommending the video.
  • the recommendation probability determine whether to recommend the video to the user, so that the user features and user features with large differences in nature
  • different networks are used for feature extraction, avoiding the loss of user features and information in video features, improving the problem of gradient dispersion, and improving the accuracy of video recommendation.
  • FIG. 10 is a schematic structural diagram of a video recommendation device provided by an embodiment of the present invention.
  • the device includes a first output module 1001, a second output module 1002, a fusion obtaining module 1003, and a determining recommendation module 1004. The details are described below Description:
  • the first output module 1001 is configured to input a video into a first feature extraction network, perform feature extraction on at least one continuous video frame in the video through the first feature extraction network, and output video features of the video.
  • the second output module 1002 is configured to input user data of the user into a second feature extraction network, and perform feature extraction on the discrete user data through the second feature extraction network, and output user features of the user.
  • the fusion obtaining module 1003 is used to perform feature fusion based on the video feature and the user feature to obtain a recommendation probability for the user to recommend the video.
  • the determining recommendation module 1004 is configured to determine whether to recommend the video to the user according to the recommendation probability.
  • the device provided by the embodiment of the present invention inputs a video into a first feature extraction network, performs feature extraction on at least one continuous video frame in the video through the first feature extraction network, and outputs the video feature of the video. Due to the type of video feature With less and high dimensionality, it can extract high-dimensional video features in a targeted manner without increasing the computational pressure, and input the user data into the second feature extraction network. Feature extraction is performed on the user data of the user, and the user characteristics of the user are output. Due to the variety of user characteristics and low dimensionality, the second feature extraction network can be used to extract low-dimensional user characteristics in a targeted manner, reducing the need to extract user characteristics.
  • Calculate the pressure perform feature fusion based on the video feature and the user feature to obtain the recommendation probability of the user recommending the video.
  • the recommendation probability determine whether to recommend the video to the user, so that the user features and user features with large differences in nature
  • different networks are used for feature extraction, avoiding the loss of user features and information in video features, improving the problem of gradient dispersion, and improving the accuracy of video recommendation.
  • the first output module 1001 includes:
  • the convolution extraction unit is configured to input at least one continuous video frame in the video into the temporal convolutional network and the convolutional neural network in the first feature extraction network respectively, and the temporal convolutional network and the convolutional neural network are A continuous video frame is subjected to convolution processing to extract the video features of the video.
  • the convolution extraction unit includes:
  • the causal convolution subunit is used to input at least one image frame included in at least one continuous video frame in the video into the temporal convolution network in the first feature extraction network, and perform processing on the at least one image frame through the temporal convolution network Causal convolution to obtain the image characteristics of the video.
  • the convolution processing subunit is configured to input at least one audio frame included in the at least one continuous video frame into the convolutional neural network in the first feature extraction network, and convolve the at least one audio frame through the convolutional neural network Process to get the audio characteristics of the video.
  • the fusion subunit is used to perform feature fusion between the image feature of the video and the audio feature of the video to obtain the video feature of the video.
  • the fusion subunit is used to perform bilinear fusion processing on the image feature of the video and the audio feature of the video to obtain the video feature of the video.
  • the second output module 1002 includes:
  • the first input unit is used to input the user data of the user into the second feature extraction network.
  • the first linear combination unit is used to extract the width part of the network through the second feature, and perform a generalized linear combination on the discrete user data to obtain the width feature of the user.
  • the first embedding convolution unit is used to extract the depth part of the network through the second feature, and perform embedding processing and convolution processing on the discrete user data to obtain the depth feature of the user.
  • the first fusion unit is used to perform feature fusion on the width feature of the user and the depth feature of the user to obtain the user feature of the user.
  • the first fusion unit is specifically configured to cascade the width characteristics of the user and the depth characteristics of the user through a fully connected layer to obtain the user characteristics of the user.
  • the fusion obtaining module 1003 is configured to perform dot multiplication processing on the video feature and the user feature to obtain a recommendation probability for the user to recommend the video.
  • the device further includes:
  • the third input module is configured to input the text corresponding to the video into a third feature extraction network, and perform feature extraction on the discrete text through the third feature extraction network, and output text features corresponding to the video.
  • the third input module includes:
  • the second input unit is used to input the text into the third feature extraction network.
  • the second linear combination unit is used to extract the width part of the network through the third feature, and perform generalized linear combination on the discrete text to obtain the width feature of the text.
  • the second embedding convolution unit is used to extract the depth part in the network through the third feature, and perform embedding processing and convolution processing on the discrete text to obtain the depth feature of the text.
  • the second fusion unit is used to perform feature fusion on the width feature of the text and the depth feature of the text to obtain the text feature corresponding to the video.
  • the second fusion unit is specifically configured to cascade the width feature of the text and the depth feature of the text through a fully connected layer to obtain the text feature corresponding to the video.
  • the fusion obtaining module 1003 includes:
  • the third fusion unit is used to perform feature fusion on the video feature and the user feature to obtain the first associated feature between the video and the user.
  • the third fusion unit is also used to perform feature fusion on the text feature and the user feature to obtain a second association feature between the text and the user.
  • the dot product unit is configured to perform dot product processing on the first associated feature and the second associated feature to obtain a recommendation probability for the user to recommend the video.
  • the third fusion unit is specifically configured to perform bilinear fusion processing on the video feature and the user feature to obtain the first associated feature.
  • the third fusion unit is also used to perform bilinear fusion processing on the text feature and the user feature to obtain the second associated feature.
  • the determining recommendation module 1004 is configured to determine that the user recommends the video when the recommendation probability is greater than the probability threshold; and when the recommendation probability is less than or equal to the probability threshold, determine not to recommend the user The video.
  • the determining recommendation module 1004 is configured to repeatedly perform the operation of generating recommendation probabilities for each video in more than one video to obtain more than one recommendation probability; and to obtain each recommendation probability in the more than one The recommendation probabilities are ranked in descending order. When the probability ranking is less than or equal to the target threshold, it is determined that the user recommends the video corresponding to the corresponding probability ranking; and when the probability ranking is greater than the target threshold, it is determined not to be The user recommends the video corresponding to the corresponding probability ranking.
  • the video recommendation device provided in the above embodiment recommends videos
  • only the division of the above-mentioned functional modules is used as an example.
  • the above-mentioned functions can be allocated by different functional modules as required, namely The internal structure of the computer equipment is divided into different functional modules to complete all or part of the functions described above.
  • the video recommendation device provided in the foregoing embodiment and the video recommendation method embodiment belong to the same concept, and the specific implementation process is detailed in the video recommendation method embodiment, which will not be repeated here.
  • FIG. 11 is a schematic structural diagram of a recommended video display device provided by an embodiment of the present invention.
  • the device includes a display module 1101, a sending module 1102, and a display module 1103, which will be described in detail below:
  • the display module 1101 is configured to display a video display interface, and the video display interface includes at least one first recommended video.
  • the sending module 1102 is configured to, when a click operation on any of the first recommended videos is detected, in response to the click operation, send the viewing record of the first recommended video to the server, and the viewing record is used to instruct the server based on the
  • the watch record optimizes the training of the video recommendation model, and returns the video information of at least one second recommended video in real time.
  • the display module 1103 is configured to display the at least one second recommended video in the video display interface based on the video information of the at least one second recommended video when the video information of the at least one second recommended video is received.
  • the recommended video by displaying at least one first recommended video on the video display interface, when a user's click operation on any first recommended video is detected, in response to the click operation, the recommended video
  • the viewing record is sent to the server, so that the quality of the first recommended video can be reported to the user in time, so that the server can mark the first recommended video based on the viewing record to distinguish between true and false samples, and the first recommended video
  • the server can also return the video information of at least one second recommended video to the terminal according to the video recommendation model after the optimization training.
  • the at least one second recommended video is displayed in the video display interface, so that following the user's click operation, the The display interface is updated in real time to display recommended videos with higher recommendation accuracy.
  • the recommended video display device provided in the above embodiment displays recommended videos
  • only the division of the above functional modules is used as an example for illustration.
  • the above functions can be allocated by different functional modules as needed.
  • the internal structure of the electronic device is divided into different functional modules to complete all or part of the functions described above.
  • the interactive embodiments of the recommended video display device and the video recommendation method provided in the above embodiments belong to the same concept. For the specific implementation process, please refer to the video recommendation method embodiment, which will not be repeated here.
  • FIG. 12 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • the computer device 1200 may have relatively large differences due to different configurations or performance, and may include one or more processors (central processing units, CPU) 1201 and one Or more than one memory 1202, wherein the memory 1202 stores at least one computer-readable instruction, and the at least one computer-readable instruction is loaded and executed by the processor 1201 to implement the video recommendation provided by the foregoing various video recommendation method embodiments method.
  • the computer device may also have components such as a wired or wireless network interface, a keyboard, an input and output interface for input and output, and the computer device may also include other components for implementing device functions, which will not be repeated here.
  • FIG. 13 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • the electronic device 1300 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, a moving picture expert compression standard Audio level 4) Player, laptop or desktop computer.
  • the electronic device 1300 may also be called user equipment, portable electronic device, laptop electronic device, desktop electronic device, and other names.
  • the electronic device 1300 includes a processor 1301 and a memory 1302.
  • the processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 1301 may adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve.
  • the processor 1301 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor 1301 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 1301 may further include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 1302 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 1302 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one computer-readable instruction, and the at least one computer-readable instruction is used to be executed by the processor 1301 to implement the method in the present application The recommended video display method provided by the embodiment.
  • the electronic device 1300 may optionally further include: a peripheral device interface 1303 and at least one peripheral device.
  • the processor 1301, the memory 1302, and the peripheral device interface 1303 may be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1304, a touch display screen 1305, a camera 1306, an audio circuit 1307, a positioning component 1308, and a power supply 1309.
  • the peripheral device interface 1303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1301 and the memory 1302.
  • the processor 1301, the memory 1302, and the peripheral device interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1301, the memory 1302, and the peripheral device interface 1303 or The two can be implemented on separate chips or circuit boards, which are not limited in this embodiment.
  • the radio frequency circuit 1304 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 1304 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 1304 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 1304 can communicate with other electronic devices through at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to: metropolitan area network, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area network and/or WiFi (Wireless Fidelity, wireless fidelity) network.
  • the radio frequency circuit 1304 may also include NFC (Near Field Communication) related circuits, which is not limited in this application.
  • the display screen 1305 is used to display UI (User Interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 1305 also has the ability to collect touch signals on or above the surface of the display screen 1305.
  • the touch signal may be input to the processor 1301 as a control signal for processing.
  • the display screen 1305 may also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 1305 may be a flexible display screen, which is arranged on the curved surface or the folding surface of the electronic device 1300. Furthermore, the display screen 1305 can also be set as a non-rectangular irregular pattern, that is, a special-shaped screen.
  • the display screen 1305 can be made of materials such as LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode, organic light-emitting diode).
  • the camera assembly 1306 is used to collect images or videos.
  • the camera assembly 1306 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the electronic device, and the rear camera is set on the back of the electronic device.
  • the camera assembly 1306 may also include a flash.
  • the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • the audio circuit 1307 may include a microphone and a speaker.
  • the microphone is used to collect the sound waves of the user and the environment, and convert the sound waves into electrical signals and input to the processor 1301 for processing, or input to the radio frequency circuit 1304 to realize voice communication.
  • the microphone can also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert the electrical signal from the processor 1301 or the radio frequency circuit 1304 into sound waves.
  • the speaker can be a traditional membrane speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert the electrical signal into human audible sound waves, but also convert the electrical signal into human inaudible sound waves for purposes such as distance measurement.
  • the audio circuit 1307 may also include a headphone jack.
  • the positioning component 1308 is used to locate the current geographic location of the electronic device 1300 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 1308 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, the Granus system of Russia, or the Galileo system of the European Union.
  • the power supply 1309 is used to supply power to various components in the electronic device 1300.
  • the power source 1309 may be alternating current, direct current, disposable batteries or rechargeable batteries.
  • the rechargeable battery may support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • the electronic device 1300 further includes one or more sensors 1310.
  • the one or more sensors 1310 include, but are not limited to, an acceleration sensor 1311, a gyroscope sensor 1312, a pressure sensor 1313, a fingerprint sensor 1314, an optical sensor 1315, and a proximity sensor 1316.
  • the acceleration sensor 1311 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the electronic device 1300.
  • the acceleration sensor 1311 can be used to detect the components of the gravitational acceleration on three coordinate axes.
  • the processor 1301 may control the touch screen 1305 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 1311.
  • the acceleration sensor 1311 may also be used for the collection of game or user motion data.
  • the gyroscope sensor 1312 can detect the body direction and rotation angle of the electronic device 1300, and the gyroscope sensor 1312 can cooperate with the acceleration sensor 1311 to collect the user's 3D actions on the electronic device 1300. Based on the data collected by the gyroscope sensor 1312, the processor 1301 can implement the following functions: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 1313 may be disposed on the side frame of the electronic device 1300 and/or the lower layer of the touch screen 1305.
  • the processor 1301 performs left and right hand recognition or quick operation according to the holding signal collected by the pressure sensor 1313.
  • the processor 1301 controls the operability controls on the UI interface according to the user's pressure operation on the touch display screen 1305.
  • the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 1314 is used to collect the user's fingerprint.
  • the processor 1301 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the user's identity according to the collected fingerprint. When it is recognized that the user's identity is a trusted identity, the processor 1301 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 1314 may be provided on the front, back or side of the electronic device 1300. When the electronic device 1300 is provided with a physical button or a manufacturer logo, the fingerprint sensor 1314 may be integrated with the physical button or the manufacturer logo.
  • the optical sensor 1315 is used to collect the ambient light intensity.
  • the processor 1301 may control the display brightness of the touch screen 1305 according to the intensity of the ambient light collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the touch screen 1305 is increased; when the ambient light intensity is low, the display brightness of the touch screen 1305 is decreased.
  • the processor 1301 may also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.
  • the proximity sensor 1316 also called a distance sensor, is usually arranged on the front panel of the electronic device 1300.
  • the proximity sensor 1316 is used to collect the distance between the user and the front of the electronic device 1300.
  • the processor 1301 controls the touch screen 1305 to switch from the on-screen state to the off-screen state; when the proximity sensor 1316 When it is detected that the distance between the user and the front of the electronic device 1300 is gradually increasing, the processor 1301 controls the touch display screen 1305 to switch from the on-screen state to the on-screen state.
  • FIG. 13 does not constitute a limitation on the electronic device 1300, and may include more or less components than shown, or combine certain components, or adopt different component arrangements.
  • a non-volatile computer-readable storage medium which stores computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, one or more processing
  • the device executes the steps of the above-mentioned video recommendation method or the above-mentioned recommended video display method.
  • the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Computer Graphics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种视频推荐方法,包括:将视频输入第一特征提取网络,对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征;将用户的用户数据输入第二特征提取网络,对离散的该用户数据进行特征提取,输出该用户的用户特征;基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率;及根据该推荐概率,确定是否对该用户推荐该视频。

Description

视频推荐方法、装置、计算机设备及存储介质
本申请要求于2019年04月23日提交中国专利局,申请号为201910330212.9、发明名称为“视频推荐方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及机器学习领域,特别涉及一种视频推荐方法、装置、计算机设备及存储介质。
背景技术
随着网络技术的发展,越来越多的用户能够通过终端随时地观看视频,服务器可以从海量的视频数据库中为用户推荐一些用户可能感兴趣的视频,从而能够更好的满足用户的视频观看需求。
在推荐过程中,服务器可以基于注意力协同(attentive collaborative filtering,ACT)模型提取视频库中任一视频与用户之间的联合特征,对视频库中的每个视频重复执行上述步骤,获取到与多个视频对应的多个联合特征,进一步根据多个联合特征在两两之间的欧几里得距离,得到所有联合特征的排序,从而将排序靠前的联合特征所对应的视频推荐给用户。
然而,由于用户特征通常种类多、维度低,而视频特征通常种类少、维度高,可见用户特征与视频特征的性质差别巨大,而在上述ACT模型中,由于用户特征与视频特征的性质差别,会容易丢失用户特征和视频特征中的信息,还容易引发ACT模型的梯度弥散,影响了视频推荐的准确度。
发明内容
本发明实施例提供了一种视频推荐方法、装置、计算机设备及存储介质,一种推荐视频展示方法、装置、电子设备和存储介质。
一种视频推荐方法,由计算机设备执行,该方法包括:
将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征;
将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的该用户数据进行特征提取,输出该用户的用户特征;
基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率;及
根据该推荐概率,确定是否对该用户推荐该视频。
一种推荐视频展示方法,由电子设备执行,该方法包括:
显示视频展示界面,该视频展示界面中包括至少一个第一推荐视频;
当检测到对任一第一推荐视频的点击操作时,响应于该点击操作,将该第一推荐视频的观看记录发送至服务器,该观看记录用于指示该服务器基于该观看记录对视频推荐模型进行优化训练,并实时返回至少一个第二推荐视频的视频信息;及
当接收到该至少一个第二推荐视频的视频信息时,基于该至少一个第二推荐视频的视频信息,在该视频展示界面中展示该至少一个第二推荐视频。
一种视频推荐装置,该装置包括:
第一输出模块,用于将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征;
第二输出模块,用于将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的该用户数据进行特征提取,输出该用户的用户特征;
融合得到模块,用于基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率;及
确定推荐模块,用于根据该推荐概率,确定是否对该用户推荐该视频。
在其中一个实施例中,该第一输出模块包括:
卷积提取单元,用于将视频中的至少一个连续视频帧分别输入第一特征提取网络中的时间卷积网络和卷积神经网络,通过该时间卷积网络和该卷积神经网络对该至少一个连续视频帧进行卷积处理,提取该视频的视频特征。
在其中一个实施例中,该卷积提取单元包括:
因果卷积子单元,用于将视频中的至少一个连续视频帧所包括的至少一个图像帧输入第一特征提取网络中的时间卷积网络,通过该时间卷积网络对该至少一个图像帧进行因果卷积,得到该视频的图像特征;
卷积处理子单元,用于将该至少一个连续视频帧所包括的至少一个音频帧输入第一特征提取网络中的卷积神经网络,通过该卷积神经网络对该至少一个音频帧进行卷积处理,得到该视频的音频特征;及
融合子单元,用于将该视频的图像特征与该视频的音频特征进行特征融合,得到该视频的视频特征。
在其中一个实施例中,该融合子单元用于:
将该视频的图像特征与该视频的音频特征进行双线性汇合处理,得到该视频的视频特征。
在其中一个实施例中,该第二输出模块包括:
第一输入单元,用于将该用户的用户数据输入第二特征提取网络;
第一线性组合单元,用于通过该第二特征提取网络中的宽度部分,对离散的该用户数据进行广义线性组合,得到该用户的宽度特征;
第一嵌入卷积单元,用于通过该第二特征提取网络中的深度部分,对离散的该用户数据进行嵌入处理和卷积处理,得到该用户的深度特征;及
第一融合单元,用于对该用户的宽度特征和该用户的深度特征进行特征融合,得到该用户的用户特征。
在其中一个实施例中,该第一融合单元用于:
通过全连接层对该用户的宽度特征和该用户的深度特征进行级联,得到该用户的用户特征。
在其中一个实施例中,该融合得到模块用于:
对该视频特征和该用户特征进行点乘处理,得到对该用户推荐该视频的推荐概率。
在其中一个实施例中,该装置还包括:
第三输入模块,用于将与该视频对应的文本输入第三特征提取网络,通过该第三特征提取网络对离散的该文本进行特征提取,输出与该视频对应的文本特征。
在其中一个实施例中,该第三输入模块包括:
第二输入单元,用于将该文本输入第三特征提取网络;
第二线性组合单元,用于通过该第三特征提取网络中的宽度部分,对离散的该文本进行广义线性组合,得到该文本的宽度特征;
第二嵌入卷积单元,用于通过该第三特征提取网络中的深度部分,对离散的该文本进行嵌入处理和卷积处理,得到该文本的深度特征;及
第二融合单元,用于对该文本的宽度特征和该文本的深度特征进行特征融合,得到与该视频对应的文本特征。
在其中一个实施例中,该第二融合单元用于:
通过全连接层对该文本的宽度特征和该文本的深度特征进行级联,得到该与该视频对应的文本特征。
在其中一个实施例中,该第二融合单元还用于通过全连接层对该文本的宽度特征和该文本的深度特征进行级联,得到与该视频对应的文本特征。
在其中一个实施例中,该融合得到模块包括:
第三融合单元,用于对该视频特征和该用户特征进行特征融合,得到该视频与该用户之间的第一关联特征;
该第三融合单元,还用于对该文本特征和该用户特征进行特征融合,得到该文本与该用户之间的第二关联特征;及
点乘单元,用于对该第一关联特征和该第二关联特征进行点乘处理,得到对该用户推荐该视频的推荐概率。
在其中一个实施例中,该第三融合单元用于:
将该视频特征与该用户特征进行双线性汇合处理,得到该视频与该用户之间的第一关联特征;
该第三融合单元还用于:
将该文本特征与该用户特征进行双线性汇合处理,得到该文本与该用户之间的第二关联特征。
在其中一个实施例中,该确定推荐模块用于:
当该推荐概率大于概率阈值时,确定为该用户推荐该视频;及
当该推荐概率小于或等于该概率阈值时,确定不为该用户推荐该视频。
在其中一个实施例中,该确定推荐模块用于:
对多于一个视频中的每个视频,重复执行生成推荐概率的操作,得到多于一个推荐概率;
获取每个推荐概率分别在该多于一个推荐概率中从大到小的概率排序,当该概率排序小于或等于目标阈值时,确定为该用户推荐相应概率排序所对应的该视频;及
当该概率排序大于该目标阈值时,确定不为该用户推荐相应概率排序所对应的该视频。
一种推荐视频展示装置,该装置包括:
显示模块,用于显示视频展示界面,该视频展示界面中包括至少一个第一推荐视频;
发送模块,用于当检测到对任一第一推荐视频的点击操作时,响应于该点击操作,将该第一推荐视频的观看记录发送至服务器,该观看记录用于指示该服务器基于该观看记录对视频推荐模型进行优化训练,并实时返回至少一个第二推荐视频的视频信息;及
展示模块,用于当接收到该至少一个第二推荐视频的视频信息时,基于该至少一个第二推荐视频的视频信息,在该视频展示界面中展示该至少一个第二推荐视频。
一种计算机设备,该计算机设备包括处理器和存储器,该存储器中存储有计算机可读指令,该计算机可读指令被该处理器执行时,使得该处理器执行如上所述的视频推荐方法的步骤。
一种电子设备,该电子设备包括处理器和存储器,该存储器中存储有计算机可读指令,该计算机可读指令被该处理器执行时,使得该处理器执行如上所述的推荐视频展示方法的步骤。
一种非易失性的计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得该一个或多个处理器执行如上所述的视频推荐方法的步骤,或,如上所述的推荐视频展示方法的步骤。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种视频推荐方法的实施环境示意图;
图2是本发明实施例提供的一种视频推荐方法的交互流程图;
图3是本发明实施例提供的一种视频展示界面的示意图;
图4是本发明实施例提供的一种视频推荐方法的流程图;
图5是本发明实施例提供的一种时间卷积网络的示意图;
图6是本发明实施例提供的一种时间卷积网络的示意图;
图7是本发明实施例提供的一种第二特征提取网络的示意图;
图8是本发明实施例提供的一种视频推荐方法的示意图;
图9是本发明实施例提供的一种视频推荐方法的流程图;
图10是本发明实施例提供的一种视频推荐装置的结构示意图;
图11是本发明实施例提供的一种推荐视频展示装置的结构示意图;
图12是本发明实施例提供的计算机设备的结构示意图;
图13是本发明实施例提供的电子设备的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1是本发明实施例提供的一种视频推荐方法的实施环境示意图。参见图1,在该实施环境中可以包括至少一个终端101和服务器102,各终端101和服务器102分别通过网络连接进行通信。
其中,该至少一个终端101用于浏览视频,该服务器102用于向该至少一个终端101所对应的至少一个用户推荐视频。
在一些实施例中,该至少一个终端101中每个终端上都可以安装有应用 客户端,该应用客户端可以是任一能够提供视频浏览服务的客户端,使得服务器102可以基于用户在该应用客户端上的行为日志,收集样本用户数据和样本视频,从而根据该样本用户数据和样本视频,训练得到第一特征提取网络、第二特征提取网络以及第三特征提取网络。
在上述基础上,服务器102能够基于第一特征提取网络、第二特征提取网络以及第三特征提取网络,对任一用户确定是否推荐任一视频,从而在一些实施例中,服务器102能够从多个视频中为每个用户筛选出至少一个视频,从而可以实现对用户进行视频推荐,服务器102将确定推荐的至少一个视频发送至该至少一个终端101之后,该至少一个终端101可以基于视频展示界面来展示至少一个推荐视频,其中,该至少一个推荐视频也即是服务器为该终端所对应的用户推荐的至少一个视频。
图2是本发明实施例提供的一种视频推荐方法的交互流程图,参见图2,该实施例应用于计算机设备和电子设备的交互过程中,本发明仅以该计算机设备为服务器,该电子设备为终端为例进行说明,该实施例包括:
S201、服务器将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征。
其中,该视频可以是本地视频库中的任一视频,该视频也可以是从云端下载的任一视频,该视频可以包括至少一个连续视频帧。
S202、服务器将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的该用户数据进行特征提取,输出该用户的用户特征。
其中,该用户可以是任一终端所对应的用户,该用户数据可以包括用户个人信息和视频偏好,该个人信息可以包括用户性别、用户年龄、用户所在地域或者用户职业中的至少一项,该个人信息可以是用户向服务器授权的信息。该视频偏好可以由服务器对用户的视频观看行为日志进行数据分析来得到。
在一些实施例中,由于用户数据中的个人信息、及视频偏好等通常是一个或多个孤立的词向量,因此用户数据是离散的,此时将离散的用户数据输入第二特征提取网络之后,通过第二特征提取网络的作用,能够将离散的用户数据转换为一个连续的特征向量,该特征向量能够体现出离散的各个用户 数据的联合特征。
S203、服务器基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率。
在一些实施例中,服务器可以对视频特征和用户特征进行点乘处理,也即是对该视频特征和该用户特征求内积,将该视频特征和该用户特征中对应位置的数值相乘后进行求和所得到的数值获取为推荐概率。
S204、服务器根据该推荐概率,确定是否对该用户推荐该视频。
在步骤S204中,服务器根据该推荐概率,确定是否对该用户推荐该视频,而对于不同的用户以及不同的视频,服务器均可以执行上述步骤S201-S204中的视频推荐流程,从而能够确定是否对任一用户推荐任一视频。在本发明实施例中,执行下述步骤S205,是以对同一个用户确定至少一个第一推荐视频为例进行说明,而对于不同的用户,是类似的过程,这里不再赘述。
S205、服务器重复执行上述步骤S201-S204,确定对该用户推荐的至少一个第一推荐视频,将该至少一个第一推荐视频的视频信息发送至该用户所对应的终端。
在步骤S205中,服务器可以为第一推荐视频设置推荐数量阈值,该推荐数量阈值可以是任一大于或等于1的数值,对于不同的用户,该推荐数量阈值可以相同也可以不同。
在一些实施例中,服务器可以对用户观看视频的行为日志进行分析,使得与该用户对应的推荐数量阈值与该用户的日均视频观看时长成正相关,也即是当用户日均视频观看时长越长时,与该用户对应的第一推荐视频的数量越多。例如,如果用户日均视频观看时长为1小时,可以为用户的终端发送2个第一推荐视频,而如果用户日均视频观看时长为3小时,可以为用户的终端发送6个第一推荐视频。
S206、终端接收该至少一个第一推荐视频的视频信息。
其中,该视频信息可以是该至少一个第一推荐视频的缩略图、网页链接或者文本中的至少一项。例如,对某一个第一推荐视频而言,该视频信息可以包括该第一推荐视频的缩略图、网页链接、标题、作者信息和摘要,本发明实施例不对该视频信息的内容进行具体限定。当然,该视频信息也可以就是该至少一个第一推荐视频本身,从而避免了终端在后续交互过程中频繁地 向服务器发送访问请求。
S207、当终端检测到用户对视频功能入口的点击操作时,显示视频展示界面,该视频展示界面中包括至少一个第一推荐视频。
其中,该视频功能入口可以是终端上任一支持视频展示的应用客户端所提供的,该视频展示界面上可以包括至少一个用户交互(user interface,UI)卡片,每个用户交互卡片用于展示一个第一推荐视频。当然,该视频展示界面上也可以包括至少一个窗口,每个窗口用于展示一个第一推荐视频,本发明实施例不对在视频展示界面中展示第一推荐视频的形式进行具体限定。
在一些实施例中,该视频功能入口可以是应用客户端的主界面上的一个功能选项,从而当终端检测到用户对该功能选项的点击操作时,从该应用客户端的主界面切换显示该视频展示界面。图3是本发明实施例提供的一种视频展示界面的示意图,参见图3,终端可以在该视频展示界面上展示多个第一推荐视频。
当然,在一些实施例中,该视频功能入口也可以是该应用客户端的图标,使得当终端检测到对该应用客户端的图标的点击操作时,终端直接启动该应用客户端,显示该视频展示界面,这种情况也即是该应用客户端的主界面为该视频展示界面。
其中,该至少一个第一推荐视频基于多个推荐概率确定,一个推荐概率可以是基于第一特征提取网络输出的当前用户的用户特征、第二特征提取网络输出的待推荐视频的视频特征或者第三特征提取网络输出的文本特征中的至少一项进行融合所得到的概率。
在上述步骤S207中,终端可以仅在视频展示界面中展示该至少一个第一推荐视频的视频信息,当检测到用户对任一第一推荐视频的点击操作时,向该第一推荐视频所对应的网页链接发送访问请求,从而在本地缓存该第一推荐视频,基于视频展示控件播放该第一推荐视频,能够节约终端的存储空间,提升终端的处理效率。
在一些实施例中,终端还可以在显示视频展示界面的同时,对该至少一个第一推荐视频中每个第一推荐视频所对应的网页链接均发送访问请求,在本地缓存该至少一个第一推荐视频,当检测到用户对任一第一推荐视频的点 击操作时,直接基于视频展示控件播放该第一推荐视频,从而能够在显示视频展示界面的时候就完成界面上每个第一推荐视频的加载过程,当用户点击时能够及时播放第一推荐视频,从而缩短用户等待视频加载的时长,优化了视频推荐的效果。
当然,如果服务器直接将该至少一个第一推荐视频发送至终端,那么终端还可以在显示该视频展示界面后,直接自动播放推荐概率最高的视频,从而可以简化视频播放的流程。
S208、当终端检测到对任一第一推荐视频的点击操作时,响应于该点击操作,将该第一推荐视频的观看记录发送至服务器,该观看记录用于指示该服务器基于该观看记录对视频推荐模型进行优化训练,并实时返回至少一个第二推荐视频的视频信息。
在上述过程中,终端响应于用户对任一第一推荐视频的点击操作,将该第一推荐视频的观看记录发送至服务器,该观看记录可以包括该第一推荐视频的曝光时长、累计观看次数等。
S209、当服务器接收到该观看记录,基于该观看记录对视频推荐模型进行优化训练,根据优化训练后的视频推荐模型确定至少一个第二推荐视频,将该至少一个第二推荐视频的视频信息发送至终端。
其中,该视频推荐模型包括第一特征提取网络、第二特征提取网络或者第三特征提取网络中的至少一项。
在上述训练过程中,服务器能够收集各个用户对各个第一推荐视频的观看记录,基于该观看记录,将曝光时长大于预设时长的第一推荐视频标记为优化训练过程中的正例(也即是标记为真),将曝光时长小于或等于该预设时长的第一推荐视频标记为优化训练过程中的负例(也即是标记为假),具体训练过程与下述实施例中的视频推荐方法类似,只是需要将视频替换为经过标记后的第一推荐视频,这里不做赘述,通过上述步骤S209能够实现对视频推荐模型的动态优化训练。
上述确定第二推荐视频以及发送第二推荐视频的视频信息的过程与上述步骤S201-S205类似,这里不做赘述。
S210、当终端接收到该至少一个第二推荐视频的视频信息时,基于该至少一个第二推荐视频的视频信息,在该视频展示界面中展示至少一个第二推 荐视频。
上述步骤S210与步骤S206-S207类似,这里不做赘述。
在上述过程中,当终端检测到用户对任一第一推荐视频的点击操作时,终端响应于点击操作向服务器发送观看记录,而服务器即刻对视频推荐模型中的各个特征提取网络进行优化训练,然后确定至少一个第二推荐视频,从而由终端对各个第二推荐视频进行展示,使得在用户点击某一第一推荐视频之前和点击该第一推荐视频之后,在视频展示界面中会显示不同的推荐结果。
例如,服务器原本预测某一用户喜欢猫的视频的概率和喜欢狗的视频的概率一样大,因此确定的10个第一推荐视频中包括5个猫的视频和5个狗的视频,而当用户点击了终端上推送的猫的视频并且曝光时间大于预设时长时,终端向服务器发送观看记录,服务器将猫的视频标记为正例后,对视频推荐模型中的各个特征提取网络进行优化训练,由于猫的视频正例数量增加了1个,从而可能会使得服务器预测该用户喜欢猫的视频的概率大于喜欢狗的视频的概率,从而在新一轮预测过程中,确定的10个第二推荐视频中包括7个猫的视频和3个狗的视频。
在一些实施例中,服务器还可以在接收到观看记录之后,不即刻执行优化训练的过程,而是定时对视频推荐模型中的各个特征提取网络进行优化训练,例如,服务器在每天的零点根据前一天的一个或多个观看记录进行优化训练,向终端发送第二推荐视频,使得终端对视频展示界面中展示的推荐视频进行更新,从而避免了每增加一个观看记录就对视频推荐模型中的各个特征提取网络训练一次,改善了特征提取网络的性能颠簸问题,增加了特征提取网络的稳定性。
本发明实施例提供的方法,通过将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征,由于视频特征种类少、维度高,从而在不增加太大的计算压力的情况下,有针对性地提取高维度的视频特征,将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的用户数据进行特征提取,输出该用户的用户特征,由于用户特征种类多、维度低,从而可以基于第二特征提取网络,有针对性地提取低维度的用户特征,减小了提取用户特 征的计算压力,基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率,根据该推荐概率,确定是否对该用户推荐该视频,从而对于性质差别较大的用户特征和视频特征,分别采用不同的网络进行特征提取,避免了丢失用户特征和视频特征中的信息,改善了梯度弥散的问题,提高了视频推荐的准确度。
另一方面,在终端侧显示视频展示界面,在该视频展示界面上展示至少一个第一推荐视频,当检测到用户对任一第一推荐视频的点击操作时,响应于该点击操作,将该推荐视频的观看记录发送至服务器,从而能够及时向用户反馈本次第一推荐视频的质量优劣,使得服务器能够基于该观看记录对该第一推荐视频进行真假样本的区分标记,将该第一推荐视频作为新一轮优化训练中的样本视频,实现了对视频推荐模型的动态优化训练,并且服务器还可以根据优化训练后的视频推荐模型向终端返回至少一个第二推荐视频的视频信息,当终端接收到至少一个第二推荐视频的视频信息时,基于该至少一个第二推荐视频的视频信息,在该视频展示界面中展示该至少一个第二推荐视频,使得随着用户的点击操作,能够在视频展示界面上实时更新展示推荐准确率更高的推荐视频。
上述实施例提供了一种终端与服务器进行交互的视频推荐过程,服务器在确定任一推荐视频后,向终端推送该推荐视频,使得终端基于视频展示界面对该推荐视频进行展示,当用户点击推荐视频后,还能够对视频展示界面中的推荐视频进行更新,在本发明实施例中将从服务器侧如何确定推荐视频进行详述,当确定了推荐视频后仍然可以执行与上述实施例中步骤S206-S210类似的终端侧显示过程,在本发明实施例中不作赘述。
图4是本发明实施例提供的一种视频推荐方法的流程图,参见图4,该实施例应用于计算机设备,本发明实施例仅以该计算机设备为服务器为例进行说明,该方法包括:
S401、服务器将视频的至少一个连续视频帧所包括的至少一个图像帧输入第一特征提取网络中的时间卷积网络,通过该时间卷积网络对该至少一个图像帧进行因果卷积,得到该视频的图像特征。
其中,该视频可以是本地视频库中的任一视频,该视频也可以是从云端 下载的任一视频,该视频可以包括至少一个连续视频帧,而该至少一个连续视频帧中可以包括至少一个图像帧和至少一个音频帧,也即是每个连续视频帧包括一个图像帧和一个音频帧。可以理解,该至少一个图像帧可以表现为序列、数组或者链表等形式,本发明实施例不对图像帧的表现形式进行具体限定。
其中,该视频的图像特征可以包括与该至少一个图像帧所对应的至少一个图像帧特征,一个图像帧特征用于表示一个图像帧的图像特征以及该图像帧和该图像帧之前的图像帧之间的关联关系。
在一些实施例中,在该第一特征提取网络内,可以包括一个时间卷积网络(temporal convolutional networks,TCN)和一个卷积神经网络(convolutional neural networks,CNN),其中,该TCN可以用于提取图像特征,该CNN可以用于提取音频特征,在下述步骤402中将对CNN进行详述,这里不再赘述。
基于上述情况,当服务器将视频的至少一个连续视频帧输入该第一特征提取网络时,对该至少一个连续视频帧中的至少一个图像帧和至少一个音频帧进行分离,分别将该至少一个图像帧输入TCN,TCN独立地提取该视频的图像特征,将该至少一个音频帧输入CNN,CNN独立地提取该视频的音频特征,进一步地,对TCN输出的图像特征和CNN输出的音频特征进行特征融合,从而可以得到该视频的视频特征。
可选地,在TCN中可以包括输入层、至少一个隐藏层和输出层,该输入层用于对输入的图像帧进行解码处理,该至少一个隐藏层用于对经过解码后的图像帧进行因果卷积(causal convolutions),该输出层用于对经过因果卷积后的图像帧进行非线性处理和归一化处理。
在上述TCN中,该输入层、该至少一个隐藏层和该输出层串行连接,在特征提取的过程中上述串行连接也即是:服务器向输入层输入该视频的至少一个图像帧,将输入层解码后的至少一个图像帧输入第一个隐藏层,将第一个隐藏层输出的至少一个特征图(feature map)输入第二个隐藏层,依此类推,直到将最后一个隐藏层输出的至少一个特征图输入至输出层,输出层所输出的至少一个图像帧特征即为TCN提取到的该视频的图像特征。
在上述架构中,每个隐藏层内可以包括至少一个卷积核(filter),对于 任一个隐藏层,在对上一个隐藏层输出的至少一个特征图进行因果卷积时,在传统的CNN框架中,一个卷积核用于对一个特征图进行卷积,而在本发明实施例所提供的TCN中,一个卷积核用于对多个特征图进行卷积,这种卷积即称为“因果卷积”,其中,上述多个特征图可以是当前时刻的特征图,以及与当前时刻之前至少一个时刻所对应的至少一个特征图。
基于上述架构,在步骤S401中,服务器将该至少一个图像帧输入TCN,通过TCN的至少一个隐藏层对该至少一个图像帧进行因果卷积,输出与该至少一个图像帧对应的至少一个图像帧特征,从而将该至少一个图像帧特征确定为该视频的图像特征。
在一些实施例中,进行因果卷积时在任一个隐藏层中,对上一隐藏层输出的至少一个特征图中任一时刻的特征图,根据该隐藏层内与该时刻所对应的卷积核,分别对该时刻的特征图以及该时刻之前的至少一个时刻所对应的至少一个特征图进行卷积,将得到的多个特征图进行叠加后,得到当前隐藏层输出的该时刻的特征图。需要说明的是,这里所说的“叠加”是指将该多个特征图中对应位置的数值直接相加。
例如,图5是本发明实施例提供的一种时间卷积网络的示意图,参见图5,在第一个隐藏层中,当对输入层T时刻的图像帧进行因果卷积时,根据第一个隐藏层中的第T个卷积核,对输入层的T时刻、T-1时刻和T-2时刻这三个时刻的三个图像帧进行卷积,得到第一个隐藏层中T时刻的特征图,其中,T为大于或等于0的任一数值。需要说明的是,在图5所示的TCN框架中,一个卷积核用于对三个特征图进行卷积,但在一些实施例中,TCN中一个卷积核可以对任一大于或等于2的数量的特征图进行卷积,图5不应构成对TCN中每次因果卷积所包含的特征图数量的具体限定。
通过引入因果卷积操作,相较于传统的CNN框架,TCN的层与层之间具有因果关系,并且可以在当前层考虑到上一层中具有时序关联的图像帧之间的相关性信息,也就使得输出层中的每个图像帧特征既可以表示一个图像帧的图像特征,又可以表示该图像帧与该图像帧之前的图像帧之间的关联关系。进一步地,相较于通常具有较好记忆能力的长短期记忆网络(long short-term memory,LSTM)框架,由于LSTM中包含有遗忘门,在处理过程中无法避免地会遗漏一些历史信息,然而由于TCN中不需要设置遗忘门,也 就避免了造成历史信息的遗漏,并且随着TCN深度的增加,因果卷积后得到的特征图可以包括输入层内图像数据的每一个图像帧的信息。
在一些实施例中,在进行因果卷积时,可以对上一隐藏层所输出的至少一个特征图进行补零(zero padding)处理,在每个特征图的外周添加至少一个零填充层,该零填充层的个数可以根据卷积核的尺寸以及因果卷积的步长来确定,从而能够保证每个隐藏层所输出的特征图与输入的特征图的尺寸是一致的。
在一些实施例中,上述每个隐藏层中的任一卷积核还可以是空洞卷积(dilated convolutions,又称扩张卷积)核,该空洞卷积核是指在原卷积核中相邻的元素之间***至少一个零元素所构成的新卷积核,由于空洞卷积核在空洞处一律填充为0,没有获取新的卷积核参数,从而可以在不额外增加卷积核参数的情况下,有效地扩大了卷积核的尺寸,增大了感受野(receptive field)的尺寸,能够得到更好的拟合效果,进一步地能够减少TCN中隐藏层的层数,减少TCN训练过程的计算量,缩短TCN的训练时长。
需要说明的是,在上述情况中,当卷积核为空洞卷积核时,也同样进行因果卷积操作,也即是一个空洞卷积核也用于对多个特征图进行卷积,可选地,该多个特征图可以是在时序上相邻的特征图,也可以是在时序上不相邻的特征图,当该多个特征图在时序上不相邻时,该多个特征图中相邻的特征图之间所具有的时序间隔可以相同,也可以不同,本发明实施例不对相邻的特征图之间所具有的时序间隔是否相同进行具体限定。
在一些实施例中,当该多个特征图在时序上不相邻,且具有相同的时序间隔时,可以通过为每个隐藏层设置一个大于或等于1的扩张系数d,且d为正整数,将该时序间隔确定为d-1,使得该时序间隔为大于或等于0的正整数,从而能够将时序上相邻(也即是时序间隔d-1=0)的情况视为扩张系数d=1的一种特殊情况。需要说明的是,在不同隐藏层的扩张系数可以相同也可以不同,本发明实施例不对该扩张系数的取值进行具体限定,当然,服务器也可以将时序间隔作为一种超参数直接进行设置,本发明实施例也不对是否设置扩张系数进行具体限定。
基于上述示例,参见图5,在第一个隐藏层中进行因果卷积时,采用了扩张系数d=1的空洞卷积核,对T时刻、T-1时刻和T-2时刻的图像帧进行 因果卷积,能够完整的提取输入层中各个图像帧的特征以及关联关系,而在第二个隐藏层中进行因果卷积时,采用了扩张系数d=2的空洞卷积核,每次因果卷积时选择的相邻特征图之间间隔了1个特征图,对T时刻、T-2时刻和T-4时刻的图像帧所对应的特征图进行因果卷积,而在第三个隐藏层中,采用了扩张系数d=4的空洞卷积核,每次因果卷积时选择的相邻特征图之间间隔了3个特征图,对T时刻、T-4时刻和T-8时刻的图像帧所对应的特征图进行因果卷积,从而可以减少TCN中隐藏层的层数,减少TCN训练过程的计算量,缩短TCN的训练时长,另一方面,在每次进行因果卷积时采用空洞卷积核,有效地扩大了卷积核的尺寸,增大了感受野的尺寸,能够得到更好的拟合效果。
在一些实施例中,该至少一个隐藏层之间可以采用残差连接,该残差连接也即是:对于每个隐藏层来说,可以将上一隐藏层所输出的任一特征图与当前隐藏层所输出的对应的特征图叠加后得到残差块(residual block),将该残差块作为输入下一隐藏层的一个特征图,从而可以解决TCN的退化问题,使得TCN的深度越深,对图像特征提取的准确度越好。
在一些实施例中,当采用残差连接时,在对特征图进行叠加之前,如果上一隐藏层输出的特征图维度与当前隐藏层输出的特征图维度不同,可以通过一个尺寸为1×1的卷积核对上一隐藏层输出的特征图进行卷积操作,从而对上一隐藏层输出的特征图进行升维或者降维,进而能够保证叠加过程中涉及到的两个特征图维度相同。
例如,图6是本发明实施例提供的一种时间卷积网络的示意图,参见图6,以每个隐藏层的扩张系数d=1为例进行说明,在第一个隐藏层对输入层中T时刻、T-1时刻和T-2时刻的图像帧进行因果卷积,而当在第二个隐藏层对T时刻、T-1时刻和T-2时刻的特征图进行因果卷积之前,将T时刻的图像帧与T时刻的特征图进行叠加,T-1时刻的图像帧与T-1时刻的特征图进行叠加,T-2时刻的图像帧与T-2时刻的特征图进行叠加,需要说明的是,这里所说的“叠加”是指将任意两个特征图中对应位置的数值直接相加。可选地,如果任一个图像帧与对应的特征图维度不同,可以通过一个尺寸为1×1的卷积核对该图像帧进行卷积操作,使得该图像帧与该特征图维度相同。
在一些实施例中,各个隐藏层之间还可以引入至少一个非线性层,该非 线性层用于对隐藏层输出的特征图进行非线性处理,该非线性层可以采用任一能够添加非线性因素的激活函数,例如该激活函数可以是sigmoid函数、tanh函数或者ReLU函数等。
在一些实施例中,各个隐藏层之间还可以引入至少一个权重归一化层,从而能够将各个卷积核的权重进行归一化,使得每个隐藏层输出的特征图具有类似的分布,从而能够加快TCN的训练速度,改善TCN的梯度弥散问题。需要说明的是,当TCN中同时具有非线性层和权重归一化层时,在任一隐藏层后先串接一个权重归一化层,进而在该权重归一化层后再串接一个非线性层。
在一些实施例中,该输出层可以是指数归一化(softmax)层,在该输出层中基于softmax函数对最后一个隐藏层所输出的各个特征图进行指数归一化,得到该视频的图像特征。
S402、服务器将该至少一个连续视频帧所包括的至少一个音频帧输入该第一特征提取网络中的卷积神经网络,通过该卷积神经网络对该至少一个音频帧进行卷积处理,得到该视频的音频特征。
其中,该至少一个音频帧可以表现为序列、数组或者链表等形式,本发明实施例不对音频帧的表现形式进行具体限定。其中,该视频的音频特征可以包括该至少一个音频帧中每个音频帧的音频特征。
在一些实施例中,第一特征提取网络中的CNN用于提取音频特征,在CNN中可以包括输入层、至少一个隐藏层和输出层,该输入层用于对输入的音频帧进行解码处理,该至少一个隐藏层用于对经过解码后的音频帧进行卷积处理,该输出层用于对经过卷积处理后的音频帧进行非线性处理和归一化处理。可选地,该输入层、该至少一个隐藏层和该输出层串行连接,与上述步骤S401中TCN的连接方式类似,这里不再赘述。
在一些实施例中,各个隐藏层之间还可以引入至少一个池化层,该池化层用于压缩上一隐藏层输出的特征图,从而减小该特征图的尺寸。在一些实施例中,该CNN中也可以采用残差连接,与上述步骤S401中TCN的残差连接类似,这里不再赘述。
在一些实施例中,该CNN可以是一个VGG(visual geometry group,视觉几何组)网络,在该VGG网络中,每个隐藏层均使用3*3的小型卷积核, 以及2*2的最大池化核,并且各个隐藏层之间采用残差连接,从而随着VGG网络的加深,每次池化后图像的尺寸缩小一半,深度增加一倍,从而简化了CNN的结构,便于获取至少一个音频帧的频谱图,便于提取高层次的音频特征。例如,该CNN可以是VGG-16或VGG-19等,本发明实施例不对该VGG网络的架构层级进行具体限定。
基于上述架构,在上述步骤S402中,服务器可以将视频的至少一个音频帧输入CNN,通过CNN的至少一个隐藏层对该至少一个音频帧进行卷积处理,输出与该至少一个音频帧对应的至少一个音频帧特征,从而将该至少一个音频帧特征确定为该视频的音频特征。可选地,在任一个隐藏层中,对上一隐藏层输出的至少一个特征图中任一时刻的特征图,根据该隐藏层内与该时刻所对应的卷积核,对该时刻的特征图进行卷积处理。
S403、服务器将该视频的图像特征与该视频的音频特征进行双线性汇合处理,得到该视频的视频特征。
在上述过程中,服务器可以对该图像特征与该音频特征进行多模态紧密双线性池化(multi-modal compact bilinear pooling,MCB)处理,MCB处理也即是:服务器获取该图像特征与该音频特征的张量积(outer product),通过二次项对该张量积进行多项式展开,得到该视频特征,当然服务器也可以通过泰勒展开、幂级数展开等方法对张量积进行展开,得到该视频特征。可选地,服务器可以将图像特征与音频特征之间的投影向量来近似表示该张量积,从而能够减少双线性汇合处理过程中的计算量,缩短视频推荐过程所用的时长。
在一些实施例中,服务器还可以对该图像特征与该音频特征进行多模态低阶双线性池化(multi-modal low-rank bilinear pooling,MLB)处理,MLB处理也即是:服务器获取图像特征的投影矩阵,获取音频特征的投影矩阵,获取该图像特征的投影矩阵与该音频特征的投影矩阵之间的哈达玛积(Hadamard product),将该哈达玛积确定为该视频特征,从而能够改善MCB中受图形处理器(graphics processing unit,GPU)性能限制的缺陷,降低了对GPU的需求,节约了双线性汇合处理的成本。
在一些实施例中,服务器还可以对该图像特征与该音频特征进行多模态因式分解双线性池化(multi-modal factorized bilinear pooling,MFB) 处理,MFB处理也即是:服务器获取图像特征的低阶投影矩阵,获取音频特征的低阶投影矩阵,获取该图像特征的低阶投影矩阵与该音频特征的低阶投影矩阵之间的池化和(sum pooling),将该池化和确定为该视频特征,从而能够改善MLB中收敛速度的缺陷,降低了双线性汇合处理的时长,提升了双线性汇合处理的效率。
由于上述步骤S401-S402中,服务器基于TCN获取视频的图像特征,基于CNN获取视频的音频特征,从而在上述步骤S403中,服务器可以将该视频的图像特征与该视频的音频特征进行特征融合,得到该视频的视频特征,通过不同的网络结构,分别对图像特征和音频特征进行特征提取,在提取图像特征时考虑到图像帧之间的关联关系,提升了图像特征的表达能力,在提取音频特征时采用简化的网络结构,从而有利于提取到更深层次的音频特征,再对两个特征进行融合得到视频特征,提升了视频推荐过程的准确度。另一方面,由于图像特征和音频特征的维度往往比较大,通过双线性汇合处理能够在提升特征融合的效率的基础上,保证了图像特征与音频特征之间的充分交互,还能高效地对融合特征进行降维。
在一些实施例中,服务器还可以不对图像特征和音频特征进行双线性汇合处理,而是可以通过获取点积、获取平均值或者级联等方式进行特征融合,从而进一步缩短特征融合的时长,减少特征融合过程的计算量。
在上述步骤S401-S403中,服务器将该视频中的该至少一个连续视频帧分别输入该第一特征提取网络中的时间卷积网络和卷积神经网络,通过该时间卷积网络和该卷积神经网络对该至少一个连续视频帧进行卷积处理,提取该视频的视频特征,该第一特征提取网络中包括TCN和CNN,在一些实施例中,服务器可以直接将该视频的至少一个图像帧以及至少一个音频帧输入同一个TCN或者CNN,输出该视频的视频特征,也即是服务器通过该同一个TCN或者CNN既提取图像特征,又提取音频特征,也就无需对图像特征和音频特征进行特征融合,从而能够只基于一个卷积神经网络完成对视频特征的提取,减少了获取视频时的计算量,加快了获取视频特征的速度。当然,服务器也可以仅提取视频的图像特征,或者仅提取视频的音频特征,同样无需进行特征融合,减少了获取视频时的计算量,加快了获取视频特征的速度。
S404、服务器将用户的用户数据输入第二特征提取网络。
其中,该用户可以是任一终端所对应的用户,该用户数据可以包括用户个人信息和视频偏好,该个人信息可以包括用户性别、用户年龄、用户所在地域或者用户职业中的至少一项,该个人信息可以是用户向服务器授权的信息,该视频偏好可以由服务器对用户的视频观看行为日志进行数据分析来得到。在本申请中,下文将用户数据中各项个人信息以及各项视频偏好中任一项称为一个用户组分信息,因此该用户数据包括至少一个用户组分信息。
在上述过程中,由于用户数据中各个用户组分信息通常是一个或多个孤立的词向量,因此用户数据是离散的,此时将离散的用户数据输入第二特征提取网络之后,通过第二特征提取网络的作用,能够将离散的用户数据转换为一个连续的特征向量,该特征向量能够体现出离散的各个用户组分信息的联合特征。
在上述过程中,该第二特征提取网络可以包括宽度部分和深度部分,例如,该第二特征提取网络可以是一个宽度与深度联合网络(wide and deep models),其中,该宽度部分用于对用户数据进行广义线性处理,例如,该宽度部分可以是一个广义线性模型,将在下述步骤S405中进行详述,此外该深度部分用于对用户数据进行嵌入处理和卷积处理,例如,该深度部分可以是一个DNN(deep neural network,深度神经网络),将在下述步骤S406中进行详述。
S405、服务器通过该第二特征提取网络中的宽度部分,对离散的该用户数据进行广义线性组合,得到该用户的宽度特征。
其中,该宽度部分(wide component)可以为一个广义线性模型。
基于上述情况,服务器可以该用户数据中的至少一个用户组分信息进行独热(one-hot)编码,从而得到该用户数据的至少一个原始特征,将该至少一个原始特征输入该第二特征提取网络中的宽度部分,方便了在该宽度部分进行线性组合,加快了获取用户的宽度特征的速度。
在一些实施例中,在该广义线性模型中可以包括第一权重矩阵和偏置项(bias),从而在上述步骤S405中,服务器能够基于该第一权重矩阵,对该至少一个原始特征进行加权处理,对加权处理后的各个原始特征以及偏置项进行求和,得到用户的宽度特征,其中,该第一权重矩阵的权项个数大于或等于原始特征的个数。
在一些实施例中,该广义线性模型中可以包括第二权重矩阵和偏置项,从而服务器可以获取该至少一个原始特征在两两之间的至少一个交叉特征,从而基于该第二权重矩阵,对该至少一个原始特征和该至少一个交叉特征进行加权处理,对加权处理后的各个原始特征、各个交叉特征以及偏置项进行求和,得到用户的宽度特征。
其中,一个交叉特征用于表示任一个原始特征与另一个原始特征之间的乘积,该第二权重矩阵的权项个数大于或等于原始特征的个数与交叉特征的个数相加后所得到的数值。
S406、服务器通过该第二特征提取网络中的深度部分,对离散的该用户数据进行嵌入处理和卷积处理,得到该用户的深度特征。
其中,该宽度部分(wide component)可以是一个DNN。
在一些实施例中,DNN中可以包括输入层、嵌入(embedding)层、至少一个隐藏层和输出层,层与层之间采用串行连接的方式,其中,该嵌入层用于将用户数据中的至少一个用户组分信息转换为嵌入向量的形式。
在上述步骤S406中,将至少一个用户组分信息输入嵌入层,通过嵌入层对该至少一个用户组分信息进行嵌入处理,能够将较为稀疏(也即是离散)的用户数据映射到低维空间,得到至少一个嵌入向量,一个嵌入向量对应于一个用户组分信息,从而将该至少一个嵌入向量输入该至少一个隐藏层,通过该至少一个隐藏层对该至少一个嵌入向量进行卷积处理,输出该用户的深度特征。
S407、服务器通过全连接层对该用户的宽度特征和该用户的深度特征进行级联,得到该用户的用户特征。
在上述过程中,服务器可以通过一个全连接(full connected,FC)层对该用户的宽度特征和该用户的深度特征进行级联,在该全连接层中,输出的用户特征与用户的宽度特征和用户的深度特征中的每一个分量都相连。
在上述步骤S407中,服务器对该用户的宽度特征和该用户的深度特征进行特征融合,得到该用户的用户特征,在一些实施例中,服务器还可以不对用户的宽度特征和用户的深度特征进行级联,而是可以通过获取点积或者获取平均值等方式进行特征融合,从而缩短了特征融合的时长,减少特征融合过程的计算量,当然服务器也可以通过双线性汇合进行用户的宽度特征和用 户的深度特征之间的特征融合,从而能够保证特征之间的充分交互。
在上述步骤S404-S407中,服务器将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的用户数据进行特征提取,输出该用户的用户特征,既通过宽度部分考虑到了第二特征提取网络的记忆能力,也通过深度部分兼顾了第二特征提取网络的泛化能力,使得第二特征提取网络能够更加准确地表达用户的用户特征。图7是本发明实施例提供的一种第二特征提取网络的示意图,参见图7,左侧部分为宽度部分,右侧部分为深度部分,这里不再赘述。
S408、服务器将与该视频对应的文本输入第三特征提取网络。
其中,该文本可以是视频的文本类元数据,例如该文本可以是视频的标题、视频的标签、视频的评论、视频的作者或者视频的摘要中的至少一项,该第三特征提取网络与上述步骤S404中的网络架构类似,但网络的参数可以相同,也可以不同。
在上述过程中,由于文本类元数据、视频的标题、视频的标签、视频的评论、视频的作者或者视频的摘要等信息通常是一个或多个孤立的词向量,因此该文本是离散的,此时将离散的文本输入第三特征提取网络之后,通过第三特征提取网络的作用,能够将离散的文本转换为一个连续的特征向量,该特征向量能够体现出离散的文本的联合特征。
上述步骤S408与上述步骤S404类似,这里不再赘述。
S409、服务器通过该第三特征提取网络中的宽度部分,对离散的该文本进行广义线性组合,得到该文本的宽度特征。
上述步骤S409与上述步骤S405类似,这里不再赘述。
S410、服务器通过该第三特征提取网络中的深度部分,对离散的该文本进行嵌入处理和卷积处理,得到该文本的深度特征。
上述步骤S410与上述步骤S406类似,这里不再赘述。
S411、服务器通过全连接层对该文本的宽度特征和该文本的深度特征进行级联,得到与该视频对应的文本特征。
上述步骤S411与上述步骤S407类似,这里不再赘述。
在上述步骤S411中,服务器对该文本的宽度特征和该文本的深度特征进行特征融合,得到与该视频对应的文本特征。在一些实施例中,服务器还可 以不对文本的宽度特征和文本的深度特征进行级联,而是可以通过获取点积或者获取平均值等方式进行特征融合,从而缩短了特征融合的时长,减少特征融合过程的计算量,当然服务器也可以通过双线性汇合进行文本的宽度特征和文本的深度特征之间的特征融合,从而能够保证特征之间的充分交互。
在上述步骤S408-S411中,服务器将与该视频对应的文本输入第三特征提取网络,通过该第三特征提取网络对离散的该文本进行特征提取,输出与该视频对应的文本特征,从而不仅能够考虑到视频的图像特征、视频的音频特征、用户的用户特征,而且没有忽视视频的文本类元数据所带来的作用,对文本进行特征提取后得到视频的文本特征,从而增加了视频推荐过程的特征种类的多元性,进一步地提升了视频推荐过程的准确度。
S412、服务器将该视频特征与该用户特征进行双线性汇合处理,得到第一关联特征。
其中,该第一关联特征用于表示视频与用户之间特征关联关系。
上述步骤S412与上述步骤S403类似,服务器可以基于MCB、MLB或者MFB等方式进行双线性汇合处理,在提升特征融合的效率的基础上,又保证了视频特征与用户特征之间的充分交互,这里不再赘述。
在上述步骤S412中,服务器对该视频特征和该用户特征进行特征融合,得到该视频与该用户之间的第一关联特征,在一些实施例中,服务器还可以不对视频特征和用户特征进行双线性汇合处理,而是可以通过获取点积、获取平均值或者级联等方式进行特征融合,从而进一步缩短特征融合的时长,减少特征融合过程的计算量。
S413、服务器将该文本特征与该用户特征进行双线性汇合处理,得到第二关联特征。
其中,该第二关联特征用于表示文本与用户之间特征关联关系。
上述步骤S413与上述步骤S403类似,服务器可以基于MCB、MLB或者MFB等方式进行双线性汇合处理,在提升特征融合的效率的基础上,又保证了视频特征与用户特征之间的充分交互,这里不再赘述。
在上述步骤S413中,服务器对该文本特征和该用户特征进行特征融合,得到该文本与该用户之间的第二关联特征,在一些实施例中,服务器还可以不对文本特征和用户特征进行双线性汇合处理,而是可以通过获取点积、获 取平均值或者级联等方式进行特征融合,从而进一步缩短特征融合的时长,减少特征融合过程的计算量。
S414、服务器对该第一关联特征和该第二关联特征进行点乘处理,得到对该用户推荐该视频的推荐概率。
在上述过程中,服务器可以对第一关联特征和第二关联特征进行点乘处理的过程,也即是对该第一关联特征和该第二关联特征求内积的过程,将该第一关联特征和该第二关联特征中对应位置的数值相乘后进行求和所得到的数值即为该视频的推荐概率。
在上述步骤S412-S414中,服务器基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率,从而能够基于该推荐概率,对用户进行视频推荐,详见下述步骤S415。
在一些实施例中,服务器还可以不执行上述步骤S408-S414,也就是不获取文本特征,而是在执行上述步骤S407后,直接对该视频特征和该用户特征进行点乘处理,得到对该用户推荐该视频的推荐概率,从而避免了获取文本特征以及后续特征融合的繁琐计算流程,减少了推荐视频的时长。
S415、当该推荐概率大于概率阈值时,服务器确定为该用户推荐该视频。
其中,该概率阈值可以是大于或等于0且小于或等于1的任一数值。
上述过程中,服务器将该推荐概率与概率阈值进行数值比较,当该推荐概率大于概率阈值时,确定为用户推荐该视频,当该推荐概率小于或等于该概率阈值时,服务器可以确定不为该用户推荐该视频。
在上述步骤S415中,服务器根据该推荐概率,确定是否对该用户推荐该视频,而对于不同的用户以及不同的视频,服务器均可以执行上述步骤S401-S415中的视频推荐流程,从而能够确定是否对任一用户推荐任一视频。
在一些实施例中,服务器还可以不根据概率阈值判断是否推荐,而是执行下述步骤:对多个视频中的每个视频,服务器重复执行生成推荐概率的操作,得到多个推荐概率;获取该推荐概率在该多个推荐概率中从大到小的概率排序,当该概率排序小于或等于目标阈值时,确定为该用户推荐该视频;当该概率排序大于该目标阈值时,确定不为该用户推荐该视频。其中,该目标阈值可以是大于或等于1且小于或等于该多个视频的个数的数值。
在上述过程中,服务器通过获取概率排序,从而能够控制选出的推荐视 频的个数,避免了当概率阈值较小时,为用户推荐太多的视频,从而优化了视频推荐的效果。
当然,在执行上述步骤S415之后,服务器可以重复执行上述步骤S401-S415所执行的操作,从而能够确定对用户进行推荐的至少一个推荐视频,向终端发送该至少一个推荐视频的视频信息,从而执行与上述实施例中步骤S206-S210类似的终端侧显示过程,在此不作赘述。
上述所有可选技术方案,可以采用任意结合形成本公开的可选实施例,在此不再一一赘述。
本发明实施例提供的方法,通过将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征,由于视频特征种类少、维度高,从而在不增加太大的计算压力的情况下,有针对性地提取高维度的视频特征,将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的用户数据进行特征提取,输出该用户的用户特征,由于用户特征种类多、维度低,从而可以基于第二特征提取网络,有针对性地提取低维度的用户特征,减小了提取用户特征的计算压力,基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率,根据该推荐概率,确定是否对该用户推荐该视频,从而对于性质差别较大的用户特征和视频特征,分别采用不同的网络进行特征提取,避免了丢失用户特征和视频特征中的信息,改善了梯度弥散的问题,提高了视频推荐的准确度。
进一步地,通过TCN提取视频的图像特征,引入因果卷积操作,由于相较于传统的CNN框架,TCN的层与层之间具有因果关系,因此可以在当前层考虑到上一层中具有时序关联的图像帧之间的相关性信息,也就使得TCN输出层中的每个图像帧特征既可以表示一个图像帧的图像特征,又可以表示该图像帧与该图像帧之前的图像帧之间的关联关系。进一步地,相较于通常具有较好记忆能力的长短期记忆网络(long short-term memory,LSTM)框架,由于LSTM中包含有遗忘门,在处理过程中无法避免地会遗漏一些历史信息,然而由于TCN中不需要设置遗忘门,也就避免了造成历史信息的遗漏,并且随着TCN深度的增加,因果卷积后得到的特征图可以包括输入层内图像数据 的每一个图像帧的信息。
进一步地,通过CNN提取视频的音频特征,当CNN网络是VGG网络时,随着VGG网络的加深,每次池化后图像的尺寸缩小一半,深度增加一倍,简化了CNN的结构,便于提取高层次的音频特征。
进一步地,由于图像特征和音频特征的维度往往比较大,通过对图像特征和音频特征进行双线性汇合处理,能够在提升特征融合的效率的基础上,保证了图像特征与音频特征之间的充分交互。
进一步地,通过第二特征提取网络提取用户特征,既通过宽度部分考虑到了第二特征提取网络的记忆能力,也通过深度部分兼顾了第二特征提取网络的泛化能力,使得第二特征提取网络能够更加准确地表达用户的用户特征。
进一步地,通过对文本进行特征提取后得到视频的文本特征,从而不仅能够考虑到视频的图像特征、视频的音频特征、用户的用户特征,而且没有忽视视频的文本类元数据所带来的作用,从而增加了视频推荐过程的特征种类的多元性,进一步地提升了视频推荐过程的准确度。
进一步地,通过第三特征提取网络提取文本特征,既通过宽度部分考虑到了第三特征提取网络的记忆能力,也通过深度部分兼顾了第三特征提取网络的泛化能力,使得第三特征提取网络能够更加准确地表达与视频对应的文本特征。
在上述实施例中,图8是本发明实施例提供的一种视频推荐方法的示意图,参见图8,服务器对于不同性质的特征采用不同架构的网络进行提取,也即是对不同模态的视频、用户数据以及与视频对应的文本,分别通过第一特征提取网络、第二特征提取网络以及第三特征提取网络进行特征提取,可以降低多模态融合信息损失,避免高维度特征挤压低维度特征的表达能力,减少了无效的融合所造成的维度***。另一方面,通过新引入文本特征,可以从视频特征和文本特征这两个维度上分别刻画用户的视频观看偏好与文本阅读偏好,增强了服务器对多模态数据的描述能力与可解释性。
另一方面,服务器在第一特征提取网络内,分别采用TCN提取视频的图像特征,采用CNN提取视频的音频特征,在第二特征提取网络内,分别采用宽度部分提取用户的宽度特征,采用深度部分提取用户的深度特征,在第三 特征提取网络内,分别采用宽度部分提取文本的宽度特征,采用深度部分提取文本的深度特征,进一步地,对于相似结构的特征先进行类内特征融合,也即是对视频的图像特征和音频特征进行融合得到视频特征,对用户的宽度特征和用户深度特征进行融合得到用户特征,对文本的宽度特征和文本的深度特征进行融合得到文本特征,从而能够降低特征维度,提高融合效率,然后对不相似结构的特征进行类间融合,例如获取第一联合特征和第二联合特征,从而能够基于多模态的视频推荐方法,对两个联合特征进行点乘得到推荐概率,充分利用了视频特征与文本特征,能够从更多维度的角度上刻画视频,也就能够更加准确地表达视频,从而提升了视频推荐的准确率。
在一些实施例中,服务器在进行视频推荐之前,可以基于反向传播算法训练得到该第一特征提取网络,基于宽度与深度联合训练方法分别得到第二特征提取网络和第三特征提取网络进行训练,训练过程与上述实施例相类似,只不过使用的是样本视频、样本用户数据以及样本文本,这里不再赘述。
上述实施例提供了一种根据视频、用户数据和文本进行视频推荐的方法,可选地,以计算机设备为服务器为例进行说明,服务器还可以不引入文本,而是直接根据视频和用户数据进行视频推荐,图9是本发明实施例提供的一种视频推荐方法的流程图,参见图9,下面进行详述:
S901、服务器将视频的至少一个连续视频帧所包括的至少一个图像帧输入第一特征提取网络中的时间卷积网络,通过该时间卷积网络对该至少一个图像帧进行因果卷积,得到该视频的图像特征。
上述步骤S901与上述实施例中的步骤S401类似,在此不作赘述。
S902、服务器将该至少一个连续视频帧所包括的至少一个音频帧输入该第一特征提取网络中的卷积神经网络,通过该卷积神经网络对该至少一个音频帧进行卷积处理,得到该视频的音频特征。
上述步骤S902与上述实施例中的步骤S402类似,在此不作赘述。
S903、服务器将该视频的图像特征与该视频的音频特征进行双线性汇合处理,得到该视频的视频特征。
上述步骤S903与上述实施例中的步骤S403类似,在此不作赘述。
S904、服务器将用户的用户数据输入第二特征提取网络。
上述步骤S904与上述实施例中的步骤S404类似,在此不作赘述。
S905、服务器通过该第二特征提取网络中的宽度部分,对离散的该用户数据进行广义线性组合,得到该用户的宽度特征。
上述步骤S905与上述实施例中的步骤S405类似,在此不作赘述。
S906、服务器通过该第二特征提取网络中的深度部分,对离散的该用户数据进行嵌入处理和卷积处理,得到该用户的深度特征。
上述步骤S906与上述实施例中的步骤S406类似,在此不作赘述。
S907、服务器通过全连接层对该用户的宽度特征和该用户的深度特征进行级联,得到该用户的用户特征。
上述步骤S907与上述实施例中的步骤S407类似,在此不作赘述。
S908、服务器对该视频特征和该用户特征进行点乘处理,得到对该用户推荐该视频的推荐概率。
上述步骤S908点乘处理的方式与上述实施例中的步骤S414类似,在此不作赘述。
S909、当该推荐概率大于概率阈值时,服务器确定为该用户推荐该视频。
上述步骤S909与上述实施例中的步骤S415类似,在此不作赘述。
当然,在执行上述步骤S909之后,服务器可以重复执行上述步骤S901-S909所执行的操作,从而能够确定对用户进行推荐的至少一个推荐视频,向终端发送该至少一个推荐视频的视频信息,从而执行与上述实施例中步骤S206-S210类似的终端侧显示过程,在此不作赘述。
本发明实施例提供的方法,通过将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征,由于视频特征种类少、维度高,从而在不增加太大的计算压力的情况下,有针对性地提取高维度的视频特征,将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的用户数据进行特征提取,输出该用户的用户特征,由于用户特征种类多、维度低,从而可以基于第二特征提取网络,有针对性地提取低维度的用户特征,减小了提取用户特征的计算压力,基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率,根据该推荐概率,确定是否对该用户推荐该视频,从而对于性质差别较大的用户特征和视频特征,分别采用不同的网络进行特 征提取,避免了丢失用户特征和视频特征中的信息,改善了梯度弥散的问题,提高了视频推荐的准确度。
图10是本发明实施例提供的一种视频推荐装置的结构示意图,参见图10,该装置包括第一输出模块1001、第二输出模块1002、融合得到模块1003和确定推荐模块1004,下面进行详述:
第一输出模块1001,用于将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征。
第二输出模块1002,用于将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的该用户数据进行特征提取,输出该用户的用户特征。
融合得到模块1003,用于基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率。
确定推荐模块1004,用于根据该推荐概率,确定是否对该用户推荐该视频。
本发明实施例提供的装置,通过将视频输入第一特征提取网络,通过该第一特征提取网络对该视频中的至少一个连续视频帧进行特征提取,输出该视频的视频特征,由于视频特征种类少、维度高,从而在不增加太大的计算压力的情况下,有针对性地提取高维度的视频特征,将用户的用户数据输入第二特征提取网络,通过该第二特征提取网络对离散的用户数据进行特征提取,输出该用户的用户特征,由于用户特征种类多、维度低,从而可以基于第二特征提取网络,有针对性地提取低维度的用户特征,减小了提取用户特征的计算压力,基于该视频特征和该用户特征进行特征融合,得到对该用户推荐该视频的推荐概率,根据该推荐概率,确定是否对该用户推荐该视频,从而对于性质差别较大的用户特征和视频特征,分别采用不同的网络进行特征提取,避免了丢失用户特征和视频特征中的信息,改善了梯度弥散的问题,提高了视频推荐的准确度。
在一些实施例中,基于图10的装置组成,该第一输出模块1001包括:
卷积提取单元,用于将视频中的至少一个连续视频帧分别输入第一特征 提取网络中的时间卷积网络和卷积神经网络,通过该时间卷积网络和该卷积神经网络对该至少一个连续视频帧进行卷积处理,提取该视频的视频特征。
在一些实施例中,基于图10的装置组成,该卷积提取单元包括:
因果卷积子单元,用于将视频中的至少一个连续视频帧所包括的至少一个图像帧输入第一特征提取网络中的时间卷积网络,通过该时间卷积网络对该至少一个图像帧进行因果卷积,得到该视频的图像特征。
卷积处理子单元,用于将该至少一个连续视频帧所包括的至少一个音频帧输入第一特征提取网络中的卷积神经网络,通过该卷积神经网络对该至少一个音频帧进行卷积处理,得到该视频的音频特征。
融合子单元,用于将该视频的图像特征与该视频的音频特征进行特征融合,得到该视频的视频特征。
在一些实施例中,该融合子单元用于,将该视频的图像特征与该视频的音频特征进行双线性汇合处理,得到该视频的视频特征。
在一些实施例中,基于图10的装置组成,该第二输出模块1002包括:
第一输入单元,用于将该用户的用户数据输入该第二特征提取网络。
第一线性组合单元,用于通过该第二特征提取网络中的宽度部分,对离散的该用户数据进行广义线性组合,得到该用户的宽度特征。
第一嵌入卷积单元,用于通过该第二特征提取网络中的深度部分,对离散的该用户数据进行嵌入处理和卷积处理,得到该用户的深度特征。
第一融合单元,用于对该用户的宽度特征和该用户的深度特征进行特征融合,得到该用户的用户特征。
在一些实施例中,该第一融合单元具体用于,通过全连接层对该用户的宽度特征和该用户的深度特征进行级联,得到该用户的用户特征。
在一些实施例中,该融合得到模块1003用于,对该视频特征和该用户特征进行点乘处理,得到对该用户推荐该视频的推荐概率。
在一些实施例中,基于图10的装置组成,该装置还包括:
第三输入模块,用于将与该视频对应的文本输入第三特征提取网络,通过该第三特征提取网络对离散的该文本进行特征提取,输出与该视频对应的文本特征。
在一些实施例中,基于图10的装置组成,该第三输入模块包括:
第二输入单元,用于将该文本输入该第三特征提取网络。
第二线性组合单元,用于通过该第三特征提取网络中的宽度部分,对离散的该文本进行广义线性组合,得到该文本的宽度特征。
第二嵌入卷积单元,用于通过该第三特征提取网络中的深度部分,对离散的该文本进行嵌入处理和卷积处理,得到该文本的深度特征。
第二融合单元,用于对该文本的宽度特征和该文本的深度特征进行特征融合,得到与该视频对应的文本特征。
在一些实施例中,该第二融合单元具体用于,通过全连接层对该文本的宽度特征和该文本的深度特征进行级联,得到该与该视频对应的文本特征。
在一些实施例中,基于图10的装置组成,该融合得到模块1003包括:
第三融合单元,用于对该视频特征和该用户特征进行特征融合,得到该视频与该用户之间的第一关联特征。
该第三融合单元,还用于对该文本特征和该用户特征进行特征融合,得到该文本与该用户之间的第二关联特征。
点乘单元,用于对该第一关联特征和该第二关联特征进行点乘处理,得到对该用户推荐该视频的推荐概率。
在一些实施例中,该第三融合单元具体用于,将该视频特征与该用户特征进行双线性汇合处理,得到该第一关联特征。
该第三融合单元还用于,将该文本特征与该用户特征进行双线性汇合处理,得到该第二关联特征。
在一些实施例中,该确定推荐模块1004用于,当该推荐概率大于概率阈值时,确定为该用户推荐该视频;及当该推荐概率小于或等于该概率阈值时,确定不为该用户推荐该视频。
在一些实施例中,该确定推荐模块1004用于,对多于一个视频中的每个视频,重复执行生成推荐概率的操作,得到多于一个推荐概率;获取每个推荐概率在该多于一个推荐概率中从大到小的概率排序,当该概率排序小于或等于目标阈值时,确定为该用户推荐相应概率排序所对应的该视频;及当该概率排序大于该目标阈值时,确定不为该用户推荐相应概率排序所对应该视频。
上述所有可选技术方案,可以采用任意结合形成本公开的可选实施例, 在此不再一一赘述。
需要说明的是:上述实施例提供的视频推荐装置在推荐视频时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将计算机设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视频推荐装置与视频推荐方法实施例属于同一构思,其具体实现过程详见视频推荐方法实施例,这里不再赘述。
图11是本发明实施例提供的一种推荐视频展示装置的结构示意图,参见图11,该装置包括显示模块1101、发送模块1102和展示模块1103,下面进行详述:
显示模块1101,用于显示视频展示界面,该视频展示界面中包括至少一个第一推荐视频。
发送模块1102,用于当检测到对任一第一推荐视频的点击操作时,响应于该点击操作,将该第一推荐视频的观看记录发送至服务器,该观看记录用于指示该服务器基于该观看记录对视频推荐模型进行优化训练,并实时返回至少一个第二推荐视频的视频信息。
展示模块1103,用于当接收到该至少一个第二推荐视频的视频信息时,基于该至少一个第二推荐视频的视频信息,在该视频展示界面中展示该至少一个第二推荐视频。
本发明实施例提供的装置,通过在该视频展示界面上展示至少一个第一推荐视频,当检测到用户对任一第一推荐视频的点击操作时,响应于该点击操作,将该推荐视频的观看记录发送至服务器,从而能够及时向用户反馈本次第一推荐视频的质量优劣,使得服务器能够基于该观看记录对该第一推荐视频进行真假样本的区分标记,将该第一推荐视频作为新一轮优化训练中的样本视频,实现了对视频推荐模型的动态优化训练,并且服务器还可以根据优化训练后的视频推荐模型向终端返回至少一个第二推荐视频的视频信息,当终端接收到至少一个第二推荐视频的视频信息时,基于该至少一个第二推荐视频的视频信息,在该视频展示界面中展示该至少一个第二推荐视频,使 得随着用户的点击操作,能够在视频展示界面上实时更新展示推荐准确率更高的推荐视频。
需要说明的是:上述实施例提供的推荐视频展示装置在展示推荐视频时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将电子设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的推荐视频展示装置与视频推荐方法的交互实施例属于同一构思,其具体实现过程详见视频推荐方法实施例,这里不再赘述。
图12是本发明实施例提供的计算机设备的结构示意图,该计算机设备1200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)1201和一个或一个以上的存储器1202,其中,该存储器1202中存储有至少一条计算机可读指令,该至少一条计算机可读指令由该处理器1201加载并执行以实现上述各个视频推荐方法实施例提供的视频推荐方法。当然,该计算机设备还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该计算机设备还可以包括其他用于实现设备功能的部件,在此不做赘述。
图13是本发明实施例提供的电子设备的结构示意图。该电子设备1300可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。电子设备1300还可能被称为用户设备、便携式电子设备、膝上型电子设备、台式电子设备等其他名称。
通常,电子设备1300包括有:处理器1301和存储器1302。
处理器1301可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1301可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式 来实现。处理器1301也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1301可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1301还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1302可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1302还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1302中的非暂态的计算机可读存储介质用于存储至少一个计算机可读指令,该至少一个计算机可读指令用于被处理器1301所执行以实现本申请中方法实施例提供的推荐视频展示方法。
在一些实施例中,电子设备1300还可选包括有:***设备接口1303和至少一个***设备。处理器1301、存储器1302和***设备接口1303之间可以通过总线或信号线相连。各个***设备可以通过总线、信号线或电路板与***设备接口1303相连。具体地,***设备包括:射频电路1304、触摸显示屏1305、摄像头1306、音频电路1307、定位组件1308和电源1309中的至少一种。
***设备接口1303可被用于将I/O(Input/Output,输入/输出)相关的至少一个***设备连接到处理器1301和存储器1302。在一些实施例中,处理器1301、存储器1302和***设备接口1303被集成在同一芯片或电路板上;在一些其他实施例中,处理器1301、存储器1302和***设备接口1303中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路1304用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路1304通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1304将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路1304包括:天线***、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、 用户身份模块卡等等。射频电路1304可以通过至少一种无线通信协议来与其它电子设备进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路1304还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。
显示屏1305用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1305是触摸显示屏时,显示屏1305还具有采集在显示屏1305的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1301进行处理。此时,显示屏1305还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏1305可以为一个,设置电子设备1300的前面板;在另一些实施例中,显示屏1305可以为至少两个,分别设置在电子设备1300的不同表面或呈折叠设计;在再一些实施例中,显示屏1305可以是柔性显示屏,设置在电子设备1300的弯曲表面上或折叠面上。甚至,显示屏1305还可以设置成非矩形的不规则图形,也即异形屏。显示屏1305可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件1306用于采集图像或视频。可选地,摄像头组件1306包括前置摄像头和后置摄像头。通常,前置摄像头设置在电子设备的前面板,后置摄像头设置在电子设备的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件1306还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路1307可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器1301进行处理,或者输入至射频电路1304以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在电子设备1300的不同部位。麦克风还可以是阵列麦克风或全 向采集型麦克风。扬声器则用于将来自处理器1301或射频电路1304的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路1307还可以包括耳机插孔。
定位组件1308用于定位电子设备1300的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件1308可以是基于美国的GPS(Global Positioning System,全球定位***)、中国的北斗***、俄罗斯的格雷纳斯***或欧盟的伽利略***的定位组件。
电源1309用于为电子设备1300中的各个组件进行供电。电源1309可以是交流电、直流电、一次性电池或可充电电池。当电源1309包括可充电电池时,该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。
在一些实施例中,电子设备1300还包括有一个或多个传感器1310。该一个或多个传感器1310包括但不限于:加速度传感器1311、陀螺仪传感器1312、压力传感器1313、指纹传感器1314、光学传感器1315以及接近传感器1316。
加速度传感器1311可以检测以电子设备1300建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器1311可以用于检测重力加速度在三个坐标轴上的分量。处理器1301可以根据加速度传感器1311采集的重力加速度信号,控制触摸显示屏1305以横向视图或纵向视图进行用户界面的显示。加速度传感器1311还可以用于游戏或者用户的运动数据的采集。
陀螺仪传感器1312可以检测电子设备1300的机体方向及转动角度,陀螺仪传感器1312可以与加速度传感器1311协同采集用户对电子设备1300的3D动作。处理器1301根据陀螺仪传感器1312采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器1313可以设置在电子设备1300的侧边框和/或触摸显示屏1305的下层。当压力传感器1313设置在电子设备1300的侧边框时,可以检测用户对电子设备1300的握持信号,由处理器1301根据压力传感器1313采 集的握持信号进行左右手识别或快捷操作。当压力传感器1313设置在触摸显示屏1305的下层时,由处理器1301根据用户对触摸显示屏1305的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器1314用于采集用户的指纹,由处理器1301根据指纹传感器1314采集到的指纹识别用户的身份,或者,由指纹传感器1314根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器1301授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器1314可以被设置电子设备1300的正面、背面或侧面。当电子设备1300上设置有物理按键或厂商Logo时,指纹传感器1314可以与物理按键或厂商Logo集成在一起。
光学传感器1315用于采集环境光强度。在一个实施例中,处理器1301可以根据光学传感器1315采集的环境光强度,控制触摸显示屏1305的显示亮度。具体地,当环境光强度较高时,调高触摸显示屏1305的显示亮度;当环境光强度较低时,调低触摸显示屏1305的显示亮度。在另一个实施例中,处理器1301还可以根据光学传感器1315采集的环境光强度,动态调整摄像头组件1306的拍摄参数。
接近传感器1316,也称距离传感器,通常设置在电子设备1300的前面板。接近传感器1316用于采集用户与电子设备1300的正面之间的距离。在一个实施例中,当接近传感器1316检测到用户与电子设备1300的正面之间的距离逐渐变小时,由处理器1301控制触摸显示屏1305从亮屏状态切换为息屏状态;当接近传感器1316检测到用户与电子设备1300的正面之间的距离逐渐变大时,由处理器1301控制触摸显示屏1305从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图13中示出的结构并不构成对电子设备1300的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在示例性实施例中,还提供了一种非易失性的计算机可读存储介质,存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一 个或多个处理器执行上述的视频推荐方法的步骤,或,上述的推荐视频展示方法的步骤。例如,该计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (20)

  1. 一种视频推荐方法,由计算机设备执行,所述方法包括:
    将视频输入第一特征提取网络,通过所述第一特征提取网络对所述视频中的至少一个连续视频帧进行特征提取,输出所述视频的视频特征;
    将用户的用户数据输入第二特征提取网络,通过所述第二特征提取网络对离散的所述用户数据进行特征提取,输出所述用户的用户特征;
    基于所述视频特征和所述用户特征进行特征融合,得到对所述用户推荐所述视频的推荐概率;及
    根据所述推荐概率,确定是否对所述用户推荐所述视频。
  2. 根据权利要求1所述的方法,其特征在于,所述将视频输入第一特征提取网络,通过所述第一特征提取网络对所述视频中的至少一个连续视频帧进行特征提取,输出所述视频的视频特征包括:
    将视频中的至少一个连续视频帧分别输入第一特征提取网络中的时间卷积网络和卷积神经网络,通过所述时间卷积网络和所述卷积神经网络对所述至少一个连续视频帧进行卷积处理,提取所述视频的视频特征。
  3. 根据权利要求2所述的方法,其特征在于,所述将视频中的至少一个连续视频帧分别输入第一特征提取网络中的时间卷积网络和卷积神经网络,通过所述时间卷积网络和所述卷积神经网络对所述至少一个连续视频帧进行卷积处理,提取所述视频的视频特征包括:
    将视频中的至少一个连续视频帧所包括的至少一个图像帧输入第一特征提取网络中的时间卷积网络,通过所述时间卷积网络对所述至少一个图像帧进行因果卷积,得到所述视频的图像特征;
    将所述至少一个连续视频帧所包括的至少一个音频帧输入第一特征提取网络中的卷积神经网络,通过所述卷积神经网络对所述至少一个音频帧进行卷积处理,得到所述视频的音频特征;及
    将所述视频的图像特征与所述视频的音频特征进行特征融合,得到所述视频的视频特征。
  4. 根据权利要求3所述的方法,其特征在于,所述将所述视频的图像特征与所述视频的音频特征进行特征融合,得到所述视频的视频特征包括:
    将所述视频的图像特征与所述视频的音频特征进行双线性汇合处理,得 到所述视频的视频特征。
  5. 根据权利要求1所述的方法,其特征在于,所述将用户的用户数据输入第二特征提取网络,通过所述第二特征提取网络对离散的所述用户数据进行特征提取,输出所述用户的用户特征包括:
    将所述用户的用户数据输入第二特征提取网络;
    通过所述第二特征提取网络中的宽度部分,对离散的所述用户数据进行广义线性组合,得到所述用户的宽度特征;
    通过所述第二特征提取网络中的深度部分,对离散的所述用户数据进行嵌入处理和卷积处理,得到所述用户的深度特征;及
    对所述用户的宽度特征和所述用户的深度特征进行特征融合,得到所述用户的用户特征。
  6. 根据权利要求5所述的方法,其特征在于,所述对所述用户的宽度特征和所述用户的深度特征进行特征融合,得到所述用户的用户特征包括:
    通过全连接层对所述用户的宽度特征和所述用户的深度特征进行级联,得到所述用户的用户特征。
  7. 根据权利要求1所述的方法,其特征在于,所述基于所述视频特征和所述用户特征进行特征融合,得到对所述用户推荐所述视频的推荐概率包括:
    对所述视频特征和所述用户特征进行点乘处理,得到对所述用户推荐所述视频的推荐概率。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将与所述视频对应的文本输入第三特征提取网络,通过所述第三特征提取网络对离散的所述文本进行特征提取,输出与所述视频对应的文本特征。
  9. 根据权利要求8所述的方法,其特征在于,所述将与所述视频对应的文本输入第三特征提取网络,通过所述第三特征提取网络对离散的所述文本进行特征提取,输出与所述视频对应的文本特征包括:
    将所述文本输入第三特征提取网络;
    通过所述第三特征提取网络中的宽度部分,对离散的所述文本进行广义线性组合,得到所述文本的宽度特征;
    通过所述第三特征提取网络中的深度部分,对离散的所述文本进行嵌入处理和卷积处理,得到所述文本的深度特征;及
    对所述文本的宽度特征和所述文本的深度特征进行特征融合,得到与所述视频对应的文本特征。
  10. 根据权利要求9所述的方法,其特征在于,所述对所述文本的宽度特征和所述文本的深度特征进行特征融合,得到与所述视频对应的文本特征,包括:
    通过全连接层对所述文本的宽度特征和所述文本的深度特征进行级联,得到与所述视频对应的文本特征。
  11. 根据权利要求8所述的方法,其特征在于,所述基于所述视频特征和所述用户特征进行特征融合,得到对所述用户推荐所述视频的推荐概率包括:
    对所述视频特征和所述用户特征进行特征融合,得到所述视频与所述用户之间的第一关联特征;
    对所述文本特征和所述用户特征进行特征融合,得到所述文本与所述用户之间的第二关联特征;及
    对所述第一关联特征和所述第二关联特征进行点乘处理,得到对所述用户推荐所述视频的推荐概率。
  12. 根据权利要求11所述的方法,其特征在于,所述对所述视频特征和所述用户特征进行特征融合,得到所述视频与所述用户之间的第一关联特征包括:
    将所述视频特征与所述用户特征进行双线性汇合处理,得到所述视频与所述用户之间的第一关联特征;
    所述对所述文本特征和所述用户特征进行特征融合,得到所述文本与所述用户之间的第二关联特征包括:
    将所述文本特征与所述用户特征进行双线性汇合处理,得到所述文本与所述用户之间的第二关联特征。
  13. 根据权利要求1至12中任一项所述的方法,其特征在于,所述根据所述推荐概率,确定是否对所述用户推荐所述视频包括:
    当所述推荐概率大于概率阈值时,确定为所述用户推荐所述视频;及
    当所述推荐概率小于或等于所述概率阈值时,确定不为所述用户推荐所述视频。
  14. 根据权利要求1至12中任一项所述的方法,其特征在于,所述根据所述推荐概率,确定是否对所述用户推荐所述视频包括:
    对多于一个视频中的每个视频,重复执行生成推荐概率的操作,得到多于一个的推荐概率;
    获取每个推荐概率分别在所述多于一个推荐概率中从大到小的概率排序,当所述概率排序小于或等于目标阈值时,确定为所述用户推荐相应概率排序所对应的所述视频;及
    当所述概率排序大于所述目标阈值时,确定不为所述用户推荐相应概率排序所对应的所述视频。
  15. 一种推荐视频展示方法,由电子设备执行,所述方法包括:
    显示视频展示界面,所述视频展示界面中包括至少一个第一推荐视频;
    当检测到对任一第一推荐视频的点击操作时,响应于所述点击操作,将所述第一推荐视频的观看记录发送至服务器,所述观看记录用于指示所述服务器基于所述观看记录对视频推荐模型进行优化训练,并实时返回至少一个第二推荐视频的视频信息;及
    当接收到所述至少一个第二推荐视频的视频信息时,基于所述至少一个第二推荐视频的视频信息,在所述视频展示界面中展示所述至少一个第二推荐视频。
  16. 一种视频推荐装置,其特征在于,所述装置包括:
    第一输出模块,用于将视频输入第一特征提取网络,通过所述第一特征提取网络对所述视频中的至少一个连续视频帧进行特征提取,输出所述视频的视频特征;
    第二输出模块,用于将用户的用户数据输入第二特征提取网络,通过所述第二特征提取网络对离散的所述用户数据进行特征提取,输出所述用户的用户特征;
    融合得到模块,用于基于所述视频特征和所述用户特征进行特征融合,得到对所述用户推荐所述视频的推荐概率;及
    确定推荐模块,用于根据所述推荐概率,确定是否对所述用户推荐所述 视频。
  17. 一种推荐视频展示装置,其特征在于,所述装置包括:
    显示模块,用于显示视频展示界面,所述视频展示界面中包括至少一个第一推荐视频;
    发送模块,用于当检测到对任一第一推荐视频的点击操作时,响应于所述点击操作,将所述第一推荐视频的观看记录发送至服务器,所述观看记录用于指示所述服务器基于所述观看记录对视频推荐模型进行优化训练,并实时返回至少一个第二推荐视频的视频信息;及
    展示模块,用于当接收到所述至少一个第二推荐视频的视频信息时,基于所述至少一个第二推荐视频的视频信息,在所述视频展示界面中展示所述至少一个第二推荐视频。
  18. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如权利要求1至14任一所述的视频推荐方法的步骤。
  19. 一种电子设备,其特征在于,所述电子设备包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如权利要求15所述的推荐视频展示方法的步骤。
  20. 一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至14任一所述的视频推荐方法的步骤,或,如权利要求15所述的推荐视频展示方法的步骤。
PCT/CN2020/081052 2019-04-23 2020-03-25 视频推荐方法、装置、计算机设备及存储介质 WO2020215962A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/329,928 US11540019B2 (en) 2019-04-23 2021-05-25 Video recommendation method and device, computer device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910330212.9A CN110149541B (zh) 2019-04-23 2019-04-23 视频推荐方法、装置、计算机设备及存储介质
CN201910330212.9 2019-04-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/329,928 Continuation US11540019B2 (en) 2019-04-23 2021-05-25 Video recommendation method and device, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2020215962A1 true WO2020215962A1 (zh) 2020-10-29

Family

ID=67593940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/081052 WO2020215962A1 (zh) 2019-04-23 2020-03-25 视频推荐方法、装置、计算机设备及存储介质

Country Status (3)

Country Link
US (1) US11540019B2 (zh)
CN (1) CN110149541B (zh)
WO (1) WO2020215962A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112437349A (zh) * 2020-11-10 2021-03-02 杭州时趣信息技术有限公司 一种视频流推荐方法及相关装置
CN112738555A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置
CN112966148A (zh) * 2021-03-05 2021-06-15 安徽师范大学 基于深度学习和特征融合的视频推荐方法和***
CN113239273A (zh) * 2021-05-14 2021-08-10 北京百度网讯科技有限公司 用于生成文本的方法、装置、设备以及存储介质
CN113869272A (zh) * 2021-10-13 2021-12-31 北京达佳互联信息技术有限公司 基于特征提取模型的处理方法、装置、电子设备及介质
CN114501076A (zh) * 2022-02-07 2022-05-13 浙江核新同花顺网络信息股份有限公司 视频生成方法、设备以及介质
US11540019B2 (en) 2019-04-23 2022-12-27 Tencent Technology (Shenzhen) Company Limited Video recommendation method and device, computer device and storage medium
EP4207770A4 (en) * 2020-12-22 2024-03-06 Shanghai Hode Information Technology Co., Ltd. VIDEO PROCESSING METHOD AND APPARATUS

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086709B (zh) * 2018-07-27 2023-04-07 腾讯科技(深圳)有限公司 特征提取模型训练方法、装置及存储介质
US11551688B1 (en) * 2019-08-15 2023-01-10 Snap Inc. Wearable speech input-based vision to audio interpreter
CN110798718B (zh) * 2019-09-02 2021-10-08 腾讯科技(深圳)有限公司 一种视频推荐方法以及装置
CN110598853B (zh) * 2019-09-11 2022-03-15 腾讯科技(深圳)有限公司 一种模型训练的方法、信息处理的方法以及相关装置
CN110609955B (zh) * 2019-09-16 2022-04-05 腾讯科技(深圳)有限公司 一种视频推荐的方法及相关设备
CN110837598B (zh) * 2019-11-11 2021-03-19 腾讯科技(深圳)有限公司 信息推荐方法、装置、设备及存储介质
FR3103601B1 (fr) * 2019-11-25 2023-03-10 Idemia Identity & Security France Procédé de classification d’une empreinte biométrique représentée par une image d’entrée
CN110941727B (zh) * 2019-11-29 2023-09-29 北京达佳互联信息技术有限公司 一种资源推荐方法、装置、电子设备及存储介质
CN111159542B (zh) * 2019-12-12 2023-05-05 中国科学院深圳先进技术研究院 一种基于自适应微调策略的跨领域序列推荐方法
CN112749297B (zh) * 2020-03-03 2023-07-21 腾讯科技(深圳)有限公司 视频推荐方法、装置、计算机设备和计算机可读存储介质
CN111491187B (zh) * 2020-04-15 2023-10-31 腾讯科技(深圳)有限公司 视频的推荐方法、装置、设备及存储介质
CN113573097A (zh) * 2020-04-29 2021-10-29 北京达佳互联信息技术有限公司 视频推荐方法、装置、服务器及存储介质
CN113836390B (zh) * 2020-06-24 2023-10-27 北京达佳互联信息技术有限公司 资源推荐方法、装置、计算机设备及存储介质
CN111967599B (zh) * 2020-08-25 2023-07-28 百度在线网络技术(北京)有限公司 用于训练模型的方法、装置、电子设备及可读存储介质
CN112215095A (zh) * 2020-09-24 2021-01-12 西北工业大学 违禁品检测方法、装置、处理器和安检***
CN114443671A (zh) * 2020-11-04 2022-05-06 腾讯科技(深圳)有限公司 推荐模型的更新方法、装置、计算机设备和存储介质
CN112507216B (zh) * 2020-12-01 2023-07-18 北京奇艺世纪科技有限公司 一种数据对象推荐方法、装置、设备和存储介质
CN114596193A (zh) * 2020-12-04 2022-06-07 英特尔公司 用于确定比赛状态的方法和装置
CN112464857A (zh) * 2020-12-07 2021-03-09 深圳市欢太科技有限公司 视频分类模型训练及视频分类方法、装置、介质和设备
CN112541846B (zh) * 2020-12-22 2022-11-29 山东师范大学 一种基于注意力机制的高校选修课混合推荐方法及***
CN112765480B (zh) * 2021-04-12 2021-06-18 腾讯科技(深圳)有限公司 一种信息推送方法、装置及计算机可读存储介质
US20230081916A1 (en) * 2021-09-14 2023-03-16 Black Sesame Technologies Inc. Intelligent video enhancement system
CN113868466B (zh) * 2021-12-06 2022-03-01 北京搜狐新媒体信息技术有限公司 视频推荐的方法、装置、设备和存储介质
CN114529761A (zh) * 2022-01-29 2022-05-24 腾讯科技(深圳)有限公司 基于分类模型的视频分类方法、装置、设备、介质及产品
CN114245206B (zh) * 2022-02-23 2022-07-15 阿里巴巴达摩院(杭州)科技有限公司 视频处理方法及装置
CN114638285B (zh) * 2022-02-25 2024-04-19 武汉大学 一种对手机惯性传感器数据的多模式识别方法
CN115065872A (zh) * 2022-06-17 2022-09-16 联通沃音乐文化有限公司 一种影音视频的智能推荐方法及***
CN115309975B (zh) * 2022-06-28 2024-06-07 中银金融科技有限公司 基于交互特征的产品推荐方法及***
CN115630173B (zh) * 2022-09-08 2023-08-18 湖北华中电力科技开发有限责任公司 一种基于兴趣度分析的用户数据管理方法
US12020276B1 (en) * 2023-01-31 2024-06-25 Walmart Apollo, Llc Systems and methods for benefit affinity using trained affinity models
CN117253061B (zh) * 2023-09-12 2024-05-28 鲸湾科技(南通)有限公司 数据推荐方法、装置及计算机可读介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193540A1 (en) * 2014-01-06 2015-07-09 Yahoo! Inc. Content ranking based on user features in content
CN106407418A (zh) * 2016-09-23 2017-02-15 Tcl集团股份有限公司 一种基于人脸识别的个性化视频推荐方法及推荐***
US20170061286A1 (en) * 2015-08-27 2017-03-02 Skytree, Inc. Supervised Learning Based Recommendation System
CN107330115A (zh) * 2017-07-12 2017-11-07 广东工业大学 一种信息推荐方法及装置
CN107544981A (zh) * 2016-06-25 2018-01-05 华为技术有限公司 内容推荐方法及装置
CN108243357A (zh) * 2018-01-25 2018-07-03 北京搜狐新媒体信息技术有限公司 一种视频推荐方法及装置
CN108833973A (zh) * 2018-06-28 2018-11-16 腾讯科技(深圳)有限公司 视频特征的提取方法、装置和计算机设备
CN109547814A (zh) * 2018-12-13 2019-03-29 北京达佳互联信息技术有限公司 视频推荐方法、装置、服务器及存储介质
CN110149541A (zh) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 视频推荐方法、装置、计算机设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103295016B (zh) * 2013-06-26 2017-04-12 天津理工大学 基于深度与rgb信息和多尺度多方向等级层次特征的行为识别方法
CN106682233B (zh) * 2017-01-16 2020-03-10 华侨大学 一种基于深度学习与局部特征融合的哈希图像检索方法
CN107911719B (zh) * 2017-10-30 2019-11-08 中国科学院自动化研究所 视频动态推荐装置
US11055764B2 (en) * 2018-01-29 2021-07-06 Selligent, S.A. Systems and methods for providing personalized online content
US11521044B2 (en) * 2018-05-17 2022-12-06 International Business Machines Corporation Action detection by exploiting motion in receptive fields
CN108764317B (zh) * 2018-05-21 2021-11-23 浙江工业大学 一种基于多路特征加权的残差卷积神经网络图像分类方法
CN109165350A (zh) * 2018-08-23 2019-01-08 成都品果科技有限公司 一种基于深度知识感知的信息推荐方法和***
US11153655B1 (en) * 2018-09-26 2021-10-19 Amazon Technologies, Inc. Content appeal prediction using machine learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193540A1 (en) * 2014-01-06 2015-07-09 Yahoo! Inc. Content ranking based on user features in content
US20170061286A1 (en) * 2015-08-27 2017-03-02 Skytree, Inc. Supervised Learning Based Recommendation System
CN107544981A (zh) * 2016-06-25 2018-01-05 华为技术有限公司 内容推荐方法及装置
CN106407418A (zh) * 2016-09-23 2017-02-15 Tcl集团股份有限公司 一种基于人脸识别的个性化视频推荐方法及推荐***
CN107330115A (zh) * 2017-07-12 2017-11-07 广东工业大学 一种信息推荐方法及装置
CN108243357A (zh) * 2018-01-25 2018-07-03 北京搜狐新媒体信息技术有限公司 一种视频推荐方法及装置
CN108833973A (zh) * 2018-06-28 2018-11-16 腾讯科技(深圳)有限公司 视频特征的提取方法、装置和计算机设备
CN109547814A (zh) * 2018-12-13 2019-03-29 北京达佳互联信息技术有限公司 视频推荐方法、装置、服务器及存储介质
CN110149541A (zh) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 视频推荐方法、装置、计算机设备及存储介质

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11540019B2 (en) 2019-04-23 2022-12-27 Tencent Technology (Shenzhen) Company Limited Video recommendation method and device, computer device and storage medium
CN112437349A (zh) * 2020-11-10 2021-03-02 杭州时趣信息技术有限公司 一种视频流推荐方法及相关装置
CN112437349B (zh) * 2020-11-10 2022-09-23 杭州时趣信息技术有限公司 一种视频流推荐方法及相关装置
CN112738555A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置
EP4207770A4 (en) * 2020-12-22 2024-03-06 Shanghai Hode Information Technology Co., Ltd. VIDEO PROCESSING METHOD AND APPARATUS
CN112738555B (zh) * 2020-12-22 2024-03-29 上海幻电信息科技有限公司 视频处理方法及装置
CN112966148A (zh) * 2021-03-05 2021-06-15 安徽师范大学 基于深度学习和特征融合的视频推荐方法和***
CN113239273A (zh) * 2021-05-14 2021-08-10 北京百度网讯科技有限公司 用于生成文本的方法、装置、设备以及存储介质
CN113239273B (zh) * 2021-05-14 2023-07-28 北京百度网讯科技有限公司 用于生成文本的方法、装置、设备以及存储介质
CN113869272A (zh) * 2021-10-13 2021-12-31 北京达佳互联信息技术有限公司 基于特征提取模型的处理方法、装置、电子设备及介质
CN114501076A (zh) * 2022-02-07 2022-05-13 浙江核新同花顺网络信息股份有限公司 视频生成方法、设备以及介质

Also Published As

Publication number Publication date
US20210281918A1 (en) 2021-09-09
CN110149541A (zh) 2019-08-20
US11540019B2 (en) 2022-12-27
CN110149541B (zh) 2021-08-03

Similar Documents

Publication Publication Date Title
WO2020215962A1 (zh) 视频推荐方法、装置、计算机设备及存储介质
US11250090B2 (en) Recommended content display method, device, and system
CN109740068B (zh) 媒体数据推荐方法、装置及存储介质
CN108304441B (zh) 网络资源推荐方法、装置、电子设备、服务器及存储介质
CN109918669B (zh) 实体确定方法、装置及存储介质
CN109284445B (zh) 网络资源的推荐方法、装置、服务器及存储介质
CN111897996B (zh) 话题标签推荐方法、装置、设备及存储介质
CN110413837B (zh) 视频推荐方法和装置
CN111291200B (zh) 多媒体资源展示方法、装置、计算机设备及存储介质
CN111737573A (zh) 资源推荐方法、装置、设备及存储介质
WO2022057435A1 (zh) 基于搜索的问答方法及存储介质
CN111831917A (zh) 内容推荐方法、装置、设备及介质
CN114154068A (zh) 媒体内容推荐方法、装置、电子设备及存储介质
CN113987326B (zh) 资源推荐方法、装置、计算机设备及介质
CN114117206B (zh) 推荐模型处理方法、装置、电子设备及存储介质
CN110166275B (zh) 信息处理方法、装置及存储介质
CN111782950B (zh) 样本数据集获取方法、装置、设备及存储介质
CN113886609A (zh) 多媒体资源推荐方法、装置、电子设备及存储介质
CN114691860A (zh) 文本分类模型的训练方法、装置、电子设备及存储介质
CN110297970B (zh) 信息推荐模型训练方法及装置
CN112230822B (zh) 评论信息的显示方法、装置、终端及存储介质
CN113377976B (zh) 资源搜索方法、装置、计算机设备及存储介质
CN115203573A (zh) 画像标签生成方法、模型训练方法、装置、介质及芯片
CN113139614A (zh) 特征提取方法、装置、电子设备及存储介质
CN109635153B (zh) 迁移路径生成方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20795486

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20795486

Country of ref document: EP

Kind code of ref document: A1