Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates a schematic diagram of one application scenario in which the method of generating classification information for video of some embodiments of the present disclosure may be applied.
The method for generating the classification information of the video, provided by some embodiments of the present disclosure, may be executed by a terminal device or a server.
It should be noted that the terminal device may be hardware or software. When the terminal device is hardware, it may be various electronic devices supporting video processing, including but not limited to a smart phone, a tablet computer, an electronic book reader, a laptop portable computer, a desktop computer, and the like. When the terminal device is software, it can be installed in the electronic devices listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.
The server may also be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules, for example, to provide distributed services, or as a single piece of software or software module. And is not particularly limited herein.
In the application scenario 100 shown in fig. 1, the executing subject of the classification information method of generating a video may be the server 101. The server 101 may extract feature information 103 of the target video 102. The server 101 may also input the target video 102 into a pre-trained user feedback prediction model 104, so as to obtain user feedback information 105 corresponding to the target video 102. The user feedback prediction model 104 may be a trained artificial neural network, among others. On this basis, the server 101 may fuse the feature information 103 and the user feedback information 105 to obtain fused information 106. Then, the fused information 106 is input into a video classification model 107 trained in advance, and classification information 108 corresponding to the target video 102 is obtained.
With continued reference to fig. 2, a flow 200 of some embodiments of a method of generating classification information for a video in accordance with the present disclosure is shown. The method for generating the classification information of the video comprises the following steps:
step 201, extracting feature information of the target video.
In some embodiments, an execution subject of the method of generating classification information of a video may extract feature information of a target video in various ways. The feature information of the target video may be information for representing various features of the target video, including but not limited to: the hue of the video, the class of objects contained in the video, the emotional mood information of the video, and so on. As an example, one or more features may be specified in advance, and on this basis, a value corresponding to the specified one or more features is determined as feature information of the target video. The target video may be any video. The determination of the target video can be obtained by specification or screening according to certain conditions.
In some optional implementation manners, the target video may also be input into a pre-trained video feature extraction model, so as to obtain feature information of the target video. The video feature extraction model is used for representing the corresponding relation between the video and the feature information.
As an example, the video feature extraction model may be a database storing a large number of videos. And storing the videos and the corresponding characteristic information in the database in an associated manner. Therefore, the execution main body can search the target video in the video library, and the video with the similarity smaller than the preset threshold value is obtained. On this basis, the feature information corresponding to the video with the similarity smaller than the preset threshold can be determined as the feature information of the target video.
As an example, the video feature extraction model may be an artificial neural network for extracting video features. The specific structure of the artificial neural network can be determined according to actual needs. For example, it may be a Recurrent Neural Network (RNN). On the basis, the training sample set can be used for training so that the parameter value of the artificial neural network meets the requirement. The training samples in the training sample set may include sample videos and sample feature information corresponding to the sample videos. In practice, as an example, the sample feature information corresponding to the sample video may be obtained by way of manual labeling.
Step 202, inputting the target video into a pre-trained user feedback prediction model to obtain user feedback information corresponding to the target video.
In some embodiments, the executing entity may input the target video into a pre-trained user feedback prediction model, so as to obtain user feedback information corresponding to the target video. The user feedback prediction model is used for representing the corresponding relation between the video and the user feedback information. The user feedback information is used for describing the feedback of the user to the video. In practice, the user feedback information may include, but is not limited to: click rate, forward amount, like amount, comment emotion orientation information, and the like.
As an example, the user feedback prediction model may be an untrained or an untrained completed CTR (click through rate) prediction model. On the basis, a training sample set can be obtained, wherein the training sample comprises a sample video and user feedback information (such as click rate) corresponding to the sample video. And training the CTR prediction model based on the training sample set so as to obtain a user feedback prediction model. Specifically, a training sample may be selected from a training sample set, a sample video of the training sample is used as an input, user feedback information corresponding to the sample video is used as an expected output, and the CTR prediction model is trained. Then, the CTR pre-estimation model can be trained based on a preset loss function and a back propagation algorithm. The loss function can be used for representing the difference degree between the output of the CTR prediction model and the feedback information of the sample user. On the basis, the parameter value of the CTR prediction model can be adjusted through a Back Propagation Algorithm (BP Algorithm), which can also be called an Error Back Propagation (BP) Algorithm, or an Error Back Propagation (rp) Algorithm. The BP algorithm consists of two processes of forward propagation of signals and backward propagation of errors in a learning process. In a feedforward network, an input signal is input through an input layer, the input signal is output through an output layer by a hidden layer calculation, an output value is compared with a mark value, if an error exists, the error is reversely propagated from the output layer to the input layer, and in the process, a gradient descent algorithm (such as a random gradient descent algorithm) can be used for adjusting neuron weights (parameter values of a CTR estimation model). Here, the above-mentioned loss function can be used to characterize the error between the output value and the mark value.
As an example, when the training stop condition is satisfied, the current CTR prediction model may be determined as the user feedback prediction model. Wherein, as an example, the training stop condition may be: the iteration times meet the preset condition, the error value is smaller than the preset threshold value, and the like.
In some optional implementations, the watching amount of the sample video contained in the training samples in the training sample set is greater than a preset threshold. By selecting the video with the watching amount larger than the preset threshold value as the sample video, the parameter value of the model obtained by training the training sample can be more accurate. Therefore, the obtained user feedback information is more accurate, and the finally obtained classification information is more accurate.
And 203, fusing the characteristic information and the user feedback information and inputting the fused information into a pre-trained video classification model to obtain classification information corresponding to the target video.
In some embodiments, the execution subject may input the feature information and the user feedback information into a pre-trained video classification model after fusing, so as to obtain classification information corresponding to the target video. The video classification model is used for representing the corresponding relation between the information obtained by fusing the characteristic information and the user feedback information and the classification information of the video. The video classification model may be a trained artificial neural network. Various neural networks supporting classification, such as convolutional neural networks, can be employed as the initial video classification model. On the basis, the initial video classification model is trained by utilizing the training sample set, so that the video classification model is obtained. As another example, some open-source video classification network may also be utilized as the initial video classification model. Such as the Attention Cluster.
In some embodiments, the feature information and the user feedback information may be fused in a variety of ways. For example, feature information and user feedback information may be intersected, merged, and so on. As another example, the feature information and the user feedback information may be weighted and averaged. The specific fusion mode can be selected according to actual needs, and is not disclosed or limited.
In some optional implementations, the feature information is a feature vector, and the user feedback information is a user feedback vector; and fusing the characteristic information and the user feedback information and inputting the fused information into a pre-trained video classification model, wherein the method comprises the following steps: and splicing the characteristic vector and the user feedback vector and then inputting the spliced characteristic vector and the user feedback vector into a pre-trained video classification model. The specific splicing form can be determined according to actual needs. For example, two row vectors may be concatenated into a two-row matrix. For another example, two vectors may be combined into a long vector.
For example, the feature vector may be: [1,0,0,0,0,0,0,0,0,0]The user feedback vector may be: [0,1,0,0,0,0,0,0,0,0]Then the fused information may be:
of course, the fused information may be: [1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0]。
According to the method for generating the classification information of the video, provided by some embodiments of the disclosure, the information richness is increased by fusing the characteristic information and the user feedback information, and the user feedback information can reflect the video content to a certain extent, so that the classification result is more accurate. The method and the device are beneficial to enabling the recommendation result to be more accurate in the process of recommending the video to the user.
With further reference to fig. 3, a flow 300 of further embodiments of methods of generating classification information for a video is illustrated. The process 300 of the method for generating classification information of videos includes the following steps:
step 301, extracting feature information of the target video.
Step 302, inputting the target video into a pre-trained user feedback prediction model to obtain user feedback information corresponding to the target video.
The user feedback prediction model is used for representing the corresponding relation between the video and the user feedback information.
And step 303, fusing the characteristic information and the user feedback information, and inputting the fused information into a pre-trained video classification model to obtain classification information corresponding to the target video.
In some embodiments, specific implementations of steps 301 to 303 and technical effects brought by the same may refer to those embodiments corresponding to fig. 2, and are not described herein again.
And 304, selecting a preset number of videos with the highest matching degree with the target user as recommended videos based on the attribute information of the target user and the classification information of the videos in the target video set.
In some embodiments, the execution subject of the method for generating classification information of videos may select, as recommended videos, a preset number of videos with the highest matching degree with the target user based on the attribute information of the target user and the classification information corresponding to each video in the combination of the target videos. Wherein, the classification information corresponding to each video can be obtained according to steps 301 to 303.
In some embodiments, the attribute information of the user may be information for describing various attributes of the user, including but not limited to: age, gender, location, category of preferred videos, and so forth. The target user may be any user. The target video set may also be any arbitrary video set. The target users and the target video sets can be determined through specification or screening according to certain conditions. For example, one user currently opening a video viewing class application may be determined as the target user. And determining a set formed by a plurality of videos stored in a server corresponding to the video watching application as a target video set.
Specifically, as an example, for each video in the target video set, the execution subject may calculate a matching degree of the attribute information of the target user and the classification information of the video, so as to obtain a matching degree with the classification information of each video. On the basis, a preset number of videos can be selected as recommended videos according to the sequence of the matching degrees from large to small. The matching degree can be calculated by a cosine similarity (cosine similarity) algorithm, a Jaccard coefficient, or the like. Taking the Jaccard coefficient method as an example, the similarity between the attribute information of the target user and the classification information of a certain video in the target video set = the number of words shared between the attribute information of the target user and the classification information of the certain video/the number of words included together with the classification information of the certain video by the attribute information of the target user.
And 305, pushing the recommended video to a terminal corresponding to the target user.
On the basis of step 304, the executing entity may push the obtained recommended video to a terminal corresponding to the target user.
As can be seen from fig. 3, compared with the description of some embodiments corresponding to fig. 2, the flow 300 of the method for generating classification information of videos in some embodiments corresponding to fig. 3 adds steps of obtaining a recommended video based on the classification information of videos, and recommending the video to the terminal. Therefore, the scheme described by the embodiment can realize more accurate video recommendation.
With further reference to fig. 4, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of a classification information apparatus for generating video, which correspond to those of the method embodiments corresponding to fig. 2, and which may be applied in various electronic devices in particular.
As shown in fig. 4, the classification information apparatus 400 of some embodiments that generates a video includes: extraction section 401, user feedback information generation section 402, and classification information generation section 403. Wherein the extraction unit 401 is configured to extract feature information of the target video. The user feedback information generating unit 402 is configured to input the target video into a pre-trained user feedback prediction model, and obtain user feedback information corresponding to the target video. The user feedback prediction model is used for representing the corresponding relation between the video and the user feedback information. The classification information generating unit 403 is configured to input the feature information and the user feedback information into a pre-trained video classification model after fusion, so as to obtain classification information corresponding to the target video.
In some embodiments, specific implementations of the extracting unit 401, the user feedback information generating unit 402, and the classification information generating unit 403 in the apparatus 400 for generating classification information of a video and technical effects brought by the specific implementations may refer to those embodiments corresponding to fig. 2, which are not described herein again.
In some optional implementations, the user feedback prediction model is trained by: acquiring a training sample set, wherein the training sample comprises a sample video and user feedback information corresponding to the sample video; selecting a training sample from a training sample set, taking a sample video of the training sample as input, taking user feedback information corresponding to the sample video as expected output, and training to obtain a user feedback prediction model.
In some optional implementations, the watching amount of the sample video contained in the training samples in the training sample set is greater than a preset threshold.
In some optional implementations, the feature information is a feature vector, and the user feedback information is a user feedback vector; and the classification information generating unit 403 may be further configured to: and splicing the characteristic vector and the user feedback vector and then inputting the spliced characteristic vector and the user feedback vector into a pre-trained video classification model.
In some optional implementations, the apparatus 400 further comprises: a pick unit (not shown) and a push unit (not shown). The selecting unit is configured to select a preset number of videos with the highest matching degree with the target user as recommended videos based on the attribute information of the target user and the classification information of the videos in the target video set. The pushing unit is configured to push the recommended video to a terminal corresponding to the target user.
In some embodiments, since the user feedback information can reflect the video content to a certain extent, the information richness is increased by fusing the characteristic information and the user feedback information, so that the classification result is more accurate.
Referring now to fig. 5, a block diagram of an electronic device (e.g., server in fig. 1) 500 suitable for use in implementing some embodiments of the present disclosure is shown. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 5 may represent one device or may represent multiple devices as desired.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network via the communication device 509, or installed from the storage device 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of some embodiments of the present disclosure.
It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting characteristic information of a target video; inputting the target video into a pre-trained user feedback prediction model to obtain user feedback information corresponding to the target video, wherein the user feedback prediction model is used for representing the corresponding relation between the video and the user feedback information; and fusing the characteristic information and the user feedback information and inputting the fused information into a pre-trained video classification model to obtain classification information corresponding to the target video.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, which may be described as: a processor includes an extraction unit, a user feedback information generation unit, and a classification information generation unit. Here, the names of these units do not constitute a limitation to the unit itself in some cases, and for example, the extraction unit may also be described as a "unit that extracts feature information of the target video".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
According to one or more embodiments of the present disclosure, there is provided a method of generating classification information of a video, including: extracting characteristic information of a target video; inputting the target video into a pre-trained user feedback prediction model to obtain user feedback information corresponding to the target video, wherein the user feedback prediction model is used for representing the corresponding relation between the video and the user feedback information; and fusing the characteristic information and the user feedback information and inputting the fused information into a pre-trained video classification model to obtain classification information corresponding to the target video.
According to one or more embodiments of the present disclosure, a user feedback prediction model is trained by: acquiring a training sample set, wherein the training sample comprises a sample video and user feedback information corresponding to the sample video; selecting a training sample from a training sample set, taking a sample video of the training sample as input, taking user feedback information corresponding to the sample video as expected output, and training to obtain a user feedback prediction model.
According to one or more embodiments of the present disclosure, the watching amount of the sample video contained in the training samples in the training sample set is larger than a preset threshold.
According to one or more embodiments of the present disclosure, the feature information is a feature vector, and the user feedback information is a user feedback vector; and fusing the characteristic information and the user feedback information and inputting the fused information into a pre-trained video classification model, wherein the method comprises the following steps: and splicing the characteristic vector and the user feedback vector and inputting the spliced characteristic vector and the user feedback vector into a pre-trained video classification model.
In accordance with one or more embodiments of the present disclosure, the method further comprises: selecting a preset number of videos with the highest matching degree with the target user as recommended videos based on the attribute information of the target user and the classification information of the videos in the target video set; and pushing the recommended video to a terminal corresponding to the target user.
According to one or more embodiments of the present disclosure, there is provided an apparatus for generating classification information of a video, including: an extraction unit configured to extract feature information of a target video; the user feedback information generation unit is configured to input the target video into a pre-trained user feedback prediction model to obtain user feedback information corresponding to the target video, wherein the user feedback prediction model is used for representing the corresponding relation between the video and the user feedback information; and the classification information generating unit is configured to input the feature information and the user feedback information into a pre-trained video classification model after fusion to obtain classification information corresponding to the target video.
According to one or more embodiments of the present disclosure, a user feedback prediction model is trained by: acquiring a training sample set, wherein the training sample comprises a sample video and user feedback information corresponding to the sample video; selecting a training sample from a training sample set, taking a sample video of the training sample as input, taking user feedback information corresponding to the sample video as expected output, and training to obtain a user feedback prediction model.
According to one or more embodiments of the present disclosure, the watching amount of the sample video contained in the training samples in the training sample set is larger than a preset threshold.
According to one or more embodiments of the present disclosure, the feature information is a feature vector, and the user feedback information is a user feedback vector; and the classification information generating unit is further configured to: and splicing the characteristic vector and the user feedback vector and then inputting the spliced characteristic vector and the user feedback vector into a pre-trained video classification model.
According to one or more embodiments of the present disclosure, an apparatus further comprises: the selecting unit is configured to select a preset number of videos with the highest matching degree with the target user as recommended videos based on the attribute information of the target user and the classification information of the videos in the target video set; and the pushing unit is configured to push the recommended video to a terminal corresponding to the target user.
According to one or more embodiments of the present disclosure, there is provided an electronic device including: one or more processors; a storage device having one or more programs stored thereon; when executed by one or more processors, cause the one or more processors to implement any of the methods described above.
According to one or more embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements any of the methods as described above.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.