CN113704541A

CN113704541A - Training data acquisition method, video push method, device, medium and electronic equipment

Info

Publication number: CN113704541A
Application number: CN202110219775.8A
Authority: CN
Inventors: 李岩; 毛懿荣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-11-26

Abstract

The embodiment of the application provides a training data acquisition method, a video push method, a training data acquisition device, a video push medium and electronic equipment. The method for acquiring the training data comprises the following steps: acquiring label information corresponding to each video to be selected, wherein the label information is obtained by editing by a publisher of the video to be selected; determining target label information from the label information according to the quantity information of the label information; determining a target video from the videos to be selected corresponding to the target label information according to the quantity information of the videos to be selected corresponding to the target label information; determining video characteristics corresponding to the target video according to the video information corresponding to the target video; and generating target training data respectively corresponding to each target video according to the video characteristics and the label information corresponding to each target video. According to the technical scheme, the target training data acquisition efficiency can be improved, and the quality of the target training data is guaranteed.

Description

Training data acquisition method, video push method, device, medium and electronic equipment

Technical Field

The application relates to the technical field of computers, in particular to a training data acquisition method, a training data acquisition device, a video push device, a medium and electronic equipment.

Background

With the continuous development of internet technology, network videos are increasingly abundant, and users can watch videos without being limited to televisions and can also search interesting videos through the internet for watching. In the current technical scheme, a video platform classifies videos in advance through a video classification model and pushes corresponding target objects according to classification results. However, the training of the video classification model relies on a large amount of training data, and the labeling of the training data consumes a lot of time and resources. Therefore, how to improve the acquisition efficiency of the training data and ensure the quality of the training data becomes a technical problem to be solved urgently.

Disclosure of Invention

Embodiments of the present application provide a method, an apparatus, a medium, and an electronic device for acquiring training data and pushing a video, so that the acquisition efficiency of the training data can be improved at least to a certain extent, and the quality of the training data is ensured.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a method for acquiring training data, the method including:

acquiring label information corresponding to each video to be selected, wherein the label information is obtained by editing by a publisher of the video to be selected;

determining target label information from the label information according to the quantity information of the label information;

determining a target video from the videos to be selected corresponding to the target label information according to the quantity information of the videos to be selected corresponding to the target label information;

determining video characteristics corresponding to the target video according to video information corresponding to the target video, wherein the video information comprises at least two of image information, audio information and character information;

and generating target training data respectively corresponding to each target video according to the video characteristics and the label information corresponding to each target video.

According to an aspect of an embodiment of the present application, there is provided an apparatus for acquiring training data, the apparatus including:

the acquisition module is used for acquiring label information corresponding to each video to be selected, and the label information is obtained by editing by a publisher of the video to be selected;

the first determining module is used for determining target label information from the label information according to the quantity information of the label information;

the second determining module is used for determining a target video from the videos to be selected corresponding to the target tag information according to the quantity information of the videos to be selected corresponding to the target tag information;

the third determining module is used for determining video characteristics corresponding to the target video according to video information corresponding to the target video, wherein the video information comprises at least two of image information, audio information and character information;

and the processing module is used for generating target training data respectively corresponding to each target video according to the video characteristics and the label information corresponding to each target video.

In some embodiments of the present application, based on the foregoing, the first determining module is configured to: counting the occurrence frequency corresponding to each label information according to the label information; and identifying target label information from each target label information according to the occurrence frequency corresponding to each label information.

In some embodiments of the present application, based on the foregoing, the second determining module is configured to: determining the number of the videos to be selected and the number of the publishers, which correspond to each target tag information, according to the videos to be selected, which correspond to each target tag information; performing quantity optimization processing on the video to be selected corresponding to each target tag information according to the quantity of the video to be selected corresponding to each target tag information and the quantity of the publishers; and identifying the video to be selected corresponding to each target label information after the quantity optimization processing as a target video.

In some embodiments of the present application, based on the foregoing, the second determining module is configured to: if the number of the videos to be selected corresponding to the target label information is larger than a first preset number, deleting the videos to be selected corresponding to the target label information; if the number of the publishers corresponding to the target tag information is smaller than a second preset number, deleting the video to be selected corresponding to the target tag information; and if the number of the videos to be selected issued by a single publisher in the videos to be selected corresponding to the target tag information is larger than a third preset number, deleting the videos to be selected issued by the single publisher.

In some embodiments of the present application, based on the foregoing, the third determining module is configured to: according to different types of video information corresponding to the target video, acquiring to-be-processed characteristic vectors corresponding to the target video and the video information of each type respectively; and generating a target feature vector corresponding to the target video according to the feature vector to be processed corresponding to the target video.

In some embodiments of the present application, based on the foregoing, the third determining module is configured to: and splicing the feature vectors to be processed to generate a target feature vector corresponding to the target video.

In some embodiments of the present application, based on the foregoing, the third determining module is configured to: acquiring vector weights corresponding to the feature vectors to be processed; and generating a target feature vector corresponding to the target video according to each feature vector to be processed and the vector weight corresponding to each feature vector to be processed.

In some embodiments of the present application, based on the foregoing, the third determining module is configured to: acquiring element weights corresponding to elements contained in the feature vectors to be processed; and generating a target feature vector corresponding to the target video according to each feature vector to be processed and the element weight corresponding to each element contained in each feature vector to be processed.

In some embodiments of the present application, based on the foregoing scheme, the video information includes image information, audio information, and text information; the third determination module is configured to: according to the image information of the target video, obtaining an image characteristic vector corresponding to the image information by adopting a local aggregation vector algorithm; acquiring a Mel frequency cepstrum coefficient corresponding to the audio information according to the audio information of the target video; obtaining an audio feature vector corresponding to the target video according to the Mel frequency cepstrum coefficient; and obtaining a character feature vector corresponding to the target video according to the image information, the audio information and the character information of the target video.

In some embodiments of the present application, based on the foregoing, the third determining module is configured to: performing image recognition according to image information corresponding to the target video to acquire first text information to be processed contained in the image information; performing voice recognition according to the audio information corresponding to the target video to acquire second text information to be processed corresponding to the audio information; and generating a character feature vector corresponding to the target video according to the first text information to be processed, the second text information to be processed and the character information of the target video.

In some embodiments of the application, based on the foregoing scheme, before the obtaining of the tag information corresponding to each video to be selected, the obtaining module is further configured to: acquiring a release video in a preset time period every other preset period; filtering the release video according to a preset rule according to the release video; and identifying the filtered release video as a video to be selected.

According to an aspect of an embodiment of the present application, there is provided a video push method, including:

acquiring a video to be pushed;

inputting the video to be pushed into a video classification model to obtain class information which is output by the video classification model and corresponds to the video to be pushed, wherein the training data of the video classification model is obtained by the method in the embodiment;

determining a target push object corresponding to the video to be pushed according to the category information corresponding to the video to be pushed;

and pushing the video to be pushed to the target pushing object.

According to an aspect of embodiments of the present application, there is provided a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in the embodiments above.

In the technical solutions provided in some embodiments of the present application, target training data corresponding to each target video is generated according to video features and label information corresponding to each target video by acquiring label information corresponding to each video to be selected, where the label information is obtained by editing a publisher of the video to be selected, determining target label information from the label information according to the label information, determining a target video from the video to be selected corresponding to the target label information according to the video to be selected corresponding to the target label information, and determining video features corresponding to the target video according to the video information corresponding to the target video, where the video information includes at least two of image information, audio information, and text information. Therefore, supervised target training data are generated according to the label information edited by the publisher of the video to be selected and the video characteristics corresponding to the video to be selected, the complexity of labeling the video is avoided, the acquisition efficiency of the target training data is improved, and the quality of the target training data is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 2 shows a flow diagram of a method of acquiring training data according to an embodiment of the present application;

FIG. 3 shows a schematic flow chart of step S220 of the training data acquisition method of FIG. 2 according to an embodiment of the present application;

FIG. 4 shows a schematic flow chart of step S230 of the training data acquisition method of FIG. 2 according to an embodiment of the present application;

FIG. 5 shows a schematic flow chart of step S240 of the training data acquisition method of FIG. 2 according to an embodiment of the present application;

fig. 6 shows a schematic flow chart of step S520 in the training data acquisition method of fig. 5 according to another embodiment of the present application;

fig. 7 shows a schematic flow chart of step S520 of the training data acquisition method of fig. 5 according to yet another embodiment of the present application;

FIG. 8 shows a schematic flow chart of step S510 of the training data acquisition method of FIG. 5 according to an embodiment of the present application;

FIG. 9 shows a flowchart of step S840 of the training data acquisition method of FIG. 8 according to an embodiment of the present application;

fig. 10 is a schematic flowchart illustrating a process of acquiring a candidate video further included in the training data acquiring method of fig. 2 according to an embodiment of the present application;

FIG. 11 shows a block flow diagram of a method of training a video classification model according to an embodiment of the present application;

FIG. 12 shows a block diagram of an apparatus for obtaining training data according to an embodiment of the present application;

FIG. 13 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present application can be applied.

As shown in fig. 1, the system architecture may include a terminal device 101, a network 102, and a server 103, the network 102 serving as a medium to provide a communication link between the terminal device 101 and the server 103. Network 102 may include various connection types, such as wired communication links, wireless communication links, and so forth.

The terminal device 101 may be one or more of a smartphone, a tablet computer, a laptop computer, and a desktop computer. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. Such as a server cluster where the server 103 may be comprised of multiple servers, etc.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. The server 103 may be a server that provides various services. For example, a user may publish a video created by the user to the server 103 by using the terminal device 101, the server 103 may perform screening according to the video published by the user, obtain tag information corresponding to each video to be selected, where the tag information is obtained by editing by a publisher of the video to be selected, determine target tag information from the tag information according to the tag information, determine a target video from the video to be selected corresponding to the target tag information according to the video to be selected corresponding to the target tag information, determine video characteristics corresponding to the target video according to the video information corresponding to the target video, where the video information includes at least two of image information, audio information, and text information, and finally generate target training data corresponding to each target video according to the video characteristics and the tag information corresponding to each target video.

It should be noted that the method for acquiring training data provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the apparatus for acquiring training data is generally disposed in the server 103. However, in other embodiments of the present application, the terminal device 101 may also have a similar function as the server 103, so as to execute the scheme of the method for acquiring training data provided in the embodiments of the present application.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 shows a schematic flow diagram of a method of acquiring training data according to an embodiment of the present application. Referring to fig. 2, the method for acquiring training data at least includes steps S210 to S250, which are described in detail as follows:

in step S210, tag information corresponding to each video to be selected is obtained, and the tag information is obtained by editing by a publisher of the video to be selected.

The video to be selected can be a video published on a video platform or the internet by a video publisher.

The tag information may be information for characterizing content included in the video, and when a publisher publishes the video, the publisher may manually edit the tag information corresponding to the video to characterize the content included in the video. For example, when a publisher issues a video including food, tag information of the video may be edited as "food", "dinner", or the like. Therefore, a viewer or a server of the video can intuitively know the content contained in the video through the label information corresponding to the video.

It should be noted that a single video may correspond to one tag information, or may correspond to multiple tag information, for example, any number of two or more, and the tag information corresponding to different videos may be the same or different, and this is not particularly limited in this application. Of course, if the publisher does not edit the video-corresponding tag information, the publisher may also publish the video.

In an exemplary embodiment of the present application, a publisher may edit tag information corresponding to a video when publishing the video, a server may store the tag information and the video in a corresponding manner, and when playing the video, may display the tag information corresponding to the video in a designated area for a user to obtain. For example, the display may be performed in an area directly below or below the left of the video playing interface, and so on.

The server can take a video published on a video platform or the internet as a video to be selected and acquire tag information corresponding to the video to be selected. It should be understood that, since the tag information is obtained by editing by a publisher of the video, the tag information can accurately represent the content contained in the video, and the video does not need to be manually labeled.

In step S220, target tag information is determined from the tag information according to the quantity information of the tag information.

The target tag information may be tag information with a high frequency of occurrence. It should be understood that a higher frequency of occurrence of the target tag information indicates that the content corresponding to the target tag information is more popular with the public and is more meaningful as training data. The method and the device can avoid generating training data by adopting the video containing more biased content, and ensure the quality of the training data.

In an exemplary embodiment of the application, the server may count the occurrence frequency of each piece of tag information according to the tag information corresponding to each video to be selected, and identify the target tag information according to the occurrence frequency of each piece of tag information, so as to obtain the target tag information that is more popular with the public.

In step S230, according to the quantity information of the videos to be selected corresponding to the target tag information, a target video is determined from the videos to be selected corresponding to the target tag information.

In an exemplary embodiment of the application, the server may calculate the number of videos to be selected corresponding to each target tag information, and determine the target video from the videos to be selected corresponding to the target tag information according to the number. For example, if the number of videos to be selected corresponding to a certain target tag information is too large (for example, greater than 2000 videos), the server may extract a part of videos from the videos to be selected corresponding to the target tag information to serve as target videos and the like, so as to avoid that the composition of the training data is affected by too many target videos corresponding to a single target tag information; if the number of the videos to be selected corresponding to a certain target label information is too small (for example, less than or equal to 500 videos), all the videos to be selected corresponding to the target label information can be identified as target videos, so that the influence on the composition of the training data and the like caused by too few target videos corresponding to a single target label information is prevented.

In step S240, according to video information corresponding to the target video, video feature data corresponding to the target video is determined, where the video information includes at least two of image information, audio information, and text information.

The video information may be content contained in the target video, and the video information may contain multiple categories of information, for example, the video information may contain image, audio, and text (such as subtitles) categories.

The video feature data can be used for representing feature information of content contained in the target data, generating video feature data corresponding to the target video according to the video information corresponding to the target video, and accurately expressing the feature information corresponding to the target video. It should be noted that one target video corresponds to one video feature data, and the video feature data corresponding to different target videos may be the same or different.

In an exemplary embodiment of the present application, the server may generate video feature data corresponding to video information of each target video based on the video information. In one example, the server may represent the video feature data in a vector form to facilitate subsequent processing. Specifically, the server may generate feature vectors corresponding to the multiple categories of video information based on the multiple categories of video information corresponding to the target video, integrate the multiple feature vectors to obtain the video feature vector corresponding to the target video, and use the video feature vector as the video feature data of the target video.

In step S250, target training data corresponding to each target video is generated according to the video feature data and the label information corresponding to each target video.

In an exemplary embodiment of the present application, the server may perform associated storage on the video feature data of the target video obtained by the calculation and the tag information corresponding to the target video to obtain target training data corresponding to the target video.

When the video classification model is subsequently trained, the video characteristic data in the target training data can be used as input so that the video classification model outputs corresponding classification information, the classification information is compared with the label information associated with the video characteristic data, and the video classification model is subjected to parameter adjustment according to the comparison result, so that the accuracy of the video classification model is ensured.

It should be noted that the video classification model may be a model that is pre-constructed by those skilled in the art to classify the content contained in the video. Through the setting of the video classification model, the video classification efficiency of a video platform or a website can be improved, and the purpose of accurate pushing is achieved.

In the embodiment shown in fig. 2, the tag information corresponding to each video to be selected is obtained, the tag information is obtained by editing a publisher of the video to be selected, the target tag information is determined from the tag information according to the quantity information of the tag information, the target video is determined from the video to be selected corresponding to the target tag information according to the quantity information of the video to be selected corresponding to the target tag information, the video feature data corresponding to the target video is determined according to the video information corresponding to the target video, and the target training data corresponding to each target tag information is generated according to the video feature data corresponding to each target video and the tag information. Therefore, supervised target training data can be generated without manual labeling, the acquisition efficiency of the target training data is improved, and the data quality of the target training data is guaranteed.

Based on the embodiment shown in fig. 2, fig. 3 is a schematic flowchart of step S220 in the training data obtaining method of fig. 2 according to an embodiment of the present application. Referring to fig. 3, step S220 at least includes steps S310 to S320, which are described in detail as follows:

in step S310, according to the tag information, the number of occurrences corresponding to each tag information is counted.

In an exemplary embodiment of the application, the server may traverse the tag information corresponding to the video to be selected, and count the occurrence number of each tag information. Specifically, the occurrence frequency of each piece of tag information is initialized to 0 in advance, and when a piece of tag information appears once in the process of traversing the tag information of the video to be selected, the occurrence frequency of the piece of tag information is added by one, so that the occurrence frequency corresponding to each piece of tag information can be obtained after the traversal is finished.

In step S320, target tag information is identified from the tag information according to the number of occurrences corresponding to each tag information.

In an exemplary embodiment of the present application, the server may compare the occurrence number corresponding to each tag information with a predetermined threshold, for example, the predetermined threshold is 600, identify a certain tag information as the target tag information if the occurrence number corresponding to the certain tag information is greater than or equal to 600, and so on.

In another exemplary embodiment of the present application, the server may sort the tag information according to a descending order of the occurrence times of the tag information, so as to obtain a tag information sequence. In an example, the server may select, as the target tag information, tag information arranged before a predetermined order from the tag information sequence, for example, the predetermined order is 200, and then the server may select, as the target tag information, tag information arranged in the top 200 names from the tag information sequence; in another example, the server may select, as the target tag information, tag information arranged before a predetermined ratio from the tag information sequence according to a predetermined ratio, for example, the predetermined ratio is 30%, the server may select, as the target tag information, tag information arranged in the top 30% from the tag information sequence, and the like. Those skilled in the art can determine the corresponding target tag information determination policy according to the actual implementation requirement, and this application is not limited specifically.

Therefore, in the embodiment shown in fig. 3, the target tag information is determined from the tag information according to the occurrence frequency of the tag information, so that the audience range of the target tag information is ensured, and the practicability of subsequent training data is further ensured.

Based on the embodiment shown in fig. 2, fig. 4 is a schematic flowchart of step S230 in the training data obtaining method of fig. 2 according to an embodiment of the present application. Referring to fig. 4, step S230 at least includes steps S410 to S430, which are described in detail as follows:

in step S410, according to the video to be selected corresponding to each target tag information, the number of the video to be selected corresponding to each target tag information and the number of the publishers are determined.

In an exemplary embodiment of the application, the server may determine, according to the to-be-selected video corresponding to each target tag information, the number of the to-be-selected videos corresponding to each target tag information and the number of publishers corresponding to each target tag information. It should be understood that, for the same publisher, the number of publishers corresponding to the target tag information is counted only once. Therefore, according to the number of the publishers corresponding to the target tag information, the number of the publishers publishing videos with the same tag information can be known.

In step S420, according to the number of the videos to be selected corresponding to each target tag information and the number of the publishers, performing number optimization processing on the videos to be selected corresponding to each target tag information.

The quantity optimization processing may be a processing procedure of balancing the quantity of the to-be-selected videos corresponding to each target tag information. Through the quantity optimization processing, the occurrence of extreme situations, such as too many or too few videos to be selected corresponding to a certain target tag information, or too many videos to be selected issued by a single issuer within a certain target tag information, can be avoided.

In an exemplary embodiment of the application, the server may perform quantity optimization processing on the to-be-selected videos corresponding to each target tag information according to the quantity of the to-be-selected videos corresponding to each target tag information and the quantity of the publishers. For example, when the number of videos to be selected corresponding to a certain target tag information is too large, part of videos to be selected corresponding to the target tag information is removed, and if too many videos to be selected issued by a single publisher are in the videos to be selected corresponding to the certain target tag information, part of videos to be selected corresponding to the target tag information and issued by the publisher are removed, so that the situation that the application range of target training data is affected due to too many videos issued by the publisher in the certain target tag information is avoided.

In step S430, the candidate video corresponding to each piece of target tag information after the quantity optimization is identified as a target video.

In this step, the server may identify, as the target video, the video to be selected corresponding to each piece of target tag information remaining after the number optimization processing.

Therefore, in the embodiment shown in fig. 4, the number of the videos to be selected corresponding to each target tag information is optimized according to the number of the videos to be selected corresponding to the target tag information and the number of the publishers, so that the composition of the training data corresponding to each target tag information in the subsequent target training data can be optimized, the practicability of the target training data is ensured, and the quality of the target training data is further ensured.

Based on the embodiments shown in fig. 2 and fig. 4, in an exemplary embodiment of the present application, the performing quantity optimization processing on the to-be-selected video corresponding to each target tag information according to the quantity of the to-be-selected video corresponding to each target tag information and the quantity of publishers includes:

if the number of the videos to be selected corresponding to the target label information is larger than a first preset number, deleting the videos to be selected corresponding to the target label information;

if the number of the publishers corresponding to the target tag information is smaller than a second preset number, deleting the video to be selected corresponding to the target tag information;

and if the number of the videos to be selected issued by a single publisher in the videos to be selected corresponding to the target tag information is larger than a third preset number, deleting the videos to be selected issued by the single publisher.

In this embodiment, the server may compare the number of videos to be selected corresponding to each target tag information with a first predetermined number, where the first predetermined number may be a threshold used to determine an upper limit of the number of videos to be selected corresponding to each target tag information, which may be set by a person skilled in the art according to prior experience. If the number of the videos to be selected corresponding to a certain target tag information is greater than the first predetermined number, it indicates that the number of the videos to be selected corresponding to the target tag information is too large, and therefore, the videos to be selected corresponding to the target tag information may be subjected to subtraction processing to reduce the number of the videos to be selected corresponding to the target tag information to be less than or equal to the first predetermined number.

In an example, when the deletion processing is performed, the server may randomly select a video to be selected from videos to be selected corresponding to the target tag information for deletion; in another example, the server may also preferentially delete the video to be selected with the prior publication time according to the sequence of the publication times of the video to be selected, so as to ensure that the remaining video to be selected can meet the current trending trend.

The server may further compare the number of publishers corresponding to the target tag information with a preset second predetermined number, and if the number of publishers corresponding to the target tag information is smaller than the second predetermined number, it indicates that the number of publishers corresponding to the target tag information is too small (for example, only one or two publishers are present), which indicates that the content corresponding to the target tag information is not accepted by most of the public, and only has a very small number of audience groups. Therefore, in order to ensure the reasonability of the target training data, the video to be selected corresponding to the target label information is deleted.

The server may also count the number of the videos to be selected issued by each publisher in the videos to be selected corresponding to each target tag information, compare the number with a preset third preset number, and if the number is greater than the third preset number, indicate that the number of the videos to be selected issued by a single author in the videos to be selected corresponding to the target tag information is too large. Therefore, the candidate videos published by the single author can be deleted and reduced in the candidate videos corresponding to the target tag information, so that the number of the candidate videos published by the single publisher is smaller than or equal to a third preset number. Similarly, the deletion processing step may refer to the deletion method described previously, and is not described herein again.

Based on the embodiment shown in fig. 2, fig. 5 is a schematic flowchart of step S240 in the training data obtaining method of fig. 2 according to an embodiment of the present application. Referring to fig. 5, step S240 at least includes steps S510 to S520, which are described in detail as follows:

in step S510, to-be-processed feature vectors corresponding to the target video and the video information of each category are obtained according to the video information of different categories corresponding to the target video.

In an exemplary embodiment of the present application, the video information may include image information, audio information, and text information, and the server may generate feature vectors to be processed corresponding to the respective categories of video information based on the respective categories of video information, for example, the server may perform feature extraction based on the image information, generate image feature vectors corresponding to the image information, perform feature extraction based on the audio information, generate audio feature vectors corresponding to the audio information, and so on. The dimension of the feature vector can be preset by those skilled in the art, so as to facilitate the subsequent fusion or splicing between different feature vectors to be processed.

In step S520, a target feature vector corresponding to the target video is generated according to the to-be-processed feature vector corresponding to the target video.

In an exemplary embodiment of the present application, the server may integrate, for example, splice or fuse, multiple feature vectors to be processed corresponding to the target video, so as to generate a target feature vector corresponding to the target video. It should be understood that the target feature vector includes video features of video information of each category corresponding to the target video, so that the content included in the target video can be accurately represented, and the accuracy of a subsequent training result is ensured.

Based on the embodiments shown in fig. 2 and fig. 5, in an exemplary embodiment of the present application, the generating a target feature vector corresponding to the target video according to the to-be-processed feature vector corresponding to the target video includes:

and splicing the feature vectors to be processed to generate a target feature vector corresponding to the target video.

In this embodiment, the server may stitch the plurality of feature vectors to be processed according to a preset stitching sequence, so as to generate a target feature vector corresponding to the target video. For example, the target video corresponds to an image feature vector, an audio feature vector and a character feature vector, and the splicing sequence is as follows: the server may arrange the elements in the text feature vector in the front according to the order, arrange the elements in the image feature vector in the middle according to the order, and finally, obtain the elements of the audio feature vector when splicing the text, the image, and the audio to generate the target feature vector corresponding to the target video.

It should be understood that the target feature vector is generated in a direct splicing manner, so that the generation efficiency of the target feature vector can be improved, and the acquisition efficiency of the training data is further improved.

Based on the embodiments shown in fig. 2 and fig. 5, fig. 6 shows a flowchart of step S520 in the training data obtaining method of fig. 5 according to another embodiment of the present application. Referring to fig. 6, step S520 at least includes steps S610 to S620, and is described in detail as follows:

in step S610, a vector weight corresponding to each of the feature vectors to be processed is obtained.

In an exemplary embodiment of the present application, a person skilled in the art may set in advance a vector weight corresponding to a feature vector of each category of video information according to prior experience. It should be understood that the vector weights corresponding to the feature vectors of different classes may be the same or different, and this is not particularly limited in this application. For example, the vector weight corresponding to the image feature vector is 0.3, the vector weight corresponding to the audio feature vector is 0.3, and the vector weight corresponding to the text feature vector is 0.4.

In step S620, a target feature vector corresponding to the target video is generated according to each of the feature vectors to be processed and a vector weight corresponding to each of the feature vectors to be processed.

In an exemplary embodiment of the present application, the server may multiply an element in each feature vector to be processed by a vector weight corresponding to the feature vector to be processed, and then correspondingly add the multiplied elements of the plurality of feature vectors to be processed, so as to implement weighting and calculation between the feature vectors to be processed, and obtain a target feature vector corresponding to the target video. It should be understood that if the category of the video information is two, the weighted sum calculation between two feature vectors to be processed is performed, and if the category of the video information is three or more, the weighted sum calculation between three or more feature vectors to be processed is performed.

Based on the embodiments shown in fig. 2 and fig. 5, fig. 7 shows a flowchart of step S520 in the training data obtaining method of fig. 5 according to another embodiment of the present application. Referring to fig. 7, step S520 at least includes steps S710 to S720, which are described in detail as follows:

in step S710, an element weight corresponding to each element included in each of the to-be-processed feature vectors is obtained.

In an exemplary embodiment of the present application, a person skilled in the art may preset, according to prior experience, an element weight corresponding to each element in each feature vector to be processed, and store the element weight for the server to read. It should be noted that the element vectors corresponding to the elements in the same feature vector to be processed may be the same or different.

In step S720, a target feature vector corresponding to the target video is generated according to each of the to-be-processed feature vectors and the element weights corresponding to the elements included in each of the to-be-processed feature vectors.

In an exemplary embodiment of the present application, the server may multiply an element in each feature vector to be processed by an element weight corresponding to the element, and correspondingly add the elements in the multiple feature vectors to be processed after the multiplication, so as to obtain a target feature vector corresponding to the target video.

In the embodiment shown in fig. 7, by setting the element weights, the importance degree of each element in the feature vector to be processed can be accurately represented, thereby ensuring the accuracy of the generated target feature vector.

Based on the embodiments shown in fig. 2 and fig. 5, fig. 8 is a flowchart illustrating step S510 in the training data obtaining method of fig. 5 according to an embodiment of the present application. Referring to fig. 8, the video information includes image information, audio information and text information, and step S510 at least includes steps S810 to S840, which are described in detail as follows:

in step S810, according to the image information of the target video, an image feature vector corresponding to the image information is obtained by using a local aggregation vector algorithm.

In an exemplary embodiment of the present application, the server may perform framing processing on the target video to obtain a video frame corresponding to the target video, for example, the server may generate one video frame every second, and so on. The server may use the obtained video frame as the image information of the target video. The server can extract features of the video frame by adopting a pre-constructed convolutional neural network, and obtains image feature vectors corresponding to the image information by adopting a Vector of locality descriptors (VLAD) algorithm. It should be noted that, other existing image feature extraction methods may also be adopted by those skilled in the art, and the present application is not limited to this.

In step S820, according to the audio information of the target video, a mel-frequency cepstrum coefficient corresponding to the audio information is obtained.

Where Mel-Frequency Cepstrum is a linear transformation of the log energy spectrum based on the nonlinear Mel scale (Mel scale) of the sound frequencies. The Mel-Frequency Cepstral Coefficient (MFCC) is a Coefficient constituting the Mel-Frequency Cepstral. It is derived from the cepstrum of an audio segment. Cepstrum differs from mel-frequency cepstrum in that the band division of the mel-frequency cepstrum is equally spaced on the mel scale, which more closely approximates the human auditory system than the linearly spaced bands used in the normal log cepstrum. Such a non-linear representation may provide a better representation of the sound signal in a number of domains.

Specifically, the method for obtaining the mel-frequency cepstrum coefficient corresponding to the audio information mainly comprises the following steps:

(1) a speech signal is decomposed into a plurality of frames.

(2) The speech signal is pre-emphasized and passed through a high pass filter.

(3) And performing Fourier transform to transform the signal into a frequency domain.

(4) The obtained spectrum for each frame is passed through a mel filter (triangular overlapping window) to obtain a mel scale.

(5) Logarithmic energy is extracted on each mel-scale.

(6) And performing inverse discrete Fourier transform on the obtained result, and transforming the result into an inverse spectrum domain.

(7) MFCC is the magnitude of this cepstral plot. Typically 12 coefficients are used, superimposed with the frame energy to obtain 13-dimensional coefficients.

It will be appreciated that, assuming that the transformation of the speech signal is flat over a short time span, the present application defines this time span as 100ms, which ensures that there are enough periods within a frame and not too drastic changes. Each frame signal is usually multiplied by a smooth window function, so that the two ends of the frame are smoothly attenuated to zero, thereby reducing the intensity of side lobes after fourier transform and obtaining a higher-quality frequency spectrum. A window function is selected for each frame, the width of the window function being the frame length. Since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices. After multiplication by the window function, a fast fourier transform must be performed for each frame to obtain the energy distribution over the spectrum. The Mel Frequency Cepstrum Coefficient (MFCC) takes human auditory features into consideration, and first maps a linear spectrum into a Mel nonlinear spectrum based on auditory perception, and then converts the Mel nonlinear spectrum onto a Cepstrum.

In step S830, an audio feature vector corresponding to the target video is obtained according to the mel-frequency cepstrum coefficient.

In this embodiment, the server may use the resulting mel-frequency cepstrum coefficient of the 13-dimensional coefficient as the audio feature vector corresponding to the target video.

In step S840, a text feature vector corresponding to the target video is obtained according to the image information, the audio information, and the text information of the target video.

In an exemplary embodiment of the present application, it can be understood that both the image information and the audio information may include text information, and the server may obtain text content corresponding to the target video according to the image information, the audio information, and the text information of the target video, so as to generate a text feature vector corresponding to the target video, and ensure the comprehensiveness of the generated text feature vector.

In the embodiment shown in fig. 8, the accuracy of the feature data of the target video is ensured by generating the image feature vector, the audio feature vector and the character feature vector corresponding to the image information, the audio information and the character information, so that the training effect of the subsequent video classification model is improved.

Based on the embodiments shown in fig. 2, fig. 5 and fig. 8, fig. 9 is a flowchart illustrating step S840 of the training data obtaining method of fig. 8 according to an embodiment of the present application. Referring to fig. 9, step S840 includes at least steps S910 to S930, and the detailed description is as follows:

in step S910, image recognition is performed according to image information corresponding to the target video, and first text information to be processed included in the image information is acquired.

In an exemplary embodiment of the present application, the server may perform image Recognition (e.g., Optical Character Recognition (OCR), etc.) on image information corresponding to the target video, so as to recognize text content included in the image information, and use the text content as the first text information to be processed.

In step S920, performing voice recognition according to the audio information corresponding to the target video, and acquiring second text information to be processed corresponding to the audio information.

In an exemplary embodiment of the present application, the server may perform speech recognition on the audio information corresponding to the target video by using an existing speech recognition technology, so as to obtain text content included in the audio information, for example, words of characters in the audio information, and use the text content as the second text information to be processed.

In step S930, a text feature vector corresponding to the target video is generated according to the first text information to be processed, the second text information to be processed, and the text information of the target video.

In an exemplary embodiment of the application, the server may integrate the acquired first text information to be processed, the acquired second text information to be processed, and the text information of the target video to obtain a complete text description of the target video. The server can extract the characteristics of the complete text description by adopting the existing character characteristic extraction mode, so that the character characteristic vector corresponding to the target video is generated. In one example, the server may perform feature extraction on the complete text description using a bidirectional LSTM, which may better extract context structure information of the text, complete feature extraction from the spatio-temporal dimension, and ensure accuracy of the generated text feature vector.

Based on the embodiment shown in fig. 2, fig. 10 is a schematic flowchart illustrating a process of acquiring a candidate video further included in the training data acquiring method of fig. 2 according to an embodiment of the present application. Referring to fig. 10, acquiring the video to be selected at least includes steps S1010 to S1030, which are described in detail as follows:

in step S1010, a distribution video within a predetermined period is acquired every predetermined period.

The published video may be all videos published on a video platform or the internet. It should be appreciated that when a publisher publishes a video, the server may record the publication time of the published video and associate the publication time with the published video.

In an exemplary embodiment of the present application, the server may acquire the published video in a predetermined period every other predetermined period, where the predetermined period and the predetermined period may be preset by a person skilled in the art according to actual implementation needs, for example, the predetermined period may be 24 hours, the predetermined period is 30 days, and then the server may acquire information about the published video in 30 days every 24 hours, and so on.

In step S1020, the release video is filtered according to a predetermined rule according to the release video.

The predetermined rule may be rule information for determining security of the video, and may be preset by a person skilled in the art according to the security specification to remove the published video that does not meet the security specification.

In an exemplary embodiment of the present application, the server may determine whether illegal content exists in each published video based on the video information of the published video, for example, whether administrative, yellow-related, illegal watermark or plagiarism handling exists. In order to facilitate acquisition of target training data, whether the release video has the condition of label-free information or not can be determined, and if the release video has the condition of label-free information, the release video is discarded so as to ensure generation of supervised target training data.

In step S1030, the post-filtering video is identified as a video to be selected.

In the embodiment shown in fig. 10, the validity of the determined video to be selected can be ensured by setting the predetermined rule, so that the quality of the subsequently generated target training data is ensured.

According to an embodiment of the present application, there is also provided a video push method, including:

acquiring video information of a video to be pushed, wherein the video information comprises at least two of image information, audio information and character information;

inputting the video information of the video to be pushed into a video classification model to obtain class information which is output by the video classification model and corresponds to the video to be pushed, wherein the training data of the video classification model is obtained by the method for obtaining the training data in the embodiment;

and pushing the video to be pushed to the target pushing object.

In this embodiment, the video to be pushed may be a video that is newly released, and when the server detects that there is a video that is released by a publisher, the server may obtain video information of the video and use the video as the video to be pushed. And inputting the video information of the video to be pushed into a video classification model which is trained in advance so as to obtain the category information of the video to be pushed, which is output by the video classification model. The training data of the video classification model is obtained by the method for obtaining the training data, so that the efficiency of obtaining the training data is improved, and the quality of the training data is ensured.

The server can compare the category information of the video to be pushed with the user portrait of each user, determine a target pushing object corresponding to the video to be pushed according to the similarity between the category information and the user portrait, and then push the video to the target pushing object. Therefore, when the newly published video does not have enough user behaviors (such as forwarding, praise, comment and the like), the classification of the newly published video can be completed, so that the accuracy of video pushing is improved.

Based on the technical solution of the above embodiment, a specific application scenario of the embodiment of the present application is introduced as follows:

FIG. 11 shows a block flow diagram of a method of training a video classification model according to an embodiment of the present application.

As shown in 1110 in fig. 11, the server may obtain video data published by a video number within a predetermined time period (i.e., publish a video), which is described below by taking 90 days as an example. After the video published within 90 days is acquired, the server may perform security filtering, such as security audit filtering (whether yellow-related or administrative-related exists, etc.), illegal watermark filtering, plagiarism transport filtering, and non-topic content filtering, according to video information of the published video. The topic is label information, and is generated by editing when a publisher publishes a video.

As shown in 1120 in fig. 11, after video data is securely filtered, according to tag information of remaining videos (i.e., videos to be selected), the occurrence frequency of each tag information is counted to identify target tag information from the tag information, and then according to quantity information corresponding to the target tag information, such as the quantity of videos to be selected corresponding to the target tag information, the quantity of authors, the quantity of videos to be selected issued by a single author, and the like, a single topic including a minimum author number limit, a single topic author maximum contribution limit, a single topic maximum, a minimum sample number limit, and a topic balance limit (avoiding a situation that too many or too few videos to be selected corresponding to a certain target tag information) is performed on the videos to be selected corresponding to the target tag information, so as to obtain a target video.

As shown in 1130 in fig. 11, according to the determined target video, video information corresponding to the target video is downloaded, and the voice recognition result, the character recognition result, the audio frequency spectrum feature, the video sampling frame, and the like of the video information are correspondingly obtained, so as to integrate the above information to obtain video feature data of the target video. And storing the video characteristic data to obtain target training data, wherein the video characteristic data is stored in a binary form in the embodiment.

As shown in 1140 in fig. 11, the server may obtain the stored target training data to update and train the video classification model in real time, or the server may identify a newly added hot topic based on the target training data, for example, identify according to an increase rate of a video corresponding to a certain topic.

Therefore, in the embodiment shown in fig. 11, supervised target training data is generated by acquiring the video to be selected with the tag information, manual labeling is not needed, the acquisition efficiency of the target training data is improved, and meanwhile, video feature data of the target video is generated based on video information of multiple categories, so that the quality of the target training data is ensured, and the training effect of a subsequent video classification model is ensured.

Embodiments of the apparatus of the present application are described below, which may be used to perform the method for acquiring training data in the above embodiments of the present application. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for acquiring training data described above in the present application.

Fig. 12 shows a block diagram of an apparatus for acquiring training data according to an embodiment of the present application.

Referring to fig. 12, an apparatus for acquiring training data according to an embodiment of the present application includes:

an obtaining module 1210, configured to obtain tag information corresponding to each video to be selected, where the tag information is obtained by editing by a publisher of the video to be selected;

a first determining module 1220, configured to determine target tag information from the tag information according to the quantity information of the tag information;

the second determining module 1230 is configured to determine a target video from the videos to be selected corresponding to the target tag information according to the quantity information of the videos to be selected corresponding to the target tag information;

a third determining module 1240, configured to determine, according to video information corresponding to the target video, video features corresponding to the target video, where the video information includes at least two of image information, audio information, and text information;

the processing module 1250 is configured to generate target training data corresponding to each target video according to the video features and the label information corresponding to each target video.

In some embodiments of the present application, based on the foregoing scheme, the first determining module 1220 is configured to: counting the occurrence frequency corresponding to each label information according to the label information; and identifying target label information from each target label information according to the occurrence frequency corresponding to each label information.

In some embodiments of the present application, based on the foregoing, the second determining module 1230 is configured to: determining the number of the videos to be selected and the number of the publishers, which correspond to each target tag information, according to the videos to be selected, which correspond to each target tag information; performing quantity optimization processing on the video to be selected corresponding to each target tag information according to the quantity of the video to be selected corresponding to each target tag information and the quantity of the publishers; and identifying the video to be selected corresponding to each target label information after the quantity optimization processing as a target video.

In some embodiments of the present application, based on the foregoing, the second determining module 1230 is configured to: if the number of the videos to be selected corresponding to the target label information is larger than a first preset number, deleting the videos to be selected corresponding to the target label information; if the number of the publishers corresponding to the target tag information is smaller than a second preset number, deleting the video to be selected corresponding to the target tag information; and if the number of the videos to be selected issued by a single publisher in the videos to be selected corresponding to the target tag information is larger than a third preset number, deleting the videos to be selected issued by the single publisher.

In some embodiments of the present application, based on the foregoing scheme, the third determining module 1240 is configured to: according to different types of video information corresponding to the target video, acquiring to-be-processed characteristic vectors corresponding to the target video and the video information of each type respectively; and generating a target feature vector corresponding to the target video according to the feature vector to be processed corresponding to the target video.

In some embodiments of the present application, based on the foregoing scheme, the third determining module 1240 is configured to: and splicing the feature vectors to be processed to generate a target feature vector corresponding to the target video.

In some embodiments of the present application, based on the foregoing scheme, the third determining module 1240 is configured to: acquiring vector weights corresponding to the feature vectors to be processed; and generating a target feature vector corresponding to the target video according to each feature vector to be processed and the vector weight corresponding to each feature vector to be processed.

In some embodiments of the present application, based on the foregoing scheme, the third determining module 1240 is configured to: acquiring element weights corresponding to elements contained in the feature vectors to be processed; and generating a target feature vector corresponding to the target video according to each feature vector to be processed and the element weight corresponding to each element contained in each feature vector to be processed.

In some embodiments of the present application, based on the foregoing scheme, the video information includes image information, audio information, and text information; the third determination module 1240 is configured to: according to the image information of the target video, obtaining an image characteristic vector corresponding to the image information by adopting a local aggregation vector algorithm; acquiring a Mel frequency cepstrum coefficient corresponding to the audio information according to the audio information of the target video; obtaining an audio feature vector corresponding to the target video according to the Mel frequency cepstrum coefficient; and obtaining a character feature vector corresponding to the target video according to the image information, the audio information and the character information of the target video.

In some embodiments of the present application, based on the foregoing scheme, the third determining module 1240 is configured to: performing image recognition according to image information corresponding to the target video to acquire first text information to be processed contained in the image information; performing voice recognition according to the audio information corresponding to the target video to acquire second text information to be processed corresponding to the audio information; and generating a character feature vector corresponding to the target video according to the first text information to be processed, the second text information to be processed and the character information of the target video.

In some embodiments of the present application, based on the foregoing scheme, before the obtaining of the tag information corresponding to each video to be selected, the obtaining module 1210 is further configured to: acquiring a release video in a preset time period every other preset period; filtering the release video according to a preset rule according to the release video; and identifying the filtered release video as a video to be selected.

It should be noted that the computer system of the electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 13, the computer system includes a Central Processing Unit (CPU)1301, which can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1302 or a program loaded from a storage portion 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for system operation are also stored. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An Input/Output (I/O) interface 1305 is also connected to bus 1304.

The following components are connected to the I/O interface 1305: an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1309 performs communication processing via a network such as the internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1308 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1301.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for acquiring training data is characterized by comprising the following steps:

determining video characteristic data corresponding to the target video according to video information corresponding to the target video, wherein the video information comprises at least two of image information, audio information and character information;

and generating target training data respectively corresponding to each target video according to the video characteristic data and the label information corresponding to each target video.

2. The method of claim 1, wherein the determining target tag information from the tag information according to the quantity information of the tag information comprises:

counting the occurrence frequency corresponding to each label information according to the label information;

and identifying target label information from the label information according to the occurrence frequency corresponding to each label information.

3. The method according to claim 1, wherein the determining, according to the quantity information of the videos to be selected corresponding to each piece of the target tag information, a target video from the videos to be selected corresponding to the target tag information includes:

determining the number of the videos to be selected and the number of the publishers, which correspond to each target tag information, according to the videos to be selected, which correspond to each target tag information;

performing quantity optimization processing on the video to be selected corresponding to each target tag information according to the quantity of the video to be selected corresponding to each target tag information and the quantity of the publishers;

and identifying the video to be selected corresponding to each target label information after the quantity optimization processing as a target video.

4. The method according to claim 3, wherein performing quantity optimization processing on the to-be-selected video corresponding to each target tag information according to the quantity of the to-be-selected video corresponding to each target tag information and the quantity of publishers comprises:

5. The method according to claim 1, wherein the determining the video characteristics corresponding to the target video according to the video information corresponding to the target video comprises:

according to different types of video information corresponding to the target video, acquiring to-be-processed characteristic vectors corresponding to the target video and the video information of each type respectively;

and generating a target feature vector corresponding to the target video according to the feature vector to be processed corresponding to the target video.

6. The method according to claim 5, wherein the generating a target feature vector corresponding to the target video according to the to-be-processed feature vector corresponding to the target video comprises:

7. The method according to claim 5, wherein the generating a target feature vector corresponding to the target video according to the to-be-processed feature vector corresponding to the target video comprises:

acquiring vector weights corresponding to the feature vectors to be processed;

and generating a target feature vector corresponding to the target video according to each feature vector to be processed and the vector weight corresponding to each feature vector to be processed.

8. The method according to claim 5, wherein the generating a target feature vector corresponding to the target video according to the to-be-processed feature vector corresponding to the target video comprises;

acquiring element weights corresponding to elements contained in the feature vectors to be processed;

and generating a target feature vector corresponding to the target video according to each feature vector to be processed and the element weight corresponding to each element contained in each feature vector to be processed.

9. The method of claim 5, wherein the video information comprises image information, audio information, and text information;

the generating the feature vectors to be processed respectively corresponding to the target video and the video information of each category according to the video information of different categories corresponding to the target video comprises:

according to the image information of the target video, obtaining an image characteristic vector corresponding to the image information by adopting a local aggregation vector algorithm;

acquiring a Mel frequency cepstrum coefficient corresponding to the audio information according to the audio information of the target video;

obtaining an audio feature vector corresponding to the target video according to the Mel frequency cepstrum coefficient;

and obtaining a character feature vector corresponding to the target video according to the image information, the audio information and the character information of the target video.

10. The method of claim 9, wherein obtaining the text feature vector corresponding to the target video according to the image information, the audio information, and the text information of the target video comprises:

performing image recognition according to image information corresponding to the target video to acquire first text information to be processed contained in the image information;

performing voice recognition according to the audio information corresponding to the target video to acquire second text information to be processed corresponding to the audio information;

and generating a character feature vector corresponding to the target video according to the first text information to be processed, the second text information to be processed and the character information of the target video.

11. The method according to claim 1, wherein before the obtaining of the tag information corresponding to each video to be selected, the method further comprises:

acquiring a release video in a preset time period every other preset period;

filtering the release video according to a preset rule according to the release video;

and identifying the filtered release video as a video to be selected.

12. A video push method, comprising:

inputting video information of the video to be pushed into a video classification model to obtain class information which is output by the video classification model and corresponds to the video to be pushed, wherein training data of the video classification model is obtained by the method of any one of claims 1 to 11;

and pushing the video to be pushed to the target pushing object.

13. An apparatus for acquiring training data, comprising:

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 12.

15. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1 to 12.