CN115878849B

CN115878849B - Video tag association method and device and electronic equipment

Info

Publication number: CN115878849B
Application number: CN202310167854.8A
Authority: CN
Inventors: 周锋
Original assignee: Beijing Qishuyouyu Culture Media Co ltd
Current assignee: Beijing Qishuyouyu Culture Media Co ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-05-26
Anticipated expiration: 2043-02-27
Also published as: CN115878849A

Abstract

The application relates to the technical field of intelligent searching, in particular to a video tag association method, a video tag association device and electronic equipment, wherein the method comprises the following steps: acquiring video information of a target video; matching is carried out on the basis of the title information and all local video tags in a local tag library, so that an initial video tag corresponding to the title information is obtained; performing feature word analysis based on the audio information, the image information and the subtitle information in the video information to obtain respective corresponding target feature words; integrating the target feature words corresponding to the audio information, the target feature words corresponding to the image information and the target feature words corresponding to the subtitle information to obtain a target video tag; and taking the initial video tag and the target video tag as video tags of the target video. The video label of the target video is automatically determined based on the multi-aspect video information of the target video so as to complete the association of the target video and the video label, and the efficiency of associating the target video and the video label is greatly improved.

Description

Video tag association method and device and electronic equipment

Technical Field

The application relates to the technical field of intelligent searching, in particular to a video tag association method, a video tag association device and electronic equipment.

Background

Currently, in order to enable a user to know the content of a video in a short time, a video playing platform usually makes a corresponding video tag for an uploaded video, where the video tag is used for describing the characteristics of the video and can also be used for recall, recommendation, search and the like of the video. The richer the video labels are, the better the video labels can be used for distributing and consuming the video, and meanwhile, the more friendly searching and recommending experience can be provided for users.

Currently, there are a variety of ways to tag video: one is to have the user watching the video tag the video in a faster way, but the efficiency of tagging the video is lower; one is to manually add labels by the manager of the video playing platform, which requires considerable manpower and material resources, and the efficiency of adding labels to the video is too low due to the limited number of manager.

Thus, how to improve the efficiency of associating video with video tags is a problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a video tag association method, a video tag association device and electronic equipment, which are used for solving at least one technical problem.

The above object of the present application is achieved by the following technical solutions:

in a first aspect, the present application provides a video tag association method, which adopts the following technical scheme:

a video tag association method, the method comprising:

acquiring video information of a target video, wherein the video information comprises title information, audio information, image information and subtitle information;

based on the title information, matching the title information with all local video tags in a local tag library to obtain initial video tags corresponding to the title information, wherein a large number of local video tags are stored in the local tag library in advance;

performing feature word analysis based on the audio information, the image information and the subtitle information in the video information to obtain respective corresponding target feature words;

integrating the target feature words corresponding to the audio information, the target feature words corresponding to the image information and the target feature words corresponding to the subtitle information to obtain a target video tag;

and taking the initial video tag and the target video tag as video tags of the target video to complete association of the target video and the video tags.

By adopting the technical scheme, the method comprises the steps of matching the title information with a local tag library, determining an initial video tag, analyzing feature words based on audio information, image information and subtitle information to obtain respective corresponding target feature words, integrating the respective corresponding target feature words to obtain a target video tag, and then using the initial video tag and the target video tag as video tags of target videos to complete association of the target videos and the video tags. The video label of the target video is automatically determined based on the video information of multiple aspects of the target video so as to complete the association of the target video and the video label, and the efficiency of associating the target video and the video label is greatly improved.

The present application may be further configured in a preferred example to: the feature word analysis is performed based on the audio information, the image information and the subtitle information in the video information to obtain respective corresponding target feature words, including:

converting the audio information in the video information into text information, and performing word segmentation processing based on the text information to obtain a plurality of audio word segments;

determining an audio feature word from the plurality of audio tokens based on the frequency of each audio token;

Performing image recognition based on the image information, and determining a plurality of entity objects and expressions corresponding to the entity objects;

aiming at each entity object, determining an object keyword corresponding to the entity object according to the entity object and the expression corresponding to the entity object;

determining image feature words from all object keywords based on the frequency of each object keyword;

extracting semantic characters based on the caption information to obtain a plurality of caption keywords;

determining caption feature words from a plurality of caption keywords based on the frequency of each caption keyword;

the audio feature words are target feature words corresponding to the audio information, the image feature words are target feature words corresponding to the image information, and the caption feature words are target feature words corresponding to the caption information.

By adopting the technical scheme, aiming at the audio information, after the audio information is converted into the text information, word segmentation processing is carried out on the text information, a plurality of audio word segments are obtained, and the audio word segment with the highest frequency is used as an audio feature word; determining a plurality of entity objects and expressions corresponding to the entity objects respectively through image recognition aiming at image information, determining corresponding object keywords aiming at each entity object, and taking the object keywords with highest frequency as image feature words; and extracting semantic characters aiming at the caption information to obtain a plurality of caption keywords, and taking the caption keywords with highest frequency as caption feature words. The feature word analysis is respectively carried out based on the audio information, the image information and the subtitle information of the target video, so that the corresponding target feature words and the target video are higher in pertinence.

The present application may be further configured in a preferred example to: the integrating processing is performed on the target feature words corresponding to the audio information, the target feature words corresponding to the image information and the target feature words corresponding to the subtitle information to obtain a target video tag, including:

and integrating the audio feature words, the image feature words and the subtitle feature words by using a feature word integration model to obtain a target video tag, wherein the feature word integration model is obtained by training based on a large number of training feature phrases.

By adopting the technical scheme, the audio feature words, the image feature words and the subtitle feature words are integrated by using the feature word integration model, so that the target video tag is obtained, the target video tag determined by integrating a plurality of target feature words can accurately tag the target video, the accuracy of the association of the target video tag and the target video is improved, and the efficiency and the accuracy of determining the target video tag can be improved by using the feature word integration model.

The present application may be further configured in a preferred example to: after the initial video tag and the target video tag are used as the video tags of the target video to complete the association between the target video and the video tags, the method further comprises:

Performing semantic similarity matching based on the initial video tag and the target video tag to obtain a matching result;

and if the matching result is that the matching fails, marking the data item corresponding to the video label of the target video as abnormal.

By adopting the technical scheme, semantic similarity matching is performed on the basis of the initial video tag and the target video tag, if matching fails, the initial video tag and the target video tag are low in similarity, and if the situation that the video tag is not matched with the target video possibly exists, the data item corresponding to the video tag of the target video is marked as abnormal, so that a manager of a video playing platform can be reminded to manually check the video tag, and the accuracy of association between the target video and the video tag is further ensured.

The present application may be further configured in a preferred example to: the video tag using the initial video tag and the target video tag as target videos includes:

determining classification items of the video tags corresponding to the initial video tag and the target video tag respectively, wherein the classification items of the video tags comprise a theme tag item, a genre tag item and an applicable state tag item;

Correspondingly, the matching of semantic similarity between the initial video tag and the target video tag to obtain a matching result comprises the following steps:

and when the initial video tag and the target video tag are the same classification item of the video tag, carrying out semantic similarity matching based on the initial video tag and the target video tag to obtain a matching result.

By adopting the technical scheme, the classification items of the video tags corresponding to the initial video tag and the target video tag are respectively determined, and when the initial video tag and the target video tag are the same classification item of the video tag, semantic similarity matching is carried out based on the initial video tag and the target video tag, so that a matching result is obtained. By the method, different classification items are divided for the video tags, so that the target video can be associated with video tags of different dimensions at the same time, the video tags of the target video are enriched, and semantic similarity matching is carried out on the initial video tags in the same classification item of the target video and the target video tags, so that the result of the semantic similarity matching is more accurate.

The present application may be further configured in a preferred example to: after marking the data item corresponding to the video tag of the target video as abnormal, the method further comprises:

When an exception handling instruction is detected, determining an exception handling type based on the exception handling instruction, wherein the exception handling type comprises: updating the video tag and updating the data item status;

if the exception handling type is to update the video tag, acquiring an artificial tag, and updating the video tag based on the artificial tag;

and if the abnormal processing type is the updated data item state, marking the data item corresponding to the video tag of the target video as normal.

By adopting the technical scheme, after the abnormal processing instruction is detected, the abnormal processing type is determined based on the abnormal processing instruction, if the abnormal processing type is the updated video tag, the manual tag is obtained, and the video tag is updated based on the manual tag, so that the video tag of the target video is more attached to the target video; if the abnormal processing type is the updated data item state, the data item corresponding to the video tag of the target video is marked as normal, so that the video tag of the target video is prevented from being manually checked for multiple times, and the workload of manually checking the video tag is reduced.

The present application may be further configured in a preferred example to: further comprises:

Obtaining the searching amount and the playing condition of the target video in a preset time period, wherein the searching amount is the number of times of the target video data appearing on a display interface when a user inputs a search keyword to search, and the playing condition comprises the clicking amount and the playing time length of the target video in the display interface;

scoring the video tags of the target video based on the search amount and the playing condition to obtain tag scores;

and judging whether the label score is smaller than a score threshold value, and if so, modifying the video label of the target video.

By adopting the technical scheme, the search amount and the play condition of the target video in the preset time period are acquired, the video labels of the target video are scored based on the search amount and the play condition, and the video labels of the target video with the label score smaller than the score threshold are modified, so that the association between the target video and the video labels is accurate.

The present application may be further configured in a preferred example to: the step of obtaining the initial video tag corresponding to the title information based on the matching of the title information with all the local video tags in the local tag library comprises the following steps:

performing word segmentation processing based on the title information to obtain a plurality of title word segments;

Performing word segmentation cleaning based on the title word segments to obtain target word segments;

aiming at each target word, matching the target word with each local video tag in a local tag library to obtain a matching result corresponding to the target word;

and determining the initial video label corresponding to the title information based on all the matching results.

By adopting the technical scheme, word segmentation processing is carried out based on the title information to obtain a plurality of title words, word segmentation cleaning is carried out to obtain a plurality of target words, and for each target word, the target word is matched with each local video tag in the local tag library to determine an initial video tag corresponding to the title information. In this way, the title word segmentation without actual semantics is removed, so that the initial video tag is more relevant to the target video.

In a second aspect, the present application provides a video tag association apparatus, which adopts the following technical scheme:

a video tag association apparatus comprising:

the video information acquisition module is used for acquiring video information of a target video, wherein the video information comprises title information, audio information, image information and subtitle information;

The initial video tag determining module is used for matching all the local video tags in the local tag library based on the title information to obtain initial video tags corresponding to the title information, wherein a large number of local video tags are stored in the local tag library in advance;

the feature word analysis module is used for carrying out feature word analysis based on the audio information, the image information and the subtitle information in the video information to obtain respective corresponding target feature words;

the integration processing module is used for integrating the target feature words corresponding to the audio information, the target feature words corresponding to the image information and the target feature words corresponding to the subtitle information to obtain a target video tag;

and the integration processing determining module is used for taking the initial video tag and the target video tag as video tags of the target video so as to complete association between the target video and the video tags.

In a third aspect, the present application provides an electronic device, which adopts the following technical scheme:

at least one processor;

a memory;

at least one application program, wherein the at least one application program is stored in the memory and configured to be executed by the at least one processor, the at least one application program configured to: the above method is performed.

In a fourth aspect, the present application provides a computer readable storage medium, which adopts the following technical scheme:

A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described above.

In summary, the present application includes at least one of the following beneficial technical effects:

1. and matching the title information with a local tag library, determining an initial video tag, analyzing feature words based on the audio information, the image information and the subtitle information to obtain respective corresponding target feature words, integrating the respective corresponding target feature words to obtain a target video tag, and then taking the initial video tag and the target video tag as video tags of the target video to complete association of the target video and the video tag. The video label of the target video is automatically determined based on the video information of multiple aspects of the target video so as to complete the association of the target video and the video label, and the efficiency of associating the target video and the video label is greatly improved;

2. Aiming at the audio information, after the audio information is converted into text information, word segmentation processing is carried out on the text information, a plurality of audio word segments are obtained, and the audio word segment with the highest frequency is used as an audio feature word; determining a plurality of entity objects and expressions corresponding to the entity objects respectively through image recognition aiming at image information, determining corresponding object keywords aiming at each entity object, and taking the object keywords with highest frequency as image feature words; and extracting semantic characters aiming at the caption information to obtain a plurality of caption keywords, and taking the caption keywords with highest frequency as caption feature words. The feature word analysis is respectively carried out based on the audio information, the image information and the subtitle information of the target video, so that the corresponding target feature words and the target video are higher in pertinence.

Drawings

Fig. 1 is a flowchart of a video tag association method according to one embodiment of the present application.

FIG. 2 is a flow chart illustrating feature word analysis according to one embodiment of the present application.

Fig. 3 is a schematic structural diagram of a video tag association apparatus according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application is described in further detail below in conjunction with fig. 1-4.

The present embodiment is merely illustrative of the present application and is not intended to be limiting, and those skilled in the art, after having read the present specification, may make modifications to the present embodiment without creative contribution as required, but is protected by patent laws within the scope of the present application.

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, unless otherwise specified, the term "/" generally indicates that the associated object is an "or" relationship.

Embodiments of the present application are described in further detail below with reference to the drawings attached hereto.

Currently, there are a variety of ways to tag video: one is to let the user watching the video add labels to the video, which is faster, but whether the added labels are compatible with the video content is not controllable; one is to manually add labels by the manager of the video playing platform, which requires considerable manpower and material resources, and the efficiency of adding labels to the video is too low due to the limited number of manager.

Therefore, in order to solve the technical problem, the video tag is automatically added to the target video based on the title information, the audio information, the image information and the subtitle information of the target video, so that the association of the target video and the video tag is completed, and the efficiency of the association of the target video and the video tag is improved.

The embodiment of the application provides a video tag association method, which is executed by electronic equipment, wherein the electronic equipment can be a server or terminal equipment, and the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing service. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like, but is not limited thereto, and the terminal device and the server may be directly or indirectly connected through a wired or wireless communication manner, which is not limited herein, and as shown in fig. 1, the method includes steps S101, S102, S103, S104, and S105, where:

Step S101: video information of a target video is acquired, wherein the video information comprises title information, audio information, image information and subtitle information.

For the embodiments of the present application, video information of a target video is acquired, where the video information includes, but is not limited to: title information, audio information, image information and subtitle information, wherein the title information is title content displayed when the target video is played, the audio information is background sound and sound made by people in the target video, the image information is pictures of the target video and comprises every frame picture in the target video, the subtitle information is subtitles and comments appearing in the video pictures of the target video, and the subtitle information can be identified from the video pictures by an OCR (Optical Character Recognition, character recognition) technology.

Step S102: and matching the title information with all local video tags in a local tag library to obtain initial video tags corresponding to the title information, wherein a large number of local video tags are stored in the local tag library in advance.

For the embodiment of the application, a large number of local video tags are pre-stored in a local tag library, wherein a plurality of modes for acquiring a large number of local video tags in the local tag library are available, in one realizable mode, an existing video tag can be acquired based on a data crawling mode, and the crawled existing video tag is used as the local video tag of the local tag library; in another implementation manner, the labels can be manually added based on a user watching the video or a manager of the video playing platform, and the manually added labels are used as local video labels of a local label library; of course, the video tag can also be automatically generated by performing corresponding processing on the video data, and the automatically generated video tag is used as a local video tag of the local tag library. The method for obtaining the local video tag is not limited in this embodiment.

Based on the matching of the title information and all the local video tags in the local tag library, an initial video tag corresponding to the title information is obtained, and specifically, the title information is subjected to word segmentation processing, so that the title information is divided into a plurality of title words, preferably, the plurality of title words can be subjected to word segmentation cleaning, and some semantic-free words are removed, for example: you, me, mock, etc. And then, matching each cleaned title word with each local video tag in the local tag library by using the title word to obtain a matching result, wherein the matching result can comprise tag matching degree, and the local video tag with the highest tag matching degree is used as an initial video tag corresponding to the title information. Of course, after the word segmentation cleaning treatment is performed, the weight of each cleaned title word segment can be obtained, the weight of each cleaned title word segment is matched with all local video tags in the local tag library one by one according to the sequence from high to low, and the successfully matched cleaned title word segment is used as an initial video tag corresponding to the title information.

Step S103: and analyzing the feature words based on the audio information, the image information and the subtitle information in the video information to obtain the corresponding target feature words.

For the embodiment of the application, feature word analysis is performed based on audio information, image information and subtitle information in video information to obtain respective corresponding target feature words, specifically, for the audio information, the audio information can be converted into text information through an ASR (Automatic Speech Recognition) technology, word segmentation processing is performed on the text information to obtain a plurality of audio segmentation words, and the audio segmentation word with the highest occurrence frequency is used as the target feature word corresponding to the audio information. For image information, an image recognition technology can be adopted to recognize entity objects and expressions corresponding to the entity objects contained in each video frame, wherein the entity objects can comprise entity objects such as characters and animals, and the expressions corresponding to the entity objects can comprise expressions such as happiness, distress and surprise; and after identifying a plurality of entity objects and expressions corresponding to the entity objects contained in the image information, determining target feature words corresponding to the image information based on all the entity objects and expressions corresponding to the entity objects. The caption information is a caption and an annotation appearing in a video picture of the target video, the caption information can be identified from the video picture through an OCR technology, semantic character extraction is carried out based on the obtained caption information, a plurality of caption keywords are obtained, and a target feature word corresponding to the caption information is determined based on the frequency of each caption keyword.

Of course, text information obtained by converting the audio information, caption information obtained by identifying from the video picture, a plurality of entity objects obtained based on the image information and expressions corresponding to the entity objects can be used as texts to be identified, and the corresponding target feature words can be obtained by using the same text processing mode.

Step S104: and integrating the target feature words corresponding to the audio information, the target feature words corresponding to the image information and the target feature words corresponding to the subtitle information to obtain the target video tag.

For the embodiment of the application, the target video tag is obtained by integrating the target feature words corresponding to the audio information, the target feature words corresponding to the image information and the target feature words corresponding to the subtitle information, and the target video tag is determined by integrating a plurality of target feature words, so that the target video tag is attached to the target video.

The method for integrating the target video labels based on the target feature words is quite many, and in an achievable mode, the feature word integration model is utilized to integrate the target feature words, wherein the feature word integration model can be obtained by training based on a large number of training feature words and training video labels corresponding to the feature word integration model, and further, the convolutional neural network is trained through the training feature words and the training video labels corresponding to the training feature words, so that the feature word integration model is obtained, structural users of the convolutional neural network can be set according to actual requirements, and the embodiment of the application is not limited. In another implementation manner, the integration processing of the plurality of target feature words is performed based on a preset feature word integration information table, wherein a plurality of corresponding relations between the plurality of target feature words and the target video tags are pre-stored in the preset feature word integration information table, so that when the plurality of target feature words are acquired, the corresponding target video tags can be rapidly determined based on the preset feature word integration information table.

Step S105: and taking the initial video tag and the target video tag as video tags of the target video to complete association of the target video and the video tags.

For the embodiment of the application, after the initial video tag and the target video tag are determined, the initial video tag and the target video tag are used as the video tags of the target video to form the corresponding relation between the target video and the video tag, and in this way, the association operation of the target video and the video tag is completed. Furthermore, a user watching the video can quickly search out the corresponding video based on the video tag, management work of a video playing platform manager is also facilitated to a certain extent, and efficiency of associating the target video with the video tag is improved.

It can be seen that, in this embodiment of the present application, the initial video tag is determined based on matching of the title information and the local tag library, the feature word analysis is performed based on the audio information, the image information and the subtitle information, so as to obtain respective corresponding target feature words, and the integration processing is performed based on the respective corresponding target feature words, so as to obtain the target video tag, and then the initial video tag and the target video tag are used as video tags of the target video, so as to complete association between the target video and the video tag. The video label of the target video is automatically determined based on the video information of multiple aspects of the target video so as to complete the association of the target video and the video label, and the efficiency of associating the target video and the video label is greatly improved.

Further, in order to enable the closeness between the corresponding target feature word and the target video to be higher, in the embodiment of the present application, step S103: the feature word analysis is performed based on the audio information, the image information and the subtitle information in the video information to obtain respective corresponding target feature words, as shown in fig. 2, including step S1031, step S1032, step S1033, step S1034, step S1035, step S1036 and step S1037, where:

step S1031: converting audio information in the video information into text information, and performing word segmentation processing based on the text information to obtain a plurality of audio word segments;

step S1032: determining an audio feature word from the plurality of audio tokens based on the frequency of each audio token;

for the embodiment of the application, the technology of ASR is utilized to convert the audio information into the text information, and then word segmentation processing is carried out on the text information to obtain a plurality of audio word segments, wherein word segmentation processing can be carried out on the text information through a character string matching or machine learning method. Specifically, when word segmentation is performed in a character string matching mode, character string scanning can be performed on character information through principles such as forward/reverse maximum matching, long word priority and the like, and words corresponding to the scanned character strings are used as a plurality of audio word segmentation; when word segmentation is performed by machine learning, a sequence labeling model may be used to calculate probability values for audio words that may appear in text information, and determine a plurality of audio words according to the probability values, where the commonly used sequence labeling model includes a CRF (Conditional Random Field algorithm ) model, HMM (Hidden Markov Model, hidden markov model), and the like.

Because there may be unintended audio tokens among the determined plurality of audio tokens, for example, you, me, he, etc., it is meaningless to use these audio tokens as audio feature words, and preferably, the determined plurality of audio tokens are cleaned to remove unintended audio tokens, and then, based on the cleaned audio tokens and the respective corresponding frequencies, the audio token with the highest frequency is selected from the plurality of cleaned audio tokens as the audio feature word.

Step S1033: performing image recognition based on the image information, and determining a plurality of entity objects and expressions corresponding to the entity objects;

step S1034: aiming at each entity object, determining an object keyword corresponding to the entity object according to the entity object and the expression corresponding to the entity object;

step S1035: determining image feature words from all object keywords based on the frequency of each object keyword;

for the embodiment of the application, image recognition is performed based on image information, and entity objects and expressions corresponding to the entity objects contained in each video frame are recognized, wherein the entity objects can comprise entity objects such as characters and animals, and the expressions corresponding to the entity objects can comprise expressions such as happiness, distress and surprise; and after identifying a plurality of entity objects and expressions corresponding to the entity objects contained in the image information, determining object keywords corresponding to the entity objects based on all the entity objects and expressions corresponding to the entity objects. For example, image recognition processing is performed based on a video frame in the image information, and if it is recognized that the entity object is a plurality of children and the expression corresponding to the entity object is happy, it is determined that the object keyword of the video frame in the image information is the happy child.

And processing each video frame in the image information to obtain all object keywords corresponding to the image information, and then selecting the object keywords with highest frequency from all the object keywords as image feature words based on all the object keywords and the respective corresponding frequencies.

Step S1036: extracting semantic characters based on the subtitle information to obtain a plurality of subtitle keywords;

step S1037: determining caption feature words from a plurality of caption keywords based on the frequency of each caption keyword; the audio feature words are target feature words corresponding to the audio information, the image feature words are target feature words corresponding to the image information, and the caption feature words are target feature words corresponding to the caption information.

For the embodiment of the application, the caption information is a caption and an annotation appearing in a video picture of a target video, the caption information can be identified from the video picture through an OCR technology, then semantic character extraction is carried out based on the obtained caption information to obtain a plurality of caption keywords, and the caption keyword with the highest frequency is selected from the plurality of caption keywords as a caption feature word based on the frequency of each caption keyword.

It is noted that the execution sequence of step S1031, step S1032 for the audio information, step S1033, step S1034, step S1035 for the image information, and step S1036, step S1037 for the subtitle information is not limited in this application.

It can be seen that, in the embodiment of the present application, aiming at the audio information, after the audio information is converted into the text information, word segmentation processing is performed on the text information, so as to obtain a plurality of audio word segments, and the audio word segment with the highest frequency is used as an audio feature word; determining a plurality of entity objects and expressions corresponding to the entity objects respectively through image recognition aiming at image information, determining corresponding object keywords aiming at each entity object, and taking the object keywords with highest frequency as image feature words; and extracting semantic characters aiming at the caption information to obtain a plurality of caption keywords, and taking the caption keywords with highest frequency as caption feature words. The feature word analysis is respectively carried out based on the audio information, the image information and the subtitle information of the target video, so that the corresponding target feature words and the target video are higher in pertinence.

Further, in order to improve accuracy of association between the target video tag and the target video and improve efficiency and accuracy of determining the target video tag, in the embodiment of the present application, the target feature words corresponding to the audio information, the target feature words corresponding to the image information, and the target feature words corresponding to the subtitle information are integrated, so as to obtain the target video tag, including:

For the embodiment of the application, the audio feature words, the image feature words and the subtitle feature words are integrated by utilizing a feature word integration model to obtain the target video tag, wherein the feature word integration model is obtained by training based on a large number of training feature phrases. Specifically, the training process of the feature word integration model is as follows: a large number of training feature phrases are acquired, wherein the training feature phrases comprise: the feature words before the integration processing and the feature word after the integration processing can be obtained from a network and a local storage. And then training the convolutional neural network by utilizing a large number of training feature word groups to obtain a feature word integration model for integrating a plurality of feature words to obtain a target video tag. Specifically, acquiring integrated information through a convolutional neural network based on a plurality of feature words before a plurality of groups of integrated processing; aiming at each group of training feature phrases, determining similarity of the integrated information and one feature word after the integrated processing; obtaining loss based on the similarity of the groups of training feature phrases, and back-propagating the loss to train the convolutional neural network; and carrying out weighted summation on each loss of the trained convolutional neural network to obtain total loss, and determining the trained convolutional neural network as a feature word integration model when the total loss meets a set loss threshold range. In the embodiment of the present application, the convolutional neural network may be various convolutional networks, for example, a Resnet network, and a yolov5 network.

Further, in the embodiment of the present application, training the convolutional neural network by using a large number of training feature phrases to obtain a feature word integration model may include: training the convolutional neural network by utilizing a large number of training feature phrases to obtain a first feature word integration model; testing the first feature word integration model by utilizing a large number of test feature word groups to obtain a test result; when the test result meets a preset result threshold, determining the first feature word integration model as a final feature word integration model; when the test result does not meet the preset result threshold, retraining the first feature word integration model by utilizing a large number of training feature word groups to obtain a second feature word integration model, and testing the second feature word integration model by utilizing a large number of test feature word groups until a final elevator detection model meeting the preset result threshold is obtained, wherein the preset result threshold can be set by a user based on actual conditions. Furthermore, after a large number of test feature phrases are used for testing the training feature word integration model, test feature phrases which are failed in test can be added into a large number of training feature phrases, so that a large number of training feature phrases are updated, and training effect can be effectively improved.

Therefore, in the embodiment of the application, the audio feature words, the image feature words and the subtitle feature words are integrated by using the feature word integration model to obtain the target video tag, the target video tag determined by integrating the plurality of target feature words can accurately tag the target video, the accuracy of association between the target video tag and the target video is improved, and the efficiency and the accuracy of determining the target video tag can also be improved by using the feature word integration model.

Further, in order to further ensure accuracy of association between the target video and the video tag, in this embodiment of the present application, the initial video tag and the target video tag are used as the video tag of the target video, so as to complete association between the target video and the video tag, and then the method further includes:

if the matching result is that the matching fails, marking the data item corresponding to the video label of the target video as abnormal.

For the embodiment of the application, the initial video tag is determined based on the title information, the target video tag is determined based on the audio information, the image information and the subtitle information, the title information is short in actual conditions, and a certain deviation can exist between the initial video tag determined based on the title information and the target video, so that the initial video tag and the target video tag are subjected to semantic similarity matching to ensure that the video tag is more attached to the content of the target video. Regarding semantic similarity matching, the LSTM-DSSM (Long Short Term Memory-Deep Structured Semantic Models, recall model for semantic retrieval) model may be used for semantic similarity matching. The LSTM-DSSM model comprises an input layer, a representation layer and a matching layer, wherein the input layer is used for mapping labels into a vector space and inputting the labels into DNNs (Deep Neural Networks ) of the representation layer; the presentation layer acquires context information of the text by adopting an LSTM model mode to obtain a semantic vector; the matching layer obtains a matching result by calculating cosine distances of semantic vectors of the matching layer and the semantic vector.

If the matching result is that the matching is failed, the data item corresponding to the video tag of the target video is marked as abnormal, and the similarity of the initial video tag and the target video tag is low by the representation of the method, however, the marked abnormal state does not indicate that the video tag has errors, and the initial video tag and the target video tag can be marked in different dimensions respectively aiming at the target video, and the target video can be accurately marked. Therefore, the data item label corresponding to the video label of the data item target video corresponding to the video label of the target video is marked abnormally, a manager of the video playing platform can be reminded to conduct manual verification on the video label, and the accuracy of association of the target video and the video label is further guaranteed.

It can be seen that, in the embodiment of the application, semantic similarity matching is performed based on the initial video tag and the target video tag, if matching fails, which indicates that the similarity of the initial video tag and the target video tag is lower, and if there is a possibility that the video tag is not matched with the target video, the data item corresponding to the video tag of the target video is marked as abnormal, so that a manager of the video playing platform can be reminded to manually check the video tag, and the accuracy of association between the target video and the video tag is further ensured.

Further, in order to enrich the video tag of the target video and make the result of matching the semantic similarity more accurate, in the embodiment of the present application, the method for using the initial video tag and the target video tag as the video tag of the target video includes:

determining classification items of the video tags corresponding to the initial video tags and the target video tags respectively, wherein the classification items of the video tags comprise a theme tag item, a genre tag item and an applicable state tag item;

correspondingly, semantic similarity matching is carried out on the basis of the initial video tag and the target video tag, and a matching result is obtained, wherein the matching result comprises the following steps:

when the initial video tag and the target video tag are the same classification item of the video tag, semantic similarity matching is carried out on the basis of the initial video tag and the target video tag, and a matching result is obtained.

For the embodiment of the application, the classification items of the video tag include a theme tag item, a genre tag item, and an application state tag item, where the theme tag item is used to represent an object of content description, the genre tag item is used to represent a requirement point or an action corresponding to the content, and the application state tag item is used to represent a time period for which the content is applicable to a preset target crowd. Rules or a classification tag library is preset for each classification item of the video tag, so that the classification items of the initial video tag and the target video tag can be determined relatively quickly. For example, the tags to which the subject tag items correspond may include, but are not limited to: birthdays, spring festival, christmas, spring tour, mother love, etc., in the form of theme display; the labels to which the genre label items correspond may include, but are not limited to: inoculating knowledge, self-description, help, advertisement, novel, etc.; tags corresponding to the applicable status tag items may include, but are not limited to: gestational weeks 5 to 34, adolescents, and years.

When the initial video tag and the target video tag are the same classification item of the video tag, the initial video tag and the target video tag are tags generated based on the same dimension, and the video tags in the same classification item corresponding to the target video are similar, so that semantic similarity matching is performed based on the initial video tag and the target video tag, and a matching result is obtained. If the initial video tag and the target video tag are different classification items of the video tag, semantic similarity matching of the initial video tag and the target video tag is not performed, and the initial video tag and the target video tag are tags generated based on different dimensions, so that the situation that the semantic similarity is lower is normal.

Therefore, in the embodiment of the application, the classification items of the video tags corresponding to the initial video tag and the target video tag are respectively determined, and when the initial video tag and the target video tag are the same classification item of the video tag, semantic similarity matching is performed based on the initial video tag and the target video tag, so that a matching result is obtained. By the method, different classification items are divided for the video tags, so that the target video can be associated with video tags of different dimensions at the same time, the video tags of the target video are enriched, and semantic similarity matching is carried out on the initial video tags in the same classification item of the target video and the target video tags, so that the result of the semantic similarity matching is more accurate.

Further, in order to enable the video tag of the target video to be more attached to the target video, and avoid manually checking the video tag of the target video marked abnormally for a plurality of times, the workload of manually checking the video tag is reduced, and in the embodiment of the present application, after marking the data item corresponding to the video tag of the target video as abnormal, the method further includes:

if the abnormal processing type is the updated video tag, acquiring the manual tag, and updating the video tag based on the manual tag;

For the embodiment of the application, after the data item corresponding to the video tag of the target video is marked as abnormal, a manager of the video playing platform processes the marked abnormal video tag to ensure the accuracy of the association of the target video and the video tag. And the manager carries out corresponding exception handling on the video label of the target video marked with the exception on the display interface, and generates an exception handling instruction. After the electronic equipment detects the abnormal processing instruction, determining an abnormal processing type based on the abnormal processing instruction, and when the abnormal processing type is an updated video tag, indicating that the tag content corresponding to the video tag has deviation from the target video, thus, acquiring a manual tag input after a manager examines the target video, and updating the video tag based on the manual tag, so that the video tag of the target video is more attached to the target video; when the abnormal processing type is the updated data item state, the label content corresponding to the video label is indicated to be in accordance with the target video, so that the data item corresponding to the video label of the target video is marked as normal, the video label of the target video is prevented from being manually checked for a plurality of times, and the workload of manually checking the video label is reduced.

It can be seen that, in the embodiment of the present application, after an exception handling instruction is detected, determining an exception handling type based on the exception handling instruction, if the exception handling type is to update a video tag, acquiring an artificial tag, and updating the video tag based on the artificial tag, so that the video tag of the target video is more attached to the target video; if the abnormal processing type is the updated data item state, the data item corresponding to the video tag of the target video is marked as normal, so that the video tag of the target video is prevented from being manually checked for multiple times, and the workload of manually checking the video tag is reduced.

Further, in order to make the association between the target video and the video tag accurate, in the embodiment of the present application, the method further includes:

obtaining the searching amount and the playing condition of the target video in a preset time period, wherein the searching amount is the number of times of the target video data appearing on a display interface when a user inputs a search keyword to search, and the playing condition comprises the clicking amount and the playing time of the target video in the display interface;

For the embodiment of the application, the video tag is used for describing the characteristics of the target video, and the video tag is also used when the user searches the video, namely, the search keyword is matched with the video tags of all videos, and all videos successfully matched are used as a plurality of target videos corresponding to the search keyword. However, in the actual video playing process, a plurality of target videos determined based on the search keywords are worse in playing condition within a preset time period, and this condition indicates that the target videos are not matched with the video tags, so that the video desired to be played by the user does not coincide with the target video displayed based on the search keywords, and therefore, the video tags which are not matched with the target videos need to be modified, so that the association between the target videos and the video tags is accurate.

Specifically, the search amount and the play condition of the target video in the preset time period are obtained, wherein the search amount is the number of times that the target video data appears on the display interface when the user inputs the search keyword to search, and the play condition includes but is not limited to: click quantity and play time of target video in the display interface. Then, scoring the video tags of the target videos based on the search amount and the playing condition to obtain tag scores, specifically, if the two target videos have the same search amount, the tag score corresponding to the target video with good playing condition is higher, the playing condition is comprehensively determined based on the click amount and the playing time length of the target video, and of course, the viewing behaviors of the user can be comprehensively considered, wherein the viewing behaviors of the user can include: praise, comment, collection, etc. The embodiment of the application is not limited any more, as long as the matching degree of the target video to the search expectations of the users can be determined based on the label score, that is, the higher the label score, the higher the matching degree of the target video to the search expectations of the users. Then, comparing the label score with a score threshold, wherein the score threshold can be set by the user based on the requirement, if the label score is smaller than the score threshold, the video label of the target video is modified if the label score is not matched with the video label, wherein the video label can be redetermined for the video information of the target video, the data item corresponding to the video label can be marked abnormally, and the video label of the target video can be modified by acquiring the manual label input by an administrator; if the tag score is not less than the score threshold, indicating that the target video is matched with the video tag, then the rest operation is not performed.

In the embodiment of the application, the search amount and the play condition of the target video in the preset time period are obtained, the video labels of the target video are scored based on the search amount and the play condition, and the video labels of the target video with the label score smaller than the score threshold are modified, so that the association between the target video and the video labels is accurate.

Further, in order to make the initial video tag more closely match with the target video, in this embodiment of the present application, the matching between the initial video tag corresponding to the title information and all the local video tags in the local tag library is performed based on the title information, which includes:

word segmentation processing is carried out based on the title information, so that a plurality of title word segments are obtained;

performing word segmentation cleaning based on the plurality of title word segments to obtain a plurality of target word segments;

aiming at each target word, matching is carried out by utilizing the target word and each local video tag in a local tag library, and a matching result corresponding to the target word is obtained;

and determining an initial video tag corresponding to the title information based on all the matching results.

For the embodiment of the application, word segmentation processing is performed based on the title information to obtain a plurality of title word segments, wherein the word segmentation processing can be performed on the title information through a character string matching or machine learning method. The obtained target word segments include: the title word segmentation without actual semantics is carried out by you, I, he and the like, so that word segmentation cleaning is carried out to remove the title word segmentation without actual semantics, and a plurality of obtained target word segments have actual semantics. Then, each target word is matched with each local video tag in the local tag library, and a matching result corresponding to the target word is obtained, wherein the matching result corresponding to the target word at least comprises: label matching degree. And taking the target word with the highest tag matching degree as an initial video tag corresponding to the title information based on the tag matching degree in the matching results corresponding to all the target words.

It can be seen that, in the embodiment of the present application, word segmentation processing is performed based on the title information to obtain a plurality of title words, then word segmentation cleaning is performed to obtain a plurality of target words, and for each target word, the target word is matched with each local video tag in the local tag library, so as to determine an initial video tag corresponding to the title information. In this way, the title word segmentation without actual semantics is removed, so that the initial video tag is more relevant to the target video.

The above embodiments describe a video tag association method from the viewpoint of a method flow, and the following embodiments describe a video tag association apparatus from the viewpoint of a virtual module or a virtual unit, which will be described in detail below.

An embodiment of the present application provides a video tag association apparatus 200, as shown in fig. 3, the video tag association apparatus 200 may specifically include:

a video information acquisition module 210, configured to acquire video information of a target video, where the video information includes title information, audio information, image information, and subtitle information;

the initial video tag determining module 220 is configured to match all local video tags in the local tag library based on the title information to obtain initial video tags corresponding to the title information, where a large number of local video tags are stored in the local tag library in advance;

The feature word analysis module 230 is configured to perform feature word analysis based on the audio information, the image information and the subtitle information in the video information, so as to obtain respective corresponding target feature words;

the integration processing module 240 is configured to integrate the target feature word corresponding to the audio information, the target feature word corresponding to the image information, and the target feature word corresponding to the subtitle information to obtain a target video tag;

the video tag determining module 250 is configured to take the initial video tag and the target video tag as video tags of the target video, so as to complete association between the target video and the video tags.

For the embodiment of the application, matching is performed on the basis of the title information and a local tag library, an initial video tag is determined, feature word analysis is performed on the basis of the audio information, the image information and the subtitle information, respective corresponding target feature words are obtained, integration processing is performed on the basis of the respective corresponding target feature words, a target video tag is obtained, and then the initial video tag and the target video tag are used as video tags of target videos, so that association of the target videos and the video tags is completed. The video label of the target video is automatically determined based on the video information of multiple aspects of the target video so as to complete the association of the target video and the video label, and the efficiency of associating the target video and the video label is greatly improved.

In one possible implementation manner of the embodiment of the present application, the feature word analysis module 230 performs feature word analysis based on the audio information, the image information and the subtitle information in the video information, so as to obtain respective corresponding target feature words, which are used for:

converting audio information in the video information into text information, and performing word segmentation processing based on the text information to obtain a plurality of audio word segments;

extracting semantic characters based on the subtitle information to obtain a plurality of subtitle keywords;

In one possible implementation manner of the embodiment of the present application, when performing the integrating processing on the target feature word corresponding to the audio information, the target feature word corresponding to the image information, and the target feature word corresponding to the subtitle information, the integrating processing module 240 is configured to:

In one possible implementation manner of the embodiment of the present application, the video tag association apparatus 200 further includes:

the semantic similarity matching module is used for carrying out semantic similarity matching based on the initial video tag and the target video tag to obtain a matching result;

In one possible implementation manner of the embodiment of the present application, the video tag determining module 250 is configured, when executing a video tag that uses an initial video tag and a target video tag as target videos, to:

Correspondingly, the semantic similarity matching module is used for performing semantic similarity matching based on the initial video tag and the target video tag to obtain a matching result when performing the semantic similarity matching:

the exception handling module is used for determining an exception handling type based on the exception handling instruction after the exception handling instruction is detected, wherein the exception handling type comprises: updating the video tag and updating the data item status;

the video label modifying module is used for acquiring the searching amount and the playing condition of the target video in a preset time period, wherein the searching amount is the number of times of the target video data appearing on the display interface when a user inputs a search keyword to search, and the playing condition comprises the click amount and the playing time length of the target video in the display interface;

In one possible implementation manner of the embodiment of the present application, when performing matching between the title information and all the local video tags in the local tag library to obtain an initial video tag corresponding to the title information, the initial video tag determining module 220 is configured to:

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the video tag association apparatus 200 described above may refer to the corresponding process in the foregoing method embodiment, and will not be described herein again.

In an embodiment of the present application, as shown in fig. 4, an electronic device 300 shown in fig. 4 includes: a processor 301 and a memory 303. Wherein the processor 301 is coupled to the memory 303, such as via a bus 302. Optionally, the electronic device 300 may also include a transceiver 304. It should be noted that, in practical applications, the transceiver 304 is not limited to one, and the structure of the electronic device 300 is not limited to the embodiment of the present application.

The processor 301 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. Processor 301 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 302 may include a path to transfer information between the components. Bus 302 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect Standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. Bus 302 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or type of bus.

The Memory 303 may be, but is not limited to, a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory ), a CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 303 is used for storing application program codes for executing the present application and is controlled to be executed by the processor 301. The processor 301 is configured to execute the application code stored in the memory 303 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. But may also be a server or the like. The electronic device shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments herein.

The present application provides a computer readable storage medium having a computer program stored thereon, which when run on a computer, causes the computer to perform the corresponding method embodiments described above. Compared with the related art, the method and the device have the advantages that the initial video tag is determined based on matching of the title information and the local tag library, feature word analysis is conducted based on the audio information, the image information and the subtitle information, the corresponding target feature words are obtained, integration processing is conducted based on the corresponding target feature words, the target video tag is obtained, and then the initial video tag and the target video tag are used as video tags of target videos, so that association of the target videos and the video tags is completed. The video label of the target video is automatically determined based on the video information of multiple aspects of the target video so as to complete the association of the target video and the video label, and the efficiency of associating the target video and the video label is greatly improved.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of video tag association, comprising:

taking the initial video tag and the target video tag as video tags of the target video to complete association of the target video and the video tags;

After the initial video tag and the target video tag are used as the video tags of the target video to complete the association between the target video and the video tags, the method further comprises:

if the matching result is that the matching fails, marking the data item corresponding to the video label of the target video as abnormal;

the video tag using the initial video tag and the target video tag as target videos includes:

2. The video tag association method according to claim 1, wherein the performing feature word analysis based on the audio information, the image information, and the subtitle information in the video information to obtain respective corresponding target feature words includes:

3. The method according to claim 2, wherein the integrating the target feature word corresponding to the audio information, the target feature word corresponding to the image information, and the target feature word corresponding to the subtitle information to obtain the target video tag includes:

4. The method for associating a video tag according to claim 1, wherein after marking the data item corresponding to the video tag of the target video as abnormal, further comprises:

5. The video tag association method of claim 1, further comprising:

6. The video tag association method according to any one of claims 1 to 5, wherein the obtaining an initial video tag corresponding to the title information based on matching the title information with all local video tags in a local tag library includes:

7. A video tag association apparatus, comprising:

the video tag determining module is used for taking the initial video tag and the target video tag as video tags of the target video to complete association of the target video and the video tags;

the video tag determination module, when executing the video tag taking the initial video tag and the target video tag as target video, is configured to:

correspondingly, when the semantic similarity matching module performs the semantic similarity matching based on the initial video tag and the target video tag to obtain a matching result, the semantic similarity matching module is configured to:

8. An electronic device, comprising:

At least one processor;

a memory;

wherein the memory stores at least one application program configured to be executed by the at least one processor, the at least one application program configured to: performing the method of any one of claims 1-6.