CN113836992B

CN113836992B - Label identification method, label identification model training method, device and equipment

Info

Publication number: CN113836992B
Application number: CN202110662545.9A
Authority: CN
Inventors: 尚焱
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2023-07-25
Anticipated expiration: 2041-06-15
Also published as: CN113836992A

Abstract

A method for identifying a tag, a method for training a tag identification model, a device and equipment are provided, and relate to the field of video processing of network media, wherein the method comprises the following steps: extracting a plurality of modal features of a target image or video; performing feature fusion on the plurality of modal features to obtain fusion features; based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate feature of the ith layer classifier until obtaining the intermediate feature of the Mth layer classifier; based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier; based on the probability distribution characteristics, a label of the target image or video is determined. According to the method, the features of the multiple modes are extracted and fused, the fused features are subjected to different-level feature multiplexing by using the M-layer classifier, so that the accuracy of the classifier is improved, namely the accuracy of probability distribution features output by the M-layer classifier is improved, namely the accuracy of identification labels is improved.

Description

Label identification method, label identification model training method, device and equipment

Technical Field

The embodiment of the application relates to the field of video processing of network media, and more particularly relates to a method for identifying a tag, a method for training a tag identification model, a device and equipment.

Background

With the advent of the fifth Generation mobile communication technology (5-Generation, 5G) and the development of mobile internet platforms, the video accumulated by the internet platforms is increasing, and the consumption of short video and images presents blowout bursts, so that intelligent understanding of images or video content becomes indispensable in various links of visual content. The most basic intelligent image understanding task is to label the images or videos accurately and abundantly, so that users or downstream tasks can be ensured to quickly search the images or videos, and the retrieval quality and efficiency are improved.

At present, the method for identifying the tag of the image or the video is as follows: first, an image or video is typically characterized by visual features, which are then sent to a classifier for classification tag identification, and the identified tags are output. However, feature expression of an image or video by only visual features may result in insufficient feature expression, and further, may result in too low recognition accuracy of a tag; in addition, the identification accuracy of the relevant classifier on the label is too low, and accordingly, the requirements of quickly searching images or video scenes on the identification accuracy cannot be met.

Therefore, a method for identifying a tag is urgently needed in the art to improve the identification accuracy and the identification effect of the tag, and particularly, the method can meet the requirement of quickly searching for the identification accuracy of the tag by the scene of an image or a video, thereby improving the user experience.

Disclosure of Invention

The embodiment of the application provides a method for identifying a tag, a method, a device and equipment for training a tag identification model, which can improve the identification accuracy and the identification effect of the tag, and particularly can meet the requirement of rapidly searching for the identification accuracy of a scene of an image or video to the tag, thereby improving the user experience.

In one aspect, a method of identifying a tag is provided, comprising:

extracting a plurality of modal features of a target image or video;

feature fusion is carried out on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate feature of the ith layer classifier by utilizing the ith layer classifier until obtaining the intermediate feature of the M layer classifier; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;

Based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier;

based on the probability distribution characteristics, a label of the target image or video is determined.

In another aspect, a method of training a tag recognition model is provided, comprising:

acquiring an image or video to be trained;

extracting a plurality of modal characteristics of the image or video to be trained;

acquiring an ith-level labeling label corresponding to the image or video to be trained;

the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level are taken as inputs, the ith layer classifier is trained to obtain a label identification model, i is more than 1 and less than or equal to M, the first layer classifier in the M layer classifier is trained based on the fusion feature and the labeling label of the first level, and the intermediate feature of the first layer classifier in the M layer classifier is obtained based on the fusion feature.

In another aspect, the present application provides an apparatus for identifying a tag, comprising:

the extraction unit is used for extracting a plurality of modal characteristics of the target image or video;

The fusion unit is used for carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

the first determining unit is used for obtaining the intermediate feature of the ith layer classifier by utilizing the ith layer classifier based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier until the intermediate feature of the M layer classifier is obtained; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;

the output unit is used for outputting probability distribution characteristics by utilizing the M-layer classifier based on the middle characteristics of the M-layer classifier;

and a second determining unit for determining the label of the target image or video based on the probability distribution characteristics.

In another aspect, the present application provides an apparatus for training a tag recognition model, comprising:

the first acquisition unit is used for acquiring images or videos to be trained;

the extraction unit is used for extracting a plurality of modal characteristics of the image or video to be trained;

The second acquisition unit is used for acquiring the ith-level labeling label corresponding to the image or video to be trained;

the training unit is used for taking the fusion characteristic, the middle characteristic of the i-1 th layer classifier in the M layer classifier and the labeling label of the i level as inputs, training the i layer classifier to obtain a label recognition model, i is more than 1 and less than or equal to M, the first layer classifier in the M layer classifier is obtained by training based on the fusion characteristic and the labeling label of the first level, and the middle characteristic of the first layer classifier in the M layer classifier is obtained based on the fusion characteristic.

In another aspect, an embodiment of the present application provides an electronic device, including:

a processor adapted to execute a computer program;

a computer readable storage medium having a computer program stored therein, which when executed by the processor, implements the method of identifying a tag or the method of training a tag identification model described above.

In another aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when read and executed by a processor of a computer device, cause the computer device to perform the method of identifying a tag or the method of training a tag identification model described above.

Based on the scheme, the plurality of modal features of the target image or video are extracted to perform feature fusion, which is equivalent to performing feature expression by fusing the plurality of modal features of the target image or video, so that the target image or video is more sufficient in terms of feature expression, and the identification accuracy of the tag can be improved.

In addition, based on the fusion characteristic and the intermediate characteristic of the i-1 th layer classifier in the M layer classifier, the intermediate characteristic of the i layer classifier is obtained by utilizing the i layer classifier until the intermediate characteristic of the M layer classifier is obtained, multiplexing of the intermediate characteristic of the front layer classifier by the rear layer classifier can be realized, and equivalently, the identification accuracy of the rear layer classifier to the tag can be improved by multiplexing the intermediate characteristic of the front layer classifier.

In other words, the M-th layer classifier considers the middle characteristics of the previous M-1 layer classifier in a layer-by-layer multiplexing manner, so that the accuracy of probability distribution characteristics output by the M-th layer classifier can be improved, which is equivalent to improving the identification accuracy of the label of the M-th level, namely improving the label accuracy of the identification target image or video.

In addition, because the accuracy of the label of the target image or video is improved, the quality and the efficiency of the retrieval can be further improved by utilizing the label for identifying the image or video to retrieve the image or video on an internet platform; meanwhile, video recommendation is carried out on the user by using the identified tag, so that the user experience of the product can be greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic block diagram of a system framework provided in an embodiment of the present application.

Fig. 2 is a schematic flow chart of a method of identifying a tag provided by an embodiment of the present application.

Fig. 3 is a schematic block diagram of feature fusion provided by an embodiment of the present application.

Fig. 4 is a schematic block diagram of three-layer classifier feature multiplexing provided in an embodiment of the present application.

Fig. 5 is a schematic flow chart of a method for training a tag recognition model provided in an embodiment of the present application.

FIG. 6 is a schematic block diagram of an apparatus for identifying tags provided in an embodiment of the present application

FIG. 7 is a schematic block diagram of an apparatus for training a tag recognition model provided in an embodiment of the present application

Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The solution provided by the present application may relate to artificial intelligence technology.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

It should be appreciated that artificial intelligence techniques are a comprehensive discipline involving a wide range of fields, both hardware-level and software-level techniques. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The embodiment of the application can relate to Computer Vision (CV) technology in artificial intelligence technology, wherein the CV is a science for researching how to make a machine "see", and further refers to a method for using a camera and a Computer to replace human eyes to recognize, track and measure a target, and further performing graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The scheme provided by the embodiment of the application also relates to an image or video processing technology in the field of network media. Network media, unlike conventional audio and video devices, relies on techniques and equipment provided by Information Technology (IT) device developers to transmit, store and process audio and video signals. The conventional Serial Digital (SDI) transmission mode lacks network switching characteristics in a true sense. Much work is required to create a portion of the network functionality like that provided by ethernet and Internet Protocol (IP) using SDI. Thus, network media technology in the video industry has grown. Further, the video processing technology of the network medium may include transmission, storage and processing of audio and video signals and text recognition technology of audio and video. Among them, the voice recognition technique ASR (Automatic Speech Recognition) is a technique of converting human voice into text, which is most advantageous in that it makes a man-machine user interface more natural and easy to use, and the text recognition technique OCR (optical character recognition) is a technique of obtaining text in an image by analyzing the position and the type of characters in a scanned or photographed image.

It should be noted that, the device provided in the embodiment of the present application may be integrated in a server, where the server may include an independently operated server or a distributed server, or may include a server cluster or a distributed system formed by a plurality of servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and basic cloud computing services such as big data and an artificial intelligent platform, where the servers may be directly or indirectly connected through wired or wireless communication modes, and this application is not limited herein.

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

It should be noted that the scheme of the identification tag provided in the present application may be applied to any scenario that needs intelligent understanding for image or video content. For example, scenes such as picture and video searches, recommendations, audits, etc. In addition, in practical applications, an image or video may be described from different angles, such as a text description of an image or video title, a title drawing expressing the main content of the image or video, a plurality of image frames describing the details of the video, audio depicting the video expression, and the like. The richer the description angle used, the more accurate the representation of the image or video. The system framework provided in the embodiments of the present application is exemplified by extracting visual features, audio features and text features in a target image or video, and of course, in other alternative embodiments, other modal features in the target image or video, for example, time sequence features, etc., may be taken as an example, and the specific manifestations of the multiple modal features are not specifically limited in the present application.

The system framework provided in the present application will be described in detail below with reference to extracting visual, audio, and text features of a target image or video.

Fig. 1 is a schematic block diagram of a system framework 100 provided in an embodiment of the present application.

As shown in fig. 1, the system framework 100 may include a visual feature extraction module 111, an audio feature extraction module 112, a text feature extraction module 113, a feature fusion module 120, a hierarchical classification module 130, a candidate tag processing module 140, and a custom tag processing module 150, wherein the hierarchical classification module 130 may include a multi-layer classifier 131 and a tag identification module 132.

The visual feature extraction module 111 may be used to extract visual features of a target image or video, the audio feature extraction module 112 may be used to extract audio features of a target image or video, the text feature extraction module 113 may be used to extract text features of a target image or video, and send the extracted visual features, audio features and text features to the feature fusion module 120, respectively, and in addition, the text feature extraction module 113 may be further used to send the extracted text information to the custom tag processing module 150; it should be noted that, the visual feature extraction module 111, the audio feature extraction module 112 and the text feature extraction module 113 may be any modules having corresponding feature extraction or extraction functions, which is not limited in this application, for example, the visual feature extraction module 111 may perform visual feature extraction based on a residual network RestNet frame in a slow-fast swfast channel video classification algorithm, the audio feature extraction module 112 may perform audio feature extraction based on a VGGish frame, the text feature extraction module 113 may perform text feature extraction by using a BERT frame or may use a text recognition technology OCR or a speech recognition technology ASR to supplement text information while using a BERT frame, and the specific extraction mode of the multiple mode features is not particularly limited in this application.

The feature fusion module 120 is configured to receive the visual feature extraction module 111, the audio feature extraction module 112, and the text feature extraction module 113, and perform feature fusion on the received visual feature, audio feature, and text feature, to obtain a fused feature, and then send the fused feature to the hierarchical classification module 130. It should be noted that, the feature fusion module 120 may be any module having a feature fusion function, which is not limited in this application; for example, the feature fusion module may be a Transformer framework based feature fusion module.

The multi-layer classifier 131 in the hierarchical classification module 130 is configured to receive the fused features sent by the feature fusion module 120, and the post-layer classifier in the multi-layer classifier 131 is configured to implement multiplexing of the post-layer classifier on the intermediate features of the pre-layer classifier based on the received fused features and the intermediate features of the pre-layer classifier, so as to improve accuracy of the post-layer classifier, until probability distribution features output by a last layer classifier are obtained, and send the probability distribution features output by the last layer classifier to the tag identification module 132 in the hierarchical classification module 130, so that the tag identification module 132 identifies a tag of the image or video according to the probability distribution features output by the last layer classifier, and finally sends the obtained tag of the image or video to the candidate tag processing module 140; it should be noted that the probability distribution feature may be a distribution having a length or dimension N. Each bit or value in the probability distribution feature corresponds to a label, and the label corresponding to the maximum value or a value greater than a certain threshold in the probability distribution feature can be determined to be the label of the image or the video. In other words, the image or video may be labeled with the largest value in the probability distribution characteristics or a value greater than a certain threshold. It should be noted that, the multi-layer classifier 131 may be any multi-layer classifier, which is not limited in this application; for example, the multi-layer classifier may be a multi-layer classifier based on units of multi-layer perceptrons (MLP, multilayer Perceptron).

The custom tag processing module 150 may be configured to receive the text information of the target image or video sent by the text feature extraction module 113, perform word segmentation processing on the text information, match the processed multiple words with the custom tag to obtain a first tag set, and send the first tag set to the candidate tag processing module 140, where it is to be noted that a specific manner of word segmentation on the text information is not specifically limited in this application.

The candidate tag processing module 140 may be configured to receive the tag of the image or video sent by the tag identification module 132 in the hierarchical classification module 130 and the first tag set sent by the custom tag processing module 150, and supplement or deduplicate the tag of the image or video based on the received first tag set, to obtain a final tag of the target image or video.

As can be seen from the above, firstly, the visual feature extraction module 111, the audio feature extraction module 112 and the text feature extraction module 113 extract the visual feature, the audio feature and the text feature of the target image or video respectively, and the visual feature, the audio feature and the text feature of the target image or video are fused by the feature fusion module 120, which is equivalent to more accurate and sufficient feature expression of the target image or video; secondly, the fused features are used as input, and the hierarchical multiplexing of the intermediate features of different layers of classifiers is realized through the multi-layer classifier 131 in the hierarchical classification module 130, so that the accuracy of the last layer of classifier in the multi-layer classifier 131 is improved, namely the accuracy of probability distribution features output by the last layer of classifier is improved; finally, the probability distribution characteristics output by the last layer of classifier are used as the input of a tag identification module 132, and the tag identification module 132 is used for identifying the tag of the target image or video, so that the accuracy of tag identification of the target image or video is improved; in addition, in order to further improve accuracy, the tag identification module 132 sends the obtained tag of the target image or video to the candidate tag processing module 140, the text feature extraction module 113 sends the extracted text information to the custom tag processing module 150, the custom tag processing module 150 performs word segmentation on the received text information, matches the divided words with the tag defined in advance in the database to obtain a first tag set, and sends the first tag set to the candidate tag processing module 140, and the candidate tag processing module 140 uses the received first tag set to perform de-duplication or supplement on the received tag of the target image or video to obtain a final tag of the target image or video, so as to further improve accuracy of the tag of the target image or video.

It should be understood that fig. 1 is only an example of the present application and should not be construed as limiting the present application.

For ease of understanding, the relevant terms in this application are described below.

Identification of image or video tags: image or video tag technology generally refers to high-level semantic description of the content of an image or video, is a basic task in computer vision, has an extremely important role in downstream tasks in the short video age, and has wide application in recommendation systems.

Multimode: multimodal refers in this application to multimedia data, which describes information such as text, video and speech of the same object or object entities with the same semantics in the internet.

Text recognition technology (optical character recognition, OCR): text recognition technology obtains text in an image by analyzing the position and type of characters in a scanned or photographed image.

Automatic speech recognition technology (automatic speech recognition, ASR): automatic speech recognition technology is a process of converting human speech into text by analyzing audio information.

The Slowfast video classification algorithm: two parallel convolutional neural networks are applied to the same video segment, namely a Slow channel and a Fast channel; the Slow channel is used for analyzing static content in the video, the Fast channel is used for analyzing dynamic content in the video, the Slow channel and the Fast channel both use a residual network RestNet model, and convolution operation is immediately operated after a plurality of video frames are captured.

The method of identifying the tag provided in the present application will be described in detail below using fig. 2 to 4.

Fig. 2 is a schematic flow chart of a method 200 of identifying tags provided in an embodiment of the present application.

It should be noted that, the scheme provided in the embodiments of the present application may be executed by any electronic device having data processing capability. For example, the electronic device may be implemented as a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, basic cloud computing services such as big data and an artificial intelligent platform, and the server may be directly or indirectly connected through a wired or wireless communication manner;

as shown in fig. 2, the method 200 may include some or all of the following:

s201, extracting a plurality of modal characteristics of a target image or video;

s202, carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

S203, based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate feature of the ith layer classifier by using the ith layer classifier until obtaining the intermediate feature of the M layer classifier; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;

s204, based on the middle characteristics of the M-th layer classifier, outputting probability distribution characteristics by using the M-th layer classifier;

s205, determining the label of the target image or video based on the probability distribution characteristics.

In other words, the server obtains the fused features by extracting a plurality of modal features of the target image or video and fusing the modal features, takes the fused features as the input of the M-layer classifier, so that the M-layer classifier obtains the middle features of the i-th layer classifier by using the i-th layer classifier based on the fused features and the middle features of the i-1-th layer classifier in the M-layer classifier until the middle features of the M-th layer classifier are obtained; taking the middle feature of the M-th layer classifier as the input of the M-th layer classifier, outputting the probability distribution feature of the target image or video, and determining the label of the target image or video based on the probability distribution feature; the method comprises the steps that i is more than 1 and less than or equal to M, the feature output by an ith layer of classifier in the M layers of classifiers is used for identifying the label of the ith level in M levels, and the intermediate feature of a first layer of classifier in the M layers of classifiers is obtained based on the fusion feature.

Based on the scheme, the feature fusion is carried out on the plurality of modal features by extracting the plurality of modal features of the target image or video, which is equivalent to considering the feature expression of the plurality of modal features of the fused target image or video, so that the target image or video is more sufficient in the aspect of feature expression, and the identification accuracy of the tag can be improved.

In addition, because the accuracy of the label of the target image or video is improved, the quality and the efficiency of the retrieval can be further improved by utilizing the label identified by the image or video to retrieve the image or video on an internet platform; meanwhile, video recommendation is carried out on the user by using the identified tag, so that the user experience of the product can be greatly improved.

It should be noted that the plurality of modal features may include, but are not limited to, visual features, audio features, text features. The manner in which the visual features of the target image or video are extracted includes, but is not limited to, visual feature extraction based on the residual network (RestNet) framework in a slow-fast channel video classification algorithm. The manner in which the audio features of the target image or video are extracted includes, but is not limited to, audio feature extraction based on the VGGish framework. The manner in which the text features of the target image or video are extracted includes, but is not limited to, text feature extraction using a BERT (BERT) framework, or text information may be supplemented using text recognition technology (OCR) or speech recognition technology (ASR) while using a BERT framework. For example, the audio in the target video is separated, the voice text in the audio is obtained using an automatic speech recognition ASR technique, and the like, which is not particularly limited in this application. It should be noted that, the manner of feature fusion of the plurality of modal features includes, but is not limited to, feature fusion based on a transformation (transformation) framework, which is not particularly limited in the present application. In addition, the M-layer classifier in the present application may be preferably a multi-layer classifier based on a multi-layer perceptron (MLP, multilayer Perceptron) unit, or may be an M-layer classifier based on another framework, so long as the intermediate features of the front-stage classifier can be multiplexed by the rear-stage classifier, so as to implement the hierarchical multiplexing of the intermediate features of the front-stage classifier. It should be noted that the probability distribution feature may be a distribution having a length or dimension N. Each bit or value in the probability distribution feature corresponds to a label, and the label corresponding to the maximum value or a value greater than a certain threshold in the probability distribution feature can be determined to be the label of the image or the video. In other words, the image or video may be labeled with the largest value in the probability distribution characteristics or a value greater than a certain threshold.

To verify the validity of multiple modalities, the following will take as an example manually labeled data of 22328 videos collected in a service, where the data is represented by 7:1 is divided into a training set and a test set, and the accuracy of the identification tag is improved by the application with reference to the experimental results of table 1.

TABLE 1

Method	Highest classification error rate	Lowest classification error rate
			Pure visual characteristics (Baserine)	61.48％	28.46％
Visual feature + speech feature	59.28％	27.03％
			Visual feature + speech feature + text feature	55.51％	22.35％

As shown in table 1, a video frame is input as a visual feature, an audio of a video is input as an audio feature, and a title of the video is input as a text feature, and a comparison experiment is designed, wherein Baseline is the highest classification error rate and the lowest classification error rate based on pure visual features, and the experiment shows that as the number of modes is increased, the classification error rates are reduced to different degrees, which indicates the effect that the multi-mode information is helpful for improving the accuracy of the label.

The feature fusion method provided in the present application will be described in detail below taking the extraction of visual features, audio features, and text features of a target image or video as an example. It should be noted that, in the present application, the visual feature, the audio feature, and the text feature of the target image or the video are taken as examples, but the present application is not limited to the visual feature, the audio feature, and the text feature, which are a plurality of modal features, and of course may also include other modal features, for example, a time sequence feature, etc., that is, the present application does not specifically limit the plurality of modal features, and only the visual feature, the audio feature, and the text feature of the target image or the video are taken as examples for detailed description.

As shown in fig. 3, the block diagram 300 may include a modality feature linear mapping module 301, a modality and position coding module 302; the linear mapping module 301 is configured to map the plurality of modal features into a plurality of first features with the same dimension; the mode and position encoding module 302 is configured to perform mode and position encoding on the plurality of first features to obtain a fusion feature.

It should be noted that the mode and position encoding module 302 may be a module based on a transducer framework.

In some embodiments of the present application, S202 may include:

mapping the plurality of modal features into a plurality of first features of the same dimension respectively; and carrying out modal and position coding on the plurality of first features to obtain the fusion features.

As an example, the visual features, the audio features and the text features of the target image or the video are mapped into a plurality of first features with the same dimension respectively, so that feature fusion is facilitated to the visual features, the audio features and the text features of the target image or the video, then the plurality of first features with the same dimension are subjected to modal and position coding, that is, feature fusion is performed to the visual features, the audio features and the text features with the same dimension, and the fused features are obtained.

It should be noted that, the feature fusion may be performed based on the feature fusion model of the transducer frame by performing the mode and the position coding on the plurality of first features, that is, performing the feature fusion on the first features corresponding to the plurality of modes respectively, and of course, performing the feature fusion based on other feature fusion models.

In some implementations, for a jth feature of the plurality of first features, correcting the jth feature based on other first features than the jth feature to obtain a second feature corresponding to the jth feature; the fusion feature is determined based on a plurality of second features corresponding to the plurality of first features, respectively.

As an example, first, for a first feature of a visual feature map of a plurality of first features, the first feature of the visual feature map is corrected based on a first feature of an audio feature map and a first feature of a text feature map other than the first feature of the visual feature map, to obtain a second feature corresponding to the first feature of the visual feature map; similarly, for a first feature of the audio feature map in the plurality of first features, correcting the first feature of the audio feature map based on a first feature of the visual feature map and a first feature of the text feature map other than the first feature of the audio feature map to obtain a second feature corresponding to the first feature of the audio feature map; similarly, for a first feature of the text feature map in the plurality of first features, the first feature of the text feature map is corrected based on a first feature of the visual feature map and a first feature of the audio feature map other than the first feature of the text feature map to obtain a second feature corresponding to the first feature of the text feature map, and then, based on the plurality of second features obtained respectively, the fused feature is obtained.

In the process of carrying out feature fusion on the plurality of modal features, not only the relation between the visual features and the audio features and the text features are considered, but also the relation between the audio features and the visual features and the relation between the text features and the visual features and the audio features are considered, namely, a plurality of second features corresponding to the plurality of first features are obtained through intersection and fusion among the plurality of modal features, so that the fusion degree of the plurality of second features is improved, namely, the fusion effect is improved, and further, the accuracy of tag identification is improved.

It should be noted that the fused features may be obtained by performing feature stitching, feature addition, or feature multiplication based on the plurality of second features, which is not particularly limited in this application. It should be understood that the present application is not limited to the specific form of visual features, audio features, text features. For example, the visual feature, the audio feature, and the text feature may be a vector of a specific dimension, or may be a matrix of a specific dimension, which is not particularly limited in this application.

In some implementations, the weight corresponding to the jth feature is determined based on other first features other than the jth feature; and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.

As an example, first, for a first feature of a visual feature map of a plurality of first features, a first weight corresponding to the first feature of the visual feature map is determined based on the first feature of an audio feature map and the first feature of a text feature map other than the first feature of the visual feature map, and then the first feature of the visual feature map is corrected by using the first weight to obtain a second feature corresponding to the first feature of the visual feature map; similarly, for a first feature of the audio feature map in the plurality of first features, first, determining a second weight corresponding to the first feature of the audio feature map based on the first feature of the visual feature map and the first feature of the text feature map other than the first feature of the audio feature map, and then correcting the first feature of the audio feature map by using the second weight to obtain a second feature corresponding to the first feature of the audio feature map; similarly, for a first feature of the text feature map of the plurality of first features, first, a third weight corresponding to the first feature of the text feature map is determined based on the first feature of the visual feature map and the first feature of the audio feature map other than the first feature of the text feature map, and then the first feature of the text feature map is corrected by using the third weight to obtain a second feature corresponding to the first feature of the text feature map.

Determining the weight corresponding to the jth feature based on other first features except the jth feature; and correcting the jth feature based on the weight of the jth feature, which is equivalent to preliminarily fusing other first features except the jth feature before determining the second feature corresponding to the jth feature, so as to improve the fusion degree of the second feature corresponding to the jth feature and the second feature corresponding to the other first features except the jth feature, and correspondingly, improve the accuracy of tag identification.

For example, the second feature corresponding to the first feature of the visual feature map or the second feature corresponding to the first feature of the audio feature map or the second feature corresponding to the first feature of the text feature map may be determined based on the following formula (1):

wherein Q, V, K is a triplet vector of attention (attention) mechanism, d _k Representing the dimension of K in the triplet vector.

Taking the text feature and the audio feature as an example, it is assumed that the text feature is composed of at least one word vector, where each word feature vector is 512-dimensional, the at least one word feature vector can be represented as a matrix, i.e. a third matrix, and the third matrix is mapped to a low-dimensional vector space, e.g. 64-dimensional, by three parameter matrices QM, KM, VM, to obtain a representation of the third matrix in the three low-dimensional vector space, i.e. Q, K, V of the third matrix. For example, the third matrix may be multiplied by QM, KM, VM, respectively, to obtain Q, K, V of the third matrix; assuming that the audio feature is made up of at least one audio feature vector, where each audio feature vector is 128-dimensional, the at least one audio feature vector may be represented as a matrix, i.e. a fourth matrix, which is mapped to a low-dimensional vector space, e.g. 64-dimensional, by means of three parameter matrices QM, KM, VM, to obtain a representation of the fourth matrix in the three low-dimensional vector space, i.e. Q, K, V of the fourth matrix. For example, the fourth matrix may be multiplied by QM, KM, VM, respectively, to obtain Q, K, V of the fourth matrix.

Matrix multiplication is performed on Q of the third matrix and K of the third matrix to obtain a matrix A, matrix multiplication is performed on Q of the fourth matrix and K of the fourth matrix to obtain a matrix B, average processing is performed on the matrix A and the matrix B to obtain a matrix A1, scaling (scale) is performed on the matrix A1, for example, the dimension of the K vector under the root number is divided by each element, and therefore the result of inner product can be prevented from being too large, and the area with the gradient of 0 is prevented from being entered during training.

In short, the Q of the third matrix and the K of the third matrix are subjected to matrix multiplication, the Q of the fourth matrix and the K of the fourth matrix are subjected to matrix multiplication, and then the multiplication results of the Q of the third matrix and the K of the fourth matrix are respectively normalized and subjected to average treatment to obtain the weight corresponding to the first feature of the visual feature mapping; and correcting the first feature of the visual feature map by using the weight corresponding to the first feature of the visual feature map to obtain a second feature corresponding to the first feature of the visual feature map.

Note that, Q, K, V of the third matrix or the fourth matrix may be acquired using a "Multi-Head" Attention (Multi-Head Attention), and "Multi-Head" may refer to using plural sets of initialization values when initializing the parameter matrices QM, KM, VM.

In some embodiments of the present application, the M-layer classifier is an M-layer classifier based on a unit of a multi-layer perceptron MLP, S203 may include:

splicing the fusion feature and the middle feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i layer classifier; and taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.

The intermediate characteristics of the ith layer classifier are obtained by utilizing the ith layer classifier based on the fusion characteristics and the intermediate characteristics of the ith-1 layer classifier in the M layer classifier, which is equivalent to that the ith layer classifier considers the intermediate characteristics of the previous ith-1 layer classifier in a layer-by-layer multiplexing mode, so that the identification accuracy of the ith layer classifier on the labels is improved, namely, the identification accuracy of the Mth layer classifier on the labels is finally improved, which is equivalent to that of the labels of the target images or videos.

In some embodiments of the present application, S205 may include:

determining a first numerical value greater than a preset threshold in the probability distribution characteristics based on the probability distribution characteristics; identifying a tag corresponding to the first value in at least one tag; and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of at least one label is equal to the dimension of the probability distribution characteristic.

It should be appreciated that the probability distribution feature may be a distribution of length or dimension N. And each bit or value in the probability distribution characteristics corresponds to one label, and the label corresponding to the first value larger than the preset threshold in the probability distribution characteristics is determined to be the label of the target image or video. In other words, the target image or video may be labeled with a first value in the probability distribution feature that is greater than a preset threshold.

It should be noted that, the preset threshold may be a range of values, or may be a specific value. Of course, the preset thresholds corresponding to the tags of the different levels may also be partially or completely different. For example, the preset threshold corresponding to the upper label may be greater than or equal to the preset threshold corresponding to the lower label. For example, the label "dog" corresponds to a preset threshold of 8 or 9, and the label "Hadoki" corresponds to a preset threshold of 5 or 6. Of course, the specific values described above are merely examples, and the present application is not limited thereto. Furthermore, the value in the probability distribution feature may be used to represent an estimated accuracy of estimating a label corresponding to the value as a label of the first image; in addition, the labels corresponding to each bit or value of the probability distribution feature have a semantic relationship, such as a context semantic relationship, a similar semantic relationship, or an opposite semantic relationship, and the labels 'dog' and 'hastelloy' have context relationships, for example. Another example, the label "african elephant" and the label "asian elephant" have similar semantic relationships. Another example, the label "day" and the label "night" have opposite semantic relationships.

The following will further describe in detail an example of an M-layer classifier as a three-layer classifier in conjunction with fig. 4, and fig. 4 is a schematic block diagram of three-layer classifier feature multiplexing provided in an embodiment of the present application.

As shown in fig. 4, the block diagram includes a first layer classifier 410, a second layer classifier 420, and a third layer classifier 430, where each layer classifier is based on MLP, i.e., each layer classifier includes an input layer, at least one hidden layer, and an output layer, and is illustrated by taking a hidden layer as an example.

As shown in fig. 4, first, the fused features are sent to each of the M-layer classifiers, after the input layer of the MLP in the first-layer classifier receives the fused features, the intermediate features of the first-layer classifier 410 are output through the MLP hidden layer in the first-layer classifier 410, then the intermediate features of the first-layer classifier 410 and the fused features are spliced, the spliced features are sent to the second-layer classifier 420, after the input layer of the MLP in the second-layer classifier 420 receives the spliced features, the intermediate features of the second-layer classifier 420 are output through the hidden layer of the MLP in the second-layer classifier 420, then the intermediate features of the second-layer classifier 420 and the fused features are spliced, then the spliced features are sent to the third-layer classifier 430, after the input layer of the MLP in the third-layer classifier 430 receives the spliced features, the probability distribution features of the third-layer classifier 430 are output through the output layer of the MLP in the third-layer classifier 430, and the target image or the label is determined based on the probability distribution features. Of course, the output layer of each layer of MLP may output a corresponding probability distribution feature, i.e., the probability distribution feature of each layer corresponds to the label of each level.

Through the design of the multi-layer classifier based on the MLP as a unit, the feature expression corresponding to the label of the upper level is effectively utilized, namely, the features among different layers are multiplexed, the classifying performance of the multi-layer classifier is improved, namely, the label recognition accuracy is improved.

For ease of understanding, the MLP will be described below.

The multi-layer perceptron (MLP, multilayer Perceptron) is also called artificial neural network (ANN, artificial Neural Network), which may have multiple hidden layers in between, except for input and output layers, the simplest MLP having a structure with only one hidden layer, i.e. three layers. The number of hidden layers is not specified by the MLP, and thus, a suitable number of hidden layers may be selected according to the respective requirements. And there is no limitation on the number of neurons in the output layer. The layers of the multi-layer perceptron are fully connected, and the fully connected meaning is that: any one neuron of the upper layer is connected to all neurons of the lower layer. For the input layer, assuming the input is an n-dimensional vector, there are n neurons. Assuming that the input layer is represented by a vector X, the output of the hidden layer is f (w1x+b1), W1 is the weight (also called the connection coefficient), b1 is the bias, and the function f can be a commonly used sigmoid function or a tanh function. The hidden layer to output layer can be seen as a multi-class logistic regression, i.e. softmax regression, so the output of the output layer is softmax (w2x1+b2), X1 representing the output f (w1x+b1) of the hidden layer. Assuming that the softmax function outputs a k-dimensional column vector, the number of each dimension represents the probability of that class occurring.

MLPs can typically be trained on a small Batch (Mini-Batch), which refers to a subset of training data randomly selected from the full set of training data T. Assuming that the training data set T contains N samples, and the Batch size (Batch size) of each Mini-Batch is b, then the entire training data can be divided into N/b Mini-batches. When the model is trained by SGD, an instance of Mini-Batch is generally run out, called one step (step) to complete training, and when N/b steps are run out, the whole training data completes one round of training, called one period (epoch) to complete. After one Epoch training process is completed, random shuffling is carried out on training data, the training data sequence is disturbed, the steps are repeated, then the training of the next Epoch is started, and the complete and full training of the model is formed by multiple rounds of epochs.

In some embodiments of the present application, the method 200 may further comprise:

acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video; and supplementing or de-duplicating the label of the target image or video based on the text information to obtain the final label of the target image or video.

In this embodiment, based on the text information of the target image or video, the label of the target image or video is supplemented or de-duplicated to obtain the final label of the target image or video, so that the accuracy of the label of the target image or video can be further improved.

In one implementation, the text information is segmented to obtain a plurality of segmented words corresponding to the target image or video; matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video; based on the first tag set, the tags of the target image or video are supplemented or de-duplicated.

For example, first, text information may be identified according to a knowledge graph and segmented to obtain a plurality of entities of the text information; then, carrying out label matching on each entity by utilizing a custom dictionary to obtain a first label set, and finally supplementing or de-duplicating labels of the target image or video according to the labels in the first label set; of course, the manner of word segmentation of the text information is not particularly limited in the present application.

In one implementation, the tags of the target image or video are supplemented or de-duplicated using semantic correlation of the tags of the target image or video and the first set of tags and/or a de-duplication number threshold of the tags of the target image or video and the first set of tags.

As an example, the semantic relationship between the label of the target image or video and the label in the first label set may be used to supplement or de-duplicate the label of the target image or video, for example, the semantic relationship between upper and lower positions, and the labels 'dog' and 'halfton' have the upper and lower positions, so that the lower position label may be selected when the label of the target image or video is supplemented or de-duplicated, if the label of the target image or video does not have the lower position label, the label of the target image or video is supplemented, and if the label of the target image or video does not have the upper position label or the lower position label, the label of the target image or video is de-duplicated, so that the label of the target image or video is more accurate.

As another example, the labels of the target image or video may be supplemented or de-duplicated by using the labels of the target image or video and the de-duplication number threshold of the first label set, for example, the threshold number of labels of the target image or video may be designed manually in advance, and if the number of labels of the obtained target image or video is greater than the threshold number, the redundant labels of the target image or video may be removed according to the semantic relevance rule as above.

It should be understood that the above manner of supplementing or de-duplicating the tag of the target image or video is merely an example of the present application, and of course, the tag of the target image or video may also be supplemented or de-duplicated in other manners, which is not specifically limited in this application.

The method comprises the steps of obtaining a first tag set based on text information of a target image or video, and supplementing or de-duplicating the obtained tag of the target image or video by using the first tag set, wherein the method is equivalent to the steps of improving the classification accuracy of an M-layer classifier, and then further improving the accuracy of the finally generated tag by using the first tag set, so that the obtained tag of the target image or video has higher practical application value.

Fig. 5 is a schematic flow chart of a method 500 of training a tag recognition model provided in an embodiment of the present application.

As shown in fig. 5, the method 500 may include some or all of the following:

s501, acquiring an image or video to be trained;

s502, extracting a plurality of modal characteristics of the image or video to be trained;

s503, carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

s504, obtaining an ith-level labeling label corresponding to the image or video to be trained;

s505, training the ith layer classifier by taking the fusion feature, the middle feature of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level as inputs to obtain an identification label model, wherein i is more than 1 and less than or equal to M, the first layer classifier in the M layer classifier is obtained by training based on the fusion feature and the labeling label of the first level, and the middle feature of the first layer classifier in the M layer classifier is obtained based on the fusion feature.

Based on the scheme, the feature fusion is carried out on the plurality of modal features of the image or the video to be trained by extracting the plurality of modal features, which is equivalent to the feature expression by fusing the plurality of modal features of the image or the video to be trained, so that the image or the video to be trained is more sufficient in terms of feature expression, and the identification accuracy of the model to the label can be improved.

In addition, based on the fusion characteristics, the intermediate characteristics of the i-1 th layer classifier in the M layer classifier and the labeling label of the i level as inputs, training the i layer classifier to obtain an identification label model, which is equivalent to multiplexing the intermediate characteristics of the front layer classifier by the rear layer classifier, on one hand, the identification accuracy of the rear layer classifier on the label is improved, namely the label accuracy of the model output image or video is improved; on the other hand, the rear-level classifier multiplexes the middle features of the front-level classifier, so that the convergence rate of the model is increased, the model training time is shortened, and the model training efficiency is improved.

In addition, because the accuracy of the model identification tag is improved, the quality and the efficiency of the retrieval can be further improved by utilizing the tag output by the tag identification model to retrieve the image or the video on an internet platform; meanwhile, video recommendation is carried out on the user by using the identified tag, so that the user experience of the product can be greatly improved.

It should be noted that the plurality of modal features may include, but are not limited to, visual features, audio features, text features; extracting visual features of the target image or video may be, but is not limited to, extracting visual features based on a residual network RestNet framework in a slow-fast low-cost channel video classification algorithm, extracting audio features of the target image or video may be, but is not limited to, extracting audio features based on a VGGish framework, extracting text features of the target image or video may be, but is not limited to, extracting text features using a BERT framework, or text information may be supplemented using text recognition technology OCR or speech recognition technology ASR while using a BERT framework, for example, separating audio in the target video, using automatic speech recognition ASR technology, obtaining sound text in the audio, and the like, which is not particularly limited in the present application. It should be noted that, the feature fusion of the plurality of modal features may be, but not limited to, feature fusion based on a transducer framework, which is not particularly limited in the present application. In addition, the identification tag model in the application may be an identification tag model based on a multi-layer perceptron (MLP, multilayer Perceptron) as a framework, and of course, may also be based on other frameworks, as long as the framework can realize multiplexing of intermediate features of the front-stage classifier by the rear-stage classifier, and can realize multiplexing of intermediate features of the front-stage classifier, which is not particularly limited in the application.

In some embodiments of the present application, S503 may include:

mapping the plurality of modal features into a plurality of first features of the same dimension respectively;

and carrying out modal and position coding on the plurality of first features to obtain the fusion features.

In some embodiments of the present application, S503 may include:

correcting the jth feature based on other first features except the jth feature aiming at the jth feature in the plurality of first features to obtain a second feature corresponding to the jth feature;

the fusion feature is determined based on a plurality of second features corresponding to the plurality of first features, respectively.

In some embodiments of the present application, S503 may include:

determining the weight corresponding to the jth feature based on other first features except the jth feature;

and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.

In some embodiments of the present application, prior to S505, the method 500 may further include:

acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video; based on the text information, supplementing or de-duplicating the labeling label of the ith level to obtain a final labeling label of the ith level;

Wherein S505 may include:

and training the ith layer classifier by taking the fusion characteristic, the intermediate characteristic of the ith-1 layer classifier in the M layer classifier and the final labeling label of the ith level as inputs so as to obtain the identification label model.

In some embodiments of the present application, the text information is segmented to obtain a plurality of segmented words corresponding to the target image or video; matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video; and supplementing or de-duplicating the labeling label of the ith level based on the first label set to obtain the final labeling label of the ith level.

In some embodiments of the present application, the method 500 may further comprise:

and supplementing or de-duplicating the labeling label of the ith level by utilizing the semantic relativity of the labeling label of the ith level and the first label set and/or the de-duplication number threshold value of the labeling label of the ith level and the first label set.

It should be noted that, the scheme for fusing the plurality of modal features in the method for training the tag identification model may be the same as the scheme for fusing the plurality of modal features in the tag identification method, and will not be described herein. For example, the method for training the tag recognition model to obtain the intermediate features of the i-1 layer classifier may be the same as the method for training the tag recognition model to obtain the intermediate features of the i-1 layer classifier.

The preferred embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.

It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The method provided by the embodiment of the application is described above, and the device provided by the embodiment of the application is described below.

Fig. 6 is a schematic block diagram of an apparatus 600 for identifying tags provided by an embodiment of the present application.

As shown in fig. 6, the apparatus 600 may include:

an extracting unit 610 for extracting a plurality of modality features of a target image or video;

a fusion unit 620, configured to perform feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

a first determining unit 630, configured to obtain intermediate features of the ith layer classifier by using the ith layer classifier based on the fusion feature and intermediate features of the ith-1 layer classifier in the M layer classifiers until intermediate features of the M layer classifier are obtained; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;

an output unit 640 for outputting probability distribution characteristics using the M-th layer classifier based on the intermediate characteristics of the M-th layer classifier;

a second determining unit 650 for determining a label of the target image or video based on the probability distribution characteristics.

In some embodiments of the present application, the fusion unit 620 is configured to:

In some embodiments of the present application, the first determining unit 630 is configured to:

based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M-layer classifier, obtaining the intermediate feature of the ith layer classifier by using the ith layer classifier comprises the following steps:

splicing the fusion feature and the middle feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i layer classifier;

And taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.

In some embodiments of the present application, the second determining unit 650 is configured to:

determining a first numerical value greater than a preset threshold in the probability distribution characteristics based on the probability distribution characteristics;

identifying a tag corresponding to the first value in at least one tag;

and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of at least one label is equal to the dimension of the probability distribution characteristic.

In some embodiments of the present application, the extraction unit 610 is further configured to:

acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video;

and supplementing or de-duplicating the label of the target image or video based on the text information to obtain the final label of the target image or video.

In some embodiments of the present application, the first determining unit 630 is further configured to:

word segmentation is carried out on the text information, and a plurality of word segments corresponding to the target image or video are obtained;

Matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video;

based on the first tag set, the tags of the target image or video are supplemented or de-duplicated.

and supplementing or de-duplicating the label of the target image or video by utilizing the semantic relativity of the label of the target image or video and the first label set and/or the de-duplication number threshold value of the label of the target image or video and the first label set.

Fig. 7 is a schematic block diagram of an apparatus 700 for training a tag recognition model provided in an embodiment of the present application.

A first acquiring unit 710, configured to acquire an image or video to be trained;

an extracting unit 720, configured to extract a plurality of modal features of the image or video to be trained;

a fusion unit 730, configured to perform feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

a second obtaining unit 740, configured to obtain an ith level label corresponding to the image or video to be trained;

the training unit 750 is configured to train the ith layer classifier by using the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layer classifier, and the label tag of the ith level as inputs, so as to obtain a tag recognition model, where i is greater than 1 and less than or equal to M, and a first layer classifier in the M layer classifier is obtained by training based on the fusion feature and the label tag of the first level, and the intermediate feature of the first layer classifier in the M layer classifier is obtained based on the fusion feature.

In some embodiments of the present application, the fusion unit 730 is configured to:

In some embodiments of the present application, the first obtaining unit 710 is further configured to:

The training unit 750 may specifically be used for:

word segmentation is carried out on the text information, and a plurality of word segments corresponding to the target image or video are obtained; matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video; and supplementing or de-duplicating the labeling label of the ith level based on the first label set to obtain the final labeling label of the ith level.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 600 may correspond to a corresponding main body in performing the method 200 in the embodiment of the present application, and each unit in the apparatus 700 is for implementing a corresponding flow in the method 500, and for brevity, will not be described herein again.

It should also be understood that each unit in the apparatus 600 or the apparatus 700 related to the embodiments of the present application may be separately or all combined into one or several other units to form a structure, or some unit(s) thereof may be further split into a plurality of units with smaller functions to form a structure, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the apparatus 600 or 700 may also include other units, and in practical applications, the functions may also be implemented with assistance from other units, and may be implemented by cooperation of multiple units. According to another embodiment of the present application, the apparatus 600 or 700 according to the embodiments of the present application may be constructed by running a computer program (including a program code) capable of executing the steps involved in the respective methods on a general purpose computing device of a general purpose computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), etc., and a storage element, and a method of implementing the method of identifying a tag or a method of training a tag identification model according to the embodiments of the present application. The computer program may be recorded on a computer readable storage medium, and loaded into an electronic device through the computer readable storage medium and executed therein to implement the corresponding method of the embodiments of the present application.

In other words, the units referred to above may be implemented in hardware, or may be implemented by instructions in software, or may be implemented in a combination of hardware and software. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software in the decoding processor. Alternatively, the software may reside in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 8 is a schematic structural diagram of an electronic device 800 provided in an embodiment of the present application.

As shown in fig. 8, the electronic device 800 includes at least a processor 810 and a computer-readable storage medium 820. Wherein the processor 810 and the computer-readable storage medium 820 may be connected by a bus or other means. The computer-readable storage medium 820 is configured to store a computer program 821, the computer program 821 including computer instructions, and the processor 810 is configured to execute the computer instructions stored by the computer-readable storage medium 820. Processor 810 is a computing core and a control core of electronic device 800 that are adapted to implement one or more computer instructions, in particular to load and execute one or more computer instructions to implement a corresponding method flow or a corresponding function.

By way of example, the processor 810 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 810 may include, but is not limited to: a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

By way of example, computer-readable storage medium 820 may be high-speed RAM memory or Non-volatile memory (Non-Volatilememory), such as at least one magnetic disk memory; alternatively, it may be at least one computer-readable storage medium located remotely from the aforementioned processor 810. In particular, computer-readable storage media 820 includes, but is not limited to: volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In one implementation, the electronic device 800 may be the apparatus 600 for identifying tags shown in FIG. 6 or the apparatus 700 for training a tag identification model shown in FIG. 7; the computer-readable storage medium 820 has stored therein computer instructions; computer instructions stored in computer-readable storage medium 820 are loaded and executed by processor 810 to implement the corresponding steps in the method embodiments shown in fig. 2-5; in particular implementations, computer instructions in the computer-readable storage medium 820 are loaded by the processor 810 and perform the corresponding steps, which are not repeated here.

According to another aspect of the present application, the embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in the electronic device 800, for storing programs and data. Such as computer-readable storage medium 820. It is understood that the computer readable storage medium 820 herein may include both built-in storage media in the electronic device 800 and extended storage media supported by the electronic device 800. The computer-readable storage medium provides storage space that stores an operating system of the electronic device 800. Also stored in this memory space are one or more computer instructions, which may be one or more computer programs 821 (including program code), adapted to be loaded and executed by the processor 810.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. For example, a computer program 821. At this point, the electronic device 800 may be a computer, and the processor 810 reads the computer instructions from the computer-readable storage medium 820, and the processor 810 executes the computer instructions, so that the computer performs the method of predicting the geographic location of the IP address provided in the above-mentioned various alternatives.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, runs the processes or implements the functions of the embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.

Those of ordinary skill in the art will appreciate that the elements and process steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Finally, it should be noted that the above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about the changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of identifying a tag, comprising:

extracting a plurality of modal features of a target image or video; the plurality of modal features includes visual features, audio features, and text features of the target image or video;

Performing feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;

based on the fusion characteristics and the intermediate characteristics of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate characteristics of the ith layer classifier by utilizing the ith layer classifier until obtaining the intermediate characteristics of the M layer classifier; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;

determining a label of the target image or video based on the probability distribution characteristics;

the method further comprises the steps of:

acquiring text information of the target image or video, wherein the text information comprises at least one of the following: the text of the target image or video, the title of the target image or video, and the labeling text of the target image or video;

and supplementing or de-duplicating the label of the target image or video based on the first label set.

2. The method of claim 1, wherein the feature fusing the plurality of modal features to obtain fused features of the plurality of modal features comprises:

mapping the plurality of modal features into a plurality of first features with the same dimension respectively;

3. The method of claim 2, wherein the performing modal and position encoding on the plurality of first features to obtain the fused features comprises:

and determining the fusion characteristic based on a plurality of second characteristics corresponding to the plurality of first characteristics respectively.

4. A method according to claim 3, wherein said modifying the jth feature based on other first features than the jth feature to obtain a second feature corresponding to the jth feature comprises:

Determining the weight corresponding to the j-th feature based on other first features except the j-th feature;

5. The method of claim 1, wherein the M-layer classifier is an M-layer classifier based on a multi-layer perceptron MLP unit;

the step of obtaining the intermediate feature of the ith layer classifier by using the ith layer classifier based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier comprises the following steps:

6. The method of claim 1, wherein the determining the label of the target image or video based on the probability distribution characteristics comprises:

Identifying a label corresponding to the first numerical value in at least one label;

7. The method of claim 1, wherein the supplementing or de-duplicating the tags of the target image or video based on the first tag set comprises:

and supplementing or de-duplicating the label of the target image or video by utilizing the semantic correlation of the label of the target image or video and the first label set and/or the de-duplication number threshold of the label of the target image or video and the first label set.

8. A method of training a tag recognition model, comprising:

acquiring an image or video to be trained;

extracting a plurality of modal characteristics of the image or video to be trained; the plurality of modal features includes visual features, audio features, and text features of the image or video to be trained;

training an ith layer classifier by taking the fusion characteristic, the middle characteristic of an ith-1 layer classifier in an M layer classifier and the labeling label of the ith level as inputs to obtain a label identification model, wherein i is more than 1 and less than or equal to M, a first layer classifier in the M layer classifier is obtained by training based on the fusion characteristic and the labeling label of the first level, and the middle characteristic of the first layer classifier in the M layer classifier is obtained based on the fusion characteristic;

the training the ith layer classifier by taking the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level as inputs to obtain a label identification model comprises the following steps:

acquiring text information of an image or video to be trained, wherein the text information comprises at least one of the following: the text of the image or video to be trained, the title of the image or video to be trained, and the labeling text of the image or video to be trained;

word segmentation is carried out on the text information, and a plurality of word segments corresponding to the image or video to be trained are obtained;

Matching the plurality of word segments with a custom dictionary to obtain a first tag set of the image or video to be trained;

supplementing or de-duplicating the labeling label of the ith level based on the first label set;

and training the ith layer classifier by taking the fusion characteristic, the intermediate characteristic of the ith-1 layer classifier in the M layer classifier and the final labeling label of the ith level as inputs so as to obtain the label identification model.

9. An apparatus for identifying a tag, comprising:

the extraction unit is used for extracting a plurality of modal characteristics of the target image or video; the plurality of modal features includes visual features, audio features, and text features of the target image or video;

The output unit is used for outputting probability distribution characteristics by utilizing the M-layer classifier based on the intermediate characteristics of the M-layer classifier;

a second determining unit configured to determine a tag of the target image or video based on the probability distribution characteristics;

the first determining unit is further configured to:

10. An apparatus for training a tag recognition model, comprising:

the extraction unit is used for extracting a plurality of modal characteristics of the image or video to be trained; the plurality of modal features includes visual features, audio features, and text features of the image or video to be trained;

the training unit is used for taking the fusion characteristics, the middle characteristics of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level as inputs, training the ith layer classifier to obtain a label identification model, i is more than 1 and less than or equal to M, a first layer classifier in the M layer classifier is obtained by training based on the fusion characteristics and the labeling label of the first level, and the middle characteristics of the first layer classifier in the M layer classifier are obtained based on the fusion characteristics;

wherein, training unit is specifically used for:

11. An electronic device, comprising:

a processor adapted to execute a computer program;

a computer readable storage medium having stored therein a computer program which, when executed by the processor, implements the method of any one of claims 1 to 7 or the method of claim 8.

12. A computer readable storage medium storing a computer program for causing a computer to perform the method of any one of claims 1 to 7 or the method of claim 8.