CN113836992B - Label identification method, label identification model training method, device and equipment - Google Patents

Label identification method, label identification model training method, device and equipment Download PDF

Info

Publication number
CN113836992B
CN113836992B CN202110662545.9A CN202110662545A CN113836992B CN 113836992 B CN113836992 B CN 113836992B CN 202110662545 A CN202110662545 A CN 202110662545A CN 113836992 B CN113836992 B CN 113836992B
Authority
CN
China
Prior art keywords
video
features
feature
label
layer classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110662545.9A
Other languages
Chinese (zh)
Other versions
CN113836992A (en
Inventor
尚焱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110662545.9A priority Critical patent/CN113836992B/en
Publication of CN113836992A publication Critical patent/CN113836992A/en
Application granted granted Critical
Publication of CN113836992B publication Critical patent/CN113836992B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A method for identifying a tag, a method for training a tag identification model, a device and equipment are provided, and relate to the field of video processing of network media, wherein the method comprises the following steps: extracting a plurality of modal features of a target image or video; performing feature fusion on the plurality of modal features to obtain fusion features; based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate feature of the ith layer classifier until obtaining the intermediate feature of the Mth layer classifier; based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier; based on the probability distribution characteristics, a label of the target image or video is determined. According to the method, the features of the multiple modes are extracted and fused, the fused features are subjected to different-level feature multiplexing by using the M-layer classifier, so that the accuracy of the classifier is improved, namely the accuracy of probability distribution features output by the M-layer classifier is improved, namely the accuracy of identification labels is improved.

Description

Label identification method, label identification model training method, device and equipment
Technical Field
The embodiment of the application relates to the field of video processing of network media, and more particularly relates to a method for identifying a tag, a method for training a tag identification model, a device and equipment.
Background
With the advent of the fifth Generation mobile communication technology (5-Generation, 5G) and the development of mobile internet platforms, the video accumulated by the internet platforms is increasing, and the consumption of short video and images presents blowout bursts, so that intelligent understanding of images or video content becomes indispensable in various links of visual content. The most basic intelligent image understanding task is to label the images or videos accurately and abundantly, so that users or downstream tasks can be ensured to quickly search the images or videos, and the retrieval quality and efficiency are improved.
At present, the method for identifying the tag of the image or the video is as follows: first, an image or video is typically characterized by visual features, which are then sent to a classifier for classification tag identification, and the identified tags are output. However, feature expression of an image or video by only visual features may result in insufficient feature expression, and further, may result in too low recognition accuracy of a tag; in addition, the identification accuracy of the relevant classifier on the label is too low, and accordingly, the requirements of quickly searching images or video scenes on the identification accuracy cannot be met.
Therefore, a method for identifying a tag is urgently needed in the art to improve the identification accuracy and the identification effect of the tag, and particularly, the method can meet the requirement of quickly searching for the identification accuracy of the tag by the scene of an image or a video, thereby improving the user experience.
Disclosure of Invention
The embodiment of the application provides a method for identifying a tag, a method, a device and equipment for training a tag identification model, which can improve the identification accuracy and the identification effect of the tag, and particularly can meet the requirement of rapidly searching for the identification accuracy of a scene of an image or video to the tag, thereby improving the user experience.
In one aspect, a method of identifying a tag is provided, comprising:
extracting a plurality of modal features of a target image or video;
feature fusion is carried out on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate feature of the ith layer classifier by utilizing the ith layer classifier until obtaining the intermediate feature of the M layer classifier; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;
Based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier;
based on the probability distribution characteristics, a label of the target image or video is determined.
In another aspect, a method of training a tag recognition model is provided, comprising:
acquiring an image or video to be trained;
extracting a plurality of modal characteristics of the image or video to be trained;
feature fusion is carried out on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
acquiring an ith-level labeling label corresponding to the image or video to be trained;
the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level are taken as inputs, the ith layer classifier is trained to obtain a label identification model, i is more than 1 and less than or equal to M, the first layer classifier in the M layer classifier is trained based on the fusion feature and the labeling label of the first level, and the intermediate feature of the first layer classifier in the M layer classifier is obtained based on the fusion feature.
In another aspect, the present application provides an apparatus for identifying a tag, comprising:
the extraction unit is used for extracting a plurality of modal characteristics of the target image or video;
The fusion unit is used for carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
the first determining unit is used for obtaining the intermediate feature of the ith layer classifier by utilizing the ith layer classifier based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier until the intermediate feature of the M layer classifier is obtained; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;
the output unit is used for outputting probability distribution characteristics by utilizing the M-layer classifier based on the middle characteristics of the M-layer classifier;
and a second determining unit for determining the label of the target image or video based on the probability distribution characteristics.
In another aspect, the present application provides an apparatus for training a tag recognition model, comprising:
the first acquisition unit is used for acquiring images or videos to be trained;
the extraction unit is used for extracting a plurality of modal characteristics of the image or video to be trained;
the fusion unit is used for carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
The second acquisition unit is used for acquiring the ith-level labeling label corresponding to the image or video to be trained;
the training unit is used for taking the fusion characteristic, the middle characteristic of the i-1 th layer classifier in the M layer classifier and the labeling label of the i level as inputs, training the i layer classifier to obtain a label recognition model, i is more than 1 and less than or equal to M, the first layer classifier in the M layer classifier is obtained by training based on the fusion characteristic and the labeling label of the first level, and the middle characteristic of the first layer classifier in the M layer classifier is obtained based on the fusion characteristic.
In another aspect, an embodiment of the present application provides an electronic device, including:
a processor adapted to execute a computer program;
a computer readable storage medium having a computer program stored therein, which when executed by the processor, implements the method of identifying a tag or the method of training a tag identification model described above.
In another aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when read and executed by a processor of a computer device, cause the computer device to perform the method of identifying a tag or the method of training a tag identification model described above.
Based on the scheme, the plurality of modal features of the target image or video are extracted to perform feature fusion, which is equivalent to performing feature expression by fusing the plurality of modal features of the target image or video, so that the target image or video is more sufficient in terms of feature expression, and the identification accuracy of the tag can be improved.
In addition, based on the fusion characteristic and the intermediate characteristic of the i-1 th layer classifier in the M layer classifier, the intermediate characteristic of the i layer classifier is obtained by utilizing the i layer classifier until the intermediate characteristic of the M layer classifier is obtained, multiplexing of the intermediate characteristic of the front layer classifier by the rear layer classifier can be realized, and equivalently, the identification accuracy of the rear layer classifier to the tag can be improved by multiplexing the intermediate characteristic of the front layer classifier.
In other words, the M-th layer classifier considers the middle characteristics of the previous M-1 layer classifier in a layer-by-layer multiplexing manner, so that the accuracy of probability distribution characteristics output by the M-th layer classifier can be improved, which is equivalent to improving the identification accuracy of the label of the M-th level, namely improving the label accuracy of the identification target image or video.
In addition, because the accuracy of the label of the target image or video is improved, the quality and the efficiency of the retrieval can be further improved by utilizing the label for identifying the image or video to retrieve the image or video on an internet platform; meanwhile, video recommendation is carried out on the user by using the identified tag, so that the user experience of the product can be greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic block diagram of a system framework provided in an embodiment of the present application.
Fig. 2 is a schematic flow chart of a method of identifying a tag provided by an embodiment of the present application.
Fig. 3 is a schematic block diagram of feature fusion provided by an embodiment of the present application.
Fig. 4 is a schematic block diagram of three-layer classifier feature multiplexing provided in an embodiment of the present application.
Fig. 5 is a schematic flow chart of a method for training a tag recognition model provided in an embodiment of the present application.
FIG. 6 is a schematic block diagram of an apparatus for identifying tags provided in an embodiment of the present application
FIG. 7 is a schematic block diagram of an apparatus for training a tag recognition model provided in an embodiment of the present application
Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The solution provided by the present application may relate to artificial intelligence technology.
Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
It should be appreciated that artificial intelligence techniques are a comprehensive discipline involving a wide range of fields, both hardware-level and software-level techniques. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The embodiment of the application can relate to Computer Vision (CV) technology in artificial intelligence technology, wherein the CV is a science for researching how to make a machine "see", and further refers to a method for using a camera and a Computer to replace human eyes to recognize, track and measure a target, and further performing graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
The scheme provided by the embodiment of the application also relates to an image or video processing technology in the field of network media. Network media, unlike conventional audio and video devices, relies on techniques and equipment provided by Information Technology (IT) device developers to transmit, store and process audio and video signals. The conventional Serial Digital (SDI) transmission mode lacks network switching characteristics in a true sense. Much work is required to create a portion of the network functionality like that provided by ethernet and Internet Protocol (IP) using SDI. Thus, network media technology in the video industry has grown. Further, the video processing technology of the network medium may include transmission, storage and processing of audio and video signals and text recognition technology of audio and video. Among them, the voice recognition technique ASR (Automatic Speech Recognition) is a technique of converting human voice into text, which is most advantageous in that it makes a man-machine user interface more natural and easy to use, and the text recognition technique OCR (optical character recognition) is a technique of obtaining text in an image by analyzing the position and the type of characters in a scanned or photographed image.
It should be noted that, the device provided in the embodiment of the present application may be integrated in a server, where the server may include an independently operated server or a distributed server, or may include a server cluster or a distributed system formed by a plurality of servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and basic cloud computing services such as big data and an artificial intelligent platform, where the servers may be directly or indirectly connected through wired or wireless communication modes, and this application is not limited herein.
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
It should be noted that the scheme of the identification tag provided in the present application may be applied to any scenario that needs intelligent understanding for image or video content. For example, scenes such as picture and video searches, recommendations, audits, etc. In addition, in practical applications, an image or video may be described from different angles, such as a text description of an image or video title, a title drawing expressing the main content of the image or video, a plurality of image frames describing the details of the video, audio depicting the video expression, and the like. The richer the description angle used, the more accurate the representation of the image or video. The system framework provided in the embodiments of the present application is exemplified by extracting visual features, audio features and text features in a target image or video, and of course, in other alternative embodiments, other modal features in the target image or video, for example, time sequence features, etc., may be taken as an example, and the specific manifestations of the multiple modal features are not specifically limited in the present application.
The system framework provided in the present application will be described in detail below with reference to extracting visual, audio, and text features of a target image or video.
Fig. 1 is a schematic block diagram of a system framework 100 provided in an embodiment of the present application.
As shown in fig. 1, the system framework 100 may include a visual feature extraction module 111, an audio feature extraction module 112, a text feature extraction module 113, a feature fusion module 120, a hierarchical classification module 130, a candidate tag processing module 140, and a custom tag processing module 150, wherein the hierarchical classification module 130 may include a multi-layer classifier 131 and a tag identification module 132.
The visual feature extraction module 111 may be used to extract visual features of a target image or video, the audio feature extraction module 112 may be used to extract audio features of a target image or video, the text feature extraction module 113 may be used to extract text features of a target image or video, and send the extracted visual features, audio features and text features to the feature fusion module 120, respectively, and in addition, the text feature extraction module 113 may be further used to send the extracted text information to the custom tag processing module 150; it should be noted that, the visual feature extraction module 111, the audio feature extraction module 112 and the text feature extraction module 113 may be any modules having corresponding feature extraction or extraction functions, which is not limited in this application, for example, the visual feature extraction module 111 may perform visual feature extraction based on a residual network RestNet frame in a slow-fast swfast channel video classification algorithm, the audio feature extraction module 112 may perform audio feature extraction based on a VGGish frame, the text feature extraction module 113 may perform text feature extraction by using a BERT frame or may use a text recognition technology OCR or a speech recognition technology ASR to supplement text information while using a BERT frame, and the specific extraction mode of the multiple mode features is not particularly limited in this application.
The feature fusion module 120 is configured to receive the visual feature extraction module 111, the audio feature extraction module 112, and the text feature extraction module 113, and perform feature fusion on the received visual feature, audio feature, and text feature, to obtain a fused feature, and then send the fused feature to the hierarchical classification module 130. It should be noted that, the feature fusion module 120 may be any module having a feature fusion function, which is not limited in this application; for example, the feature fusion module may be a Transformer framework based feature fusion module.
The multi-layer classifier 131 in the hierarchical classification module 130 is configured to receive the fused features sent by the feature fusion module 120, and the post-layer classifier in the multi-layer classifier 131 is configured to implement multiplexing of the post-layer classifier on the intermediate features of the pre-layer classifier based on the received fused features and the intermediate features of the pre-layer classifier, so as to improve accuracy of the post-layer classifier, until probability distribution features output by a last layer classifier are obtained, and send the probability distribution features output by the last layer classifier to the tag identification module 132 in the hierarchical classification module 130, so that the tag identification module 132 identifies a tag of the image or video according to the probability distribution features output by the last layer classifier, and finally sends the obtained tag of the image or video to the candidate tag processing module 140; it should be noted that the probability distribution feature may be a distribution having a length or dimension N. Each bit or value in the probability distribution feature corresponds to a label, and the label corresponding to the maximum value or a value greater than a certain threshold in the probability distribution feature can be determined to be the label of the image or the video. In other words, the image or video may be labeled with the largest value in the probability distribution characteristics or a value greater than a certain threshold. It should be noted that, the multi-layer classifier 131 may be any multi-layer classifier, which is not limited in this application; for example, the multi-layer classifier may be a multi-layer classifier based on units of multi-layer perceptrons (MLP, multilayer Perceptron).
The custom tag processing module 150 may be configured to receive the text information of the target image or video sent by the text feature extraction module 113, perform word segmentation processing on the text information, match the processed multiple words with the custom tag to obtain a first tag set, and send the first tag set to the candidate tag processing module 140, where it is to be noted that a specific manner of word segmentation on the text information is not specifically limited in this application.
The candidate tag processing module 140 may be configured to receive the tag of the image or video sent by the tag identification module 132 in the hierarchical classification module 130 and the first tag set sent by the custom tag processing module 150, and supplement or deduplicate the tag of the image or video based on the received first tag set, to obtain a final tag of the target image or video.
As can be seen from the above, firstly, the visual feature extraction module 111, the audio feature extraction module 112 and the text feature extraction module 113 extract the visual feature, the audio feature and the text feature of the target image or video respectively, and the visual feature, the audio feature and the text feature of the target image or video are fused by the feature fusion module 120, which is equivalent to more accurate and sufficient feature expression of the target image or video; secondly, the fused features are used as input, and the hierarchical multiplexing of the intermediate features of different layers of classifiers is realized through the multi-layer classifier 131 in the hierarchical classification module 130, so that the accuracy of the last layer of classifier in the multi-layer classifier 131 is improved, namely the accuracy of probability distribution features output by the last layer of classifier is improved; finally, the probability distribution characteristics output by the last layer of classifier are used as the input of a tag identification module 132, and the tag identification module 132 is used for identifying the tag of the target image or video, so that the accuracy of tag identification of the target image or video is improved; in addition, in order to further improve accuracy, the tag identification module 132 sends the obtained tag of the target image or video to the candidate tag processing module 140, the text feature extraction module 113 sends the extracted text information to the custom tag processing module 150, the custom tag processing module 150 performs word segmentation on the received text information, matches the divided words with the tag defined in advance in the database to obtain a first tag set, and sends the first tag set to the candidate tag processing module 140, and the candidate tag processing module 140 uses the received first tag set to perform de-duplication or supplement on the received tag of the target image or video to obtain a final tag of the target image or video, so as to further improve accuracy of the tag of the target image or video.
It should be understood that fig. 1 is only an example of the present application and should not be construed as limiting the present application.
For ease of understanding, the relevant terms in this application are described below.
Identification of image or video tags: image or video tag technology generally refers to high-level semantic description of the content of an image or video, is a basic task in computer vision, has an extremely important role in downstream tasks in the short video age, and has wide application in recommendation systems.
Multimode: multimodal refers in this application to multimedia data, which describes information such as text, video and speech of the same object or object entities with the same semantics in the internet.
Text recognition technology (optical character recognition, OCR): text recognition technology obtains text in an image by analyzing the position and type of characters in a scanned or photographed image.
Automatic speech recognition technology (automatic speech recognition, ASR): automatic speech recognition technology is a process of converting human speech into text by analyzing audio information.
The Slowfast video classification algorithm: two parallel convolutional neural networks are applied to the same video segment, namely a Slow channel and a Fast channel; the Slow channel is used for analyzing static content in the video, the Fast channel is used for analyzing dynamic content in the video, the Slow channel and the Fast channel both use a residual network RestNet model, and convolution operation is immediately operated after a plurality of video frames are captured.
The method of identifying the tag provided in the present application will be described in detail below using fig. 2 to 4.
Fig. 2 is a schematic flow chart of a method 200 of identifying tags provided in an embodiment of the present application.
It should be noted that, the scheme provided in the embodiments of the present application may be executed by any electronic device having data processing capability. For example, the electronic device may be implemented as a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, basic cloud computing services such as big data and an artificial intelligent platform, and the server may be directly or indirectly connected through a wired or wireless communication manner;
as shown in fig. 2, the method 200 may include some or all of the following:
s201, extracting a plurality of modal characteristics of a target image or video;
s202, carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
S203, based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate feature of the ith layer classifier by using the ith layer classifier until obtaining the intermediate feature of the M layer classifier; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;
s204, based on the middle characteristics of the M-th layer classifier, outputting probability distribution characteristics by using the M-th layer classifier;
s205, determining the label of the target image or video based on the probability distribution characteristics.
In other words, the server obtains the fused features by extracting a plurality of modal features of the target image or video and fusing the modal features, takes the fused features as the input of the M-layer classifier, so that the M-layer classifier obtains the middle features of the i-th layer classifier by using the i-th layer classifier based on the fused features and the middle features of the i-1-th layer classifier in the M-layer classifier until the middle features of the M-th layer classifier are obtained; taking the middle feature of the M-th layer classifier as the input of the M-th layer classifier, outputting the probability distribution feature of the target image or video, and determining the label of the target image or video based on the probability distribution feature; the method comprises the steps that i is more than 1 and less than or equal to M, the feature output by an ith layer of classifier in the M layers of classifiers is used for identifying the label of the ith level in M levels, and the intermediate feature of a first layer of classifier in the M layers of classifiers is obtained based on the fusion feature.
Based on the scheme, the feature fusion is carried out on the plurality of modal features by extracting the plurality of modal features of the target image or video, which is equivalent to considering the feature expression of the plurality of modal features of the fused target image or video, so that the target image or video is more sufficient in the aspect of feature expression, and the identification accuracy of the tag can be improved.
In addition, based on the fusion characteristic and the intermediate characteristic of the i-1 th layer classifier in the M layer classifier, the intermediate characteristic of the i layer classifier is obtained by utilizing the i layer classifier until the intermediate characteristic of the M layer classifier is obtained, multiplexing of the intermediate characteristic of the front layer classifier by the rear layer classifier can be realized, and equivalently, the identification accuracy of the rear layer classifier to the tag can be improved by multiplexing the intermediate characteristic of the front layer classifier.
In other words, the M-th layer classifier considers the middle characteristics of the previous M-1 layer classifier in a layer-by-layer multiplexing manner, so that the accuracy of probability distribution characteristics output by the M-th layer classifier can be improved, which is equivalent to improving the identification accuracy of the label of the M-th level, namely improving the label accuracy of the identification target image or video.
In addition, because the accuracy of the label of the target image or video is improved, the quality and the efficiency of the retrieval can be further improved by utilizing the label identified by the image or video to retrieve the image or video on an internet platform; meanwhile, video recommendation is carried out on the user by using the identified tag, so that the user experience of the product can be greatly improved.
It should be noted that the plurality of modal features may include, but are not limited to, visual features, audio features, text features. The manner in which the visual features of the target image or video are extracted includes, but is not limited to, visual feature extraction based on the residual network (RestNet) framework in a slow-fast channel video classification algorithm. The manner in which the audio features of the target image or video are extracted includes, but is not limited to, audio feature extraction based on the VGGish framework. The manner in which the text features of the target image or video are extracted includes, but is not limited to, text feature extraction using a BERT (BERT) framework, or text information may be supplemented using text recognition technology (OCR) or speech recognition technology (ASR) while using a BERT framework. For example, the audio in the target video is separated, the voice text in the audio is obtained using an automatic speech recognition ASR technique, and the like, which is not particularly limited in this application. It should be noted that, the manner of feature fusion of the plurality of modal features includes, but is not limited to, feature fusion based on a transformation (transformation) framework, which is not particularly limited in the present application. In addition, the M-layer classifier in the present application may be preferably a multi-layer classifier based on a multi-layer perceptron (MLP, multilayer Perceptron) unit, or may be an M-layer classifier based on another framework, so long as the intermediate features of the front-stage classifier can be multiplexed by the rear-stage classifier, so as to implement the hierarchical multiplexing of the intermediate features of the front-stage classifier. It should be noted that the probability distribution feature may be a distribution having a length or dimension N. Each bit or value in the probability distribution feature corresponds to a label, and the label corresponding to the maximum value or a value greater than a certain threshold in the probability distribution feature can be determined to be the label of the image or the video. In other words, the image or video may be labeled with the largest value in the probability distribution characteristics or a value greater than a certain threshold.
To verify the validity of multiple modalities, the following will take as an example manually labeled data of 22328 videos collected in a service, where the data is represented by 7:1 is divided into a training set and a test set, and the accuracy of the identification tag is improved by the application with reference to the experimental results of table 1.
TABLE 1
Method Highest classification error rate Lowest classification error rate
Pure visual characteristics (Baserine) 61.48% 28.46%
Visual feature + speech feature 59.28% 27.03%
Visual feature + speech feature + text feature 55.51% 22.35%
As shown in table 1, a video frame is input as a visual feature, an audio of a video is input as an audio feature, and a title of the video is input as a text feature, and a comparison experiment is designed, wherein Baseline is the highest classification error rate and the lowest classification error rate based on pure visual features, and the experiment shows that as the number of modes is increased, the classification error rates are reduced to different degrees, which indicates the effect that the multi-mode information is helpful for improving the accuracy of the label.
The feature fusion method provided in the present application will be described in detail below taking the extraction of visual features, audio features, and text features of a target image or video as an example. It should be noted that, in the present application, the visual feature, the audio feature, and the text feature of the target image or the video are taken as examples, but the present application is not limited to the visual feature, the audio feature, and the text feature, which are a plurality of modal features, and of course may also include other modal features, for example, a time sequence feature, etc., that is, the present application does not specifically limit the plurality of modal features, and only the visual feature, the audio feature, and the text feature of the target image or the video are taken as examples for detailed description.
Fig. 3 is a schematic block diagram of feature fusion provided by an embodiment of the present application.
As shown in fig. 3, the block diagram 300 may include a modality feature linear mapping module 301, a modality and position coding module 302; the linear mapping module 301 is configured to map the plurality of modal features into a plurality of first features with the same dimension; the mode and position encoding module 302 is configured to perform mode and position encoding on the plurality of first features to obtain a fusion feature.
It should be noted that the mode and position encoding module 302 may be a module based on a transducer framework.
In some embodiments of the present application, S202 may include:
mapping the plurality of modal features into a plurality of first features of the same dimension respectively; and carrying out modal and position coding on the plurality of first features to obtain the fusion features.
As an example, the visual features, the audio features and the text features of the target image or the video are mapped into a plurality of first features with the same dimension respectively, so that feature fusion is facilitated to the visual features, the audio features and the text features of the target image or the video, then the plurality of first features with the same dimension are subjected to modal and position coding, that is, feature fusion is performed to the visual features, the audio features and the text features with the same dimension, and the fused features are obtained.
It should be noted that, the feature fusion may be performed based on the feature fusion model of the transducer frame by performing the mode and the position coding on the plurality of first features, that is, performing the feature fusion on the first features corresponding to the plurality of modes respectively, and of course, performing the feature fusion based on other feature fusion models.
In some implementations, for a jth feature of the plurality of first features, correcting the jth feature based on other first features than the jth feature to obtain a second feature corresponding to the jth feature; the fusion feature is determined based on a plurality of second features corresponding to the plurality of first features, respectively.
As an example, first, for a first feature of a visual feature map of a plurality of first features, the first feature of the visual feature map is corrected based on a first feature of an audio feature map and a first feature of a text feature map other than the first feature of the visual feature map, to obtain a second feature corresponding to the first feature of the visual feature map; similarly, for a first feature of the audio feature map in the plurality of first features, correcting the first feature of the audio feature map based on a first feature of the visual feature map and a first feature of the text feature map other than the first feature of the audio feature map to obtain a second feature corresponding to the first feature of the audio feature map; similarly, for a first feature of the text feature map in the plurality of first features, the first feature of the text feature map is corrected based on a first feature of the visual feature map and a first feature of the audio feature map other than the first feature of the text feature map to obtain a second feature corresponding to the first feature of the text feature map, and then, based on the plurality of second features obtained respectively, the fused feature is obtained.
In the process of carrying out feature fusion on the plurality of modal features, not only the relation between the visual features and the audio features and the text features are considered, but also the relation between the audio features and the visual features and the relation between the text features and the visual features and the audio features are considered, namely, a plurality of second features corresponding to the plurality of first features are obtained through intersection and fusion among the plurality of modal features, so that the fusion degree of the plurality of second features is improved, namely, the fusion effect is improved, and further, the accuracy of tag identification is improved.
It should be noted that the fused features may be obtained by performing feature stitching, feature addition, or feature multiplication based on the plurality of second features, which is not particularly limited in this application. It should be understood that the present application is not limited to the specific form of visual features, audio features, text features. For example, the visual feature, the audio feature, and the text feature may be a vector of a specific dimension, or may be a matrix of a specific dimension, which is not particularly limited in this application.
In some implementations, the weight corresponding to the jth feature is determined based on other first features other than the jth feature; and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.
As an example, first, for a first feature of a visual feature map of a plurality of first features, a first weight corresponding to the first feature of the visual feature map is determined based on the first feature of an audio feature map and the first feature of a text feature map other than the first feature of the visual feature map, and then the first feature of the visual feature map is corrected by using the first weight to obtain a second feature corresponding to the first feature of the visual feature map; similarly, for a first feature of the audio feature map in the plurality of first features, first, determining a second weight corresponding to the first feature of the audio feature map based on the first feature of the visual feature map and the first feature of the text feature map other than the first feature of the audio feature map, and then correcting the first feature of the audio feature map by using the second weight to obtain a second feature corresponding to the first feature of the audio feature map; similarly, for a first feature of the text feature map of the plurality of first features, first, a third weight corresponding to the first feature of the text feature map is determined based on the first feature of the visual feature map and the first feature of the audio feature map other than the first feature of the text feature map, and then the first feature of the text feature map is corrected by using the third weight to obtain a second feature corresponding to the first feature of the text feature map.
Determining the weight corresponding to the jth feature based on other first features except the jth feature; and correcting the jth feature based on the weight of the jth feature, which is equivalent to preliminarily fusing other first features except the jth feature before determining the second feature corresponding to the jth feature, so as to improve the fusion degree of the second feature corresponding to the jth feature and the second feature corresponding to the other first features except the jth feature, and correspondingly, improve the accuracy of tag identification.
For example, the second feature corresponding to the first feature of the visual feature map or the second feature corresponding to the first feature of the audio feature map or the second feature corresponding to the first feature of the text feature map may be determined based on the following formula (1):
wherein Q, V, K is a triplet vector of attention (attention) mechanism, d k Representing the dimension of K in the triplet vector.
Taking the text feature and the audio feature as an example, it is assumed that the text feature is composed of at least one word vector, where each word feature vector is 512-dimensional, the at least one word feature vector can be represented as a matrix, i.e. a third matrix, and the third matrix is mapped to a low-dimensional vector space, e.g. 64-dimensional, by three parameter matrices QM, KM, VM, to obtain a representation of the third matrix in the three low-dimensional vector space, i.e. Q, K, V of the third matrix. For example, the third matrix may be multiplied by QM, KM, VM, respectively, to obtain Q, K, V of the third matrix; assuming that the audio feature is made up of at least one audio feature vector, where each audio feature vector is 128-dimensional, the at least one audio feature vector may be represented as a matrix, i.e. a fourth matrix, which is mapped to a low-dimensional vector space, e.g. 64-dimensional, by means of three parameter matrices QM, KM, VM, to obtain a representation of the fourth matrix in the three low-dimensional vector space, i.e. Q, K, V of the fourth matrix. For example, the fourth matrix may be multiplied by QM, KM, VM, respectively, to obtain Q, K, V of the fourth matrix.
Matrix multiplication is performed on Q of the third matrix and K of the third matrix to obtain a matrix A, matrix multiplication is performed on Q of the fourth matrix and K of the fourth matrix to obtain a matrix B, average processing is performed on the matrix A and the matrix B to obtain a matrix A1, scaling (scale) is performed on the matrix A1, for example, the dimension of the K vector under the root number is divided by each element, and therefore the result of inner product can be prevented from being too large, and the area with the gradient of 0 is prevented from being entered during training.
In short, the Q of the third matrix and the K of the third matrix are subjected to matrix multiplication, the Q of the fourth matrix and the K of the fourth matrix are subjected to matrix multiplication, and then the multiplication results of the Q of the third matrix and the K of the fourth matrix are respectively normalized and subjected to average treatment to obtain the weight corresponding to the first feature of the visual feature mapping; and correcting the first feature of the visual feature map by using the weight corresponding to the first feature of the visual feature map to obtain a second feature corresponding to the first feature of the visual feature map.
Note that, Q, K, V of the third matrix or the fourth matrix may be acquired using a "Multi-Head" Attention (Multi-Head Attention), and "Multi-Head" may refer to using plural sets of initialization values when initializing the parameter matrices QM, KM, VM.
In some embodiments of the present application, the M-layer classifier is an M-layer classifier based on a unit of a multi-layer perceptron MLP, S203 may include:
splicing the fusion feature and the middle feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i layer classifier; and taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.
The intermediate characteristics of the ith layer classifier are obtained by utilizing the ith layer classifier based on the fusion characteristics and the intermediate characteristics of the ith-1 layer classifier in the M layer classifier, which is equivalent to that the ith layer classifier considers the intermediate characteristics of the previous ith-1 layer classifier in a layer-by-layer multiplexing mode, so that the identification accuracy of the ith layer classifier on the labels is improved, namely, the identification accuracy of the Mth layer classifier on the labels is finally improved, which is equivalent to that of the labels of the target images or videos.
In some embodiments of the present application, S205 may include:
determining a first numerical value greater than a preset threshold in the probability distribution characteristics based on the probability distribution characteristics; identifying a tag corresponding to the first value in at least one tag; and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of at least one label is equal to the dimension of the probability distribution characteristic.
It should be appreciated that the probability distribution feature may be a distribution of length or dimension N. And each bit or value in the probability distribution characteristics corresponds to one label, and the label corresponding to the first value larger than the preset threshold in the probability distribution characteristics is determined to be the label of the target image or video. In other words, the target image or video may be labeled with a first value in the probability distribution feature that is greater than a preset threshold.
It should be noted that, the preset threshold may be a range of values, or may be a specific value. Of course, the preset thresholds corresponding to the tags of the different levels may also be partially or completely different. For example, the preset threshold corresponding to the upper label may be greater than or equal to the preset threshold corresponding to the lower label. For example, the label "dog" corresponds to a preset threshold of 8 or 9, and the label "Hadoki" corresponds to a preset threshold of 5 or 6. Of course, the specific values described above are merely examples, and the present application is not limited thereto. Furthermore, the value in the probability distribution feature may be used to represent an estimated accuracy of estimating a label corresponding to the value as a label of the first image; in addition, the labels corresponding to each bit or value of the probability distribution feature have a semantic relationship, such as a context semantic relationship, a similar semantic relationship, or an opposite semantic relationship, and the labels 'dog' and 'hastelloy' have context relationships, for example. Another example, the label "african elephant" and the label "asian elephant" have similar semantic relationships. Another example, the label "day" and the label "night" have opposite semantic relationships.
The following will further describe in detail an example of an M-layer classifier as a three-layer classifier in conjunction with fig. 4, and fig. 4 is a schematic block diagram of three-layer classifier feature multiplexing provided in an embodiment of the present application.
As shown in fig. 4, the block diagram includes a first layer classifier 410, a second layer classifier 420, and a third layer classifier 430, where each layer classifier is based on MLP, i.e., each layer classifier includes an input layer, at least one hidden layer, and an output layer, and is illustrated by taking a hidden layer as an example.
As shown in fig. 4, first, the fused features are sent to each of the M-layer classifiers, after the input layer of the MLP in the first-layer classifier receives the fused features, the intermediate features of the first-layer classifier 410 are output through the MLP hidden layer in the first-layer classifier 410, then the intermediate features of the first-layer classifier 410 and the fused features are spliced, the spliced features are sent to the second-layer classifier 420, after the input layer of the MLP in the second-layer classifier 420 receives the spliced features, the intermediate features of the second-layer classifier 420 are output through the hidden layer of the MLP in the second-layer classifier 420, then the intermediate features of the second-layer classifier 420 and the fused features are spliced, then the spliced features are sent to the third-layer classifier 430, after the input layer of the MLP in the third-layer classifier 430 receives the spliced features, the probability distribution features of the third-layer classifier 430 are output through the output layer of the MLP in the third-layer classifier 430, and the target image or the label is determined based on the probability distribution features. Of course, the output layer of each layer of MLP may output a corresponding probability distribution feature, i.e., the probability distribution feature of each layer corresponds to the label of each level.
Through the design of the multi-layer classifier based on the MLP as a unit, the feature expression corresponding to the label of the upper level is effectively utilized, namely, the features among different layers are multiplexed, the classifying performance of the multi-layer classifier is improved, namely, the label recognition accuracy is improved.
For ease of understanding, the MLP will be described below.
The multi-layer perceptron (MLP, multilayer Perceptron) is also called artificial neural network (ANN, artificial Neural Network), which may have multiple hidden layers in between, except for input and output layers, the simplest MLP having a structure with only one hidden layer, i.e. three layers. The number of hidden layers is not specified by the MLP, and thus, a suitable number of hidden layers may be selected according to the respective requirements. And there is no limitation on the number of neurons in the output layer. The layers of the multi-layer perceptron are fully connected, and the fully connected meaning is that: any one neuron of the upper layer is connected to all neurons of the lower layer. For the input layer, assuming the input is an n-dimensional vector, there are n neurons. Assuming that the input layer is represented by a vector X, the output of the hidden layer is f (w1x+b1), W1 is the weight (also called the connection coefficient), b1 is the bias, and the function f can be a commonly used sigmoid function or a tanh function. The hidden layer to output layer can be seen as a multi-class logistic regression, i.e. softmax regression, so the output of the output layer is softmax (w2x1+b2), X1 representing the output f (w1x+b1) of the hidden layer. Assuming that the softmax function outputs a k-dimensional column vector, the number of each dimension represents the probability of that class occurring.
MLPs can typically be trained on a small Batch (Mini-Batch), which refers to a subset of training data randomly selected from the full set of training data T. Assuming that the training data set T contains N samples, and the Batch size (Batch size) of each Mini-Batch is b, then the entire training data can be divided into N/b Mini-batches. When the model is trained by SGD, an instance of Mini-Batch is generally run out, called one step (step) to complete training, and when N/b steps are run out, the whole training data completes one round of training, called one period (epoch) to complete. After one Epoch training process is completed, random shuffling is carried out on training data, the training data sequence is disturbed, the steps are repeated, then the training of the next Epoch is started, and the complete and full training of the model is formed by multiple rounds of epochs.
In some embodiments of the present application, the method 200 may further comprise:
acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video; and supplementing or de-duplicating the label of the target image or video based on the text information to obtain the final label of the target image or video.
In this embodiment, based on the text information of the target image or video, the label of the target image or video is supplemented or de-duplicated to obtain the final label of the target image or video, so that the accuracy of the label of the target image or video can be further improved.
In one implementation, the text information is segmented to obtain a plurality of segmented words corresponding to the target image or video; matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video; based on the first tag set, the tags of the target image or video are supplemented or de-duplicated.
For example, first, text information may be identified according to a knowledge graph and segmented to obtain a plurality of entities of the text information; then, carrying out label matching on each entity by utilizing a custom dictionary to obtain a first label set, and finally supplementing or de-duplicating labels of the target image or video according to the labels in the first label set; of course, the manner of word segmentation of the text information is not particularly limited in the present application.
In one implementation, the tags of the target image or video are supplemented or de-duplicated using semantic correlation of the tags of the target image or video and the first set of tags and/or a de-duplication number threshold of the tags of the target image or video and the first set of tags.
As an example, the semantic relationship between the label of the target image or video and the label in the first label set may be used to supplement or de-duplicate the label of the target image or video, for example, the semantic relationship between upper and lower positions, and the labels 'dog' and 'halfton' have the upper and lower positions, so that the lower position label may be selected when the label of the target image or video is supplemented or de-duplicated, if the label of the target image or video does not have the lower position label, the label of the target image or video is supplemented, and if the label of the target image or video does not have the upper position label or the lower position label, the label of the target image or video is de-duplicated, so that the label of the target image or video is more accurate.
As another example, the labels of the target image or video may be supplemented or de-duplicated by using the labels of the target image or video and the de-duplication number threshold of the first label set, for example, the threshold number of labels of the target image or video may be designed manually in advance, and if the number of labels of the obtained target image or video is greater than the threshold number, the redundant labels of the target image or video may be removed according to the semantic relevance rule as above.
It should be understood that the above manner of supplementing or de-duplicating the tag of the target image or video is merely an example of the present application, and of course, the tag of the target image or video may also be supplemented or de-duplicated in other manners, which is not specifically limited in this application.
The method comprises the steps of obtaining a first tag set based on text information of a target image or video, and supplementing or de-duplicating the obtained tag of the target image or video by using the first tag set, wherein the method is equivalent to the steps of improving the classification accuracy of an M-layer classifier, and then further improving the accuracy of the finally generated tag by using the first tag set, so that the obtained tag of the target image or video has higher practical application value.
Fig. 5 is a schematic flow chart of a method 500 of training a tag recognition model provided in an embodiment of the present application.
As shown in fig. 5, the method 500 may include some or all of the following:
s501, acquiring an image or video to be trained;
s502, extracting a plurality of modal characteristics of the image or video to be trained;
s503, carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
s504, obtaining an ith-level labeling label corresponding to the image or video to be trained;
s505, training the ith layer classifier by taking the fusion feature, the middle feature of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level as inputs to obtain an identification label model, wherein i is more than 1 and less than or equal to M, the first layer classifier in the M layer classifier is obtained by training based on the fusion feature and the labeling label of the first level, and the middle feature of the first layer classifier in the M layer classifier is obtained based on the fusion feature.
Based on the scheme, the feature fusion is carried out on the plurality of modal features of the image or the video to be trained by extracting the plurality of modal features, which is equivalent to the feature expression by fusing the plurality of modal features of the image or the video to be trained, so that the image or the video to be trained is more sufficient in terms of feature expression, and the identification accuracy of the model to the label can be improved.
In addition, based on the fusion characteristics, the intermediate characteristics of the i-1 th layer classifier in the M layer classifier and the labeling label of the i level as inputs, training the i layer classifier to obtain an identification label model, which is equivalent to multiplexing the intermediate characteristics of the front layer classifier by the rear layer classifier, on one hand, the identification accuracy of the rear layer classifier on the label is improved, namely the label accuracy of the model output image or video is improved; on the other hand, the rear-level classifier multiplexes the middle features of the front-level classifier, so that the convergence rate of the model is increased, the model training time is shortened, and the model training efficiency is improved.
In addition, because the accuracy of the model identification tag is improved, the quality and the efficiency of the retrieval can be further improved by utilizing the tag output by the tag identification model to retrieve the image or the video on an internet platform; meanwhile, video recommendation is carried out on the user by using the identified tag, so that the user experience of the product can be greatly improved.
It should be noted that the plurality of modal features may include, but are not limited to, visual features, audio features, text features; extracting visual features of the target image or video may be, but is not limited to, extracting visual features based on a residual network RestNet framework in a slow-fast low-cost channel video classification algorithm, extracting audio features of the target image or video may be, but is not limited to, extracting audio features based on a VGGish framework, extracting text features of the target image or video may be, but is not limited to, extracting text features using a BERT framework, or text information may be supplemented using text recognition technology OCR or speech recognition technology ASR while using a BERT framework, for example, separating audio in the target video, using automatic speech recognition ASR technology, obtaining sound text in the audio, and the like, which is not particularly limited in the present application. It should be noted that, the feature fusion of the plurality of modal features may be, but not limited to, feature fusion based on a transducer framework, which is not particularly limited in the present application. In addition, the identification tag model in the application may be an identification tag model based on a multi-layer perceptron (MLP, multilayer Perceptron) as a framework, and of course, may also be based on other frameworks, as long as the framework can realize multiplexing of intermediate features of the front-stage classifier by the rear-stage classifier, and can realize multiplexing of intermediate features of the front-stage classifier, which is not particularly limited in the application.
In some embodiments of the present application, S503 may include:
mapping the plurality of modal features into a plurality of first features of the same dimension respectively;
and carrying out modal and position coding on the plurality of first features to obtain the fusion features.
In some embodiments of the present application, S503 may include:
correcting the jth feature based on other first features except the jth feature aiming at the jth feature in the plurality of first features to obtain a second feature corresponding to the jth feature;
the fusion feature is determined based on a plurality of second features corresponding to the plurality of first features, respectively.
In some embodiments of the present application, S503 may include:
determining the weight corresponding to the jth feature based on other first features except the jth feature;
and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.
In some embodiments of the present application, prior to S505, the method 500 may further include:
acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video; based on the text information, supplementing or de-duplicating the labeling label of the ith level to obtain a final labeling label of the ith level;
Wherein S505 may include:
and training the ith layer classifier by taking the fusion characteristic, the intermediate characteristic of the ith-1 layer classifier in the M layer classifier and the final labeling label of the ith level as inputs so as to obtain the identification label model.
In some embodiments of the present application, the text information is segmented to obtain a plurality of segmented words corresponding to the target image or video; matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video; and supplementing or de-duplicating the labeling label of the ith level based on the first label set to obtain the final labeling label of the ith level.
In some embodiments of the present application, the method 500 may further comprise:
and supplementing or de-duplicating the labeling label of the ith level by utilizing the semantic relativity of the labeling label of the ith level and the first label set and/or the de-duplication number threshold value of the labeling label of the ith level and the first label set.
It should be noted that, the scheme for fusing the plurality of modal features in the method for training the tag identification model may be the same as the scheme for fusing the plurality of modal features in the tag identification method, and will not be described herein. For example, the method for training the tag recognition model to obtain the intermediate features of the i-1 layer classifier may be the same as the method for training the tag recognition model to obtain the intermediate features of the i-1 layer classifier.
The preferred embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.
It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application.
The method provided by the embodiment of the application is described above, and the device provided by the embodiment of the application is described below.
Fig. 6 is a schematic block diagram of an apparatus 600 for identifying tags provided by an embodiment of the present application.
As shown in fig. 6, the apparatus 600 may include:
an extracting unit 610 for extracting a plurality of modality features of a target image or video;
a fusion unit 620, configured to perform feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
a first determining unit 630, configured to obtain intermediate features of the ith layer classifier by using the ith layer classifier based on the fusion feature and intermediate features of the ith-1 layer classifier in the M layer classifiers until intermediate features of the M layer classifier are obtained; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;
an output unit 640 for outputting probability distribution characteristics using the M-th layer classifier based on the intermediate characteristics of the M-th layer classifier;
a second determining unit 650 for determining a label of the target image or video based on the probability distribution characteristics.
In some embodiments of the present application, the fusion unit 620 is configured to:
Mapping the plurality of modal features into a plurality of first features of the same dimension respectively;
and carrying out modal and position coding on the plurality of first features to obtain the fusion features.
In some embodiments of the present application, the fusion unit 620 is configured to:
correcting the jth feature based on other first features except the jth feature aiming at the jth feature in the plurality of first features to obtain a second feature corresponding to the jth feature;
the fusion feature is determined based on a plurality of second features corresponding to the plurality of first features, respectively.
In some embodiments of the present application, the fusion unit 620 is configured to:
determining the weight corresponding to the jth feature based on other first features except the jth feature;
and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.
In some embodiments of the present application, the first determining unit 630 is configured to:
based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M-layer classifier, obtaining the intermediate feature of the ith layer classifier by using the ith layer classifier comprises the following steps:
splicing the fusion feature and the middle feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i layer classifier;
And taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.
In some embodiments of the present application, the second determining unit 650 is configured to:
determining a first numerical value greater than a preset threshold in the probability distribution characteristics based on the probability distribution characteristics;
identifying a tag corresponding to the first value in at least one tag;
and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of at least one label is equal to the dimension of the probability distribution characteristic.
In some embodiments of the present application, the extraction unit 610 is further configured to:
acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video;
and supplementing or de-duplicating the label of the target image or video based on the text information to obtain the final label of the target image or video.
In some embodiments of the present application, the first determining unit 630 is further configured to:
word segmentation is carried out on the text information, and a plurality of word segments corresponding to the target image or video are obtained;
Matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video;
based on the first tag set, the tags of the target image or video are supplemented or de-duplicated.
In some embodiments of the present application, the first determining unit 630 is further configured to:
and supplementing or de-duplicating the label of the target image or video by utilizing the semantic relativity of the label of the target image or video and the first label set and/or the de-duplication number threshold value of the label of the target image or video and the first label set.
Fig. 7 is a schematic block diagram of an apparatus 700 for training a tag recognition model provided in an embodiment of the present application.
A first acquiring unit 710, configured to acquire an image or video to be trained;
an extracting unit 720, configured to extract a plurality of modal features of the image or video to be trained;
a fusion unit 730, configured to perform feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
a second obtaining unit 740, configured to obtain an ith level label corresponding to the image or video to be trained;
the training unit 750 is configured to train the ith layer classifier by using the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layer classifier, and the label tag of the ith level as inputs, so as to obtain a tag recognition model, where i is greater than 1 and less than or equal to M, and a first layer classifier in the M layer classifier is obtained by training based on the fusion feature and the label tag of the first level, and the intermediate feature of the first layer classifier in the M layer classifier is obtained based on the fusion feature.
In some embodiments of the present application, the fusion unit 730 is configured to:
mapping the plurality of modal features into a plurality of first features of the same dimension respectively;
and carrying out modal and position coding on the plurality of first features to obtain the fusion features.
In some embodiments of the present application, the fusion unit 730 is configured to:
correcting the jth feature based on other first features except the jth feature aiming at the jth feature in the plurality of first features to obtain a second feature corresponding to the jth feature;
the fusion feature is determined based on a plurality of second features corresponding to the plurality of first features, respectively.
In some embodiments of the present application, the fusion unit 730 is configured to:
determining the weight corresponding to the jth feature based on other first features except the jth feature;
and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.
In some embodiments of the present application, the first obtaining unit 710 is further configured to:
acquiring text information of the target image or video, the text information including at least one of: text of the target image or video, title of the target image or video, and annotation text of the target image or video; based on the text information, supplementing or de-duplicating the labeling label of the ith level to obtain a final labeling label of the ith level;
The training unit 750 may specifically be used for:
and training the ith layer classifier by taking the fusion characteristic, the intermediate characteristic of the ith-1 layer classifier in the M layer classifier and the final labeling label of the ith level as inputs so as to obtain the identification label model.
In some embodiments of the present application, the first obtaining unit 710 is further configured to:
word segmentation is carried out on the text information, and a plurality of word segments corresponding to the target image or video are obtained; matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video; and supplementing or de-duplicating the labeling label of the ith level based on the first label set to obtain the final labeling label of the ith level.
In some embodiments of the present application, the first obtaining unit 710 is further configured to:
and supplementing or de-duplicating the labeling label of the ith level by utilizing the semantic relativity of the labeling label of the ith level and the first label set and/or the de-duplication number threshold value of the labeling label of the ith level and the first label set.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 600 may correspond to a corresponding main body in performing the method 200 in the embodiment of the present application, and each unit in the apparatus 700 is for implementing a corresponding flow in the method 500, and for brevity, will not be described herein again.
It should also be understood that each unit in the apparatus 600 or the apparatus 700 related to the embodiments of the present application may be separately or all combined into one or several other units to form a structure, or some unit(s) thereof may be further split into a plurality of units with smaller functions to form a structure, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the apparatus 600 or 700 may also include other units, and in practical applications, the functions may also be implemented with assistance from other units, and may be implemented by cooperation of multiple units. According to another embodiment of the present application, the apparatus 600 or 700 according to the embodiments of the present application may be constructed by running a computer program (including a program code) capable of executing the steps involved in the respective methods on a general purpose computing device of a general purpose computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), etc., and a storage element, and a method of implementing the method of identifying a tag or a method of training a tag identification model according to the embodiments of the present application. The computer program may be recorded on a computer readable storage medium, and loaded into an electronic device through the computer readable storage medium and executed therein to implement the corresponding method of the embodiments of the present application.
In other words, the units referred to above may be implemented in hardware, or may be implemented by instructions in software, or may be implemented in a combination of hardware and software. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software in the decoding processor. Alternatively, the software may reside in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.
Fig. 8 is a schematic structural diagram of an electronic device 800 provided in an embodiment of the present application.
As shown in fig. 8, the electronic device 800 includes at least a processor 810 and a computer-readable storage medium 820. Wherein the processor 810 and the computer-readable storage medium 820 may be connected by a bus or other means. The computer-readable storage medium 820 is configured to store a computer program 821, the computer program 821 including computer instructions, and the processor 810 is configured to execute the computer instructions stored by the computer-readable storage medium 820. Processor 810 is a computing core and a control core of electronic device 800 that are adapted to implement one or more computer instructions, in particular to load and execute one or more computer instructions to implement a corresponding method flow or a corresponding function.
By way of example, the processor 810 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 810 may include, but is not limited to: a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
By way of example, computer-readable storage medium 820 may be high-speed RAM memory or Non-volatile memory (Non-Volatilememory), such as at least one magnetic disk memory; alternatively, it may be at least one computer-readable storage medium located remotely from the aforementioned processor 810. In particular, computer-readable storage media 820 includes, but is not limited to: volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).
In one implementation, the electronic device 800 may be the apparatus 600 for identifying tags shown in FIG. 6 or the apparatus 700 for training a tag identification model shown in FIG. 7; the computer-readable storage medium 820 has stored therein computer instructions; computer instructions stored in computer-readable storage medium 820 are loaded and executed by processor 810 to implement the corresponding steps in the method embodiments shown in fig. 2-5; in particular implementations, computer instructions in the computer-readable storage medium 820 are loaded by the processor 810 and perform the corresponding steps, which are not repeated here.
According to another aspect of the present application, the embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in the electronic device 800, for storing programs and data. Such as computer-readable storage medium 820. It is understood that the computer readable storage medium 820 herein may include both built-in storage media in the electronic device 800 and extended storage media supported by the electronic device 800. The computer-readable storage medium provides storage space that stores an operating system of the electronic device 800. Also stored in this memory space are one or more computer instructions, which may be one or more computer programs 821 (including program code), adapted to be loaded and executed by the processor 810.
According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. For example, a computer program 821. At this point, the electronic device 800 may be a computer, and the processor 810 reads the computer instructions from the computer-readable storage medium 820, and the processor 810 executes the computer instructions, so that the computer performs the method of predicting the geographic location of the IP address provided in the above-mentioned various alternatives.
In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, runs the processes or implements the functions of the embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
Those of ordinary skill in the art will appreciate that the elements and process steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Finally, it should be noted that the above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about the changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (12)

1. A method of identifying a tag, comprising:
extracting a plurality of modal features of a target image or video; the plurality of modal features includes visual features, audio features, and text features of the target image or video;
Performing feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
based on the fusion characteristics and the intermediate characteristics of the ith-1 layer classifier in the M layer classifier, obtaining the intermediate characteristics of the ith layer classifier by utilizing the ith layer classifier until obtaining the intermediate characteristics of the M layer classifier; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;
based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier;
determining a label of the target image or video based on the probability distribution characteristics;
the method further comprises the steps of:
acquiring text information of the target image or video, wherein the text information comprises at least one of the following: the text of the target image or video, the title of the target image or video, and the labeling text of the target image or video;
word segmentation is carried out on the text information, and a plurality of word segments corresponding to the target image or video are obtained;
Matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video;
and supplementing or de-duplicating the label of the target image or video based on the first label set.
2. The method of claim 1, wherein the feature fusing the plurality of modal features to obtain fused features of the plurality of modal features comprises:
mapping the plurality of modal features into a plurality of first features with the same dimension respectively;
and carrying out modal and position coding on the plurality of first features to obtain the fusion features.
3. The method of claim 2, wherein the performing modal and position encoding on the plurality of first features to obtain the fused features comprises:
correcting the jth feature based on other first features except the jth feature aiming at the jth feature in the plurality of first features to obtain a second feature corresponding to the jth feature;
and determining the fusion characteristic based on a plurality of second characteristics corresponding to the plurality of first characteristics respectively.
4. A method according to claim 3, wherein said modifying the jth feature based on other first features than the jth feature to obtain a second feature corresponding to the jth feature comprises:
Determining the weight corresponding to the j-th feature based on other first features except the j-th feature;
and determining the product of the weight corresponding to the j-th feature and the j-th feature as a second feature corresponding to the j-th feature.
5. The method of claim 1, wherein the M-layer classifier is an M-layer classifier based on a multi-layer perceptron MLP unit;
the step of obtaining the intermediate feature of the ith layer classifier by using the ith layer classifier based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier comprises the following steps:
splicing the fusion feature and the middle feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i layer classifier;
and taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.
6. The method of claim 1, wherein the determining the label of the target image or video based on the probability distribution characteristics comprises:
determining a first numerical value greater than a preset threshold in the probability distribution characteristics based on the probability distribution characteristics;
Identifying a label corresponding to the first numerical value in at least one label;
and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of at least one label is equal to the dimension of the probability distribution characteristic.
7. The method of claim 1, wherein the supplementing or de-duplicating the tags of the target image or video based on the first tag set comprises:
and supplementing or de-duplicating the label of the target image or video by utilizing the semantic correlation of the label of the target image or video and the first label set and/or the de-duplication number threshold of the label of the target image or video and the first label set.
8. A method of training a tag recognition model, comprising:
acquiring an image or video to be trained;
extracting a plurality of modal characteristics of the image or video to be trained; the plurality of modal features includes visual features, audio features, and text features of the image or video to be trained;
performing feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
Acquiring an ith-level labeling label corresponding to the image or video to be trained;
training an ith layer classifier by taking the fusion characteristic, the middle characteristic of an ith-1 layer classifier in an M layer classifier and the labeling label of the ith level as inputs to obtain a label identification model, wherein i is more than 1 and less than or equal to M, a first layer classifier in the M layer classifier is obtained by training based on the fusion characteristic and the labeling label of the first level, and the middle characteristic of the first layer classifier in the M layer classifier is obtained based on the fusion characteristic;
the training the ith layer classifier by taking the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level as inputs to obtain a label identification model comprises the following steps:
acquiring text information of an image or video to be trained, wherein the text information comprises at least one of the following: the text of the image or video to be trained, the title of the image or video to be trained, and the labeling text of the image or video to be trained;
word segmentation is carried out on the text information, and a plurality of word segments corresponding to the image or video to be trained are obtained;
Matching the plurality of word segments with a custom dictionary to obtain a first tag set of the image or video to be trained;
supplementing or de-duplicating the labeling label of the ith level based on the first label set;
and training the ith layer classifier by taking the fusion characteristic, the intermediate characteristic of the ith-1 layer classifier in the M layer classifier and the final labeling label of the ith level as inputs so as to obtain the label identification model.
9. An apparatus for identifying a tag, comprising:
the extraction unit is used for extracting a plurality of modal characteristics of the target image or video; the plurality of modal features includes visual features, audio features, and text features of the target image or video;
the fusion unit is used for carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
the first determining unit is used for obtaining the intermediate feature of the ith layer classifier by utilizing the ith layer classifier based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layer classifier until the intermediate feature of the M layer classifier is obtained; the i is more than 1 and less than or equal to M, the characteristics output by the ith layer of classifier in the M layers of classifiers are used for identifying the label of the ith level in M levels, and the intermediate characteristics of the first layer of classifier in the M layers of classifiers are obtained based on the fusion characteristics;
The output unit is used for outputting probability distribution characteristics by utilizing the M-layer classifier based on the intermediate characteristics of the M-layer classifier;
a second determining unit configured to determine a tag of the target image or video based on the probability distribution characteristics;
the first determining unit is further configured to:
acquiring text information of the target image or video, wherein the text information comprises at least one of the following: the text of the target image or video, the title of the target image or video, and the labeling text of the target image or video;
word segmentation is carried out on the text information, and a plurality of word segments corresponding to the target image or video are obtained;
matching the plurality of segmentation words with a custom dictionary to obtain a first tag set of the target image or video;
and supplementing or de-duplicating the label of the target image or video based on the first label set.
10. An apparatus for training a tag recognition model, comprising:
the first acquisition unit is used for acquiring images or videos to be trained;
the extraction unit is used for extracting a plurality of modal characteristics of the image or video to be trained; the plurality of modal features includes visual features, audio features, and text features of the image or video to be trained;
The fusion unit is used for carrying out feature fusion on the plurality of modal features to obtain fusion features after the plurality of modal features are fused;
the second acquisition unit is used for acquiring the ith-level labeling label corresponding to the image or video to be trained;
the training unit is used for taking the fusion characteristics, the middle characteristics of the ith-1 layer classifier in the M layer classifier and the labeling label of the ith level as inputs, training the ith layer classifier to obtain a label identification model, i is more than 1 and less than or equal to M, a first layer classifier in the M layer classifier is obtained by training based on the fusion characteristics and the labeling label of the first level, and the middle characteristics of the first layer classifier in the M layer classifier are obtained based on the fusion characteristics;
wherein, training unit is specifically used for:
acquiring text information of an image or video to be trained, wherein the text information comprises at least one of the following: the text of the image or video to be trained, the title of the image or video to be trained, and the labeling text of the image or video to be trained;
word segmentation is carried out on the text information, and a plurality of word segments corresponding to the image or video to be trained are obtained;
Matching the plurality of word segments with a custom dictionary to obtain a first tag set of the image or video to be trained;
supplementing or de-duplicating the labeling label of the ith level based on the first label set;
and training the ith layer classifier by taking the fusion characteristic, the intermediate characteristic of the ith-1 layer classifier in the M layer classifier and the final labeling label of the ith level as inputs so as to obtain the label identification model.
11. An electronic device, comprising:
a processor adapted to execute a computer program;
a computer readable storage medium having stored therein a computer program which, when executed by the processor, implements the method of any one of claims 1 to 7 or the method of claim 8.
12. A computer readable storage medium storing a computer program for causing a computer to perform the method of any one of claims 1 to 7 or the method of claim 8.
CN202110662545.9A 2021-06-15 2021-06-15 Label identification method, label identification model training method, device and equipment Active CN113836992B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110662545.9A CN113836992B (en) 2021-06-15 2021-06-15 Label identification method, label identification model training method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110662545.9A CN113836992B (en) 2021-06-15 2021-06-15 Label identification method, label identification model training method, device and equipment

Publications (2)

Publication Number Publication Date
CN113836992A CN113836992A (en) 2021-12-24
CN113836992B true CN113836992B (en) 2023-07-25

Family

ID=78962663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110662545.9A Active CN113836992B (en) 2021-06-15 2021-06-15 Label identification method, label identification model training method, device and equipment

Country Status (1)

Country Link
CN (1) CN113836992B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398973B (en) * 2022-01-07 2024-04-16 腾讯科技(深圳)有限公司 Media content tag identification method, device, equipment and storage medium
CN114821401A (en) * 2022-04-07 2022-07-29 腾讯科技(深圳)有限公司 Video auditing method, device, equipment, storage medium and program product
CN115052201A (en) * 2022-05-17 2022-09-13 阿里巴巴(中国)有限公司 Video editing method and electronic equipment
CN115879473B (en) * 2022-12-26 2023-12-01 淮阴工学院 Chinese medical named entity recognition method based on improved graph attention network
CN117421497B (en) * 2023-11-02 2024-04-26 北京蜂鸟映像电子商务有限公司 Work object processing method and device, readable storage medium and electronic equipment

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007114796A1 (en) * 2006-04-05 2007-10-11 Agency For Science, Technology And Research Apparatus and method for analysing a video broadcast
CN110162669A (en) * 2019-04-04 2019-08-23 腾讯科技(深圳)有限公司 Visual classification processing method, device, computer equipment and storage medium
CN110688461A (en) * 2019-09-30 2020-01-14 中国人民解放军国防科技大学 Online text education resource label generation method integrating multi-source knowledge
CN110737801A (en) * 2019-10-14 2020-01-31 腾讯科技(深圳)有限公司 Content classification method and device, computer equipment and storage medium
CN110866184A (en) * 2019-11-11 2020-03-06 湖南大学 Short video data label recommendation method and device, computer equipment and storage medium
CN111222500A (en) * 2020-04-24 2020-06-02 腾讯科技(深圳)有限公司 Label extraction method and device
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion
CN111626251A (en) * 2020-06-02 2020-09-04 Oppo广东移动通信有限公司 Video classification method, video classification device and electronic equipment
CN112069884A (en) * 2020-07-28 2020-12-11 中国传媒大学 Violent video classification method, system and storage medium
CN112183672A (en) * 2020-11-05 2021-01-05 北京金山云网络技术有限公司 Image classification method, and training method and device of feature extraction network
CN112347290A (en) * 2020-10-12 2021-02-09 北京有竹居网络技术有限公司 Method, apparatus, device and medium for identifying label
CN112348111A (en) * 2020-11-24 2021-02-09 北京达佳互联信息技术有限公司 Multi-modal feature fusion method and device in video, electronic equipment and medium
CN112364810A (en) * 2020-11-25 2021-02-12 深圳市欢太科技有限公司 Video classification method and device, computer readable storage medium and electronic equipment
CN112765403A (en) * 2021-01-11 2021-05-07 北京达佳互联信息技术有限公司 Video classification method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11176423B2 (en) * 2016-10-24 2021-11-16 International Business Machines Corporation Edge-based adaptive machine learning for object recognition
CN112231275B (en) * 2019-07-14 2024-02-27 阿里巴巴集团控股有限公司 Method, system and equipment for classifying multimedia files, processing information and training models

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007114796A1 (en) * 2006-04-05 2007-10-11 Agency For Science, Technology And Research Apparatus and method for analysing a video broadcast
CN110162669A (en) * 2019-04-04 2019-08-23 腾讯科技(深圳)有限公司 Visual classification processing method, device, computer equipment and storage medium
CN110688461A (en) * 2019-09-30 2020-01-14 中国人民解放军国防科技大学 Online text education resource label generation method integrating multi-source knowledge
CN110737801A (en) * 2019-10-14 2020-01-31 腾讯科技(深圳)有限公司 Content classification method and device, computer equipment and storage medium
CN110866184A (en) * 2019-11-11 2020-03-06 湖南大学 Short video data label recommendation method and device, computer equipment and storage medium
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion
CN111222500A (en) * 2020-04-24 2020-06-02 腾讯科技(深圳)有限公司 Label extraction method and device
CN111626251A (en) * 2020-06-02 2020-09-04 Oppo广东移动通信有限公司 Video classification method, video classification device and electronic equipment
CN112069884A (en) * 2020-07-28 2020-12-11 中国传媒大学 Violent video classification method, system and storage medium
CN112347290A (en) * 2020-10-12 2021-02-09 北京有竹居网络技术有限公司 Method, apparatus, device and medium for identifying label
CN112183672A (en) * 2020-11-05 2021-01-05 北京金山云网络技术有限公司 Image classification method, and training method and device of feature extraction network
CN112348111A (en) * 2020-11-24 2021-02-09 北京达佳互联信息技术有限公司 Multi-modal feature fusion method and device in video, electronic equipment and medium
CN112364810A (en) * 2020-11-25 2021-02-12 深圳市欢太科技有限公司 Video classification method and device, computer readable storage medium and electronic equipment
CN112765403A (en) * 2021-01-11 2021-05-07 北京达佳互联信息技术有限公司 Video classification method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于异构信息双向传播的网络视频分类方法;李谦;杜友田;薛姣;;计算机应用(第08期);全文 *
基于张量表示的直推式多模态视频语义概念检测;吴飞;刘亚楠;庄越挺;;软件学报(第11期);全文 *

Also Published As

Publication number Publication date
CN113836992A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN110414432B (en) Training method of object recognition model, object recognition method and corresponding device
WO2020238293A1 (en) Image classification method, and neural network training method and apparatus
CN111353076B (en) Method for training cross-modal retrieval model, cross-modal retrieval method and related device
US11238093B2 (en) Video retrieval based on encoding temporal relationships among video frames
CN111062215B (en) Named entity recognition method and device based on semi-supervised learning training
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
EP3926531B1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN112347284B (en) Combined trademark image retrieval method
CN111475622A (en) Text classification method, device, terminal and storage medium
CN114896434B (en) Hash code generation method and device based on center similarity learning
CN114282055A (en) Video feature extraction method, device and equipment and computer storage medium
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN117453949A (en) Video positioning method and device
CN113704534A (en) Image processing method and device and computer equipment
CN113011320A (en) Video processing method and device, electronic equipment and storage medium
CN117392488A (en) Data processing method, neural network and related equipment
WO2023168818A1 (en) Method and apparatus for determining similarity between video and text, electronic device, and storage medium
CN116109980A (en) Action recognition method based on video text matching
CN115759262A (en) Visual common sense reasoning method and system based on knowledge perception attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant