CN118035945B - Label recognition model processing method and related device - Google Patents

Label recognition model processing method and related device Download PDF

Info

Publication number
CN118035945B
CN118035945B CN202410441452.7A CN202410441452A CN118035945B CN 118035945 B CN118035945 B CN 118035945B CN 202410441452 A CN202410441452 A CN 202410441452A CN 118035945 B CN118035945 B CN 118035945B
Authority
CN
China
Prior art keywords
tag
content
model
features
tags
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410441452.7A
Other languages
Chinese (zh)
Other versions
CN118035945A (en
Inventor
杨善明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410441452.7A priority Critical patent/CN118035945B/en
Publication of CN118035945A publication Critical patent/CN118035945A/en
Application granted granted Critical
Publication of CN118035945B publication Critical patent/CN118035945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a processing method and a related device of a tag identification model, which can be applied to the fields of computer vision technology, natural language technology, machine learning technology and the like, and is used for acquiring a content sample, a plurality of tags and a plurality of co-occurrence frequencies, and extracting features through an initial tag identification model according to the plurality of co-occurrence frequencies and the plurality of tags to obtain first fusion tag features of each tag; identifying through an initial tag identification model according to the content sample and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample; and according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain a tag identification model, wherein the tag identification model is used for identifying the tag of the content. Therefore, by introducing the semantics of the plurality of labels and the relevance among the plurality of labels, the label identification model can better understand the content based on the guidance of the labels, thereby improving the accuracy of the label identification model on the content identification.

Description

Label recognition model processing method and related device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a processing method and a related device of a label identification model.
Background
With the development of internet technology, users can view various contents through a content platform. To facilitate viewing of content by users, content platforms may add tags to various content, such as person tags, scene tags, genre tags, etc. to content. Various downstream tasks, such as content recommendation, content search, etc., may be accomplished based on the tags.
In the related art, in order to add a tag to a content, the content may be identified, and the identification result is used as the tag. Taking video content as an example, dividing the video content into a plurality of video frames, then carrying out image recognition on each video frame to obtain a label corresponding to each video frame, and finally determining the label of the video according to the labels respectively corresponding to the video frames included in the video.
However, the accuracy of the label determined based on the identification manner is low, and thus, the accuracy of the downstream task completed based on the label is low. For example, in a content recommendation task, the accuracy is lower based on recommended content determined by a label with lower accuracy.
Disclosure of Invention
In order to solve the technical problems, the application provides a processing method and a related device of a tag identification model, which are used for improving the accuracy of determining the tag of the content.
The embodiment of the application discloses the following technical scheme:
in one aspect, an embodiment of the present application provides a method for processing a tag identification model, where the method includes:
Acquiring a content sample, a plurality of tags and a plurality of co-occurrence frequencies, wherein the co-occurrence frequencies are used for identifying the frequency of at least two tags in the plurality of tags for commonly identifying the same content, the tags are used for identifying the characteristics of the content, and the content sample is the content with the real tags;
According to the co-occurrence frequencies and the tags, extracting features through an initial tag identification model to obtain first fusion tag features of the tags, wherein the first fusion tag features are used for identifying features of corresponding tags and features of tags which jointly identify the same content with the corresponding tags, and in the first fusion tag features, the larger the co-occurrence frequency of the corresponding tags and the tags which jointly identify the same content with the corresponding tags is, the larger the influence of the features of the tags which jointly identify the same content on the first fusion tag features is;
according to the content sample and the first fusion tag characteristics, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
And according to the difference between the first identification tag and the real tag, adjusting the model parameters of the initial tag identification model to obtain a tag identification model, wherein the tag identification model is used for identifying the tag of the content.
In another aspect, an embodiment of the present application provides a processing apparatus for a tag identification model, where the apparatus includes: the device comprises an acquisition unit, a feature extraction unit, an identification unit and an adjustment unit;
The obtaining unit is used for obtaining a content sample, a plurality of labels and a plurality of co-occurrence frequencies, wherein the co-occurrence frequencies are used for identifying the frequency of at least two labels in the plurality of labels for commonly identifying the same content, the labels are used for identifying the characteristics of the content, and the content sample is the content with the real labels;
the feature extraction unit is configured to perform feature extraction according to a plurality of co-occurrence frequencies and a plurality of tags through an initial tag identification model, so as to obtain a first fusion tag feature of each tag, where the first fusion tag feature is used to identify a feature of a corresponding tag and a feature of a tag that commonly identifies the same content with the corresponding tag, and in the first fusion tag feature, the larger the co-occurrence frequency of the corresponding tag and the tag that commonly identifies the same content with the corresponding tag is, the larger the influence of the feature of the tag that commonly identifies the same content on the first fusion tag feature is;
The identification unit is used for identifying through the initial tag identification model according to the content sample and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample;
the adjusting unit is used for adjusting the model parameters of the initial tag identification model according to the difference between the first identification tag and the real tag to obtain a tag identification model, wherein the tag identification model is used for identifying the tag of the content.
In another aspect, an embodiment of the present application provides a computer device including a processor and a memory:
the memory is used for storing a computer program and transmitting the computer program to the processor;
The processor is configured to perform the method of the above aspect according to instructions in the computer program.
In another aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program for executing the method described in the above aspect.
In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method described in the above aspect.
According to the technical scheme, not only the content sample, but also a plurality of labels and a plurality of co-occurrence frequencies are obtained, the co-occurrence frequencies are used for identifying the frequency of at least two labels in the plurality of labels for jointly identifying the same content, and the relationship and the dependence among the plurality of labels can be embodied through the co-occurrence frequencies. And the first fusion tag features of the tags are obtained by extracting features through the initial tag identification model according to the plurality of tags and the plurality of co-occurrence frequencies, so that the first fusion tag features not only comprise features of the corresponding tags, but also comprise features of tags which are identified with the corresponding tags together in the same content. In addition, in the first fused tag feature, the larger the co-occurrence frequency of the corresponding tag and the tag which jointly identifies the same content is, the larger the influence of the feature of the tag which jointly identifies the same content on the first fused tag feature is. That is, the semantics of the tag and the relevance between the plurality of tags can be better understood through the initial tag identification model, so that the initial tag identification model is guided to purposefully understand the content sample according to the semantics of the tag and the relevance between the plurality of tags, that is, the first identification tag for the content sample is obtained by identifying through the initial tag identification model according to the content sample and the first fusion tag characteristics. According to the difference between the first identification tag and the real tag corresponding to the content sample, adjusting model parameters of the initial tag identification model to enable the first identification tag obtained through the identification of the initial tag identification model to be closer to the real tag, and accordingly obtaining the tag identification model. In the process of identifying the content, the semantic meaning of a plurality of labels and the relevance among the labels are introduced, so that the label identification model can better understand the content based on the guidance of the labels, and the accuracy of the label identification model on the content identification is improved through information of different sources.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an application scenario schematic diagram of a processing method of a tag identification model according to an embodiment of the present application;
fig. 2 is a flowchart illustrating a processing method of a tag identification model according to an embodiment of the present application;
FIG. 3 is a schematic diagram of obtaining a first identification tag according to an embodiment of the present application;
FIG. 4 is a schematic diagram of yet another embodiment of the present application for obtaining a first identification tag;
FIG. 5 is a schematic diagram of multi-modal feature extraction according to an embodiment of the present application;
fig. 6 is a schematic diagram of an application scenario of a processing method of a tag identification model according to an embodiment of the present application;
fig. 7 is a schematic view of an application scenario corresponding to fig. 6 according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a processing device of a tag identification model according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
In the related art, in order to add a tag to a content, information included in the content is identified, thereby obtaining the tag of the content. But since the identification is performed based on only the information included in the content, the accuracy of the identification may be low due to the lack and deficiency of the information. For example, if video a is a video describing a famous basketball star playing three basketball, the tag identified by video a may be three basketball. If the user searches for basketball videos, the video A is not recommended, and the recommendation accuracy is low.
Based on this, to increase the accuracy of subsequent downstream tasks, multiple tags may be added to the content, such as the tags of video A may be the names of famous basketball stars and basketball. But only based on the information comprised by the content, a plurality of tags of the content are obtained. This approach may also result in less accurate identification due to lack and deficiency of information.
Accordingly, in order to improve accuracy, information of the content is increased by introducing tags of the content. In the process of introducing tags, it is found that the tags of the content are often not independent, but have complex relevance and semantics. For example, zhang San is one of the representative figures of basketball sports, so that the association between the label "Zhang San" and the label "basketball" is very tight, or the probability of the label "Zhang San" and the label "basketball" appearing together is relatively high.
Therefore, in order to avoid that only content is identified, the method and the related device for processing the tag identification model introduce a plurality of tags and co-occurrence frequencies for identifying the co-occurrence frequencies of the plurality of tags, so that the tag identification model can better understand the relevance and semantics among the tags, the content is better understood based on the guidance of the tags, and the accuracy of content identification is improved.
The processing method for the tag identification model provided by the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, intelligent home, intelligent wearing equipment, virtual assistant, intelligent sound box, intelligent marketing, conversational interaction, intelligent customer service, intelligent retail, content recommendation, content inquiry and other scenes. Three scenarios are exemplified below.
Scene one, content recommendation scene.
Firstly, according to the processing method of the tag identification model provided by the application, the tag identification model required by the content recommendation scene is trained. Specifically, a content sample, a plurality of tags (e.g., person tags, genre tags, etc. for video content) of the content recommendation scene, and co-occurrence frequencies of the plurality of tags are obtained. And extracting features according to the co-occurrence frequency of the plurality of tags and the plurality of tags through an initial tag identification model to obtain first fusion tag features of each tag, and identifying through the initial tag identification model according to the content sample and the first fusion tag features to obtain a first identification tag aiming at the content sample. And according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model.
Then, the content of the content recommendation scene is identified according to the trained label identification model, labels of all the contents (such as films, short videos and the like) are obtained, and the content which is possibly interested by the user is determined and recommended to the user based on the labels of all the contents and the preference of the user.
It will be appreciated that in the specific embodiment of the present application, data related to the user, such as the preference of the user, is relevant, when the above embodiments of the present application are applied to specific products or technologies, individual permissions or individual agreements of the user need to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.
Scene two, the content inquiry scene.
Firstly, according to the processing method of the tag identification model provided by the application, the tag identification model required by the content query scene is trained. Specifically, a content sample, a plurality of tags (e.g., author tags, genre tags, etc. for text content) of the content query scene, and co-occurrence frequencies of the plurality of tags are obtained. And extracting features according to the co-occurrence frequency of the plurality of tags and the plurality of tags through an initial tag identification model to obtain first fusion tag features of each tag, and identifying through the initial tag identification model according to the content sample and the first fusion tag features to obtain a first identification tag aiming at the content sample. And according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model.
And then, identifying the content of the content inquiry scene according to the trained label identification model to obtain labels of all the contents (such as news manuscripts, novels and the like), determining the content conforming to the inquiry keywords based on the labels of all the contents and the inquiry keywords of the user, and returning the content to the user.
Scene three, intelligent house scene.
Firstly, according to the processing method of the tag identification model provided by the application, the tag identification model required by the intelligent home scene is trained. Specifically, a content sample, a plurality of tags of the smart home scene (e.g., keyword tags for text content, mood tags, etc.), and co-occurrence frequencies of the plurality of tags are obtained. And extracting features according to the co-occurrence frequency of the plurality of tags and the plurality of tags through an initial tag identification model to obtain first fusion tag features of each tag, and identifying through the initial tag identification model according to the content sample and the first fusion tag features to obtain a first identification tag aiming at the content sample. And according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model.
And then, identifying the content of the intelligent home scene according to the trained label identification model to obtain labels of all the contents (such as voices corresponding to answer texts and the like), determining answer texts corresponding to the answer voices based on the labels of all the contents and the texts corresponding to the question voices of the user, and returning the answer texts corresponding to the question voices to the user in a voice mode.
It should be noted that the above application scenario is only an example, and the method for displaying an interactive session provided in this embodiment may also be applied to other scenarios, which is not limited herein.
The processing method of the tag identification model provided by the embodiment of the application mainly can relate to an artificial intelligence technology, and the generation of the text map for the text fragment is automatically realized through the artificial intelligence technology. Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
In the embodiment of the application, the artificial intelligence technology mainly comprises the directions of the computer vision technology, the natural language processing technology, the machine learning technology and the like.
The Computer Vision technology (CV) is a science for researching how to make a machine "look at", and more specifically, a camera and a Computer are used to replace human eyes to perform machine Vision such as recognition, detection and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, a general vision deformer backbone network (swin-transform), a picture classification network (Vision Transformer, viT), a dilution vision deformer (Vision Transformer, V-MOE), a shield automatic encoder (masked autoencoder, MAE) and other vision fields of a pre-training model can be quickly and widely applied to specific downstream tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also relates to an important technology for model training in the fields of computer science, mathematics and artificial intelligence. The pre-training model is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
The Pre-training model (PTM), also called a kerbstone model or a large model, refers to a deep neural network (Deep neural network, DNN) with large parameters, which is trained on massive unlabeled data, and the PTM extracts common features from the data by utilizing the function approximation capability of the large-parameter DNN, and is suitable for downstream tasks through fine tuning (PEFT), efficient fine tuning (PARAMETER EFFICIENT FINE-tuning) of parameters, prompt learning (prompt-tuning) and other technologies. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of the process into language models (e.g., ELMO, BERT, GPT, etc.), visual models (e.g., swin-transformer, viT, V-MOE, etc.), speech models (e.g., VALL-E), multi-modal models (e.g., viBERT, CLIP, flamingo, gato, etc.), etc., where multi-modal models refer to models that build a representation of two or more data modality features. The pre-training model is an important tool for outputting artificial intelligence generation Content (ARTIFICIAL INTELLIGENCE GENERATED Content, AIGC), and can also be used as a general interface for connecting a plurality of specific task models.
In the processing method of the tag recognition model provided by the embodiment of the application, a video sub-model can be obtained through training of a computer vision technology, a text sub-model can be obtained through training based on a natural language technology, a recognition sub-model can be obtained through training based on a machine learning technology, and the tag recognition model can be obtained through training based on a pre-training model.
The processing method of the tag identification model provided by the application can be applied to computer equipment with the processing capability of the tag identification model, such as terminal equipment and a server.
The terminal device may be a desktop computer, a notebook computer, a smart phone, a tablet computer, an internet of things device and a portable wearable device, the internet of things device may be an intelligent sound box, an intelligent television, an intelligent air conditioner, an intelligent vehicle-mounted device, the intelligent vehicle-mounted device may be a vehicle-mounted navigation terminal, a vehicle-mounted computer, and the portable wearable device may be an intelligent watch, an intelligent bracelet, a head-mounted device, but is not limited thereto.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
In order to facilitate understanding of the processing method of the tag identification model provided by the embodiment of the present application, an application scenario of the processing method of the tag identification model is described by taking an execution subject of the processing method of the tag identification model as an example of a server.
Referring to fig. 1, the application scenario of a processing method of a tag identification model according to an embodiment of the present application is shown. As shown in fig. 1, the application scenario includes a server 100, where the server 100 may be an independent server for training a tag identification model, and after training the tag identification model is completed, the tag identification model obtained by training may be deployed on a server or a terminal device corresponding to a product, so as to provide an identification service for a tag of content; the server 100 may also be a server that provides corresponding services for various products, where the services provided may include, for example, recommending content that may be of interest to the user after identifying tags of the content, or returning content that meets the query criteria of the user, etc. The following description will take the example in which the server 100 trains the tag recognition model.
The server 100 not only obtains a content sample, but also obtains a plurality of tags and a plurality of co-occurrence frequencies. The content sample is content with a real label, and the content can be video content, text content and the like. Tags are features used to identify content, such as persona tags, genre tags, etc. The co-occurrence frequency is used for identifying the frequency of at least two tags in the plurality of tags for jointly identifying the same content, namely the probability of the co-occurrence of the plurality of tags, such as the probability of the tag 'Zhang Sano' and the tag 'basketball' occurring together (namely the plurality of tags which are the same content) is higher, so that the relationship and the dependence among the plurality of tags can be represented through the co-occurrence frequency.
The server 100 performs feature extraction through the initial tag identification model according to the plurality of tags and the plurality of co-occurrence frequencies to obtain first fusion tag features of each tag, so that the first fusion tag features not only include features of corresponding tags, but also include features of tags co-occurring with the corresponding tags. Moreover, in the first fused tag feature, the greater the co-occurrence frequency of the corresponding tag and its co-occurring tag, the greater the impact that the feature of the co-occurring tag has on the first fused tag feature. That is, training the initial tag recognition model better understands the semantics of the tag and the relevance between the plurality of tags.
The server 100 directs the initial tag recognition model to purposefully understand the content sample according to the semantic meaning of the tag and the relevance between the plurality of tags, that is, recognizes through the initial tag recognition model according to the content sample and the first fusion tag feature, and obtains a first recognition tag for the content sample. According to the difference between the first identification tag and the real tag corresponding to the content sample, adjusting model parameters of the initial tag identification model to enable the first identification tag obtained through the identification of the initial tag identification model to be closer to the real tag, and accordingly obtaining the tag identification model.
In the process of identifying the content, the semantic meaning of a plurality of labels and the relevance among the labels are introduced, so that the label identification model can better understand the content based on the guidance of the labels, and the accuracy of the label identification model on the content identification is improved through information of different sources.
The processing method of the tag identification model provided by the embodiment of the application can be executed by a server. However, in other embodiments of the present application, the terminal device may also have a similar function to the server, so as to execute the processing method of the tag identification model provided in the embodiment of the present application, or the terminal device and the server together execute the processing method of the tag identification model provided in the embodiment of the present application, which is not limited in this embodiment.
The following describes a processing method of the tag identification model provided by the application in detail through a method embodiment.
Referring to fig. 2, the flow chart of a processing method of a tag identification model according to an embodiment of the present application is shown. For convenience of description, the following embodiments will be described by taking an execution body of the processing method of the tag identification model as a server as an example. As shown in fig. 2, the processing method of the tag identification model includes the following steps:
s201: a content sample, a plurality of tags, and a plurality of co-occurrence frequencies are obtained.
The content sample is content with a real label, and the content includes but is not limited to video content, graphic content, audio content and the like. The real tag is a tag of the content sample and is used to represent information that the initial tag recognition model should learn and predict. Taking the content sample as video content as an example, the real tag of the content sample may be a character tag for describing a principal angle, a category tag for describing a video category, or the like.
Taking a video content sample as an example, video content generally includes a plurality of video frames, and because of the greater similarity of adjacent video frames, there is more redundant information, and there is no significant gain in training the tag recognition model. In order to reduce the calculation amount and the calculation complexity, frame extraction processing can be performed on the video content, for example, M video frames are extracted in a1 second hand, and M is a positive integer. As a possible implementation, 24 video frames can be extracted within 1 second, so that the calculation amount and the calculation complexity are reduced without reducing the accuracy of the tag identification model.
Tags (Label) are used to identify characteristics of the content, such as a character highlighted in the content may be a tag, a core idea in the content may be a tag, a category of the content may be a tag, etc. The plurality of tags is a collection of tags for a plurality of content. It should be noted that the plurality of tags may be tags of all contents included in the content platform, or may be tags of all contents in a large category, such as tags of all contents in a news category, which is not particularly limited in the present application.
The co-occurrence frequency is used to identify a frequency at which at least two tags of the plurality of tags commonly identify the same content, i.e., a frequency at which two tags co-occur. Alternatively, the co-occurrence frequency of a set of tags is used to identify the co-occurrence frequency of the set of tags, and a set of tags may be made up of two or more tags. Taking an example that a group of tags includes two tags, the frequency that the tag "Zhang San" and the tag "basketball" jointly identify the same content is the co-occurrence frequency of the group of tags (i.e., the tag "Zhang San" and the tag "basketball"). It should be noted that, if the tag "Zhang Sano" and the tag "basketball" jointly identify the video a and the tag "Zhang Sano" and the tag "basketball" jointly identify the video B, the tag "Zhang Sano" and the tag "basketball" jointly identify the same content twice. The co-occurrence frequency can be expressed as formula (1):
(1)
Wherein, Representing co-occurrence frequencies of an ith tag and a jth tag; representing the number of times the ith tag and the jth tag together identify the same content; indicating the number of occurrences of the ith tag; Indicating the number of occurrences of the jth tag.
As a possible implementation manner, a data set formed by a plurality of content samples may be obtained, each content sample has one or more real tags, a union set of all real tags of the plurality of content samples in the data set is determined as a tag set, and a plurality of real tags included in the tag set are a plurality of tags. In addition, the tag set further comprises an association relationship among a plurality of tags, namely, the plurality of tags for identifying the same content have the association relationship and are recorded as co-occurrence. The co-occurrence frequency between the plurality of tags is thereby determined based on the number of occurrences of each tag in the dataset and the number of co-occurrences of each tag with other tags.
S202: and extracting features according to the co-occurrence frequencies and the tags through an initial tag identification model to obtain first fusion tag features of each tag.
The initial tag recognition model is a tag recognition model that has not been trained yet. Taking a target tag in a plurality of tags as an example, extracting features through an initial tag identification model according to the target tag and the co-occurrence frequency between the target tag and other tags, so as to obtain first fusion tag features aiming at the target tag, and then taking the plurality of tags as the target tag respectively so as to obtain the first fusion tag features respectively corresponding to the tags. Taking a label A in 100 labels as an example, according to the label A and the co-occurrence frequency between the label A and the other 99 labels, extracting the characteristics through an initial label identification model to obtain a first fusion label characteristic aiming at the label A.
The initial tag recognition model not only extracts features of co-occurrence frequency, but also performs features of the tags, so that the initial tag recognition model can enable the obtained first fusion tag features to learn the features of each tag, namely the semantics of each tag through learning the tags.
Moreover, the co-occurrence frequency can represent the association relationship between a plurality of tags, for example, the greater the co-occurrence frequency corresponding to the tag a and the tag B, the higher the probability that the tag a and the tag B together identify the same content, the higher the association degree of the tag a and the tag B. Therefore, the initial tag identification model can enable the obtained first fusion tag characteristics to learn the relevance among a plurality of tags by learning the co-occurrence frequency. For example, if the number of the plurality of tags is 100, each tag establishes an association relationship with the other 99 tags, it can be understood that if two tags never commonly identify the same content, the association degree of the two tags is 0, that is, the two tags do not have an association relationship.
That is, the first fused tag feature is a fused tag feature obtained by the initial tag identification model, and the fused tag feature is used to identify a feature of a corresponding tag and a feature of a tag that co-appears with the corresponding tag among the plurality of tags. Moreover, in the fused tag features, the larger the co-occurrence frequency of the corresponding tag and the tag co-occurring with the corresponding tag is, the larger the influence of the features of the co-occurring tag on the fused tag features is. For example, in the fusion feature for tag a, the greater the co-occurrence frequency of tag a and tag B, the higher the co-occurrence frequency of tag a and tag B is, and thus the greater the influence of the feature of tag B on the fusion feature of tag a, so that the more features of tag B are learned for the fusion feature of tag a, and thus the relationship and mutual influence between tags can be captured more accurately.
As a possible implementation manner, in order to facilitate feature extraction of the co-occurrence frequencies and the plurality of tags by using the initial tag identification model, the plurality of co-occurrence frequencies and the plurality of tags may be represented by a matrix, which will be described in detail later herein, and will not be described in detail.
S203: and identifying through the initial tag identification model according to the content sample and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample.
Since the first fused tag feature includes semantics of the plurality of tags and associations between the plurality of tags, the initial tag recognition model may direct the initial tag recognition model to purposefully understand the content sample based on the first fused tag feature when recognizing from the content sample and the first fused tag feature.
For example, in the related art, when a content sample is identified, only a tag that may identify the content sample is a tag a. If the embodiment of the application is adopted, the first fusion tag characteristic can guide the initial tag model to purposefully understand the content sample, and after the tag of the content sample is identified as the tag A, the initial tag identification model also tries to identify whether the content sample can be identified or not or can be provided with the tag B because the co-occurrence frequency of the tag A and the tag B is higher, so that the probability that the tag B is identified is improved, and the accuracy of the initial tag identification model is improved.
S204: and according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model.
The first identification tag is obtained by identification through the initial tag identification model, and the accuracy is low. The real label is a label corresponding to the content sample, and the accuracy is higher. The difference between the first identification tag and the real tag can reflect the accuracy of the initial tag identification model, so that the model parameters of the initial tag identification model can be adjusted based on the difference between the first identification tag and the real tag, so that the first identification tag obtained by the identification of the initial tag identification model is more and more close to the real tag, the accuracy of the initial tag identification model is improved, training of the initial tag identification model can be finished after the number of iterations is met or the identification result obtained by the initial tag identification model is converged, namely, the model acquisition parameters of the initial tag identification model are not adjusted any more, and the tag identification model with higher accuracy is obtained, so that the tag identification model can identify the tag of the content.
According to the technical scheme, not only the content sample, but also a plurality of labels and a plurality of co-occurrence frequencies are obtained, the co-occurrence frequencies are used for identifying the frequency of at least two labels in the plurality of labels for jointly identifying the same content, and the relationship and the dependence among the plurality of labels can be embodied through the co-occurrence frequencies. And the first fusion tag features of the tags are obtained by extracting features through the initial tag identification model according to the plurality of tags and the plurality of co-occurrence frequencies, so that the first fusion tag features not only comprise features of the corresponding tags, but also comprise features of tags which are identified with the corresponding tags together in the same content. In addition, in the first fused tag feature, the larger the co-occurrence frequency of the corresponding tag and the tag which jointly identifies the same content is, the larger the influence of the feature of the tag which jointly identifies the same content on the first fused tag feature is. That is, the semantics of the tag and the relevance between the plurality of tags can be better understood through the initial tag identification model, so that the initial tag identification model is guided to purposefully understand the content sample according to the semantics of the tag and the relevance between the plurality of tags, that is, the first identification tag for the content sample is obtained by identifying through the initial tag identification model according to the content sample and the first fusion tag characteristics. According to the difference between the first identification tag and the real tag corresponding to the content sample, adjusting model parameters of the initial tag identification model to enable the first identification tag obtained through the identification of the initial tag identification model to be closer to the real tag, and accordingly obtaining the tag identification model. In the process of identifying the content, the semantic meaning of a plurality of labels and the relevance among the labels are introduced, so that the label identification model can better understand the content based on the guidance of the labels, and the accuracy of the label identification model on the content identification is improved through information of different sources.
The embodiment of the present application is not particularly limited to the specific implementation manner of S203, that is, the specific implementation manner of the first identification tag for the content sample is obtained by identifying the content sample and the first fusion tag feature through the initial tag identification model. Two examples will be described below.
In one mode, the fusion is direct.
A1: and extracting the characteristics of the content sample through the initial tag identification model according to the content sample, so as to obtain the content characteristics corresponding to the content sample.
The content characteristics are used to identify characteristics of the content samples.
The embodiment of the application is not particularly limited to the structure of the initial tag identification model, for example, the initial tag identification model has a feature extraction function for a content sample, a feature extraction function for a tag, a feature fusion function, and an identification model. For convenience of explanation, an example in which each function corresponds to one sub-model, that is, the initial tag recognition model includes a content feature extraction sub-model, a tag feature extraction sub-model, a first fusion sub-model, and a recognition sub-model, will be described.
Referring to fig. 3, a schematic diagram of obtaining a first identification tag according to an embodiment of the present application is shown. In fig. 3, a content sample is input into a content feature extraction sub-model in an initial tag identification model, and feature extraction is performed on the content sample through the content feature extraction sub-model, so as to obtain content features for the content sample.
A2: and fusing through the initial tag identification model according to the content features and the first fusion tag features to obtain comprehensive features aiming at the content samples.
The comprehensive features are features obtained by fusing the content features and the first fusion tag features. The integrated features may represent not only the features of the content sample, but also the semantics of the individual tags and the associations between the plurality of tags.
With continued reference to fig. 3, the plurality of co-occurrence frequencies and the plurality of tags are input into a tag feature extraction sub-model included in the initial tag identification model to obtain a first fused tag feature. Inputting the first fusion tag features and the content features into a first fusion sub-model included in the initial tag identification model, and fusing through the first fusion sub-model to obtain comprehensive features aiming at the content samples.
The embodiment of the application is not particularly limited to the fusion mode, and can be set by a person skilled in the art according to actual needs. For example, feature addition, feature stitching, feature multiplication, and the like may be used to achieve fusion between features. Taking feature multiplication as an example, each corresponding element in the feature is multiplied one by one.
As a possible implementation manner, the embodiment of the application may adopt a feature multiplication manner to realize the fusion between features. Feature fusion and enhancement of co-occurring features by feature multiplication while weakening unimportant features. This operation can be seen as a transformation in feature space that alters the way the original features are expressed, potentially making the model better able to capture the underlying structure and pattern of the data. Therefore, information fusion can be realized on the feature level through feature multiplication, so that the initial tag recognition model can learn more complex feature combinations. Furthermore, by multiplying, the initial tag recognition model may better emphasize co-occurring features because the product of these features will be larger, thus taking up more weight in subsequent computations. That is, by means of feature multiplication, the data representation capability of the initial tag identification model can be enhanced, and the performance of the initial tag identification model can be improved.
A3: and according to the comprehensive characteristics, identifying through an initial tag identification model to obtain a first identification tag aiming at the content sample.
With continued reference to fig. 3, the integrated features are input into an identification sub-model included in the initial tag identification model, and identification is performed through the identification sub-model, so as to obtain a first identification tag for the content sample.
The method comprises the steps of respectively extracting content characteristics of a content sample, and first fusion tag characteristics of a plurality of tags and a plurality of co-occurrence frequencies, directly fusing the content characteristics and the first fusion tag characteristics to obtain comprehensive characteristics, and identifying the comprehensive characteristics to obtain a first identification tag for the content sample. The fusion mode is simple and convenient, has small calculated amount and can improve the accuracy of the identification result.
And in a second mode, complex fusion is performed.
B1: and extracting the characteristics of the content sample through the initial tag identification model according to the content sample, so as to obtain the content characteristics corresponding to the content sample.
The content characteristics are used to identify characteristics of the content samples.
The embodiment of the application is not particularly limited to the structure of the initial tag identification model, for example, the initial tag identification model has a feature extraction function for a content sample, a feature extraction function for a tag, a feature fusion function, and an identification model. For convenience of explanation, an example in which each function corresponds to one sub-model, that is, the initial tag recognition model includes a content feature extraction sub-model, a tag feature extraction sub-model, a first fusion sub-model, a second fusion sub-model, and a recognition sub-model, will be described.
Referring to fig. 4, a schematic diagram of still another embodiment of the present application for obtaining a first identification tag is shown. In fig. 4, a content sample is input into a content feature extraction sub-model in an initial tag identification model, and feature extraction is performed on the content sample through the content feature extraction sub-model, so as to obtain content features for the content sample.
B2: and fusing through the initial tag identification model according to the content features and the first fusion tag features to obtain comprehensive features.
The comprehensive features are features obtained by fusing the content features and the first fusion tag features. The integrated features may represent not only the features of the content sample, but also the semantics of the individual tags and the associations between the plurality of tags.
With continued reference to fig. 4, the plurality of co-occurrence frequencies and the plurality of tags are input into a tag feature extraction sub-model included in the initial tag identification model to obtain a first fused tag feature. Inputting the first fusion tag features and the content features into a first fusion sub-model included in the initial tag identification model, and fusing through the first fusion sub-model to obtain comprehensive features aiming at the content samples.
The embodiment of the application is not particularly limited to the fusion mode, and can be set by a person skilled in the art according to actual needs. For example, feature addition, feature stitching, feature multiplication, and the like may be used to achieve fusion between features.
B3: and according to the content characteristics and the comprehensive characteristics, fusing through an initial tag identification model to obtain the enhanced characteristics aiming at the content samples.
The enhancement features are obtained by fusing the content features and the comprehensive features, and compared with the comprehensive features, the enhancement features are focused on the features of the labels corresponding to the content samples, so that the enhancement features are focused on the uniqueness of the content samples, and the understanding capability of the content samples is higher.
With continued reference to fig. 4, the content features and features are input into a second fusion sub-model, and fusion is performed through the second fusion sub-model to obtain enhanced features for the content sample.
B4: and according to the enhancement features, identifying through an initial tag identification model to obtain a first identification tag aiming at the content sample.
With continued reference to fig. 4, the enhanced features are input into an identification sub-model included in the initial tag identification model, and the identification is performed through the identification sub-model, so as to obtain a first identification tag for the content sample.
Therefore, the content characteristics of the content sample, the first fusion tag characteristics of the plurality of tags and the plurality of co-occurrence frequencies are respectively extracted, the content characteristics and the first fusion tag characteristics are fused to obtain comprehensive characteristics, the comprehensive characteristics and the content characteristics are fused again to obtain enhancement characteristics, and compared with the comprehensive characteristics, the enhancement characteristics can reflect the characteristics of the corresponding tags of the content sample, so that the first identification tag aiming at the content sample is obtained according to the enhancement characteristics, the accuracy of feature extraction can be further improved, and the accuracy of tag identification is improved.
As one possible implementation, the plurality of tags that collectively identify the same content may be tags that are truly identified on the content, such as a person tag and a scene tag on video content. It may also be a tag implicitly identified on the content, e.g. the video content may be divided into categories, e.g. under each category the video content may have only a person tag, whose corresponding video category, although not actually identified on the video content, implicitly has a category tag. Correspondingly, the embodiment of the application provides a specific implementation manner of S201, namely a specific implementation manner of acquiring a content sample, a plurality of tags and a plurality of co-occurrence frequencies. The following description will be given of a content sample having a single label.
A content sample and a plurality of tags are obtained, the content sample having a single tag, such as only one of its truly identified tags. And determining the category of the content sample, updating the plurality of labels according to the category of the content sample, and obtaining a plurality of updated labels, namely adding the category of the content sample into the plurality of labels. It should be noted that, if the category of the content sample is a label already included in the plurality of labels, the number of the plurality of labels is not increased any more; if the category of the content sample is not a tag already included in the plurality of tags, the number of the plurality of tags is increased by 1. And obtaining a plurality of co-occurrence frequencies according to the updated plurality of tags, wherein the co-occurrence frequencies are used for identifying the frequency of the co-occurrence of the category of the content sample and the plurality of tags respectively.
Therefore, even if the content sample has a single label, the hidden label of the content sample, such as the category of the content sample, can be obtained, and the hidden label is used as another label of the content sample, so that a plurality of labels are updated, and the co-occurrence frequency corresponding to the updated plurality of labels is obtained, so that a label recognition model is obtained based on more accurate co-occurrence frequency training, the accuracy of the label recognition model is improved, and the application scene of the label recognition model is expanded.
In addition, the tag recognition model can be trained to recognize content samples with single tags, and the tag recognition model can be trained to recognize multi-mode content samples, so that the application scene of the tag recognition model is expanded. Wherein multi-modality (Multimodal) refers to a context, system, or technology that utilizes multiple different modalities or sensors simultaneously. These modes or sensors may include a variety of modalities, visual, audible, tactile, motion, etc., intended to understand and process information by combining a variety of sensing channels. The multi-modal technology can be used for simulating the natural perception process of human beings, and provides more complete information input and richer interaction experience for the fields of machine learning, man-machine interaction and the like. For example, the multimodal content sample may include a first sub-content belonging to a first category and a second sub-content belonging to a second category, the first category and the second category being different categories, such as video category, text category, sound category, and the like.
Based on this, the embodiment of the present application provides a specific implementation manner of S203, that is, according to the content sample and the first fusion tag feature, the specific implementation manner of the first identification tag for the content sample is obtained by identifying with the initial tag identification model, specifically see S2031-S2035.
S2031: and obtaining the first sub-content and the second sub-content according to the content sample.
For example, a first content extraction sub-model and a second content extraction sub-model are trained, a content sample is extracted through the first content extraction sub-model to obtain first sub-content, and a content sample is extracted through the second content extraction sub-model to obtain second sub-content. The first content extraction sub-model and the second content extraction sub-model are used to extract different categories of sub-content.
S2032: and extracting the characteristics of the first sub-content through the initial tag identification model according to the first sub-content to obtain the characteristics of the first sub-content.
S2033: and extracting features according to the second sub-content through the initial tag identification model to obtain second sub-content features.
As one possible implementation, the initial tag identification model may include a first feature extraction sub-model for extracting features from a first sub-content belonging to a first category and a second feature extraction sub-model for extracting features from a second sub-content belonging to a second category, the first feature extraction sub-model and the second feature extraction sub-model being different sub-models. Or, according to the first sub-content, extracting the features through a first feature extraction sub-model included in the initial tag identification model to obtain the features of the first sub-content, and according to the second sub-content, extracting the features through a second feature extraction sub-model included in the initial tag identification model to obtain the features of the second sub-content.
As one possible implementation, if the first category is a video category, the second category is a text category, and taking video content as an example, the first sub-content may be a video sub-content in the video content, and the second sub-content is a text sub-content from one or more combinations of a title (i.e., a summary of the entire video content when the producer publishes the video content), a bullet screen, a comment, a subtitle, and the like. Correspondingly, the first feature extraction sub-model may be a video sub-model, and the second feature extraction sub-model may be a text sub-model, and then feature extraction is performed through the video sub-model included in the initial tag identification model according to the first sub-content, so as to obtain the features of the first sub-content. And extracting features of the text sub-model included in the initial tag identification model according to the second sub-content to obtain features of the second sub-content.
Embodiments of the application are not particularly limited to video submodels and text submodels, for example, the video submodels may be convolutional neural networks (Convolutional Neural Network, CNN) or transducers (transformers). The text sub-module may be a recurrent neural network (Recurrent Neural Network, RNN) or a transformer, etc. As a possible implementation, the video sub-model may be 3D Swin Transformer (a model running on three-dimensional data), and the text sub-model may be a bi-directional transcoder (Bidirectional Encoder Representations from Transformers, BERT) model, so as to improve the accuracy of feature extraction and improve the accuracy of the tag recognition model.
Referring to fig. 5, a schematic diagram of multi-mode feature extraction according to an embodiment of the present application is shown. In fig. 5, the first sub-content may be a video sub-content in the video content, and the second sub-content is a text sub-content from the title. Inputting the video sub-content into 3D Swin Transformer, and extracting the characteristics through 3D Swin Transformer to obtain a first sub-content characteristic. Inputting the text sub-content into a BERT model, and extracting features through the BERT model to obtain second sub-content features
S2034: and fusing the first sub-content and the second sub-content through an initial tag identification model to obtain the content characteristics aiming at the content sample.
With continued reference to fig. 5, the initial tag identification model may further include a third fusion sub-model, and according to the first sub-content and the second sub-content, the fusion is performed through the third fusion sub-model included in the initial tag identification model, so as to obtain a content feature for the content sample, which may be expressed as formula (2):
(2)
Wherein, Representing content characteristics; representing a first sub-content; representing a second sub-content; b represents the batch (batchsize) size; d represents the dimension of the content feature.
S2035: and identifying through the initial tag identification model according to the content characteristics and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample.
The embodiment of the application is not particularly limited to the identification mode, and can be implemented in the first mode (A1-A3) or the second mode (B1-B4), and the second mode (B1-B4) is taken as an example, and can be expressed as a formula (3):
(3)
Wherein, Representing an enhancement feature; Representing content characteristics; z represents the first fusion tag feature and, Representing the composite feature.
Thus, for content samples belonging to multiple modes, the information included in the content samples is more complex, and is easy to cause inaccuracy of identification, namely, understanding the relevance of a plurality of labels is important. Taking video content as an example, it is a representative character that only shoots basketball sports, who is telling a growing experience, but has no corresponding image. If the identification is based only on video sub-content, only the label "Zhang Sano" may be obtained. Thus, the accuracy of the identification is lower due to the lack and deficiency of information, based on which the video content can be more fully understood and accurate and comprehensive label information can be provided by respectively identifying different categories of sub-content, such as by utilizing the visual, text and audio features of the video medium. Therefore, the label 'growing experience' can be obtained by identifying the character sub-content, and the label 'basketball' can be obtained through a plurality of labels and a plurality of co-occurrence frequencies. Therefore, even if the multi-mode content comprises more information, the sub-content of different modes can be respectively identified through a plurality of labels and a plurality of co-occurrence frequencies, so that the identification accuracy is further improved, and the application scene of the label identification model is expanded.
As can be seen from the foregoing, if the first category is a video category and the second category is a text category, feature extraction can be performed according to the first sub-content by using a video sub-model included in the initial tag identification model, so as to obtain the first sub-content feature. And extracting features of the text sub-model included in the initial tag identification model according to the second sub-content to obtain features of the second sub-content.
Because of the variability between the different sub-models included in the initial tag recognition model, the characteristics of the different sub-models are different in one model training, for example, the different sub-models may be responsible for processing different tasks or functions, so that there may be differences in their data distribution, learning difficulty and convergence speed. If all sub-models are updated in the same way, the respective characteristics and requirements may not be adapted, resulting in reduced model performance or unstable training.
Based on this, different updating modes can be adopted for different sub-models. For example, the text sub-model may be updated with smaller super-parameters because the understanding capabilities of the text sub-model are greater than the understanding capabilities of the video sub-model, and the video sub-model may be updated with larger super-parameters. Specifically, according to the difference between the first identification tag and the real tag, the model parameters of the video sub-model are adjusted by the first super-parameters, and the model parameters of the text sub-model are adjusted by the second super-parameters, so that the tag identification model is obtained, wherein the first super-parameters are larger than the second super-parameters. In model training, the super-parameters are parameters whose values are set before the learning process is started, and are not parameter data obtained by training. These parameters define a higher level of concept about the model, such as complexity or learning ability, and cannot be learned directly from the data in the standard model training process, but need to be predefined.
Thus, the learning process of the sub-model can be controlled more flexibly by adopting different updating modes. For example, for simpler sub-models, a larger learning rate may be used for rapid updates; whereas for complex sub-models, a smaller learning rate or more complex optimization strategy may need to be employed to ensure stable learning. Moreover, the adoption of different updating modes also helps to improve the generalization capability of the initial tag recognition model. Since different sub-models employ different update strategies, they can learn the characteristics of the data from different angles and planes, making the overall model (i.e., the initial tag recognition model) more robust and generalized.
The embodiment of the application is not particularly limited to the adjustment modes of the super parameters of different sub-models, and can be set by a person skilled in the art according to actual needs. For example, the initial tag identification model may include sub-models that each use fixed super-parameters during the training process. For another example, when training is started, each sub-model included in the initial tag recognition model is respectively provided with an initial super-parameter, and then the corresponding super-parameters are continuously adjusted according to training results of different sub-models, so that the accuracy of each sub-model is improved, and the accuracy of the initial tag recognition model is further improved.
For example, taking the model training process i+1 times as an example, i is an integer greater than or equal to 1. Updating the first super-parameters according to the Accuracy (ACC) of the video sub-model obtained by the ith model training, obtaining updated first super-parameters, and adjusting the model parameters of the video sub-model according to the (i+1) th time of the updated first super-parameters. Similarly, updating the second super-parameters according to the accuracy of the text sub-model obtained by training the ith model to obtain updated second super-parameters, and adjusting the model parameters of the text sub-model according to the (i+1) th time of the updated second super-parameters. For example, if the accuracy does not increase significantly over a period of time, it may be a locally optimal or undertrained condition. At this time, an attempt may be made to adjust super parameters such as increasing the batch size or increasing the regularization strength.
Furthermore, in addition to using accuracy, loss functions may also be used. Specifically, a corresponding classification header is added to a sub-model (such as a sub-model requiring dynamic adjustment) included in the initial tag identification model, for example, a video sub-model may add a classification header for classifying video sub-content, and a text sub-model may add a classification header for classifying text sub-content. The corresponding losses for each classification head are then determined, which may be too low a learning rate if the losses continue to remain at a higher level. At this time, an attempt may be made to increase the super-parameter of the learning rate. Conversely, if the loss drops rapidly and approaches 0, it may be that the learning rate is too high or the model complexity is too high resulting in overfitting. At this time, the learning rate should be reduced or the model complexity should be reduced.
As a possible implementation manner, the embodiment of the present application provides a specific implementation manner of S202, that is, a specific implementation manner of extracting features by using an initial tag identification model according to a plurality of co-occurrence frequencies and a plurality of tags, so as to obtain first fused tag features of each tag.
To illustrate by a graph convolutional neural network comprising n graph convolution layers, n being a positive integer. Specifically, feature extraction is performed through n graph convolution layers included in an initial tag identification model according to a plurality of co-occurrence frequencies and a plurality of tags, so that first fusion tag features of each tag are obtained. Thus, based on the idea that the tag features of each tag are related to all the neighboring tags, feature extraction is continuously performed through n graph convolution layers, so that each tag can learn the features of the tags (i.e., the neighboring tags) that co-occur with each tag.
The following describes a feature extraction method for the convolution layer in the j+1th layer, where j is a positive integer less than or equal to n.
Specifically, a normalized matrix, a characteristic matrix of a j-th layer graph convolution layer and a weight matrix of the j-th layer graph convolution layer are obtained, and the characteristic matrix of a j+1-th layer graph convolution layer is obtained according to the normalized matrix, the characteristic matrix of the j-th layer graph convolution layer and the weight matrix of the j-th layer graph convolution layer.
For example, a plurality of tags and a plurality of co-occurrence frequencies may be represented by a graph structure. Taking the number of the plurality of labels as M as an example, M is a positive integer. The data set G of the graph structure comprises M nodes, each node corresponding to a label, each node having its own characteristics, which nodes can be represented as a matrix X of size M X D,. Where D represents the dimension of the hidden state of each node, i.e. the dimension of the deep features. In addition, the relationship between nodes can be extracted as a relationship matrix a of M x M size,This relationship matrix may also be referred to as an adjacency matrix (adjacency matrix).
Feature extraction may be performed by n graph convolutional layers included in a graph convolutional neural network (Graph Convolutional Network, GCN) model with matrix X and adjacency matrix a as inputs, so that relationships between nodes and associations between node features are learned by the GCN model, and co-occurrence frequencies determined based on multiple tags are used to initialize adjacency matrices in the GCN. Wherein, the feature matrix of the convolution layer of the j+1th layer can be expressed as formula (4):
(4)
Wherein, A feature matrix representing a layer of the j+1th layer of the graph roll layer; σ (∙) represents a nonlinear activation function. Relu or tanh are generally used; representing a normalized matrix; A feature matrix representing a layer of a graph roll stack; A weight matrix representing the convolution layer of the j-th layer.
It should be noted that the normalized matrix is obtained according to the co-occurrence frequencies of the plurality of tags and the identity matrix. For example, according to the co-occurrence frequency of the plurality of tags, determining an adjacent matrix A corresponding to each tag, wherein the adjacent matrix A comprises the frequency of the co-occurrence of the corresponding tag and each tag in the plurality of tags, and according to the adjacent matrix A and the identity matrixThe sum is used to obtain the self-degree matrixThe process can be expressed as. The self-degree matrix is standardized so thatIs 1 in each row, resulting in a normalized matrix, i.eIs a symmetric and normalized matrix. Wherein, the function of the self degree matrix is mainly embodied in the processing and analysis of the graph.
Therefore, in tasks such as graph neural network (Graph Neural Network, GNN) and graph embedding, an own degree matrix is often used together with an adjacent matrix, so that each node can pay attention to all neighbor nodes and own characteristics, and the influence of different node degrees is balanced when the characteristics of the nodes are aggregated, so that the node with large degree cannot excessively dominate an aggregation result, and the node with small degree cannot be ignored. Therefore, by introducing the self-degree matrix, the method is beneficial to realizing more balanced and accurate feature aggregation and solving the self-delivery problem while maintaining the structural information of the graph. And the normalization operation of the adjacent matrix is obtained by multiplying the two sides of the adjacent matrix by the degree of the node and then inverting, namely, by normalizing the adjacent matrix, the condition that more nodes of the adjacent nodes tend to have larger influence is avoided, and the accuracy of feature extraction is improved.
Through researches, the first fusion tag characteristic of each tag can be well extracted through a formula (4), namely, a plurality of tags and a plurality of co-occurrence frequencies are input, and the characteristic of each node is changed from the characteristic of independently describing the tag to the characteristic of not only independently describing the tag, but also the characteristic of the tag which commonly identifies the same content with the tag, namely, the characteristic of a neighbor node through a plurality of layers of GCNs. But no matter how many layers there are in between, the connection relationship between nodes, i.e. adjacency matrix a, is shared.
Based on this, in order to quickly realize layer-to-layer propagation, the embodiment of the application provides a method for quickly realizing forward propagation through two graph roll stacking layers, specifically, according to a plurality of co-occurrence frequencies and a plurality of labels, feature extraction is performed through a first graph convolution layer included in an initial label identification model to obtain undetermined label features of each label, and according to undetermined label features, feature extraction is performed through a second graph convolution layer included in the initial label identification model to obtain first fused label features of each label.
As a possible implementation, if the first layer of the convolution uses the activation function ReLU and the second layer of the convolution uses the activation function softmax, the forward propagation procedure can be expressed as formula (5):
(5)
wherein Z represents a first fusion tag feature; x represents a plurality of labels, such as the characteristics of the labels, i.e. only the characteristics of the corresponding labels themselves; a represents a plurality of co-occurrence frequencies; representing a self-degree matrix; representing the weight matrix of the graph convolution layers, namely the weight matrix when the first graph convolution layer does not forward propagate; Representing the weight matrix of the first picture convolution output.
Therefore, based on the theory that the adjacent matrix A is shared no matter how many layers are in the middle, a plurality of graph convolution layers can be converted into two graph convolution layers, so that quick propagation between layers is realized, and the training speed of the label identification model is improved.
As one possible implementation manner, after model training is performed based on the initial tag identification model, a trained tag identification model is obtained, and the accuracy of the tag identification model is greater than or equal to that of the initial tag identification model. The embodiment of the present application is not particularly limited to the use of the tag recognition model, and four modes are described below as examples.
In a first mode, a content to be identified, a plurality of tags and a plurality of co-occurrence frequencies are obtained, wherein the content to be identified is the content of the tag to be identified, the plurality of tags and the plurality of co-occurrence frequencies are used for training an initial tag identification model, feature extraction is carried out through the tag identification model according to the plurality of co-occurrence frequencies and the plurality of tags, and second fusion tag features of the tags are obtained through the tag identification model. And identifying through a tag identification model according to the content to be identified and the second fusion tag characteristics to obtain a second identification tag aiming at the content to be identified.
In the second mode, since the plurality of tags and the plurality of co-occurrence frequencies are not changed for different contents to be identified in the first mode, in order to improve the identification speed, after the first time of obtaining the second fusion tag features based on the plurality of co-occurrence frequencies and the plurality of tags through the tag identification model, the second fusion tag features are stored, and in the subsequent use, feature extraction is not needed through the tag identification model according to the plurality of co-occurrence frequencies and the plurality of tags, so that the second fusion tag features of each tag are obtained, but the second fusion tag features are directly obtained, and according to the second fusion tag features and different contents to be identified, the second fusion tag features corresponding to each content to be identified are respectively obtained through the tag identification model. Therefore, the identification speed of the tag identification model can be improved by directly using the second fusion tag features instead of repeatedly extracting the features of the plurality of co-occurrence frequencies and the plurality of tags.
In the third mode, if the plurality of tags or the plurality of co-occurrence frequencies change, the updated plurality of tags and the updated plurality of co-occurrence frequencies are obtained, and it is understood that if the plurality of tags are updated, the co-occurrence frequencies may also be updated. And extracting features through the tag identification model according to the updated multiple tags and the updated multiple co-occurrence frequencies to obtain second fusion tag features of each tag. And identifying through a tag identification model according to the content to be identified and the second fusion tag characteristics to obtain a second identification tag aiming at the content to be identified. Therefore, if the plurality of tags or the plurality of co-occurrence frequencies change, in order to improve the accuracy of the tag identification model, feature extraction can be performed through the tag identification model according to the updated plurality of tags or the updated plurality of co-occurrence frequencies, so that the second fusion tag features of the tags are obtained, the accuracy of feature extraction is improved through the updated data, and the accuracy of the second identification tag is improved.
In the fourth mode, if the plurality of tags or the plurality of co-occurrence frequencies change, the updated plurality of tags and the updated plurality of co-occurrence frequencies are acquired. And taking the trained tag recognition model as an initial tag recognition model, extracting features through the initial tag recognition model according to the updated multiple tags or the updated multiple co-occurrence frequencies to obtain first fusion tag features of each tag, wherein it can be understood that if only the multiple co-occurrence frequencies are updated, extracting features through the initial tag recognition model by using the non-updated multiple tags and the updated multiple co-occurrence frequencies to obtain the first fusion tag features of each tag. Identifying through an initial tag identification model according to the content sample and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample; and according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model. Therefore, the model training is carried out on the tag identification model again according to the updated multiple tags or the updated multiple co-occurrence frequencies, and therefore the accuracy of the tag identification model is improved.
In order to facilitate further understanding of the technical solution provided by the embodiments of the present application, an execution body of the processing method of the tag identification model provided by the embodiments of the present application is taken as a server, and a usage scenario of news reading application is taken as an example, so that the processing method of the tag identification model is described in an overall exemplary manner.
In newsreading applications, a large number of various categories of video content are produced daily, which presents a number of problems: how do users quickly find topics or view content of interest to them? Is the user able to enter deep reading after quickly finding the topic of interest? How are related topics or video content accurately recommended to users to promote the user's use experience? For convenience of explanation, the following takes video content as an example.
The problems can be solved by accurately identifying the corresponding tag of the video content, namely, the tag of the video content is higher in accuracy, the tag is matched with the query word of the user, and the accuracy of the obtained matching result is higher, so that more accurate video content, namely, video content which the user may be interested in, can be rapidly recommended to the user, the user can read deeply, and the use experience of the user is improved.
The training process of the tag recognition model is first described below.
Referring to fig. 6, the application scenario of a processing method of a tag identification model according to an embodiment of the present application is shown. In fig. 6, the initial tag recognition model includes a video sub-model, a text sub-model, a neural network model, a fusion sub-model, and a recognition sub-model.
First, video sub-content in a content sample is input into a video sub-model, text sub-content in the content sample is input into a text sub-model, and co-occurrence frequencies determined based on a plurality of labels and a plurality of labels are input into a graph neural network model (i.e., the label feature extraction sub-model described above). And performing feature extraction through the video sub-model to obtain a first sub-content feature, performing feature extraction through the text sub-model to obtain a second sub-content feature, and performing feature extraction through the graph neural network model to obtain a first fusion tag feature. And fusing the first sub-content features, the second sub-content features and the first fusion tag features through a fusion sub-model to obtain enhanced features, and identifying the enhanced features through an identification sub-model to obtain a first identification tag for the content sample. And according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model. It can be understood that each time the model parameters are adjusted, the model is trained, so that whether the model training is finished is determined according to whether iteration conditions are met, and a label recognition model is obtained. The iteration condition may be a preset iteration number, or model convergence, etc., which is not specifically limited in the present application.
Referring to fig. 7, the application scenario diagram corresponding to fig. 6 is provided in the embodiment of the present application. In fig. 7, the fusion sub-model includes a first fusion sub-model, a second fusion sub-model, and a third fusion sub-model.
Inputting the video sub-content in the content sample into a video sub-model, and extracting the characteristics through the video sub-model to obtain the first sub-content characteristics. Inputting the text sub-content in the content sample into a text sub-model, and extracting features through the text sub-model to obtain second sub-content features. Inputting the first sub-content features and the second sub-content features into a third fusion sub-model, and fusing through the third fusion sub-model to obtain content features, wherein the content features are referred to in the formula (2).
According to the plurality of tags, the co-occurrence frequencies corresponding to the tags can be determined according to the formula (1), and the co-occurrence frequencies are represented in a matrix form, namely a co-occurrence frequency matrix. The feature extraction is performed on the plurality of labels, so as to obtain the own features of each label, which can be expressed as a matrix X,Where M is the number of the plurality of tags and D is the dimension of the deep features of each tag. And inputting the co-occurrence frequency matrix and the matrix X into a graph neural network model, and extracting features through the graph neural network model to obtain first fusion tag features.
And fusing through the first fusion sub-model according to the content characteristics and the first fusion tag characteristics to obtain comprehensive characteristics. The fusion may be feature multiplication. And (3) fusing through a second fusion sub-model according to the comprehensive characteristics and the content characteristics to obtain enhanced characteristics, wherein the enhanced characteristics are shown in the formula (3).
And according to the enhancement features, identifying through the identification submodel to obtain a first identification tag aiming at the content sample. The recognition sub-model may be a Full Connected (FC) network, which is not specifically limited in the present application.
And according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain the tag identification model. Wherein the difference between the first identification tag and the authentic tag may be represented in the form of a binary cross entropy loss function (Binary Cross Entropy Loss, BCELoss), which is not particularly limited in the present application.
After the trained tag identification model is obtained, the content to be identified, a plurality of tags and a plurality of co-occurrence frequencies can be obtained; extracting features according to the co-occurrence frequencies and the tags through a tag identification model to obtain second fusion tag features of each tag; and identifying through a tag identification model according to the content to be identified and the second fusion tag characteristics to obtain a second identification tag aiming at the content to be identified.
Referring to table 1, this table shows a comparison between the mode of the examples of the present application and the mode in which the co-occurrence frequency is not employed.
TABLE 1
Therefore, aggregation and sequencing can be performed according to the label dimension of the video, on one hand, users are helped to actively search the content (namely, the content searching scene) interested by the users, on the other hand, the platform is also helped to recommend the content (namely, the content recommending scene) to the relevant audience more accurately, the consumption time and depth of the users are increased, and therefore the consumption time and depth of the users are increased, and more users are increased.
The application also provides a corresponding processing device of the tag identification model aiming at the processing method of the tag identification model, so that the processing method of the tag identification model is practically applied and realized.
Referring to fig. 8, the schematic structural diagram of a processing device for a tag identification model according to an embodiment of the present application is shown. As shown in fig. 8, the processing device 800 of the tag identification model includes: an acquisition unit 801, a feature extraction unit 802, an identification unit 803, and an adjustment unit 804;
The obtaining unit 801 is configured to obtain a content sample, a plurality of tags, and a plurality of co-occurrence frequencies, where the co-occurrence frequencies are used to identify frequencies where at least two of the plurality of tags commonly identify the same content, the tags are used to identify features of the content, and the content sample is a content with a real tag;
The feature extraction unit 802 is configured to perform feature extraction according to a plurality of co-occurrence frequencies and a plurality of the tags through an initial tag identification model, so as to obtain a first fused tag feature of each of the tags, where the first fused tag feature is used to identify a feature of a corresponding tag and a feature of a tag that identifies the same content together with the corresponding tag, and in the first fused tag feature, the greater the co-occurrence frequency of the corresponding tag and the tag that identifies the same content together with the corresponding tag is, the greater the influence of the feature of the tag that identifies the same content together on the first fused tag feature is;
The identifying unit 803 is configured to identify, according to the content sample and the first fusion tag feature, by using the initial tag identification model, to obtain a first identification tag for the content sample;
the adjusting unit 804 is configured to adjust model parameters of the initial tag identification model according to a difference between the first identification tag and the real tag, to obtain a tag identification model, where the tag identification model is used to identify a tag of the content.
According to the technical scheme, the processing device of the tag identification model comprises: the device comprises an acquisition unit, a feature extraction unit, an identification unit and an adjustment unit. Through the acquisition unit, not only the content sample, but also a plurality of labels and a plurality of co-occurrence frequencies are acquired, wherein the co-occurrence frequencies are used for identifying the frequency of at least two labels in the plurality of labels for commonly identifying the same content, and the relationship and the dependence among the plurality of labels can be embodied through the co-occurrence frequencies. And the first fusion tag features of the tags are obtained through feature extraction according to the plurality of tags and the plurality of co-occurrence frequencies by an initial tag identification model, so that the first fusion tag features not only comprise the features of the corresponding tags, but also comprise the features of the tags which jointly identify the same content with the corresponding tags. In addition, in the first fused tag feature, the larger the co-occurrence frequency of the corresponding tag and the tag which jointly identifies the same content is, the larger the influence of the feature of the tag which jointly identifies the same content on the first fused tag feature is. That is, the semantics of the tag and the relevance between the plurality of tags can be better understood through the initial tag identification model, so that the initial tag identification model is guided to purposefully understand the content sample according to the semantics of the tag and the relevance between the plurality of tags, that is, the first identification tag for the content sample is obtained by the identification unit according to the content sample and the first fusion tag characteristics through the identification performed by the initial tag identification model. And adjusting model parameters of the initial tag identification model according to the difference between the first identification tag and the real tag corresponding to the content sample by an adjusting unit, so that the first identification tag obtained through the identification of the initial tag identification model is more and more close to the real tag, and the tag identification model is obtained. In the process of identifying the content, the semantic meaning of a plurality of labels and the relevance among the labels are introduced, so that the label identification model can better understand the content based on the guidance of the labels, and the accuracy of the label identification model on the content identification is improved through information of different sources.
As a possible implementation manner, the identifying unit 803 is specifically configured to:
according to the content sample, extracting features through the initial tag identification model to obtain content features corresponding to the content sample;
According to the content characteristics and the first fusion tag characteristics, fusing is carried out through the initial tag identification model, and comprehensive characteristics aiming at the content samples are obtained;
and according to the comprehensive characteristics, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample.
As a possible implementation manner, the identifying unit 803 is specifically configured to:
according to the content sample, extracting features through the initial tag identification model to obtain content features corresponding to the content sample;
According to the content characteristics and the first fusion tag characteristics, fusing is carried out through the initial tag identification model to obtain comprehensive characteristics;
According to the content characteristics and the comprehensive characteristics, fusing through the initial tag identification model to obtain enhanced characteristics aiming at the content samples;
And according to the enhancement features, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample.
As a possible implementation manner, the obtaining unit 801 is specifically configured to:
acquiring the content sample and a plurality of tags, wherein the content sample has a single tag;
Determining a category of the content sample;
updating a plurality of labels according to the category of the content sample to obtain a plurality of updated labels;
And obtaining a plurality of co-occurrence frequencies according to the updated plurality of labels, wherein the co-occurrence frequencies are used for identifying frequencies of the same content respectively identified by the category and the plurality of labels.
As a possible implementation manner, if the content sample includes a first sub-content belonging to a first category and a second sub-content belonging to a second category, the identifying unit 803 is specifically configured to:
Obtaining the first sub-content and the second sub-content according to the content sample;
according to the first sub-content, extracting features through the initial tag identification model to obtain first sub-content features;
extracting features through the initial tag identification model according to the second sub-content to obtain second sub-content features;
According to the first sub-content features and the second sub-content features, fusing through the initial tag identification model to obtain content features aiming at the content sample;
and identifying through the initial tag identification model according to the content characteristics and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample.
As a possible implementation manner, if the first category is a video category and the second category is a text category, the feature extraction unit 802 is specifically configured to:
according to the first sub-content, extracting features through a video sub-model included in the initial tag identification model to obtain first sub-content features;
According to the second sub-content, extracting features through a text sub-model included in the initial tag identification model to obtain second sub-content features;
The adjusting unit 804 is specifically configured to:
according to the difference between the first identification tag and the real tag, adjusting the model parameters of the video sub-model by using a first super-parameter and adjusting the model parameters of the text sub-model by using a second super-parameter to obtain a tag identification model, wherein the first super-parameter is larger than the second super-parameter.
As a possible implementation manner, the adjusting unit 804 performs the following model training process for the (i+1) th time, where i is a positive integer;
updating the first super-parameters according to the accuracy of the video sub-model obtained by the ith model training to obtain updated first super-parameters;
According to the (i+1) th updated first super parameter, adjusting the model parameters of the video sub-model;
updating the second super-parameters according to the accuracy of the text sub-model obtained by the ith model training to obtain updated second super-parameters;
And adjusting the model parameters of the text sub-model according to the (i+1) th time of the updated second super-parameter.
As a possible implementation manner, the feature extraction unit 802 is specifically configured to:
according to the co-occurrence frequencies and the labels, extracting features through n graph convolution layers included in the initial label identification model to obtain first fusion label features of the labels, wherein n is a positive integer;
the feature extraction method for the convolution layer of the j+1th layer is as follows, wherein j is a positive integer smaller than n:
acquiring a normalized matrix, a characteristic matrix of a j-th layer graph convolution layer and a weight matrix of the j-th layer graph convolution layer, wherein the normalized matrix is obtained according to co-occurrence frequencies and identity matrixes of a plurality of labels;
And obtaining the characteristic matrix of the j+1th layer graph convolution layer according to the normalized matrix, the characteristic matrix of the j layer graph convolution layer and the weight matrix of the j layer graph convolution layer.
As a possible implementation manner, the feature extraction unit 802 is specifically configured to:
Extracting features through a first graph convolution layer included in the initial tag identification model according to the co-occurrence frequencies and the tags to obtain undetermined tag features of the tags;
and extracting features through a second graph convolution layer included in the initial tag identification model according to the features of the undetermined tag to obtain first fusion tag features of the tags.
As a possible implementation manner, the processing apparatus 800 of the tag identification model further includes an application unit configured to:
acquiring content to be identified, a plurality of labels and a plurality of co-occurrence frequencies;
according to the co-occurrence frequencies and the labels, extracting features through the label identification model to obtain second fusion label features of the labels;
And identifying through the tag identification model according to the content to be identified and the second fusion tag characteristics to obtain a second identification tag aiming at the content to be identified.
As a possible implementation manner, the obtaining unit 801 is further configured to obtain updated tags and updated co-occurrence frequencies if the tags or the co-occurrence frequencies change;
The feature extraction unit 802 is specifically configured to perform feature extraction according to the updated plurality of tags or the updated plurality of co-occurrence frequencies through the tag identification model, so as to obtain second fusion tag features of each tag.
The embodiment of the application also provides a computer device which can be a server or a terminal device, and the computer device provided by the embodiment of the application is introduced from the aspect of hardware materialization. Fig. 9 is a schematic structural diagram of a server, and fig. 10 is a schematic structural diagram of a terminal device.
Referring to fig. 9, which is a schematic diagram of a server structure according to an embodiment of the present application, the server 1400 may have a relatively large difference between configurations or performances, and may include one or more processors 1422, such as a central processing unit (Central Processing Units, CPU), a memory 1432, one or more application programs 1442, or a storage medium 1430 (such as one or more mass storage devices) for data 1444. Wherein the memory 1432 and storage medium 1430 can be transitory or persistent storage. The program stored in the storage medium 1430 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a processor 1422 may be provided in communication with a storage medium 1430 to execute a series of instructions operations on the storage medium 1430 on the server 1400.
The Server 1400 can also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, and/or one or more operating systems 1441, such as a Windows Server TM,Mac OS XTM,UnixTM, LinuxTM,FreeBSDTM, or the like.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.
The processor 1422 is configured to perform the following steps:
Acquiring a content sample, a plurality of tags and a plurality of co-occurrence frequencies, wherein the co-occurrence frequencies are used for identifying the frequency of at least two tags in the plurality of tags for commonly identifying the same content, the tags are used for identifying the characteristics of the content, and the content sample is the content with the real tags;
According to the co-occurrence frequencies and the tags, extracting features through an initial tag identification model to obtain first fusion tag features of the tags, wherein the first fusion tag features are used for identifying features of corresponding tags and features of tags which jointly identify the same content with the corresponding tags, and in the first fusion tag features, the larger the co-occurrence frequency of the corresponding tags and the tags which jointly identify the same content with the corresponding tags is, the larger the influence of the features of the tags which jointly identify the same content on the first fusion tag features is;
according to the content sample and the first fusion tag characteristics, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
And according to the difference between the first identification tag and the real tag, adjusting the model parameters of the initial tag identification model to obtain a tag identification model, wherein the tag identification model is used for identifying the tag of the content.
Optionally, the processor 1422 may also perform method steps of any specific implementation of the method for processing a tag identification model in an embodiment of the present application.
Referring to fig. 10, the structure of a terminal device according to an embodiment of the present application is shown. Taking the example that the terminal device is a smart phone as an example, fig. 10 shows a block diagram of a part of the structure of the smart phone, where the smart phone includes: radio Frequency (RF) circuitry 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuitry 1560, wireless fidelity (WiFi) module 1570, processor 1580, power supply 1590, and the like. Those skilled in the art will appreciate that the smartphone structure shown in fig. 10 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
The following describes each component of the smart phone in detail with reference to fig. 10:
The RF circuit 1510 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 1580; in addition, the data of the design uplink is sent to the base station.
The memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 1520.
The input unit 1530 may be used to receive input numerical or character information and generate key signal inputs related to user settings and function control of the smart phone. In particular, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, may collect touch operations on or near the user and drive the corresponding connection device according to a predetermined program. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.
The display unit 1540 may be used to display information input by a user or information provided to the user and various menus of the smart phone. The display unit 1540 may include a display panel 1541, and optionally, the display panel 1541 may be configured in the form of a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an Organic Light-Emitting Diode (OLED), or the like.
The smartphone may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors. Other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the smart phone are not described in detail herein.
Audio circuitry 1560, speaker 1561, and microphone 1562 may provide an audio interface between a user and a smart phone. The audio circuit 1560 may transmit the received electrical signal converted from audio data to the speaker 1561, and be converted into a sound signal by the speaker 1561 for output; on the other hand, the microphone 1562 converts the collected sound signals into electrical signals, which are received by the audio circuit 1560 for conversion into audio data, which is processed by the audio data output processor 1580 for transmission to, for example, another smart phone via the RF circuit 1510 or for output to the memory 1520 for further processing.
Processor 1580 is a control center of the smartphone, connects various parts of the entire smartphone with various interfaces and lines, performs various functions of the smartphone and processes data by running or executing software programs and/or modules stored in memory 1520, and invoking data stored in memory 1520. In the alternative, processor 1580 may include one or more processing units.
The smart phone also includes a power source 1590 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 1580 via a power management system, such as to provide for managing charging, discharging, and power consumption.
Although not shown, the smart phone may further include a camera, a bluetooth module, etc., which will not be described herein.
In an embodiment of the present application, the memory 1520 included in the smart phone may store a computer program and transmit the computer program to the processor.
The processor 1580 included in the smart phone may execute the processing method of the tag identification model provided in the foregoing embodiment according to instructions in the computer program.
The embodiment of the application also provides a computer readable storage medium for storing a computer program for executing the processing method of the tag identification model provided by the embodiment.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the processing method of the tag identification model provided in various alternative implementations of the above aspect.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-Only Memory (ROM), RAM, magnetic disk or optical disk, etc.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be included in the scope of the present application. Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (15)

1. A method of processing a tag identification model, the method comprising:
Obtaining a content sample, a plurality of tags, and a plurality of co-occurrence frequencies, including: acquiring the content sample and a plurality of tags, wherein the content sample has a single tag; determining a category of the content sample; updating a plurality of labels according to the category of the content sample to obtain a plurality of updated labels; obtaining a plurality of co-occurrence frequencies according to the updated plurality of labels, wherein the co-occurrence frequencies are used for identifying frequencies of the same content respectively identified by the categories and the labels, the co-occurrence frequencies are used for identifying frequencies of the same content commonly identified by at least two labels in the plurality of labels, the labels are used for identifying characteristics of the content, and the content sample is the content with a real label;
According to the co-occurrence frequencies and the tags, extracting features through an initial tag identification model to obtain first fusion tag features of the tags, wherein the first fusion tag features are used for identifying features of corresponding tags and features of tags which are used for identifying the same content together with the corresponding tags, and in the first fusion tag features, the larger the co-occurrence frequency of the corresponding tags and the tags which are used for identifying the same content together with the corresponding tags is, the larger the influence of the features of the tags which are used for identifying the same content together on the first fusion tag features is, and the first fusion tag features comprise semantics of the tags and relativity among the tags;
according to the content sample and the first fusion tag characteristics, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
according to the difference between the first identification tag and the real tag, adjusting model parameters of the initial tag identification model to obtain a tag identification model, wherein the tag identification model is used for identifying the tag of the content;
the identifying, according to the content sample and the first fusion tag feature, through the initial tag identification model, to obtain a first identification tag for the content sample, including:
according to the content sample, extracting features through the initial tag identification model to obtain content features corresponding to the content sample;
According to the content characteristics and the first fusion tag characteristics, fusing is carried out through the initial tag identification model to obtain comprehensive characteristics;
According to the content characteristics and the comprehensive characteristics, fusing through the initial tag identification model to obtain enhanced characteristics aiming at the content samples;
According to the enhancement features, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
If the content sample includes a first sub-content belonging to a first category and a second sub-content belonging to a second category, the first category is a video category, and the second category is a text category, the identifying, according to the content sample and the first fusion tag feature, through the initial tag identification model, obtains a first identification tag for the content sample, including:
Obtaining the first sub-content and the second sub-content according to the content sample;
according to the first sub-content, extracting features through a video sub-model included in the initial tag identification model to obtain first sub-content features;
According to the second sub-content, extracting features through a text sub-model included in the initial tag identification model to obtain second sub-content features;
According to the first sub-content features and the second sub-content features, fusing through the initial tag identification model to obtain content features aiming at the content sample;
according to the content characteristics and the first fusion tag characteristics, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
the step of adjusting the model parameters of the initial tag identification model according to the difference between the first identification tag and the real tag to obtain a tag identification model comprises the following steps:
according to the difference between the first identification tag and the real tag, adjusting the model parameters of the video sub-model by using a first super-parameter and adjusting the model parameters of the text sub-model by using a second super-parameter to obtain a tag identification model, wherein the first super-parameter is larger than the second super-parameter.
2. The method according to claim 1, wherein the (i+1) th adjustment of the model parameters is as follows, i being a positive integer:
Updating the first super-parameters according to the accuracy of the video sub-model obtained by the ith model training to obtain updated first super-parameters;
according to the updated first super parameter, the model parameters of the video sub-model are adjusted for the (i+1) th time;
updating the second super-parameters according to the accuracy of the text sub-model obtained by the ith model training to obtain updated second super-parameters;
and according to the updated second super parameter, the model parameter of the text sub-model is adjusted for the (i+1) th time.
3. The method according to claim 1, wherein the extracting features from the co-occurrence frequencies and the tags by using an initial tag identification model to obtain first fused tag features of the tags includes:
according to the co-occurrence frequencies and the labels, extracting features through n graph convolution layers included in the initial label identification model to obtain first fusion label features of the labels, wherein n is a positive integer;
the feature extraction method for the convolution layer of the j+1th layer is as follows, wherein j is a positive integer smaller than n:
acquiring a normalized matrix, a characteristic matrix of a j-th layer graph convolution layer and a weight matrix of the j-th layer graph convolution layer, wherein the normalized matrix is obtained according to co-occurrence frequencies and identity matrixes of a plurality of labels;
And obtaining the characteristic matrix of the j+1th layer graph convolution layer according to the normalized matrix, the characteristic matrix of the j layer graph convolution layer and the weight matrix of the j layer graph convolution layer.
4. The method according to claim 2, wherein the extracting features from the co-occurrence frequencies and the tags by using an initial tag identification model to obtain first fused tag features of the tags includes:
Extracting features through a first graph convolution layer included in the initial tag identification model according to the co-occurrence frequencies and the tags to obtain undetermined tag features of the tags;
and extracting features through a second graph convolution layer included in the initial tag identification model according to the features of the undetermined tag to obtain first fusion tag features of the tags.
5. The method according to any one of claims 1-4, further comprising:
acquiring content to be identified, a plurality of labels and a plurality of co-occurrence frequencies;
according to the co-occurrence frequencies and the labels, extracting features through the label identification model to obtain second fusion label features of the labels;
And identifying through the tag identification model according to the content to be identified and the second fusion tag characteristics to obtain a second identification tag aiming at the content to be identified.
6. The method of claim 5, wherein the method further comprises:
If the plurality of labels or the plurality of co-occurrence frequencies change, acquiring a plurality of updated labels and a plurality of updated co-occurrence frequencies;
And extracting features through the tag identification model according to the co-occurrence frequencies and the tags to obtain second fusion tag features of the tags, wherein the second fusion tag features comprise:
And extracting features according to the updated multiple tags or the updated multiple co-occurrence frequencies through the tag identification model to obtain second fusion tag features of the tags.
7. A processing apparatus for a tag identification model, the apparatus comprising: the device comprises an acquisition unit, a feature extraction unit, an identification unit and an adjustment unit;
The obtaining unit is configured to obtain a content sample, a plurality of tags, and a plurality of co-occurrence frequencies, and includes: acquiring the content sample and a plurality of tags, wherein the content sample has a single tag; determining a category of the content sample; updating a plurality of labels according to the category of the content sample to obtain a plurality of updated labels; obtaining a plurality of co-occurrence frequencies according to the updated plurality of labels, wherein the co-occurrence frequencies are used for identifying frequencies of the same content respectively identified by the categories and the labels, the co-occurrence frequencies are used for identifying frequencies of the same content commonly identified by at least two labels in the plurality of labels, the labels are used for identifying characteristics of the content, and the content sample is the content with a real label;
The feature extraction unit is configured to perform feature extraction according to a plurality of co-occurrence frequencies and a plurality of tags through an initial tag identification model, so as to obtain a first fused tag feature of each tag, where the first fused tag feature is used to identify a feature of a corresponding tag and a feature of a tag that commonly identifies the same content with the corresponding tag, and in the first fused tag feature, the greater the co-occurrence frequency of the corresponding tag and the tag that commonly identifies the same content with the corresponding tag is, the greater the influence of the feature of the tag that commonly identifies the same content on the first fused tag feature is, and the first fused tag feature includes semantics of the plurality of tags and relevance between the plurality of tags;
The identification unit is used for identifying through the initial tag identification model according to the content sample and the first fusion tag characteristics to obtain a first identification tag aiming at the content sample;
The adjusting unit is used for adjusting the model parameters of the initial tag identification model according to the difference between the first identification tag and the real tag to obtain a tag identification model, wherein the tag identification model is used for identifying the tag of the content;
The identification unit is specifically configured to:
according to the content sample, extracting features through the initial tag identification model to obtain content features corresponding to the content sample;
According to the content characteristics and the first fusion tag characteristics, fusing is carried out through the initial tag identification model to obtain comprehensive characteristics;
According to the content characteristics and the comprehensive characteristics, fusing through the initial tag identification model to obtain enhanced characteristics aiming at the content samples;
According to the enhancement features, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
If the content sample includes a first sub-content belonging to a first category and a second sub-content belonging to a second category, the first category is a video category, the second category is a text category, and the identification unit is specifically configured to:
Obtaining the first sub-content and the second sub-content according to the content sample;
according to the first sub-content, extracting features through a video sub-model included in the initial tag identification model to obtain first sub-content features;
According to the second sub-content, extracting features through a text sub-model included in the initial tag identification model to obtain second sub-content features;
According to the first sub-content features and the second sub-content features, fusing through the initial tag identification model to obtain content features aiming at the content sample;
according to the content characteristics and the first fusion tag characteristics, identifying through the initial tag identification model to obtain a first identification tag aiming at the content sample;
Wherein, the adjustment unit is specifically configured to:
according to the difference between the first identification tag and the real tag, adjusting the model parameters of the video sub-model by using a first super-parameter and adjusting the model parameters of the text sub-model by using a second super-parameter to obtain a tag identification model, wherein the first super-parameter is larger than the second super-parameter.
8. The apparatus according to claim 7, wherein the adjustment unit adjusts the model parameters for the (i+1) th time as follows, i being a positive integer:
Updating the first super-parameters according to the accuracy of the video sub-model obtained by the ith model training to obtain updated first super-parameters;
according to the updated first super parameter, the model parameters of the video sub-model are adjusted for the (i+1) th time;
updating the second super-parameters according to the accuracy of the text sub-model obtained by the ith model training to obtain updated second super-parameters;
and according to the updated second super parameter, the model parameter of the text sub-model is adjusted for the (i+1) th time.
9. The apparatus according to claim 7, wherein the feature extraction unit is specifically configured to:
according to the co-occurrence frequencies and the labels, extracting features through n graph convolution layers included in the initial label identification model to obtain first fusion label features of the labels, wherein n is a positive integer;
the feature extraction method for the convolution layer of the j+1th layer is as follows, wherein j is a positive integer smaller than n:
acquiring a normalized matrix, a characteristic matrix of a j-th layer graph convolution layer and a weight matrix of the j-th layer graph convolution layer, wherein the normalized matrix is obtained according to co-occurrence frequencies and identity matrixes of a plurality of labels;
And obtaining the characteristic matrix of the j+1th layer graph convolution layer according to the normalized matrix, the characteristic matrix of the j layer graph convolution layer and the weight matrix of the j layer graph convolution layer.
10. The apparatus according to claim 8, wherein the feature extraction unit is specifically configured to:
Extracting features through a first graph convolution layer included in the initial tag identification model according to the co-occurrence frequencies and the tags to obtain undetermined tag features of the tags;
and extracting features through a second graph convolution layer included in the initial tag identification model according to the features of the undetermined tag to obtain first fusion tag features of the tags.
11. The apparatus according to any of the claims 7-10, further comprising an application unit for:
acquiring content to be identified, a plurality of labels and a plurality of co-occurrence frequencies;
according to the co-occurrence frequencies and the labels, extracting features through the label identification model to obtain second fusion label features of the labels;
And identifying through the tag identification model according to the content to be identified and the second fusion tag characteristics to obtain a second identification tag aiming at the content to be identified.
12. The apparatus of claim 11, wherein the obtaining unit is further configured to obtain the updated plurality of tags and the updated plurality of co-occurrence frequencies if the plurality of tags or the plurality of co-occurrence frequencies change;
The feature extraction unit is specifically configured to perform feature extraction according to the updated plurality of tags or the updated plurality of co-occurrence frequencies through the tag identification model, so as to obtain second fusion tag features of each tag.
13. A computer device, the computer device comprising a processor and a memory:
the memory is used for storing a computer program and transmitting the computer program to the processor;
The processor is configured to perform the method of any of claims 1-6 according to the computer program.
14. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a computer program for executing the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when run on a computer device, causes the computer device to perform the method of any of claims 1-6.
CN202410441452.7A 2024-04-12 2024-04-12 Label recognition model processing method and related device Active CN118035945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410441452.7A CN118035945B (en) 2024-04-12 2024-04-12 Label recognition model processing method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410441452.7A CN118035945B (en) 2024-04-12 2024-04-12 Label recognition model processing method and related device

Publications (2)

Publication Number Publication Date
CN118035945A CN118035945A (en) 2024-05-14
CN118035945B true CN118035945B (en) 2024-07-05

Family

ID=90991735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410441452.7A Active CN118035945B (en) 2024-04-12 2024-04-12 Label recognition model processing method and related device

Country Status (1)

Country Link
CN (1) CN118035945B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118228021A (en) * 2024-05-24 2024-06-21 腾讯科技(深圳)有限公司 Training method and related device for recognition model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627447A (en) * 2021-10-13 2021-11-09 腾讯科技(深圳)有限公司 Label identification method, label identification device, computer equipment, storage medium and program product

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761291A (en) * 2021-04-27 2021-12-07 腾讯科技(深圳)有限公司 Processing method and device for label classification
CN114283310A (en) * 2021-08-25 2022-04-05 腾讯科技(深圳)有限公司 Image recognition model acquisition method, image recognition device and medium
CN115344698A (en) * 2022-08-11 2022-11-15 腾讯科技(深圳)有限公司 Label processing method, label processing device, computer equipment, storage medium and program product

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627447A (en) * 2021-10-13 2021-11-09 腾讯科技(深圳)有限公司 Label identification method, label identification device, computer equipment, storage medium and program product

Also Published As

Publication number Publication date
CN118035945A (en) 2024-05-14

Similar Documents

Publication Publication Date Title
Yang et al. Image-text multimodal emotion classification via multi-view attentional network
CN111897964B (en) Text classification model training method, device, equipment and storage medium
CN112182166B (en) Text matching method and device, electronic equipment and storage medium
US12008810B2 (en) Video sequence selection method, computer device, and storage medium
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
CN118035945B (en) Label recognition model processing method and related device
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN114332680A (en) Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
CN118103834A (en) Information acquisition method and device
CN113761153A (en) Question and answer processing method and device based on picture, readable medium and electronic equipment
CN112085120B (en) Multimedia data processing method and device, electronic equipment and storage medium
CN116935188B (en) Model training method, image recognition method, device, equipment and medium
CN113761887A (en) Matching method and device based on text processing, computer equipment and storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN116955591A (en) Recommendation language generation method, related device and medium for content recommendation
CN117556067B (en) Data retrieval method, device, computer equipment and storage medium
Ishmam et al. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
CN116186197A (en) Topic recommendation method, device, electronic equipment and storage medium
Liu et al. A multimodal approach for multiple-relation extraction in videos
CN116578729B (en) Content search method, apparatus, electronic device, storage medium, and program product
CN116910201A (en) Dialogue data generation method and related equipment thereof
CN116975403A (en) Content retrieval model, content retrieval processing method and device and computer equipment
CN114357203B (en) Multimedia retrieval method and device and computer equipment
CN115269961A (en) Content search method and related device
CN114625986A (en) Method, device and equipment for sorting search results and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant