CN118135466B - Data processing method, device, computer, storage medium and program product - Google Patents

Data processing method, device, computer, storage medium and program product Download PDF

Info

Publication number
CN118135466B
CN118135466B CN202410558896.9A CN202410558896A CN118135466B CN 118135466 B CN118135466 B CN 118135466B CN 202410558896 A CN202410558896 A CN 202410558896A CN 118135466 B CN118135466 B CN 118135466B
Authority
CN
China
Prior art keywords
feature
data
video
attention
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410558896.9A
Other languages
Chinese (zh)
Other versions
CN118135466A (en
Inventor
杨善明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410558896.9A priority Critical patent/CN118135466B/en
Publication of CN118135466A publication Critical patent/CN118135466A/en
Application granted granted Critical
Publication of CN118135466B publication Critical patent/CN118135466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a data processing method, a device, a computer, a storage medium and a program product, which relate to the data transmission technology in the field of big data, and the method comprises the following steps: inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder; performing feature coding on the video data and the text data through an encoder to respectively obtain a video feature sequence and text data features, and performing fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence; acquiring tag feature sequences corresponding to N tag data, and performing cross attention processing on the composite feature sequences and the tag feature sequences through a decoder to obtain cross attention sequences corresponding to the tag feature sequences; and setting target tag data in the N tag data for the video data according to the confidence value output by the cross attention sequence. By adopting the method and the device, the efficiency of identifying the video multi-label can be improved.

Description

Data processing method, device, computer, storage medium and program product
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, computer, storage medium, and program product.
Background
With the continuous evolution and innovation of internet technology, video content gradually becomes one of information transmission core carriers in modern society, and massive video resources are filling large video sharing and communication media platforms at unprecedented speed. How to efficiently sort and sort the complex videos and realize accurate content recommendation based on the personalized requirements of users has become a key topic to be overcome in the current industry. In the face of a huge number of video libraries, each video possibly corresponds to a plurality of labels, and the video can be marked by multiple labels according to manual marking currently, but the manual marking is low in efficiency, time-consuming and labor-consuming, high in economic cost and difficult to adapt to the ecology of the video content which is updated rapidly, meanwhile, the objectivity of the manual marking is low, the label marking of the video is influenced, and the accuracy of the label marking of the video is possibly low.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, a computer, a storage medium and a program product, which can improve the accuracy and efficiency of video multi-label identification.
In one aspect, an embodiment of the present application provides a data processing method, where the method includes:
Inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder;
Performing feature coding on the video data and the text data through an encoder in the target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and performing fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence;
Acquiring tag feature sequences corresponding to N tag data, and performing cross attention processing on the composite feature sequences and the tag feature sequences through a decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross attention value in the cross attention sequence is used for indicating the association relationship between video data and one tag data; n is a positive integer;
and outputting confidence values respectively corresponding to the N label data according to the cross attention sequence, and setting target label data in the N label data for the video data according to the confidence values.
In one aspect, an embodiment of the present application provides another data processing method, where the method includes:
Inputting a video sample and a text sample associated with the video sample into an initial joint model; the initial joint model includes an initial encoder and an initial decoder;
Performing feature coding on the video sample and the text sample through an initial encoder in the initial joint model to obtain a video sample feature sequence corresponding to the video sample and text sample features corresponding to the text sample, and performing fusion processing on the video sample feature sequence and the text sample features to obtain a composite sample feature sequence;
Acquiring tag sample feature sequences corresponding to W tag samples, and performing cross attention processing on the composite sample feature sequences and the tag sample feature sequences through an initial decoder in an initial joint model to obtain cross attention sample sequences corresponding to the tag sample feature sequences; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between a video sample and one label sample; w is a positive integer;
and carrying out parameter adjustment on an initial encoder and an initial decoder in the initial joint model according to the cross attention sample sequence to obtain a target joint model, wherein the target joint model is used for carrying out multi-label prediction of the video.
In one aspect, an embodiment of the present application provides a data processing apparatus, including:
The parameter input module is used for inputting video data and text data associated with the video data into the target joint model; the target joint model includes an encoder and a decoder;
The feature processing module is used for carrying out feature coding on the video data and the text data through an encoder in the target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and carrying out fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence;
The cross attention processing module is used for acquiring tag feature sequences corresponding to the N tag data, and carrying out cross attention processing on the composite feature sequences and the tag feature sequences through the decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross-attention value in the cross-attention sequence is used to indicate a degree of association between video data and one tag data; n is a positive integer;
The label setting module is used for outputting confidence values corresponding to the N label data respectively according to the cross attention sequence, and setting target label data in the N label data for the video data according to the confidence values.
In one possible implementation, the video data includes a number a of video frames, a being a positive integer; the feature processing module is used for carrying out feature coding on the video data and the text data through an encoder in the target joint model, and when a video feature sequence corresponding to the video data and a text data feature corresponding to the text data are obtained, the feature processing module is specifically used for executing the following operations:
Generating R video blocks based on A video frames in video data, projecting the R video blocks in the video data through an encoder in a target joint model to obtain image projection features corresponding to the R video blocks respectively, and adding video position information to the R image projection features to obtain R image update features; video position information refers to information of the position of a video block in video data; r is a positive integer; the video block is composed of one or more video frames of the A video frames;
performing feature coding on the R image updating features to obtain image data features corresponding to the R video blocks respectively, and generating a video feature sequence corresponding to video data according to the image data features corresponding to the R video blocks respectively;
splitting text data into P data phrases through an encoder in the target joint model; p is a positive integer;
Vector conversion is carried out on the P data word groups to obtain word embedding vectors corresponding to the P data word groups respectively, text position information is added to the P word embedding vectors respectively, and P word updating vectors are obtained; the text position information refers to the information of the position of a data phrase in text data;
And carrying out feature coding on the P word updating vectors to obtain data phrase features corresponding to the P data phrases respectively, and generating text data features corresponding to the text data according to the data phrase features corresponding to the P data phrases respectively.
In one possible implementation manner, when the cross-attention processing module is configured to obtain tag feature sequences corresponding to N tag data, the cross-attention processing module is specifically configured to perform the following operations:
acquiring N label data, and splitting the N label data respectively to obtain word element sequences corresponding to the N label data respectively; one word element in the word element sequence refers to a minimum basic unit obtained after splitting processing of tag data;
Mapping N word sequences through an encoder in the target joint model to obtain word embedded vector sequences corresponding to the N word sequences respectively;
respectively carrying out weighted average processing on the N word embedded vector sequences to obtain tag data characteristics corresponding to the N tag data respectively;
And splicing the N tag data features to obtain tag feature sequences corresponding to the N tag data.
In one possible implementation manner, the cross-attention processing module is configured to perform cross-attention processing on the composite feature sequence and the tag feature sequence by using a decoder, and when the cross-attention sequence corresponding to the tag feature sequence is obtained, the cross-attention processing module is specifically configured to perform the following operations:
In the decoder, the composite feature sequence and the label feature sequence are subjected to cross attention processing to obtain initial cross attention features;
Normalizing the initial cross attention feature to obtain a cross normalized feature;
And carrying out forward propagation processing on the cross normalized features to obtain cross forward features, fusing the cross normalized features and the cross forward features to obtain cross fusion features, and carrying out normalization processing on the cross fusion features to obtain cross attention sequences corresponding to the tag feature sequences. In one possible implementation, the decoder includes M decoding subassemblies, where M is a positive integer, and M decoding subassemblies include decoding subassemblies S i, and i is a positive integer less than or equal to M; the cross attention processing module is used for carrying out cross attention processing on the composite characteristic sequence and the tag characteristic sequence through the decoder, and when the cross attention sequence corresponding to the tag characteristic sequence is obtained, the cross attention processing module is specifically used for executing the following operations:
The iterative variable characteristics and the composite characteristic sequences corresponding to the decoding sub-assembly S i are used as the iterative input characteristics of the decoding sub-assembly S i, and the iterative input characteristics of the decoding sub-assembly S i are subjected to cross attention processing in the decoding sub-assembly S i to obtain the iterative output characteristics of the decoding sub-assembly S i; if the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, the iteration variable feature corresponding to the decoding subassembly S i is a tag feature sequence;
The iteration output characteristic of the decoding subassembly S i is used as the iteration variable characteristic corresponding to the decoding subassembly S i+1, the iteration variable characteristic and the composite characteristic sequence corresponding to the decoding subassembly S i+1 are used as the iteration input characteristic of the decoding subassembly S i+1, the iteration input characteristic of the decoding subassembly S i+1 is subjected to cross attention processing in the decoding subassembly S i+1, the iteration output characteristic of the decoding subassembly S i+1 is obtained until the iteration output characteristic of the last decoding subassembly in M decoding subassemblies is obtained, and the iteration output characteristic of the last decoding subassembly is determined as a cross attention sequence corresponding to the tag characteristic sequence; the decoding subcomponent S i+1 is the next decoding subcomponent of the decoding subcomponent S i.
In one possible implementation manner, the cross-attention processing module is configured to use the iteration variable feature and the composite feature sequence corresponding to the decoding sub-component S i as the iteration input feature of the decoding sub-component S i, and perform cross-attention processing on the iteration input feature of the decoding sub-component S i in the decoding sub-component S i, so as to obtain the iteration output feature of the decoding sub-component S i, where the cross-attention processing module is specifically configured to perform the following operations:
In the decoding subassembly S i, performing cross attention processing on the iterative variable characteristic and the composite characteristic sequence to obtain an initial attention characteristic;
Normalizing the initial attention characteristic to obtain a normalized characteristic;
And carrying out forward propagation processing on the normalized features to obtain forward propagation features, fusing the normalized features and the forward propagation features to obtain target fusion features, and carrying out normalization processing on the target fusion features to obtain iterative output features of the decoding subassembly S i.
In one possible implementation manner, the cross-attention processing module is configured to perform cross-attention processing on the iteration variable feature and the composite feature sequence in the decoding subassembly S i, and when obtaining the initial attention feature, the cross-attention processing module is specifically configured to perform the following operations:
In the decoding subcomponent S i, the iterative variable feature is taken as a query vector in the cross-attention function, the composite feature sequence is taken as a key vector in the cross-attention function, and the composite feature sequence is taken as a value vector in the cross-attention function;
determining a product of the query vector and the key vector as a first fusion feature by a cross-attention function;
Acquiring the dimension number corresponding to the composite feature sequence, and determining the product of the first fusion feature and the reciprocal of the dimension number as a second fusion feature;
converting the second fusion feature to a first activation feature based on an activation sub-function in the cross-attention function;
the product of the first activation feature and the value vector is determined as a third fusion feature and the sum of the third fusion feature and the query vector is determined as an initial attention feature.
In one possible implementation manner, the cross-attention processing module is configured to normalize the initial attention feature, and when the normalized feature is obtained, the cross-attention processing module is specifically configured to perform the following operations:
acquiring a characteristic value in the initial attention characteristic, and determining the mean value and the variance of the characteristic value;
Determining a difference between the initial attention feature and the mean as an offset feature;
acquiring a normalization constant, and performing squaring processing on the sum of the variance and the normalization constant to obtain a scaling value;
The product between the offset feature and the inverse of the scaled value is determined as the normalized feature.
In one possible implementation manner, the target joint model further includes a full connection layer, and the tag setting module is configured to output confidence values corresponding to the N tag data respectively according to the cross attention sequence, and when the target tag data in the N tag data is set for the video data according to the confidence values, the tag setting module is specifically configured to perform the following operations:
Performing confidence degree normalization processing on the cross attention sequence based on an activation function of the full connection layer to obtain confidence degree values corresponding to the N label data respectively;
And determining the confidence coefficient value which is larger than or equal to the confidence coefficient threshold value in the N confidence coefficient values as a target confidence coefficient value, and performing association setting on the tag data corresponding to the target confidence coefficient value and the video data.
In one possible implementation, the video data includes a number of video frames, the number of target tag data is H, a and H are both positive integers, the target tag data includes tag data M j, j is a positive integer less than or equal to H; the data processing device further comprises a target video block determining module, wherein the target video block determining module is specifically used for executing the following operations:
based on A video frames in video data, generating R video blocks, and acquiring a video block characteristic sequence corresponding to the R video blocks and a target tag characteristic corresponding to tag data M j; r is a positive integer; the video block is composed of one or more video frames of the A video frames;
the video block feature sequence and the target tag feature are respectively subjected to cross attention processing through a decoder to obtain a video block attention sequence corresponding to tag data M j; each cross attention value in the video block attention sequence is used for indicating the association relationship between the tag data M j and one video block;
And outputting video block confidence values corresponding to the R video blocks respectively according to the video block attention sequence, acquiring target video blocks associated with the tag data M j from the R video blocks according to the video block confidence values, and setting the tag data M j for the target video blocks.
In one aspect, an embodiment of the present application provides another data processing apparatus, including:
the parameter training module is used for inputting the video sample and the text sample associated with the video sample into the initial joint model; the initial joint model includes an initial encoder and an initial decoder;
The feature training module is used for carrying out feature coding on the video sample and the text sample through an initial encoder in the initial joint model to obtain a video sample feature sequence corresponding to the video sample and text sample features corresponding to the text sample, and carrying out fusion processing on the video sample feature sequence and the text sample features to obtain a composite sample feature sequence;
The cross attention training module is used for acquiring tag sample feature sequences corresponding to the W tag samples, and carrying out cross attention processing on the composite sample feature sequences and the tag sample feature sequences through an initial decoder in the initial joint model to obtain cross attention sample sequences corresponding to the tag sample feature sequences; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between a video sample and one label sample; w is a positive integer;
And the parameter adjustment module is used for carrying out parameter adjustment on an initial encoder and an initial decoder in the initial joint model according to the cross attention sample sequence to obtain a target joint model, wherein the target joint model is used for carrying out multi-label prediction of the video.
In one possible implementation, the initial encoder includes a visual feature extraction component and a text feature extraction component; the parameter adjustment module is used for performing parameter adjustment on an initial encoder and an initial decoder in the initial joint model according to the cross attention sample sequence, and when the target joint model is obtained, the parameter adjustment module is specifically used for performing the following operations:
Acquiring standard video features corresponding to the video samples, and carrying out parameter adjustment on the visual feature extraction component according to the standard video features and the video sample feature sequences;
Acquiring standard text features corresponding to the text samples, and performing parameter adjustment on the text feature extraction component according to the standard text features and the text sample features;
Acquiring a sample type label corresponding to a video sample, and generating a label loss value according to the sample type label and a cross attention sample sequence;
and carrying out parameter adjustment on an initial decoder in the initial joint model according to the label loss value to obtain a target joint model.
In one possible implementation manner, the parameter adjustment module is configured to, when generating the tag loss value according to the sample class tag and the cross-attention sample sequence, specifically perform the following operations:
outputting sample confidence values corresponding to the W label samples respectively according to the cross attention sample sequence;
carrying out logarithmic processing on the confidence coefficient values of the W samples respectively to obtain W confidence coefficient logarithmic values;
Based on the sample category labels and the sample confidence values, determining label determination values corresponding to the W label samples respectively; the label determining value comprises a first value or a second value, wherein the first value is used for indicating that labels which are the same as label samples corresponding to the first value exist in the sample type labels, and the second value is used for indicating that labels which are the same as label samples corresponding to the second value do not exist in the sample type labels;
And generating a label loss value according to the label determination values and the W confidence coefficient logarithmic values respectively corresponding to the W label samples.
In one aspect, the embodiment of the application provides a computer device, which comprises a processor, a memory and an input/output interface;
The processor is respectively connected with the memory and the input/output interface, wherein the input/output interface is used for receiving data and outputting data, the memory is used for storing a computer program, and the processor is used for calling the computer program so as to enable the computer device containing the processor to execute the method in the aspect of the embodiment of the application.
An aspect of an embodiment of the present application provides a computer-readable storage medium storing a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method in the aspect of an embodiment of the present application.
In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternatives in an aspect of the embodiments of the application. In other words, the computer instructions, when executed by a processor, implement the methods provided in the various alternatives in one aspect of the embodiments of the present application.
The implementation of the embodiment of the application has the following beneficial effects:
In the embodiment of the application, video data and text data associated with the video data are input into a target joint model; the target joint model includes an encoder and a decoder; performing feature coding on the video data and the text data through an encoder in the target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and performing fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence; acquiring tag feature sequences corresponding to N tag data, and performing cross attention processing on the composite feature sequences and the tag feature sequences through a decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross attention value in the cross attention sequence is used for indicating the association relationship between video data and one tag data; n is a positive integer; and outputting confidence values respectively corresponding to the N label data according to the cross attention sequence, and setting target label data in the N label data for the video data according to the confidence values. The application fully utilizes the feature extraction and fusion mechanism, so that the video data can be better represented through the complex composite feature sequence, namely, the features of the video data can be more comprehensively and finely mined through the composite feature sequence, and the tag data associated with the composite feature sequence can be more comprehensively and accurately identified. In addition, the application fully utilizes a cross attention mechanism, so that the target joint model can still adaptively extract the associated detail characteristics in the complex characteristic sequence and the label characteristic sequence when facing the complex characteristic sequence and the label characteristic sequence, thereby accurately mapping the association relation between video data and different label data, effectively improving the identification accuracy of the target joint model for multi-label prediction classification, further better coping with the multi-label prediction problem under the complex input data condition and improving the recall rate and stability of multi-label prediction. In addition, in the process of multi-label prediction of video data through the target joint model, label marking is not needed through manpower, so that the label marking cost is saved, and the label marking efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of a network interaction architecture provided by an embodiment of the present application;
Fig. 2 is a schematic view of a scenario of a data processing method according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for data processing according to an embodiment of the present application;
FIG. 4 is a flowchart of a second method for data processing according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a target joint model according to an embodiment of the present application;
FIG. 6 is a flowchart III of a method for data processing according to an embodiment of the present application;
FIG. 7 is a schematic diagram of model training provided by an embodiment of the present application;
FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a second data processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
If the data of the object (such as the user) needs to be collected in the application, before and during the collection, a prompt interface or a popup window is displayed, wherein the prompt interface or the popup window is used for prompting the user to collect certain data currently, and the relevant step of data acquisition is started only after the confirmation operation of the user to the prompt interface or the popup window is obtained, otherwise, the process is ended. The acquired user data is used in a reasonable and legal scene, application, or the like. Optionally, in some scenarios where user data is required but not authorized by the user, authorization may be requested from the user, and the user data may be reused when authorization passes.
It will be appreciated that in the specific embodiments of the present application, the user data involved, when the following embodiments of the present application are applied to specific products or technologies, will require user approval or consent, and the collection, use and processing of the relevant data will require compliance with relevant laws and regulations and standards in the relevant areas.
The application can relate to a machine learning technology in the field of artificial intelligence, and training, use and the like of a model are realized through the machine learning technology.
Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. For example, in the application, training, using and the like of the mode characteristic analysis model, the data analysis model and the like corresponding to each data mode is performed by training the model, so that the model continuously learns new knowledge or skills, and further a trained model is obtained for data analysis. For example, the method and the device are used for carrying out analysis and learning of the association relation between the composite feature sequence and the tag data features so as to obtain a trained target joint model and the like, so that the target joint model can be used for carrying out multi-tag prediction on video data.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.
Among them, the encoder-decoder (Bidirectional Encoder Representations from Transformers, BERT) based on the self-attention model (Transformers) is a pre-training model based on the transducer architecture, which is important in the field of natural language processing. The context representation can be learned from text by a bi-directional encoder, which not only can understand the meaning of the word itself, but can also capture the relationship of the word to other words in the sentence.
In the embodiment of the present application, please refer to fig. 1, fig. 1 is a network interaction architecture diagram provided in the embodiment of the present application, as shown in fig. 1, a computer device 101 may be a background server corresponding to video application software in a service device, and the computer device 101 may obtain video data, text data associated with the video data, and tag data from the service device, may also obtain video data, text data associated with the video data, and tag data from a memory space of the computer device 101, or may obtain video data, text data associated with the video data, and tag data from the service device and a memory space of the computer device 101 at the same time. wherein the text data may be a title associated with the video data or other descriptive content provided for the video data. The tag data may be total tag data obtained by traversing the video database by the computer device and performing tag summarization on all tags corresponding to all videos in the video database. The number of service devices is one or more, and in fig. 1, the service devices may include a service device 102a, a service device 102b, a service device 102c, and so on. Wherein, each service device can perform data interaction, or each service device can perform data interaction through the computer device 101. The computer device 101 inputs the acquired video data, text data associated with the video data, and tag data into a target joint model, wherein the target joint model comprises an encoder and a decoder, the computer device 101 processes the video data and the text data based on the encoder, acquires a corresponding video feature sequence and text data features, acquires tag feature sequences corresponding to N tag data, and N is a positive integer, wherein the tag feature sequences comprise tag data features corresponding to the N tag data respectively. The target joint model is a deep learning model for performing multi-label prediction of video, the computer device 101 may take video data, text data associated with the video data, and N label data as input data of the target joint model, analyze the input data through the target joint model to obtain a confidence value of an association relationship between the video data and the N label data, determine target label data from the N label data based on the confidence value, thereby setting the target label data for the video data, and send the video data to one or more service devices. The confidence value is used for indicating the matching degree of the tag data and the video data, the higher the confidence value is, the more the tag data is matched with the video data, and the tag data can be considered as target tag data corresponding to the video data, namely, the more the tag data is credible. The target tag data is one or more tag data matched with the video data, i.e., one video data may correspond to a plurality of tag data. When a user corresponding to the service equipment enters video application software in the service equipment to conduct video browsing, tag data interested by the user can be determined based on historical browsing data of the corresponding user, and when the tag data interested by the user comprises target tag data, the service equipment can push the video data with the target tag data set to the corresponding user.
Through the above process, the target joint model is fully utilized, the confidence value between the tag data and the video data is output, the tag of the video data is determined based on the target joint model, and the tag is set. In the process of multi-label prediction of video data through the target joint model, label marking is not needed through manpower, so that the label marking cost is saved, and the label marking efficiency is improved. By acquiring the tag data, when multi-tag prediction is performed, the comparison processing is directly performed based on the characteristics of the video (including the video characteristic sequence and the text data characteristics) and the characteristics of different tag data, so that the aim of co-occurrence of the tags when the video is subjected to multi-tag classification is fulfilled, more target tag data can be recalled as much as possible, and the recall rate and the stability of the multi-tag prediction are improved.
Specifically, referring to fig. 2, fig. 2 is a schematic view of a scenario of a data processing method according to an embodiment of the present application. As shown in fig. 2, in the embodiment of the present application, the video data and the text data may come from the service device 102a, for example, a news application is taken as an example, the video data may be a news video in the news application, and the text data may be a news headline corresponding to the news video. In fig. 2, the computer device 101 may acquire video data and text data in the service device 102a, or may acquire video data and text data in other service devices, where the video data and text data in each service device may be synchronized by a server corresponding to a service application in the service device or network communication, for example, a news video in a news application may be synchronized by a server corresponding to the news application, so as to ensure consistency of news video content in each service device. Taking video data and text data in service equipment 102a as an example, computer equipment 101 respectively performs feature coding on the video data and the text data through an encoder in a target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and then performs fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence. The computer device 101 obtains N tag data, where the tag data may be a total tag data from a video database summarized by the computer device, for example, in a video database corresponding to a news-type video application, the total tag data may be "international news, domestic news, economic news, scientific news, entertainment news, sports news, social news, environmental news, educational news, health news, travel news, cultural news, fashion news, automotive news, house news, game news, movie news, music news", and the like. The computer equipment performs feature coding on N tag data through an encoder in the target joint model to obtain a tag feature sequence corresponding to the N tag data, wherein the tag feature sequence comprises tag data features corresponding to the N tag data respectively, and further, the computer equipment 101 inputs the composite feature sequence and the tag feature sequence into a decoder of the target joint model, and performs cross attention processing on the composite feature sequence and the tag feature sequence through the decoder to obtain a cross attention sequence corresponding to the tag feature sequence; each cross attention value in the cross attention sequence is used for indicating the association relationship between video data and one tag data; N is a positive integer. The computer device 101 outputs confidence values [ tag 1 (confidence value 1), tag 2 (confidence value 2), …, tag N (confidence value N) ] corresponding to the N tag data, respectively, according to the cross-attention sequence and the N tag data, determines target tag data from the N tag data according to the N confidence values, and sets the target tag data for the video data. The confidence value is used to indicate the association relation or matching degree between the tag data and the video data, for example, two tag data including [ game news (0.82) and cultural news (0.02) ] and the confidence value corresponding to each tag data are present, the tag data "game news" can be considered to be highly matched with the video data or have a strong association relation, and the "game news" can be set as the target tag data of the video data. The computer device may set a confidence threshold, and determine tag data corresponding to a confidence value greater than or equal to the confidence threshold as target tag data, i.e., the video data may match one or more tag data as target tag data. Further, the computer device may send the video data with the target tag data added to the service device 102a and other service devices. When a user corresponding to the service device 102a enters a corresponding video application, for example, when the video application is a news application, the service device 102a may determine tag data interested in the user based on historical browsing data of the user when the user corresponding to the service device 102a enters the news application, and when the tag data interested in the user exists in the target tag data, push the video data to the user for browsing.
Through the above process, the feature extraction and fusion mechanism is fully utilized, so that the video data can be better represented through the complex composite feature sequence, namely, the features of the video data can be more comprehensively and finely mined through the composite feature sequence, and the tag data associated with the composite feature sequence can be more comprehensively and accurately identified. In addition, the application fully utilizes a cross attention mechanism, so that the target joint model can adaptively extract the associated detail characteristics in the complex characteristic sequence and the label characteristic sequence when facing the complex characteristic sequence and the label characteristic sequence, thereby accurately mapping the association relation between video data and different label data, effectively improving the identification accuracy of the target joint model for multi-label prediction classification, recalling as much as possible target label data, improving the recall rate and stability of multi-label prediction, and further better coping with the multi-label prediction problem under the condition of complex input data, and improving the recall rate and stability of multi-label prediction. In addition, in the process of multi-label prediction of video data through the target joint model, label marking is not needed through manpower, so that the label marking cost is saved, and the label marking efficiency is improved. Meanwhile, by setting target tag data for the video data, video content corresponding to the designated tag highly fitting the interest point of the video application user can be provided for the video application user, so that user experience can be improved, and user viscosity can be enhanced.
It will be understood that the service device mentioned in the embodiment of the present application may also be a computer device, where the computer device in the embodiment of the present application includes, but is not limited to, a terminal device or a server. In other words, the computer device may be a server or a terminal device, or may be a system formed by the server and the terminal device. The above-mentioned terminal device may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted device, an augmented Reality/Virtual Reality (AR/VR) device, a head-mounted display, a smart tv, a wearable device, a digital camera, a camera, and other mobile internet devices (mobile INTERNET DEVICE, MID) with network access capability, or a terminal device in a scene such as a train, a ship, or a flight. As shown in fig. 1, the terminal device may be a notebook (as shown by service device 102 b), a tablet (as shown by service device 102 c), a mobile phone (as shown by service device 102 a), etc., fig. 1 illustrates only a part of the devices, and alternatively, the service device 102a refers to a device located in the vehicle 103. The servers mentioned above may be independent physical servers, or may be server clusters or distributed systems formed by a plurality of physical servers, or may be cloud servers that provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, vehicle-road collaboration, content distribution networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Optionally, the data related to the embodiment of the present application may be stored in a computer device, or may be stored based on a cloud storage technology or a blockchain network, which is not limited herein.
Further, referring to fig. 3, fig. 3 is a flowchart illustrating a method for data processing according to an embodiment of the present application. As shown in fig. 3, for describing a data processing procedure including the steps of:
Step S301, inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder.
In an embodiment of the present application, a computer device may acquire video data and text data associated with the video data, where the video data may include video frames, and the video frames may be images that the computer device samples and decimates from video source data indicated by the video data based on a frame-extraction period. For example, the frame extraction period may be 1 second(s), and the computer device may extract 50 frames of images per second as 50 video frames in the video source data, combining one or more video frames exhibiting the same image characteristics into one video block. The text data associated with the video data may be text information added by the service device for the video source data corresponding to the video data, where the text information may be added when the service device publishes the video source data. For example, the text information may be a generalized title describing the video content, or other sentence describing the video content. The computer device inputs the video block and text data into a target joint model, which is a deep learning model for multi-label prediction of video, the target joint model comprising an encoder and a decoder.
Step S302, feature encoding is carried out on the video data and the text data through an encoder in the target joint model, a video feature sequence corresponding to the video data and text data features corresponding to the text data are obtained, and fusion processing is carried out on the video feature sequence and the text data features, so that a composite feature sequence is obtained.
In the embodiment of the application, the computer equipment performs feature coding on video blocks in video data through an encoder in a target joint model, extracts feature information in the video blocks, and obtains a video feature sequence corresponding to the video data, wherein the video feature sequence comprises video block features respectively corresponding to all the video blocks. And carrying out feature coding on the text data through an encoder in the target joint model, extracting feature information in the text data, and obtaining text data features corresponding to the text data. When the computer device performs feature encoding on the video data and the text data, the base network of the encoder used can be different base networks, for example, the computer device can use a three-dimensional shift window attention network (3D Swin Transformer) as the base network of the encoder to perform feature encoding on the video data to obtain a video feature sequence; the computer device may use BERT as the underlying network for the encoder to feature encode the text data to obtain text data features. Further, the computer device may perform feature fusion on the obtained video feature sequence and the text data feature to obtain a composite feature sequence, where one possible feature fusion manner may be referred to in formula ①:
As shown in the formula ① of the present invention, For the purpose of indicating a sequence of composite features,For indicating a sequence of video features,For indicating a text data feature, B for indicating a lot size, i.e. the number of video data (video data includes B videos), e.g. b=1 when video data is indicated as one video; t 'is used to indicate the sum of the number of video feature sequences and text data features, for example, 3 video feature sequences acquired based on video data and 4 text data features acquired based on text data, then T' =7; d is dimension information indicating a composite feature sequence, e.g. d=3, representingIs a feature having 3 dimensions. The function "concat ()" is used for the pair ofAndFeature fusion, i.e. pairingAndThe corresponding two data are spliced to obtain a higher dimension data, for example,Represented as vectors (3, 2),Represented as vectors (4, 1, 6), then pairAndObtained after feature fusionCan be expressed as [ (3, 2), (4, 1, 6) ].
Step S303, obtaining tag feature sequences corresponding to N tag data, and performing cross attention processing on the composite feature sequences and the tag feature sequences through a decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross attention value in the cross attention sequence is used for indicating the association relationship between video data and one tag data; n is a positive integer.
In the embodiment of the application, the computer device can acquire N tag data, and the N tag data are full tag data, namely, all tag data acquired by the computer device in the video database based on all video data. For example, in a video database corresponding to a short video application, the full-scale tag data acquired by the computer device may include "food production, fitness course, travel strategy, fashion wear, music singing, fun piece, hand production, technological front, pet daily, parent-child interaction", etc. The computer equipment performs feature coding on N label data based on an encoder in the target joint model to obtain label data features corresponding to the N label data respectively, and splices the N label data features to obtain a label feature sequence, namely the label feature sequence comprises the label data features corresponding to the N label data respectively. The computer device may use the same encoder as the encoder used when performing feature encoding on the text data in step S302, that is, the BERT is used as a base network of the encoder to perform feature encoding on N tag data, so as to obtain tag data features; alternatively, the computer device may use an encoder different from the encoder used in the feature encoding of the text data in step S302, for example, the computer device may further perform feature encoding on N tag data by using a convolutional neural network (Convolutional Neural Network, CNN), a Long Short-Term Memory (LSTM), or the like as a base network of the encoder, to obtain tag data features.
Further, the computer device may input the composite feature sequence and the tag feature sequence into a decoder in the target joint model, and perform cross attention processing on the composite feature sequence as a key vector and a value vector in a cross attention function by using the decoder, and on the tag feature sequence as a query vector, obtain cross attention values between the composite feature sequence and the tag feature sequence, and determine the cross attention values as cross attention sequences corresponding to the tag feature sequence. The computer device may perform cross attention processing in the decoder using the composite feature sequence as a key vector and a value vector in a cross attention function, using the tag feature sequence as a query vector to obtain an initial cross attention value, and determining a sum of the initial cross attention value and the tag feature sequence as an initial cross attention feature; normalizing the initial cross attention feature to obtain a cross normalized feature; and carrying out forward propagation processing on the cross normalized features to obtain cross forward features, fusing the cross normalized features and the cross forward features to obtain cross fusion features, and carrying out normalization processing on the cross fusion features to obtain cross attention sequences corresponding to the tag feature sequences. Each cross attention value in the cross attention sequence is used to indicate an association between video data and one tag data. For example, the cross attention sequence may be denoted as "[2.71,3.61,3.32,4.29,1.61]", that is, 5 cross attention values are used to indicate the association relationship between the video data and 5 tag data, and the larger the cross attention value, the more matching the association relationship between the corresponding tag data and the video data.
Specifically, the computer device uses the tag feature sequence as a query vector, the composite feature sequence as a key vector in the cross-attention function, and the composite feature sequence as a value vector in the cross-attention function in the decoder. Determining a product of the query vector and the key vector as a first fusion feature by a cross-attention function; and acquiring the dimension number corresponding to the composite feature sequence, and determining the product of the first fusion feature and the reciprocal of the dimension number as a second fusion feature. The computer device may convert the second fused feature to the first activated feature based on the activated sub-function in the cross-attention function; the product of the first activation feature and the value vector is determined as a third fusion feature and the sum of the third fusion feature and the query vector is determined as an initial cross initial attention feature.
Further, the computer device obtains the feature value in the initial intersecting initial attention feature, and determines the mean value and the variance of the feature value in the initial intersecting initial attention feature; determining a difference between the initial intersecting initial attention feature and a mean of feature values in the initial intersecting initial attention feature as an offset feature; acquiring a normalization constant, and performing evolution processing on the sum of the variance of the characteristic value in the initial cross initial attention characteristic and the normalization constant to obtain a scaling value; the product between the offset feature and the inverse of the scaled value is determined as a cross normalized feature. And then, the process is performed. The computer equipment carries out forward propagation processing on the cross normalized features to obtain cross forward features, fuses the cross normalized features and the cross forward features to obtain cross fusion features, and carries out normalization processing on the cross fusion features to obtain cross attention sequences corresponding to the tag feature sequences.
Step S304, the confidence values corresponding to the N label data are output according to the cross attention sequence, and the target label data in the N label data are set for the video data according to the confidence values.
In the embodiment of the present application, the target joint model further includes a full connection layer, the full connection layer is used for outputting confidence values, the computer device outputs confidence values corresponding to the N tag data respectively according to the cross attention sequence, and a specific implementation process for setting the target tag data in the N tag data for the video data according to the confidence values may be: performing confidence degree normalization processing on the cross attention sequence based on an activation function of the full connection layer to obtain confidence degree values corresponding to the N label data respectively; and determining the confidence coefficient value which is larger than or equal to the confidence coefficient threshold value in the N confidence coefficient values as a target confidence coefficient value, and performing association setting on the tag data corresponding to the target confidence coefficient value and the video data.
Specifically, the computer device performs a confidence normalization process on the cross-attention sequence based on the activation function of the full connection layer, converts the cross-attention sequence into a probability distribution, that is, converts each cross-attention value in the cross-attention sequence into a confidence value between 0 and 1, and the sum of all confidence values after conversion is 1, where one possible activation function may be seen in formula ②:
as shown in the formula ② of the present invention, For indicating an ith cross-attention value in the cross-attention sequence; ""As a function of a natural index of refraction,For indicating the natural constant e (approximately equal to 2.71828)To the power, i.e. natural exponential function atAn exponential score at this point; for indicating that the exponential score for each cross-attention value in the cross-attention sequence is summed. For example, the confidence value obtained by performing the confidence normalization processing on the cross-attention sequence "[3.12,2.85,4.21,3.67,2.98]" may be: "[0.185,0.169,0.25,0.218,0.177]".
After obtaining the confidence values respectively corresponding to the N tag data, the computer device determines the confidence value greater than or equal to the confidence threshold value in the N confidence values as a target confidence value, where the confidence threshold value may be set to 0.5, determines the confidence value with the confidence value greater than 0.5 as the target confidence value, and may consider that the association relationship between the tag data corresponding to the target confidence value and the video data is tight, that is, the tag data is highly associated with the video data, and may set the tag data corresponding to the target confidence value in association with the video data.
Through the process, sampling and frame extraction are carried out from video data to obtain video frames, the adjacent frames in each second are often not different, and partial images are extracted through the sampling and frame extraction to form video blocks for video feature extraction, namely the identification effect of the video features is not affected, and meanwhile, the consumption of computing resources can be reduced. The method for extracting the features is further fully utilized to determine the features corresponding to the video block, the text data associated with the video data and the N label data respectively; the fusion mechanism is fully utilized, the video feature sequence and the text data feature are fused to obtain the composite feature sequence, so that the video data can be better represented through the composite feature sequence, namely, the features of the video data can be more comprehensively and finely mined through the composite feature sequence, and the tag data associated with the composite feature sequence can be more comprehensively and accurately identified. And the decoder in the deep learning model (target joint model) is utilized to carry out cross attention processing, so that the target joint model can still adaptively extract the associated detail characteristics in the complex characteristic sequence and the tag characteristic sequence when facing the complex characteristic sequence and the tag characteristic sequence, thereby accurately mapping the association relation between video data and different tag data, effectively improving the identification accuracy of the target joint model for carrying out multi-tag prediction classification, recalling the target tag data as much as possible, improving the recall rate and stability of multi-tag prediction, and further better coping with the multi-tag prediction problem under the condition of complex input data, and improving the recall rate and stability of multi-tag prediction. In addition, in the process of multi-label prediction of video data through the target joint model, label marking is not needed through manpower, so that the label marking cost is saved, and the label marking efficiency is improved. Meanwhile, in the target joint model, powerful encoders and decoders such as a transducer are fully utilized to conduct joint modeling and feature learning, and potential tag association information in video data can be effectively captured and integrated, so that the capability of the model for predicting various tags is comprehensively improved.
Further, referring to fig. 4, fig. 4 is a flowchart illustrating a method for data processing according to an embodiment of the present application. As shown in fig. 4, for describing a data processing procedure including the steps of:
step S401, inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder.
In the embodiment of the present application, the specific implementation process of step S401 may refer to the specific description process in step S301 shown in fig. 3, which is not described herein.
Step S402, feature encoding is carried out on the video data and the text data through an encoder in the target joint model, a video feature sequence corresponding to the video data and text data features corresponding to the text data are obtained, and fusion processing is carried out on the video feature sequence and the text data features, so that a composite feature sequence is obtained.
In the embodiment of the application, the video data comprises A video frames, A is a positive integer, the computer equipment performs feature coding on the video data, and the specific implementation process for obtaining the video feature sequence corresponding to the video data can be as follows: generating R video blocks based on A video frames in video data, projecting the R video blocks in the video data through an encoder in a target joint model to obtain image projection features corresponding to the R video blocks respectively, and adding video position information to the R image projection features to obtain R image update features; video position information refers to information of the position of a video block in video data; r is a positive integer; a video block consists of one or more of a video frames. And carrying out feature coding on the R image updating features to obtain image data features corresponding to the R video blocks respectively, and generating a video feature sequence corresponding to the video data according to the image data features corresponding to the R video blocks respectively.
Specifically, each video frame can be regarded as an image, and the computer device can respectively classify and combine the a video frames in the video data through the encoder in the target joint model to obtain R video blocks, wherein the R video blocks are composed of one or more video frames representing the same characteristics in the a video frames. For example, if the image content of 3 video frames is that a person plays basketball, that is, the relevant elements such as a person and basketball exist in all 3 video frames, the 3 video frames can be combined into one video block. Further, the computer equipment performs image segmentation on R video blocks, and segments the image in each video block into a plurality of image blocks with the same size, wherein the image blocks corresponding to the video blocks can cover the whole video block; Further, respectively carrying out linear projection on the image blocks corresponding to each video block to obtain image block vectors, combining the image block vectors corresponding to each video block to obtain image projection features corresponding to R video blocks respectively, and respectively adding video position information and image position information to the R image projection features to obtain R image update features; the video position information refers to information of a position of one video block in video data, and the image position information refers to information of a position of one image block in video block. The computer device may perform image segmentation on the image in the video block to obtain image blocks, map the image corresponding to the whole video block into coordinate axes, calculate, for each image block, position coordinates of the image blocks in the image corresponding to the whole video block, generate, for each image block, image position information based on the position coordinates, and connect each image position information with an image block vector of each image block, so as to implement adding the image position information to the image projection feature. The image position information may be a fixed length vector, where each element identifies whether the image block is located at a specific position, for example, the image in the video block is divided into a 2x2 grid of blocks, then the length of the position-coded vector corresponding to the image position information will be 5, the first element indicates what image in the video block the image block belongs to, and the last four elements indicate whether the image block is located in the corresponding row and column in the image, where, taking the 2 nd image (video frame) in the video block as an example, the image position information of the image block in the upper left corner in the image is [2,1,0,1,0], indicating its first row and first column in the 2 nd image in the video block; The image position information of the image block in the upper right corner is [2,1,0,0,1], representing its first row and second column in the 2 nd image in the video block; the image position information of the image block in the lower left corner is [2,0,1,1,0], representing its second row and first column in the 2 nd image in the video block; the image position information of the image block in the lower right corner is [2,0,1,0,1], indicating its second row and second column in the 2 nd image in the video block. The video position information may be a time stamp of the video block in the video data, a block number, or other information that can be located at the location of the identified video block, and the computer device may connect the video position information with the image projection feature of the corresponding video block, and add the video position information to the image projection feature. further, the computer device performs feature encoding on the R image update features (i.e., extracts features on each image block in each video block) until image data features corresponding to the R video blocks are obtained, and performs feature fusion processing on the image data features corresponding to the R video blocks, so as to generate a video feature sequence corresponding to the video data. The basic network used by the encoder can be 3D Swin Transformer, the computer equipment performs feature fusion operation on the image data features corresponding to R video blocks respectively in different stages of the Swin Transformer, wherein the feature fusion operation can comprise multi-layer feature fusion, namely, the features generated by the Swin Transformer in each network layer are fused, the low-level fine-grained features and the high-level abstract features are fused, and the features of different levels are comprehensively considered; The feature fusion operation may include cross-frame feature fusion, i.e., fusion between features at different points in time, capturing timing information in video data. For example, the computer device may use a cross-frame attention mechanism to enable Swin transducer to focus on relevant features on corresponding video frames at different points in time; the feature fusion operation may include adaptive feature fusion, i.e., the computer device adaptively adjusts according to the characteristics of the particular task and data. Feature fusion may be performed, for example, by increasing or decreasing the number of layers of feature fusion, adjusting the weight of feature fusion, or employing a different fusion strategy. The computer device integrates features from different video frames to obtain a sequence of video features.
The specific implementation process of the computer equipment for carrying out feature coding on the text data to obtain the text data features corresponding to the text data can be as follows: splitting text data into P data phrases through an encoder in the target joint model; p is a positive integer; vector conversion is carried out on the P data word groups to obtain word embedding vectors corresponding to the P data word groups respectively, text position information is added to the P word embedding vectors respectively, and P word updating vectors are obtained; the text position information is information of the position of one data phrase in text data, and feature codes are carried out on P word update vectors to obtain data phrase features corresponding to the P data phrases respectively, and text data features corresponding to the text data are generated according to the data phrase features corresponding to the P data phrases respectively.
Specifically, the computer device divides the text data into P data phrases by an encoder in the target joint model according to words or terms in the text data and the number of the words or terms, and adds special marks at the beginning and the end. For example, there is text data "BERT is a powerful NLP model", "the 6 split data words may be expressed as" { "[ CLS ]", "BERT", "is", "a", "powerful", "NLP", "model", "[ SEP ]", where a special tag "[ CLS ]" is used to denote the beginning of a sequence of data words and a special tag "[ SEP ]" is added at the end of one sentence for splitting different sentences. Further, the computer device performs vector conversion on the P data word groups respectively to obtain word embedding vectors corresponding to the P data word groups respectively, and adds text position information to the P word embedding vectors respectively to obtain P word updating vectors. Where text position information refers to information of the position of a data phrase in text data, the text position information may be position-coded vectors, which are usually fixed or learnable. The computer equipment inputs the P word update vectors into a basic network of the encoder for feature coding to obtain data phrase features corresponding to the P data phrases respectively, and performs feature fusion processing on the data phrase features corresponding to the P data phrases respectively to generate text data features corresponding to the text data. The underlying network of the encoder may be, among other things, the BERT in which there are different network layers, the BERT model typically consisting of multiple transducer layers, each layer containing a self-attention mechanism and a feed-forward neural network. At the different network layers of the BERT, the computer device may aggregate the features of each network layer. For example, the hidden states of some layers may be selected as final text data features, or the features of different layers may be spliced or weighted averaged to obtain the final text data features.
Further, the specific implementation process of the computer device for performing fusion processing on the video feature sequence and the text data feature to obtain the composite feature sequence may refer to the related description in step S302 shown in fig. 3, which is not described herein.
Step S403, acquiring N tag data, and determining a corresponding tag feature sequence based on the N tag data.
In the embodiment of the application, the computer equipment acquires N label data, and uses a model basic component, such as a word segmentation device (tokenizer), to perform word segmentation processing on the label data, namely respectively splitting the N label data to obtain word element sequences corresponding to the N label data respectively, wherein one label data corresponds to one word element sequence. Wherein, a word element in the word element sequence refers to a minimum basic unit obtained by splitting tag data, and for example, the minimum basic unit can be a single word. And mapping N word sequences through an encoder in the target joint model to obtain word embedded vector sequences corresponding to the N word sequences. And respectively carrying out weighted average processing on the N word embedded vector sequences to obtain the tag data characteristics corresponding to the N tag data. Optionally, the computer device may perform an averaging process on the N word embedded vector sequences, to obtain tag data features corresponding to the N tag data respectively. Further, the N tag data features are spliced to obtain tag feature sequences corresponding to the N tag data. For example, when one of the tags is basketball, the following codes "token_ids=token_encoder (" basketball "), token_ embedding = embedding _token (token_ids), tag_ embedding = averagepooling (token_ embedding)" can be referred to, and the tag data feature corresponding to basketball can be obtained. Where "token_ids" corresponds to a word sequence, "token_ embedding" corresponds to a word embedded vector sequence, and "tag_ embedding" corresponds to a tag data feature.
Step S404, the iterative variable feature and the composite feature sequence corresponding to the decoding sub-assembly S i are used as the iterative input feature of the decoding sub-assembly S i, and the iterative input feature of the decoding sub-assembly S i is subjected to cross attention processing in the decoding sub-assembly S i to obtain the iterative output feature of the decoding sub-assembly S i; if the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, the iteration variable feature corresponding to the decoding subassembly S i is a tag feature sequence.
In the embodiment of the application, the computer device performs cross attention processing on the iterative variable feature and the composite feature sequence in the decoding subassembly S i to obtain an initial attention feature. And carrying out normalization processing on the initial attention characteristic to obtain a normalized characteristic. And carrying out forward propagation processing on the normalized features to obtain forward propagation features, fusing the normalized features and the forward propagation features to obtain target fusion features, and carrying out normalization processing on the target fusion features to obtain iterative output features of the decoding subassembly S i.
In the decoding subassembly S i, the computer device performs cross attention processing on the iteration variable feature and the composite feature sequence, and a specific implementation process for obtaining the initial attention feature may be: in the decoding subcomponent S i, the iterative variable feature is taken as the query vector in the cross-attention function, the composite feature sequence is taken as the key vector in the cross-attention function, and the composite feature sequence is taken as the value vector in the cross-attention function. The product of the query vector and the key vector is determined as a first fusion feature by a cross-attention function. And acquiring the dimension number corresponding to the composite feature sequence, and determining the product of the first fusion feature and the reciprocal of the dimension number as a second fusion feature. The second fusion feature is converted to a first activation feature based on an activation sub-function in the cross-attention function. The product of the first activation feature and the value vector is determined as a third fusion feature and the sum of the third fusion feature and the query vector is determined as an initial attention feature.
In particular, the first fusion feature may be represented asThe second fusion feature may be represented asThe activation sub-functions in the cross-attention function may be expressed as'". One possible cross-attention function may be found in equation ③:
As shown in formula ③, Q is used to represent a query vector query, K is used to represent a key vector key, and V is used to represent a value vector value; t represents the transpose of the number, The transposed matrix of K is represented,The dimension number is used for representing the corresponding dimension number of the composite feature sequence; In order for the first activation feature to be a first, Is a third fusion feature.
Further, initial attention featureMay be denoted as "Attention (Z,,) +Z ", whereinFor the composite feature sequence, Z is an iteration variable feature, and if the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, the iteration variable feature corresponding to the decoding subassembly S i is a tag feature sequence. For example, referring to fig. 5, fig. 5 is a schematic diagram of a target joint model according to an embodiment of the present application; the target joint model as shown in fig. 5 includes an encoder, a decoder, and a full connection layer, wherein the decoder includes M decoding subassemblies, namely decoding subassembly 1, decoding subassemblies 2, …, and decoding subassembly M. When the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, i.e., decoding subassembly S i is decoding subassembly 1, the iteration variable feature corresponding to decoding subassembly S i is a tag feature sequence from the encoder. Wherein the iteration variable characteristic of decoding subassembly 2 is the iteration output characteristic of decoding subassembly 1. Wherein, when the number of decoding sub-components is 1, "Attention (Z,,) The result of +Z "may also represent the initial cross-attention feature in the corresponding embodiment of FIG. 3 described above.
The computer device performs normalization processing on the initial attention feature, and a specific implementation process for obtaining the normalization feature may be: and acquiring a characteristic value in the initial attention characteristic, and determining the mean value and the variance of the characteristic value. The difference between the initial attention feature and the mean value is determined as the offset feature. And obtaining a normalization constant, and performing evolution processing on the sum of the variance and the normalization constant to obtain a scaling value. The product between the offset feature and the inverse of the scaled value is determined as the normalized feature.
In particular, the normalized features may be expressed asThe offset feature may be expressed asThe scaling value may be expressed as. One possible normalization process formula may be found in formula ④:
as shown in the formula ④ of the present invention, Representing the initial attention profileThe normalization process is carried out, the processing is carried out,For the mean of the feature values in the initial attention feature,For the variance of feature values in the initial attention feature,For the normalization constant, which is typically used to prevent division by zero, a very small positive number, for example 10 -8, can be used. Wherein, when the number of decoding subassemblies is 1,The cross-normalization feature described above in the corresponding embodiment of fig. 3 may also be represented.
Further, the computer device performs forward propagation processing on the normalized feature of the decoding subassembly S i to obtain a forward propagation feature, fuses the normalized feature and the forward propagation feature to obtain a target fusion feature, and performs normalization processing on the target fusion feature to obtain an iterative output feature of the decoding subassembly S i.
In particular, the iterative output characteristics of the decoding subassembly S i may be usedThe forward propagation processing function when the computer device performs forward propagation processing on the normalized features of the decoding subassembly S i is represented as'The function is used to apply an activation function to nonlinear process the normalized features in multiple network layers. For example, the forward propagation process may be two multi-layer perceptron (Multilayer Perceptron, MLP). One possible iterative output feature calculation formula may be referred to as formula ⑤:
As shown in the formula ⑤ of the present invention, In order to normalize the features,For representing forward propagation characteristics, ""Is used to denote a target fusion feature. When the number of decoding subassemblies is 1, "The result of "may also represent the cross-attention sequence in the corresponding embodiment of fig. 3 described above.
Step S405, taking the iteration output characteristic of the decoding subassembly S i as the iteration variable characteristic corresponding to the decoding subassembly S i+1, taking the iteration variable characteristic and the composite characteristic sequence corresponding to the decoding subassembly S i+1 as the iteration input characteristic of the decoding subassembly S i+1, performing cross attention processing on the iteration input characteristic of the decoding subassembly S i+1 in the decoding subassembly S i+1 to obtain the iteration output characteristic of the decoding subassembly S i+1 until the iteration output characteristic of the last decoding subassembly in M decoding subassemblies is obtained, and determining the iteration output characteristic of the last decoding subassembly as a cross attention sequence corresponding to the tag characteristic sequence; the decoding subcomponent S i+1 is the next decoding subcomponent of the decoding subcomponent S i. For example, as shown in FIG. 5, the iterative variable feature of decoding subassembly M is the iterative output feature of decoding subassembly M-1; the iterative output features of the decoding sub-component M may be determined as a cross-attention sequence corresponding to the tag feature sequence.
In the embodiment of the present application, after obtaining the iterative output feature of the decoding subassembly S i, the computer device further needs to process the data, and uses the iterative output feature of the decoding subassembly S i as the iterative variable feature corresponding to the decoding subassembly S i+1, where the decoding subassembly S i+1 is the next decoding subassembly of the decoding subassembly S i. Further, the iteration variable feature and the composite feature sequence corresponding to the decoding sub-assembly S i+1 are used as iteration input features of the decoding sub-assembly S i+1, and cross attention processing is performed on the iteration input features of the decoding sub-assembly S i+1 in the decoding sub-assembly S i+1, where a specific implementation process may refer to a related process of performing cross attention processing on the iteration input features of the decoding sub-assembly S i in the decoding sub-assembly S i by using the above-mentioned computer device, which is not described herein again. After obtaining the iterative output feature of the last decoding sub-assembly in the M decoding sub-assemblies, the computer equipment determines the iterative output feature of the last decoding sub-assembly as a cross attention sequence corresponding to the tag feature sequence. Where M may be equal to 10, i.e., there are 10 decoding subassemblies, the computer device may determine the iterative output feature of the 10 th decoding subassembly as the cross-attention feature corresponding to the tag feature sequence.
Step S406, the confidence values corresponding to the N label data are output according to the cross attention sequence, and the target label data in the N label data are set for the video data according to the confidence values.
In the embodiment of the present application, the specific implementation process of step S406 may refer to the specific description process in step S304 shown in fig. 3, which is not described herein.
Step S407, acquiring video block feature sequences corresponding to the R video blocks and target tag features corresponding to the tag data M j, determining a target video block based on the video block feature sequences and the target tag features, and setting the tag data M j for the target video block.
In the embodiment of the application, after determining the target tag data, the computer device may further associate the target tag data with a video block corresponding to the target tag data in the video data. The video data comprises A video frames, the number of the target tag data is H, both A and H are positive integers, the target tag data comprises tag data M j, and j is a positive integer smaller than or equal to H; the specific implementation process of obtaining the video block feature sequences corresponding to the R video blocks and the target tag features corresponding to the tag data M j by the computer device, determining the target video block based on the video block feature sequences and the target tag features, and setting the tag data M j for the target video block may be: based on A video frames in video data, R video blocks are generated, and video block feature sequences corresponding to the A video blocks and target tag features corresponding to tag data M j are obtained; r is a positive integer; a video block consists of one or more of a video frames. The video block feature sequence and the target tag feature are respectively subjected to cross attention processing through a decoder to obtain a video block attention sequence corresponding to tag data M j; each cross attention value in the video block attention sequence is used to indicate an association between the tag data M j and one video block. And outputting video block confidence values corresponding to the R video blocks respectively according to the video block attention sequence, acquiring target video blocks associated with the tag data M j from the R video blocks according to the video block confidence values, and setting the tag data M j for the target video blocks.
Specifically, the computer equipment acquires A video frames in video data, generates R video blocks based on the A video frames, performs feature coding on the R video blocks through an encoder in a target joint model, and acquires video block feature sequences corresponding to the R video blocks, wherein the video block feature sequences comprise video block features corresponding to the R video blocks respectively; and the computer equipment performs feature coding on the tag data M j through an encoder in the target joint model to acquire target tag features corresponding to the tag data M j. Further, the computer device determines a sequence of video block features as query vectors in a cross-attention function, determines target tag features as key vectors in the cross-attention function, and determines target tag features as value vectors in the cross-attention function by which a sequence of video block attention corresponding to tag data M j is determined, each cross-attention value in the sequence of video block attention being used to indicate an association between tag data M j and one video block. Performing confidence normalization processing on the video block attention sequence based on an activation function of the full connection layer to obtain video block confidence values corresponding to R video blocks respectively, determining the video block confidence value with the video block confidence value larger than or equal to a video block confidence threshold value as a target video block confidence value in the R video blocks, and performing association setting on the target video block corresponding to the target video block confidence value and the tag data M j until the association setting of H target tag data and the corresponding video block in the video data is completed. It may be appreciated that one target tag data may be matched to a plurality of video frames in the video data, that is, a target tag feature corresponding to the target tag data, and that the video block feature in the matched video block feature sequence may indicate a feature corresponding to the plurality of video frames, that is, the plurality of video frames may collectively embody one tag content, and the computer device may determine the plurality of video frames as one video block, where the video block may be understood as one video area in the video data, and finally set the corresponding target tag data for the video area in an associated manner. Through the association setting of the target video block and the target tag data, target tag data display can be provided for a user browsing the video data, so that the user can drag a video progress bar according to own interests, directly browse the video area corresponding to the interested target tag data, and the experience of the user is enhanced.
Through the process, sampling and frame extraction are carried out from video data to obtain video frames, so that video blocks are formed, namely, the identification effect of video features is not affected, and meanwhile, the consumption of computing resources can be reduced. The method for extracting the features is further fully utilized to determine the features corresponding to the video block, the text data associated with the video data and the N label data respectively; the fusion mechanism is fully utilized, the video feature sequence and the text data feature are fused to obtain the composite feature sequence, so that the video data can be better represented through the composite feature sequence, namely, the features of the video data can be more comprehensively and finely mined through the composite feature sequence, and the tag data associated with the composite feature sequence can be more comprehensively and accurately identified. And the decoding sub-assembly in the decoder in the deep learning model (target joint model) is utilized to carry out cross attention processing, so that the target joint model can still adaptively extract the associated detail characteristics in the composite characteristic sequence and the label characteristic sequence when facing the complex composite characteristic sequence and the label characteristic sequence, thereby accurately mapping the association relation between video data and different label data, effectively improving the identification accuracy of the target joint model for multi-label prediction classification, recalling target label data as much as possible, improving the recall rate and stability of multi-label prediction, and further better coping with the multi-label prediction problem under the condition of complex input data, and improving the recall rate and stability of multi-label prediction. By designing multiple decoding sub-components (i.e., multi-layer decoders), the target joint model is enabled to capture more complex inputs and features, while multiple decoding sub-components can improve the generalization ability of the target joint model by learning more intermediate representations, which is a better adaptation of the target joint model to unseen data or more complex tasks. Finally, the accuracy can be maintained at 66.29% by the target joint model in the embodiment of the application, and the recall rate of the target label data is improved to 53.58%.
Further, referring to fig. 6, fig. 6 is a flowchart of a method for data processing according to an embodiment of the present application. As shown in fig. 6, for describing a data processing procedure including the steps of:
Step S601, inputting a video sample and a text sample associated with the video sample into an initial joint model; the initial joint model includes an initial encoder and an initial decoder.
In an embodiment of the application, a computer device may obtain a video sample and a text sample associated with the video sample, wherein the video sample may include a block of video samples. For example, the computer device may extract 50 frames of images per second in a video sample as frames of video samples, combining one or more frames of video samples characterizing the feature into one block of video samples. The text sample associated with the video sample may be, for example, a generalized title describing the video content when the computer device obtains the video sample. The computer equipment inputs the video sample block and the text sample into an initial joint model, wherein the initial joint model is used for performing multi-label prediction on the video sample, parameter adjustment is performed on the initial joint model based on standard label data corresponding to the video sample so as to obtain a deep learning model of a target joint model, and the initial joint model comprises an initial encoder and an initial decoder.
Step S602, feature encoding is carried out on the video sample and the text sample through an initial encoder in the initial joint model, a video sample feature sequence corresponding to the video sample and text sample features corresponding to the text sample are obtained, and fusion processing is carried out on the video sample feature sequence and the text sample features, so that a composite sample feature sequence is obtained.
In the embodiment of the present application, the specific implementation process of acquiring the composite sample feature sequence may refer to the related description of acquiring the composite feature sequence in step S402 in fig. 4, which is not described herein.
Step S603, obtaining tag sample feature sequences corresponding to W tag samples, and performing cross attention processing on the composite sample feature sequences and the tag sample feature sequences through an initial decoder in an initial joint model to obtain cross attention sample sequences corresponding to the tag sample feature sequences; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between a video sample and one label sample; w is a positive integer.
In the embodiment of the present application, the computer device performs cross attention processing on the composite sample feature sequence and the tag sample feature sequence through the initial decoder in the initial joint model, so as to obtain a specific implementation process of the cross attention sample sequence corresponding to the tag sample feature sequence, which can be referred to as step S303 in fig. 3, and a description of a related process of acquiring the cross attention sequence as steps S404 and S405 in fig. 4, which are not repeated herein.
Step S604, performing parameter adjustment on an initial encoder and an initial decoder in the initial joint model according to the cross attention sample sequence to obtain a target joint model, wherein the target joint model is used for performing multi-label prediction of the video.
In the embodiment of the application, the initial encoder comprises a visual feature extraction component and a text feature extraction component, and the computer equipment carries out parameter adjustment on the initial encoder and the initial decoder in the initial joint model according to the cross attention sample sequence, so that the specific implementation process of obtaining the target joint model can be as follows: and acquiring standard video features corresponding to the video samples, and carrying out parameter adjustment on the visual feature extraction component according to the standard video features and the video sample feature sequences. And acquiring standard text characteristics corresponding to the text sample, and carrying out parameter adjustment on the text characteristic extraction component according to the standard text characteristics and the text sample characteristics. And obtaining a sample type label corresponding to the video sample, and generating a label loss value according to the sample type label and the cross attention sample sequence. And carrying out parameter adjustment on an initial decoder in the initial joint model according to the label loss value to obtain a target joint model.
Specifically, the computer device may acquire a standard video feature corresponding to the video sample, where the standard video feature may be a standard video feature already associated with the video sample when the video sample is acquired, or may perform feature encoding on the video sample through a trained visual feature extraction model, acquire a feature corresponding to the video sample, determine the feature as a standard video feature, and perform parameter adjustment on the visual feature extraction component according to the standard video feature and a video sample feature sequence. For example, the learning rate of the visual feature extraction component may be increased in order to learn some simple patterns faster. The computer device may obtain standard text features corresponding to the text samples, and parameter adjustment is performed on the text feature extraction component according to the standard text features and the text sample features. For example, regularization of the text feature extraction module may be reduced in order to better learn details of the text sample data.
The specific implementation process of generating the label loss value by the computer device according to the sample category label and the cross attention sample sequence may be: and outputting sample confidence values corresponding to the W label samples respectively according to the cross attention sample sequence. And carrying out logarithmic processing on the confidence coefficient values of the W samples respectively to obtain W confidence coefficient logarithmic values. Based on the sample category labels and the sample confidence values, determining label determination values corresponding to the W label samples respectively; the label determination value includes a first value or a second value, the first value is used for indicating that labels identical to label samples corresponding to the first value exist in the sample class labels, and the second value is used for indicating that labels identical to label samples corresponding to the second value do not exist in the sample class labels. And generating a label loss value according to the label determination values and the W confidence coefficient logarithmic values respectively corresponding to the W label samples.
In particular, the tag Loss value may be represented by Loss, the sample confidence value may be represented by p, the tag determination value may be represented by y, for example when the sample category tag corresponding to video sample i includes sample tag c,For example, the first value may be 1, and when the sample class label corresponding to the video sample i does not include the sample label c,The second value may be, for example, 0. One possible label loss value representation method can be found in formula ⑥:
As shown in the formula ⑥ of the present invention, The label loss value corresponding to the ith video sample is represented, L is the number of video samples, W is the number of label samples,For representing the confidence between video sample i and label sample c,For labels, a value is determined to indicate whether the association between video sample i and label sample c is correct, e.gWhen 0, the label sample c is not considered to be the label corresponding to the video sample i,When 1, the label sample c can be regarded as a label corresponding to the video sample i. ""Is used to denote a logarithmic function,Is a confidence logarithm value.
Through the above process, the characteristics of the video sample can be identified and predicted more accurately by integrating the video sample, the text sample associated with the video sample and the label sample, and the initial decoder is provided with a smaller learning rate by setting a larger learning rate for the visual characteristic extraction component and the text characteristic extraction component, so that the visual characteristic extraction component and the text characteristic extraction component learn some simple modes faster, the model characteristic extraction efficiency is improved, and the initial decoder can learn more complex modes more stably to ensure the stability and convergence speed of the whole model. By adjusting the regularization term of the loss function in the initial decoder, for example, a larger coefficient is set for the regularization term to prevent overfitting, and a smaller regularization coefficient is set for the regularization term of the visual feature extraction component and the text feature extraction component to better learn details of the data, the model is facilitated to capture the relevance between different modalities, and the recall rate of the recall tag of the model is remarkably enhanced.
Referring further to fig. 7, fig. 7 is a schematic diagram of model training according to an embodiment of the present application. As shown in fig. 7, the initial encoder includes a visual feature extraction component, a text feature extraction component. The computer equipment inputs the video sample into a visual feature extraction component of an initial encoder of the initial joint model through the initial joint model to obtain a video sample feature sequence; inputting the text sample into a text feature extraction component to obtain text sample features; and inputting the label sample into a text feature extraction component to obtain a label sample feature sequence. Inputting the video sample feature sequence, the text sample feature and the label sample feature sequence into an initial decoder for model training to obtain a model output result, and carrying out parameter adjustment on the video feature extraction component, the text feature extraction component and the initial decoder in a model training iteration process based on the model output result until adjustment is finished, so as to complete the model iteration process and obtain the target joint model.
Further, referring to fig. 8, fig. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the application. The data processing apparatus 800 may be a computer program (including program code, etc.) running in a computer device, for example the data processing apparatus 800 may be an application software; the data processing apparatus 800 may be used to perform the corresponding steps in the method provided by the embodiments of the present application. As shown in fig. 8, the data processing apparatus 800 may be used in the computer device in the embodiment corresponding to fig. 3, fig. 4, or fig. 6, and specifically, the data processing apparatus 800 may include: a parameter input module 11, a feature processing module 12, a cross-attention processing module 13, a tag setting module 14, and a target video block determination module 15.
A parameter input module 11 for inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder;
The feature processing module 12 is configured to perform feature encoding on the video data and the text data through an encoder in the target joint model, obtain a video feature sequence corresponding to the video data, and perform fusion processing on the video feature sequence and the text data feature corresponding to the text data, so as to obtain a composite feature sequence;
The cross attention processing module 13 is configured to obtain tag feature sequences corresponding to the N tag data, and perform cross attention processing on the composite feature sequence and the tag feature sequence through a decoder to obtain a cross attention sequence corresponding to the tag feature sequence; each cross-attention value in the cross-attention sequence is used to indicate a degree of association between video data and one tag data; n is a positive integer;
The tag setting module 14 is configured to output confidence values corresponding to the N tag data respectively according to the cross attention sequence, and set target tag data in the N tag data for the video data according to the confidence values.
In one possible implementation, the video data includes a number a of video frames, a being a positive integer; the feature processing module 12 is configured to perform feature encoding on the video data and the text data by using an encoder in the target joint model, so as to obtain a video feature sequence corresponding to the video data and a text data feature corresponding to the text data, where the feature processing module 12 is specifically configured to perform the following operations:
Generating R video blocks based on A video frames in video data, projecting the R video blocks in the video data through an encoder in a target joint model to obtain image projection features corresponding to the R video blocks respectively, and adding video position information to the R image projection features to obtain R image update features; video position information refers to information of the position of a video block in video data; r is a positive integer; the video block is composed of one or more video frames of the A video frames;
performing feature coding on the R image updating features to obtain image data features corresponding to the R video blocks respectively, and generating a video feature sequence corresponding to video data according to the image data features corresponding to the R video blocks respectively;
splitting text data into P data phrases through an encoder in the target joint model; p is a positive integer;
Vector conversion is carried out on the P data word groups to obtain word embedding vectors corresponding to the P data word groups respectively, text position information is added to the P word embedding vectors respectively, and P word updating vectors are obtained; the text position information refers to the information of the position of a data phrase in text data;
And carrying out feature coding on the P word updating vectors to obtain data phrase features corresponding to the P data phrases respectively, and generating text data features corresponding to the text data according to the data phrase features corresponding to the P data phrases respectively.
In one possible implementation manner, when the cross-attention processing module 13 is configured to obtain the tag feature sequences corresponding to the N tag data, the cross-attention processing module 13 is specifically configured to perform the following operations:
acquiring N label data, and splitting the N label data respectively to obtain word element sequences corresponding to the N label data respectively; one word element in the word element sequence refers to a minimum basic unit obtained after splitting processing of tag data;
Mapping N word sequences through an encoder in the target joint model to obtain word embedded vector sequences corresponding to the N word sequences respectively;
respectively carrying out weighted average processing on the N word embedded vector sequences to obtain tag data characteristics corresponding to the N tag data respectively;
And splicing the N tag data features to obtain tag feature sequences corresponding to the N tag data.
In one possible implementation manner, the cross-attention processing module is configured to perform cross-attention processing on the composite feature sequence and the tag feature sequence by using a decoder, and when the cross-attention sequence corresponding to the tag feature sequence is obtained, the cross-attention processing module is specifically configured to perform the following operations:
In the decoder, the composite feature sequence and the label feature sequence are subjected to cross attention processing to obtain initial cross attention features;
Normalizing the initial cross attention feature to obtain a cross normalized feature;
And carrying out forward propagation processing on the cross normalized features to obtain cross forward features, fusing the cross normalized features and the cross forward features to obtain cross fusion features, and carrying out normalization processing on the cross fusion features to obtain cross attention sequences corresponding to the tag feature sequences.
In one possible implementation, the decoder includes M decoding subassemblies, where M is a positive integer, and M decoding subassemblies include decoding subassemblies S i, and i is a positive integer less than or equal to M; the cross attention processing module 13 is configured to perform cross attention processing on the composite feature sequence and the tag feature sequence by using a decoder, and when the cross attention sequence corresponding to the tag feature sequence is obtained, the cross attention processing module 13 is specifically configured to perform the following operations:
The iterative variable characteristics and the composite characteristic sequences corresponding to the decoding sub-assembly S i are used as the iterative input characteristics of the decoding sub-assembly S i, and the iterative input characteristics of the decoding sub-assembly S i are subjected to cross attention processing in the decoding sub-assembly S i to obtain the iterative output characteristics of the decoding sub-assembly S i; if the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, the iteration variable feature corresponding to the decoding subassembly S i is a tag feature sequence;
The iteration output characteristic of the decoding subassembly S i is used as the iteration variable characteristic corresponding to the decoding subassembly S i+1, the iteration variable characteristic and the composite characteristic sequence corresponding to the decoding subassembly S i+1 are used as the iteration input characteristic of the decoding subassembly S i+1, the iteration input characteristic of the decoding subassembly S i+1 is subjected to cross attention processing in the decoding subassembly S i+1, the iteration output characteristic of the decoding subassembly S i+1 is obtained until the iteration output characteristic of the last decoding subassembly in M decoding subassemblies is obtained, and the iteration output characteristic of the last decoding subassembly is determined as a cross attention sequence corresponding to the tag characteristic sequence; the decoding subcomponent S i+1 is the next decoding subcomponent of the decoding subcomponent S i.
In one possible implementation manner, the cross-attention processing module 13 is configured to use the iteration variable feature and the composite feature sequence corresponding to the decoding sub-component S i as the iteration input feature of the decoding sub-component S i, and perform cross-attention processing on the iteration input feature of the decoding sub-component S i in the decoding sub-component S i, so as to obtain the iteration output feature of the decoding sub-component S i, where the cross-attention processing module 13 is specifically configured to perform the following operations:
In the decoding subassembly S i, performing cross attention processing on the iterative variable characteristic and the composite characteristic sequence to obtain an initial attention characteristic;
Normalizing the initial attention characteristic to obtain a normalized characteristic;
And carrying out forward propagation processing on the normalized features to obtain forward propagation features, fusing the normalized features and the forward propagation features to obtain target fusion features, and carrying out normalization processing on the target fusion features to obtain iterative output features of the decoding subassembly S i.
In one possible implementation manner, the cross-attention processing module 13 is configured to perform cross-attention processing on the iteration variable feature and the composite feature sequence in the decoding subassembly S i, so as to obtain an initial attention feature, where the cross-attention processing module 13 is specifically configured to perform the following operations:
In the decoding subcomponent S i, the iterative variable feature is taken as a query vector in the cross-attention function, the composite feature sequence is taken as a key vector in the cross-attention function, and the composite feature sequence is taken as a value vector in the cross-attention function;
determining a product of the query vector and the key vector as a first fusion feature by a cross-attention function;
Acquiring the dimension number corresponding to the composite feature sequence, and determining the product of the first fusion feature and the reciprocal of the dimension number as a second fusion feature;
converting the second fusion feature to a first activation feature based on an activation sub-function in the cross-attention function;
the product of the first activation feature and the value vector is determined as a third fusion feature and the sum of the third fusion feature and the query vector is determined as an initial attention feature.
In one possible implementation manner, the cross-attention processing module 13 is configured to normalize the initial attention feature, and when the normalized feature is obtained, the cross-attention processing module 13 is specifically configured to:
acquiring a characteristic value in the initial attention characteristic, and determining the mean value and the variance of the characteristic value;
Determining a difference between the initial attention feature and the mean as an offset feature;
acquiring a normalization constant, and performing squaring processing on the sum of the variance and the normalization constant to obtain a scaling value;
The product between the offset feature and the inverse of the scaled value is determined as the normalized feature.
In one possible implementation manner, the target joint model further includes a full connection layer, the tag setting module 14 is configured to output confidence values corresponding to the N tag data respectively according to the cross attention sequence, and when the target tag data in the N tag data is set for the video data according to the confidence values, the tag setting module 14 is specifically configured to perform the following operations:
Performing confidence degree normalization processing on the cross attention sequence based on an activation function of the full connection layer to obtain confidence degree values corresponding to the N label data respectively;
And determining the confidence coefficient value which is larger than or equal to the confidence coefficient threshold value in the N confidence coefficient values as a target confidence coefficient value, and performing association setting on the tag data corresponding to the target confidence coefficient value and the video data.
In one possible implementation, the video data includes a number of video frames, the number of target tag data is H, a and H are both positive integers, the target tag data includes tag data M j, j is a positive integer less than or equal to H; the data processing apparatus further comprises a target video block determination module 15, the target video block determination module 15 being specifically configured to:
based on A video frames in video data, generating R video blocks, and acquiring a video block characteristic sequence corresponding to the R video blocks and a target tag characteristic corresponding to tag data M j; r is a positive integer; the video block is composed of one or more video frames of the A video frames;
the video block feature sequence and the target tag feature are respectively subjected to cross attention processing through a decoder to obtain a video block attention sequence corresponding to tag data M j; each cross attention value in the video block attention sequence is used for indicating the association relationship between the tag data M j and one video block;
And outputting video block confidence values corresponding to the R video blocks respectively according to the video block attention sequence, acquiring target video blocks associated with the tag data M j from the R video blocks according to the video block confidence values, and setting the tag data M j for the target video blocks.
Further, referring to fig. 9, fig. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the application. The data processing apparatus 900 may be a computer program (including program code, etc.) running in a computer device, for example the data processing apparatus 900 may be an application software; the data processing apparatus 900 may be used to perform the corresponding steps in the method provided by the embodiments of the present application. As shown in fig. 9, the data processing apparatus 900 may be used in the computer device in the embodiment corresponding to fig. 3, fig. 4, or fig. 6, and specifically, the data processing apparatus 900 may include: a parameter training module 21, a feature training module 22, a cross-attention training module 23, and a parameter adjustment module 24.
A parameter training module 21 for inputting a video sample and a text sample associated with the video sample into an initial joint model; the initial joint model includes an initial encoder and an initial decoder;
The feature training module 22 is configured to perform feature encoding on the video sample and the text sample through an initial encoder in the initial joint model to obtain a video sample feature sequence corresponding to the video sample and a text sample feature corresponding to the text sample, and perform fusion processing on the video sample feature sequence and the text sample feature to obtain a composite sample feature sequence;
The cross attention training module 23 is configured to obtain tag sample feature sequences corresponding to W tag samples, and perform cross attention processing on the composite sample feature sequence and the tag sample feature sequence through an initial decoder in an initial joint model to obtain a cross attention sample sequence corresponding to the tag sample feature sequence; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between a video sample and one label sample; w is a positive integer;
The parameter adjustment module 24 is configured to perform parameter adjustment on an initial encoder and an initial decoder in the initial joint model according to the cross-attention sample sequence, so as to obtain a target joint model, where the target joint model is used for performing multi-label prediction of the video.
In one possible implementation, the initial encoder includes a visual feature extraction component and a text feature extraction component; the parameter adjustment module 24 is configured to perform parameter adjustment on an initial encoder and an initial decoder in an initial joint model according to a cross-attention sample sequence, and when a target joint model is obtained, the parameter adjustment module 24 is specifically configured to:
Acquiring standard video features corresponding to the video samples, and carrying out parameter adjustment on the visual feature extraction component according to the standard video features and the video sample feature sequences;
Acquiring standard text features corresponding to the text samples, and performing parameter adjustment on the text feature extraction component according to the standard text features and the text sample features;
Acquiring a sample type label corresponding to a video sample, and generating a label loss value according to the sample type label and a cross attention sample sequence;
and carrying out parameter adjustment on an initial decoder in the initial joint model according to the label loss value to obtain a target joint model.
In one possible implementation, the parameter adjustment module 24 is configured to, when generating the tag loss value according to the sample class tag and the cross-attention sample sequence, the parameter adjustment module 24 is specifically configured to:
outputting sample confidence values corresponding to the W label samples respectively according to the cross attention sample sequence;
carrying out logarithmic processing on the confidence coefficient values of the W samples respectively to obtain W confidence coefficient logarithmic values;
Based on the sample category labels and the sample confidence values, determining label determination values corresponding to the W label samples respectively; the label determining value comprises a first value or a second value, wherein the first value is used for indicating that labels which are the same as label samples corresponding to the first value exist in the sample type labels, and the second value is used for indicating that labels which are the same as label samples corresponding to the second value do not exist in the sample type labels;
And generating a label loss value according to the label determination values and the W confidence coefficient logarithmic values respectively corresponding to the W label samples.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the application. As shown in fig. 10, the computer device in the embodiment of the present application may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 10, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer-readable storage medium.
Network interface 1004 may provide network communication network elements; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be configured to invoke the device control application stored in the memory 1005 to perform the following operations:
Inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder;
Performing feature coding on the video data and the text data through an encoder in the target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and performing fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence;
Acquiring tag feature sequences corresponding to N tag data, and performing cross attention processing on the composite feature sequences and the tag feature sequences through a decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross-attention value in the cross-attention sequence is used to indicate a degree of association between video data and one tag data; n is a positive integer;
and outputting confidence values respectively corresponding to the N label data according to the cross attention sequence, and setting target label data in the N label data for the video data according to the confidence values.
In one possible implementation, the video data includes a number a of video frames, a being a positive integer; the processor 1001 is configured to perform feature encoding on the video data and the text data by using an encoder in the target joint model, so as to obtain a video feature sequence corresponding to the video data and a text data feature corresponding to the text data, and specifically is configured to perform the following operations:
Generating R video blocks based on A video frames in video data, projecting the R video blocks in the video data through an encoder in a target joint model to obtain image projection features corresponding to the R video blocks respectively, and adding video position information to the R image projection features to obtain R image update features; video position information refers to information of the position of a video block in video data; r is a positive integer; the video block is composed of one or more video frames of the A video frames;
performing feature coding on the R image updating features to obtain image data features corresponding to the R video blocks respectively, and generating a video feature sequence corresponding to video data according to the image data features corresponding to the R video blocks respectively;
splitting text data into P data phrases through an encoder in the target joint model; p is a positive integer;
Vector conversion is carried out on the P data word groups to obtain word embedding vectors corresponding to the P data word groups respectively, text position information is added to the P word embedding vectors respectively, and P word updating vectors are obtained; the text position information refers to the information of the position of a data phrase in text data;
And carrying out feature coding on the P word updating vectors to obtain data phrase features corresponding to the P data phrases respectively, and generating text data features corresponding to the text data according to the data phrase features corresponding to the P data phrases respectively.
In one possible implementation manner, the processor 1001 is configured to obtain tag feature sequences corresponding to the N tag data, and specifically is configured to perform the following operations:
acquiring N label data, and splitting the N label data respectively to obtain word element sequences corresponding to the N label data respectively; one word element in the word element sequence refers to a minimum basic unit obtained after splitting processing of tag data;
Mapping N word sequences through an encoder in the target joint model to obtain word embedded vector sequences corresponding to the N word sequences respectively;
respectively carrying out weighted average processing on the N word embedded vector sequences to obtain tag data characteristics corresponding to the N tag data respectively;
And splicing the N tag data features to obtain tag feature sequences corresponding to the N tag data.
In a possible implementation manner, the processor 1001 is configured to perform cross-attention processing on the composite feature sequence and the tag feature sequence by using a decoder to obtain a cross-attention sequence corresponding to the tag feature sequence, and specifically is configured to perform the following operations:
In the decoder, the composite feature sequence and the label feature sequence are subjected to cross attention processing to obtain initial cross attention features;
Normalizing the initial cross attention feature to obtain a cross normalized feature;
And carrying out forward propagation processing on the cross normalized features to obtain cross forward features, fusing the cross normalized features and the cross forward features to obtain cross fusion features, and carrying out normalization processing on the cross fusion features to obtain cross attention sequences corresponding to the tag feature sequences.
In one possible implementation, the decoder includes M decoding subassemblies, where M is a positive integer, and M decoding subassemblies include decoding subassemblies S i, and i is a positive integer less than or equal to M; the processor 1001 is configured to perform cross attention processing on the composite feature sequence and the tag feature sequence by using a decoder to obtain a cross attention sequence corresponding to the tag feature sequence, and specifically configured to perform the following operations:
The iterative variable characteristics and the composite characteristic sequences corresponding to the decoding sub-assembly S i are used as the iterative input characteristics of the decoding sub-assembly S i, and the iterative input characteristics of the decoding sub-assembly S i are subjected to cross attention processing in the decoding sub-assembly S i to obtain the iterative output characteristics of the decoding sub-assembly S i; if the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, the iteration variable feature corresponding to the decoding subassembly S i is a tag feature sequence;
The iteration output characteristic of the decoding subassembly S i is used as the iteration variable characteristic corresponding to the decoding subassembly S i+1, the iteration variable characteristic and the composite characteristic sequence corresponding to the decoding subassembly S i+1 are used as the iteration input characteristic of the decoding subassembly S i+1, the iteration input characteristic of the decoding subassembly S i+1 is subjected to cross attention processing in the decoding subassembly S i+1, the iteration output characteristic of the decoding subassembly S i+1 is obtained until the iteration output characteristic of the last decoding subassembly in M decoding subassemblies is obtained, and the iteration output characteristic of the last decoding subassembly is determined as a cross attention sequence corresponding to the tag characteristic sequence; the decoding subcomponent S i+1 is the next decoding subcomponent of the decoding subcomponent S i.
In one possible implementation manner, the processor 1001 is configured to take the iteration variable feature and the composite feature sequence corresponding to the decoding sub-component S i as the iteration input feature of the decoding sub-component S i, and perform cross-attention processing on the iteration input feature of the decoding sub-component S i in the decoding sub-component S i, so as to obtain an iteration output feature of the decoding sub-component S i, and specifically is configured to perform the following operations:
In the decoding subassembly S i, performing cross attention processing on the iterative variable characteristic and the composite characteristic sequence to obtain an initial attention characteristic;
Normalizing the initial attention characteristic to obtain a normalized characteristic;
And carrying out forward propagation processing on the normalized features to obtain forward propagation features, fusing the normalized features and the forward propagation features to obtain target fusion features, and carrying out normalization processing on the target fusion features to obtain iterative output features of the decoding subassembly S i.
In one possible implementation, the processor 1001 is configured to perform, in the decoding subcomponent S i, cross-attention processing on the iteration variable feature and the composite feature sequence to obtain an initial attention feature, and specifically is configured to perform the following operations:
In the decoding subcomponent S i, the iterative variable feature is taken as a query vector in the cross-attention function, the composite feature sequence is taken as a key vector in the cross-attention function, and the composite feature sequence is taken as a value vector in the cross-attention function;
determining a product of the query vector and the key vector as a first fusion feature by a cross-attention function;
Acquiring the dimension number corresponding to the composite feature sequence, and determining the product of the first fusion feature and the reciprocal of the dimension number as a second fusion feature;
converting the second fusion feature to a first activation feature based on an activation sub-function in the cross-attention function;
the product of the first activation feature and the value vector is determined as a third fusion feature and the sum of the third fusion feature and the query vector is determined as an initial attention feature.
In one possible implementation, the processor 1001 is configured to normalize the initial attention feature to obtain a normalized feature, and is specifically configured to perform the following operations:
acquiring a characteristic value in the initial attention characteristic, and determining the mean value and the variance of the characteristic value;
Determining a difference between the initial attention feature and the mean as an offset feature;
acquiring a normalization constant, and performing squaring processing on the sum of the variance and the normalization constant to obtain a scaling value;
The product between the offset feature and the inverse of the scaled value is determined as the normalized feature.
In a possible implementation manner, the target joint model further includes a full connection layer, and the processor 1001 is configured to output confidence values corresponding to the N tag data respectively according to the cross-attention sequence, and set target tag data in the N tag data for the video data according to the confidence values, specifically configured to perform the following operations:
Performing confidence degree normalization processing on the cross attention sequence based on an activation function of the full connection layer to obtain confidence degree values corresponding to the N label data respectively;
And determining the confidence coefficient value which is larger than or equal to the confidence coefficient threshold value in the N confidence coefficient values as a target confidence coefficient value, and performing association setting on the tag data corresponding to the target confidence coefficient value and the video data.
In one possible implementation, the video data includes a number of video frames, the number of target tag data is H, a and H are both positive integers, the target tag data includes tag data M j, j is a positive integer less than or equal to H; the processor 1001 is further configured to:
based on A video frames in video data, generating R video blocks, and acquiring a video block characteristic sequence corresponding to the R video blocks and a target tag characteristic corresponding to tag data M j; r is a positive integer; the video block is composed of one or more video frames of the A video frames;
the video block feature sequence and the target tag feature are respectively subjected to cross attention processing through a decoder to obtain a video block attention sequence corresponding to tag data M j; each cross attention value in the video block attention sequence is used for indicating the association relationship between the tag data M j and one video block;
And outputting video block confidence values corresponding to the R video blocks respectively according to the video block attention sequence, acquiring target video blocks associated with the tag data M j from the R video blocks according to the video block confidence values, and setting the tag data M j for the target video blocks.
The processor 1001 is further configured to:
Inputting a video sample and a text sample associated with the video sample into an initial joint model; the initial joint model includes an initial encoder and an initial decoder;
Performing feature coding on the video sample and the text sample through an initial encoder in the initial joint model to obtain a video sample feature sequence corresponding to the video sample and text sample features corresponding to the text sample, and performing fusion processing on the video sample feature sequence and the text sample features to obtain a composite sample feature sequence;
Acquiring tag sample feature sequences corresponding to W tag samples, and performing cross attention processing on the composite sample feature sequences and the tag sample feature sequences through an initial decoder in an initial joint model to obtain cross attention sample sequences corresponding to the tag sample feature sequences; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between a video sample and one label sample; w is a positive integer;
and carrying out parameter adjustment on an initial encoder and an initial decoder in the initial joint model according to the cross attention sample sequence to obtain a target joint model, wherein the target joint model is used for carrying out multi-label prediction of the video.
In one possible implementation, the initial encoder includes a visual feature extraction component and a text feature extraction component; the processor 1001 is configured to perform parameter adjustment on an initial encoder and an initial decoder in an initial joint model according to a cross-attention sample sequence, so as to obtain a target joint model, and specifically is configured to perform the following operations:
Acquiring standard video features corresponding to the video samples, and carrying out parameter adjustment on the visual feature extraction component according to the standard video features and the video sample feature sequences;
Acquiring standard text features corresponding to the text samples, and performing parameter adjustment on the text feature extraction component according to the standard text features and the text sample features;
Acquiring a sample type label corresponding to a video sample, and generating a label loss value according to the sample type label and a cross attention sample sequence;
and carrying out parameter adjustment on an initial decoder in the initial joint model according to the label loss value to obtain a target joint model.
In one possible implementation, the processor 1001 is configured to generate a tag loss value according to the sample class tag and the cross-attention sample sequence, and specifically configured to:
outputting sample confidence values corresponding to the W label samples respectively according to the cross attention sample sequence;
carrying out logarithmic processing on the confidence coefficient values of the W samples respectively to obtain W confidence coefficient logarithmic values;
Based on the sample category labels and the sample confidence values, determining label determination values corresponding to the W label samples respectively; the label determining value comprises a first value or a second value, wherein the first value is used for indicating that labels which are the same as label samples corresponding to the first value exist in the sample type labels, and the second value is used for indicating that labels which are the same as label samples corresponding to the second value do not exist in the sample type labels;
And generating a label loss value according to the label determination values and the W confidence coefficient logarithmic values respectively corresponding to the W label samples.
Furthermore, it should be noted here that: embodiments of the present application further provide a computer readable storage medium storing a computer program, where the computer program is adapted to be loaded by the processor and execute the method provided by each step in fig. 3, fig. 4, or fig. 6, and specifically refer to the implementation manner provided by each step in fig. 3, fig. 4, or fig. 6, which is not described herein again. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, a computer program can be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network.
The computer readable storage medium may be an apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternatives of fig. 3,4, or 6, and thus, will not be described in detail herein.
The terms first, second and the like in the description and in the claims and drawings of embodiments of the application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in this description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means or transmission over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The instruction means implement the functions specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (18)

1.A method of data processing, the method comprising:
Inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder;
Performing feature coding on the video data and the text data through the encoder in the target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and performing fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence;
Acquiring tag feature sequences corresponding to N tag data, and performing cross attention processing on the composite feature sequences and the tag feature sequences through the decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross attention value in the cross attention sequence is used for indicating the association relation between the video data and one label data, and the number of the cross attention values in the cross attention sequence is N; n is a positive integer; in the cross attention processing process, the composite feature sequence is used as a key vector and a value vector in a cross attention function, and the label feature sequence is used as a query vector;
Outputting confidence values respectively corresponding to the N label data according to the cross attention sequence, and setting target label data in the N label data for the video data according to the confidence values; a confidence value is calculated from the cross-attention value of the corresponding tag data in the cross-attention sequence and the N cross-attention values in the cross-attention sequence.
2. The method according to claim 1, wherein the cross-attention processing, by the decoder, the composite feature sequence and the tag feature sequence to obtain a cross-attention sequence corresponding to the tag feature sequence includes:
in the decoder, performing cross attention processing on the composite feature sequence and the tag feature sequence to obtain an initial cross attention feature;
normalizing the initial cross attention feature to obtain a cross normalized feature;
And carrying out forward propagation processing on the cross normalized features to obtain cross forward features, fusing the cross normalized features and the cross forward features to obtain cross fusion features, and carrying out normalization processing on the cross fusion features to obtain cross attention sequences corresponding to the tag feature sequences.
3. The method of claim 1, wherein the decoder comprises M decoding subassemblies, M being a positive integer, the M decoding subassemblies comprising decoding subassemblies S i, i being a positive integer less than or equal to M; the cross attention processing is performed on the composite feature sequence and the tag feature sequence by the decoder to obtain a cross attention sequence corresponding to the tag feature sequence, including:
The iterative variable characteristics and the composite characteristic sequences corresponding to the decoding sub-assembly S i are used as iterative input characteristics of the decoding sub-assembly S i, and cross attention processing is carried out on the iterative input characteristics of the decoding sub-assembly S i in the decoding sub-assembly S i to obtain iterative output characteristics of the decoding sub-assembly S i; if the decoding subassembly S i is the first decoding subassembly of the M decoding subassemblies, the iteration variable feature corresponding to the decoding subassembly S i is the tag feature sequence;
Taking the iteration output characteristic of the decoding subassembly S i as the iteration variable characteristic corresponding to the decoding subassembly S i+1, taking the iteration variable characteristic corresponding to the decoding subassembly S i+1 and the composite characteristic sequence as the iteration input characteristic of the decoding subassembly S i+1, performing cross attention processing on the iteration input characteristic of the decoding subassembly S i+1 in the decoding subassembly S i+1 to obtain the iteration output characteristic of the decoding subassembly S i+1 until the iteration output characteristic of the last decoding subassembly in the M decoding subassemblies is obtained, and determining the iteration output characteristic of the last decoding subassembly as the cross attention sequence corresponding to the tag characteristic sequence; the decoding subassembly S i+1 is the next decoding subassembly of the decoding subassembly S i.
4. A method according to claim 3, wherein said taking the iteration variable feature and the composite feature sequence corresponding to the decoding subassembly S i as the iteration input feature of the decoding subassembly S i performs cross-attention processing on the iteration input feature of the decoding subassembly S i in the decoding subassembly S i to obtain the iteration output feature of the decoding subassembly S i, and includes:
in the decoding subassembly S i, performing cross attention processing on the iterative variable feature and the composite feature sequence to obtain an initial attention feature;
normalizing the initial attention feature to obtain a normalized feature;
And carrying out forward propagation processing on the normalized features to obtain forward propagation features, fusing the normalized features and the forward propagation features to obtain target fusion features, and carrying out normalization processing on the target fusion features to obtain iterative output features of the decoding subassembly S i.
5. The method according to claim 4, wherein said cross-attention processing of the iterative variable feature and the composite feature sequence in the decoding subassembly S i to obtain an initial attention feature comprises:
In the decoding subcomponent S i, the iterative variable feature is taken as a query vector in a cross-attention function, the composite feature sequence is taken as a key vector in the cross-attention function, and the composite feature sequence is taken as a value vector in the cross-attention function;
determining a product of the query vector and the key vector as a first fusion feature by the cross-attention function;
Acquiring the number of dimensions corresponding to the composite feature sequence, and determining the product of the first fusion feature and the reciprocal of the number of dimensions as a second fusion feature;
converting the second fusion feature to a first activation feature based on an activation sub-function in the cross-attention function;
the product of the first activation feature and the value vector is determined as a third fusion feature and the sum of the third fusion feature and the query vector is determined as an initial attention feature.
6. The method of claim 4, wherein normalizing the initial attention feature to obtain a normalized feature comprises:
acquiring a characteristic value in the initial attention characteristic, and determining the mean value and the variance of the characteristic value;
Determining a difference between the initial attention feature and the mean as an offset feature;
Obtaining a normalization constant, and squaring the sum of the variance and the normalization constant to obtain a scaling value;
The product between the offset feature and the inverse of the scaled value is determined as a normalized feature.
7. The method according to claim 1, wherein the target joint model further includes a full connection layer, the outputting confidence values respectively corresponding to the N tag data according to the cross-attention sequence, setting target tag data in the N tag data for the video data according to the confidence values, and including:
Performing confidence degree normalization processing on the cross attention sequence based on the activation function of the full connection layer to obtain confidence degree values corresponding to the N label data respectively;
And determining a confidence coefficient value which is larger than or equal to a confidence coefficient threshold value in the N confidence coefficient values as a target confidence coefficient value, and performing association setting on tag data corresponding to the target confidence coefficient value and the video data.
8. The method according to claim 1, wherein the obtaining tag feature sequences corresponding to the N tag data includes:
acquiring N label data, and splitting the N label data respectively to obtain word element sequences corresponding to the N label data respectively; one word element in the word element sequence refers to a minimum basic unit obtained after splitting processing of tag data;
Mapping N word element sequences through the encoder in the target joint model to obtain word embedded vector sequences respectively corresponding to the N word element sequences;
respectively carrying out weighted average processing on the N word embedded vector sequences to obtain tag data characteristics corresponding to the N tag data respectively;
And splicing the N tag data features to obtain tag feature sequences corresponding to the N tag data.
9. The method of claim 1, wherein the video data comprises a video frames, a being a positive integer; the feature encoding is performed on the video data and the text data by the encoder in the target joint model, so as to obtain a video feature sequence corresponding to the video data and a text data feature corresponding to the text data, which comprises the following steps:
Generating R video blocks based on A video frames in the video data, and projecting the R video blocks through the encoder in the target joint model to obtain image projection features corresponding to the R video blocks respectively, and adding video position information to the R image projection features respectively to obtain R image update features; the video position information refers to information of a position of a video block in the video data; r is a positive integer; the video block is composed of one or more video frames of the a video frames;
Performing feature coding on the R image updating features to obtain image data features corresponding to the R video blocks respectively, and generating a video feature sequence corresponding to the video data according to the image data features corresponding to the R video blocks respectively;
Splitting the text data into P data phrases by the encoder in the target joint model; p is a positive integer;
Vector conversion is carried out on the P data word groups to obtain word embedding vectors corresponding to the P data word groups respectively, text position information is added to the P word embedding vectors respectively, and P word updating vectors are obtained; the text position information refers to the information of the position of a data phrase in the text data;
And carrying out feature coding on the P word updating vectors to obtain data phrase features corresponding to the P data phrases respectively, and generating text data features corresponding to the text data according to the data phrase features corresponding to the P data phrases respectively.
10. The method of claim 1, wherein the video data comprises a number of a video frames, the number of the target tag data is H, a and H are both positive integers, the target tag data comprises tag data M j, j is a positive integer less than or equal to H; the method further comprises the steps of:
generating R video blocks based on A video frames in the video data, and acquiring video block feature sequences corresponding to the R video blocks and target tag features corresponding to the tag data M j; r is a positive integer; the video block is composed of one or more video frames of the a video frames;
the decoder is used for carrying out cross attention processing on the video block feature sequence and the target tag feature respectively to obtain a video block attention sequence corresponding to the tag data M j; each cross attention value in the video block attention sequence is used for indicating an association relationship between the tag data M j and one video block;
And outputting video block confidence values corresponding to the R video blocks respectively according to the video block attention sequence, acquiring target video blocks associated with the tag data M j from the R video blocks according to the video block confidence values, and setting the tag data M j for the target video blocks.
11. A method of data processing, the method comprising:
inputting a video sample and a text sample associated with the video sample into an initial joint model; the initial joint model comprises an initial encoder and an initial decoder;
Performing feature coding on the video sample and the text sample through the initial encoder in the initial joint model to obtain a video sample feature sequence corresponding to the video sample and text sample features corresponding to the text sample, and performing fusion processing on the video sample feature sequence and the text sample features to obtain a composite sample feature sequence;
Acquiring tag sample feature sequences corresponding to W tag samples, and performing cross attention processing on the composite sample feature sequences and the tag sample feature sequences through the initial decoder in the initial joint model to obtain cross attention sample sequences corresponding to the tag sample feature sequences; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between the video sample and one label sample, the number of cross-attention values in the sequence of cross-attention samples being W; w is a positive integer; in the cross attention processing process, the composite sample feature sequence is used as a key vector and a value vector in a cross attention function, and the label sample feature sequence is used as a query vector;
Performing parameter adjustment on the initial encoder and the initial decoder in the initial joint model according to the cross attention sample sequence to obtain a target joint model, wherein the target joint model is used for performing multi-label prediction of video; the initial decoder performs parameter adjustment on a label loss value generated by the label determination value and the W confidence coefficient pair value respectively corresponding to the W label samples; the label determining values respectively corresponding to the W label samples are determined based on the sample category labels corresponding to the video samples and the sample confidence values respectively corresponding to the W label samples; a sample confidence value is calculated from the cross attention value of the corresponding label sample in the cross attention sample sequence and the W cross attention values in the cross attention sample sequence; the W confidence coefficient pair values are obtained by respectively carrying out logarithmic processing on the W sample confidence coefficient values.
12. The method of claim 11, wherein the initial encoder comprises a visual feature extraction component and a text feature extraction component; the parameter adjustment is performed on the initial encoder and the initial decoder in the initial joint model according to the cross attention sample sequence, so as to obtain a target joint model, which comprises the following steps:
Acquiring standard video features corresponding to the video samples, and carrying out parameter adjustment on the visual feature extraction component according to the standard video features and the video sample feature sequences;
Acquiring standard text features corresponding to the text samples, and carrying out parameter adjustment on the text feature extraction component according to the standard text features and the text sample features;
acquiring a sample type label corresponding to the video sample, and generating a label loss value according to the sample type label and the cross attention sample sequence;
And carrying out parameter adjustment on the initial decoder in the initial joint model according to the label loss value to obtain a target joint model.
13. The method of claim 12, wherein the generating a tag loss value from the sample class tag and the sequence of cross-attention samples comprises:
outputting sample confidence values respectively corresponding to the W label samples according to the cross attention sample sequence;
carrying out logarithmic processing on the confidence coefficient values of the W samples respectively to obtain W confidence coefficient logarithmic values;
Determining tag determination values corresponding to the W tag samples respectively based on the sample category tags and the sample confidence values; the label determining value comprises a first value or a second value, wherein the first value is used for indicating that labels which are identical to label samples corresponding to the first value exist in the sample type labels, and the second value is used for indicating that labels which are identical to label samples corresponding to the second value do not exist in the sample type labels;
and generating a label loss value according to the label determination values and the W confidence coefficient logarithmic values respectively corresponding to the W label samples.
14. A data processing apparatus, the apparatus comprising:
A parameter input module for inputting video data and text data associated with the video data into a target joint model; the target joint model includes an encoder and a decoder;
The feature processing module is used for carrying out feature coding on the video data and the text data through the encoder in the target joint model to obtain a video feature sequence corresponding to the video data and text data features corresponding to the text data, and carrying out fusion processing on the video feature sequence and the text data features to obtain a composite feature sequence;
The cross attention processing module is used for acquiring tag feature sequences corresponding to N tag data, and carrying out cross attention processing on the composite feature sequences and the tag feature sequences through the decoder to obtain cross attention sequences corresponding to the tag feature sequences; each cross attention value in the cross attention sequence is used for indicating the association degree between the video data and one label data, and the number of the cross attention values in the cross attention sequence is N; n is a positive integer; in the cross attention processing process, the composite feature sequence is used as a key vector and a value vector in a cross attention function, and the label feature sequence is used as a query vector;
The label setting module is used for outputting confidence values corresponding to the N label data respectively according to the cross attention sequence, and setting target label data in the N label data for the video data according to the confidence values; a confidence value is calculated from the cross-attention value of the corresponding tag data in the cross-attention sequence and the N cross-attention values in the cross-attention sequence.
15. A data processing apparatus, the apparatus comprising:
The parameter training module is used for inputting a video sample and a text sample associated with the video sample into the initial joint model; the initial joint model comprises an initial encoder and an initial decoder;
the feature training module is used for carrying out feature coding on the video sample and the text sample through the initial encoder in the initial joint model to obtain a video sample feature sequence corresponding to the video sample and text sample features corresponding to the text sample, and carrying out fusion processing on the video sample feature sequence and the text sample features to obtain a composite sample feature sequence;
The cross attention training module is used for acquiring tag sample feature sequences corresponding to W tag samples, and carrying out cross attention processing on the composite sample feature sequences and the tag sample feature sequences through the initial decoder in the initial joint model to obtain cross attention sample sequences corresponding to the tag sample feature sequences; each cross-attention value in the sequence of cross-attention samples is used to indicate a degree of association between the video sample and one label sample, the number of cross-attention values in the sequence of cross-attention samples being W; w is a positive integer; in the cross attention processing process, the composite sample feature sequence is used as a key vector and a value vector in a cross attention function, and the label sample feature sequence is used as a query vector;
The parameter adjustment module is used for carrying out parameter adjustment on the initial encoder and the initial decoder in the initial joint model according to the cross attention sample sequence to obtain a target joint model, and the target joint model is used for carrying out multi-label prediction of the video; the initial decoder performs parameter adjustment on a label loss value generated by the label determination value and the W confidence coefficient pair value respectively corresponding to the W label samples; the label determining values respectively corresponding to the W label samples are determined based on the sample category labels corresponding to the video samples and the sample confidence values respectively corresponding to the W label samples; a sample confidence value is calculated from the cross attention value of the corresponding label sample in the cross attention sample sequence and the W cross attention values in the cross attention sample sequence; the W confidence coefficient pair values are obtained by respectively carrying out logarithmic processing on the W sample confidence coefficient values.
16. A computer device, comprising a processor, a memory, and an input-output interface;
the processor is connected to the memory and the input-output interface, respectively, wherein the input-output interface is used for receiving data and outputting data, the memory is used for storing a computer program, and the processor is used for calling the computer program to enable the computer device to execute the method of any one of claims 1-13.
17. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-13.
18. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-13.
CN202410558896.9A 2024-05-08 2024-05-08 Data processing method, device, computer, storage medium and program product Active CN118135466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410558896.9A CN118135466B (en) 2024-05-08 2024-05-08 Data processing method, device, computer, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410558896.9A CN118135466B (en) 2024-05-08 2024-05-08 Data processing method, device, computer, storage medium and program product

Publications (2)

Publication Number Publication Date
CN118135466A CN118135466A (en) 2024-06-04
CN118135466B true CN118135466B (en) 2024-07-23

Family

ID=91244455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410558896.9A Active CN118135466B (en) 2024-05-08 2024-05-08 Data processing method, device, computer, storage medium and program product

Country Status (1)

Country Link
CN (1) CN118135466B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905605A (en) * 2021-09-29 2023-04-04 腾讯科技(深圳)有限公司 Data processing method, data processing equipment and computer readable storage medium
CN116304184A (en) * 2023-03-17 2023-06-23 阿里巴巴(中国)有限公司 Video classification model, training method, classification method, apparatus, and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642536B (en) * 2021-10-13 2021-12-24 腾讯科技(深圳)有限公司 Data processing method, computer device and readable storage medium
CN116541556A (en) * 2022-01-07 2023-08-04 腾讯科技(深圳)有限公司 Label determining method, device, equipment and storage medium
US20240020337A1 (en) * 2022-07-12 2024-01-18 Adobe Inc. Multimodal intent discovery system
CN116310643A (en) * 2023-03-17 2023-06-23 北京百度网讯科技有限公司 Video processing model training method, device and equipment
CN116128043B (en) * 2023-04-17 2023-07-18 中国科学技术大学 Training method of video scene boundary detection model and scene boundary detection method
CN116821417B (en) * 2023-08-28 2023-12-12 中国科学院自动化研究所 Video tag sequence generation method and device
CN117972138B (en) * 2024-04-02 2024-07-23 腾讯科技(深圳)有限公司 Training method and device for pre-training model and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905605A (en) * 2021-09-29 2023-04-04 腾讯科技(深圳)有限公司 Data processing method, data processing equipment and computer readable storage medium
CN116304184A (en) * 2023-03-17 2023-06-23 阿里巴巴(中国)有限公司 Video classification model, training method, classification method, apparatus, and storage medium

Also Published As

Publication number Publication date
CN118135466A (en) 2024-06-04

Similar Documents

Publication Publication Date Title
CN111444340B (en) Text classification method, device, equipment and storage medium
CN108536679B (en) Named entity recognition method, device, equipment and computer readable storage medium
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN110795944A (en) Recommended content processing method and device, and emotion attribute determining method and device
CN117540221B (en) Image processing method and device, storage medium and electronic equipment
CN116861258B (en) Model processing method, device, equipment and storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN112749556A (en) Multi-language model training method and device, storage medium and electronic equipment
CN113569068B (en) Descriptive content generation method, visual content encoding and decoding method and device
CN111767720B (en) Title generation method, computer and readable storage medium
CN117271745A (en) Information processing method and device, computing equipment and storage medium
CN118135466B (en) Data processing method, device, computer, storage medium and program product
CN116956183A (en) Multimedia resource recommendation method, model training method, device and storage medium
CN114419514B (en) Data processing method, device, computer equipment and storage medium
CN113836308B (en) Network big data long text multi-label classification method, system, device and medium
CN117034133A (en) Data processing method, device, equipment and medium
CN113657092B (en) Method, device, equipment and medium for identifying tag
CN116992947A (en) Model training method, video query method and device
CN113919338B (en) Method and device for processing text data
CN112417260B (en) Localized recommendation method, device and storage medium
CN117876940B (en) Video language task execution and model training method, device, equipment and medium thereof
CN117711001B (en) Image processing method, device, equipment and medium
CN113505246B (en) Data processing method, device, terminal equipment and storage medium
WO2024174583A1 (en) Model training method and apparatus, and device, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant