CN116977701A

CN116977701A - Video classification model training method, video classification method and device

Info

Publication number: CN116977701A
Application number: CN202310499277.2A
Authority: CN
Inventors: 汪俊明
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-10-31

Abstract

The application provides a video classification model training method, a video classification method and a video classification device, and relates to the field of artificial intelligence. The method for training the video classification model comprises the following steps: acquiring text data and image data from a video sample; acquiring at least one of a single-mode tag and a multi-mode tag according to the text data and the image data; wherein the single-mode tag includes a tag representing video content using text information or image information, and the multi-mode tag includes a tag representing video content using text information and image information; determining a training sample, the training sample comprising a video sample and a sample tag, the sample tag comprising at least one of a single-mode tag and a multi-mode tag; and updating parameters of the video classification model according to the training sample to obtain a trained video classification model. The embodiment of the application can be beneficial to improving the efficiency and quality of video annotation.

Description

Video classification model training method, video classification method and device

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a text classification method for video, a text classification model training method and a device.

Background

With the rapid development of internet technology, multimedia applications are becoming more and more widespread, and the number of videos is also rapidly increasing, so that users can browse various videos through various multimedia platforms. In order to enable a user to acquire a video of interest from a huge amount of videos, the user typically performs video content understanding to identify key information in the videos. An important link of video content understanding is to extract information in a video as a tag, so that the tag is used for helping a user to search the video, helping a recommendation system to recommend the video, and also assisting in commercialization of the content.

The video tag is usually obtained by manually labeling the video to train the classification model, so that the model can accurately classify and identify the video. However, the problem is more pronounced with the traditional way of manually labeling video. On the one hand, the effective bottleneck of manual marking, such as efficiency promotes the degree of difficulty height, and the operating time of video marking is huge with video generation speed gap, leads to the video quantity backlog easily, hardly guarantees timeliness, influences business efficiency. On the other hand, the quality of the manually marked label is unstable, for example, the manually marked label is very dependent on the understanding depth of catalogue personnel on video content and key figures, and the quality and the result of the manually marked label are also random, so that the consistency is difficult to maintain.

Disclosure of Invention

The application provides a video classification model training method, a video classification method and a video classification device, which can help to improve the efficiency and quality of video annotation.

In a first aspect, an embodiment of the present application provides a method for training a video classification model, including:

acquiring text data and image data from a video sample;

acquiring at least one of a single-mode tag and a multi-mode tag according to the text data and the image data; wherein the single-mode tag includes a tag representing video content using text information or image information, and the multi-mode tag includes a tag representing video content using text information and image information;

determining a training sample, the training sample comprising the video sample and a sample tag, the sample tag comprising at least one of the single-mode tag and the multi-mode tag;

and carrying out parameter updating on the video classification model according to the training sample to obtain the trained video classification model.

In a second aspect, an embodiment of the present application provides a method for classifying video, including:

acquiring image features, text features and voice features of a video to be identified;

Inputting the image features, the text features and the voice features into a video classification model to obtain class labels of the videos to be identified; the video classification model comprises a multi-head attention module and a dynamic graph convolution network; the multi-head attention module is used for inputting the image features, the text features and the voice features and obtaining fusion features; the dynamic graph convolution network is used for inputting the fusion characteristics to obtain the class labels; the video classification model is trained according to the method of the first aspect.

In a third aspect, an embodiment of the present application provides an apparatus for training a video classification model, including:

an acquisition unit for acquiring text data and image data from a video sample;

the acquisition unit is further used for acquiring at least one of a single-mode tag and a multi-mode tag according to the text data and the image data; wherein the single-mode tag includes a tag representing video content using text information or image information, and the multi-mode tag includes a tag representing video content using text information and image information;

a determining unit configured to determine a training sample, where the training sample includes the video sample and a sample tag, and the sample tag includes at least one of the single-mode tag and the multi-mode tag;

And the training unit is used for updating parameters of the video classification model according to the training sample to obtain the trained video classification model.

In a fourth aspect, an embodiment of the present application provides an apparatus for video classification, including:

the acquisition unit is used for acquiring image features, text features and voice features of the video to be identified;

the video classification model is used for inputting the image features, the text features and the voice features to obtain class labels of the videos to be identified; the video classification model comprises a multi-head attention module and a dynamic graph convolution network; the multi-head attention module is used for inputting the image features, the text features and the voice features and obtaining fusion features; the dynamic graph convolution network is used for inputting the fusion characteristics to obtain the class labels; the video classification model is trained according to the method of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory for performing the method as in the first or second aspect.

In a sixth aspect, embodiments of the application provide a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform a method as in the first or second aspect.

In a seventh aspect, embodiments of the present application provide a computer program product comprising computer program instructions for causing a computer to perform the method as in the first or second aspect.

In an eighth aspect, embodiments of the present application provide a computer program that causes a computer to perform the method as in the first or second aspect.

According to the technical scheme, the embodiment of the application obtains the video tags of different modes by obtaining at least one of the single-mode tag and the multi-mode tag according to the text data and the image data, and the video tags of different modes can represent video information from different dimensions of the video, so that the video information of different dimensions are mutually supplemented and cooperated, the video content can be comprehensively and completely understood, the quality of video labeling is improved, and the video classification model can accurately classify and identify the video. In addition, the embodiment of the application can acquire the video tag by fusing the multi-mode information without manually marking the video, thereby being beneficial to improving the video marking efficiency.

Drawings

Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method of video classification model training according to an embodiment of the application;

FIG. 3 is a schematic diagram of a network architecture according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another network architecture according to an embodiment of the present application;

fig. 5 is a specific example of an image tag of image data according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another network architecture according to an embodiment of the present application;

FIG. 7 is a specific example of feature fusion according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a transducer network according to an embodiment of the present application;

fig. 9 is a specific example of video slicing according to an embodiment of the present application;

FIG. 10 is a specific example of content recommendation according to an embodiment of the present application;

FIG. 11 is a schematic flow chart diagram of a method of video classification according to an embodiment of the application;

FIG. 12 is a schematic block diagram of an apparatus for video classification model training in accordance with an embodiment of the application;

FIG. 13 is a schematic block diagram of an apparatus for video classification in accordance with a self-embodiment;

fig. 14 is a schematic block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

It should be understood that in embodiments of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.

In the description of the present application, unless otherwise indicated, "at least one" means one or more, and "a plurality" means two or more. In addition, "and/or" describes an association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be further understood that the description of the first, second, etc. in the embodiments of the present application is for illustration and distinction of descriptive objects, and is not intended to represent any limitation on the number of devices in the embodiments of the present application, nor is it intended to constitute any limitation on the embodiments of the present application.

It should also be appreciated that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application is applied to the technical field of artificial intelligence.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

Embodiments of the present application may relate to natural language processing (Nature Language processing, NLP) in artificial intelligence technology, an important direction in the computer science and artificial intelligence fields. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The embodiment of the application can relate to Computer Vision (CV) technology in artificial intelligence technology, wherein the Computer Vision is a science for researching how to make a machine "see", and further refers to using a camera and a Computer to replace human eyes to recognize, monitor, measure and other machine Vision of a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The embodiment of the application can also relate to Machine Learning (ML) in the artificial intelligence technology, wherein ML is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Embodiments of the present application may also relate to Multi-modal Learning (Multi-modal Learning) in artificial intelligence techniques, which refers to using a variety of different types of data as input in machine Learning to improve model accuracy and performance. The data may be information from different sources, of different types, of different structures, such as text, images, video, audio, etc. In multimodal learning, different types of data need to be fused and integrated to extract useful feature information and reduce redundant information. By utilizing a variety of different types of data, more comprehensive and accurate information can be obtained, thereby improving the performance and robustness of the model.

At present, the related technology trains a video classification model in a mode of manually labeling videos, so that the video classification model can accurately conduct label prediction on the videos. However, the problem of the traditional manual video labeling method is more obvious, and the problems of efficiency bottleneck and unstable quality exist.

In order to solve the technical problems, the embodiment of the application provides a method for training a video classification model, a video classification method and a device, which can help to improve the efficiency and quality of video annotation.

Specifically, text data and image data may be obtained from a video sample, and at least one of a single-mode tag and a multi-mode tag may be obtained from the text data and the image data. Wherein the single-mode tag includes a tag representing video content using text information or image information, and the multi-mode tag includes a tag representing video content using text information and image information. A training sample is then determined, the training sample comprising a video sample and a sample tag, the sample tag comprising at least one of a single-mode tag and a multi-mode tag. And finally, carrying out parameter updating on the video classification model according to the training sample to obtain a trained video classification model.

According to the embodiment of the application, at least one of the single-mode tag and the multi-mode tag is obtained according to the text data and the image data, and the video tags of different modes are obtained, and as the tags of different modes can represent the video information from different dimensions of the video, the video information of different dimensions are mutually supplemented and cooperated, so that the video content can be comprehensively and completely understood, the quality of video annotation is improved, and the video classification model can accurately classify and identify the video. In addition, the embodiment of the application can acquire the video tag by fusing the multi-mode information without manually marking the video, thereby being beneficial to improving the video marking efficiency.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application.

As shown in fig. 1, the application scenario involves a server 1 and a terminal device 2, and the terminal device 2 may communicate data with the server 1 through a communication network. The server 1 may be a background server of the terminal device 2.

The terminal device 2 may be, for example, a device with rich man-machine interaction, internet access capability, various operating systems, and strong processing capability. The terminal device may be a terminal device such as a smart phone, a tablet computer, a portable notebook computer, a desktop computer, a wearable device, a vehicle-mounted device, etc., but is not limited thereto. Optionally, in the embodiment of the present application, the terminal device 2 is installed with a video playing application program or an application program with a video playing function.

The server 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. Servers may also become nodes of the blockchain.

The server may be one or more. Where the servers are multiple, there are at least two servers for providing different services and/or there are at least two servers for providing the same service, such as in a load balancing manner, as embodiments of the application are not limited in this respect.

The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the present application. The present application does not limit the number of servers or terminal devices. The scheme provided by the application can be independently completed by the terminal equipment, can be independently completed by the server, and can be completed by the cooperation of the terminal equipment and the server, and the application is not limited to the scheme.

In the present embodiment, the server 1 is connected to the terminal device 2 via a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, a telephony network, etc.

It should be understood that fig. 1 is only an exemplary illustration, and does not specifically limit the application scenario of the embodiment of the present application. For example, fig. 1 illustrates one terminal device, one server, and may actually include other numbers of terminal devices and servers, which the present application is not limited to.

The embodiment of the application can be applied to any application scene for understanding video content, such as video content recommendation, video stripping, video search, video recall and the like. In the embodiment of the application, the comprehensive understanding of the video content can be realized by combining the picture scene, the subtitle information, the audio content and the like in the video image, and the multi-layer cataloging results such as a program layer, a fragment layer, a scene layer, a lens layer and the like can be intelligently output, for example, the contents such as abstract, cover, classification, character labels, place labels, scene labels and the like can be contained, and the comprehensive degree of the granularity and the field of the labels is more comprehensive and complete than that of manual labeling labels. Based on the output multi-level and multi-granularity labels or classifications, the method can realize the accurate searching, recommending or recalling of the associated video content or realize the accurate segmentation of the frequency.

The following describes the technical scheme of the embodiments of the present application in detail through some embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 2 is a schematic flow chart of a method 200 for training a video classification model according to an embodiment of the application, the method 200 may be performed by any electronic device having data processing capabilities, for example, the electronic device may be implemented as a server or a terminal device, for example, the server 1 or the terminal device 2 in fig. 1, which is not limited in this regard. As shown in fig. 2, method 200 includes steps 210 through 240.

At 210, text data and image data are obtained from the video samples.

The video samples may include data such as titles, subtitles, audio, images, etc., from which data may be obtained in various modes. Wherein text modality data (i.e., text data) can be obtained from titles, subtitles, audio, etc., and image modality data (i.e., image data) can be obtained from image data. The embodiment of the application can generate the video tag based on the text data and the image data.

In some embodiments, the text data may include at least one of a video title, optical character recognition OCR data, and voice recognition data.

In particular, a video title is typically a subjective description of the content of a video presentation by a video publisher, and may typically encompass the high-level semantics that the video is intended to express. The optical character recognition (optical character recognition, OCR) data may include text extracted from the OCR of the video frames, such as at least one of video description, subtitles, captioning, background text, and the like. OCR data can also reflect semantic information of video content. Alternatively, the OCR data may be denoised, for example, filtering text with small offset between adjacent frames and high text content repetition rate.

In addition, video typically contains audio data, which may be the dialect or parallactic of a person in the video, which also typically reflects semantic information of the video content. The speech recognition data may be obtained by automatic speech recognition (Automatic Speech Recognition, ASR) of the audio data in the video.

In some embodiments, the image data may be one or more frames of images extracted from the video, as the application is not limited in this regard.

220, obtaining at least one of a single-mode tag and a multi-mode tag according to the text data and the image data; wherein the single-mode tag includes a tag representing video content using text information or image information, and the multi-mode tag includes a tag representing video content using text information and image information.

Because the labels of different modes can represent video information from different dimensions of the video, the video information of different dimensions are mutually supplemented and cooperated, so that the comprehensive and complete understanding of video content can be facilitated, and the quality of video annotation is improved.

In some embodiments, keyword information may be obtained from the text data and the text labels determined using the keyword information and the knowledge-graph. Wherein the single-mode tag comprises the text tag. Wherein the keyword information may include at least one of an entity keyword, an abstract keyword, and a search keyword, and the text label may include different granularity classification information of the keyword information and the keyword information.

For example, referring to fig. 3, text data may be input into the text label recognition module 301 to recognize at least one of entity keywords, abstract keywords, and search keywords in the text data. Then, the identified keywords are input into the knowledge-graph module 302 for label classification, so as to obtain at least one text label. Optionally, the knowledge-graph module 302 may also perform tag error correction.

For example, the text label recognition module 301 may extract entity keywords of text data. For example, the text label recognition module 301 may perform named entity recognition on the input text data to obtain entity keywords. Named entities may include people's names, places, institutions, time, etc. Optionally, in order to more precisely describe video content, the embodiment of the present application extends the broad class of named entities, such as events, identities, brands, products, etc.

The text label recognition module 301 may also extract abstract keywords of the text data. For example, the text tag recognition module 301 may intelligently analyze words in text data that have a general description, or high frequency words, which may be output as core keywords (abstract keywords) of the video content, through NLP.

The text label recognition module 301 may also extract search keywords of the text data. For example, the text tag recognition module 301 may support a search function for outputting a tag matching text data as a search keyword by performing a search for a tag library.

Illustratively, the knowledge-graph module 302 describes a large number of entities, entity attributes, and relationships in the objective world in a structured form. Knowledge of a large number of entities, properties, and relationships is organized into a connected network, which may be referred to as a knowledge base. After obtaining the keyword information, the keyword information may be input into a knowledge graph module 302, and attribute information associated with each keyword may be obtained through a knowledge graph (i.e., a knowledge base), where the attribute information may include different granularity classification information of the keyword information.

Alternatively, the different granularity classification information may include first granularity classification information including at least one of a person name, a place name, an organization name, a time, an event, an identity, a brand, and a product, and second granularity classification information including fine granularity classification information of the first granularity classification information. For example, the name of a person may be subdivided into a literature celebrity, a sports star, etc., and the place name may be further subdivided into a country, a city, a province, a capital/house, a scenic spot, a base table, etc.

Therefore, the embodiment of the application can be beneficial to better understanding and describing the video content by classifying the keywords with fine granularity on the text data, so that the text label can more accurately and comprehensively embody the video content, thereby improving the label quality.

Optionally, the knowledge-graph module 302 may also perform error correction on the input keyword information. For example, the knowledge-graph module 302 performs error correction on the phonetic and adjective words in the keyword information through a label prediction or relationship prediction function.

In some embodiments, entity information may also be obtained from the image data and used to determine an image tag. Wherein the single mode tag comprises the image tag. Alternatively, the image tag may include classification information of the entity information and a vertical type tag to which the entity information belongs.

In particular, image tagging is the most common task in the field of computer vision where there are a large number of public data sets and pre-trained models. However, the tag system and pre-training model of these datasets are directly applied to the media scene but are significantly lower in effectiveness index than their effectiveness in the original public dataset scene. Based on the method, the embodiment of the application constructs a vertical label system based on the broadcast and television media scene, and can accurately and completely describe the label with the maximum application value of each specific scene in the broadcast and television multimedia scene.

For example, referring to fig. 4, image data (such as one or more frames of images) may be input to the image recognition module 401 to recognize entity information in the image data. The identified entity information is then input to the knowledge-graph module 402 for label classification to obtain at least one image label.

For example, the image recognition module 401 may obtain the entity information of the image data by performing face recognition or object detection on the image data. By way of example, the entity information may include at least one of an event, a scene, a person, an item. After obtaining the entity information, the entity information may be input into a knowledge graph module 402, and attribute information associated with each entity information may be obtained through a knowledge graph, where the attribute information may include classification information and a vertical label of the entity information. Here, the classification information may include at least one of an event, a scene, a person, and an article, and the drop type tag may include at least one of a news frame tag, a process frame tag, a melting media frame tag, a sports frame tag, a movie frame tag, and a general frame tag.

Fig. 5 shows a specific example of an image tag of image data. (a) The entity information of the image 1 comprises streets, crosswalks, sky, buildings, multiple people, pedestrians, vehicles, automobiles and the like, wherein the classification labels of the streets, the crosswalks, the sky and the buildings are scene labels, the classification labels of the multiple people and the pedestrians are character labels, the classification labels of the vehicles and the automobiles are article labels, and further, the vertical labels of the entity information in the image 1 are general image labels. (b) The entity information of the image 2 comprises a meeting, welcome/cheering, handshake, meeting room, indoor, multiple people and the like, wherein the classification labels of the meeting, welcome/cheering and handshake are event labels, the classification labels of the meeting room and the indoor are scene labels, the classification labels of the multiple people are character labels, and further, the vertical labels of the entity information in the image 2 are news image labels. (c) The entity information of the middle image 3 comprises a drama performance, a stage, multiple persons, a costume, a peach and the like, wherein the classification labels of the drama performance and the stage performance are event labels, the classification labels of the stage are scene labels, the classification labels of the multiple persons are character labels, and the classification labels of the costume and the peach are article labels.

Therefore, the embodiment of the application obtains the classification information and the vertical label of the image data by classifying the image data, which is favorable for better understanding and describing the video content, so that the image label can more accurately and comprehensively embody the video content, thereby improving the label quality.

In some embodiments, a text vector representation may be derived from text data and an image vector representation may be derived from image data. And inputting the text vector representation and the image vector representation into a neural network model to obtain the multi-mode label.

The text vector is represented as the text feature of the video, the image vector is represented as the image feature of the video, the text feature and the image feature of the video can be fused through the neural network model, the relevant part and the specific part between the text mode and the image mode are respectively extracted, and the relevance and the complementarity of the two modes are fully utilized to generate the multi-mode label in a generation mode. Wherein the multi-modal tag is a word or phrase capable of summarizing the core information of the video, and the information is more focused. Meanwhile, the multi-mode label obtained by the generation formula can break through the limitation of an inherent label system, and a label which does not exist in the label system is generated.

In some embodiments, the neural network model includes a transducer network. Referring to fig. 6, a video frame input image embedded (Embedding) representation module 601 may obtain an image vector representation, a text data input text embedded representation module 602 may obtain a text vector representation, and the image vector representation and the text vector representation are input together into a transducer network 603, and a multi-modal tag may be obtained.

As one implementation, deriving a text vector representation from text data may be implemented as: text labels constructed according to the text data obtain text vector representations; deriving an image vector representation from the image data may be implemented as: and obtaining image vector representation according to the image label constructed by the image data.

For example, the text label may be input to the text embedded representation module to obtain a text vector representation, and the image label may be input to the image embedded representation module to obtain an image vector representation, and then the text vector representation and the image vector representation may be feature fused to obtain the multi-modal label. Fig. 7 shows a specific example of feature fusion. As shown in fig. 7, for one video, a video image, OCR text and ASR text may be extracted, wherein the video image includes at least one frame of video frame; the OCR text is text data obtained by performing OCR text recognition on video, and is, for example, "city of certain province: ambulances suffer from congestion and two girls ' overlook ' to open a road '; the ASR text is text data obtained by ASR recognition of audio data, and is, for example, "No. 9 months 20," when an ambulance in a certain city goes out of a mission, a traffic jam is encountered, and two girls walk to the first place. Image classification is performed on the video frames, such as obtaining at least one image tag through CV recognition and knowledge graph, and text classification is performed on the OCR text and the ASR text, such as obtaining at least one text tag through text tag recognition and knowledge graph. Furthermore, the image tag is subjected to embedded representation to obtain image vector representation, the text tag is subjected to embedded representation to obtain text vector representation, and the image vector representation and the text vector representation are subjected to feature fusion to obtain video tags, such as public welfare mutual assistance.

As one implementation, the transducer network architecture mainly includes an Encoder (Encoder) and a Decoder (Decoder). The encoder is for processing the input and the decoder is for generating the output. Wherein the encoder is formed of multiple identical layers, each layer containing a fully connected feed forward network and a Multi-head attention (Multi-head attention) mechanism. The encoder encodes word vectors in the input sequence into context vectors, and the encoding of each position is independent of the encoding of other positions, so that parallel operation can be realized, and the transformation has good efficiency when processing long sequences. The decoder is also composed of multiple identical layers, each layer including a fully connected feed forward network and two multi-headed attention mechanisms. The input to the decoder is the output from the encoder and the previously generated token. Unlike encoders, the decoder needs to pay attention to the output of each position of the encoder before generating each token in order to integrate the relevant information into the currently generated token. Thus, each layer of the decoder contains an additional multi-headed attention mechanism for performing attention operations on the encoding of the encoder output. Therefore, the encoder and the decoder respectively execute encoding and decoding operations, and mutually cooperate to form a core component of the transducer network, so that remarkable performance improvement on different tasks is realized.

Referring to fig. 8, the image vector representation and the text vector representation of the resulting video are input to an encoder, which encodes the input vector for each location to obtain a context vector for each location. The output of the encoder and the previously generated token for each location (e.g., chinese suit, girl, sledge, snow, respectively) are input to the decoder, which blends the context vector into the currently generated token to obtain an output token for each location (e.g., chinese suit, girl, sledge, snow, respectively).

The multi-head attention module parallelizes the computation of the inputs Q, K and V, wherein Q, K and V are vector representations obtained by carrying out embedded representation on text or images and W to be learned _q 、W _k And W is _v Three matrix multiplication maps. Similarity S is calculated from Q and K, and probability P of representing the candidate word is calculated through Softmax. In Scaled Dot-Product Attention, softm was calculatedax is preceded by a division of the similarity S by a factor.

The core of the self-attention mechanism is to calculate the attention weight through Q and K, and then act on V to obtain the whole weight and output. Illustratively, for inputs Q, K and V, the output satisfies the following relationship:

wherein Q, K and V are 3 matrices respectively, and the dimensions thereof are d respectively _q ，d _k And v _d 。

The method comprises the steps of adding weights to P to obtain attention weight, calculating d_model dimension vector by original attention weight, and outputting the d_model dimension vector through a linear layer (Linerlayer) after decomposing the d_model dimension vector into h heads to calculate attention, and finally connecting the attention vectors together and outputting the attention vector through a Linerlayer. All that is required in the overall process is that the 4 input and output dimensions are the Linerlayer of the d_model, while the input of the overall model is (batch_size, seq_length, d_model) and the output is (batch_size, seq_length, d_model).

Therefore, the embodiment of the application generates the multi-mode label by fusing the text characteristics and the image characteristics of the video, can fully utilize the relativity and the complementarity of the text data and the image data to generate the generalized label, is beneficial to better understanding and describing the video content, and further improves the label quality.

230, determining a training sample, the training sample comprising a video sample and a sample label, the sample label comprising at least one of a single-mode label and a multi-mode label.

Specifically, the single-mode tag and/or the multi-mode tag obtained in the step 220 may be used as a sample tag of the video, so as to obtain a training sample of the video classification model. In video classification, the tag is very important information, can be used for determining information such as content and category of a video by a video classification model, and is beneficial to accurately classifying and identifying the video classification model. The single-mode tag and/or the multi-mode tag are used as training sample data for training the video classification model, so that the video classification model can be helped to learn the difference and the relation between different types of videos, and automatic classification of the videos is realized.

In the embodiment of the application, the single-mode tag can realize the understanding of the low-dimensional information of the video. The low-dimensional information is, for example, a 2D dimension such as a scene, an entity, a person, etc., and is, for example, a 3D dimension such as an action, a behavior, etc. The multi-modal tag enables understanding of high dimensional information of a video. The high-dimension information is, for example, abstract dimensions such as events, emotions, dramas, summaries (such as domestic administrative, financial, fashion information) and the like.

And 240, updating parameters of the video classification model according to the training sample to obtain a trained video classification model.

The algorithm implementation of the video classification model considers each dimensional information of the video, such as 2D dimensional information, 3D dimensional information, abstract dimensional information and the like, and three modes of fusion image, text and voice can realize accurate understanding of video content, so that the advantages of each mode are brought into play, and the mode information is organically combined, so that better video classification results are obtained. In the multi-mode video understanding system, the modes are not superposition of independent operation machines, but complement each other, check each other and cooperate with each other, and finally core content information for video understanding is output.

Similar to the human understanding process of video content, information can be acquired through multiple dimensions such as pictures, sounds, subtitles and the like, and information processing and processing can be performed on accumulation of time domain dimensions, so that the video content is completely understood. For example, when viewing video content, even if each word speaking in the video is not heard, the meaning that the video is to express can be easily understood when combining picture information content.

In some embodiments, image features, text features, and voice features of a video sample may be acquired, and the image features, text features, and voice features may be input to a multi-headed attention module to obtain fusion features; and inputting the fusion characteristics into a dynamic graph convolutional network to obtain a prediction type label of the video sample. Then, according to the prediction class label and the sample label, a loss function can be determined, and according to the loss function, parameter updating is carried out on the multi-head attention module and the dynamic graph convolution network, so that a trained video classification model is obtained.

Exemplary, extraction of image time sequence information can be realized through 3D convolution (such as slow), and fusion of the image time sequence information can be realized through nextvall to obtain image characteristics; extracting text features through Bert to obtain text features; and extracting the voice features through PANNs to obtain the voice features. The image features, the text features and the voice features are input into an attribute module, and the attribute module simulates self-adaptive selection and weighting of information when a person accepts the information, so that fusion features of the three features are obtained. The fusion features are input into an Attention-driven dynamic graph convolution network (Attention-Driven Dynamic GCN), and mutual information among labels (or categories) is learned to obtain the probability of each prediction type label. At the loss (loss) level, the learning of class labels is supervised by designing cascading losses through posterior knowledge between sample labels and predictive class labels.

Among them, the Attention-Driven Dynamic GCN is used for processing dynamic image data. Dynamic graph refers to a graph that changes and evolves over time, where the properties, connectivity, of nodes and edges may change over time. The attribute-Driven Dynamic GCN adapts to the change of the dynamic graph by introducing an adaptive mechanism, so that the expressive force and accuracy of the GCN on dynamic graph data are improved.

Specifically, the attribute-Driven Dynamic GCN introduces two adaptive mechanisms, namely a multi-layer attribute mechanism based on edge weights and an adaptive sampling mechanism based on node degrees. The multi-layer intent mechanism can adaptively learn the importance of edges in each layer of GCN to effectively capture the structure information of the dynamic graph in different time steps. The self-adaptive sampling mechanism can sample based on node degrees, so that the number of nodes adopted in a region with higher node concentration is smaller, and the number of nodes sampled in a sparse node region is larger, thereby better balancing calculation efficiency and model precision in different time steps. Therefore, the attribute-Driven Dynamic GCN can use the node and edge changes existing between different time steps to improve the expressive force and reliability on the dynamic diagram data, thereby improving the accuracy of the predictive class labels.

Therefore, according to the embodiment of the application, at least one of the single-mode tag and the multi-mode tag is obtained according to the text data and the image data, and the video tags of different modes are obtained, and as the tags of different modes can represent the video information from different dimensions of the video, the video information of different dimensions are mutually supplemented and cooperated, so that the video content can be comprehensively and completely understood, the quality of video annotation is improved, and the video classification model can accurately classify and identify the video. In addition, the embodiment of the application can acquire the video tag by fusing the multi-mode information without manually marking the video, thereby being beneficial to improving the video marking efficiency.

For example, the embodiment of the application supports segment segmentation, scene segmentation and shot segmentation of video. Referring to fig. 9, in (a), the news long video is segmented, and in particular, the news long video is precisely segmented into frames by combining visual information and semantic understanding, so that the filtering of the head, the tail and each segment is supported. (b) Scene segmentation is carried out on shot sequences (including shots 1 to n) to support the recognition of news core scenes, such as news broadcasting, person interview, membership, news targets and the like. (c) Shot segmentation is carried out on the video, accurate segmentation of shots of the video (such as accurate frame) is supported, shot segmentation with gradual change of overlapped pictures is supported, and shot sequences (including shots 1 to n) are obtained.

For another example, embodiments of the present application support content recommendation to users based on video content. Referring to fig. 10, the content library includes video content information, such as information about attributes, descriptions, tags, classifications, etc. of videos. The user information may include keywords, tags, preferences, etc. of the user. The user's preferences are, for example, the user's scoring, viewing, or interacting information. The video content information and the user information may be input to a recommendation engine system that calculates a degree of association between the user information and the video content information, recommends video content to the most relevant user, such as content a to user 1, content B to user 4, content C to user 3, and content D to user 2. The recommendation engine system may also calculate a degree of similarity or association between video content, and when a user browses a certain video content, the recommendation engine system may recommend content most similar/related to the video content to the user. According to the embodiment of the application, the content related to the user is recommended to the user, or the video related to the current content is recommended to the user, so that the willingness of the user to continue content consumption can be improved, and the activity of the user is improved.

Fig. 11 is a schematic flow chart of a method 1100 of video classification according to an embodiment of the present application, where the method 1100 may be performed by any electronic device having data processing capabilities, for example, the electronic device may be implemented as a server or a terminal device, for example, the server 1 or the terminal device 2 in fig. 1, which is not limited in this regard. As shown in fig. 11, method 1100 includes steps 1110 through 1120.

1110, image features, text features and voice features of the video to be identified are acquired.

1120, inputting the image features, the text features and the voice features into a video classification model to obtain class labels of videos to be identified; the video classification model comprises a multi-head attention module and a dynamic graph convolution network; the multi-head attention module is used for inputting the image features, the text features and the voice features and obtaining fusion features; the dynamic graph convolution network is used for inputting the fusion characteristics to obtain class labels. The video classification model is trained by the method 200 described above.

Specifically, the image feature, the text feature, the voice feature, and the video classification model may be referred to in step 240 in fig. 2, and will not be described herein.

Therefore, the embodiment of the application characterizes the video information from different dimensions of the video through the different mode information, and the video information of different dimensions are mutually complemented and cooperated, so that the video content can be comprehensively and completely understood, and the video can be accurately classified and identified. In addition, the embodiment of the application can acquire the video tag by fusing the multi-mode information without manually marking the video, thereby being beneficial to improving the video marking efficiency.

The specific embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described further. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be regarded as the disclosure of the present application.

It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. It is to be understood that the numbers may be interchanged where appropriate such that the described embodiments of the application may be practiced otherwise than as shown or described.

The method embodiments of the present application are described above in detail, and the apparatus embodiments of the present application are described below in detail with reference to fig. 12 to 14.

Fig. 12 is a schematic block diagram of an apparatus 10 for video classification model training in accordance with an embodiment of the application. As shown in fig. 12, the apparatus 10 may include an acquisition unit 11, a determination unit 12, and a training unit 13.

An acquisition unit 11 for acquiring text data and image data from a video sample;

the acquiring unit 11 is further configured to acquire at least one of a single-mode tag and a multi-mode tag according to the text data and the image data; wherein the single-mode tag includes a tag representing video content using text information or image information, and the multi-mode tag includes a tag representing video content using text information and image information;

a determining unit 12, configured to determine a training sample, where the training sample includes the video sample and a sample tag, and the sample tag includes at least one of the single-mode tag and the multi-mode tag;

and the training unit 13 is used for updating parameters of the video classification model according to the training sample to obtain the trained video classification model.

In some embodiments, the obtaining unit 11 is specifically configured to:

acquiring keyword information according to the text data;

determining a text label by utilizing the keyword information and the knowledge graph; wherein the single-mode tag comprises the text tag.

In some embodiments, the obtaining unit 11 is specifically configured to:

acquiring entity information according to the image data;

determining an image tag by using the entity information and the knowledge graph; wherein the single mode tag comprises the image tag.

In some embodiments, the obtaining unit 11 is specifically configured to:

obtaining text vector representation according to the text data;

obtaining an image vector representation according to the image data;

and inputting the text vector representation and the image vector representation into a neural network model to obtain the multi-mode label.

In some embodiments, the neural network model comprises a transducer network.

In some embodiments, the text data includes at least one of a video title, optical character recognition OCR data and voice recognition data.

In some embodiments, the keyword information includes at least one of an entity keyword, an abstract keyword, and a search keyword, and the text label includes different granularity classification information of the keyword information and the keyword information.

In some embodiments, the image tag includes classification information of the entity information and a vertical tag to which the entity information belongs.

In some embodiments, the training unit 13 is specifically configured to:

acquiring image features, text features and voice features of the video sample;

inputting the image features, the text features and the voice features into a multi-head attention module to obtain fusion features;

inputting the fusion characteristics into a dynamic graph convolutional network to obtain a prediction type label of the video sample;

determining a loss function according to the prediction class label and the sample label;

and updating parameters of the multi-head attention module and the dynamic graph convolution network according to the loss function to obtain the trained video classification model.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 10 shown in fig. 10 may perform the above-described method embodiments, and the foregoing and other operations and/or functions of each module in the apparatus 10 are respectively for implementing the corresponding flows in the above-described method 200, which are not described herein for brevity.

Fig. 13 is a schematic block diagram of an apparatus 20 for video classification in accordance with an embodiment of the application. As shown in fig. 13, the apparatus 20 may include an acquisition unit 21 and a video classification model 22. Further, the video classification model 22 includes a multi-headed attention module 221 and a dynamic graphical convolution network 222.

An acquisition unit 21 for acquiring image features, text features, and voice features of the video to be recognized.

The video classification model 22 is used for inputting the image features, the text features and the voice features to obtain class labels of the videos to be identified; wherein the video classification model 22 includes a multi-headed attention module 221 and a dynamic graphical convolution network 222; the multi-head attention module 221 is configured to input the image feature, the text feature, and the voice feature, and obtain a fusion feature; the dynamic graph convolution network 222 is configured to input the fusion feature to obtain the class label; the video classification model is trained in accordance with the method 200 described above.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 20 shown in fig. 13 may perform the above-described method embodiments, and the foregoing and other operations and/or functions of each module in the apparatus 20 are respectively for implementing the corresponding flow in the above-described method 1100, which is not described herein for brevity.

The apparatus of the embodiments of the present application is described above in terms of functional modules with reference to the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 14 is a schematic block diagram of an electronic device 30 provided by an embodiment of the present application.

As shown in fig. 14, the electronic device 30 may include:

a memory 31 and a processor 32, the memory 31 being for storing a computer program and for transmitting the program code to the processor 32. In other words, the processor 32 may call and run a computer program from the memory 31 to implement the method in the embodiment of the present application.

For example, the processor 32 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the processor 32 may include, but is not limited to:

a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the present application, the memory 31 includes, but is not limited to:

volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the present application, the computer program may be divided into one or more modules, which are stored in the memory 31 and executed by the processor 32 to perform the methods provided by the present application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device.

As shown in fig. 14, the electronic device 30 may further include:

a transceiver 33, the transceiver 33 being connectable to the processor 32 or the memory 31.

The processor 32 may control the transceiver 33 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 730 may include a transmitter and a receiver. The transceiver 33 may further include antennas, the number of which may be one or more.

It will be appreciated that the various components in the electronic device are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that in the specific implementation of the present application, when the above embodiments of the present application are applied to specific products or technologies and relate to data related to user information and the like, user permission or consent needs to be obtained, and the collection, use and processing of the related data needs to comply with the relevant laws and regulations and standards.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a video classification model, comprising:

acquiring text data and image data from a video sample;

2. The method of claim 1, wherein the obtaining at least one of a single-modality tag and a multi-modality tag from the text data and the image data comprises:

acquiring keyword information according to the text data;

3. The method of claim 1, wherein the obtaining at least one of a single-modality tag and a multi-modality tag from the text data and the image data comprises:

acquiring entity information according to the image data;

4. The method of claim 3, wherein the obtaining at least one of a single-modality tag and a multi-modality tag from the text data and the image data comprises:

Obtaining text vector representation according to the text data;

obtaining an image vector representation according to the image data;

5. The method of claim 4, wherein the neural network model comprises a Transformer network.

6. The method of claim 1, wherein the text data comprises at least one of video titles, optical character recognition OCR data and speech recognition data.

7. The method of claim 2, wherein the keyword information comprises at least one of an entity keyword, an abstract keyword, and a search keyword, and the text label comprises the keyword information and different granularity classification information of the keyword information.

8. A method according to claim 3, wherein the image tags include classification information of the entity information and a vertical class tag to which the entity information belongs.

9. The method according to claim 1, wherein said updating parameters of said video classification model based on said training samples to obtain said trained video classification model comprises:

Acquiring image features, text features and voice features of the video sample;

10. A method of video classification, comprising:

inputting the image features, the text features and the voice features into a video classification model to obtain class labels of the videos to be identified; the video classification model comprises a multi-head attention module and a dynamic graph convolution network; the multi-head attention module is used for inputting the image features, the text features and the voice features and obtaining fusion features; the dynamic graph convolution network is used for inputting the fusion characteristics to obtain the class labels; the video classification model is trained according to the method of any one of claims 1-9.

11. An apparatus for training a video classification model, comprising:

an acquisition unit for acquiring text data and image data from a video sample;

12. An apparatus for video classification, comprising:

the video classification model is used for inputting the image features, the text features and the voice features to obtain class labels of the videos to be identified; the video classification model comprises a multi-head attention module and a dynamic graph convolution network; the multi-head attention module is used for inputting the image features, the text features and the voice features and obtaining fusion features; the dynamic graph convolution network is used for inputting the fusion characteristics to obtain the class labels; the video classification model is trained according to the method of any one of claims 1-9.

13. An electronic device comprising a processor and a memory, the memory having instructions stored therein that when executed by the processor cause the processor to perform the method of any of claims 1-10.

14. A computer storage medium for storing a computer program, the computer program comprising instructions for performing the method of any one of claims 1-10.

15. A computer program product comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1-10.