CN112861580A

CN112861580A - Video information processing method and device based on video information processing model

Info

Publication number: CN112861580A
Application number: CN201911183241.3A
Authority: CN
Inventors: 徐聪; 彭广举
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2021-05-28

Abstract

The invention provides a video information processing method based on a video information processing model, which comprises the following steps: acquiring a first target video, and analyzing the first target video to acquire video parameters of the first target video; determining basic features matched with the first target video according to the video parameters of the first target video; determining multi-modal characteristics matched with the first target video according to the video parameters of the first target video; and determining a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic features and the multi-modal features. The invention also provides an information processing device, an electronic device and a storage medium. The method can form the matched video multi-mode information, integrates the multi-mode characteristics of the first target video, can better express the characteristics of the first target video, and is beneficial to the subsequent operation of the first target video.

Description

Video information processing method and device based on video information processing model

Technical Field

The present invention relates to information processing technologies, and in particular, to a video information processing method and apparatus based on a video information processing model, an electronic device, and a storage medium.

Background

The vectorization representation of the video information is the basis of many machine learning algorithms, and how to accurately represent the video information is the research focus of the direction. In the prior art, most of the videos are relatively compared in one plane and are not represented and learned in a structured mode.

Common learning approaches include: 1) the representation mode of the video label is directly used and comprises video classification, video subject, video distribution source and the like. The video is roughly divided into entertainment videos and sports videos or subdivided into basketball highlights and movie and television catkins through the labels. However, this kind of representation method is relatively extensive, the classification label information needs to be set in advance and updated in time, and the content representation capability is limited. 2) The text learning comprises the text semantic learning of video titles, video description information or video labels, and the method is relatively dependent on the accuracy of text information, but the video representation is inaccurate due to the fact that many videos have text information missing. The random recommendation strategy generated in the preamble learning process has simple logic and low accuracy, and seriously influences the use experience of the user.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video information processing method and apparatus based on a video information processing model, an electronic device, and a storage medium, and a technical solution of an embodiment of the present invention is implemented as follows:

the embodiment of the invention provides a video information processing method based on a video information processing model, which comprises the following steps:

acquiring a first target video, and analyzing the first target video to acquire video parameters of the first target video;

determining basic features matched with the first target video according to the video parameters of the first target video;

determining multi-modal characteristics matched with the first target video according to the video parameters of the first target video;

and determining a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic feature and the multi-modal feature, wherein the fusion feature vector is used for being combined with a second target video fusion feature vector output by a second video processing network in the video information processing model to realize a process matched with the video information processing model.

The embodiment of the invention also provides a processing device based on the video information processing model, which comprises:

the information transmission module is used for acquiring a first target video and analyzing the first target video to acquire video parameters of the first target video;

the information processing module is used for determining basic characteristics matched with the first target video according to the video parameters of the first target video;

the information processing module is used for determining multi-modal characteristics matched with the first target video according to the video parameters of the first target video;

and the information processing module is used for determining a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic feature and the multi-modal feature, wherein the fusion feature vector is used for combining with a second target video fusion feature vector output by a second video processing network in the video information processing model to realize a process matched with the video information processing model.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for analyzing the first target video to acquire the label information of the first target video;

the information processing module is used for analyzing the video information corresponding to the first target video according to the tag information of the first target video so as to respectively acquire the video parameters of the first target video in a basic dimension and a multi-modal dimension.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for processing the first target video according to the video parameters of the first target video in the basic dimension,

the information processing module is used for determining a category parameter, a video tag parameter and a video publishing source parameter corresponding to the first target video;

the information processing module is used for respectively extracting the category parameter, the video tag parameter and the video publishing source parameter corresponding to the first target video to form a basic feature matched with the first target video.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a title character parameter, an image information parameter and a visual information parameter corresponding to the first target video;

the information processing module is used for respectively performing feature extraction and fusion on the title character parameters, the image information parameters and the visual information parameters corresponding to the first target video to form multi-modal features matched with the first target video.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for processing the basic features through a basic information processing network in the first video processing network to form corresponding basic feature vectors;

the information processing module is used for processing the image features in the multi-modal features through an image processing network in the first video processing network to form corresponding image feature vectors;

the information processing module is used for processing the title character features in the multi-modal features through a character processing network in the first video processing network to form corresponding title character feature vectors;

the information processing module is used for processing the visual features in the multi-modal features through a visual processing network in the first video processing network to form corresponding visual feature vectors;

the information processing module is configured to perform vector fusion through the first video processing network based on the basic feature vector, the image feature vector, the title text feature vector, and the visual feature vector to form a fusion feature vector matched with the first target video.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for acquiring an image to be processed and a target resolution corresponding to a playing interface of the first target video;

and the information processing module is used for responding to the target resolution, performing resolution enhancement processing on the image to be processed through an image processing network in the first video processing network, and acquiring a corresponding image feature vector so as to realize that the image feature vector is matched with the target resolution corresponding to the playing interface of the first target video.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for extracting character feature vectors matched with the title character features through a text processing network;

the information processing module is used for determining at least one word-level hidden variable corresponding to the title character features according to the character feature vector through the text processing network;

the information processing module is used for generating processing words corresponding to the hidden variables of the word level and the selected probability of the processing words according to the hidden variables of the at least one word level through the text processing network;

and the information processing module is used for selecting at least one processing word to form a text processing result corresponding to the title character characteristics according to the selection probability of the processing result.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining code rate information matched with the playing environment of the first target video;

the information processing module is configured to adjust, through a visual processing network in the first video processing network, a code rate of the first target video by using the visual features in the multi-modal features, so as to achieve matching between the code rate of the first target video and the code rate information of the playing environment.

In the above-mentioned scheme, the first step of the method,

and the information processing module is used for adjusting parameters of a cyclic convolution neural network based on the attention mechanism in the first video processing network according to a second target video fusion feature vector output by a second video processing network in the video information processing model when the process matched with the video information processing model is a video recommendation process, so that the parameters of the cyclic convolution neural network based on the attention mechanism are matched with the fusion feature vector.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for adjusting parameters of a second video processing network in the video information processing model;

determining a new second target video fusion characteristic vector through a second video processing network in the video information processing model after parameter adjustment;

and connecting the new second target video fusion feature vector with the fusion feature vector of the first target video through a classification prediction function matched with the video information processing model so as to determine the association degree of the first target video and the second target video.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the video information processing method based on the video information processing model of the preamble when the executable instructions stored in the memory are run.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and the executable instructions are executed by a processor to realize the video information processing method based on the video information processing model of the preamble.

The embodiment of the invention has the following beneficial effects:

acquiring a first target video, and analyzing the first target video to acquire video parameters of the first target video; determining basic features matched with the first target video according to the video parameters of the first target video; determining multi-modal characteristics matched with the first target video according to the video parameters of the first target video; and determining a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic feature and the multi-modal feature, wherein the fusion feature vector is used for being combined with a second target video fusion feature vector output by a second video processing network in the video information processing model to realize a process matched with the video information processing model, so that the video information of the first target video is processed to form matched video multi-modal information, the multi-modal feature of the first target video is fused, the feature of the first target video can be better expressed, and the subsequent operation of the first target video is facilitated.

Drawings

Fig. 1 is a schematic view of a usage scenario of a video information processing method based on a video information processing model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a processing apparatus based on a video information processing model according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an alternative video information processing method based on a video information processing model according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of an alternative video information processing method based on a video information processing model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an alternative structure of a text processing network in an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a process for determining an optional word-level class hidden variable of the text processing network according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative architecture of an encoder in a text processing network in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of vector concatenation for an encoder in a text processing network according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating an encoding process of an encoder in a text processing network according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating a decoding process of a decoder in a text processing network according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating a decoding process of a decoder in a text processing network according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating a decoding process of a decoder in a text processing network according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of an alternative architecture of an image processing network in an embodiment of the present invention;

FIG. 14 is a schematic diagram of an alternative architecture of an image vision processing network in an embodiment of the present invention;

fig. 15 is a schematic flow chart of an alternative video information processing method based on a video information processing model according to an embodiment of the present invention;

FIG. 16 is a schematic diagram of an application environment of a video information processing method based on a video information processing model according to an embodiment of the present invention;

fig. 17 is a schematic diagram of an operating process of a video information processing method based on a video information processing model according to an embodiment of the present invention;

fig. 18 is a schematic structural diagram of a video information processing apparatus of a video information processing model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) A first target video, various forms of video information available in the internet, such as video files, multimedia information, etc. presented in a client or smart device.

3) Convolutional Neural Networks (CNN Convolutional Neural Networks) are a class of Feed forward Neural Networks (Feed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure of the convolutional neural network.

4) And (4) model training, namely performing multi-classification learning on the image data set. The model can be constructed by adopting deep learning frames such as Tensor Flow, torch and the like, and a multi-classification model is formed by combining multiple layers of neural network layers such as CNN and the like. The input of the model is a three-channel or original channel matrix formed by reading an image through openCV and other tools, the output of the model is multi-classification probability, and the webpage category is finally output through softmax and other algorithms. During training, the model approaches to a correct trend through an objective function such as cross entropy and the like.

5) Neural Networks (NN): an Artificial Neural Network (ANN), referred to as Neural Network or Neural Network for short, is a mathematical model or computational model that imitates the structure and function of biological Neural Network (central nervous system of animals, especially brain) in the field of machine learning and cognitive science, and is used for estimating or approximating functions.

6) Speech Recognition (SR Speech Recognition): also known as Automatic Speech Recognition (ASR Automatic Speech Recognition), Computer Speech Recognition (CSR Computer Speech Recognition) or Speech-To-Text Recognition (STT Speech To Text), the goal of which is To automatically convert human Speech content into corresponding Text using a Computer.

7) Machine Translation (MT): in the category of computational linguistics, the study of translating words or speech from one natural language to another by computer programs has been carried out. Neural Network Machine Translation (NMT) is a technique for performing Machine Translation using Neural network technology.

8) Encoder-decoder architecture: a network architecture commonly used for machine translation technology. The decoder receives the output result of the encoder as input and outputs a corresponding text sequence of another language.

9) Bidirectional attention neural network model (BERT Bidirectional Encoder responses from transformations) Google.

10) token: the word unit, before any actual processing of the input text, needs to be divided into language units such as words, punctuation, numbers or pure alphanumerics. These units are called word units.

11) Softmax: the normalized exponential function is a generalization of the logistic function. It can "compress" a K-dimensional vector containing arbitrary real numbers into another K-dimensional real vector, such that each element ranges between [0,1] and the sum of all elements is 1.

12) Word segmentation: and segmenting the Chinese text by using a Chinese word segmentation tool to obtain a set of fine-grained words. Stop words: words or words that do not contribute or contribute negligibly to the semantics of the text. Cosin similarity: the two texts are represented as cosine similarities behind a vector.

13) Transformers: a new network architecture, employing an attention mechanism, replaces the traditional encoder-decoder that must rely on other neural network patterns. Word vector: a single word is represented by a fixed-dimension distribution vector. Compound word: the keywords with thicker granularity are composed of the keywords with fine granularity, and the semantics of the keywords with thicker granularity are richer and more complete than those of the keywords with fine granularity.

Fig. 1 is a schematic view of a usage scenario of a video information processing method based on a video information processing model according to an embodiment of the present invention, and referring to fig. 1, a client capable of displaying software corresponding to a first target video, such as a client or a plug-in for video playing, is disposed on a terminal (including a terminal 10-1 and a terminal 10-2), and a user may obtain and display the first target video (or the first target video and a corresponding second target video) through the corresponding client; the terminal is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to realize data transmission.

As an example, the server 200 is configured to lay out the processing apparatus based on the video information processing model to implement the video information processing method based on the video information processing model provided by the present invention, so as to obtain the video parameters of the first target video by obtaining the first target video and parsing the first target video; determining basic features matched with the first target video according to the video parameters of the first target video; determining multi-modal characteristics matched with the first target video according to the video parameters of the first target video; determining a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic feature and the multi-modal feature, wherein the fusion feature vector is used for combining with a second target video fusion feature vector output by a second video processing network in the video information processing model to realize the process matched with the video information processing model, and displaying and outputting any feature matched with the first target video and the first target video through a terminal (a terminal 10-1 and/or a terminal 10-2). Certainly, the processing device based on the video information processing model provided by the invention can be applied to video playing, and generally processes the first target videos of different data sources in the video playing, and finally presents any feature information matched with the corresponding first target video and the first target video on a User Interface (UI), so that the accuracy and timeliness of the features of the first target video directly influence the User experience. The background database for video playing receives a large amount of video data from different sources every day, and the obtained text information matched with the first target video can be called by other application programs.

Certainly, the process of processing the first target video by the processing apparatus based on the video information processing model to match the video information processing model specifically includes:

acquiring a first target video, and analyzing the first target video to acquire video parameters of the first target video; determining basic features matched with the first target video according to the video parameters of the first target video; determining multi-modal characteristics matched with the first target video according to the video parameters of the first target video; and determining a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic feature and the multi-modal feature, wherein the fusion feature vector is used for being combined with a second target video fusion feature vector output by a second video processing network in the video information processing model to realize a process matched with the video information processing model.

As described in detail below, the structure of the processing apparatus based on the video information processing model according to the embodiment of the present invention may be implemented in various forms, such as a dedicated terminal with a processing function of the processing apparatus based on the video information processing model, or a server provided with a processing function of the processing apparatus based on the video information processing model, for example, the server 200 in the foregoing fig. 1. Fig. 2 is a schematic diagram of a composition structure of a processing apparatus based on a video information processing model according to an embodiment of the present invention, and it should be understood that fig. 2 only shows an exemplary structure of the processing apparatus based on the video information processing model, and not a whole structure, and a part of the structure or a whole structure shown in fig. 2 may be implemented as needed.

The processing device based on the video information processing model provided by the embodiment of the invention comprises: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the video information processing model-based processing device are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

In some embodiments, the video information processing model-based processing apparatus provided in the embodiments of the present invention may be implemented by a combination of hardware and software, and as an example, the video information processing model-based processing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the video information processing model-based video information processing method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

As an example of the video information processing model-based processing apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the video information processing model-based processing apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, the software modules may be located in a storage medium, the storage medium is located in the memory 202, the processor 201 reads executable instructions included in the software modules in the memory 202, and the video information processing method provided by the embodiment of the present invention is completed by combining necessary hardware (for example, including the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

As an example of the video information processing model-based processing apparatus provided by the embodiment of the present invention implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components, to implement the video information processing model-based video information processing method provided by the embodiment of the present invention.

The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the processing apparatus based on the video information processing model. Examples of such data include: any executable instructions for operating on a video information processing model-based processing device, such as executable instructions, a program implementing the video information processing model-based video information processing method according to an embodiment of the present invention may be embodied in the executable instructions.

In other embodiments, the processing apparatus based on the video information processing model according to the embodiments of the present invention may be implemented in software, and fig. 2 illustrates the processing apparatus based on the video information processing model stored in the memory 202, which may be software in the form of programs, plug-ins, and the like, and includes a series of modules, and as an example of the program stored in the memory 202, the processing apparatus based on the video information processing model may include the following software modules:

an information transmission module 2081 and an information processing module 2082. When the software modules in the processing apparatus based on the video information processing model are read into the RAM by the processor 201 and executed, the video information processing method based on the video information processing model provided by the embodiment of the invention is implemented, wherein the functions of each software module in the processing apparatus based on the video information processing model include:

the information transmission module 2081, configured to obtain a first target video, and analyze the first target video to obtain a video parameter of the first target video;

the information processing module 2082 is configured to determine, according to the video parameter of the first target video, a basic feature matched with the first target video;

the information processing module 2082 is configured to determine a multi-modal feature matched with the first target video according to the video parameter of the first target video;

the information processing module 2082 is configured to determine, based on the basic features and the multi-modal features, a fusion feature vector matched with the first target video through a first video processing network in the video information processing model, where the fusion feature vector is used to combine with a second target video fusion feature vector output by a second video processing network in the video information processing model to implement a process matched with the video information processing model.

Referring to fig. 3, fig. 3 is an alternative flow chart of the video information processing method based on the video information processing model according to the embodiment of the present invention, and it is understood that the steps shown in fig. 3 may be executed by various electronic devices running the processing apparatus based on the video information processing model, such as a dedicated terminal, a server or a server cluster with the processing apparatus based on the video information processing model, where the dedicated terminal with the processing apparatus based on the video information processing model may be an electronic device with the processing apparatus based on the video information processing model according to the embodiment shown in the foregoing fig. 2. The following is a description of the steps shown in fig. 3.

Step 301: the method comprises the steps that a processing device based on a video information processing model obtains a first target video, and analyzes the first target video to obtain video parameters of the first target video.

In some embodiments of the present invention, parsing the first target video to obtain the video parameters of the first target video may be implemented by:

analyzing the first target video to acquire label information of the first target video; and analyzing the video information corresponding to the first target video according to the label information of the first target video so as to respectively acquire the video parameters of the first target video in a basic dimension and a multi-modal dimension. The acquired tag information of the first target video can be used for decomposing video image frames of the first target video and corresponding audio files, and since the source of the first target video has uncertainty (which may be a video resource in the internet or a local video file stored by the electronic device), by acquiring video parameters in a base dimension and a multi-modal dimension corresponding to the first target video, when the original first target video is stored in a corresponding blockchain network, the video parameters in the base dimension and the multi-modal dimension corresponding to the first target video are stored in the blockchain network at the same time, so as to implement source tracing of the first target video.

Step 302: and determining the basic characteristics matched with the first target video according to the video parameters of the first target video by the processing device based on the video information processing model.

Continuing with the description of the video information processing method based on a video information processing model provided in the embodiment of the present invention with reference to fig. 2, referring to fig. 4, fig. 4 is an optional schematic flow chart of the video information processing method based on a video information processing model provided in the embodiment of the present invention, it can be understood that the steps illustrated in fig. 4 may be executed by various electronic devices of the video information processing apparatus running the video information processing model, for example, a dedicated terminal, a server or a server cluster with a video information processing function of the video information processing model may be used for determining a basic feature and a multi-modal dimensional feature matched with a first target video to determine a model parameter matched with the video information processing model, and specifically includes the following steps:

step 401: determining a category parameter, a video tag parameter and a video publishing source parameter corresponding to the first target video according to the video parameter of the first target video in the basic dimension;

step 402: and respectively extracting the category parameter, the video tag parameter and the video publishing source parameter corresponding to the first target video to form a basic feature matched with the first target video.

Step 403: and determining title character parameters, image information parameters and visual information parameters corresponding to the first target video according to the video parameters of the first target video in the basic dimension.

Step 404: and respectively extracting and fusing the title character parameters, the image information parameters and the visual information parameters corresponding to the first target video to form multi-modal characteristics matched with the first target video.

In some embodiments of the present invention, the basic features are mainly a basic description of the video in a definition manner, including video multi-level classification categories, video tags, video distribution sources, video durations, distribution times, and event cities. The basic characteristic is the qualitative description of the video, and the content representation information of the video is relatively lacking in thirty years.

In some embodiments of the present invention, the multi-modal feature is feature extraction performed on title text, picture information, and visual information of a video, and is used to describe content information of the video, where a title and a cover page may affect a play click rate of the video, and video visual frame image information may affect a play completion of the video.

Step 303: and the processing device based on the video information processing model determines multi-modal characteristics matched with the first target video according to the video parameters of the first target video.

In some embodiments of the present invention, determining, by the first video processing network in the video information processing model, a fused feature vector matching the first target video based on the base features and the multi-modal features may be performed by:

processing the basic features through a basic information processing network in the first video processing network to form corresponding basic feature vectors; processing, by an image processing network of the first video processing network, image features of the multi-modal features to form corresponding image feature vectors; processing the title word features in the multi-modal features through a word processing network in the first video processing network to form corresponding title word feature vectors; processing, by a visual processing network of the first video processing network, visual features of the multi-modal features to form corresponding visual feature vectors; and performing vector fusion through the first video processing network based on the basic feature vector, the image feature vector, the title character feature vector and the visual feature vector to form a fusion feature vector matched with the first target video. The video information processing model provided by the invention comprises a first video processing network and a second video processing network, wherein the first video processing network is used for processing a first target video to form a fusion feature vector matched with the first target video, and the second video processing network is used for processing a second target video to form a second target video fusion feature vector matched with the second target video; further, the first video processing network may be formed of different sub-networks for processing features in the multi-modal features separately.

The following describes the different sub-networks in the first video processing network, respectively.

In some embodiments of the invention, the method further comprises:

extracting character feature vectors matched with the title character features through a text processing network; determining at least one word-level hidden variable corresponding to the title character features according to the character feature vector through the text processing network; generating, by the text processing network, a processing word corresponding to the word-level hidden variable and a selected probability of the processing word according to the at least one word-level hidden variable; and selecting at least one processing word to form a text processing result corresponding to the title character characteristics according to the selected probability of the processing result. Therefore, the method and the device not only realize processing of the title character features of the target text through the text processing network to determine the proper title of the first target video for display, but also realize processing of the title character features in the multi-modal features to form corresponding title character feature vectors.

In some embodiments of the present invention, the text processing network may be a Bidirectional attention neural network model (BERT Bidirectional Ender respondents from Transformers). With continuing reference to fig. 5, fig. 5 is a schematic diagram of an optional structure of a text processing network in an embodiment of the present invention, where the Encoder includes: n ═ 6 identical layers, each layer containing two sub-layers. The first sub-layer is a multi-head attention layer (multi-head attention layer) and then a simple fully connected layer. Each sub-layer is added with residual connection (residual connection) and normalization (normalization).

The Decoder includes: the Layer consists of N-6 identical layers, wherein the layers and the encoder are not identical, and the layers comprise three sub-layers, wherein one self-addressing Layer is arranged, and the encoder-decoder addressing Layer is finally a full connection Layer. Both of the first two sub-layers are based on multi-head authentication layers.

With continuing reference to fig. 6, fig. 6 is a schematic diagram illustrating a process for determining an optional word-level hidden variable of the text processing network in the embodiment of the present invention, wherein both the encoder and decoder portions include 6 encoders and encoders. Inputs into the first encoder combine embedding and positional embedding. After passing 6 encoders, outputting to each decoder of the decoder part; the input target is 'West-run inscription 86 edition 35 th set daughter' and is processed by a text processing network, and the output word-level hidden variable result is as follows: "journey to the West-daughter country".

With continuing reference to fig. 7, fig. 7 is an alternative structural diagram of an encoder in a text processing network in an embodiment of the present invention, where its input consists of a query (Q) and a key (K) of dimension d and a value (V) of dimension d, all keys calculate a dot product of the query, and apply a softmax function to obtain a weight of the value.

With continuing reference to FIG. 7, FIG. 7 shows a vector representation of an encoder in a text processing network in an embodiment of the present invention, where Q, K, and V are obtained by multiplying the vector x of the input encoder by W ^ Q, W ^ K, W ^ V. W ^ Q, W ^ K, W ^ V are (512, 64) in the dimension of the article, then suppose the dimension of our inputs is (m, 512), where m represents the number of words. The dimension of Q, K and V obtained after multiplying the input vector by W ^ Q, W ^ K, W ^ V is (m, 64).

With continued reference to fig. 8, fig. 8 is a schematic diagram of vector concatenation of an encoder in a text processing network according to an embodiment of the present invention, where Z0 to Z7 are corresponding 8 parallel heads (dimension is (m, 64)), and then concat obtains (m, 512) dimension after the 8 heads. After the final multiplication with W ^ O, the output matrix with the dimension (m, 512) is obtained, and the dimension of the matrix is consistent with the dimension of entering the next encoder.

With continued reference to fig. 9, fig. 9 is a schematic diagram of an encoding process of an encoder in the text processing network according to the embodiment of the present invention, in which x1 goes through self-attention to a state z1, the tensor that passed through self-attention needs to go through the residual error network and the latex Norm, and then the fully connected feed-forward network needs to go through the same operations as those of the feed-forward network, and the residual error processing and normalization are performed. The tensor which is finally output can enter the next encoder, then the iteration is carried out for 6 times, and the result of the iteration processing enters the decoder.

With continuing reference to fig. 10, fig. 10 is a schematic diagram of a decoding process of a decoder in a text processing network according to an embodiment of the present invention, wherein the input/output and decoding process of the decoder:

and (3) outputting: probability distribution of output words corresponding to the i position;

inputting: output of encoder & output of corresponding i-1 position decoder. So the middle atttion is not self-atttion, its K, V comes from encoder and Q comes from the output of the decoder at the last position.

With continuing reference to fig. 11 and 12, fig. 11 is a schematic diagram of a decoding process of a decoder in a text processing network according to an embodiment of the present invention, wherein a vector output by a last decoder of the decoder network passes through a Linear layer and a softmax layer. Fig. 12 is a schematic diagram of a decoding process of a decoder in a text processing network according to an embodiment of the present invention, where the Linear layer is used to map a vector from the decoder portion into a logits vector, and then the softmax layer converts the logits vector into a probability value according to the logits vector, and finally finds a position of a maximum probability value, thereby completing the output of the decoder.

In some embodiments of the invention, the method further comprises:

acquiring an image to be processed and a target resolution corresponding to a playing interface of the first target video;

and responding to the target resolution, performing resolution enhancement processing on the image to be processed through an image processing network in the first video processing network, and acquiring a corresponding image feature vector to realize that the image feature vector is matched with the target resolution corresponding to the playing interface of the first target video. Therefore, the image to be processed is processed through the image processing network to determine the cover image of the appropriate first target video, and the image features in the multi-modal features are processed to form the corresponding title image feature vectors.

Referring to fig. 13, fig. 13 is a schematic diagram of an alternative structure of an image processing network according to an embodiment of the present invention, where an encoder may include a convolutional neural network, and after an image feature vector is input into the encoder, a frame-level image feature vector corresponding to the image feature vector is output. Specifically, the image feature vector is input into an encoder, that is, a convolutional neural network in the encoder, a frame-level image feature vector corresponding to the image feature vector is extracted through the convolutional neural network, the convolutional neural network outputs the extracted frame-level image feature vector and serves as an output of the encoder, and then corresponding image semantic recognition is performed by using the image feature vector output by the encoder, or the encoder may include the convolutional neural network and the cyclic neural network, and after the image feature vector is input into the encoder, a frame-level image feature vector carrying timing information corresponding to the image feature vector is output, as shown in the encoder in fig. 13. Specifically, the image feature vector is input into an encoder, that is, a convolutional neural network (for example, a CNN neural network in fig. 13) in the encoder, a frame-level image feature vector corresponding to the image feature vector is extracted by the convolutional neural network, the convolutional neural network outputs the extracted frame-level image feature vector, the frame-level image feature vector is input into a recurrent neural network (corresponding to structures such as hi-1 and hi in fig. 13) in the encoder, the extracted convolutional neural network feature vector is subjected to extraction and fusion of timing information by the recurrent neural network, the recurrent neural network outputs the image feature vector carrying the timing information and serves as an output of the encoder, and then corresponding processing steps are executed by using the image feature vector output by the encoder.

In some embodiments of the invention, the method further comprises:

determining code rate information matched with the playing environment of the first target video; and adjusting the code rate of the first target video by using the visual characteristics in the multi-modal characteristics through a visual processing network in the first video processing network so as to realize that the code rate of the first target video is matched with the code rate information of the playing environment. Therefore, the method not only realizes the processing of the visual information through the visual processing network to determine the proper dynamic code rate of the first target video, but also realizes the processing of the visual characteristics in the multi-modal characteristics to form the corresponding title visual characteristic vector.

Referring to fig. 14, fig. 14 is an optional structural schematic diagram of an image visual processing network in an embodiment of the present invention, where the dual-flow long-term and short-term memory network may include a bidirectional vector model, an attention model, full-link layers, and a sigmoid classifier, the bidirectional vector model performs recursion processing on different feature vectors in an input visual feature vector set, and combines the feature vectors after the recursion processing together to form a longer vector, for example, combines the associated visual feature vectors together to form a longer vector, and combines the two combined vectors together again to form a longer vector (a local aggregation vector), and finally uses two full-link layers to map the obtained distributed feature representation to a corresponding sample mark space to improve accuracy of a final code rate, and finally uses the sigmoid classifier to determine probability values of the image visual features corresponding to respective tags, and integrating the text processing result to form new text information corresponding to the graph visual characteristic information.

Wherein, the batch processing parameter (batch size) of the convolutional neural network model is selected to be 32 or 64, the initial learning rate of the optimizer of the convolutional neural network model selecting adaptive optimizer (adam) is selected to be 0.0001, and the random inactivation (dropout) is selected to be 0.2. After 100000 times of iterative training, the accuracy of the training set and the accuracy of the test set are both stabilized at more than 90%, which indicates that the model is matched with the task scene, and can obtain a relatively ideal training effect and fix all parameters of the convolutional neural network model in the state, so that the code rate of the first target video is adjusted to realize the matching of the code rate of the first target video and the code rate information of the playing environment.

Step 304: and the processing device based on the video information processing model determines a fusion feature vector matched with the first target video through a first video processing network in the video information processing model based on the basic feature and the multi-modal feature.

The fusion feature vector is used for being combined with a second target video fusion feature vector output by a second video processing network in the video information processing model to realize a process matched with the video information processing model.

Continuing with the description of the video information processing method based on a video information processing model provided in the embodiment of the present invention with reference to fig. 15, fig. 15 is an optional flowchart of the video information processing method based on a video information processing model provided in the embodiment of the present invention, and it can be understood that the steps illustrated in fig. 15 may be executed by various electronic devices of the video information processing apparatus running the video information processing model, for example, a dedicated terminal, a server or a server cluster with a video information processing function of the video information processing model may be used to determine the basic feature and the multi-modal dimensional feature matched with the first target video to determine the model parameter matched with the video information processing model, and specifically includes the following steps:

step 1501: a video information processing apparatus of a video information processing model determines a type of a process matching the video information processing model.

Step 1502: and when the process matched with the video information processing model is a video recommendation process, the video information processing device of the video information processing model adjusts the parameters of the circular convolution neural network based on the attention mechanism in the first video processing network according to the second target video fusion feature vector output by the second video processing network in the video information processing model.

Thereby, it can be achieved that the parameters of the attention-based cyclic convolutional neural network are adapted to the fused feature vector.

Step 1503: the video information processing device of the video information processing model adjusts parameters of a second video processing network in the video information processing model.

Step 1504: and the video information processing device of the video information processing model determines a new second target video fusion characteristic vector through a second video processing network in the video information processing model after parameter adjustment.

Step 1505: and the video information processing device of the video information processing model connects the new second target video fusion characteristic vector with the fusion characteristic vector of the first target video through a classification prediction function matched with the video information processing model.

Thereby, the determination of the association degree of the first target video and the second target video can be realized.

Further, when the association degree exceeds the corresponding association degree threshold, the first target video may be recommended to the corresponding terminal, and otherwise, the other video is recommended to replace the current first target video.

The following describes a video information processing method provided by an embodiment of the present invention by taking a video recommendation scene in a short video playing interface as an example, where fig. 16 is a schematic view of an application environment of a video information processing method based on a video information processing model in an embodiment of the present invention, where as shown in fig. 16, the short video playing interface may be displayed in a corresponding APP or triggered by a wechat applet (the video information processing model may be packaged in the corresponding APP after being trained or stored in the wechat applet in a plug-in form), as short video application products are continuously developed and increased, the carrying capacity of video information is far greater than that of text information, and a short video can be continuously recommended to a user through the corresponding application program, so that, in a case that a video is played above (i.e. a second target video related in the preamble embodiment) is known, the recommendation of the subsequent related video (i.e. the first target video in the preamble embodiment) is a very important link, and the effective recommendation of the subsequent related video can effectively improve the use experience of the user, wherein the above video represented by the second target video may be one video played before the first target video is displayed, or may be a video set of a plurality of videos played before the first target video is displayed, and in this process, the video information vectorization representation is the basis of many machine learning algorithms.

In the conventional technology center, common learning methods include: 1) and then, using the representation mode of the video label, including video classification, video subject, video distribution source and the like. The video is roughly divided into entertainment videos and sports videos or subdivided into basketball highlights and movie and television catkins through the labels. However, this kind of representation method is relatively extensive, the classification label information needs to be set in advance and updated in time, and the content representation capability is limited. 2) The text-based learning comprises text semantic learning on video titles, video description information or video labels, the type of the learning is relatively dependent on the accuracy of text information, but many videos have the condition of text information loss, so that the video representation is inaccurate, and the modes are double-image and influence the use experience of users.

Fig. 17 is a schematic diagram of a working process of a video information processing method based on a video information processing model according to an embodiment of the present invention, where fig. 18 is a schematic diagram of a structure of a video information processing apparatus of the video information processing model according to the embodiment of the present invention, and the following describes a working process of a question-and-answer model of the video information processing method with reference to the schematic diagram of the structure of the video information processing apparatus of the video information processing model shown in fig. 18, and specifically includes the following steps:

step 1701: and acquiring the related video pair annotation in the video data source.

The related video pairs can be obtained by simply calculating text relevance in a video data source of a video server by using titles. It should be noted that, the relevant video pair is not directional, and the testing personnel needs to manually label part of seed data, which is a data pair with directivity, that is, the video a can recommend the video B, and the opposite is not true. Furthermore, the video data can be processed through the learning and prediction result of the video information processing model, the quality of the video pairs is spot-checked, and the labeled data pairs are gradually expanded, so that the data pairs are expanded. After the positive samples are obtained, video pairs with the same magnitude can be randomly screened and used as negative examples for training the video information processing model.

Step 1702: and acquiring multi-modal characteristics of the video matched with the first target video.

The video features are summarized into two categories, namely basic features and multi-modal features, specifically, the basic features are mainly used for basic description of the video in a definition mode, and the basic features comprise: video multi-level classification category, video tag, video publishing source, video duration, publishing time and event city. The basic characteristic is the qualitative description of the video, but the content representation information of the video is relatively lacking.

The multi-modal feature is feature extraction for the caption text, image information, and visual adjustment of the video. The content information, title and cover map used for describing the video can influence the playing click rate of the video, and the video visual frame image information can influence the playing completion degree of the video.

The title features are extracted by using a pre-training model of natural language processing, wherein the pre-training model has an optional structure of bidirectional attention neural network model BERT (bidirectional Encoder retrieval from transformations) and is used for sending video title sentences into a model task to obtain 64-dimensional (dimension size can be customized) title feature vectors. The generalization capability of the word vector model is further increased through the bert model, and the sentence-level representation capability is realized.

The cover map features are extracted by using a pre-trained convolutional neural network based on depth residual error resnet50, and cover map information of the video is extracted into 128-dimensional feature vectors. Resnet is a widely-used extraction network in image feature extraction at present, and is beneficial to representing cover map information. The cover map information has great eyeball attraction before being watched by a user, and the playing click rate of the video can be well improved by the reasonably close cover map.

The visual features are feature extracted using a netvladVector of locally aggregated descriptors of the video processing, and the video frame image is generated 128 as a feature vector. In video viewing, video frame information reflects the specific content and video quality of a video, and has direct correlation with the viewing duration of a user.

Step 1703: and processing the first target video information through a video information processing model to form matched video multi-modal information.

With continuing reference to the schematic structural diagram of the video information processing model shown in fig. 18, the multi-modal information shown in fig. 18 can be divided into four fields, namely, basic information, cover art information, header information, and visual information. The ID class features in the basic information are embedded layers (embedding) which initialize one-hot features and then learn in a video information processing model, and a cover picture, a title vector and visual information are obtained by a pre-training mode mentioned in the preamble step. Full link reconstruction is performed in each domain as a 128-dimensional vector, and then the vectors of the four domains are concatenated to generate a 128 x 4-512-dimensional vector, which is then passed through two fully-connected layers. The first layer of full join dimensionality reduction to 256 dimensions, then the second layer of full join generates the final 128-dimensional video vector representation.

Continuing with fig. 18, the attention mechanism shown in fig. 18 will be described, by which the above attention mechanism is added to learn the weights of the respective vectors when generating for the respective feature fields of the prediction-related video. The Attention mechanism (Attention) flow is as follows:

1) firstly, the weight of the second target video vector to the related video in the domain is calculated and expressed as

2) Followed by weight normalization using softmax

3) Finally, the weights and the corresponding key values are weighted, so as to obtain the vector representation under the above attention mechanism

Wherein Q_abvRepresenting a second target video vector, K_relVectors that represent embedded layer network (embedding) learning of the relevant video. After the attention mechanism is processed for each domain, the learning of the second target video vector and the related video vector is related in a weight weighting mode.

Further, for example, the difference between the second target video and the related video in the information content is realized, and different dimensionality re-expression is performed on the video output. On the basis of 128 dimensions of the video, the related video is re-expressed as a 128-dimension new vector high embedding, and the second target video only generates a 64-dimension new vector low embedding, so that the information content of the second target video is reduced, and the information content of the related video is relatively improved. Then the two embeddings are spliced together, and the split head is represented as two middle vectors with 64 dimensions. And finally, splicing the two intermediate vectors, and performing sigmod function classification to predict whether the second target video can point to the related video.

Class prediction sigmod function:

the loss function uses cross-entropy loss with weights:

wherein theta is^kRepresenting the input of the k-th sample, p^kRepresenting the estimated classification of the kth sample, y^kRepresenting the actual classification of the kth sample. a is_kThe sample weight is expressed in proportion to the number of times the second target video and the related video appear.

The beneficial technical effects are as follows:

1) compared with the prior art, according to the technical scheme provided by the application, the video information of the first target video is processed to form matched video multi-mode information, the multi-mode characteristics of the first target video are fused, the characteristics of the first target video can be better expressed, and the subsequent operation of the first target video is facilitated.

2) When the video multi-mode information is used for realizing video pushing or prediction, the second target video and the related video can be effectively distinguished, different information representations contained in information representation dimensions of the second target video and the related video can be distinguished, meanwhile, the defect that a single-pure video label mode in a traditional mode lacks expansibility can be overcome, directional recommendation can be carried out according to the time-related video conforming to relevance, the rationality of recommended content is improved, and the use experience of a user is improved.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A video information processing method based on a video information processing model, the method comprising:

2. The method of claim 1, wherein the parsing the first target video to obtain the video parameters of the first target video comprises:

analyzing the first target video to acquire label information of the first target video;

and analyzing the video information corresponding to the first target video according to the label information of the first target video so as to respectively acquire the video parameters of the first target video in a basic dimension and a multi-modal dimension.

3. The method of claim 2, wherein determining the base feature matching the first target video according to the video parameters of the first target video comprises:

based on video parameters of the first target video in a base dimension,

determining a category parameter, a video tag parameter and a video publishing source parameter corresponding to the first target video;

and respectively extracting the category parameter, the video tag parameter and the video publishing source parameter corresponding to the first target video to form a basic feature matched with the first target video.

4. The method of claim 2, wherein the determining multi-modal features matching the first target video according to the video parameters of the first target video comprises:

based on video parameters of the first target video in a base dimension,

determining title character parameters, image information parameters and visual information parameters corresponding to the first target video;

and respectively extracting and fusing the title character parameters, the image information parameters and the visual information parameters corresponding to the first target video to form multi-modal characteristics matched with the first target video.

5. The method according to any one of claims 1-4, wherein determining, by a first video processing network in the video information processing model, a fused feature vector matching the first target video based on the base features and the multi-modal features comprises:

processing the basic features through a basic information processing network in the first video processing network to form corresponding basic feature vectors;

processing, by an image processing network of the first video processing network, image features of the multi-modal features to form corresponding image feature vectors;

processing the title word features in the multi-modal features through a word processing network in the first video processing network to form corresponding title word feature vectors;

processing, by a visual processing network of the first video processing network, visual features of the multi-modal features to form corresponding visual feature vectors;

and performing vector fusion through the first video processing network based on the basic feature vector, the image feature vector, the title character feature vector and the visual feature vector to form a fusion feature vector matched with the first target video.

6. The method of claim 5, further comprising:

and responding to the target resolution, performing resolution enhancement processing on the image to be processed through an image processing network in the first video processing network, and acquiring a corresponding image feature vector to realize that the image feature vector is matched with the target resolution corresponding to the playing interface of the first target video.

7. The method of claim 5, further comprising:

extracting character feature vectors matched with the title character features through a text processing network;

determining at least one word-level hidden variable corresponding to the title character features according to the character feature vector through the text processing network;

generating, by the text processing network, a processing word corresponding to the word-level hidden variable and a selected probability of the processing word according to the at least one word-level hidden variable;

and selecting at least one processing word to form a text processing result corresponding to the title character characteristics according to the selected probability of the processing result.

8. The method of claim 5, further comprising:

determining code rate information matched with the playing environment of the first target video;

and adjusting the code rate of the first target video by using the visual characteristics in the multi-modal characteristics through a visual processing network in the first video processing network so as to realize that the code rate of the first target video is matched with the code rate information of the playing environment.

9. The method of claim 1, further comprising:

when the process matching the video information processing model is a video recommendation process,

and adjusting parameters of a cyclic convolution neural network based on the attention mechanism in the first video processing network according to a second target video fusion feature vector output by a second video processing network in the video information processing model so as to realize that the parameters of the cyclic convolution neural network based on the attention mechanism are matched with the fusion feature vector.

10. The method of claim 9, further comprising:

adjusting parameters of a second video processing network in the video information processing model;

11. A processing apparatus based on a video information processing model, the apparatus comprising:

12. The apparatus of claim 11,

13. The apparatus of claim 12,

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the video information processing method based on the video information processing model according to any one of claims 1 to 10 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the video information processing method based on the video information processing model according to any one of claims 1 to 10.