CN112036276B

CN112036276B - Artificial intelligent video question-answering method

Info

Publication number: CN112036276B
Application number: CN202010839563.5A
Authority: CN
Inventors: 王田; 李嘉锟; 李泽贤; 张奇鹏; 彭泰膺; 吕金虎
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2023-04-07
Anticipated expiration: 2040-08-19
Also published as: CN112036276A

Abstract

The invention discloses an artificial intelligent video question-answering method, which comprises the following steps: s1, acquiring visual features and character features; s2, visual feature extraction, namely performing multi-mode fusion on the visual features and the semantic features to obtain fusion features; and S3, generating an answer according to the fusion feature and the semantic feature. The artificial intelligent video question-answering method disclosed by the invention has the advantages of small parameter quantity and high operation speed, and can correctly understand the question, the alternative answers and the logic relationship among the alternative answers, so that the accuracy of the obtained answers is obviously improved.

Description

Artificial intelligent video question-answering method

Technical Field

The invention relates to an artificial intelligence video question-answering method, and belongs to the field of artificial intelligence.

Background

Under the rapid development of computer hardware technology and internet technology, large-scale video data are generated, and how to utilize the data to perform spatio-temporal scene analysis and understanding on the content of video has become an increasing demand.

Meanwhile, natural language is one of the most important tools of human society, the natural language is used for communicating with a computer, the computer automatically solves and outputs corresponding answers according to given videos and combination of vision and natural language processing, the process is called as video question answering, and the video question answering can rapidly process video contents.

For example, in general, a model for a human motion recognition task only needs to recognize a person and perform time sequence modeling, but does not need to recognize other objects such as vehicles and animals, but the model has poor adaptability and extensibility.

Video question-answering requires that the model understand the logical relationship between questions and alternative answers, i.e., some questions will provide multiple alternative answers. For such problems, the existing research works adopt a question and a single answer to be connected in series to form a question setting sentence, and the logic relationship between the question and the alternative answer can be correctly understood by directly applying a pre-trained language model considering that the question setting sentence is a common and reasonable language expression form. For example, the input to the model is "what color is car? And black sentences and corresponding video clips, then, the model gives confidence evaluation of the sentences to the comprehensive information of the black sentences and the corresponding video clips, and after a plurality of question setting sentences consisting of a plurality of candidate answers are evaluated, the final answer is selected as the answer with the highest confidence.

However, this common practice ignores the logical relationship between alternative answers. When there is a distracter in the candidate answers that is very similar to the correct answer, if the above-described confidence evaluation algorithm is used, it is likely that both are given high and close confidence, and thus it is likely that an incorrect answer is generated.

Therefore, it is highly desirable to design a high-accuracy artificial intelligence video question-answering method.

Disclosure of Invention

In order to overcome the above problems, the present inventors have conducted intensive research and designed an artificial intelligence video question-answering method, which includes the following steps:

s1, obtaining semantic features;

s2, visual feature extraction, namely performing multi-mode fusion on the visual features and the semantic features to obtain fusion features;

and S3, generating an answer according to the fusion feature and the semantic feature.

Specifically, in step S1, the problem is processed by using the GLoVe word embedding model to obtain word vector expressions, the word vector expressions are input into the two-layer LSTM model according to the word order, and the output of the last time state of the LSTM model is used as a semantic feature.

In a preferred embodiment, the GLoVe word embedding model is trained using the question together with all the alternative answers as input to the GLoVe word embedding model.

More preferably, when the question is input together with all the alternative answers as the GLoVe word embedding model, the alternative answers are labeled so that the answer and the question are semantically separated.

In step S2, the visual feature extraction comprises the following sub-steps:

s21, modeling a video image in a spatial dimension to obtain image characteristics;

s22, modeling image features in a time dimension, extracting time sequence features and obtaining visual features;

the image features are features of a scene in one frame of video, and the time sequence features are features of the scene in different frames of video.

Further, in step S21, an FPN model is established to obtain a target level feature, and a ResNet model is established to obtain a global feature of the image;

in step S22, the image features of each frame of image are subjected to time series analysis by the LSTM model, so as to obtain visual features.

According to the invention, in step S2, semantic features and visual feature information are fused through an attention mechanism, and the visual features are weighted according to the semantic features to complete visual feature information fusion.

In the present invention, the attention mechanism is expressed as follows:

wherein the content of the first and second substances,

α＝[α ₁ 、α ₂ 、…、α _i 、…、α _n ]，

V＝[V ₁ 、V ₂ 、…、V _i 、…、V _n ]，

t is semantic feature, V is visual related feature, V _i Representing the ith visually relevant feature, f (, is the fusion function of the semantic feature and the ith visually relevant feature, α) _i For the ith visionThe weight of the relevant feature(s) is,

the output weighting result represents the number of visual correlation feature vectors.

Further, the attention mechanism comprises a spatial attention mechanism and a temporal attention mechanism;

after the obtained image features, the spatial attention mechanism is applied to weight the image features of each frame of image,

in step S22, a time attention mechanism is applied in the LSTM model to weight different frame images;

in the spatial attention mechanism, the visually relevant features V refer to different regions of an image feature;

in the temporal attention mechanism, the visually relevant features V refer to images of different frames;

furthermore, the importance degree of the vision-related feature is normalized by adopting a softmax function, and the weight alpha in the space attention mechanism and the time attention mechanism is obtained _i 。

In a preferred embodiment of the present invention, a two-layer perceptron is used as the fusion algorithm of the attention mechanism, and the perceptron comprises two fully-connected layers, and the spatial attention mechanism and the temporal attention mechanism are respectively carried out.

The artificial intelligent video question-answering method has the beneficial effects that:

(1) The artificial intelligent video question-answering method provided by the invention has good performance and obviously improved accuracy compared with other methods;

(2) According to the artificial intelligent video question-answering method provided by the invention, the questions, the alternative answers and the logic relationship among the alternative answers can be correctly understood;

(3) The artificial intelligent video question-answering method provided by the invention has the advantages of small parameter quantity and high operation speed.

Drawings

FIG. 1 is a flow diagram illustrating an artificial intelligence video question-answering method in accordance with a preferred embodiment;

FIG. 2 is a flow chart of the artificial intelligent video question-answering method in embodiment 1;

fig. 3 shows a problem form in embodiment 1.

Detailed Description

The invention is explained in further detail below with reference to the drawing. The features and advantages of the present invention will become more apparent from the description.

The invention provides an artificial intelligence video question-answering method, which comprises the following steps as shown in figure 1:

s1, obtaining semantic features;

In step S1, the questions in the video question answering are expressed in natural language, the semantic features refer to features capable of characterizing the questions,

in the invention, a GLoVe word embedding model is adopted to process the problem to obtain word vector expression, the word vector expression is input into a double-layer LSTM model according to the word order, and the output of the last time state of the LSTM model is used as the semantic feature.

Compared with models such as LMO (local mean Square), bert (Bert) and the like, the GLoVe word embedding model has the advantages of small parameter quantity, high calculation speed and the like, is particularly suitable for video question answering with short sentences and common words in natural language, only needs to analyze a single question sentence and common words, and can meet the requirements without overlarge training scale.

Before the GLoVe word embedding model is used, the question and the answer need to be correlated, and the GLoVe word embedding model needs to be trained so that the GLoVe word embedding model can understand the meaning of the question.

In a preferred embodiment, the questions and the single candidate answers are connected in series to form a question setting sentence, multiple question setting sentences can be formed due to the fact that multiple candidate answers are provided, all the question setting sentences and the corresponding video segments are used as model input, correct answers are used as output, the number of the correct answers is preferably used as output to train a GLoVe word embedding model, the trained GLoVe word embedding model can perform confidence evaluation on the multiple question setting sentences, and therefore the highest confidence is selected as the final answer, and semantic features are obtained.

The inventor finds that, by adopting the above manner, when the alternative answer has an interference item which is very similar to the correct answer, the confidence degrees of the interference item and the correct answer are both high and close, so that the answer generation error rate is high, and how to solve the problem of the interference item is the difficult point of the invention.

In a more preferred embodiment, the GLoVe word embedding model is trained with the question together with all the alternative answers as input to the GLoVe word embedding model and the correct answer as output.

Further, when the question and all the alternative answers are taken as the input of the GLoVe word embedding model, the alternative answers are labeled, for example, "< aa >" (alternative answers) is added in front of the alternative answers as an identifier, so that the answer and the question are divided at a semantic level.

By taking the question and all the alternative answers as input and marking the alternative answers, the model can correctly understand the question and the alternative answers and the logical relationship among the alternative answers, so that the correctness of the generated answers is improved.

The LSTM model is a special cyclic neural network model, can learn long-term laws, and is firstly proposed in 1997 by Hochreiter & Schmidhuber, wherein a plurality of gating structures are added on the basis of a cyclic neural network (RNN), so that information is allowed to exist continuously, and semantic continuous output is realized through the cyclic network.

In step S2, the visual feature extraction process includes image feature extraction and time sequence feature extraction, where the image feature is a feature of a scene in one frame of video image, and the time sequence feature is a feature of the scene in different frame of video image, and in the present invention, the visual feature extraction includes the following sub-steps:

the image features include target-level features and global features in an image of the image.

Different from the traditional video question answering, the image features of the invention not only have target level features, but also have global features, so that the model can comprehensively understand the content of the video, thereby obtaining more comprehensive and effective visual features to meet more complex question answering content.

In the invention, modeling is carried out on each frame of image of the video in a space dimension, and the image characteristics of each frame of image are obtained through a model.

Specifically, an FPN model is established to obtain target level features, a ResNet model is established to obtain global features of an image, and further, a data set containing daily life scenes is used as training samples of the FPN model and the ResNet model.

Preferably, a COCO data set is used as a training sample of the FPN model, and the COCO data set is a large image database designed for object detection, segmentation, human key point detection, semantic segmentation and subtitle generation by microsoft, and includes more than 200,000 images including 80 common living targets.

Preferably, the ImageNet data set is used as a training sample of the ResNet model, and the ImageNet data set is a large visual database for visual object recognition software research, and contains 14,000,000 images, wherein the images comprise 1000 types of common living objects, such as various animals, various vehicles and the like.

And S22, modeling the image characteristics in a time dimension, extracting time sequence characteristics and obtaining visual characteristics.

In step S11, the image features are obtained based on a single frame, only the visual information of the spatial dimension is acquired, and after the information of the spatial dimension is obtained, the time dimension of the image needs to be modeled to obtain the visual features.

Preferably, the image features of each frame of image are subjected to time sequence analysis through the LSTM model, and then the visual features are obtained.

The LSTM modelUsing current time information x _t And last moment state h _t-1 As input, the current time state h is output _t Therefore, time series analysis can be completed, and visual characteristics can be obtained.

Further, in step S2, the visual features and the semantic features need to be multi-modal fused, and the visual feature information is further screened and understood by the semantic features, so that the visual feature information with low association with the semantic features is removed, thereby obtaining the fused features.

The difficulty is how to fuse the visual features and the semantic features, if the visual features and the semantic features are directly spliced, the full-connection network is adopted to further extract the features and then output answers, and the obtained result is poor because the difference of the two feature vectors is ignored.

In the invention, semantic features and visual feature information are fused through an attention mechanism, and specifically, the visual features are weighted according to the semantic features to complete the fusion of the visual feature information.

The visual features are weighted according to the semantic features, so that information interaction between the visual features and the semantic features is realized, screening of the visual information is realized, the features related to problems in the visual information are enhanced, and the unrelated features are weakened.

The attention mechanism is a data processing method in machine learning, is widely applied to various different types of machine learning tasks such as natural language processing, image recognition, voice recognition and the like, can adjust an attention direction and a weighting model according to specific task targets, and can be attached to various neural network models.

In the invention, the attention mechanism is used for weighting the features related to the semantic features in the visual feature information more heavily and weighting the features unrelated to the semantic features less heavily.

In a preferred embodiment, the attention mechanism may be expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

α＝[α ₁ 、α ₂ 、…、α _i 、…、α _n ]，

V＝[V ₁ 、V ₂ 、…、V _i 、…、V _n ]，

t is semantic feature, V is visual related feature, V _i Representing the ith visually relevant feature, f (, is the fusion function of the semantic feature and the ith visually relevant feature, α) _i Is the weight of the ith visually relevant feature,

In a preferred embodiment, the attention mechanism includes a spatial attention mechanism, and after the image features obtained in step S21, the image features of each frame of image are weighted by applying the spatial attention mechanism, so that the image features related to the problem in spatial dimension are filtered out.

Further, in the spatial attention mechanism, the visual correlation features refer to different regions of the image features, and the different regions are weighted to obtain the image feature expression.

For example, if the image feature output by the ResNet model has 49 regions, n =49, and each region is represented by V ₁ 、V ₂ 、…、V ₄₉ To score the importance of these 49 regions, α is given ₁ 、α ₂ 、…、α ₄₉ (ii) a And then, carrying out weighted combination on the 49 areas according to the formula to obtain the final image feature expression.

In a preferred embodiment, the attention mechanism comprises a temporal attention mechanism, and in step S22, the temporal attention mechanism is applied to the LSTM model to weight the different frame images, so that the visual features related to the problem in the temporal dimension are filtered out.

Further, in the temporal attention mechanism, the visual related features refer to images of different frames, and the images of different frames are weighted to obtain the visual feature expression.

For example, if the video length is 60 frames, n =60, and different frame images are respectively represented by V ₁ 、V ₂ 、…、V ₆₀ To score the importance of the 60 frames of images, respectively, are α ₁ 、α ₂ 、…、α ₆₀ (ii) a And then, carrying out weighted combination on the 60 frames of images according to the formula to obtain the final visual feature expression.

In a preferred embodiment, a two-layer perceptron is used as a fusion algorithm of the attention mechanism, and the perceptron comprises two fully connected layers, and the spatial attention mechanism and the temporal attention mechanism are respectively carried out.

Specifically, after the semantic features and the visual features are connected in series, correlation analysis is performed through two fully-connected layers, the feature dimension of the middle layer of the perceptron is set to be 256-1024, preferably 512, and the feature dimension of the output layer is set to be 1, wherein the feature dimension of the output layer represents the importance degree of the visual features.

Further, the importance degree of the vision related characteristics is normalized by adopting a softmax function, so that the weight alpha in the space attention system and the time attention system can be obtained _i And carrying out weighted summation on the vision related characteristics according to the weight to obtain fusion characteristics.

In step S3, the semantic features and the fusion features are connected in series to construct a convolutional neural network, and preferably, the convolutional neural network performs inference by using double-layer full connection to generate a final question answer, thereby completing a video question-answering task.

Examples

Example 1

Performing an experiment on a large-scale public video question-answer data set TGIF-QA, wherein as shown in FIG. 2, in step S1, a question and all alternative answers are used as input of a GLoVe word embedding model together, a < aa > "is added in front of the alternative answers to be used as an identifier, the GLoVe word embedding model is trained to obtain word vector expression, the word vector expression is input into a double-layer LSTM model according to a word order, and the last time state output of the LSTM model is selected to be used as a final semantic feature;

in step S2, a double-layer perceptron is used as a fusion algorithm of the attention mechanism, the double-layer perceptron has two fully-connected layers, wherein the characteristic dimension of the middle layer is 512, the characteristic dimension of the output layer is 1, specifically, a ResNet-50 model pre-trained in ImageNet is used to extract single-frame image features, and the semantic features and the image features are connected in series to form a fully-connected layer, the image features of each frame of image are weighted by applying the spatial attention mechanism, the weighted multi-frame image features are input into a double-layer LSTM to further extract timing features, a temporal attention mechanism is applied in an LSTM model, the semantic features and the timing features are connected in series to form a fully-connected layer, and fusion features are output.

Wherein, the importance degree of the vision related characteristics is normalized by adopting a softmax function, and a space attention mechanism weight and a time attention mechanism weight are obtained.

In step S3, the original semantic features and the weighted visual features are connected in series, and then a double-layer full connection is adopted for reasoning, thereby generating a final question answer.

The data set TGIF-QA includes about 160,000 questions and corresponding videos, and is classified into four categories, i.e., a Count question (Count), a Frame question (Frame), an Action question (Action), and a timing question (Transition) according to the types of the questions, where the specific categories of the questions and the amounts of data used for training and testing are shown in table one.

Watch 1

	Training	Testing	Total up to
				Counting problem	26,843	3,554	30,397
Problem of picture	39,392	13,691	53,083
				Problem of motion	20,475	2,274	22,749
Timing problem	52,704	6,232	58,936
				Total up to	139,414	25,751	165,165

In the TGIF-QA dataset, for a piece of video, the count question asks the number of times a person or animal performs some kind of action, e.g., "How many times do the man blank eyes? "; the picture question asks for static information available from a single frame picture, including object color, number of objects, etc., such as "What is the color of the cat"; the action question asks What type of action the person or animal performs, such as "What do the girl do 3times"; the timing question asks the type of action that the person or animal takes before or after performing an action, such as "at do the man do before walk away", as shown in fig. 3.

Comparative example 1

The method in the paper Jang Y, song Y, yu Y, et al, tgif-qa: heated spatial-temporal fermentation in visual query and analysis [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2017:2758-2766 is adopted to replace the visual feature extraction mode in the step S2 in the embodiment 1, wherein the image feature extraction adopts a ResNet model, and the time sequence feature extraction adopts the three-dimensional space-time feature of a C3D model.

Comparative example 2

The method in the paper Gao J, ge R, chen K, et al.motion-alignment co-memory networks for video query analysis [ C ]// Proceedings of the IEEE Conference Computer Vision and Pattern recognition.2018:6576-6585 is adopted to replace the visual feature extraction method in the step S2 in the embodiment 1, wherein the image feature extraction adopts a ResNet model, and the time sequence feature extraction adopts an Optical Flow method (Optical Flow).

Comparative example 3

The same experiment as in example 1 was performed on the data set TGIF-QA using the model in the paper Gao L, zeng P, song J, et al.structural Two-Stream attachment Network for Video Question Answering [ C ]// Proceedings of the AAAI Conference on Artificial Intelligence interest.2019, 33.

Comparative example 4

The same experiment as in example 1 was performed on a data set TGIF-QA using the model in the paper Jang Y, song Y, yu Y, et al, tgif-QA: heated spatial-temporal testing in visual query analysis [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2017:2758-2766. In the TGIF-QA data set, the same video was selected for the same queries as in example 1, including the query of a count problem, a picture problem, an action problem or a timing problem.

Comparative example 5

The experiment was performed by the method in example 1 except that in step S2, the visual characteristics were obtained only by the ResNet model and the LSTM model without applying the attention mechanism.

Comparative example 6

The experiment was carried out using the method in example 1, except that in step S2, no spatial attention mechanism was applied.

Comparative example 7

An experiment was performed by the method in example 1 except that in step S2, no time attention mechanism was applied.

Comparative example 8

The experiment was carried out using the method of example 4, except that no attention mechanism was applied.

Comparative example 9

The experiment was carried out using the method of example 4, except that no spatial attention mechanism was applied.

Comparative example 10

The experiment was carried out using the method of example 4, except that no time attention mechanism was applied.

Experimental example 1

The screen problem, the action problem and the time sequence problem all adopt the accuracy as evaluation indexes, the counting problem adopts the mean square error as the evaluation indexes, the experimental results of the embodiment 1, the comparative example 2 and the comparative example 3 are analyzed, and the higher the accuracy and the lower the mean square error, the better the performance of the method is.

The experimental results of the example 1, the comparative example 1 and the comparative example 2 are shown in the table II:

watch two

/>

From table two, the method in example 1 has lower mean square error and higher accuracy, which indicates that the method in example 1 is very effective and excellent in performance.

Experimental example 2

Ablation experiments were performed based on the method in example 1, specifically, in step S1, the natural language question was changed to text information input, and before using GLoVe word embedding model, the following were respectively adopted:

taking the question and all alternative answers together as the input of a GLoVe word embedding model; and

the questions and the single alternative answers are connected in series to form a question setting form, a plurality of question setting forms are formed by the plurality of alternative answers, and all the question setting forms and the corresponding video clips are used as model input;

the experiments were carried out in two ways and the results are shown in table three.

Watch III

Text input form	Accuracy of action problem (%)	Accuracy of timing problem (%)
			Question + Single alternative answer	58.85	73.62
Question + all alternative answers	86.33	96.68

As can be seen from table three, the model performance can be greatly improved by using the mode of inputting the question and all the alternative answers together as the model input.

Experimental example 3

The results of the experiments of example 1 and comparative examples 5 to 10 were analyzed and are shown in table four.

Watch four

From the results, it can be seen that the attention mechanism module designed in comparative example 8 can reduce the model performance in some cases, such as applying spatial attention mechanism on the action problem results in 2.8% (57.33% -60.13%) of accuracy reduction. The attention mechanism module designed in embodiment 1 adopts a double-layer perceptron to analyze the correlation between the visual information and the semantic information, so that the performance of the model is comprehensively improved on all kinds of problems, and the attention mechanism design of comparative example 1 is reasonable and effective.

In the description of the present invention, it should be noted that the terms "upper", "lower", "inner" and "outer" indicate the orientation or positional relationship based on the operation state of the present invention, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and thus should not be construed as limiting the present invention.

The present invention has been described above in connection with preferred embodiments, which are merely exemplary and illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

1. An artificial intelligence video question-answering method comprises the following steps:

s1, obtaining semantic features;

s3, generating an answer according to the fusion feature and the semantic feature;

in the step S1, a GLoVe word embedding model is adopted to process the problem to obtain word vector expression, the word vector expression is input into a double-layer LSTM model according to the language order, and the output of the last moment state of the LSTM model is used as semantic features;

taking the question and all the alternative answers as the input of a GLoVe word embedding model, and training the GLoVe word embedding model;

when the question and all the alternative answers are used as the input of the GLoVe word embedding model together, the alternative answers are labeled, so that the answer and the question are divided at a semantic level.

2. The artificial intelligence video question-answering method according to claim 1,

in step S2, the visual feature extraction comprises the following sub-steps:

3. The artificial intelligence video question-answering method according to claim 2,

in step S21, an FPN model is established to obtain a target level feature, and a ResNet model is established to obtain a global feature of an image;

4. The artificial intelligence video question-answering method according to claim 1,

in step S2, semantic features and visual feature information are fused through an attention mechanism, and the visual features are weighted according to the semantic features to complete visual feature information fusion.

5. The artificial intelligence video question-answering method according to claim 4,

the attention mechanism is expressed as follows:

wherein the content of the first and second substances,

α＝[α ₁ 、α ₂ 、…、α _i 、…、α _n ]，

V＝[V ₁ 、V ₂ 、…、V _i 、…、V _n ]，

t is semantic feature, V is visual related feature, V _i Representing the ith visually relevant feature, f (, x) being a fusion function of the semantic feature and the ith visually relevant feature, α _i Is the weight of the ith visually relevant feature,

for the output weighted result, n represents the number of visually relevant feature vectors.

6. The artificial intelligence video question-answering method according to claim 5,

the attention mechanism comprises a space attention mechanism and a time attention mechanism;

normalizing the importance degree of the vision related characteristics by adopting a softmax function to obtain the weight alpha in the space attention mechanism and the time attention mechanism _i 。

7. The artificial intelligence video question-answering method according to claim 4,

a double-layer perceptron is adopted as a fusion algorithm of an attention mechanism, and the perceptron comprises two full-connection layers and is respectively used for carrying out a space attention mechanism and a time attention mechanism.