CN112036276B - Artificial intelligent video question-answering method - Google Patents

Artificial intelligent video question-answering method Download PDF

Info

Publication number
CN112036276B
CN112036276B CN202010839563.5A CN202010839563A CN112036276B CN 112036276 B CN112036276 B CN 112036276B CN 202010839563 A CN202010839563 A CN 202010839563A CN 112036276 B CN112036276 B CN 112036276B
Authority
CN
China
Prior art keywords
features
feature
attention mechanism
question
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010839563.5A
Other languages
Chinese (zh)
Other versions
CN112036276A (en
Inventor
王田
李嘉锟
李泽贤
张奇鹏
彭泰膺
吕金虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010839563.5A priority Critical patent/CN112036276B/en
Publication of CN112036276A publication Critical patent/CN112036276A/en
Application granted granted Critical
Publication of CN112036276B publication Critical patent/CN112036276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an artificial intelligent video question-answering method, which comprises the following steps: s1, acquiring visual features and character features; s2, visual feature extraction, namely performing multi-mode fusion on the visual features and the semantic features to obtain fusion features; and S3, generating an answer according to the fusion feature and the semantic feature. The artificial intelligent video question-answering method disclosed by the invention has the advantages of small parameter quantity and high operation speed, and can correctly understand the question, the alternative answers and the logic relationship among the alternative answers, so that the accuracy of the obtained answers is obviously improved.

Description

Artificial intelligent video question-answering method
Technical Field
The invention relates to an artificial intelligence video question-answering method, and belongs to the field of artificial intelligence.
Background
Under the rapid development of computer hardware technology and internet technology, large-scale video data are generated, and how to utilize the data to perform spatio-temporal scene analysis and understanding on the content of video has become an increasing demand.
Meanwhile, natural language is one of the most important tools of human society, the natural language is used for communicating with a computer, the computer automatically solves and outputs corresponding answers according to given videos and combination of vision and natural language processing, the process is called as video question answering, and the video question answering can rapidly process video contents.
For example, in general, a model for a human motion recognition task only needs to recognize a person and perform time sequence modeling, but does not need to recognize other objects such as vehicles and animals, but the model has poor adaptability and extensibility.
Video question-answering requires that the model understand the logical relationship between questions and alternative answers, i.e., some questions will provide multiple alternative answers. For such problems, the existing research works adopt a question and a single answer to be connected in series to form a question setting sentence, and the logic relationship between the question and the alternative answer can be correctly understood by directly applying a pre-trained language model considering that the question setting sentence is a common and reasonable language expression form. For example, the input to the model is "what color is car? And black sentences and corresponding video clips, then, the model gives confidence evaluation of the sentences to the comprehensive information of the black sentences and the corresponding video clips, and after a plurality of question setting sentences consisting of a plurality of candidate answers are evaluated, the final answer is selected as the answer with the highest confidence.
However, this common practice ignores the logical relationship between alternative answers. When there is a distracter in the candidate answers that is very similar to the correct answer, if the above-described confidence evaluation algorithm is used, it is likely that both are given high and close confidence, and thus it is likely that an incorrect answer is generated.
Therefore, it is highly desirable to design a high-accuracy artificial intelligence video question-answering method.
Disclosure of Invention
In order to overcome the above problems, the present inventors have conducted intensive research and designed an artificial intelligence video question-answering method, which includes the following steps:
s1, obtaining semantic features;
s2, visual feature extraction, namely performing multi-mode fusion on the visual features and the semantic features to obtain fusion features;
and S3, generating an answer according to the fusion feature and the semantic feature.
Specifically, in step S1, the problem is processed by using the GLoVe word embedding model to obtain word vector expressions, the word vector expressions are input into the two-layer LSTM model according to the word order, and the output of the last time state of the LSTM model is used as a semantic feature.
In a preferred embodiment, the GLoVe word embedding model is trained using the question together with all the alternative answers as input to the GLoVe word embedding model.
More preferably, when the question is input together with all the alternative answers as the GLoVe word embedding model, the alternative answers are labeled so that the answer and the question are semantically separated.
In step S2, the visual feature extraction comprises the following sub-steps:
s21, modeling a video image in a spatial dimension to obtain image characteristics;
s22, modeling image features in a time dimension, extracting time sequence features and obtaining visual features;
the image features are features of a scene in one frame of video, and the time sequence features are features of the scene in different frames of video.
Further, in step S21, an FPN model is established to obtain a target level feature, and a ResNet model is established to obtain a global feature of the image;
in step S22, the image features of each frame of image are subjected to time series analysis by the LSTM model, so as to obtain visual features.
According to the invention, in step S2, semantic features and visual feature information are fused through an attention mechanism, and the visual features are weighted according to the semantic features to complete visual feature information fusion.
In the present invention, the attention mechanism is expressed as follows:
Figure BDA0002640947280000031
wherein the content of the first and second substances,
α=[α 1 、α 2 、…、α i 、…、α n ],
V=[V 1 、V 2 、…、V i 、…、V n ],
t is semantic feature, V is visual related feature, V i Representing the ith visually relevant feature, f (, is the fusion function of the semantic feature and the ith visually relevant feature, α) i For the ith visionThe weight of the relevant feature(s) is,
Figure BDA0002640947280000032
the output weighting result represents the number of visual correlation feature vectors.
Further, the attention mechanism comprises a spatial attention mechanism and a temporal attention mechanism;
after the obtained image features, the spatial attention mechanism is applied to weight the image features of each frame of image,
in step S22, a time attention mechanism is applied in the LSTM model to weight different frame images;
in the spatial attention mechanism, the visually relevant features V refer to different regions of an image feature;
in the temporal attention mechanism, the visually relevant features V refer to images of different frames;
furthermore, the importance degree of the vision-related feature is normalized by adopting a softmax function, and the weight alpha in the space attention mechanism and the time attention mechanism is obtained i
In a preferred embodiment of the present invention, a two-layer perceptron is used as the fusion algorithm of the attention mechanism, and the perceptron comprises two fully-connected layers, and the spatial attention mechanism and the temporal attention mechanism are respectively carried out.
The artificial intelligent video question-answering method has the beneficial effects that:
(1) The artificial intelligent video question-answering method provided by the invention has good performance and obviously improved accuracy compared with other methods;
(2) According to the artificial intelligent video question-answering method provided by the invention, the questions, the alternative answers and the logic relationship among the alternative answers can be correctly understood;
(3) The artificial intelligent video question-answering method provided by the invention has the advantages of small parameter quantity and high operation speed.
Drawings
FIG. 1 is a flow diagram illustrating an artificial intelligence video question-answering method in accordance with a preferred embodiment;
FIG. 2 is a flow chart of the artificial intelligent video question-answering method in embodiment 1;
fig. 3 shows a problem form in embodiment 1.
Detailed Description
The invention is explained in further detail below with reference to the drawing. The features and advantages of the present invention will become more apparent from the description.
The invention provides an artificial intelligence video question-answering method, which comprises the following steps as shown in figure 1:
s1, obtaining semantic features;
s2, visual feature extraction, namely performing multi-mode fusion on the visual features and the semantic features to obtain fusion features;
and S3, generating an answer according to the fusion feature and the semantic feature.
In step S1, the questions in the video question answering are expressed in natural language, the semantic features refer to features capable of characterizing the questions,
in the invention, a GLoVe word embedding model is adopted to process the problem to obtain word vector expression, the word vector expression is input into a double-layer LSTM model according to the word order, and the output of the last time state of the LSTM model is used as the semantic feature.
Compared with models such as LMO (local mean Square), bert (Bert) and the like, the GLoVe word embedding model has the advantages of small parameter quantity, high calculation speed and the like, is particularly suitable for video question answering with short sentences and common words in natural language, only needs to analyze a single question sentence and common words, and can meet the requirements without overlarge training scale.
Before the GLoVe word embedding model is used, the question and the answer need to be correlated, and the GLoVe word embedding model needs to be trained so that the GLoVe word embedding model can understand the meaning of the question.
In a preferred embodiment, the questions and the single candidate answers are connected in series to form a question setting sentence, multiple question setting sentences can be formed due to the fact that multiple candidate answers are provided, all the question setting sentences and the corresponding video segments are used as model input, correct answers are used as output, the number of the correct answers is preferably used as output to train a GLoVe word embedding model, the trained GLoVe word embedding model can perform confidence evaluation on the multiple question setting sentences, and therefore the highest confidence is selected as the final answer, and semantic features are obtained.
The inventor finds that, by adopting the above manner, when the alternative answer has an interference item which is very similar to the correct answer, the confidence degrees of the interference item and the correct answer are both high and close, so that the answer generation error rate is high, and how to solve the problem of the interference item is the difficult point of the invention.
In a more preferred embodiment, the GLoVe word embedding model is trained with the question together with all the alternative answers as input to the GLoVe word embedding model and the correct answer as output.
Further, when the question and all the alternative answers are taken as the input of the GLoVe word embedding model, the alternative answers are labeled, for example, "< aa >" (alternative answers) is added in front of the alternative answers as an identifier, so that the answer and the question are divided at a semantic level.
By taking the question and all the alternative answers as input and marking the alternative answers, the model can correctly understand the question and the alternative answers and the logical relationship among the alternative answers, so that the correctness of the generated answers is improved.
The LSTM model is a special cyclic neural network model, can learn long-term laws, and is firstly proposed in 1997 by Hochreiter & Schmidhuber, wherein a plurality of gating structures are added on the basis of a cyclic neural network (RNN), so that information is allowed to exist continuously, and semantic continuous output is realized through the cyclic network.
In step S2, the visual feature extraction process includes image feature extraction and time sequence feature extraction, where the image feature is a feature of a scene in one frame of video image, and the time sequence feature is a feature of the scene in different frame of video image, and in the present invention, the visual feature extraction includes the following sub-steps:
s21, modeling a video image in a spatial dimension to obtain image characteristics;
the image features include target-level features and global features in an image of the image.
Different from the traditional video question answering, the image features of the invention not only have target level features, but also have global features, so that the model can comprehensively understand the content of the video, thereby obtaining more comprehensive and effective visual features to meet more complex question answering content.
In the invention, modeling is carried out on each frame of image of the video in a space dimension, and the image characteristics of each frame of image are obtained through a model.
Specifically, an FPN model is established to obtain target level features, a ResNet model is established to obtain global features of an image, and further, a data set containing daily life scenes is used as training samples of the FPN model and the ResNet model.
Preferably, a COCO data set is used as a training sample of the FPN model, and the COCO data set is a large image database designed for object detection, segmentation, human key point detection, semantic segmentation and subtitle generation by microsoft, and includes more than 200,000 images including 80 common living targets.
Preferably, the ImageNet data set is used as a training sample of the ResNet model, and the ImageNet data set is a large visual database for visual object recognition software research, and contains 14,000,000 images, wherein the images comprise 1000 types of common living objects, such as various animals, various vehicles and the like.
And S22, modeling the image characteristics in a time dimension, extracting time sequence characteristics and obtaining visual characteristics.
In step S11, the image features are obtained based on a single frame, only the visual information of the spatial dimension is acquired, and after the information of the spatial dimension is obtained, the time dimension of the image needs to be modeled to obtain the visual features.
Preferably, the image features of each frame of image are subjected to time sequence analysis through the LSTM model, and then the visual features are obtained.
The LSTM modelUsing current time information x t And last moment state h t-1 As input, the current time state h is output t Therefore, time series analysis can be completed, and visual characteristics can be obtained.
Further, in step S2, the visual features and the semantic features need to be multi-modal fused, and the visual feature information is further screened and understood by the semantic features, so that the visual feature information with low association with the semantic features is removed, thereby obtaining the fused features.
The difficulty is how to fuse the visual features and the semantic features, if the visual features and the semantic features are directly spliced, the full-connection network is adopted to further extract the features and then output answers, and the obtained result is poor because the difference of the two feature vectors is ignored.
In the invention, semantic features and visual feature information are fused through an attention mechanism, and specifically, the visual features are weighted according to the semantic features to complete the fusion of the visual feature information.
The visual features are weighted according to the semantic features, so that information interaction between the visual features and the semantic features is realized, screening of the visual information is realized, the features related to problems in the visual information are enhanced, and the unrelated features are weakened.
The attention mechanism is a data processing method in machine learning, is widely applied to various different types of machine learning tasks such as natural language processing, image recognition, voice recognition and the like, can adjust an attention direction and a weighting model according to specific task targets, and can be attached to various neural network models.
In the invention, the attention mechanism is used for weighting the features related to the semantic features in the visual feature information more heavily and weighting the features unrelated to the semantic features less heavily.
In a preferred embodiment, the attention mechanism may be expressed as follows:
Figure BDA0002640947280000081
wherein, the first and the second end of the pipe are connected with each other,
α=[α 1 、α 2 、…、α i 、…、α n ],
V=[V 1 、V 2 、…、V i 、…、V n ],
t is semantic feature, V is visual related feature, V i Representing the ith visually relevant feature, f (, is the fusion function of the semantic feature and the ith visually relevant feature, α) i Is the weight of the ith visually relevant feature,
Figure BDA0002640947280000082
the output weighting result represents the number of visual correlation feature vectors.
In a preferred embodiment, the attention mechanism includes a spatial attention mechanism, and after the image features obtained in step S21, the image features of each frame of image are weighted by applying the spatial attention mechanism, so that the image features related to the problem in spatial dimension are filtered out.
Further, in the spatial attention mechanism, the visual correlation features refer to different regions of the image features, and the different regions are weighted to obtain the image feature expression.
For example, if the image feature output by the ResNet model has 49 regions, n =49, and each region is represented by V 1 、V 2 、…、V 49 To score the importance of these 49 regions, α is given 1 、α 2 、…、α 49 (ii) a And then, carrying out weighted combination on the 49 areas according to the formula to obtain the final image feature expression.
In a preferred embodiment, the attention mechanism comprises a temporal attention mechanism, and in step S22, the temporal attention mechanism is applied to the LSTM model to weight the different frame images, so that the visual features related to the problem in the temporal dimension are filtered out.
Further, in the temporal attention mechanism, the visual related features refer to images of different frames, and the images of different frames are weighted to obtain the visual feature expression.
For example, if the video length is 60 frames, n =60, and different frame images are respectively represented by V 1 、V 2 、…、V 60 To score the importance of the 60 frames of images, respectively, are α 1 、α 2 、…、α 60 (ii) a And then, carrying out weighted combination on the 60 frames of images according to the formula to obtain the final visual feature expression.
In a preferred embodiment, a two-layer perceptron is used as a fusion algorithm of the attention mechanism, and the perceptron comprises two fully connected layers, and the spatial attention mechanism and the temporal attention mechanism are respectively carried out.
Specifically, after the semantic features and the visual features are connected in series, correlation analysis is performed through two fully-connected layers, the feature dimension of the middle layer of the perceptron is set to be 256-1024, preferably 512, and the feature dimension of the output layer is set to be 1, wherein the feature dimension of the output layer represents the importance degree of the visual features.
Further, the importance degree of the vision related characteristics is normalized by adopting a softmax function, so that the weight alpha in the space attention system and the time attention system can be obtained i And carrying out weighted summation on the vision related characteristics according to the weight to obtain fusion characteristics.
In step S3, the semantic features and the fusion features are connected in series to construct a convolutional neural network, and preferably, the convolutional neural network performs inference by using double-layer full connection to generate a final question answer, thereby completing a video question-answering task.
Examples
Example 1
Performing an experiment on a large-scale public video question-answer data set TGIF-QA, wherein as shown in FIG. 2, in step S1, a question and all alternative answers are used as input of a GLoVe word embedding model together, a < aa > "is added in front of the alternative answers to be used as an identifier, the GLoVe word embedding model is trained to obtain word vector expression, the word vector expression is input into a double-layer LSTM model according to a word order, and the last time state output of the LSTM model is selected to be used as a final semantic feature;
in step S2, a double-layer perceptron is used as a fusion algorithm of the attention mechanism, the double-layer perceptron has two fully-connected layers, wherein the characteristic dimension of the middle layer is 512, the characteristic dimension of the output layer is 1, specifically, a ResNet-50 model pre-trained in ImageNet is used to extract single-frame image features, and the semantic features and the image features are connected in series to form a fully-connected layer, the image features of each frame of image are weighted by applying the spatial attention mechanism, the weighted multi-frame image features are input into a double-layer LSTM to further extract timing features, a temporal attention mechanism is applied in an LSTM model, the semantic features and the timing features are connected in series to form a fully-connected layer, and fusion features are output.
Wherein, the importance degree of the vision related characteristics is normalized by adopting a softmax function, and a space attention mechanism weight and a time attention mechanism weight are obtained.
In step S3, the original semantic features and the weighted visual features are connected in series, and then a double-layer full connection is adopted for reasoning, thereby generating a final question answer.
The data set TGIF-QA includes about 160,000 questions and corresponding videos, and is classified into four categories, i.e., a Count question (Count), a Frame question (Frame), an Action question (Action), and a timing question (Transition) according to the types of the questions, where the specific categories of the questions and the amounts of data used for training and testing are shown in table one.
Watch 1
Training Testing Total up to
Counting problem 26,843 3,554 30,397
Problem of picture 39,392 13,691 53,083
Problem of motion 20,475 2,274 22,749
Timing problem 52,704 6,232 58,936
Total up to 139,414 25,751 165,165
In the TGIF-QA dataset, for a piece of video, the count question asks the number of times a person or animal performs some kind of action, e.g., "How many times do the man blank eyes? "; the picture question asks for static information available from a single frame picture, including object color, number of objects, etc., such as "What is the color of the cat"; the action question asks What type of action the person or animal performs, such as "What do the girl do 3times"; the timing question asks the type of action that the person or animal takes before or after performing an action, such as "at do the man do before walk away", as shown in fig. 3.
Comparative example 1
The method in the paper Jang Y, song Y, yu Y, et al, tgif-qa: heated spatial-temporal fermentation in visual query and analysis [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2017:2758-2766 is adopted to replace the visual feature extraction mode in the step S2 in the embodiment 1, wherein the image feature extraction adopts a ResNet model, and the time sequence feature extraction adopts the three-dimensional space-time feature of a C3D model.
Comparative example 2
The method in the paper Gao J, ge R, chen K, et al.motion-alignment co-memory networks for video query analysis [ C ]// Proceedings of the IEEE Conference Computer Vision and Pattern recognition.2018:6576-6585 is adopted to replace the visual feature extraction method in the step S2 in the embodiment 1, wherein the image feature extraction adopts a ResNet model, and the time sequence feature extraction adopts an Optical Flow method (Optical Flow).
Comparative example 3
The same experiment as in example 1 was performed on the data set TGIF-QA using the model in the paper Gao L, zeng P, song J, et al.structural Two-Stream attachment Network for Video Question Answering [ C ]// Proceedings of the AAAI Conference on Artificial Intelligence interest.2019, 33.
Comparative example 4
The same experiment as in example 1 was performed on a data set TGIF-QA using the model in the paper Jang Y, song Y, yu Y, et al, tgif-QA: heated spatial-temporal testing in visual query analysis [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2017:2758-2766. In the TGIF-QA data set, the same video was selected for the same queries as in example 1, including the query of a count problem, a picture problem, an action problem or a timing problem.
Comparative example 5
The experiment was performed by the method in example 1 except that in step S2, the visual characteristics were obtained only by the ResNet model and the LSTM model without applying the attention mechanism.
Comparative example 6
The experiment was carried out using the method in example 1, except that in step S2, no spatial attention mechanism was applied.
Comparative example 7
An experiment was performed by the method in example 1 except that in step S2, no time attention mechanism was applied.
Comparative example 8
The experiment was carried out using the method of example 4, except that no attention mechanism was applied.
Comparative example 9
The experiment was carried out using the method of example 4, except that no spatial attention mechanism was applied.
Comparative example 10
The experiment was carried out using the method of example 4, except that no time attention mechanism was applied.
Experimental example 1
The screen problem, the action problem and the time sequence problem all adopt the accuracy as evaluation indexes, the counting problem adopts the mean square error as the evaluation indexes, the experimental results of the embodiment 1, the comparative example 2 and the comparative example 3 are analyzed, and the higher the accuracy and the lower the mean square error, the better the performance of the method is.
The experimental results of the example 1, the comparative example 1 and the comparative example 2 are shown in the table II:
watch two
Figure BDA0002640947280000141
/>
From table two, the method in example 1 has lower mean square error and higher accuracy, which indicates that the method in example 1 is very effective and excellent in performance.
Experimental example 2
Ablation experiments were performed based on the method in example 1, specifically, in step S1, the natural language question was changed to text information input, and before using GLoVe word embedding model, the following were respectively adopted:
taking the question and all alternative answers together as the input of a GLoVe word embedding model; and
the questions and the single alternative answers are connected in series to form a question setting form, a plurality of question setting forms are formed by the plurality of alternative answers, and all the question setting forms and the corresponding video clips are used as model input;
the experiments were carried out in two ways and the results are shown in table three.
Watch III
Text input form Accuracy of action problem (%) Accuracy of timing problem (%)
Question + Single alternative answer 58.85 73.62
Question + all alternative answers 86.33 96.68
As can be seen from table three, the model performance can be greatly improved by using the mode of inputting the question and all the alternative answers together as the model input.
Experimental example 3
The results of the experiments of example 1 and comparative examples 5 to 10 were analyzed and are shown in table four.
Watch four
Figure BDA0002640947280000151
From the results, it can be seen that the attention mechanism module designed in comparative example 8 can reduce the model performance in some cases, such as applying spatial attention mechanism on the action problem results in 2.8% (57.33% -60.13%) of accuracy reduction. The attention mechanism module designed in embodiment 1 adopts a double-layer perceptron to analyze the correlation between the visual information and the semantic information, so that the performance of the model is comprehensively improved on all kinds of problems, and the attention mechanism design of comparative example 1 is reasonable and effective.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner" and "outer" indicate the orientation or positional relationship based on the operation state of the present invention, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and thus should not be construed as limiting the present invention.
The present invention has been described above in connection with preferred embodiments, which are merely exemplary and illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims (7)

1. An artificial intelligence video question-answering method comprises the following steps:
s1, obtaining semantic features;
s2, visual feature extraction, namely performing multi-mode fusion on the visual features and the semantic features to obtain fusion features;
s3, generating an answer according to the fusion feature and the semantic feature;
in the step S1, a GLoVe word embedding model is adopted to process the problem to obtain word vector expression, the word vector expression is input into a double-layer LSTM model according to the language order, and the output of the last moment state of the LSTM model is used as semantic features;
taking the question and all the alternative answers as the input of a GLoVe word embedding model, and training the GLoVe word embedding model;
when the question and all the alternative answers are used as the input of the GLoVe word embedding model together, the alternative answers are labeled, so that the answer and the question are divided at a semantic level.
2. The artificial intelligence video question-answering method according to claim 1,
in step S2, the visual feature extraction comprises the following sub-steps:
s21, modeling a video image in a spatial dimension to obtain image characteristics;
s22, modeling image features in a time dimension, extracting time sequence features and obtaining visual features;
the image features are features of a scene in one frame of video, and the time sequence features are features of the scene in different frames of video.
3. The artificial intelligence video question-answering method according to claim 2,
in step S21, an FPN model is established to obtain a target level feature, and a ResNet model is established to obtain a global feature of an image;
in step S22, the image features of each frame of image are subjected to time series analysis by the LSTM model, so as to obtain visual features.
4. The artificial intelligence video question-answering method according to claim 1,
in step S2, semantic features and visual feature information are fused through an attention mechanism, and the visual features are weighted according to the semantic features to complete visual feature information fusion.
5. The artificial intelligence video question-answering method according to claim 4,
the attention mechanism is expressed as follows:
Figure FDA0003937483450000021
wherein the content of the first and second substances,
α=[α 1 、α 2 、…、α i 、…、α n ],
V=[V 1 、V 2 、…、V i 、…、V n ],
t is semantic feature, V is visual related feature, V i Representing the ith visually relevant feature, f (, x) being a fusion function of the semantic feature and the ith visually relevant feature, α i Is the weight of the ith visually relevant feature,
Figure FDA0003937483450000022
for the output weighted result, n represents the number of visually relevant feature vectors.
6. The artificial intelligence video question-answering method according to claim 5,
the attention mechanism comprises a space attention mechanism and a time attention mechanism;
after the obtained image features, the spatial attention mechanism is applied to weight the image features of each frame of image,
in step S22, a time attention mechanism is applied in the LSTM model to weight different frame images;
in the spatial attention mechanism, the visually relevant features V refer to different regions of an image feature;
in the temporal attention mechanism, the visually relevant features V refer to images of different frames;
normalizing the importance degree of the vision related characteristics by adopting a softmax function to obtain the weight alpha in the space attention mechanism and the time attention mechanism i
7. The artificial intelligence video question-answering method according to claim 4,
a double-layer perceptron is adopted as a fusion algorithm of an attention mechanism, and the perceptron comprises two full-connection layers and is respectively used for carrying out a space attention mechanism and a time attention mechanism.
CN202010839563.5A 2020-08-19 2020-08-19 Artificial intelligent video question-answering method Active CN112036276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010839563.5A CN112036276B (en) 2020-08-19 2020-08-19 Artificial intelligent video question-answering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010839563.5A CN112036276B (en) 2020-08-19 2020-08-19 Artificial intelligent video question-answering method

Publications (2)

Publication Number Publication Date
CN112036276A CN112036276A (en) 2020-12-04
CN112036276B true CN112036276B (en) 2023-04-07

Family

ID=73577605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010839563.5A Active CN112036276B (en) 2020-08-19 2020-08-19 Artificial intelligent video question-answering method

Country Status (1)

Country Link
CN (1) CN112036276B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860847B (en) * 2021-01-19 2022-08-19 中国科学院自动化研究所 Video question-answer interaction method and system
CN113010656B (en) * 2021-03-18 2022-12-20 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113128415B (en) * 2021-04-22 2023-09-29 合肥工业大学 Environment distinguishing method, system, equipment and storage medium
CN113536952B (en) * 2021-06-22 2023-04-21 电子科技大学 Video question-answering method based on attention network of motion capture
CN113837047B (en) * 2021-09-16 2022-10-28 广州大学 Video quality evaluation method, system, computer equipment and storage medium
CN114707022B (en) * 2022-05-31 2022-09-06 浙江大学 Video question-answer data set labeling method and device, storage medium and electronic equipment
CN117917696A (en) * 2022-10-20 2024-04-23 华为技术有限公司 Video question-answering method and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
CN108549658B (en) * 2018-03-12 2021-11-30 浙江大学 Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
CN111008293A (en) * 2018-10-06 2020-04-14 上海交通大学 Visual question-answering method based on structured semantic representation
CN110727824B (en) * 2019-10-11 2022-04-01 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism
CN110889340A (en) * 2019-11-12 2020-03-17 哈尔滨工程大学 Visual question-answering model based on iterative attention mechanism

Also Published As

Publication number Publication date
CN112036276A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN112036276B (en) Artificial intelligent video question-answering method
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN109919031B (en) Human behavior recognition method based on deep neural network
CN108829677B (en) Multi-modal attention-based automatic image title generation method
CN109783666B (en) Image scene graph generation method based on iterative refinement
Wang et al. Context modulated dynamic networks for actor and action video segmentation with language queries
CN110705490B (en) Visual emotion recognition method
Jing et al. Recognizing american sign language manual signs from rgb-d videos
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN114937066A (en) Point cloud registration system and method based on cross offset features and space consistency
CN113239153B (en) Text and image mutual retrieval method based on example masking
CN114662497A (en) False news detection method based on cooperative neural network
CN111368142A (en) Video intensive event description method based on generation countermeasure network
CN115512191A (en) Question and answer combined image natural language description method
CN112183465A (en) Social relationship identification method based on character attributes and context
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN115223021A (en) Visual question-answering-based fruit tree full-growth period farm work decision-making method
CN114241606A (en) Character interaction detection method based on adaptive set learning prediction
Jiang et al. Cross-level reinforced attention network for person re-identification
Ling et al. A facial expression recognition system for smart learning based on YOLO and vision transformer
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN114511813B (en) Video semantic description method and device
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN116503753A (en) Remote sensing image scene classification method based on multi-mode airspace transformation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant