CN110851760B

CN110851760B - Human-computer interaction system for integrating visual question answering in web3D environment

Info

Publication number: CN110851760B
Application number: CN201911099861.9A
Authority: CN
Inventors: 谢宁; 孔文喆; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2022-12-27
Anticipated expiration: 2039-11-12
Also published as: CN110851760A

Abstract

The invention relates to the field of VR education, discloses a human-computer interaction system integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in traditional multimedia teaching. The system comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for an education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side.

Description

Human-computer interaction system for integrating visual question answering in web3D environment

Technical Field

The invention relates to the field of VR education, in particular to a human-computer interaction system for integrating visual question answering in a web3D environment.

Background

Thanks to the development of educational informatization and intelligent hardware, a human-computer interactive learning system based on multimedia technology has been widely applied to teaching. However, the current multimedia teaching has the following disadvantages: (1) The interaction mode is single, and students can only operate through a mouse and a keyboard; (2) The design of the interactive interface is monotonous, and the interest is not interesting and the attention of students is hard to attract.

The research on education and psychologists at home and abroad shows that the interest level of students in learning is one of the most important factors influencing the learning effect of students except intelligence. Therefore, if the teaching can be improved from the aspect of human-computer interaction, the interest of students can be improved, and the improvement of the learning efficiency can be brought inevitably.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the human-computer interaction system integrating the visual question answering in the web3D environment is provided, and the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching are solved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.

For further optimization, the web terminal respectively acquires the problems of the user and the picture input information through a microphone and a camera.

As a further optimization, the visual question-answer model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the picture by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.

As further optimization, the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at the education scene.

As a further optimization, the 3D education scenes displayed by the web end can be switched at will.

As a further optimization, the image feature extraction module extracts feature information of the picture by using a Faster R-CNN model, and specifically includes: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate regions recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain feature vectors corresponding to the candidate regions, and then classifies and regresses the feature vectors.

As a further optimization, the feature fusion module adopts a soft attention mechanism to fuse and infer the coding information of the problem and the feature information of the picture, and the soft attention mechanism includes:

let X denote input information, X = [ X ] ₁ ，x ₂ ，x ₃ ...x _N ]The attention variable z ∈ [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected _i The calculation method of (c) is as follows:

s(x _i q) is a scoring function to calculate the value of attention at that location.

The invention has the beneficial effects that:

(1) The visual question-answering and Web3D technology are combined, an AI of a Web3D application system for future education is designed, the AI is an intelligent small teacher with multi-perception fusion of visual recognition and semantic understanding, and the AI can sensitively react and perform corresponding communication in interaction with learners.

(2) The adopted visual question-answering model is based on an education system, a data set of the visual question-answering model comprises most of picture information and text information in the education process, and the trained model is very fit with the education system.

(3) The method has the advantages that the WebGL technology is utilized to construct the three-dimensional education scene at the web end, so that a user can browse required contents more intuitively, the user has a feeling of being personally on the scene, and the activity and the interactivity of a project are enhanced more.

Therefore, compared with the traditional multimedia teaching, the human-computer interaction realized by the invention is firstly the improvement of the interactivity, so that the learner can carry out one-to-one communication with the system, and the attention and the immersion of the learner are improved; secondly, the system can automatically carry out visual and semantic understanding through images and languages given by a learner, so that the system can feed back the insights and answers of different knowledge to the learner, and the learning efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of the human-computer interaction of the present invention to integrate visual question answering in a web3D environment;

FIG. 2 is a flow diagram of the fast R-CNN algorithm;

FIG. 3 is a diagram showing the structure of an LSTM-model memory cell.

Detailed Description

The invention aims to provide a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in the traditional multimedia teaching. The method comprises the steps that a WebGL technology is utilized, a scene model made by a modeling tool is led in a browser at a web end through a model loader, a scene is rendered through a renderer, and finally a large-scale VR world is displayed; the method comprises the steps that a visual question-answer model suitable for an education system is arranged at a server side, the visual question-answer model adopts feature fusion and reasoning of an attention mechanism, questions of learners are obtained and answered based on the visual question-answer model, and the questions are fed back to a three-dimensional education scene of a web side, so that intelligent human-computer interaction of web3D + AI is achieved.

The man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, feedback answers of the server end are acquired, and answers are displayed in the 3D education scene in a combined mode through interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side. In a specific implementation, the visual question-answering model includes: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the picture by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.

In order to establish the perception and interaction of 3D vision, the core of the system is to build a VQA (visual question answering) model facing an education system. The VQA model takes as input a picture and a free, open natural language question about the picture to generate as output a natural language answer. The image feature extraction part, the problem coding part and the feature fusion and reasoning part in the model are realized as follows:

a. image feature extraction:

the Faster and better target detection tool is realized by the Faster and better target detection tool realized by anyone of the Microsoft institute of research, such as Neplus, hosiemin, ross Girshick and Sunware on the basis of Fast R-CNN. The Faster R-CNN algorithm realizes the real end-to-end target detection and calculation process, and is mainly divided into three parts: 1) A convolutional neural network; 2) Regional recommendation Network (RPN); 3) Fast R-CNN target detection network. The algorithm still continues the idea that the R-CNN firstly carries out regional recommendation and then carries out classification, but the regional recommendation task by the convolutional neural network is successfully realized, and no additional algorithm is needed to be used for independently operating. The RPN and Fast R-CNN share a convolution neural network to carry out feature extraction, so that the convolution calculation times are reduced, and the speed of the whole algorithm is improved. Fast R-CNN is that convolution calculation is carried out on the whole image, then a candidate area recommended by a selective search algorithm is fused with a feature mapping image calculated by a convolution network through a region-of-interest Pooling Layer (RoI Pooling Layer), so that a feature vector corresponding to the candidate area is obtained, and the frequency of convolution calculation is greatly reduced by the operation of sharing convolution calculation. And the dimensions of the feature vectors are uniform, so that the subsequent classification work is facilitated.

Fast R-CNN is generated by the enlightening of SPP-Net, and the proposed region-of-interest pooling layer fuses and pools convolution features and candidate region borders to obtain feature vectors of corresponding regions, which is equivalent to a special SPP-Net spatial pyramid pooling layer with only one layer of pyramid structure. In addition, to achieve better training, fast R-CNN also uses methods to increase speed, two of which are important: multitasking training and minimal batch sampling.

1) Region of interest pooling layer:

the region-of-interest pooling layer transforms the features in each active candidate region into a feature vector of fixed size W x H using a max pooling operation. The region of interest here refers to a rectangular window in the convolution signature, and in Fast R-CNN is the segmented region computed by the selective search algorithm. Each region of interest is represented by a quaternion vector (x, y, w, h), where (x, y) represents the coordinates of the upper left corner and (w, h) represents the height and width of the rectangular window. The region of interest pooling layer divides the window of interest of size W H into W x H mesh sub-windows, each of which is about (W/W) x (H/H), and then pools the eigenvalues in each sub-window maximally into the corresponding output mesh. This applies to each feature channel, as is the standard max pooling operation.

2) Multi-task training:

the multi-task training refers to that the target classification and the regression calculation of the frame of the candidate region are simultaneously used as two parallel output layers, and the training and the regression calculation of the classifier SVM are not divided into different stages in the R-CNN. The first task outputs the probability distribution of each interested area on the K +1 class (wherein K is the category of the data set, and the category is added with background), and the probability is calculated by using a softmax function; the second task is to compute t for the bounding box ^k ＝(t _x ^k ,t _y ^k ,t _h ^k ,t _w ^k ) The regression offset, parameter K, represents the K classes. Using k which is still defined in R-CNN _t . Combining two tasks, and performing combined training classification and candidate region frame regression calculation by using a multi-task loss function:

L(p,u,tu,v)＝L _dx (p,u)+λ[u≥1]L _loc (t ^u ,v)

the parameter u marks the actual category of the candidate region content as the target, normally u is larger than or equal to 1, and if u =0, the region content is a background; l is _cls (p,u)＝-logp _u Is the probability log-loss function for class u; l is _loc (t _u V) is the frame position loss function calculated by smoothing the L1 loss function, t ^u Is the bounding box for class u prediction; aves brackets [ u is not less than 1]If the mark meets the condition u is more than or equal to 1 in the square brackets, the mark is 1, otherwise, the mark is 0; the parameter λ controls the balance between the two loss functions, and is therefore set to 1 in all experiments, since the contribution of the two loss functions is equally important.

3) Minimum batch learning:

the least batch of learning is to achieve better gradient back-propagation effects. The convolutional neural network uses a gradient descent method for back propagation of the parameters. In the training process, the whole data set can be completely sent to a network model for training and learning, and the network calculates an iteratively updated gradient value by using all samples, which is a traditional gradient descent method. Alternatively, the updated gradient values of the network can be calculated using only one sample at a time, which is called a stochastic gradient descent method, also called an online learning method. Learning using the entire data set can more accurately converge to the location where the extreme value is located, and the number of iterations is small, but the time taken for each iteration is long. With equal computational effort, the convergence rate of training using the entire data set is slower than with a small number of samples. Although the convergence speed is high by using a small number of samples for training, the correction is carried out towards the gradient direction pointed by the current sample in each iteration process, the correction respectively faces different directions, and the correction respectively turns into halves, so that a large amount of noise is brought to cause performance reduction, and the convergence is difficult to achieve. Therefore, a mediocre approach, minimal batch learning, is used to seek a balance point between the two extremes. The minimum batch size set by Fast R-CNN is 128, and 64 interesting regions are respectively collected from two original images to be trained and learned together.

The algorithm flow of Fast R-CNN is shown in FIG. 2, convolution characteristics are processed through an interested region pooling layer, and the obtained characteristics are sent to two parallel computing tasks for training, classification and positioning regression. With these methods and improved frameworks Fast R-CNN achieves better results than R-CNN with shorter training and testing duration.

B. Text feature extraction:

for text selection, we use the LSTM network. The result of each learning of the ordinary recurrent neural network is related to the data at the current moment and the data at the previous moment. The special structure of the recurrent neural network makes full use of historical data, so that the recurrent neural network has obvious advantages in processing sequence problems. However, the recurrent neural network model has a gradient vanishing problem, that is, the farther the data belongs to the moment, the smaller the influence of the data on the weight change is, and finally, the training result is often dependent on the data at the closer moment, that is, the long-term memory of the historical data is lacked. Hochereiter et al originally proposed a long-short term memory network, LSTM is an optimized model of RNN, inherits most characteristics of RNN, and solves the problem of disappearance of gradients generated in the reverse transfer process, and then LSTM is further improved and popularized by a. LSTM has three more sets of controllers in long-short term memory compared to the original RNN: forget gate, input gate, output gate. An LSTM model memory cell structure is shown in FIG. 3.

The effect of the forgetting gate is to selectively discard data in cell state C, the selection process of which is calculated using the following formula:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )

taking the language model as an example, in the process of predicting a new word, if the cell state includes the property of the subject in the previous sentence, the property of the previous subject needs to be forgotten under the new sentence structure. After forgetting the gate, which is a sigmoid layer with information to decide which values to update, a new candidate is generated by a tanh layer, it needs to choose how to update the cell state. The above update process can be implemented by the following formula:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )

following the language model above as an example, the properties of the new subject need to be added to the cell state to replace the old relevant information. The updating process is as follows:

after the LSMT cell unit is refreshed, the output value needs to be calculated. The output value is determined by the cell state C, first by determining the position of the desired output part using the sigmoid layer, then converting to a value between (-1, 1) with tanh, and then multiplying with the sigmoid gate output.

The output process is represented by the formula o _t ＝σ(W _o [h _t ,x _t ]+b _o ) Calculate, at the same time, h _t ＝σ _t *tanh(c _t )。

C. Feature fusion and reasoning:

the fusion between the problem feature and the image feature is various, such as multiplying or concatenating the two features together. However, such fusion is insufficient because the relationship between the problem and the image is complicated, and the interaction between them may not be fully utilized by a simple operation. Most current models answer a word by treating VQA as a multiclass classification question. Of course, we can answer the complete sentence by the RNN decoder model. In recent years, attention Mechanism (Attention Mechanism) has become a relatively popular topic in various fields of deep learning. Taking the example of mutual conversation among people, if a second person needs to answer the question of a first person, the question cannot be answered well under the condition that words are omitted and all the words are completely heard, and the question needs to be answered according to the emphasis of the question, wherein the emphasis is the so-called attention. We achieve feature fusion and reasoning through a mechanism of attention. Here we design the Soft Attention mechanism, which is a model that is relatively easy to add to existing network structures.

Let X denote input information, X = [ X ] ₁ ，x ₂ ，x ₃ ...x _N ]Note that the variable z ∈ [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected _i The calculation method of (c) is as follows:

Based on the above, the invention establishes a data set belonging to an education system on the basis of the existing VQA model, constructs a characteristic extraction algorithm of the VQA model by combining the advantages of an image recognition algorithm and a text analysis algorithm, and fuses and infers the characteristics through an attention mechanism, so that the established VQA model has the functions of visual recognition and semantic perception in the education process, and can provide corresponding answers according to questions and picture input of questioners.

The human-computer interaction principle of integrating visual question answering in a web3D environment is shown in figure 1, a learner can directly load a 3D education scene in a browser at a web end, in the learning process, the learner can ask a question to an AI teacher through a microphone, meanwhile, a camera acquires picture information, and the picture and the question are uploaded to a server through a socket at the web end. The server utilizes the VQA model to extract the features of the picture, encodes the question, fuses and infers the features of the picture and the question through an attention mechanism, and finally generates a corresponding answer through a decoder. The server feeds the answers back to the web end through the socket, and the answers are displayed in a 3D education scene in combination with the animation of the AI teachers, so that human-computer interaction is achieved.

In conclusion, the invention adopts a method of combining the Web3D technology and the visual question-answer technology, designs and realizes a real world based on educational content in a browser by utilizing the Web3D technology, simultaneously constructs a set of visual question-answer model suitable for the VR world by using a deep learning network, and finally fuses the model and the Web3D technology, thereby developing an intelligent VR educational project integrating interaction, three-dimensional, dynamic and object recognition. The method has the advantages that the projects which can be opened only by other clients such as a PC (personal computer), a host and the like are moved to a new stage of the browser, any east plug-in is not needed to be installed, and good project experience can be obtained by opening a website in the browser.

Claims

1. A human-computer interaction system for integrating visual question answering in a web3D environment is characterized in that,

the method comprises the following steps: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, feedback answers of the server end are acquired, and answers are displayed in the 3D education scene in a combined mode through interactive animations; the server side is internally provided with a visual question-answer model for the education system, and after questions and picture input information transmitted by the web side are received, corresponding answers are obtained by the visual question-answer model and fed back to the web side;

the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the feature fusion module and subjected to feature fusion and reasoning;

the feature fusion module adopts a soft attention mechanism to fuse and reason the coding information of the problem and the feature information of the picture, and the soft attention mechanism comprises:

2. The human-computer interaction system of integrating visual question answering in a web3D environment of claim 1,

the method is characterized in that the web end respectively acquires the question and the picture input information of the user through a microphone and a camera.

3. The human-computer interaction system of integrating visual question answering in a web3D environment of claim 1,

the image feature extraction module is characterized in that the image feature extraction module adopts a Faster R-CNN model to extract feature information of the image, and specifically comprises the following steps: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.

4. The human-computer interaction system of claim 1 for integrating visual question-answering in a web3D environment,

the visual question-answering model is characterized in that the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at education scenes.

5. The human-computer interaction system of integrating visual question answering in a web3D environment according to any one of claims 1-4,

the method is characterized in that the 3D education scene displayed by the web end can be switched at will.