CN110851760A

CN110851760A - Human-computer interaction system for integrating visual question answering in web3D environment

Info

Publication number: CN110851760A
Application number: CN201911099861.9A
Authority: CN
Inventors: 谢宁; 孔文喆; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-02-28
Anticipated expiration: 2039-11-12
Also published as: CN110851760B

Abstract

The invention relates to the field of VR education, discloses a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in traditional multimedia teaching. The system comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.

Description

Human-computer interaction system for integrating visual question answering in web3D environment

Technical Field

The invention relates to the field of VR education, in particular to a human-computer interaction system for integrating visual questions and answers in a web3D environment.

Background

Thanks to the development of educational informatization and intelligent hardware, a human-computer interactive learning system based on multimedia technology has been widely applied to teaching. However, the current multimedia teaching has the following disadvantages: (1) the interaction mode is single, and students can only operate through a mouse and a keyboard; (2) the design of the interactive interface is monotonous, and the interest is not interesting and the attention of students is hard to attract.

The research on education and psychologists at home and abroad shows that the interest level in the learning of students is one of the most important factors influencing the learning effect of students except intelligence. Therefore, if the teaching can be improved from the aspect of human-computer interaction, the interest of students can be improved, and the improvement of the learning efficiency can be brought inevitably.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a human-computer interaction system integrating visual question answering in a web3D environment is provided, and the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching are solved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the man-machine interaction system for integrating visual question answering in the web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.

For further optimization, the web terminal respectively acquires the problems of the user and the picture input information through a microphone and a camera.

As a further optimization, the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.

As further optimization, the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at the education scene.

As a further optimization, the 3D education scenes displayed by the web end can be switched at will.

As a further optimization, the image feature extraction module extracts feature information of the picture by using a Faster R-CNN model, and specifically includes: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.

As a further optimization, the feature fusion module adopts a soft attention mechanism to fuse and infer the coding information of the problem and the feature information of the picture, and the soft attention mechanism includes:

let X denote input information, X ═ X₁，x₂，x₃...x_N]The attention variable z ∈ [1, N ∈ ]]Is a set of index positions, z being valuedPosition, i.e. the position representing the selected information, given q and x, the probability a that any one position is selected_iThe calculation method of (c) is as follows:

s(x_iq) is a scoring function to calculate the value of attention at that location.

The invention has the beneficial effects that:

(1) the visual question-answering technology is combined with the Web3D technology, and an AI of a Web3D application system for future education is designed, wherein the AI is an intelligent small teacher with multi-perception fusion of visual recognition and semantic understanding, and can sensitively react and correspondingly communicate in interaction with learners.

(2) The adopted visual question-answering model is based on an education system, a data set of the visual question-answering model comprises most of picture information and text information in the education process, and the trained model is very fit with the education system.

(3) The method has the advantages that the WebGL technology is utilized to construct the three-dimensional education scene at the web end, so that a user can browse required contents more intuitively, the feeling of being personally on the scene is realized, and the activity and the interactivity of a project are enhanced more.

Therefore, compared with the traditional multimedia teaching, the human-computer interaction realized by the invention is firstly the improvement of the interactivity, so that the learner can carry out one-to-one communication with the system, and the attention and the immersion of the learner are improved; secondly, the system can automatically carry out visual semantic understanding through the images and languages given by the learner, thereby being capable of feeding back the insights and answers of different knowledge to the learner and improving the learning efficiency.

Drawings

FIG. 1 is a schematic diagram of the human-computer interaction of the present invention incorporating visual question answering in a web3D environment;

FIG. 2 is a flow diagram of the fast R-CNN algorithm;

FIG. 3 is a diagram showing the structure of an LSTM-model memory cell.

Detailed Description

The invention aims to provide a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching. The method comprises the steps that a WebGL technology is utilized, a scene model made by a modeling tool is led in a browser at a web end through a model loader, a scene is rendered through a renderer, and finally a large-scale VR world is displayed; and a visual question-answering model suitable for the education system is arranged at the server side, the visual question-answering model adopts the feature fusion and reasoning of an attention mechanism, the problems of the learner are obtained and answered based on the visual question-answering model, and the questions are fed back to the three-dimensional education scene of the web side, so that the intelligent man-machine interaction of the web3D + AI is realized.

The man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side. In a specific implementation, the visual question-answering model includes: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.

In order to establish the perception and interaction of 3D vision, the core of the system is to build VQA (visual question and answer) model facing the education system. VQA the model is input as a picture and a free, open form natural language question about the picture to generate a natural language answer as output. The image feature extraction part, the problem coding part and the feature fusion and reasoning part in the model are realized as follows:

a. image feature extraction:

the Faster and better target detection tool is realized by the Faster and better target detection tool realized by anyone of the Microsoft institute of research, such as Neplus, Hosiemin, Ross Girshick and Sunware on the basis of Fast R-CNN. The Faster R-CNN algorithm realizes the real end-to-end target detection and calculation process, and is mainly divided into three parts: 1) a convolutional neural network; 2) regional recommendation Network (RPN); 3) fast R-CNN target detection network. The algorithm still continues the idea that the R-CNN firstly carries out regional recommendation and then carries out classification, but the regional recommendation task by the convolutional neural network is successfully realized, and no additional algorithm is needed to be used for independently operating. The RPN and Fast R-CNN share a convolutional neural network for feature extraction, so that the convolution calculation times are reduced, and the speed of the whole algorithm is improved. Fast R-CNN is to perform convolution calculation on the whole image, then fuse a candidate region recommended by a selective search algorithm with a feature mapping image calculated by a convolution network through a region-of-interest Pooling Layer (RoI Pooling Layer) to obtain a feature vector corresponding to the candidate region, and the frequency of convolution calculation is greatly reduced by the operation of sharing convolution calculation. And the dimensions of the feature vectors are uniform, so that the subsequent classification work is facilitated.

The Fast R-CNN is generated by the enlightening of SPP-Net, and the proposed region-of-interest pooling layer fuses and pools the convolution characteristics and the candidate region border to obtain the characteristic vector of the corresponding region, which is equivalent to a special SPP-Net spatial pyramid pooling layer and a pyramid structure with only one layer. In addition, to achieve better training, Fast R-CNN also uses some methods to increase speed, two of which are important: multitasking training and minimal batch sampling.

1) Region of interest pooling layer:

the region-of-interest pooling layer transforms the features in each active candidate region into a feature vector of fixed size W x H using a max pooling operation. The region of interest here refers to a rectangular window in the convolution signature, and in FastR-CNN is the segmented region calculated by the selective search algorithm. Each region of interest is represented by a quaternion vector (x, y, w, h), where (x, y) represents the coordinates of the upper left corner and (w, h) represents the height and width of the rectangular window. The region of interest pooling layer divides a window of interest of size W H into W x H mesh sub-windows, each of which is about (W/W) x (H/H), and then pools the eigenvalues in each sub-window maximally into the corresponding output mesh. This applies to each feature channel, as is the standard max pooling operation.

2) Multi-task training:

the multi-task training refers to that the target classification and the regression calculation of the frame of the candidate region are simultaneously used as two parallel output layers, and the training of a classifier SVM and the calculation of the regression amount are not divided into different stages in R-CNN. The first task outputs the probability distribution of each interested area on the K +1 class (wherein K is the category of the data set, and the category is added with background), and the probability is calculated by using a softmax function; the second task is to compute t for the bounding box^k＝(t_x ^k,t_y ^k,t_h ^k,t_w ^k) The regression offset, parameter K, represents K classes. Using k which is still defined in R-CNN_t. Combining two tasks, and performing combined training classification and candidate region frame regression calculation by using a multi-task loss function:

L(p,u,tu,v)＝L_dx(p,u)+λ[u≥1]L_loc(t^u,v)

the parameter u marks the actual category of the target candidate region content, and normally u ≧ 1, and if u ≧ 0 indicates that the region content is the background；L_cls(p,u)＝-logp_uIs the probability log-loss function for class u; l is_loc(t_uV) is the bounding box position penalty function calculated by smoothing the L1 penalty function, t^uIs the bounding box for class u prediction; everson brackets [ u is more than or equal to 1]If the mark meets the condition u is more than or equal to 1 in the square brackets, the mark is 1, otherwise, the mark is 0; the parameter λ controls the balance between the two loss functions, and is therefore set to 1 in all experiments, since the contribution of the two loss functions is equally important.

3) Minimum batch learning:

the least batch of learning is to achieve better gradient back-propagation effects. The convolutional neural network uses a gradient descent method for back propagation of the parameters. In the training process, the whole data set can be completely sent to a network model for training and learning, and the network calculates an iteratively updated gradient value by using all samples, which is a traditional gradient descent method. Alternatively, the gradient values of the network update can be calculated using only one sample at a time, which is called a random gradient descent method, also called an online learning method. Learning using the entire data set can more accurately converge to the location where the extreme value is located, and the number of iterations is small, but the time taken for each iteration is long. With equal computational effort, the convergence rate of training using the entire data set is slower than with a small number of samples. Although the convergence speed is high by using a small number of samples for training, the correction is carried out towards the gradient direction pointed by the current sample in each iteration process, the correction respectively faces different directions, and the correction respectively turns into halves, so that a large amount of noise is brought to cause performance reduction, and the convergence is difficult to achieve. Therefore, a mediocre approach, minimal batch learning, is used to seek a balance point between the two extremes. The minimum batch size set by Fast R-CNN is 128, and 64 interesting regions are respectively collected from two original images to be trained and learned together.

The algorithm flow of Fast R-CNN is shown in FIG. 2, convolution characteristics are processed through an interested region pooling layer, and the obtained characteristics are sent to two parallel computing tasks for training, classification and positioning regression. With these methods and improved frameworks Fast R-CNN achieves better results than R-CNN with shorter training and testing duration.

B. Text feature extraction:

for text selection, we use the LSTM network. The result of each learning of the ordinary recurrent neural network is related to the data at the current moment and the data at the previous moment. The special structure of the recurrent neural network makes full use of historical data, so that the recurrent neural network has obvious advantages in processing sequence problems. However, the recurrent neural network model has a gradient vanishing problem, that is, the farther the data belongs to the moment, the smaller the influence of the data on the weight change is, and finally, the training result is often dependent on the data at the closer moment, that is, the long-term memory of the historical data is lacked. Hochereiter et al originally proposed a long-short term memory network, LSTM was an optimized model of RNN, inherited most of the properties of RNN, and solved the problem of gradient disappearance during reverse transfer, after which LSTM was further improved and generalized by a.graves. LSTM has three more sets of controllers for long-short term memory compared to the original RNN: forget gate, input gate, output gate. An LSTM model memory cell structure is shown in FIG. 3.

The forgetting gate serves to selectively discard data in cell state C, the selection process being calculated using the following formula:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

taking the language model as an example, in the process of predicting a new word, if the cell state includes the property of the subject in the previous sentence, the property of the previous subject needs to be forgotten under the new sentence structure. After forgetting the gate, which is a sigmoid layer with information to decide which values to update, a new candidate is generated by a tanh layer, it needs to choose how to update the cell state. The above update process can be implemented by the following formula:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

following the language model above as an example, the properties of the new subject need to be added to the cell state to replace the old relevant information. The updating process is as follows:

after the LSMT cell unit is refreshed, the output value needs to be calculated. The output value is determined by the cell state C, first by using the sigmoid layer to determine the position of the part to be output, then converting to a value between (-1,1) with tanh, and then multiplying with the sigmoid gate output.

The output process is represented by the formula o_t＝σ(W_o[h_t,x_t]+b_o) Calculate, at the same time, h_t＝σ_t*tanh(c_t)。

C. Feature fusion and reasoning:

the fusion between the problem feature and the image feature is various, such as multiplying or concatenating the two features together. However, such fusion is insufficient because the relationship between the problem and the image is complicated, and the interaction between them may not be fully utilized by a simple operation. Current models answer a word mostly by treating VQA as a multi-class classification question. Of course, we can answer the complete sentence by the RNN decoder model. In recent years, Attention Mechanism (Attention Mechanism) has become a relatively popular topic in various fields of deep learning. Taking the example of mutual conversation among people, if a second person needs to answer the question of a first person, the question cannot be answered well under the condition that words are omitted and all the words are completely heard, and the question needs to be answered according to the emphasis of the question, wherein the emphasis is the so-called attention. We achieve feature fusion and reasoning through a mechanism of attention. Here, we design the SoftAttention mechanism, which is a model that is relatively easy to add to existing network structures.

Let X denote input information, X ═ X₁，x₂，x₃...x_N]Attention is changedThe quantity z belongs to [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected_iThe calculation method of (c) is as follows:

Based on the above, the invention establishes a data set belonging to an education system on the basis of the existing VQA model, constructs a characteristic extraction algorithm of the invention by combining the advantages of an image recognition algorithm and a text analysis algorithm, and fuses and infers the characteristics through an attention mechanism, so that the established VQA model has visual recognition and semantic perception functions in the education process, and can provide corresponding answers according to questions and picture input of questioners.

The human-computer interaction principle of integrating visual question answering in a web3D environment is shown in fig. 1, a learner can directly load a 3D education scene in a browser at a web end, in the learning process, the learner can ask a question to an AI teacher through a microphone, meanwhile, a camera acquires picture information, and the picture and the question are uploaded to a server through a socket at the web end. The server utilizes an VQA model to perform feature extraction on the picture, performs coding processing on the question, performs fusion and reasoning of features of the picture and the question through an attention mechanism, and finally generates a corresponding answer through a decoder. The server feeds the answers back to the web end through the socket, and the answers are displayed in a 3D education scene in combination with the animation of the AI teachers, so that human-computer interaction is achieved.

In conclusion, the method of combining the Web3D technology and the visual question-answer technology is adopted, the Web3D technology is used for designing and realizing a real world based on education contents in a browser, meanwhile, a deep learning network is used for constructing a set of visual question-answer model suitable for the VR world, and finally the model is fused with the Web3D technology, so that the intelligent VR education project integrating interaction, three-dimensional, dynamic and object recognition is developed. The method has the advantages that the projects which can be opened only by other clients such as a PC (personal computer), a host and the like are moved to a new stage of the browser, any east plug-in is not needed to be installed, and good project experience can be obtained by opening a website in the browser.

Claims

1. A human-computer interaction system incorporating visual question-answering in a web3D environment, characterized in that,

the method comprises the following steps: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.

2. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,

the method is characterized in that the web end respectively acquires the question and the picture input information of the user through a microphone and a camera.

3. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,

wherein the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.

4. The human-computer interaction system of claim 3 that incorporates visual question answering in a web3D environment,

the image feature extraction module is characterized in that the image feature extraction module adopts a Faster R-CNN model to extract feature information of the image, and specifically comprises the following steps: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.

5. The human-computer interaction system of claim 3 that incorporates visual question answering in a web3D environment,

the method is characterized in that the feature fusion module adopts a soft attention mechanism to fuse and reason the coding information of the problem and the feature information of the picture, and the soft attention mechanism comprises the following steps:

let X denote input information, X ═ X₁，x₂，x₃...x_N]The attention variable z ∈ [1, N ∈ ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected_iThe calculation method of (c) is as follows:

6. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,

the visual question-answering model is characterized in that the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at education scenes.

7. The human-computer interaction system of any one of claims 1-6 for integrating visual question answering in a web3D environment,

the method is characterized in that the 3D education scene displayed by the web end can be switched at will.