CN110851760B - Human-computer interaction system for integrating visual question answering in web3D environment - Google Patents

Human-computer interaction system for integrating visual question answering in web3D environment Download PDF

Info

Publication number
CN110851760B
CN110851760B CN201911099861.9A CN201911099861A CN110851760B CN 110851760 B CN110851760 B CN 110851760B CN 201911099861 A CN201911099861 A CN 201911099861A CN 110851760 B CN110851760 B CN 110851760B
Authority
CN
China
Prior art keywords
model
education
visual question
information
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911099861.9A
Other languages
Chinese (zh)
Other versions
CN110851760A (en
Inventor
谢宁
孔文喆
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911099861.9A priority Critical patent/CN110851760B/en
Publication of CN110851760A publication Critical patent/CN110851760A/en
Application granted granted Critical
Publication of CN110851760B publication Critical patent/CN110851760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/08Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations
    • G09B5/14Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations with provision for individual teacher-student communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of VR education, discloses a human-computer interaction system integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in traditional multimedia teaching. The system comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for an education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side.

Description

Human-computer interaction system for integrating visual question answering in web3D environment
Technical Field
The invention relates to the field of VR education, in particular to a human-computer interaction system for integrating visual question answering in a web3D environment.
Background
Thanks to the development of educational informatization and intelligent hardware, a human-computer interactive learning system based on multimedia technology has been widely applied to teaching. However, the current multimedia teaching has the following disadvantages: (1) The interaction mode is single, and students can only operate through a mouse and a keyboard; (2) The design of the interactive interface is monotonous, and the interest is not interesting and the attention of students is hard to attract.
The research on education and psychologists at home and abroad shows that the interest level of students in learning is one of the most important factors influencing the learning effect of students except intelligence. Therefore, if the teaching can be improved from the aspect of human-computer interaction, the interest of students can be improved, and the improvement of the learning efficiency can be brought inevitably.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the human-computer interaction system integrating the visual question answering in the web3D environment is provided, and the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching are solved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.
For further optimization, the web terminal respectively acquires the problems of the user and the picture input information through a microphone and a camera.
As a further optimization, the visual question-answer model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the picture by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
As further optimization, the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at the education scene.
As a further optimization, the 3D education scenes displayed by the web end can be switched at will.
As a further optimization, the image feature extraction module extracts feature information of the picture by using a Faster R-CNN model, and specifically includes: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate regions recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain feature vectors corresponding to the candidate regions, and then classifies and regresses the feature vectors.
As a further optimization, the feature fusion module adopts a soft attention mechanism to fuse and infer the coding information of the problem and the feature information of the picture, and the soft attention mechanism includes:
let X denote input information, X = [ X ] 1 ,x 2 ,x 3 ...x N ]The attention variable z ∈ [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected i The calculation method of (c) is as follows:
Figure BDA0002269498650000021
s(x i q) is a scoring function to calculate the value of attention at that location.
The invention has the beneficial effects that:
(1) The visual question-answering and Web3D technology are combined, an AI of a Web3D application system for future education is designed, the AI is an intelligent small teacher with multi-perception fusion of visual recognition and semantic understanding, and the AI can sensitively react and perform corresponding communication in interaction with learners.
(2) The adopted visual question-answering model is based on an education system, a data set of the visual question-answering model comprises most of picture information and text information in the education process, and the trained model is very fit with the education system.
(3) The method has the advantages that the WebGL technology is utilized to construct the three-dimensional education scene at the web end, so that a user can browse required contents more intuitively, the user has a feeling of being personally on the scene, and the activity and the interactivity of a project are enhanced more.
Therefore, compared with the traditional multimedia teaching, the human-computer interaction realized by the invention is firstly the improvement of the interactivity, so that the learner can carry out one-to-one communication with the system, and the attention and the immersion of the learner are improved; secondly, the system can automatically carry out visual and semantic understanding through images and languages given by a learner, so that the system can feed back the insights and answers of different knowledge to the learner, and the learning efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of the human-computer interaction of the present invention to integrate visual question answering in a web3D environment;
FIG. 2 is a flow diagram of the fast R-CNN algorithm;
FIG. 3 is a diagram showing the structure of an LSTM-model memory cell.
Detailed Description
The invention aims to provide a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in the traditional multimedia teaching. The method comprises the steps that a WebGL technology is utilized, a scene model made by a modeling tool is led in a browser at a web end through a model loader, a scene is rendered through a renderer, and finally a large-scale VR world is displayed; the method comprises the steps that a visual question-answer model suitable for an education system is arranged at a server side, the visual question-answer model adopts feature fusion and reasoning of an attention mechanism, questions of learners are obtained and answered based on the visual question-answer model, and the questions are fed back to a three-dimensional education scene of a web side, so that intelligent human-computer interaction of web3D + AI is achieved.
The man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, feedback answers of the server end are acquired, and answers are displayed in the 3D education scene in a combined mode through interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side. In a specific implementation, the visual question-answering model includes: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the picture by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
In order to establish the perception and interaction of 3D vision, the core of the system is to build a VQA (visual question answering) model facing an education system. The VQA model takes as input a picture and a free, open natural language question about the picture to generate as output a natural language answer. The image feature extraction part, the problem coding part and the feature fusion and reasoning part in the model are realized as follows:
a. image feature extraction:
the Faster and better target detection tool is realized by the Faster and better target detection tool realized by anyone of the Microsoft institute of research, such as Neplus, hosiemin, ross Girshick and Sunware on the basis of Fast R-CNN. The Faster R-CNN algorithm realizes the real end-to-end target detection and calculation process, and is mainly divided into three parts: 1) A convolutional neural network; 2) Regional recommendation Network (RPN); 3) Fast R-CNN target detection network. The algorithm still continues the idea that the R-CNN firstly carries out regional recommendation and then carries out classification, but the regional recommendation task by the convolutional neural network is successfully realized, and no additional algorithm is needed to be used for independently operating. The RPN and Fast R-CNN share a convolution neural network to carry out feature extraction, so that the convolution calculation times are reduced, and the speed of the whole algorithm is improved. Fast R-CNN is that convolution calculation is carried out on the whole image, then a candidate area recommended by a selective search algorithm is fused with a feature mapping image calculated by a convolution network through a region-of-interest Pooling Layer (RoI Pooling Layer), so that a feature vector corresponding to the candidate area is obtained, and the frequency of convolution calculation is greatly reduced by the operation of sharing convolution calculation. And the dimensions of the feature vectors are uniform, so that the subsequent classification work is facilitated.
Fast R-CNN is generated by the enlightening of SPP-Net, and the proposed region-of-interest pooling layer fuses and pools convolution features and candidate region borders to obtain feature vectors of corresponding regions, which is equivalent to a special SPP-Net spatial pyramid pooling layer with only one layer of pyramid structure. In addition, to achieve better training, fast R-CNN also uses methods to increase speed, two of which are important: multitasking training and minimal batch sampling.
1) Region of interest pooling layer:
the region-of-interest pooling layer transforms the features in each active candidate region into a feature vector of fixed size W x H using a max pooling operation. The region of interest here refers to a rectangular window in the convolution signature, and in Fast R-CNN is the segmented region computed by the selective search algorithm. Each region of interest is represented by a quaternion vector (x, y, w, h), where (x, y) represents the coordinates of the upper left corner and (w, h) represents the height and width of the rectangular window. The region of interest pooling layer divides the window of interest of size W H into W x H mesh sub-windows, each of which is about (W/W) x (H/H), and then pools the eigenvalues in each sub-window maximally into the corresponding output mesh. This applies to each feature channel, as is the standard max pooling operation.
2) Multi-task training:
the multi-task training refers to that the target classification and the regression calculation of the frame of the candidate region are simultaneously used as two parallel output layers, and the training and the regression calculation of the classifier SVM are not divided into different stages in the R-CNN. The first task outputs the probability distribution of each interested area on the K +1 class (wherein K is the category of the data set, and the category is added with background), and the probability is calculated by using a softmax function; the second task is to compute t for the bounding box k =(t x k ,t y k ,t h k ,t w k ) The regression offset, parameter K, represents the K classes. Using k which is still defined in R-CNN t . Combining two tasks, and performing combined training classification and candidate region frame regression calculation by using a multi-task loss function:
L(p,u,tu,v)=L dx (p,u)+λ[u≥1]L loc (t u ,v)
the parameter u marks the actual category of the candidate region content as the target, normally u is larger than or equal to 1, and if u =0, the region content is a background; l is cls (p,u)=-logp u Is the probability log-loss function for class u; l is loc (t u V) is the frame position loss function calculated by smoothing the L1 loss function, t u Is the bounding box for class u prediction; aves brackets [ u is not less than 1]If the mark meets the condition u is more than or equal to 1 in the square brackets, the mark is 1, otherwise, the mark is 0; the parameter λ controls the balance between the two loss functions, and is therefore set to 1 in all experiments, since the contribution of the two loss functions is equally important.
3) Minimum batch learning:
the least batch of learning is to achieve better gradient back-propagation effects. The convolutional neural network uses a gradient descent method for back propagation of the parameters. In the training process, the whole data set can be completely sent to a network model for training and learning, and the network calculates an iteratively updated gradient value by using all samples, which is a traditional gradient descent method. Alternatively, the updated gradient values of the network can be calculated using only one sample at a time, which is called a stochastic gradient descent method, also called an online learning method. Learning using the entire data set can more accurately converge to the location where the extreme value is located, and the number of iterations is small, but the time taken for each iteration is long. With equal computational effort, the convergence rate of training using the entire data set is slower than with a small number of samples. Although the convergence speed is high by using a small number of samples for training, the correction is carried out towards the gradient direction pointed by the current sample in each iteration process, the correction respectively faces different directions, and the correction respectively turns into halves, so that a large amount of noise is brought to cause performance reduction, and the convergence is difficult to achieve. Therefore, a mediocre approach, minimal batch learning, is used to seek a balance point between the two extremes. The minimum batch size set by Fast R-CNN is 128, and 64 interesting regions are respectively collected from two original images to be trained and learned together.
The algorithm flow of Fast R-CNN is shown in FIG. 2, convolution characteristics are processed through an interested region pooling layer, and the obtained characteristics are sent to two parallel computing tasks for training, classification and positioning regression. With these methods and improved frameworks Fast R-CNN achieves better results than R-CNN with shorter training and testing duration.
B. Text feature extraction:
for text selection, we use the LSTM network. The result of each learning of the ordinary recurrent neural network is related to the data at the current moment and the data at the previous moment. The special structure of the recurrent neural network makes full use of historical data, so that the recurrent neural network has obvious advantages in processing sequence problems. However, the recurrent neural network model has a gradient vanishing problem, that is, the farther the data belongs to the moment, the smaller the influence of the data on the weight change is, and finally, the training result is often dependent on the data at the closer moment, that is, the long-term memory of the historical data is lacked. Hochereiter et al originally proposed a long-short term memory network, LSTM is an optimized model of RNN, inherits most characteristics of RNN, and solves the problem of disappearance of gradients generated in the reverse transfer process, and then LSTM is further improved and popularized by a. LSTM has three more sets of controllers in long-short term memory compared to the original RNN: forget gate, input gate, output gate. An LSTM model memory cell structure is shown in FIG. 3.
The effect of the forgetting gate is to selectively discard data in cell state C, the selection process of which is calculated using the following formula:
f t =σ(W f ·[h t-1 ,x t ]+b f )
taking the language model as an example, in the process of predicting a new word, if the cell state includes the property of the subject in the previous sentence, the property of the previous subject needs to be forgotten under the new sentence structure. After forgetting the gate, which is a sigmoid layer with information to decide which values to update, a new candidate is generated by a tanh layer, it needs to choose how to update the cell state. The above update process can be implemented by the following formula:
i t =σ(W i ·[h t-1 ,x t ]+b i )
Figure BDA0002269498650000051
following the language model above as an example, the properties of the new subject need to be added to the cell state to replace the old relevant information. The updating process is as follows:
Figure BDA0002269498650000052
after the LSMT cell unit is refreshed, the output value needs to be calculated. The output value is determined by the cell state C, first by determining the position of the desired output part using the sigmoid layer, then converting to a value between (-1, 1) with tanh, and then multiplying with the sigmoid gate output.
The output process is represented by the formula o t =σ(W o [h t ,x t ]+b o ) Calculate, at the same time, h t =σ t *tanh(c t )。
C. Feature fusion and reasoning:
the fusion between the problem feature and the image feature is various, such as multiplying or concatenating the two features together. However, such fusion is insufficient because the relationship between the problem and the image is complicated, and the interaction between them may not be fully utilized by a simple operation. Most current models answer a word by treating VQA as a multiclass classification question. Of course, we can answer the complete sentence by the RNN decoder model. In recent years, attention Mechanism (Attention Mechanism) has become a relatively popular topic in various fields of deep learning. Taking the example of mutual conversation among people, if a second person needs to answer the question of a first person, the question cannot be answered well under the condition that words are omitted and all the words are completely heard, and the question needs to be answered according to the emphasis of the question, wherein the emphasis is the so-called attention. We achieve feature fusion and reasoning through a mechanism of attention. Here we design the Soft Attention mechanism, which is a model that is relatively easy to add to existing network structures.
Let X denote input information, X = [ X ] 1 ,x 2 ,x 3 ...x N ]Note that the variable z ∈ [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected i The calculation method of (c) is as follows:
Figure BDA0002269498650000061
s(x i q) is a scoring function to calculate the value of attention at that location.
Based on the above, the invention establishes a data set belonging to an education system on the basis of the existing VQA model, constructs a characteristic extraction algorithm of the VQA model by combining the advantages of an image recognition algorithm and a text analysis algorithm, and fuses and infers the characteristics through an attention mechanism, so that the established VQA model has the functions of visual recognition and semantic perception in the education process, and can provide corresponding answers according to questions and picture input of questioners.
The human-computer interaction principle of integrating visual question answering in a web3D environment is shown in figure 1, a learner can directly load a 3D education scene in a browser at a web end, in the learning process, the learner can ask a question to an AI teacher through a microphone, meanwhile, a camera acquires picture information, and the picture and the question are uploaded to a server through a socket at the web end. The server utilizes the VQA model to extract the features of the picture, encodes the question, fuses and infers the features of the picture and the question through an attention mechanism, and finally generates a corresponding answer through a decoder. The server feeds the answers back to the web end through the socket, and the answers are displayed in a 3D education scene in combination with the animation of the AI teachers, so that human-computer interaction is achieved.
In conclusion, the invention adopts a method of combining the Web3D technology and the visual question-answer technology, designs and realizes a real world based on educational content in a browser by utilizing the Web3D technology, simultaneously constructs a set of visual question-answer model suitable for the VR world by using a deep learning network, and finally fuses the model and the Web3D technology, thereby developing an intelligent VR educational project integrating interaction, three-dimensional, dynamic and object recognition. The method has the advantages that the projects which can be opened only by other clients such as a PC (personal computer), a host and the like are moved to a new stage of the browser, any east plug-in is not needed to be installed, and good project experience can be obtained by opening a website in the browser.

Claims (5)

1. A human-computer interaction system for integrating visual question answering in a web3D environment is characterized in that,
the method comprises the following steps: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, feedback answers of the server end are acquired, and answers are displayed in the 3D education scene in a combined mode through interactive animations; the server side is internally provided with a visual question-answer model for the education system, and after questions and picture input information transmitted by the web side are received, corresponding answers are obtained by the visual question-answer model and fed back to the web side;
the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the feature fusion module and subjected to feature fusion and reasoning;
the feature fusion module adopts a soft attention mechanism to fuse and reason the coding information of the problem and the feature information of the picture, and the soft attention mechanism comprises:
let X denote input information, X = [ X ] 1 ,x 2 ,x 3 ...x N ]The attention variable z ∈ [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selected i The calculation method of (c) is as follows:
Figure FDA0003878164880000011
s(x i q) is a scoring function to calculate the value of attention at that location.
2. The human-computer interaction system of integrating visual question answering in a web3D environment of claim 1,
the method is characterized in that the web end respectively acquires the question and the picture input information of the user through a microphone and a camera.
3. The human-computer interaction system of integrating visual question answering in a web3D environment of claim 1,
the image feature extraction module is characterized in that the image feature extraction module adopts a Faster R-CNN model to extract feature information of the image, and specifically comprises the following steps: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.
4. The human-computer interaction system of claim 1 for integrating visual question-answering in a web3D environment,
the visual question-answering model is characterized in that the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at education scenes.
5. The human-computer interaction system of integrating visual question answering in a web3D environment according to any one of claims 1-4,
the method is characterized in that the 3D education scene displayed by the web end can be switched at will.
CN201911099861.9A 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment Active CN110851760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911099861.9A CN110851760B (en) 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911099861.9A CN110851760B (en) 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment

Publications (2)

Publication Number Publication Date
CN110851760A CN110851760A (en) 2020-02-28
CN110851760B true CN110851760B (en) 2022-12-27

Family

ID=69600399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911099861.9A Active CN110851760B (en) 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment

Country Status (1)

Country Link
CN (1) CN110851760B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459283A (en) * 2020-04-07 2020-07-28 电子科技大学 Man-machine interaction implementation method integrating artificial intelligence and Web3D
CN112463936B (en) * 2020-09-24 2024-06-07 北京影谱科技股份有限公司 Visual question-answering method and system based on three-dimensional information
CN112873211B (en) * 2021-02-24 2022-03-11 清华大学 Robot man-machine interaction method
CN112926655B (en) * 2021-02-25 2022-05-17 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113010656B (en) * 2021-03-18 2022-12-20 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113837259B (en) * 2021-09-17 2023-05-30 中山大学附属第六医院 Education video question-answering method and system for graph-note-meaning fusion of modal interaction
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114840697B (en) * 2022-04-14 2024-04-26 山东大学 Visual question-answering method and system for cloud service robot

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016170A (en) * 2017-03-14 2017-08-04 上海大学 A kind of LED lamp three-dimensional customization emulation mode based on WebGL
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
KR20190092043A (en) * 2018-01-30 2019-08-07 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN110196717A (en) * 2019-06-22 2019-09-03 中国地质大学(北京) A kind of Web3D internet exchange platform and its building method
CN110309850A (en) * 2019-05-15 2019-10-08 山东省计算中心(国家超级计算济南中心) Vision question and answer prediction technique and system based on language priori problem identification and alleviation
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
AU2017203904A1 (en) * 2016-06-15 2018-01-18 Dotty Digital Pty Ltd A system, device, or method for collaborative augmented reality

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016170A (en) * 2017-03-14 2017-08-04 上海大学 A kind of LED lamp three-dimensional customization emulation mode based on WebGL
KR20190092043A (en) * 2018-01-30 2019-08-07 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN110309850A (en) * 2019-05-15 2019-10-08 山东省计算中心(国家超级计算济南中心) Vision question and answer prediction technique and system based on language priori problem identification and alleviation
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110196717A (en) * 2019-06-22 2019-09-03 中国地质大学(北京) A kind of Web3D internet exchange platform and its building method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"Research and implementation on the WEB3D visualization of digtal moon based on WebGL";Yi Lian,Long He, Jinsong Ping et al.;《2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)》;20171204;6094-6097 *
"Viewpoint:Restrospective :An axiomatic basis for computer programming";C.A.R. Hoare;《Communication of the ACM》;20091231;第52卷(第10期);30-32 *
"VQA-Med:Overview of the Medical Visual Queswering Task at ImageCIEF 2019";Abacha A B,Hasan S A, Dalta VV,et al.;《Lecture Notes in Computer ence,2019》;20190912;论文第1-5页 *
"基于深度神经网络的图像碎片化信息问答算法";王一蕾 等;《计算机研究与发展》;20181215;第55卷(第12期);2600-2610 *
Iqbal Chowdhury ; Kien Nguyen ; Clinton Fookes ; Sridha Sridharan."A cascaded long short-term memory (LSTM) driven generic visual question answering (VQA)".《 2017 IEEE International Conference on Image Processing (ICIP)》.2018,1842-1846. *

Also Published As

Publication number Publication date
CN110851760A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN110851760B (en) Human-computer interaction system for integrating visual question answering in web3D environment
US11907637B2 (en) Image processing method and apparatus, and storage medium
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN110377710B (en) Visual question-answer fusion enhancement method based on multi-mode fusion
CN109829541A (en) Deep neural network incremental training method and system based on learning automaton
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN110399518A (en) A kind of vision question and answer Enhancement Method based on picture scroll product
CN113761153B (en) Picture-based question-answering processing method and device, readable medium and electronic equipment
CN110866542A (en) Depth representation learning method based on feature controllable fusion
CN109711356B (en) Expression recognition method and system
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN111598979A (en) Method, device and equipment for generating facial animation of virtual character and storage medium
CN117055724B (en) Working method of generating teaching resource system in virtual teaching scene
CN112070040A (en) Text line detection method for video subtitles
CN115953521B (en) Remote digital person rendering method, device and system
CN112530218A (en) Many-to-one accompanying intelligent teaching system and teaching method
Zhang et al. Teaching chinese sign language with a smartphone
Hui et al. A systematic approach for English education model based on the neural network algorithm
CN113591988A (en) Knowledge cognitive structure analysis method, system, computer equipment, medium and terminal
CN113780059A (en) Continuous sign language identification method based on multiple feature points
CN117251057A (en) AIGC-based method and system for constructing AI number wisdom
Yan Computational methods for deep learning: theory, algorithms, and implementations
CN110826510A (en) Three-dimensional teaching classroom implementation method based on expression emotion calculation
CN116541507A (en) Visual question-answering method and system based on dynamic semantic graph neural network
CN113360669B (en) Knowledge tracking method based on gating graph convolution time sequence neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant