CN110851760A - Human-computer interaction system for integrating visual question answering in web3D environment - Google Patents
Human-computer interaction system for integrating visual question answering in web3D environment Download PDFInfo
- Publication number
- CN110851760A CN110851760A CN201911099861.9A CN201911099861A CN110851760A CN 110851760 A CN110851760 A CN 110851760A CN 201911099861 A CN201911099861 A CN 201911099861A CN 110851760 A CN110851760 A CN 110851760A
- Authority
- CN
- China
- Prior art keywords
- model
- visual question
- education
- answering
- web3d
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B5/00—Electrically-operated educational appliances
- G09B5/08—Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations
- G09B5/14—Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations with provision for individual teacher-student communication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of VR education, discloses a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in traditional multimedia teaching. The system comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.
Description
Technical Field
The invention relates to the field of VR education, in particular to a human-computer interaction system for integrating visual questions and answers in a web3D environment.
Background
Thanks to the development of educational informatization and intelligent hardware, a human-computer interactive learning system based on multimedia technology has been widely applied to teaching. However, the current multimedia teaching has the following disadvantages: (1) the interaction mode is single, and students can only operate through a mouse and a keyboard; (2) the design of the interactive interface is monotonous, and the interest is not interesting and the attention of students is hard to attract.
The research on education and psychologists at home and abroad shows that the interest level in the learning of students is one of the most important factors influencing the learning effect of students except intelligence. Therefore, if the teaching can be improved from the aspect of human-computer interaction, the interest of students can be improved, and the improvement of the learning efficiency can be brought inevitably.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a human-computer interaction system integrating visual question answering in a web3D environment is provided, and the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching are solved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the man-machine interaction system for integrating visual question answering in the web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.
For further optimization, the web terminal respectively acquires the problems of the user and the picture input information through a microphone and a camera.
As a further optimization, the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
As further optimization, the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at the education scene.
As a further optimization, the 3D education scenes displayed by the web end can be switched at will.
As a further optimization, the image feature extraction module extracts feature information of the picture by using a Faster R-CNN model, and specifically includes: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.
As a further optimization, the feature fusion module adopts a soft attention mechanism to fuse and infer the coding information of the problem and the feature information of the picture, and the soft attention mechanism includes:
let X denote input information, X ═ X1,x2,x3...xN]The attention variable z ∈ [1, N ∈ ]]Is a set of index positions, z being valuedPosition, i.e. the position representing the selected information, given q and x, the probability a that any one position is selectediThe calculation method of (c) is as follows:
s(xiq) is a scoring function to calculate the value of attention at that location.
The invention has the beneficial effects that:
(1) the visual question-answering technology is combined with the Web3D technology, and an AI of a Web3D application system for future education is designed, wherein the AI is an intelligent small teacher with multi-perception fusion of visual recognition and semantic understanding, and can sensitively react and correspondingly communicate in interaction with learners.
(2) The adopted visual question-answering model is based on an education system, a data set of the visual question-answering model comprises most of picture information and text information in the education process, and the trained model is very fit with the education system.
(3) The method has the advantages that the WebGL technology is utilized to construct the three-dimensional education scene at the web end, so that a user can browse required contents more intuitively, the feeling of being personally on the scene is realized, and the activity and the interactivity of a project are enhanced more.
Therefore, compared with the traditional multimedia teaching, the human-computer interaction realized by the invention is firstly the improvement of the interactivity, so that the learner can carry out one-to-one communication with the system, and the attention and the immersion of the learner are improved; secondly, the system can automatically carry out visual semantic understanding through the images and languages given by the learner, thereby being capable of feeding back the insights and answers of different knowledge to the learner and improving the learning efficiency.
Drawings
FIG. 1 is a schematic diagram of the human-computer interaction of the present invention incorporating visual question answering in a web3D environment;
FIG. 2 is a flow diagram of the fast R-CNN algorithm;
FIG. 3 is a diagram showing the structure of an LSTM-model memory cell.
Detailed Description
The invention aims to provide a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching. The method comprises the steps that a WebGL technology is utilized, a scene model made by a modeling tool is led in a browser at a web end through a model loader, a scene is rendered through a renderer, and finally a large-scale VR world is displayed; and a visual question-answering model suitable for the education system is arranged at the server side, the visual question-answering model adopts the feature fusion and reasoning of an attention mechanism, the problems of the learner are obtained and answered based on the visual question-answering model, and the questions are fed back to the three-dimensional education scene of the web side, so that the intelligent man-machine interaction of the web3D + AI is realized.
The man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side. In a specific implementation, the visual question-answering model includes: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
In order to establish the perception and interaction of 3D vision, the core of the system is to build VQA (visual question and answer) model facing the education system. VQA the model is input as a picture and a free, open form natural language question about the picture to generate a natural language answer as output. The image feature extraction part, the problem coding part and the feature fusion and reasoning part in the model are realized as follows:
a. image feature extraction:
the Faster and better target detection tool is realized by the Faster and better target detection tool realized by anyone of the Microsoft institute of research, such as Neplus, Hosiemin, Ross Girshick and Sunware on the basis of Fast R-CNN. The Faster R-CNN algorithm realizes the real end-to-end target detection and calculation process, and is mainly divided into three parts: 1) a convolutional neural network; 2) regional recommendation Network (RPN); 3) fast R-CNN target detection network. The algorithm still continues the idea that the R-CNN firstly carries out regional recommendation and then carries out classification, but the regional recommendation task by the convolutional neural network is successfully realized, and no additional algorithm is needed to be used for independently operating. The RPN and Fast R-CNN share a convolutional neural network for feature extraction, so that the convolution calculation times are reduced, and the speed of the whole algorithm is improved. Fast R-CNN is to perform convolution calculation on the whole image, then fuse a candidate region recommended by a selective search algorithm with a feature mapping image calculated by a convolution network through a region-of-interest Pooling Layer (RoI Pooling Layer) to obtain a feature vector corresponding to the candidate region, and the frequency of convolution calculation is greatly reduced by the operation of sharing convolution calculation. And the dimensions of the feature vectors are uniform, so that the subsequent classification work is facilitated.
The Fast R-CNN is generated by the enlightening of SPP-Net, and the proposed region-of-interest pooling layer fuses and pools the convolution characteristics and the candidate region border to obtain the characteristic vector of the corresponding region, which is equivalent to a special SPP-Net spatial pyramid pooling layer and a pyramid structure with only one layer. In addition, to achieve better training, Fast R-CNN also uses some methods to increase speed, two of which are important: multitasking training and minimal batch sampling.
1) Region of interest pooling layer:
the region-of-interest pooling layer transforms the features in each active candidate region into a feature vector of fixed size W x H using a max pooling operation. The region of interest here refers to a rectangular window in the convolution signature, and in FastR-CNN is the segmented region calculated by the selective search algorithm. Each region of interest is represented by a quaternion vector (x, y, w, h), where (x, y) represents the coordinates of the upper left corner and (w, h) represents the height and width of the rectangular window. The region of interest pooling layer divides a window of interest of size W H into W x H mesh sub-windows, each of which is about (W/W) x (H/H), and then pools the eigenvalues in each sub-window maximally into the corresponding output mesh. This applies to each feature channel, as is the standard max pooling operation.
2) Multi-task training:
the multi-task training refers to that the target classification and the regression calculation of the frame of the candidate region are simultaneously used as two parallel output layers, and the training of a classifier SVM and the calculation of the regression amount are not divided into different stages in R-CNN. The first task outputs the probability distribution of each interested area on the K +1 class (wherein K is the category of the data set, and the category is added with background), and the probability is calculated by using a softmax function; the second task is to compute t for the bounding boxk=(tx k,ty k,th k,tw k) The regression offset, parameter K, represents K classes. Using k which is still defined in R-CNNt. Combining two tasks, and performing combined training classification and candidate region frame regression calculation by using a multi-task loss function:
L(p,u,tu,v)=Ldx(p,u)+λ[u≥1]Lloc(tu,v)
the parameter u marks the actual category of the target candidate region content, and normally u ≧ 1, and if u ≧ 0 indicates that the region content is the background;Lcls(p,u)=-logpuIs the probability log-loss function for class u; l isloc(tuV) is the bounding box position penalty function calculated by smoothing the L1 penalty function, tuIs the bounding box for class u prediction; everson brackets [ u is more than or equal to 1]If the mark meets the condition u is more than or equal to 1 in the square brackets, the mark is 1, otherwise, the mark is 0; the parameter λ controls the balance between the two loss functions, and is therefore set to 1 in all experiments, since the contribution of the two loss functions is equally important.
3) Minimum batch learning:
the least batch of learning is to achieve better gradient back-propagation effects. The convolutional neural network uses a gradient descent method for back propagation of the parameters. In the training process, the whole data set can be completely sent to a network model for training and learning, and the network calculates an iteratively updated gradient value by using all samples, which is a traditional gradient descent method. Alternatively, the gradient values of the network update can be calculated using only one sample at a time, which is called a random gradient descent method, also called an online learning method. Learning using the entire data set can more accurately converge to the location where the extreme value is located, and the number of iterations is small, but the time taken for each iteration is long. With equal computational effort, the convergence rate of training using the entire data set is slower than with a small number of samples. Although the convergence speed is high by using a small number of samples for training, the correction is carried out towards the gradient direction pointed by the current sample in each iteration process, the correction respectively faces different directions, and the correction respectively turns into halves, so that a large amount of noise is brought to cause performance reduction, and the convergence is difficult to achieve. Therefore, a mediocre approach, minimal batch learning, is used to seek a balance point between the two extremes. The minimum batch size set by Fast R-CNN is 128, and 64 interesting regions are respectively collected from two original images to be trained and learned together.
The algorithm flow of Fast R-CNN is shown in FIG. 2, convolution characteristics are processed through an interested region pooling layer, and the obtained characteristics are sent to two parallel computing tasks for training, classification and positioning regression. With these methods and improved frameworks Fast R-CNN achieves better results than R-CNN with shorter training and testing duration.
B. Text feature extraction:
for text selection, we use the LSTM network. The result of each learning of the ordinary recurrent neural network is related to the data at the current moment and the data at the previous moment. The special structure of the recurrent neural network makes full use of historical data, so that the recurrent neural network has obvious advantages in processing sequence problems. However, the recurrent neural network model has a gradient vanishing problem, that is, the farther the data belongs to the moment, the smaller the influence of the data on the weight change is, and finally, the training result is often dependent on the data at the closer moment, that is, the long-term memory of the historical data is lacked. Hochereiter et al originally proposed a long-short term memory network, LSTM was an optimized model of RNN, inherited most of the properties of RNN, and solved the problem of gradient disappearance during reverse transfer, after which LSTM was further improved and generalized by a.graves. LSTM has three more sets of controllers for long-short term memory compared to the original RNN: forget gate, input gate, output gate. An LSTM model memory cell structure is shown in FIG. 3.
The forgetting gate serves to selectively discard data in cell state C, the selection process being calculated using the following formula:
ft=σ(Wf·[ht-1,xt]+bf)
taking the language model as an example, in the process of predicting a new word, if the cell state includes the property of the subject in the previous sentence, the property of the previous subject needs to be forgotten under the new sentence structure. After forgetting the gate, which is a sigmoid layer with information to decide which values to update, a new candidate is generated by a tanh layer, it needs to choose how to update the cell state. The above update process can be implemented by the following formula:
it=σ(Wi·[ht-1,xt]+bi)
following the language model above as an example, the properties of the new subject need to be added to the cell state to replace the old relevant information. The updating process is as follows:
after the LSMT cell unit is refreshed, the output value needs to be calculated. The output value is determined by the cell state C, first by using the sigmoid layer to determine the position of the part to be output, then converting to a value between (-1,1) with tanh, and then multiplying with the sigmoid gate output.
The output process is represented by the formula ot=σ(Wo[ht,xt]+bo) Calculate, at the same time, ht=σt*tanh(ct)。
C. Feature fusion and reasoning:
the fusion between the problem feature and the image feature is various, such as multiplying or concatenating the two features together. However, such fusion is insufficient because the relationship between the problem and the image is complicated, and the interaction between them may not be fully utilized by a simple operation. Current models answer a word mostly by treating VQA as a multi-class classification question. Of course, we can answer the complete sentence by the RNN decoder model. In recent years, Attention Mechanism (Attention Mechanism) has become a relatively popular topic in various fields of deep learning. Taking the example of mutual conversation among people, if a second person needs to answer the question of a first person, the question cannot be answered well under the condition that words are omitted and all the words are completely heard, and the question needs to be answered according to the emphasis of the question, wherein the emphasis is the so-called attention. We achieve feature fusion and reasoning through a mechanism of attention. Here, we design the SoftAttention mechanism, which is a model that is relatively easy to add to existing network structures.
Let X denote input information, X ═ X1,x2,x3...xN]Attention is changedThe quantity z belongs to [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selectediThe calculation method of (c) is as follows:
s(xiq) is a scoring function to calculate the value of attention at that location.
Based on the above, the invention establishes a data set belonging to an education system on the basis of the existing VQA model, constructs a characteristic extraction algorithm of the invention by combining the advantages of an image recognition algorithm and a text analysis algorithm, and fuses and infers the characteristics through an attention mechanism, so that the established VQA model has visual recognition and semantic perception functions in the education process, and can provide corresponding answers according to questions and picture input of questioners.
The human-computer interaction principle of integrating visual question answering in a web3D environment is shown in fig. 1, a learner can directly load a 3D education scene in a browser at a web end, in the learning process, the learner can ask a question to an AI teacher through a microphone, meanwhile, a camera acquires picture information, and the picture and the question are uploaded to a server through a socket at the web end. The server utilizes an VQA model to perform feature extraction on the picture, performs coding processing on the question, performs fusion and reasoning of features of the picture and the question through an attention mechanism, and finally generates a corresponding answer through a decoder. The server feeds the answers back to the web end through the socket, and the answers are displayed in a 3D education scene in combination with the animation of the AI teachers, so that human-computer interaction is achieved.
In conclusion, the method of combining the Web3D technology and the visual question-answer technology is adopted, the Web3D technology is used for designing and realizing a real world based on education contents in a browser, meanwhile, a deep learning network is used for constructing a set of visual question-answer model suitable for the VR world, and finally the model is fused with the Web3D technology, so that the intelligent VR education project integrating interaction, three-dimensional, dynamic and object recognition is developed. The method has the advantages that the projects which can be opened only by other clients such as a PC (personal computer), a host and the like are moved to a new stage of the browser, any east plug-in is not needed to be installed, and good project experience can be obtained by opening a website in the browser.
Claims (7)
1. A human-computer interaction system incorporating visual question-answering in a web3D environment, characterized in that,
the method comprises the following steps: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.
2. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,
the method is characterized in that the web end respectively acquires the question and the picture input information of the user through a microphone and a camera.
3. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,
wherein the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
4. The human-computer interaction system of claim 3 that incorporates visual question answering in a web3D environment,
the image feature extraction module is characterized in that the image feature extraction module adopts a Faster R-CNN model to extract feature information of the image, and specifically comprises the following steps: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.
5. The human-computer interaction system of claim 3 that incorporates visual question answering in a web3D environment,
the method is characterized in that the feature fusion module adopts a soft attention mechanism to fuse and reason the coding information of the problem and the feature information of the picture, and the soft attention mechanism comprises the following steps:
let X denote input information, X ═ X1,x2,x3...xN]The attention variable z ∈ [1, N ∈ ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selectediThe calculation method of (c) is as follows:
s(xiq) is a scoring function to calculate the value of attention at that location.
6. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,
the visual question-answering model is characterized in that the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at education scenes.
7. The human-computer interaction system of any one of claims 1-6 for integrating visual question answering in a web3D environment,
the method is characterized in that the 3D education scene displayed by the web end can be switched at will.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911099861.9A CN110851760B (en) | 2019-11-12 | 2019-11-12 | Human-computer interaction system for integrating visual question answering in web3D environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911099861.9A CN110851760B (en) | 2019-11-12 | 2019-11-12 | Human-computer interaction system for integrating visual question answering in web3D environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110851760A true CN110851760A (en) | 2020-02-28 |
CN110851760B CN110851760B (en) | 2022-12-27 |
Family
ID=69600399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911099861.9A Active CN110851760B (en) | 2019-11-12 | 2019-11-12 | Human-computer interaction system for integrating visual question answering in web3D environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110851760B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111459283A (en) * | 2020-04-07 | 2020-07-28 | 电子科技大学 | Man-machine interaction implementation method integrating artificial intelligence and Web3D |
CN112463936A (en) * | 2020-09-24 | 2021-03-09 | 北京影谱科技股份有限公司 | Visual question answering method and system based on three-dimensional information |
CN112873211A (en) * | 2021-02-24 | 2021-06-01 | 清华大学 | Robot man-machine interaction method |
CN112926655A (en) * | 2021-02-25 | 2021-06-08 | 电子科技大学 | Image content understanding and visual question and answer VQA method, storage medium and terminal |
CN113010656A (en) * | 2021-03-18 | 2021-06-22 | 广东工业大学 | Visual question-answering method based on multi-mode fusion and structural control |
CN113837259A (en) * | 2021-09-17 | 2021-12-24 | 中山大学附属第六医院 | Modal-interactive, pictorial-and-attention-fused education video question-answering method and system |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114840697A (en) * | 2022-04-14 | 2022-08-02 | 山东大学 | Visual question answering method and system of cloud service robot |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107016170A (en) * | 2017-03-14 | 2017-08-04 | 上海大学 | A kind of LED lamp three-dimensional customization emulation mode based on WebGL |
US20180130259A1 (en) * | 2016-06-15 | 2018-05-10 | Dotty Digital Pty Ltd | System, Device or Method for Collaborative Augmented Reality |
CN108549658A (en) * | 2018-03-12 | 2018-09-18 | 浙江大学 | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree |
CN109902166A (en) * | 2019-03-12 | 2019-06-18 | 北京百度网讯科技有限公司 | Vision Question-Answering Model, electronic equipment and storage medium |
KR20190092043A (en) * | 2018-01-30 | 2019-08-07 | 연세대학교 산학협력단 | Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof |
CN110196717A (en) * | 2019-06-22 | 2019-09-03 | 中国地质大学(北京) | A kind of Web3D internet exchange platform and its building method |
CN110309850A (en) * | 2019-05-15 | 2019-10-08 | 山东省计算中心(国家超级计算济南中心) | Vision question and answer prediction technique and system based on language priori problem identification and alleviation |
CN110377710A (en) * | 2019-06-17 | 2019-10-25 | 杭州电子科技大学 | A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion |
-
2019
- 2019-11-12 CN CN201911099861.9A patent/CN110851760B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
US20180130259A1 (en) * | 2016-06-15 | 2018-05-10 | Dotty Digital Pty Ltd | System, Device or Method for Collaborative Augmented Reality |
CN107016170A (en) * | 2017-03-14 | 2017-08-04 | 上海大学 | A kind of LED lamp three-dimensional customization emulation mode based on WebGL |
KR20190092043A (en) * | 2018-01-30 | 2019-08-07 | 연세대학교 산학협력단 | Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof |
CN108549658A (en) * | 2018-03-12 | 2018-09-18 | 浙江大学 | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree |
CN109902166A (en) * | 2019-03-12 | 2019-06-18 | 北京百度网讯科技有限公司 | Vision Question-Answering Model, electronic equipment and storage medium |
CN110309850A (en) * | 2019-05-15 | 2019-10-08 | 山东省计算中心(国家超级计算济南中心) | Vision question and answer prediction technique and system based on language priori problem identification and alleviation |
CN110377710A (en) * | 2019-06-17 | 2019-10-25 | 杭州电子科技大学 | A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion |
CN110196717A (en) * | 2019-06-22 | 2019-09-03 | 中国地质大学(北京) | A kind of Web3D internet exchange platform and its building method |
Non-Patent Citations (5)
Title |
---|
ABACHA A B,HASAN S A, DALTA VV,ET AL.: ""VQA-Med:Overview of the Medical Visual Queswering Task at ImageCIEF 2019"", 《LECTURE NOTES IN COMPUTER ENCE,2019》 * |
C.A.R. HOARE: ""Viewpoint:Restrospective :An axiomatic basis for computer programming"", 《COMMUNICATION OF THE ACM》 * |
IQBAL CHOWDHURY; KIEN NGUYEN; CLINTON FOOKES; SRIDHA SRIDHARAN: ""A cascaded long short-term memory (LSTM) driven generic visual question answering (VQA)"", 《 2017 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》 * |
YI LIAN,LONG HE, JINSONG PING ET AL.: ""Research and implementation on the WEB3D visualization of digtal moon based on WebGL"", 《2017 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS)》 * |
王一蕾 等: ""基于深度神经网络的图像碎片化信息问答算法"", 《计算机研究与发展》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111459283A (en) * | 2020-04-07 | 2020-07-28 | 电子科技大学 | Man-machine interaction implementation method integrating artificial intelligence and Web3D |
CN112463936A (en) * | 2020-09-24 | 2021-03-09 | 北京影谱科技股份有限公司 | Visual question answering method and system based on three-dimensional information |
CN112463936B (en) * | 2020-09-24 | 2024-06-07 | 北京影谱科技股份有限公司 | Visual question-answering method and system based on three-dimensional information |
CN112873211B (en) * | 2021-02-24 | 2022-03-11 | 清华大学 | Robot man-machine interaction method |
CN112873211A (en) * | 2021-02-24 | 2021-06-01 | 清华大学 | Robot man-machine interaction method |
CN112926655A (en) * | 2021-02-25 | 2021-06-08 | 电子科技大学 | Image content understanding and visual question and answer VQA method, storage medium and terminal |
CN112926655B (en) * | 2021-02-25 | 2022-05-17 | 电子科技大学 | Image content understanding and visual question and answer VQA method, storage medium and terminal |
CN113010656A (en) * | 2021-03-18 | 2021-06-22 | 广东工业大学 | Visual question-answering method based on multi-mode fusion and structural control |
CN113837259A (en) * | 2021-09-17 | 2021-12-24 | 中山大学附属第六医院 | Modal-interactive, pictorial-and-attention-fused education video question-answering method and system |
CN113837259B (en) * | 2021-09-17 | 2023-05-30 | 中山大学附属第六医院 | Education video question-answering method and system for graph-note-meaning fusion of modal interaction |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114840697A (en) * | 2022-04-14 | 2022-08-02 | 山东大学 | Visual question answering method and system of cloud service robot |
CN114840697B (en) * | 2022-04-14 | 2024-04-26 | 山东大学 | Visual question-answering method and system for cloud service robot |
Also Published As
Publication number | Publication date |
---|---|
CN110851760B (en) | 2022-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110851760B (en) | Human-computer interaction system for integrating visual question answering in web3D environment | |
CN110163299B (en) | Visual question-answering method based on bottom-up attention mechanism and memory network | |
US11907637B2 (en) | Image processing method and apparatus, and storage medium | |
CN110377710B (en) | Visual question-answer fusion enhancement method based on multi-mode fusion | |
CN109829541A (en) | Deep neural network incremental training method and system based on learning automaton | |
CN109255359B (en) | Visual question-answering problem solving method based on complex network analysis method | |
CN109919221B (en) | Image description method based on bidirectional double-attention machine | |
CN111598118B (en) | Visual question-answering task implementation method and system | |
CN110866542A (en) | Depth representation learning method based on feature controllable fusion | |
CN117055724B (en) | Working method of generating teaching resource system in virtual teaching scene | |
CN112530218A (en) | Many-to-one accompanying intelligent teaching system and teaching method | |
CN112070040A (en) | Text line detection method for video subtitles | |
Zhang et al. | Teaching chinese sign language with a smartphone | |
Hui et al. | A systematic approach for English education model based on the neural network algorithm | |
CN113591988A (en) | Knowledge cognitive structure analysis method, system, computer equipment, medium and terminal | |
Yan | Computational methods for deep learning: theory, algorithms, and implementations | |
CN110826510A (en) | Three-dimensional teaching classroom implementation method based on expression emotion calculation | |
CN116541507A (en) | Visual question-answering method and system based on dynamic semantic graph neural network | |
CN112036546B (en) | Sequence processing method and related equipment | |
CN115168722A (en) | Content interaction prediction method and related equipment | |
Yang et al. | The Application of Interactive Humanoid Robots in the History Education of Museums Under Artificial Intelligence | |
Zhu et al. | Emotion Recognition in Learning Scenes Supported by Smart Classroom and Its Application. | |
Nunes | Deep emotion recognition through upper body movements and facial expression | |
CN117540024B (en) | Classification model training method and device, electronic equipment and storage medium | |
Senevirathne et al. | Imagibot–an image recognition chatbot for sri lankan ancient places |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |