CN110851760A - Human-computer interaction system for integrating visual question answering in web3D environment - Google Patents

Human-computer interaction system for integrating visual question answering in web3D environment Download PDF

Info

Publication number
CN110851760A
CN110851760A CN201911099861.9A CN201911099861A CN110851760A CN 110851760 A CN110851760 A CN 110851760A CN 201911099861 A CN201911099861 A CN 201911099861A CN 110851760 A CN110851760 A CN 110851760A
Authority
CN
China
Prior art keywords
model
visual question
education
answering
web3d
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911099861.9A
Other languages
Chinese (zh)
Other versions
CN110851760B (en
Inventor
谢宁
孔文喆
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911099861.9A priority Critical patent/CN110851760B/en
Publication of CN110851760A publication Critical patent/CN110851760A/en
Application granted granted Critical
Publication of CN110851760B publication Critical patent/CN110851760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/08Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations
    • G09B5/14Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations with provision for individual teacher-student communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of VR education, discloses a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness in traditional multimedia teaching. The system comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.

Description

Human-computer interaction system for integrating visual question answering in web3D environment
Technical Field
The invention relates to the field of VR education, in particular to a human-computer interaction system for integrating visual questions and answers in a web3D environment.
Background
Thanks to the development of educational informatization and intelligent hardware, a human-computer interactive learning system based on multimedia technology has been widely applied to teaching. However, the current multimedia teaching has the following disadvantages: (1) the interaction mode is single, and students can only operate through a mouse and a keyboard; (2) the design of the interactive interface is monotonous, and the interest is not interesting and the attention of students is hard to attract.
The research on education and psychologists at home and abroad shows that the interest level in the learning of students is one of the most important factors influencing the learning effect of students except intelligence. Therefore, if the teaching can be improved from the aspect of human-computer interaction, the interest of students can be improved, and the improvement of the learning efficiency can be brought inevitably.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a human-computer interaction system integrating visual question answering in a web3D environment is provided, and the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching are solved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the man-machine interaction system for integrating visual question answering in the web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.
For further optimization, the web terminal respectively acquires the problems of the user and the picture input information through a microphone and a camera.
As a further optimization, the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
As further optimization, the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at the education scene.
As a further optimization, the 3D education scenes displayed by the web end can be switched at will.
As a further optimization, the image feature extraction module extracts feature information of the picture by using a Faster R-CNN model, and specifically includes: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.
As a further optimization, the feature fusion module adopts a soft attention mechanism to fuse and infer the coding information of the problem and the feature information of the picture, and the soft attention mechanism includes:
let X denote input information, X ═ X1,x2,x3...xN]The attention variable z ∈ [1, N ∈ ]]Is a set of index positions, z being valuedPosition, i.e. the position representing the selected information, given q and x, the probability a that any one position is selectediThe calculation method of (c) is as follows:
Figure BDA0002269498650000021
s(xiq) is a scoring function to calculate the value of attention at that location.
The invention has the beneficial effects that:
(1) the visual question-answering technology is combined with the Web3D technology, and an AI of a Web3D application system for future education is designed, wherein the AI is an intelligent small teacher with multi-perception fusion of visual recognition and semantic understanding, and can sensitively react and correspondingly communicate in interaction with learners.
(2) The adopted visual question-answering model is based on an education system, a data set of the visual question-answering model comprises most of picture information and text information in the education process, and the trained model is very fit with the education system.
(3) The method has the advantages that the WebGL technology is utilized to construct the three-dimensional education scene at the web end, so that a user can browse required contents more intuitively, the feeling of being personally on the scene is realized, and the activity and the interactivity of a project are enhanced more.
Therefore, compared with the traditional multimedia teaching, the human-computer interaction realized by the invention is firstly the improvement of the interactivity, so that the learner can carry out one-to-one communication with the system, and the attention and the immersion of the learner are improved; secondly, the system can automatically carry out visual semantic understanding through the images and languages given by the learner, thereby being capable of feeding back the insights and answers of different knowledge to the learner and improving the learning efficiency.
Drawings
FIG. 1 is a schematic diagram of the human-computer interaction of the present invention incorporating visual question answering in a web3D environment;
FIG. 2 is a flow diagram of the fast R-CNN algorithm;
FIG. 3 is a diagram showing the structure of an LSTM-model memory cell.
Detailed Description
The invention aims to provide a human-computer interaction system for integrating visual question answering in a web3D environment, and solves the problems of single interaction mode, low interactivity and lack of interestingness existing in the traditional multimedia teaching. The method comprises the steps that a WebGL technology is utilized, a scene model made by a modeling tool is led in a browser at a web end through a model loader, a scene is rendered through a renderer, and finally a large-scale VR world is displayed; and a visual question-answering model suitable for the education system is arranged at the server side, the visual question-answering model adopts the feature fusion and reasoning of an attention mechanism, the problems of the learner are obtained and answered based on the visual question-answering model, and the questions are fed back to the three-dimensional education scene of the web side, so that the intelligent man-machine interaction of the web3D + AI is realized.
The man-machine interaction system for integrating visual question answering in a web3D environment comprises: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, corresponding answers are obtained by using the visual question-answer model and are fed back to the web side. In a specific implementation, the visual question-answering model includes: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
In order to establish the perception and interaction of 3D vision, the core of the system is to build VQA (visual question and answer) model facing the education system. VQA the model is input as a picture and a free, open form natural language question about the picture to generate a natural language answer as output. The image feature extraction part, the problem coding part and the feature fusion and reasoning part in the model are realized as follows:
a. image feature extraction:
the Faster and better target detection tool is realized by the Faster and better target detection tool realized by anyone of the Microsoft institute of research, such as Neplus, Hosiemin, Ross Girshick and Sunware on the basis of Fast R-CNN. The Faster R-CNN algorithm realizes the real end-to-end target detection and calculation process, and is mainly divided into three parts: 1) a convolutional neural network; 2) regional recommendation Network (RPN); 3) fast R-CNN target detection network. The algorithm still continues the idea that the R-CNN firstly carries out regional recommendation and then carries out classification, but the regional recommendation task by the convolutional neural network is successfully realized, and no additional algorithm is needed to be used for independently operating. The RPN and Fast R-CNN share a convolutional neural network for feature extraction, so that the convolution calculation times are reduced, and the speed of the whole algorithm is improved. Fast R-CNN is to perform convolution calculation on the whole image, then fuse a candidate region recommended by a selective search algorithm with a feature mapping image calculated by a convolution network through a region-of-interest Pooling Layer (RoI Pooling Layer) to obtain a feature vector corresponding to the candidate region, and the frequency of convolution calculation is greatly reduced by the operation of sharing convolution calculation. And the dimensions of the feature vectors are uniform, so that the subsequent classification work is facilitated.
The Fast R-CNN is generated by the enlightening of SPP-Net, and the proposed region-of-interest pooling layer fuses and pools the convolution characteristics and the candidate region border to obtain the characteristic vector of the corresponding region, which is equivalent to a special SPP-Net spatial pyramid pooling layer and a pyramid structure with only one layer. In addition, to achieve better training, Fast R-CNN also uses some methods to increase speed, two of which are important: multitasking training and minimal batch sampling.
1) Region of interest pooling layer:
the region-of-interest pooling layer transforms the features in each active candidate region into a feature vector of fixed size W x H using a max pooling operation. The region of interest here refers to a rectangular window in the convolution signature, and in FastR-CNN is the segmented region calculated by the selective search algorithm. Each region of interest is represented by a quaternion vector (x, y, w, h), where (x, y) represents the coordinates of the upper left corner and (w, h) represents the height and width of the rectangular window. The region of interest pooling layer divides a window of interest of size W H into W x H mesh sub-windows, each of which is about (W/W) x (H/H), and then pools the eigenvalues in each sub-window maximally into the corresponding output mesh. This applies to each feature channel, as is the standard max pooling operation.
2) Multi-task training:
the multi-task training refers to that the target classification and the regression calculation of the frame of the candidate region are simultaneously used as two parallel output layers, and the training of a classifier SVM and the calculation of the regression amount are not divided into different stages in R-CNN. The first task outputs the probability distribution of each interested area on the K +1 class (wherein K is the category of the data set, and the category is added with background), and the probability is calculated by using a softmax function; the second task is to compute t for the bounding boxk=(tx k,ty k,th k,tw k) The regression offset, parameter K, represents K classes. Using k which is still defined in R-CNNt. Combining two tasks, and performing combined training classification and candidate region frame regression calculation by using a multi-task loss function:
L(p,u,tu,v)=Ldx(p,u)+λ[u≥1]Lloc(tu,v)
the parameter u marks the actual category of the target candidate region content, and normally u ≧ 1, and if u ≧ 0 indicates that the region content is the background;Lcls(p,u)=-logpuIs the probability log-loss function for class u; l isloc(tuV) is the bounding box position penalty function calculated by smoothing the L1 penalty function, tuIs the bounding box for class u prediction; everson brackets [ u is more than or equal to 1]If the mark meets the condition u is more than or equal to 1 in the square brackets, the mark is 1, otherwise, the mark is 0; the parameter λ controls the balance between the two loss functions, and is therefore set to 1 in all experiments, since the contribution of the two loss functions is equally important.
3) Minimum batch learning:
the least batch of learning is to achieve better gradient back-propagation effects. The convolutional neural network uses a gradient descent method for back propagation of the parameters. In the training process, the whole data set can be completely sent to a network model for training and learning, and the network calculates an iteratively updated gradient value by using all samples, which is a traditional gradient descent method. Alternatively, the gradient values of the network update can be calculated using only one sample at a time, which is called a random gradient descent method, also called an online learning method. Learning using the entire data set can more accurately converge to the location where the extreme value is located, and the number of iterations is small, but the time taken for each iteration is long. With equal computational effort, the convergence rate of training using the entire data set is slower than with a small number of samples. Although the convergence speed is high by using a small number of samples for training, the correction is carried out towards the gradient direction pointed by the current sample in each iteration process, the correction respectively faces different directions, and the correction respectively turns into halves, so that a large amount of noise is brought to cause performance reduction, and the convergence is difficult to achieve. Therefore, a mediocre approach, minimal batch learning, is used to seek a balance point between the two extremes. The minimum batch size set by Fast R-CNN is 128, and 64 interesting regions are respectively collected from two original images to be trained and learned together.
The algorithm flow of Fast R-CNN is shown in FIG. 2, convolution characteristics are processed through an interested region pooling layer, and the obtained characteristics are sent to two parallel computing tasks for training, classification and positioning regression. With these methods and improved frameworks Fast R-CNN achieves better results than R-CNN with shorter training and testing duration.
B. Text feature extraction:
for text selection, we use the LSTM network. The result of each learning of the ordinary recurrent neural network is related to the data at the current moment and the data at the previous moment. The special structure of the recurrent neural network makes full use of historical data, so that the recurrent neural network has obvious advantages in processing sequence problems. However, the recurrent neural network model has a gradient vanishing problem, that is, the farther the data belongs to the moment, the smaller the influence of the data on the weight change is, and finally, the training result is often dependent on the data at the closer moment, that is, the long-term memory of the historical data is lacked. Hochereiter et al originally proposed a long-short term memory network, LSTM was an optimized model of RNN, inherited most of the properties of RNN, and solved the problem of gradient disappearance during reverse transfer, after which LSTM was further improved and generalized by a.graves. LSTM has three more sets of controllers for long-short term memory compared to the original RNN: forget gate, input gate, output gate. An LSTM model memory cell structure is shown in FIG. 3.
The forgetting gate serves to selectively discard data in cell state C, the selection process being calculated using the following formula:
ft=σ(Wf·[ht-1,xt]+bf)
taking the language model as an example, in the process of predicting a new word, if the cell state includes the property of the subject in the previous sentence, the property of the previous subject needs to be forgotten under the new sentence structure. After forgetting the gate, which is a sigmoid layer with information to decide which values to update, a new candidate is generated by a tanh layer, it needs to choose how to update the cell state. The above update process can be implemented by the following formula:
it=σ(Wi·[ht-1,xt]+bi)
following the language model above as an example, the properties of the new subject need to be added to the cell state to replace the old relevant information. The updating process is as follows:
Figure BDA0002269498650000052
after the LSMT cell unit is refreshed, the output value needs to be calculated. The output value is determined by the cell state C, first by using the sigmoid layer to determine the position of the part to be output, then converting to a value between (-1,1) with tanh, and then multiplying with the sigmoid gate output.
The output process is represented by the formula ot=σ(Wo[ht,xt]+bo) Calculate, at the same time, ht=σt*tanh(ct)。
C. Feature fusion and reasoning:
the fusion between the problem feature and the image feature is various, such as multiplying or concatenating the two features together. However, such fusion is insufficient because the relationship between the problem and the image is complicated, and the interaction between them may not be fully utilized by a simple operation. Current models answer a word mostly by treating VQA as a multi-class classification question. Of course, we can answer the complete sentence by the RNN decoder model. In recent years, Attention Mechanism (Attention Mechanism) has become a relatively popular topic in various fields of deep learning. Taking the example of mutual conversation among people, if a second person needs to answer the question of a first person, the question cannot be answered well under the condition that words are omitted and all the words are completely heard, and the question needs to be answered according to the emphasis of the question, wherein the emphasis is the so-called attention. We achieve feature fusion and reasoning through a mechanism of attention. Here, we design the SoftAttention mechanism, which is a model that is relatively easy to add to existing network structures.
Let X denote input information, X ═ X1,x2,x3...xN]Attention is changedThe quantity z belongs to [1, N ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selectediThe calculation method of (c) is as follows:
Figure BDA0002269498650000061
s(xiq) is a scoring function to calculate the value of attention at that location.
Based on the above, the invention establishes a data set belonging to an education system on the basis of the existing VQA model, constructs a characteristic extraction algorithm of the invention by combining the advantages of an image recognition algorithm and a text analysis algorithm, and fuses and infers the characteristics through an attention mechanism, so that the established VQA model has visual recognition and semantic perception functions in the education process, and can provide corresponding answers according to questions and picture input of questioners.
The human-computer interaction principle of integrating visual question answering in a web3D environment is shown in fig. 1, a learner can directly load a 3D education scene in a browser at a web end, in the learning process, the learner can ask a question to an AI teacher through a microphone, meanwhile, a camera acquires picture information, and the picture and the question are uploaded to a server through a socket at the web end. The server utilizes an VQA model to perform feature extraction on the picture, performs coding processing on the question, performs fusion and reasoning of features of the picture and the question through an attention mechanism, and finally generates a corresponding answer through a decoder. The server feeds the answers back to the web end through the socket, and the answers are displayed in a 3D education scene in combination with the animation of the AI teachers, so that human-computer interaction is achieved.
In conclusion, the method of combining the Web3D technology and the visual question-answer technology is adopted, the Web3D technology is used for designing and realizing a real world based on education contents in a browser, meanwhile, a deep learning network is used for constructing a set of visual question-answer model suitable for the VR world, and finally the model is fused with the Web3D technology, so that the intelligent VR education project integrating interaction, three-dimensional, dynamic and object recognition is developed. The method has the advantages that the projects which can be opened only by other clients such as a PC (personal computer), a host and the like are moved to a new stage of the browser, any east plug-in is not needed to be installed, and good project experience can be obtained by opening a website in the browser.

Claims (7)

1. A human-computer interaction system incorporating visual question-answering in a web3D environment, characterized in that,
the method comprises the following steps: the system comprises a web end and a server end, wherein the web end and the server end are connected through a Socket; the web end utilizes a WebGL technology, an education scene model made by a modeling tool is led in through a model loader, a scene is rendered through a renderer, so that a 3D education scene is displayed in a browser, questions and picture input information of a user are acquired and transmitted to a server end, and feedback answers of the server end are acquired to be displayed in the 3D education scene in combination with interactive animations; and the server side is internally provided with a visual question-answer model for the education system, and after receiving questions and picture input information transmitted by the web side, the server side acquires corresponding answers by using the visual question-answer model and feeds the answers back to the web side.
2. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,
the method is characterized in that the web end respectively acquires the question and the picture input information of the user through a microphone and a camera.
3. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,
wherein the visual question-answering model comprises: the system comprises a problem coding module, an image feature extraction module, a feature fusion module and a decoder module; the problem coding module is used for coding the problem by adopting an LSTM network; the image feature extraction module is used for extracting feature information of the image by adopting a Faster R-CNN model; the feature fusion module is used for fusing and reasoning the coding information of the problem and the feature information of the picture based on an attention mechanism; the decoder module is used for decoding and outputting corresponding answers according to the information which is output by the characteristic fusion module and subjected to characteristic fusion and reasoning.
4. The human-computer interaction system of claim 3 that incorporates visual question answering in a web3D environment,
the image feature extraction module is characterized in that the image feature extraction module adopts a Faster R-CNN model to extract feature information of the image, and specifically comprises the following steps: the Fast R-CNN model firstly performs convolution calculation on the whole image, then fuses the candidate region recommended by the selective search algorithm with the feature mapping calculated by the convolution network through the region-of-interest pooling layer to obtain the feature vector corresponding to the candidate region, and then classifies and regresses the feature vector.
5. The human-computer interaction system of claim 3 that incorporates visual question answering in a web3D environment,
the method is characterized in that the feature fusion module adopts a soft attention mechanism to fuse and reason the coding information of the problem and the feature information of the picture, and the soft attention mechanism comprises the following steps:
let X denote input information, X ═ X1,x2,x3...xN]The attention variable z ∈ [1, N ∈ ]]Is a set of index positions, the position of z value is the position representing the selected information, and under the condition of given q and x, the probability a that a certain position is selectediThe calculation method of (c) is as follows:
Figure FDA0002269498640000011
s(xiq) is a scoring function to calculate the value of attention at that location.
6. The human-computer interaction system of claim 1 with visual question-answering integrated in a web3D environment,
the visual question-answering model is characterized in that the visual question-answering model is trained by adopting picture information and text information in the education process, so that the visual question-answering model has visual perception and semantic recognition functions aiming at education scenes.
7. The human-computer interaction system of any one of claims 1-6 for integrating visual question answering in a web3D environment,
the method is characterized in that the 3D education scene displayed by the web end can be switched at will.
CN201911099861.9A 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment Active CN110851760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911099861.9A CN110851760B (en) 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911099861.9A CN110851760B (en) 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment

Publications (2)

Publication Number Publication Date
CN110851760A true CN110851760A (en) 2020-02-28
CN110851760B CN110851760B (en) 2022-12-27

Family

ID=69600399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911099861.9A Active CN110851760B (en) 2019-11-12 2019-11-12 Human-computer interaction system for integrating visual question answering in web3D environment

Country Status (1)

Country Link
CN (1) CN110851760B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459283A (en) * 2020-04-07 2020-07-28 电子科技大学 Man-machine interaction implementation method integrating artificial intelligence and Web3D
CN112463936A (en) * 2020-09-24 2021-03-09 北京影谱科技股份有限公司 Visual question answering method and system based on three-dimensional information
CN112873211A (en) * 2021-02-24 2021-06-01 清华大学 Robot man-machine interaction method
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113010656A (en) * 2021-03-18 2021-06-22 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113837259A (en) * 2021-09-17 2021-12-24 中山大学附属第六医院 Modal-interactive, pictorial-and-attention-fused education video question-answering method and system
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114840697A (en) * 2022-04-14 2022-08-02 山东大学 Visual question answering method and system of cloud service robot

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107016170A (en) * 2017-03-14 2017-08-04 上海大学 A kind of LED lamp three-dimensional customization emulation mode based on WebGL
US20180130259A1 (en) * 2016-06-15 2018-05-10 Dotty Digital Pty Ltd System, Device or Method for Collaborative Augmented Reality
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
KR20190092043A (en) * 2018-01-30 2019-08-07 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN110196717A (en) * 2019-06-22 2019-09-03 中国地质大学(北京) A kind of Web3D internet exchange platform and its building method
CN110309850A (en) * 2019-05-15 2019-10-08 山东省计算中心(国家超级计算济南中心) Vision question and answer prediction technique and system based on language priori problem identification and alleviation
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
US20180130259A1 (en) * 2016-06-15 2018-05-10 Dotty Digital Pty Ltd System, Device or Method for Collaborative Augmented Reality
CN107016170A (en) * 2017-03-14 2017-08-04 上海大学 A kind of LED lamp three-dimensional customization emulation mode based on WebGL
KR20190092043A (en) * 2018-01-30 2019-08-07 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN110309850A (en) * 2019-05-15 2019-10-08 山东省计算中心(国家超级计算济南中心) Vision question and answer prediction technique and system based on language priori problem identification and alleviation
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110196717A (en) * 2019-06-22 2019-09-03 中国地质大学(北京) A kind of Web3D internet exchange platform and its building method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ABACHA A B,HASAN S A, DALTA VV,ET AL.: ""VQA-Med:Overview of the Medical Visual Queswering Task at ImageCIEF 2019"", 《LECTURE NOTES IN COMPUTER ENCE,2019》 *
C.A.R. HOARE: ""Viewpoint:Restrospective :An axiomatic basis for computer programming"", 《COMMUNICATION OF THE ACM》 *
IQBAL CHOWDHURY; KIEN NGUYEN; CLINTON FOOKES; SRIDHA SRIDHARAN: ""A cascaded long short-term memory (LSTM) driven generic visual question answering (VQA)"", 《 2017 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》 *
YI LIAN,LONG HE, JINSONG PING ET AL.: ""Research and implementation on the WEB3D visualization of digtal moon based on WebGL"", 《2017 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS)》 *
王一蕾 等: ""基于深度神经网络的图像碎片化信息问答算法"", 《计算机研究与发展》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459283A (en) * 2020-04-07 2020-07-28 电子科技大学 Man-machine interaction implementation method integrating artificial intelligence and Web3D
CN112463936A (en) * 2020-09-24 2021-03-09 北京影谱科技股份有限公司 Visual question answering method and system based on three-dimensional information
CN112463936B (en) * 2020-09-24 2024-06-07 北京影谱科技股份有限公司 Visual question-answering method and system based on three-dimensional information
CN112873211B (en) * 2021-02-24 2022-03-11 清华大学 Robot man-machine interaction method
CN112873211A (en) * 2021-02-24 2021-06-01 清华大学 Robot man-machine interaction method
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN112926655B (en) * 2021-02-25 2022-05-17 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113010656A (en) * 2021-03-18 2021-06-22 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113837259A (en) * 2021-09-17 2021-12-24 中山大学附属第六医院 Modal-interactive, pictorial-and-attention-fused education video question-answering method and system
CN113837259B (en) * 2021-09-17 2023-05-30 中山大学附属第六医院 Education video question-answering method and system for graph-note-meaning fusion of modal interaction
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114840697A (en) * 2022-04-14 2022-08-02 山东大学 Visual question answering method and system of cloud service robot
CN114840697B (en) * 2022-04-14 2024-04-26 山东大学 Visual question-answering method and system for cloud service robot

Also Published As

Publication number Publication date
CN110851760B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN110851760B (en) Human-computer interaction system for integrating visual question answering in web3D environment
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
US11907637B2 (en) Image processing method and apparatus, and storage medium
CN110377710B (en) Visual question-answer fusion enhancement method based on multi-mode fusion
CN109829541A (en) Deep neural network incremental training method and system based on learning automaton
CN109255359B (en) Visual question-answering problem solving method based on complex network analysis method
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN111598118B (en) Visual question-answering task implementation method and system
CN110866542A (en) Depth representation learning method based on feature controllable fusion
CN117055724B (en) Working method of generating teaching resource system in virtual teaching scene
CN112530218A (en) Many-to-one accompanying intelligent teaching system and teaching method
CN112070040A (en) Text line detection method for video subtitles
Zhang et al. Teaching chinese sign language with a smartphone
Hui et al. A systematic approach for English education model based on the neural network algorithm
CN113591988A (en) Knowledge cognitive structure analysis method, system, computer equipment, medium and terminal
Yan Computational methods for deep learning: theory, algorithms, and implementations
CN110826510A (en) Three-dimensional teaching classroom implementation method based on expression emotion calculation
CN116541507A (en) Visual question-answering method and system based on dynamic semantic graph neural network
CN112036546B (en) Sequence processing method and related equipment
CN115168722A (en) Content interaction prediction method and related equipment
Yang et al. The Application of Interactive Humanoid Robots in the History Education of Museums Under Artificial Intelligence
Zhu et al. Emotion Recognition in Learning Scenes Supported by Smart Classroom and Its Application.
Nunes Deep emotion recognition through upper body movements and facial expression
CN117540024B (en) Classification model training method and device, electronic equipment and storage medium
Senevirathne et al. Imagibot–an image recognition chatbot for sri lankan ancient places

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant