CN112529054B

CN112529054B - Multi-dimensional convolution neural network learner modeling method for multi-source heterogeneous data

Info

Publication number: CN112529054B
Application number: CN202011355627.0A
Authority: CN
Inventors: 杨宗凯; 廖盛斌; 王小丰
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2023-04-07
Anticipated expiration: 2040-11-27
Also published as: CN112529054A

Abstract

The invention discloses a multi-dimensional convolutional neural network learner modeling method for multi-source heterogeneous data. The method comprises the steps of synchronously acquiring eye movement data, voice data and video data of a learner; preprocessing the eye movement data, the voice data and the video data; training a multi-dimensional convolutional neural network, and respectively inputting the heat point diagram, the spectrogram and the human body posture image to be recognized into the multi-dimensional convolutional neural network with the same structure for feature extraction to respectively obtain output classification results; and performing space-time multi-dimensional feature modeling analysis by combining the three output classification results. The invention realizes modeling of the learner by using multi-source heterogeneous data, can integrate and analyze the learning state of the learner from different learning sources, and the mode is more in line with the learning essence of the learner; the learner is subjected to omnibearing three-dimensional modeling from characteristics such as emotion, cognition, interaction and the like, and the real learning state of the learner can be represented.

Description

Multi-dimensional convolution neural network learner modeling method for multi-source heterogeneous data

Technical Field

The application relates to the technical field of education informatization, in particular to a multi-dimensional convolution neural network learner modeling method for multi-source heterogeneous data.

Background

With the development of education informatization, the construction of an individual chemistry learner model is the key of intelligent education, wherein big data and artificial intelligence technology are the basis for constructing a learner model, potential information of the data can be mined through deep modeling research on a learner, rules and mechanisms of learner emotion, cognition, knowledge construction modes and the like can be mined, and education services are further improved.

For learner data collection, researchers are biased to collect personal basic information and behavior information of learners before artificial intelligence, and with the application and popularization of artificial intelligence technology and an online learning platform, people start to collect all-around data about learners, and the data is utilized to restore the real learning state of learners in the learning environment to the maximum extent. Such full-range data includes video, audio, biological data, etc., wherein information of brain, skin and heart is an important source of biological data of learners, and the biological data frequently used includes heart rate, electroencephalogram, electrodermal activity, etc.

Convolutional Neural Networks (CNNs) are a research hotspot in the field of artificial intelligence and are also the basis of computer vision, a Convolutional Neural network is a typical hierarchical structure, and classifies and identifies data by automatically extracting features layer by layer, a 2D Convolutional Neural network is commonly used for classifying static pictures, but the 2D network has certain limitation on extracting characteristics of serialized data such as videos and voices, because the 2D network cannot identify the time relationship between each piece of data in a sequence, and a 3D Convolutional Neural network can capture the correlation between sequence data, but because the Convolutional kernel of the Convolutional Neural network has one more time dimension than that of the 2D Convolutional network, the network has more network parameters, the calculation consumption is increased, and the calculation is slower.

Chinese patent application No. 201710049075.2 discloses an emotion classification method based on facial expressions, learning scores and voice data, which is used for evaluating the class mastering degree of students. The method comprises the steps of classifying multi-modal data by using a convolutional neural network, wherein each data label is dysphoria, pleasure and calm, finally fusing classification results by using a Gaussian mixture model to obtain a final classification result so as to predict the emotion of a student, and judging the learning state of the student according to the emotion. However, the method has some defects that the multi-modal data is used for modeling students, but the characteristics of the selected students are single, the classroom grasping condition is not accurately analyzed only from the perspective of the emotion of the students, and deeper information of the data cannot be fully mined.

Chinese patent application No. 201910056952.8 discloses a method for modeling cognitive ability of a student and recommending personalized courses based on a cognitive diagnosis model, firstly, courses are modeled quantitatively, then, the cognitive ability of the student is modeled according to the learning condition of the student, and finally, personalized course recommendation is carried out on the student.

In summary, the existing learner modeling method mainly has the following disadvantages: 1. modeling learners focuses on single structural data, which does not accurately characterize the learner. 2. The learner has more attention to the characteristics of the learner, namely the knowledge level, the cognitive ability and other characteristics expressing intelligence of the learner, the non-intelligence characteristics of the student, such as emotion, interaction and the like, have less attention, most of the existing researches focus on single characteristics, and the comprehensive analysis modeling research on a plurality of characteristics is less.

Disclosure of Invention

In order to solve the above problem, the embodiment of the present application provides a multi-dimensional convolutional neural network learner modeling method for multi-source heterogeneous data. The method integrates two-dimensional convolution and three-dimensional convolution, compared with the method of singly using three-dimensional convolution, the method has the advantages of less parameter quantity, higher calculation speed and reduced calculation consumption, and meanwhile, the method takes multi-source heterogeneous data as input and can carry out multi-level and multi-angle modeling on the characteristics of the learner such as emotion, interaction, cognition and the like. The modeling precision of the learner is further improved.

In a first aspect, an embodiment of the present application provides a multi-dimensional convolutional neural network learner modeling method for multi-source heterogeneous data, where the method includes:

(1) Synchronously acquiring eye movement data, voice data and video data of a learner;

specifically, the eye movement data of the learner is collected through an eye movement instrument, the voice data of the learner is collected through a microphone or a professional recording pen, the video data of the learner is collected through a camera, and the eye movement data, the voice data and the video data are collected synchronously.

(2) Preprocessing the eye movement data, the voice data and the video data to respectively obtain a heat point diagram corresponding to the eye movement data, a spectrogram corresponding to the voice data and a human body posture image corresponding to the video data;

specifically, preprocessing collected learner eye movement data, voice data and video data, wherein a hotspot graph about each frame of a video is generated on professional software by using the learner eye movement data and the video data; the learner video data processing mode is to extract video frames, generate human body posture images according to each frame and finally obtain an image sequence; the learner voice data preliminary processing is to generate a spectrogram according to the audio information; the obtained data are encoded according to time sequence, and the encoding format is as follows: frame000001, frame0000002 … …, the heat point diagram, the sound spectrogram and the human body posture image are in one-to-one correspondence.

(3) Setting a label for the hotspot graph based on the cognitive state classification of the learner, setting a label for the spectrogram based on the interactive state classification of the learner, and setting a label for the human posture image based on the emotional state classification of the learner;

specifically, the hotspot graph corresponds to the cognitive state of the learner, and the labels are set to be difficult, non-participatory and easy; the spectrogram reflects the interaction level of a learner, and the labels of the spectrogram are set to be high-rising tone and low-depression tone according to the tone; the human body posture image represents the emotional state of the learner, and the label is set to be interested, confused, stressed, boring and relaxed.

(4) Training a multi-dimensional convolutional neural network, and respectively inputting the heat point diagram, the spectrogram and the human body posture image to be recognized into the multi-dimensional convolutional neural network with the same structure for feature extraction to respectively obtain output classification results;

(5) And performing space-time multi-dimensional feature modeling analysis by combining the three output classification results.

Preferably, the heat point diagram, the spectrogram and the human body posture image are one-to-one corresponding serialized data;

in the step (4), the feature extraction of the multi-dimensional convolutional neural network with the same input structure includes:

averagely dividing the serialized data into K segments { S ] according to time sequence ₁ ,S ₂ ,S ₃ ……S _K }；

And (3) carrying out equal-probability random sampling on N images of each section of the serialized data, inputting K x N data serving as input data into a multi-dimensional convolutional neural network, and processing according to the following formula:

m(T ₁ ,T ₂ ,T ₃ ……T _K )＝(f(T ₁ ,W),f(T ₂ ,W),f(T ₃ ,W)……f(T _K ,W))

M(T ₁ ,T ₂ ,T ₃ ……T _K )＝H(F(m(T ₁ ,T ₂ ,T ₃ ……T _K ),W ₁ ))

wherein, m (T) ₁ ,T ₂ ,T ₃ ……T _K ) Representing the extraction of data by a two-dimensional convolutional neural network, T ₁ ,T ₂ ,T ₃ ……T _K Is a sequence of N pictures, each sequence being from segment S ₁ ,S ₂ ,S ₃ ……S _K Get f (T) by random sampling _K W) two-dimensional convolution layer with parameter W, M (T) ₁ ,T ₂ ,T ₃ ……T _K ) Represents the final prediction result of the network, F (m (T) ₁ ,T ₂ ,T ₃ ……T _K ),W ₁ ) The expression parameter is W ₁ The three-dimensional convolutional neural network carries out feature extraction on the data, and H represents a SoftMax function.

Preferably, the training of the multidimensional convolutional neural network in the step (4) includes:

forming a multi-dimensional convolutional neural network by a 2D network and a 3D network, the multi-dimensional convolutional neural network comprising an input layer, two-dimensional convolutional layers, three-dimensional convolutional layers, a maximum pooling layer, an average pooling layer, a BatchNorm layer, and a SoftMax classification layer, the BatchNorm layer following each convolutional layer;

the input layer inputs a sample heat point diagram, a sample spectrogram or a sample human body posture image into the two-dimensional convolutional layer and the maximum pooling layer to obtain static characteristics of input data;

the static features are expanded according to time dimension and then input into the three-dimensional convolutional layer and the maximum pooling layer, wherein the last pooling layer is an average pooling layer, and dynamic information of the input data is obtained;

calculating the error between the classification result output by the SoftMax classification layer and the actual classification, calculating the gradient of each layer of parameters according to the calculated error back propagation, adjusting the parameters connected with each layer according to the gradient, circulating the error back propagation until each layer of parameters reaches the minimum point of the classification output error, and stopping iteration.

Preferably, the error is calculated by:

wherein, l (x) _i ) A tag value, p (x), representing the ith input data _i ) Is a predicted value obtained by the ith input data after passing through a convolutional network, loss (x) _i ) The loss function is represented.

Preferably, the method for calculating the gradient of each layer parameter according to the calculated error back propagation includes:

wherein L represents an error obtained after training sample data,

is a convolution kernel parameter of l layers, <' >>

Represents the convolved output, <' > or>

Is to activate the result after convolution, delta _l Represents the gradient of the error to the convolution kernel parameters, η is the learning rate.

Preferably, the unfolding manner of unfolding the static features according to the time dimension is as follows:

[B*S,C,H,W]→[B,C,S,H,W]

where B denotes the batch size, S denotes the number of input images, and C, H, and W denote the number of channels and the height and width of the feature map, respectively.

Preferably, the performing spatiotemporal multidimensional feature modeling analysis by combining the three output classification results in the step (5) includes:

classifying the learner into three dimensions according to characteristics according to the output classification result, wherein the dimensions comprise a cognitive dimension, an interaction dimension and an emotion dimension;

sending suggestion information to the learner based on the classification result of the cognitive dimension;

sending interaction reminding information to the learner based on the classification result of the interaction dimension;

and sending learning difficulty adjustment information to the learner based on the classification result of the emotion dimension.

Specifically, the input of the three networks are a hotspot graph, a spectrogram and a human body posture image respectively, the hotspot graph can reflect the cognitive process of a learner, and the obtained label of the hotspot graph is set to be difficult, non-participatory and easy; the tone in the voice data is a specific reflection of the internal emotion and emotion, so that the label of the spectrogram is set to be high tone and low tone, wherein the high tone indicates that a learner already masters the learned knowledge and is full of confidence and active interaction, and the low tone indicates that the learner is full of confusion and unwilling of interaction on the learned knowledge; different human postures of learners reflect different emotional states, and the human posture graph is labeled as five types of interest, confusion, pressure, boring and relaxation.

According to the final classification results of the three networks, the learner is subjected to space-time multi-dimensional modeling analysis, the learner is divided into three dimensions according to characteristics, namely cognitive, interactive and emotional dimensions, a learner model facing the cognitive-interactive-emotional three dimensions is formed, the cognitive development condition of the learner is effectively analyzed through effective extraction of eye movement data information of the learner in the cognitive dimension of the learner, reasonable suggestions are given to different classification results, the interaction condition of students is analyzed through voice data in the interactive dimension, the learner is reminded to actively interact, the emotional change of the learner in a certain time is analyzed in the emotional dimension, the learning difficulty is adjusted according to the emotional state, the internal cognitive structure of the learner is accurately represented through multi-dimensional and comprehensive modeling analysis of the learner, and support is provided for formulation of an accurate teaching strategy.

The invention has the beneficial effects that: (1) The modeling is carried out on the learner by using the multi-source heterogeneous data, the learning state of the learner can be fused and analyzed from different learning sources, and the mode is more consistent with the learning essence of the learner.

(2) The learner is subjected to omnibearing three-dimensional modeling from characteristics such as emotion, cognition, interaction and the like, and the real learning state of the learner can be represented.

(3) The convolution neural network is constructed by fusing two-dimensional convolution and three-dimensional convolution, so that the three-dimensional convolution can be utilized to extract the characteristics of data in time dimension, and the two-dimensional convolution is added into the network, thereby effectively reducing the training time and reducing the calculation overhead.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a multi-dimensional convolutional neural network learner modeling method for multi-source heterogeneous data according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating an example of a multidimensional convolution network according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a 2D network according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a 3D network provided in an embodiment of the present application;

fig. 5 is an exemplary schematic diagram of a subnetwork Inc in a 2D network according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating an example of a learner modeling process according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In the following description, the terms "first" and "second" are used for descriptive purposes only and are not intended to indicate or imply relative importance. The following description provides embodiments of the invention, which may be combined with or substituted for various embodiments, and the invention is thus to be construed as embracing all possible combinations of the same and/or different embodiments described. Thus, if one embodiment includes the feature A, B, C and another embodiment includes the feature B, D, the invention should also be considered to include embodiments that include one or more of all other possible combinations of A, B, C, D, although this embodiment may not be explicitly recited in text below.

The following description provides examples, and does not limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements described without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the described methods may be performed in an order different than the order described, and various steps may be added, omitted, or combined. Furthermore, features described with respect to some examples may be combined into other examples.

The technical idea of the application is as follows: the method comprises the steps of collecting eye movement data, voice data and video data of a learner through equipment such as an eye movement instrument, a microphone, a camera and the like, generating a corresponding heat point diagram, a sound spectrogram and a human body posture image, constructing a deep convolution neural network by using a two-dimensional convolution and a three-dimensional convolution fusion mode, firstly extracting static characteristics of input data in a space dimension by using the two-dimensional convolution, then extracting dynamic characteristics of the static characteristics in a time dimension and a space dimension by using the three-dimensional convolution, independently operating the three convolution neural networks, carrying out space-time multi-dimensional modeling analysis on obtained classification results and proposing opinions or suggestions.

Referring to fig. 1, fig. 1 is a schematic flowchart of a multi-dimensional convolutional neural network learner modeling method for multi-source heterogeneous data according to an embodiment of the present application.

Illustratively, the method comprises the steps of:

(1) The voice data, the video data and the eye movement data of the learner are acquired by utilizing equipment such as a microphone, a camera, an eye movement instrument and the like, and all data are acquired synchronously.

(2) And converting the collected voice data, video data and eye movement data into a spectrogram, a human body posture graph and a hot spot graph, setting the size of the graph to be 224 × 224, and sequentially coding all the graphs according to a time sequence.

(3) And sampling the obtained data by using a sparse sampling strategy, and inputting the obtained image data with a fixed size into the three multi-dimensional convolutional neural networks as input values, wherein the structures of the networks are the same, and refer to the attached figure 2.

The structure of the multidimensional convolutional neural network of the embodiment of the invention is shown in fig. 2, the network is composed of 2DNets and 3DNets, the structure of the 2DNets network is shown in fig. 3, the 2DNets is composed of two convolutional layers and an Inc module, the structure of the Inc module can be shown in fig. 4 with reference to the network structure of 5,3D, the network structure is a residual error structure composed of 12 three-dimensional convolutions, as shortcut connections and residual error networks can effectively avoid the degradation problem, the input of the multidimensional convolutional network is composed of 3 pictures obtained by sparse sampling, the data is firstly divided into 3 sections according to time sequence, then 1 picture is randomly sampled from the three sections to form picture data with 9 channels, therefore, the format of the input data is 9 × 224, a small-batch descent algorithm is adopted to convert the data format into [ B × 52 zxft 3252 ], B represents the size of each batch of data, the output of the network after passing through the 2D network is 3532B, and finally the output of the network is classified as max [ B × 32 ], and the final input of the network is finally classified as max [ 3 × 32 ] after 2D network is subjected to be subjected to expansion and the last step of the input of the software.

The feature extraction process of each layer is described in detail below with reference to fig. 3, 4 and 5:

2DNets, namely the input data format of the network is [9,224,224], then the format is set to [ B x 3,3,224,224] and the input data is input into a first convolutional layer and a pooling layer, the convolutional layer uses 64 7*7 convolutional cores to check the input data and extract characteristics, the pooling layer has a convolutional kernel of 3*3, and the output data format is [ B x 3,64,56,56]; the size of convolution kernels of the second layer of convolution and pooling is 3*3, the number of convolution kernels is 192, and the output data format is [ B × 3,192,28,28]; and then inputting the product into an Inc module, wherein the Inc module consists of 1*1 convolution kernels and 3*3 convolution kernels, the number of the convolution kernels of each Inc module is different, the output of the first Inc module is [ B + 3,256,28,28], and the output of the second Inc module is [ B + 3,256,28,28]

And [ B3,320,28,28 ], inputting the convolution kernels into two convolution layers, wherein the convolution kernels are 1*1 and 3*3 respectively, and the final output of the network is [ B3,96,28,28 ].

3DNets, wherein the network consists of 6 residual blocks, each residual block comprises two three-dimensional convolution layers, convolution kernels of the three-dimensional convolution layers are uniformly set to be 3 x 3, and the number of feature maps finally output by each residual block is 128, 256, 512 and 512 respectively; the result of 2DNets is expanded in time dimension to [ B,96,3,28,28] input network, and the final output is [ B,512,1,1,1].

(4) And inputting the network output value into a SoftMax layer to obtain a final classification result.

(5) And (3) performing space-time multi-dimensional feature modeling analysis, and referring to the attached figure 6.

According to the final classification results of the three networks, performing space-time multi-dimensional modeling analysis on the learner, which specifically comprises the following steps: the learner is divided into three dimensions according to characteristics, namely cognition dimension, interaction dimension and emotion dimension, a learner model facing to the cognition dimension, the interaction dimension and the emotion dimension is formed, in the cognition dimension of the learner, the cognitive development condition of the learner is effectively analyzed through effective extraction of eye movement data information of the learner, reasonable suggestions are given to different classification results, in the interaction dimension, the interaction condition of students is analyzed through voice data, the learner is reminded of actively interacting, in the emotion dimension, the emotion change of the learner within a certain time is analyzed, the learning difficulty is adjusted according to the emotion state, and through multi-dimensional and comprehensive modeling analysis of the learner, the internal cognitive structure of the learner is more accurately represented, and support is provided for formulation of an accurate teaching strategy.

The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A multi-dimensional convolutional neural network learner modeling method of multi-source heterogeneous data, the method comprising:

the heat point diagram, the spectrogram and the human body posture image are serialized data which correspond to each other one by one;

wherein, m (T) ₁ ,T ₂ ,T ₃ ……T _K ) Representing the extraction of data by a two-dimensional convolutional neural network, T ₁ ,T ₂ ,T ₃ ……T _K Is a sequence of N pictures, each sequence being from segment S ₁ ,S ₂ ,S ₃ ……S _K Get f (T) by random sampling _K W) two-dimensional convolution layer with parameter W, M (T) ₁ ,T ₂ ,T ₃ ……T _K ) Represents the final prediction result of the network, F (m (T) ₁ ,T ₂ ,T ₃ ……T _K )，W ₁ ) The expression parameter is W ₁ The three-dimensional convolution neural network carries out feature extraction on the data, and H represents a SoftMax function;

the training multidimensional convolutional neural network in the step (4) comprises the following steps:

calculating an error between a classification result output by the SoftMax classification layer and an actual classification, calculating a gradient of each layer of parameters according to error back propagation obtained by calculation, adjusting the parameters connected with each layer according to the gradient, circulating the error back propagation until each layer of parameters reaches a classification output error minimum point, and stopping iteration;

2. The method of claim 1, wherein the error is calculated by:

3. The method of claim 2, wherein the calculating the gradient of each layer parameter based on the calculated error back propagation comprises:

wherein L represents an error obtained after training sample data,

is a convolution kernel parameter of l layers, <' >>

Represents the convolved output, <' > or>

4. The method of claim 1, wherein the unfolding manner of unfolding the static features in the time dimension is as follows:

[B*S,C,H,W]→[B,C,S,H,W]

5. The method according to claim 1, wherein said combining three said output classification results in step (5) for spatiotemporal multidimensional feature modeling analysis comprises:

and sending learning difficulty adjustment information to the learner based on the classification result of the emotional dimension.