CN114663910A

CN114663910A - Multi-mode learning state analysis system

Info

Publication number: CN114663910A
Application number: CN202210027041.4A
Authority: CN
Inventors: 朱世宇; 孙令翠; 杨红艳; 何桢; 田菊艳; 余玉清; 卢政旭; 冉程好
Original assignee: Chongqing Institute of Engineering
Current assignee: Chongqing Institute of Engineering
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-06-24

Abstract

The invention relates to the field of digital data processing, in particular to a multi-mode-based learning state analysis system, which comprises an image acquisition module, a pre-processing module and a learning module, wherein the image acquisition module is used for collecting image data in a classroom and carrying out pre-processing on the image data to obtain a processed image; the teacher position information acquisition module is used for acquiring the position information of the teacher based on the processed image and the Faster R-CNN target detection model; the teacher posture information acquisition module is used for acquiring the posture information of the teacher based on the processed image and the Faster R-CNN target detection model; the student face information acquisition module is used for acquiring student face information based on the processed image and an ERT person face characteristic point detection method; and the analysis module is used for analyzing the learning state based on the position information, the posture information and the facial information of the student. Therefore, the student can be better supervised the class condition, and the learning efficiency of the student is improved.

Description

Multimode-based learning state analysis system

Technical Field

The invention relates to the field of digital data processing, in particular to a multi-mode learning state analysis system.

Background

In traditional classroom education, a teacher judges the learning state of students through facial expressions and head gestures of the students, but the teacher has limited energy and cannot observe the classroom learning state condition of each student in time, so that the teaching strategy cannot be adjusted according to the learning condition of each student.

The intelligent teaching is a hot word of the current education informatization research of China, and learners refer to the intelligent teaching as a new form, a new border and a new stage of education informatization development, so that the intelligent teaching research is advanced to a quite high level. In the detection aspect of student classroom learning analysis, detection can be classified into a face recognition-based method, a micro-expression-based recognition method, and a brain wave-based detection method.

The method has the advantage that the detection precision is low due to the fact that only one variable is provided.

Disclosure of Invention

The invention aims to provide a multi-mode learning state analysis system, which aims to combine facial expressions of students, positions of teachers and class voice data of the teachers for analysis and provide visual analysis results. The analysis of the concentration degree of the students can be given, so that a teacher can conveniently make a proper teaching plan, and the classroom interaction of the teacher and the students is promoted.

In order to achieve the above object, the present invention provides a system for analyzing learning status based on multiple modalities, which comprises an image acquisition module, a teacher position information acquisition module, a teacher posture information acquisition module, a student face information acquisition module and an analysis module,

the image acquisition module is used for collecting image data in a classroom and preprocessing the image data to obtain a processed image;

the teacher position information acquisition module is used for acquiring the position information of the teacher based on the processed image and the Faster R-CNN target detection model;

the teacher posture information acquisition module is used for acquiring the posture information of the teacher based on the processed image and the Faster R-CNN target detection model;

the student face information acquisition module is used for acquiring student face information based on a processed image and an ERT person face characteristic point detection method;

the analysis module is used for analyzing the learning state based on the position information, the posture information and the facial information of the student.

The image acquisition module comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for collecting image data of the camera;

and the processing unit is used for standardizing the size of the image data and carrying out normalization processing to obtain a processed image.

The specific way of collecting the image data of the camera is to acquire an image for one frame every 5 seconds.

Wherein the teacher position information acquisition module comprises a feature map extraction unit, a candidate unit, a region feature map generation unit and a teacher position acquisition unit,

the feature map extraction unit is used for extracting an original feature map based on the processed image;

the candidate unit is used for inputting the original feature map into a candidate frame extraction network to generate a regional candidate frame;

the regional characteristic diagram generating unit is used for mapping the regional candidate frame into the original characteristic diagram and pooling the regional candidate frame into a regional characteristic diagram;

and the teacher position acquisition unit is used for inputting the regional characteristic diagram into a Faster R-CNN target detection model to acquire teacher position information.

The teacher posture information acquisition module comprises a key point acquisition unit and a normalization unit, wherein the key point acquisition unit is used for acquiring key point information of the posture of the human body based on a Faster R-CNN target detection model;

and the normalization unit is used for carrying out posture normalization module processing based on the human posture key point information.

The specific steps of analyzing the learning state based on the position information, the posture information and the facial information of the student are as follows:

constructing and training a multi-modal feature fusion network structure based on a full connection layer;

and mapping the position information, the posture information and the student face information to a feature fusion space for learning state analysis.

fusing position information, posture information and student face information by adopting a weighted fusion method to obtain weighted fusion characteristics;

inputting the weighted fusion features into the full connection layer;

and obtaining the classification probability distribution of the weighted fusion characteristics.

According to the multi-mode learning state analysis system, the cameras can be respectively placed in front of and behind a classroom, then the image acquisition module is used for acquiring image data, then the images are respectively processed through the Faster R-CNN target detection model to obtain the position information and the face information of a teacher, then the face information of students in the image data is extracted through the ERT face characteristic point detection method, and then the learning state of the students can be comprehensively evaluated, so that the class condition of the students can be better supervised, and the learning efficiency of the students can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow diagram of the pooling of the present invention.

FIG. 2 is a flow chart of teacher pose information calculation of the present invention.

FIG. 3 is a multi-modal fusion analysis student classroom learning state diagram of the present invention.

FIG. 4 is a flow chart of the present invention for obtaining three different modality-based three-dimensional model features through the joint learning of the tri-modal convolutional neural network.

Fig. 5 is a structural diagram of a multi-modal learning-based state analysis system according to the present invention.

FIG. 6 is a block diagram of an image acquisition module of the present invention.

Fig. 7 is a block diagram of a teacher position information acquisition module of the present invention.

Fig. 8 is a block diagram of the teacher's posture information acquisition module of the present invention.

The system comprises an image acquisition module, a 2-teacher position information acquisition module, a 3-teacher posture information acquisition module, a 4-student face information acquisition module, a 5-analysis module, an 11-acquisition unit, a 12-processing unit, a 21-feature map extraction unit, a 22-candidate unit, a 23-region feature map generation unit, a 24-teacher position acquisition unit, a 31-key point acquisition unit and a 32-normalization unit.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Example 1

Referring to fig. 1 to 8, the present invention provides a system for analyzing a learning state based on multiple modalities:

comprises an image acquisition module 1, a teacher position information acquisition module 2, a teacher posture information acquisition module 3, a student face information acquisition module 4 and an analysis module 5,

the image acquisition module 1 is used for collecting image data in a classroom and preprocessing the image data to obtain a processed image;

the teacher position information acquisition module 2 is used for acquiring the position information of the teacher based on the processed image and the Faster R-CNN target detection model;

the teacher posture information acquisition module 3 is used for acquiring the posture information of the teacher based on the processed image and the Faster R-CNN target detection model;

the student face information acquisition module 4 is used for acquiring student face information based on a processing image and an ERT person face feature point detection method;

and the analysis module 5 is used for analyzing the learning state based on the position information, the posture information and the facial information of the student.

In the embodiment, cameras can be respectively placed in front of and behind a classroom, then image data is obtained through the image obtaining module 1, then images are respectively processed through a Faster R-CNN target detection model to obtain position information and face information of a teacher, then student face information in the image data is extracted through an ERT (alert person feature point) detection method, and then the learning state of a student can be comprehensively evaluated, so that the learning condition of the student can be better supervised, and the learning efficiency of the student is improved. The ERT human face characteristic point detection method has the following formula:

ξ^(t+1)＝ξ^(t)+r_t(I,ξ^(t))

wherein: t denotes the cascade number, r_t(,) represents the regressor of the current stage. The regressor inputs the face image I and the position coordinates of the facial feature points updated by the previous-stage regressor, and the adopted features can be gray values or other features. When the image is obtained, the algorithm generates a position of the event, and the initial position is the position of the estimated face feature point. And minimizing errors by adopting a gradient lifting algorithm to obtain each cascade regression silver.

Further, the image acquiring module 1 includes an acquiring unit 11 and a processing unit 12, where the acquiring unit 11 is configured to collect image data of a camera;

the processing unit 12 is configured to normalize the size of the image data, and perform normalization processing to obtain a processed image.

The specific way to collect the image data of the camera is to acquire an image for one frame every 5 seconds. The sampling frequency is more suitable, and the image processing efficiency is improved while the real-time performance is met.

Further, the teacher position information obtaining module 2 includes a feature map extracting unit 21, a candidate unit 22, a region feature map generating unit 23, and a teacher position obtaining unit 24,

the feature map extracting unit 21 is configured to extract an original feature map based on the processed image;

the candidate unit 22 is configured to input the original feature map into a candidate box extraction network to generate a region candidate box;

the region feature map generating unit 23 is configured to map the region candidate box into the original feature map, and pool the region candidate box into the region feature map;

the teacher position obtaining unit 24 is configured to input the region feature map into a Faster R-CNN target detection model to obtain teacher position information.

In the present embodiment, a specific way of extracting the original feature map is to first shrink the teacher's behavior image in the lesson scene with any size to a fixed size 256 × 256, and then input the picture with the size 256 × 256 into a basic network such as CNN (convolutional layer), so as to extract the feature map of the original picture, and the subsequent RPN layer and full link layer can share the feature map.

The second step is that: the feature map is input to the RPN to generate a region candidate box. And applying a east China window on the feature map to judge and classify the target region, thereby generating a region candidate frame. The RPN firstly performs convolution operation on the feature graph input by the shared convolution layer to obtain a feature vector, and then inputs the feature vector into a plurality of full-connected layers, namely a boundary frame regression layer and a boundary frame classification layer.

The third step: pooling (POI Pooling) is performed for each regional feature map. Inputting a characteristic diagram and a candidate region, mapping the candidate region into the characteristic diagram, pooling the characteristic diagram into a region characteristic diagram with uniform size, inputting the region characteristic diagram and fully connecting to obtain a classification vector and a position coordinate

The fourth step: and finally, inputting the regional characteristic diagram into the Faster R-CNN, and performing further position fine modification amount category regression on the candidate frame according to the candidate frame proposed by the RPN.

For each selected region i, f in the picture of the teacher's lesson behavior_iDefined as the average pooled convolution feature from the region, so that the image feature vector has dimension 2048_iAnd converting into h-dimensional vectors.

v_i＝W_vf_i+b_v

Thus, the complete representation of the teacher's image of the class time behavior is a set of embedded vectors.

V＝{v₁,...,v_k},vi∈R

Wherein v is_iOne salient region i is encoded, k being the number of regions.

Further, the teacher posture information acquiring module 3 includes a key point acquiring unit 31 and a normalizing unit 32, where the key point acquiring unit 31 is configured to acquire human posture key point information based on a Faster R-CNN target detection model;

the normalization unit 32 is configured to perform posture normalization module processing based on the human posture key point information.

In the embodiment, the Faster R-CNN target detection model acquires the key point information of the human posture and performs posture normalization module processing.

And estimating the coordinates of the key points during posture normalization, thereby calculating a transformation matrix.

The flow is shown in fig. 2. In the figure, firstly, the coordinates of the key points are extracted based on the key points obtained by the previous key point detection network. Then, the maximum value point is selected as the estimation of the key point, and the key point coordinate is obtained.

Further, the specific step of analyzing the learning state based on the position information, the posture information and the student face information is:

In order to solve the problem of evaluation of single-mode data on the class learning state of the student, a multi-mode fusion method is adopted in the text, and the class learning state of the student is analyzed from 3 modes. The general flow chart is shown in fig. 3.

The multi-mode feature fusion network structure based on the full connection layer is constructed and trained, and multi-dimensional multi-scale features are mapped to the feature fusion space.

The method selects a Softmax function to map the feature vector into a probability sequence, and original information of more features is reserved. Softmax calculates output class y⁽ⁱ⁾The process of (c) is shown in the formula.

In the formula: etaⁱThe characteristic value after fusion is obtained; k is the number of categories; p represents y⁽ⁱ⁾Representing the probability value for the class k.

The first stage is as follows: paying attention to the teacher position information characteristics and the student eye offset in each teacher picture to obtain whether the student looks at the teacher

And a second stage: analyzing the characteristic region of the teacher behavior image, wherein the teacher should look at the blackboard or see the blackboard; and then compared to the student's eye offset to conclude that the student should look there at the moment.

And a third stage: similar example 2 was obtained by analyzing the facial feature maps of students with the behavior and location of teachers

Embodiment 2 differs from embodiment 1 only in that the specific steps of analyzing the learning state based on the position information, posture information, and student face information are:

inputting the weighted fusion features into the full connection layer;

and obtaining the classification probability distribution of the weighted fusion features.

Three-dimensional model characteristics based on different modes can be obtained through the joint learning of the three-mode convolution neural network. Different from the traditional feature fusion method using pooling operation, the method uses a weighted fusion method to fuse three feature vectors based on a statistical method, and the framework of the method is shown in fig. 4.

The concrete formula is as follows:

wherein: f represents a feature vector extracted from the features of different modal tables; alpha (alpha) ("alpha")_iThe weights of different modes are shown, the feature vector of the weighted fusion feature is input into a full connection layer (FC layer), the dimensionality of the full connection layer is 512, 256 and C in sequence, and C represents the number of data set categories. And finally, obtaining the classification probability distribution of the three-dimensional model through a softmax layer. Divide the learning status intoFor leisurely, general, vague and difficult

The method ensures that a plurality of modes can guide each other in the training process through the correlation loss function, improves the learning speed of network training, and improves the robustness of the final feature vector.

The specific formula is as follows:

L_C(M_i,M_j)＝|ξ(f_Mi)-ξ(f_Mj)|₂

wherein: 2 is expressed as the correlation of two different feature vectors. f represents a feature vector extracted from features of different modality tables; the subscript of M indicates data for the 1 st, 2 nd, 3 rd modalities; ξ (log ()) represents a normalized excitation function. In the training process, the value of the correlation loss is gradually reduced, which shows that different modal characteristics guide each other in the training process, so that the convergence speed of the training can be increased, and the characteristic vector with more robustness can be obtained. Taking the modality M as an example, based on the design of the correlation loss function, the final loss function of the network of different modalities is as follows:

wherein:

expressing cross entropy loss based on single modality; l is a radical of an alcohol_C(M₁M2) and L_C(M₁,M₃) Respectively represent the mode M, and the mode M₂And M₃Loss of correlation. And finally, optimizing the three single-mode networks through back propagation of random gradient descent.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-modal learning-based state analysis system is characterized in that,

comprises an image acquisition module, a teacher position information acquisition module, a teacher posture information acquisition module, a student face information acquisition module and an analysis module,

the image acquisition module is used for collecting image data in a classroom and carrying out preprocessing to obtain a processed image;

2. The multi-modal learning state analysis-based system of claim 1,

3. The multi-modal-based learning state analysis system of claim 2, wherein the image data of the camera is collected by capturing an image for a frame every 5 seconds.

4. The multi-modal learning state analysis-based system of claim 1,

the teacher position information acquisition module comprises a feature map extraction unit, a candidate unit, a region feature map generation unit and a teacher position acquisition unit,

the candidate unit is used for inputting the original feature map into a candidate frame extraction network to generate a region candidate frame;

5. The multi-modal learning state analysis-based system of claim 4,

6. A multi-modal based learning state analysis system as defined in claim 1,

the specific steps of analyzing the learning state based on the position information, the posture information and the student face information are as follows:

7. The multi-modal learning state analysis-based system of claim 1,

inputting the weighted fusion features into the full connection layer;