CN114708658A

CN114708658A - Online learning concentration degree identification method

Info

Publication number: CN114708658A
Application number: CN202210332063.1A
Authority: CN
Inventors: 祝玉军; 陈锡敏; 郭梦丽; 武伟; 杨丹丹; 章智强; 陈能
Original assignee: Anhui Normal University
Current assignee: Anhui Normal University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-05

Abstract

The invention discloses an identification method for concentration degree of online learning, which comprises the following steps of constructing a human fatigue state detection module and a human behavior identification module; identifying the collected face data through a human body fatigue state detection module and detecting the fatigue degree of the human body; the collected image data of the user online learning is identified through the human behavior identification module, and whether the user is distracted or not is detected. The invention has the advantages that: the online education student attention state detection method can effectively detect the attention state of online education students in real time, analyzes the class attending concentration degree condition of the students, is an important index for better measuring the online teaching quality, is beneficial to improving the online teaching quality and better expands the online teaching of the quality education.

Description

Online learning concentration degree identification method

Technical Field

The invention relates to the field of computer vision analysis, in particular to a student online learning concentration degree identification and analysis system based on a YOLOv5 and a Dlib model.

Background

At present, with the popularization of intelligent electronic devices and diversification of education forms, learning methods such as online lectures are gradually introduced into education modes of colleges and universities, wherein fatigue and distraction of students become main factors influencing online education effects. Traditional classroom is consuming time and wasting power on obtaining teaching feedback mode, and the teacher both must consider the quality of classroom teaching content, and the careful circumstances of listening to class of supervision student will cause the decline of teaching quality simultaneously to the teacher can't pay close attention to every student's state of giving lessons constantly. The network lesson is convenient for teachers and students, and the problem that students in the network lesson cannot be supervised due to condition limitation exists. The teaching mode of the online class is only to watch videos, and teachers cannot know the class listening effect and concentration degree of each student in real time like the traditional classroom teaching, and the teachers cannot monitor or obtain class listening feedback of the students, so that the class listening quality of the students on the online class is greatly reduced. Therefore, how to automatically identify the concentration degree of the students in the course of the online lesson in a computer vision technology mode is very important for online teaching, and no relevant documents are disclosed in the prior art.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a student online learning concentration degree identification and analysis method based on a YOLOv5 and a Dlib model, which is used for automatically identifying the concentration degree of the student online learning.

In order to achieve the purpose, the invention adopts the technical scheme that:

an online learning concentration degree identification method comprises the following steps:

constructing a human fatigue state detection module and a human behavior recognition module;

identifying the collected face data through a human body fatigue state detection module and detecting the fatigue degree of the human body;

the collected image data of the user online learning is identified through the human behavior identification module, and whether the user is distracted or not is detected.

The human body fatigue state detection module firstly obtains indexes of facial marks of a left eye, a right eye and a mouth respectively, then carries out graying processing on a video stream through function classes in an OpenCV software library, detects position information of the eyes and the mouth, then calculates data including eye opening degree, eye blinking times, mouth opening degree, yawning frequency and the like, constructs a function for evaluating fatigue degree, and compares the function with a set threshold value so as to detect the fatigue degree of the human body.

The human fatigue state detection module comprises an eye detection step, a mouth detection step and a fatigue judgment step, wherein the eye detection step comprises:

calculating EAR values of the left eye and the right eye, wherein the EAR values are the length-width ratio of the eyes, and the average value of the left eye and the right eye is used as the final EAR value; if the EAR values of two continuous frames are smaller than the set threshold value, indicating that blinking activity is performed; counting the frame number Rolleye meeting the eye closing characteristic in the total frame number Roll of the set video stream;

wherein the mouth detection step comprises: calculating the MAR value of the mouth, wherein the MAR value is the length-width ratio of the mouth; determining that Harvest is performed when MAR values of mouths in two continuous frame images are larger than a set threshold value, and counting frame numbers (Rollmouth) meeting the Harvest in a set total frame number Roll of the video stream;

the fatigue judging step comprises: the fatigue degree is judged based on the Rolleye value detected in the eye detecting step and the Rollmouth value in the mouth detecting step.

Fatigue determinationThe method comprises the step of judging the fatigue degree by adopting a constructor PERCLOS (Perrolos), wherein the constructor

Setting a fatigue threshold value P1, judging to be in a fatigue state when the PERCLOS value is larger than P1, and judging to be in an awake state otherwise.

Calculating the EAR values for the left and right eyes includes: step 1: opening a camera to obtain a facial image of a student, circulating from a video stream, reading a picture, performing dimension expansion on the picture, and performing graying processing;

step 2: pre-training a key function Dlib, get _ front _ face _ detector (), used for detecting boundary bounds of a face region in a Dlib library; and a key function Dlib in the Dlib library, shape _ predictor (predictor path), which is used for obtaining the face feature position detector and outputting the feature point coordinates;

face position detection is performed using d1ib, get _ front _ face _ detector (gray,0), face position information is looped, and information of the face feature position is obtained using dlib, shape _ predictor (gray, rect).

And 3, step 3: extracting frame images to detect human faces, roughly positioning the eyes according to the facial feature position information, and then carrying out skin color segmentation. The positions of human eyes are better positioned through skin color segmentation, and data jumping can be generated in the image processing process by utilizing the difference of the colors of the eyes and the skin, so that segmentation is performed;

and 4, step 4: converting the facial feature information into an array format (used for storing eye coordinates and mouth coordinates in the array format to facilitate extraction of subsequent processing), extracting coordinates of a left eye and a right eye, and accurately positioning the eyes (roughly positioning to determine the outline of each part, accurately positioning to directly determine the coordinate position, roughly positioning to obtain the position to find the eyes and then accurately positioning to obtain the corresponding coordinate position);

and 5: and calculating the EAR values of the left eye and the right eye by using a constructor, wherein the EAR value is the average value of the left eye and the right eye and serves as the final EAR value, and the EAR value calculation formula of one eye is as follows:

wherein P is₃₈.y、P₃₉.y、P₄₂.y、P₄₁Y are the ordinate, P, of the

points

38, 39, 42, 41 of the 68-point characteristic points of the human face, respectively₃₇.x、P₄₀X is the abscissa of 37, 40 of 68 points of the human face feature points;

and respectively calculating EAR values of the two eyes, and averaging the EAR values to obtain a final EAR value.

Calculating the MAR value of the mouth comprises the steps of:

step (1): extracting a frame image to detect a human face, roughly positioning a mouth part to perform skin color segmentation, roughly drawing a mouth part outline according to the difference between the mouth color and the skin color, and directly judging the depth to draw the outline after graying the image due to different grays of different colors so as to realize rapid rough positioning of the mouth position and better position of the human mouth part, wherein the skin color segmentation means that the colors of the mouth and the skin are different, and data jump is seen in the image processing process so as to perform segmentation);

step (2): the facial feature information is converted into an array format (position coordinate information is assigned and stored according to the positions of 68 individual facial feature points in the array, each position can be accurately positioned), mouth position coordinates are extracted, and a mouth is accurately positioned (accurate positioning is carried out according to the coordinate information stored in the array, outlines are drawn in rough positioning, and coordinates are given in accurate positioning);

and (4) calculating a MAR value by a constructor, wherein the MAR value calculation function is as follows:

wherein P is₅₁.y、P₅₃.y、P₅₇.y、P₅₉Y is the ordinate of

points

51, 59, 53, 57 of the 68-point characteristic points of the human face, P₄₉.x、P₅₅X is the abscissa of

points

49, 55 of the 68-point human face feature points.

The human behavior recognition module adopts a YOLOv5 neural network model to recognize objects appearing in each frame of pictures in the collected images and judges the concentration state of the student according to the recognized objects.

The method comprises the steps of adopting a pre-trained and weight-modified Yolov5 neural network model to identify objects in an acquired image, calibrating three objects including a mobile phone, a cigarette and a cup in advance, and judging the object as a learning distraction state after identifying any object.

And (4) combining the fatigue degree result and the distraction state result according to a certain period to evaluate the concentration degree and give a concentration degree evaluation.

The invention has the advantages that: the online education teaching system can effectively detect the attention state of online education students in real time, analyzes the class attending concentration condition of the students, is an important index for better measuring the online teaching quality, is beneficial to improving the online teaching quality and better expanding the online teaching of the quality education. Through computer vision technology, carry out the identification analysis record to student's action in real time in the net class to combine together with the performance in the human facial expression, can realize the study concentration degree analysis to online classroom student, let teacher and head of a family know the whole situation of listening to class in this class of student more directly perceivedly and labour saving and time saving, have its important reference value and meaning to the improvement that supplementary teacher carried out education and teaching mode.

Drawings

The contents of the expressions in the various figures of the present specification and the labels in the figures are briefly described as follows:

FIG. 1 is a flow chart of an identification method of the present invention;

FIG. 2 is a schematic diagram of 68 feature point models of human faces in a Dlib library according to the present invention;

FIG. 3 is a network architecture view of the YOLOv5 model of the present invention;

FIG. 4 is a detailed content visualization of the Focus structure of FIG. 4 according to the present invention;

FIG. 5 is a schematic diagram of the content visualization of the CBL structure;

FIG. 6 is a diagram showing the content visualization of the CSP1_ x structure;

FIG. 7 is a diagram illustrating the detailed content visualization of the SPP structure;

FIG. 8 is a diagram illustrating the content visualization of CSP2_ x;

FIG. 9 is a diagram illustrating the visualization of the specific content of the Res unit structure;

Detailed Description

The following description of preferred embodiments of the present invention will be made in further detail with reference to the accompanying drawings.

This application mainly to the discernment control of student's attention in the current net class, the mode through image recognition, computer vision acquires student's characteristic state and judges student's concentration degree with this to realized providing the foundation that the teacher understood the student state for net class, network teaching. The specific scheme is as follows:

the invention provides a student online learning concentration degree analysis system based on a YOLOv5 and a Dlib model, and aims to solve the problems that the behavior and the action of students are easy to ignore in the background and the online learning concentration degree of the students cannot be effectively analyzed, so that the attention state of the students educated online in real time is effectively detected. An online learning concentration degree identification method comprises the following steps:

the method comprises the steps that collected image data of online learning of a user are identified through a human behavior identification module, and whether the user is distracted or not in learning is detected;

and judging the concentration degree of the student based on the fatigue degree and the user analysis degree.

Wherein human fatigue state detection module includes:

firstly, indexes of facial marks of a left eye, a right eye and a mouth are respectively obtained, then, graying processing is carried out on a video stream through function classes in an OpenCV software library, position information of the human eyes and the human mouth is detected, then, data including eye opening degree, eye blinking times, mouth opening degree, yawning frequency and the like are calculated, a function for evaluating fatigue degree is constructed, and the function is compared with a set threshold value so as to detect the fatigue degree of the human body.

The human body behavior recognition module adopts a YOLOv5 neural network model to recognize objects appearing in each frame of picture in the collected image and judges the concentration state of the student according to the recognized objects. Aiming at the YOLOv5 neural network, a pre-trained and weight-modified YOLOv5 neural network model is adopted to identify objects in the acquired image, three objects of a mobile phone, a cigarette and a water cup are calibrated in advance, and after any object is identified, the learning distraction state is judged.

And (4) combining the fatigue degree result and the distraction state result according to a certain period to evaluate the concentration degree and give a concentration degree evaluation. Taking 150 frames of images as an example, the concentration degree of the user is judged according to the fatigue state and the distraction state obtained from the 150 frames of images, and when the fatigue state and the distraction state can be adopted or a relation is formed between the fatigue state and the distraction state, the user can be judged to be inattentive at the moment when any one of the conditions is in the fatigue state or the distraction state, and a prompt of inattention or insufficient concentration degree is given.

As shown in fig. 1, first, a VudeiCapture () function of OPenCV is called to read the video frame by frame to obtain an image of each frame; then, fatigue and human body distraction are respectively detected through a human body fatigue detection module and a human body behavior recognition module;

the human body fatigue state detection module uses a Dlib to provide a shape _ predictor _68_ face _ ladmarks model to respectively obtain indexes of facial marks of eyes and a mouth, graying processing is carried out on a video stream through function classes in an OpenCV software library to detect position information of the eyes and the mouth of a human body, then data including eye opening degree, eye blinking times, mouth opening degree, yawning frequency and the like are calculated, a function for evaluating fatigue degree is constructed, and the function is compared with a set threshold value to detect the fatigue degree of the human body.

The human body behavior recognition module recognizes the object through the model after the YOLOv5 modifies the weight so as to judge the behavior state of the student, including the states of playing a mobile phone, smoking, drinking water and the like, and if any one or more of the states are judged to be distracted, a prompt can be given through a pop window to remind that the student is not distracted.

And analyzing the class concentration degree of the student based on the combination of the fatigue state and the distraction state, and giving a judgment result every 150 frames to judge the learning concentration degree of the student.

The embodiment of the invention provides a student online learning concentration analysis system based on YOLOv5 and Dlib models. Wherein the system is mainly divided into two modules: one is based on a face recognition 68 feature point detection module in a Dlib library, firstly, indexes of left and right eye and mouth facial marks are respectively obtained, then, a video capture function in OpenCV (OpenCV is a cross-platform computer vision and machine learning software library issued based on BSD permission), graying is carried out on a video stream, position information of human eyes and a mouth is detected, then, data including the opening degree of the eyes, the number of times of winks, the opening degree of the mouth, the frequency of yawning and the like are calculated, a function is constructed, and a threshold value is set to detect the fatigue degree of a human body, namely, a human body fatigue state detection module. The other module is a human behavior recognition module which selects one of four models trained by YOLOv5 after the weight is modified to recognize objects appearing in each frame of image and records the number of the appearing frames of the objects so as to judge whether the student has a distraction action and the duration of the distraction action. The last two modules are combined to give out fatigue state data and distraction state data of students, and then the concentration degree condition of the students in class can be judged. The human fatigue state detection module comprises an eye detection step, a mouth detection step and a fatigue judgment step, wherein the eye detection step comprises:

calculating EAR values of the left eye and the right eye, wherein the EAR values are the length-width ratio of the eyes, and the average value of the left eye and the right eye is used as the final EAR value; if the EAR values of two continuous frames are smaller than the set threshold value, indicating that one blinking activity is performed; counting the frame number Rolleye meeting the eye closing characteristic in the total frame number Roll of the set video stream;

wherein the mouth detection step comprises:

calculating the MAR value of the mouth, wherein the MAR value is the length-width ratio of the mouth; determining that Harvest is performed when MAR values of mouths in two continuous frame images are larger than a set threshold value, and counting frame numbers (Rollmouth) meeting the Harvest in a set total frame number Roll of the video stream;

The human body fatigue state detection module and the human body behavior recognition module recognize the learning state of the student through the deep learning neural network, and the detection effect of the learning state of the student is output once every 150 frames of pictures.

The human fatigue detection module (for detecting eye-related data) comprises the following steps:

step S1: get _ front _ face _ detector () using the key function Dlib in the Dlib library to get the face position detector, which is the face detection algorithm built in Dlib, and detect the boundaries (bases) of the face region using HOG pyramid.

Step S2: a key function Dlib, shape _ predictor (predictor path) in the Dlib library is used to obtain a face feature position detector, which is an algorithm used by Dlib to detect feature points in a region, and outputs coordinates of the feature points, and a pre-trained model (which is transmitted by a file path method and can be pre-trained according to requirements) is required to work normally.

Step S3: and opening the camera to obtain the facial image of the student, circulating from the video stream, reading the picture, performing dimension expansion on the picture, and graying.

Step S4: face position detection is performed using dlib.get _ front _ face _ detector (gray,0), face position information is circulated, and information of the face feature position is obtained using dlib.shape _ predictor (gray, rect).

Step S5: extracting frame images to detect human faces, and roughly positioning eyes to segment skin colors. The positions of human eyes can be better positioned through segmentation, the human eyes and the skin are confirmed by the fact that the gray levels of the human eyes and the skin are inconsistent after the gray levels of the images are achieved, the human eyes and the skin are represented by the fact that data jumps are seen in the image processing process, and therefore segmentation is conducted according to points of data treaties.

Step S6: converting the facial feature information into a format of an array, extracting coordinates of a left eye and a right eye, and accurately positioning the eyes; the array is used for storing and storing eye coordinates and mouth coordinates according to a data format, extracting left eye coordinates and right eye coordinates, accurately positioning eyes, roughly positioning to determine the outline of each part, and then more conveniently realizing accurate positioning to directly determine the coordinate position; coarse positioning: the outline of the mouth is roughly drawn according to the difference between the mouth color and the skin color, and the principle is that after the image is grayed, the depth can be directly judged to draw the outline because the different colors have different depths. The purpose is as follows: better positioning the position of the human mouth, dividing the skin color: the color of the mouth and the skin are different, and the jump of data can be seen in the image processing process, so that the segmentation is carried out; the array comprises position coordinate information of 68 human face characteristic points, and each position can be accurately positioned.

Step S7: the constructor calculates EAR (eye Aspect ratio) values for the left and right eyes, i.e., eye Aspect ratio values, and uses the average of both eyes as the final EAR value. If the EAR values of two consecutive frames are less than the set threshold, it indicates that a blinking activity has been performed. Counting the closed-eye feature count +1, when the count exceeds the threshold and the closed-eye feature of the next frame disappears, saving the count to Rolleye, wherein Rolleye is the number of frames which are consistent with closed-eye, then returning count clear 0 to step S5, otherwise, directly returning to step S5 to check the next frame.

Blinking is an unconscious biological feature in normal healthy people with blinking intervals of 2s to 10s and eye closure duration of 100ms to 400 ms. The eye blinking frequency of a healthy person is used as a reference, and the attention state of a single student is judged by calculating the eye blinking frequency of the student, so that the fatigue state of the student is judged. The clear expression of inattention is frequent blinking (blinking interval becomes short) and long-time eye closure (eyelid closure time becomes long), and when frequent blinking and/or long-time eye closure occur to the student, the student can be judged to be inattention and fatigued.

Eye Aspect Ratio, EAR, is calculated. When the human eye is open, the EAR fluctuates above and below a certain value, and when the human eye is closed, the EAR drops rapidly, theoretically approaching zero, so we consider that when the EAR is below a certain threshold, the eye is in a closed state. To detect the number of blinks, a threshold of consecutive frames for the same blink is set. The blinking speed is relatively fast, and the blinking action is completed by 1-2 frames generally. Both thresholds are set according to the actual situation.

When the absolute value (EAR) of the difference between the aspect ratio of the two eyes in the current frame and the previous frame is greater than 0.2, the current frame is considered to be in an eye-closing state (fig. 2 is a 68-point human face feature point diagram, it can be seen that 37-42 are left eyes, and 43-48 are right eyes), the position coordinates of the points 37-42 and 43-48 can be calculated through a model, and the calculation of the degree of opening can be realized based on the coordinates.

The left eye openness can be obtained by the following formula (the same applies to the right eye):

by calculating the ordinate (P) of the

points

38, 39, 42, 41₃₈.y、P₃₉.y、P₄₂.y、P₄₁Y) and the abscissa (P) of the points 37, 40₃₇.x、P₄₀X) to calculate the opening of the eyes. Whether the eyes are open or closed is determined by a threshold. The ratio of this value to the initial value may be used as the opening degree, and the comparison may be made according to the degree of difference. The maximum eye closing time (which can be replaced by a number of frames) is calculated when the opening degree is from large to small as the eye closing period and from small to large as the eye opening period, and the eye closing frequency is the frequency of entering closed eyes and entering open eyes.

The human body fatigue detection module (detecting mouth-related data) comprises the following steps:

step S8: extracting a frame image to detect a face, and roughly positioning a mouth to segment the skin color;

step S9: converting the facial feature information into a format of an array, extracting mouth position coordinates, and accurately positioning a mouth;

step S10: similar to the blink detection, the constructor calculates a MAR (motion estimate ratio) value, and if the MAR values of two continuous frames are larger than a set threshold value, the yawning is considered to be performed. Counting the yawning feature count as count +1, when the count exceeds the threshold and the yawning feature of the next frame disappears, saving the count to the Rollmouth, wherein the Rollmouth is the frame number which meets the yawning, then returning the count clear 0 to the step S8, otherwise, directly returning to the step S8 to check the next frame.

Yawning is also an instinct response of the human body, and it is like heartbeat and respiration and is not controlled by the will of the human body. When people are sleepy, the yawning action can be realized, the time for yawning by normal people is about 6 seconds, and when the yawning occurs frequently, the students can be judged to be in a fatigue state because of inattention.

The Mouth Aspect Ratio Mouth Aspect Ratio, MAR, was calculated. The MAR fluctuates above and below a certain value when the mouth is open, and theoretically approaches zero when the mouth is closed. We consider the eye to be in an open state when the MAR is above a certain threshold. In order to detect the frequency of the yawning, a threshold value of the continuous frame number of the same yawning needs to be set. The yawning speed is slow, so both thresholds are set according to actual conditions.

The mouth openness is calculated by a method similar to the eye according to the following formula:

from the 68-point facial feature point diagram of fig. 2, it can be known that yawning can utilize the vertical coordinates (P) of the calculated points 51, 59, 53, 57 at the mouth₅₁.y、P₅₃.y、P₅₇.y、P₅₉Y) and the abscissa (P) of the points 49, 55₄₉.x、P₅₅X) to calculate the mouth openness. Whether the mouth is opened or not is judged through the opening degree of the mouth, the opening time of the mouth is calculated (the number of frames can be used for replacing the opening degree of the mouth), whether the human is yawning or not is determined, meanwhile, the threshold value is reasonable, and the threshold value can be distinguished from normal speaking or humming through a large number of experiments.

Step S12: the value of PERCLOS of the student in the past 150 frames is calculated by detecting the obtained Rollmouth value and Rolleye value through the steps, and the function for calculating the PERCLOS value is constructed as follows, wherein the Roll value is the judged total frame number of the video stream and is set as 150. And comparing the obtained PERCLOS value with a set threshold, if the PERCLOS value is greater than the threshold, judging the PERCLOS value to be in a fatigue state, reminding by the system, and if the PERCLOS value is less than the threshold, judging the PERCLOS value to be in a waking state.

The human behavior recognition module adopts a Yolov5 neural network model for recognition, as shown in FIGS. 3-9, the model of Yolov5 can be mainly divided into four modules: the first module is a picture input end, the second module Backbone is a main part of the model, the third module Neck is an enhanced feature extraction network of the model, and the fourth module prediction is a prediction output end of the model.

The method has the main idea that the obtained picture is divided into squares with three sizes, namely large, medium and small squares, and the top left corner vertex of each square is responsible for detecting the central point of the square where no object exists, so that the objects with three sizes, namely large, medium and small, can be detected.

YOLOv5 judges the behavior and action of the student by detecting the object appearing in each frame of picture captured by the camera, for example, the action of using the mobile phone to judge that the student is using the mobile phone when detecting that the mobile phone exists in the picture, and the action of drinking water when the student is using the water cup when detecting that the mobile phone exists in the picture, because the computer camera generally only shoots the part above the chest of the student, the desktop can not be shot, and the influence of the object existing on the desktop on the judgment of the system model is reduced.

The procedure for the Backbone (Backbone) part was: (see FIG. 3 for details)

Step S13: reading pictures from a video stream, inputting a Yolov5 network model, and adopting a mode of conducting Mosaic data enhancement on the pictures at the input end of Yolov 5. Namely, 4 pictures are randomly used, randomly scaled and then randomly distributed for splicing, so that the detection data set is greatly enriched, and particularly, many small targets are added by random scaling, so that the robustness of the network is better.

Step S14: YOLOv5 performs adaptive scaling on the inputted picture, calculates the scaling ratio, calculates the scaled size, calculates the black margin value needed to be filled into the picture to fill the picture, and converts the picture size into 608 × 3.

Step S15: the picture enters a backbone part of a YOLOv5 network structure, an original 608 × 608 × 3 image is input into a Focus structure, and is changed into a feature map of 304 × 304 × 12 by adopting a slicing operation, and is finally changed into a feature map of 304 × 304 × 64 by performing a convolution operation of 64 convolution kernels. (see FIG. 4 for details)

Step S16: the obtained picture information was input to the convolution layer, and feature extraction was performed by the convolution layer of 128 convolution kernels at a time, thereby obtaining a 152 × 152 × 128 feature map.

The CSP1_ X structure (see fig. 6 and 9 for details) in YOLOv5 is applied to the Backbone network of the Backbone, and the CSP2_ X structure (see fig. 8 for details) is applied to the hack enhanced feature extraction structure.

Step S17: and (3) the obtained feature map passes through a CSP1_1 network structure, namely a cross-stage local network, the feature map is split into two parts, one part is subjected to convolution operation, and the other part and the result of the convolution operation of the previous part are subjected to tensor splicing. The network result relieves the problem that a large amount of reasoning calculation is needed before, the learning capability of the CNN is enhanced by integrating the change of the gradient into the characteristic diagram from beginning to end, the accuracy can be kept while the weight is reduced, and the calculation bottleneck and the memory cost are reduced.

Step S18: the obtained feature map was passed through one convolution layer (see fig. 5 for details), and a feature map of 76 × 76 × 256 was obtained.

Step S19: the resulting signature is passed through the CSP1_3 network structure. The characteristic map obtained at this time is denoted as T1.

Step S20: the obtained feature map was passed through the convolution layer once, and a feature map of 38 × 38 × 512 was obtained.

Step S21: the resulting signature is passed through the CSP1_3 network structure. The characteristic map obtained at this time is denoted as T2.

Step S22: the obtained feature map was passed through the convolution layer once, and a feature map of 19 × 19 × 1024 was obtained at this time.

Step S23: and processing the obtained feature maps by using four maximum pooling of different scales through a primary SPP structure, wherein the sizes of the pooled kernels of the maximum pooling are respectively 13x13, 9x9, 5x5 and 1x1(1x1 is no processing), and then carrying out tensor stitching operation on the obtained feature maps of different scales. (see FIG. 7 for details)

Here, the maximum pooling is performed by using a pooling operation, the step size of the move is 1, for example, 13 × 13 input feature maps are pooled by using a pooling kernel of 5 × 5, and the pooling is 2, so that the feature maps after pooling are still 13 × 13 in size.

Compared with a pure k × k maximum pooling mode, the SPP module mode effectively increases the receiving range of the trunk features and obviously separates the most important context features.

The steps of enhancing the feature extraction network (hack) part are as follows: (see FIG. 3 for details)

Step S24: the obtained signature is subjected to a CSP2_1 network structure and then to convolution operation by one convolution layer, and at this time, a 19 × 19 × 512 signature is obtained and denoted as T3.

Step S25: the obtained feature map is up-sampled, that is, the original image is enlarged, and in this case, a feature map of 38 × 38 × 512 is obtained.

Step S26: and carrying out tensor splicing operation on the obtained characteristic graph and the T2 characteristic graph. At this time, a 38 × 38 × 1024 characteristic map was obtained.

Step S27: the obtained feature map is passed through a CSP2_1 network structure, and then a convolution operation is performed on the convolution layer once, so that a feature map of 38 × 38 × 256 is obtained, and is denoted as T4.

Step S28: the obtained feature map is up-sampled, that is, the original image is enlarged, and in this case, a feature map of 76 × 76 × 256 is obtained.

Step S29: and carrying out tensor splicing operation on the obtained characteristic graph and the T1 characteristic graph. At this time, a characteristic map of 76 × 76 × 512 was obtained.

Step S30: when the obtained feature map is passed through the CSP2_1 network structure, a feature map of 76 × 76 × 256 is obtained, and this feature map is denoted as T5.

Step S31: and (4) carrying out tensor splicing operation on the T5 feature map and the T4 feature map by passing the T5 feature map through a convolution layer for one time.

Step S32: when the obtained feature map is passed through the CSP2_1 network structure, a feature map of 38 × 38 × 512 is obtained, and this feature map is denoted as T6.

Step S33: and (4) carrying out tensor splicing operation on the T6 feature map and the T3 feature map by passing the T6 feature map through a convolution layer for one time.

Step S34: when the obtained profile is subjected to the CSP2_1 network structure, a profile of 19 × 19 × 1024 is obtained, and is denoted as T7.

At this time, in the feature utilization part, YOLOv5 extracts multiple feature layers for object detection, and a total of three feature layers are extracted, and the shape of each of the three feature layers is (76, 76, 256), (38, 38, 512), (19, 19, 1024), respectively, and the middle layer, the lower layer, and the bottom layer.

The step of predicting the output end (prediction) part is as follows: (see FIG. 3 for details)

Step S35: the obtained T5, T6, and T7 feature maps are input to the convolutional layer, so that prediction results of three feature layers can be obtained, and shape data are (N, 19, 19, 255), (N, 38, 38, 255), (N, 76, 76, 76, and 255), and correspond to the positions of 3 prediction frames on the grid divided into 19x19, 38x38, and 76x76 for each map.

Step S36: the 3 feature layers of YOLOv5 divide the entire graph into grids of 19x19, 38x38, 76x76, respectively, with each nexus responsible for the detection of one region. Knowing that the prediction results of the feature layer correspond to the positions of the three prediction boxes, we first look at its reshape, which results in (N, 19, 19, 3, 85), (N, 38, 38, 3, 85), (N, 76, 76, 3, 85). 85 in the last dimension contains 4+1+80, which represent x _ offset, y _ offset, h and w, confidence, and classification result, respectively.

The decoding process of YOLOv5 is to add x _ offset and y _ offset corresponding to each grid point, the result of the addition is the center of the prediction frame, and then calculate the length and width of the prediction frame by using the prior frame and the combination of h and w. This results in the position of the entire prediction block.

Step S37: of course, after the final prediction structure is obtained, score sorting and non-maximum inhibition screening are carried out. And firstly, taking out the frames and the scores of each class with the scores larger than the threshold value, and then carrying out non-maximum inhibition by using the positions and the scores of the frames. I.e. the most reliable one is selected out of a stack of candidate boxes of the same face.

Non-Maximum inhibition (Non-Maximum Suppression, NMS): the candidate frames are sorted in descending order of confidence level, the IOU (intersection of the two candidate frames is divided by union of the two candidate frames) is carried out on the first candidate frame and the next frame in sequence, and the frames (considered as the same face) larger than a set threshold value are discarded. And reserving the first candidate frame, and repeating the steps for the rest candidate frames until one frame is left and reserved. The candidate box that remains at the end is the result after the non-maximum suppression processing.

Step S38: after the object is detected, a behavior and action prompt corresponding to the object is given to remind students of concentrating the attention and not distracting. And the number of distraction sessions was recorded. The distraction of the student is output every 150 frames.

And finally, combining the human body fatigue detection module with the human body behavior identification module, and finally identifying the class concentration state of the student in a period of time by utilizing the face condition of the student and the common behavior characteristic combination of the student in class, namely whether the student is tired in the period of time and the distraction times of the student in the period of time.

By adopting the student online learning concentration analysis system based on the YOLOv5 and the Dlib model, the technical advantages are embodied as follows:

the utility model provides a when the net lesson study of going on, discern student's facial condition and action (doze and play cell-phone etc.) in the net lesson study. Firstly, video images of students in class are shot in real time through a camera of a notebook computer in a class, and then a software module is designed and divided into a human body fatigue detection module for identifying blinking and yawning of the students; meanwhile, a behavior and action recognition module which causes the students to be not concentrated on watching courseware when playing mobile phones, drinking water and the like is designed. And finally, designing a deep learning network, combining expressions and common behavior characteristics of the students in class, and finally identifying the state analysis condition of the concentration degree of the students in class.

The system for analyzing the concentration of the online learning of the students based on the YOLOv5 and the Dlib model has the following key points and points to be protected:

1. the human fatigue detection module and the human behavior recognition module of the system are combined to form a result of student learning concentration degree analysis, so that teachers can conveniently evaluate scores of online classes at ordinary times.

2. The detected eye data of the person is combined with the data of the mouth to calculate the fatigue degree of the student.

3. The YOLOv5 is used as a network model of the human behavior recognition module, and the model is high in speed, small in size and high in recognition accuracy.

The letters in the drawings in this application are explained as follows:

and (3) upsampling: enlarging the original image, i.e. any technique that allows the image to be changed to a higher resolution. An interpolation method is generally adopted, i.e. a proper interpolation algorithm is adopted to insert new elements between pixel points on the basis of the original image pixels.

slice: the slicing operation is performed on the originally input image.

Concat: tensor splicing, dimensions will expand. The number of features (number of channels) describing the image itself is increased, while the information under each feature is not increased; transverse or longitudinal spatial overlap.

Conv: a convolution operation.

Bn (batch normalization): batch normalization, similar to normal data normalization, is a way to unify scattered data and is also a way to optimize neural networks.

Leaky relu: the function is activated. The calculation form is as follows:

res unit: and (5) residual error structure. First a convolution operation is performed that compresses the width and height of the incoming feature layer, at which point we can obtain a feature layer, which we name layer. Then we perform a convolution of 1X1 and a convolution of 3X3 again on the feature layer and add this result to layer, at which point we construct the residual structure.

and add: tensor addition does not expand dimensionality. The amount of information describing the features of an image increases, but the dimensions of the image do not themselves increase, but the amount of information per dimension increases.

Maxpool: one maximum pooling operation.

It is clear that the specific implementation of the invention is not restricted to the above-described embodiments, but that various insubstantial modifications of the inventive process concept and technical solutions are within the scope of protection of the invention.

Claims

1. An online learning concentration degree identification method is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein the online learning concentration recognition method comprises: the human body fatigue state detection module firstly obtains indexes of facial marks of a left eye, a right eye and a mouth respectively, then carries out graying processing on a video stream through function classes in an OpenCV software library, detects position information of the eyes and the mouth, then calculates data including eye opening degree, eye blinking times, mouth opening degree, yawning frequency and the like, constructs a function for evaluating fatigue degree, and compares the function with a set threshold value so as to detect the fatigue degree of the human body.

3. The method of claim 2, wherein the online learning concentration recognition method comprises: the human fatigue state detection module comprises an eye detection step, a mouth detection step and a fatigue judgment step, wherein the eye detection step comprises:

wherein the mouth detection step comprises:

the fatigue judging step comprises: and judging the fatigue degree based on the Rolleye value detected in the eye detection step and the Rollmouth value in the mouth detection step.

4. The method of claim 3, wherein the online learning concentration recognition method comprises: the fatigue judging step comprises judging the fatigue degree by using a constructor PERCLOS (Perrolos), wherein the constructor

Setting a fatigue threshold value P1, judging to be in a fatigue state when the ERCLOS value is larger than P1, otherwise, judging to be in an awake state.

5. The method of claim 3, wherein the online learning concentration recognition method comprises: calculating the EAR values for the left and right eyes includes:

step 1: opening a camera to obtain a facial image of a student, circulating from a video stream, reading a picture, performing dimension expansion on the picture, and performing graying processing;

and 2, step: pre-training a key function Dlib, get _ front _ face _ detector (), used for detecting boundary bounds of a face region in a Dlib library; and a key function Dlib in the Dlib library, shape _ predictor (predictor path), which is used for obtaining the face feature position detector and outputting the feature point coordinates;

face position detection is performed using dlib.get _ front _ face _ detector (gray,0), face position information is circulated, and information of the face feature position is obtained using dlib.shape _ predictor (gray, rect).

And step 3: extracting a frame image to detect a human face, roughly positioning the eye according to the facial feature position information, and then carrying out skin color segmentation;

and 4, step 4: converting the facial feature information into a format of an array, extracting coordinates of a left eye and a right eye, and accurately positioning the eyes;

and 5: and calculating the EAR values of the left eye and the right eye by the constructor, wherein the EAR value is the average value of the left eye and the right eye and serves as the final EAR value, and the EAR value calculation formula of one eye is as follows:

wherein P is₃₈.y、P₃₉.y、P₄₂.y、P₄₁Y are the ordinate, P, of the points 38, 39, 42, 41 of the 68-point characteristic points of the human face, respectively₃₇.x、P₄₀X is the abscissa of 37, 40 of the 68 points of the human face feature;

6. The method of claim 3, wherein the online learning concentration recognition method comprises: calculating the MAR value of the mouth comprises the steps of:

step (1): extracting a frame image to detect a human face, and roughly positioning a mouth part to perform skin color segmentation;

step (2): converting the face characteristic information into the format of array, extracting the position coordinates of the mouth, accurately positioning the mouth,

and (3) calculating a MAR value by a constructor, wherein the MAR value calculation function is as follows:

wherein P is₅₁.y、P₅₃.y、P₅₇.y、P₅₉Y is the ordinate of points 51, 59, 53, 57 of the 68-point characteristic points of the human face, P₄₉.x、P₅₅X is the abscissa of points 49, 55 of the 68-point human face feature points.

7. The method of claim 1, wherein the online learning concentration recognition method comprises: the human behavior recognition module adopts a YOLOv5 neural network model to recognize objects appearing in each frame of picture in the collected image and judges the learning distraction state of the student according to the recognized objects.

8. The method of claim 8, wherein the online learning concentration recognition method comprises: the method comprises the steps of adopting a pre-trained and weight-modified Yolov5 neural network model to identify objects in an acquired image, calibrating three objects including a mobile phone, a cigarette and a cup in advance, and judging the object as a learning distraction state after identifying any object.

9. An online learning concentration recognition method according to any one of claims 1-8, characterized in that: and (4) combining the fatigue degree result and the distraction state result according to a certain period to evaluate the concentration degree and give a concentration degree evaluation.