CN116881853A

CN116881853A - Attention assessment method, system, equipment and medium based on multi-mode fusion

Info

Publication number: CN116881853A
Application number: CN202311154786.8A
Authority: CN
Inventors: 胡方扬; 魏彦兆; 唐海波
Original assignee: Xiaozhou Technology Co ltd
Current assignee: Xiaozhou Technology Co ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-10-13
Anticipated expiration: 2043-09-08
Also published as: CN116881853B

Abstract

The invention discloses a method, a system, equipment and a medium for attention assessment based on multi-mode fusion, which have the technical scheme that multi-mode data of a user to be assessed are collected, wherein the multi-mode data comprise: a plurality of current modality data reflecting the attention characteristics of the user to be evaluated from different angles; inputting each current mode data in the multi-mode data into a corresponding trained scoring sub-model to obtain a corresponding sub-scoring value; and carrying out linear weighting according to each sub-score value, the accuracy weight corresponding to each sub-score value and the time attenuation weight corresponding to each sub-score value to obtain a comprehensive score value, wherein each accuracy weight is determined according to the accuracy of the corresponding score sub-model, each time attenuation weight is determined according to each corresponding score sub-model and a preset time attenuation function, the attention state can be reflected more comprehensively by comprehensively utilizing the multi-mode data, and the accuracy is higher than that of a single mode.

Description

Attention assessment method, system, equipment and medium based on multi-mode fusion

Technical Field

The invention belongs to the technical field of attention assessment, and particularly relates to an attention assessment method, system, equipment and medium based on multi-mode fusion.

Background

Attention has an important impact on the efficiency of the work and life of a person. Attention states can generally be divided into two categories, concentrate and disperse. The inability to concentrate on long periods of time can seriously affect the quality of work and life, and therefore, being able to accurately assess the individual's state of attention in real time is of great importance to improving attention management.

Current studies to evaluate the state of attention are mainly based on physiological signal analysis, representative physiological signals including electroencephalogram (EEG), functional magnetic resonance (fMRI, functional magnetic resonance imaging), eye movement signals, facial videos, and the like. EEG is used as a noninvasive physiological signal, the most widely applied traditional method based on EEG mainly extracts time domain and frequency domain features, but the manually designed features cannot describe the complex dynamics of EEG well. In recent years, the deep learning technology is applied to EEG analysis, and the attention-related time sequence characteristics can be directly learned from the original EEG signals through a cyclic neural network and the like, so that the effect is remarkably improved compared with the traditional characteristic engineering method.

However, EEG signals are susceptible to multifaceted factors, accurate determination of the state of attention is difficult to achieve with only EEG, eye movements and facial videos contain attention state information in addition to EEG, eye movements can reflect line of sight focus, and facial expression changes are also associated with mental states. The existing system such as FaceReader can analyze facial expression to judge psychological states, but most of the existing systems are single-mode analysis, so that effective fusion of multi-mode physiological signals is not realized, and judging accuracy is uneven. Therefore, how to perform the fusion of the multi-mode physiological signals and improve the accuracy and reliability of the judgment of the attention state is a current technical problem.

Disclosure of Invention

The invention aims to provide a attention assessment method, a system, equipment and a medium based on multi-mode fusion, which comprehensively utilize multi-mode data, can reflect attention states more comprehensively and have higher accuracy than single mode.

The first aspect of the invention discloses a attention assessment method based on multi-mode fusion, which is characterized by comprising the following steps:

collecting multi-modal data of a user to be evaluated, wherein the multi-modal data comprises: a plurality of current modality data reflecting the attention characteristics of the user to be evaluated from different angles;

inputting each current mode data in the multi-mode data into a corresponding trained scoring sub-model to obtain a corresponding sub-scoring value;

and carrying out linear weighting according to each sub-score value, the accuracy weight corresponding to each sub-score value and the time attenuation weight corresponding to each sub-score value to obtain a comprehensive score value, wherein each accuracy weight is determined according to the accuracy of the corresponding score sub-model, and each time attenuation weight is determined according to each corresponding score sub-model and a preset time attenuation function.

Optionally, the time decay function is:

，

Wherein n is the number of scoring sub-models, 0<i<n, i is a positive integer,for the current time +.>Is the optimal time attenuation coefficient corresponding to the scoring sub-model i, < ->The time attenuation weight corresponding to the scoring submodel i is obtained;

the method for determining the optimal time attenuation coefficient comprises the following steps:

defining a time attenuation coefficient of each evaluation molecular module;

acquiring a plurality of training data corresponding to each evaluation molecular model, and labeling all the training data with real labels;

inputting each training data into a corresponding scoring sub-model to obtain a corresponding prediction score, and comparing each prediction score with a corresponding real label to obtain a corresponding comparison result;

and accumulating and summing all comparison results corresponding to each evaluation molecular model to obtain a loss function, updating the corresponding time attenuation coefficient through back propagation of the loss function, and obtaining the corresponding optimal time attenuation coefficient after multiple iterative training.

Optionally, each evaluation molecular model is obtained through training of the following steps:

acquiring a plurality of historical mode data of the user to be evaluated corresponding to each scoring sub-model;

judging whether the sum of the number of the historical mode data and the current mode data corresponding to each evaluation molecular model is smaller than a preset threshold value;

If yes, loading each pre-trained general scoring model, selecting a first preset percentage of modal data from all historical modal data and current modal data corresponding to each general scoring model as a corresponding fine tuning data set, fixing parameters of all layers except the last layer of each general scoring model, training the last layer by using the corresponding fine tuning data set, and updating the weight of the last layer to obtain a corresponding scoring sub-model;

if not, loading each pre-trained general scoring model, selecting a second preset percentage of mode data from all the historical mode data and the current mode data corresponding to each general scoring model as a training set, using the rest of mode data as a verification set, training the corresponding general scoring model by using the training set, testing the trained general scoring model by using the corresponding verification set every preset training round number until the loss function of the corresponding verification set converges, and using the trained general scoring model as a corresponding scoring sub-model.

Optionally, the determining, by the accuracy weights, the accuracy of the corresponding scoring sub-models includes:

Extracting a corresponding feature set from each current mode data of the user to be evaluated;

calculating the distinguishing degree of the corresponding feature set on each evaluation molecular model;

determining the accuracy weight of each scoring submodel according to an accuracy weight calculation formula, wherein the accuracy weight calculation formula is as follows:

，

wherein n is the number of scoring sub-models, 0<i<n, i is a positive integer,for scoring the accuracy corresponding to submodel i, < ->For scoring the degree of differentiation corresponding to submodel i, < >>For the average differentiation, ++>For the weight adjustment factor, +.>And (5) weighing the accuracy corresponding to the scoring submodel i.

Optionally, the method further comprises:

judging whether the comprehensive score value is lower than a preset score value or not;

if yes, judging whether the comprehensive score value rises to be more than a preset score value in the confirmation time window, if yes, judging that the time is short, if not, judging that the time is real, and sending out a warning;

if not, the normal attention state is determined.

Optionally, the method for determining the confirmation time window includes:

collecting a plurality of sections of historical comprehensive scoring value sequences, wherein the comprehensive scoring values in the historical comprehensive scoring value sequences are selected according to preset quantity;

Calculating the mean value and standard deviation of each historical comprehensive scoring value sequence;

setting a candidate time sequence, wherein the candidate time sequence comprises a plurality of time windows;

counting the proportion of the true low-attention state with correct prediction of all the historical comprehensive score value sequences to the true low-attention state in each time window to obtain a corresponding recall rate, and taking the time window with the highest recall rate as a first window;

calculating a first average value of the average values of all the historical comprehensive score value sequences, calculating a second average value of the standard values of all the historical comprehensive score value sequences, and calculating a second window through the first average value and the second average value;

and testing the recall rates of the first window and the second window on the historical comprehensive score value sequence, judging whether the recall rate of the second window is higher than that of the first window, if so, taking the second window as a confirmation time window, and if not, taking the first window as the confirmation time window.

Optionally, the multi-modal data includes: any of brain electrical data, eye movement data, and facial video data.

The second aspect of the invention discloses a attention assessment system based on multi-modal fusion, comprising:

The data acquisition module is used for acquiring multi-mode data of a user to be evaluated, wherein the multi-mode data comprises: a plurality of current modality data reflecting the attention characteristics of the user to be evaluated from different angles;

the scoring sub-module is used for inputting each current mode data in the multi-mode data into a corresponding trained scoring sub-model to obtain a corresponding sub-scoring value;

and the comprehensive scoring module is used for carrying out linear weighting according to each sub-scoring value, the accuracy weight corresponding to each sub-scoring value and the time attenuation weight corresponding to each sub-scoring value to obtain a comprehensive scoring value, wherein each accuracy weight is determined according to the accuracy of the corresponding scoring sub-model, and each time attenuation weight is determined according to each corresponding scoring sub-model and a preset time attenuation function.

In a third aspect the invention discloses a computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

A fourth aspect of the invention discloses a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method described above.

The technical scheme provided by the invention has the following advantages and effects: the attention state can be reflected more comprehensively by comprehensively utilizing the multi-mode data, and the accuracy is higher than that of a single mode; the time attenuation weights of the corresponding scoring submodels are determined through the scoring submodels and the preset time attenuation function, so that the influence of the latest data on attention assessment is larger, meanwhile, the characteristic of attention attenuation along with time is simulated, the cognitive characteristic that the attention cannot be continuously stabilized is met, and the real-time assessment and monitoring of high sensitivity of the attention variation can be realized; and training the last layer of the general scoring model by using the fine-tuning data set to obtain a corresponding scoring sub-model, wherein the scoring sub-model is a model obtained after adjustment according to individual data of the user to be evaluated, after continuously collecting multi-modal data of the user to be evaluated, the historical modal data of the user to be evaluated is gradually increased until the user to be evaluated becomes an old user, loading the corresponding pre-trained general scoring model, randomly initializing network weights, not loading the pre-training weights, training the corresponding general scoring model by using the training set, obtaining model parameters when training is unstable, obtaining the corresponding scoring sub-model, and realizing individual modeling of the user to be evaluated.

Drawings

FIG. 1 is a flow chart of a method for attention assessment based on multi-modal fusion, disclosed in an embodiment of the invention;

FIG. 2 is a block diagram of a multi-modal fusion-based attention assessment system in accordance with an embodiment of the present invention;

fig. 3 is an internal structural diagram of a computer device disclosed in an embodiment of the present invention.

Detailed Description

In order that the invention may be readily understood, a more particular description of specific embodiments thereof will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

As used herein, the terms "first and second …" are used merely to distinguish between names and not to represent a particular number or order unless otherwise specified or defined.

The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items, unless specifically stated or otherwise defined.

The term "fixed" or "connected" as used herein may be directly fixed or connected to an element, or indirectly fixed or connected to an element.

As shown in fig. 1, an embodiment of the present invention discloses a method for evaluating attention based on multi-modal fusion, including:

step S1, acquiring multi-modal data of a user to be evaluated, wherein the multi-modal data comprises: a plurality of current modality data reflecting the attention characteristics of the user to be evaluated from different angles.

The multi-modal data in this embodiment includes: electroencephalogram data, eye movement data, and face video data; in other embodiments the multimodal data may be any two of electroencephalogram data, eye movement data, and facial video data; in practical application, loadable portable electroencephalogram acquisition equipment can be used, sampling rate is selected to be 250Hz or more, electroencephalogram detail is ensured to be read, relevant area electroencephalogram data such as frontal lobe, top lobe and the like are acquired, scalp of a user to be evaluated is required to be clean in the acquisition process, electrode contact skin is good, signal quality is ensured, a non-contact eye tracker can be used, an infrared technology is adopted, frequency is more than or equal to 100Hz, equipment is calibrated in the acquisition process, accurate pupil position and gaze point information is ensured to be acquired, a high-definition camera can be used, resolution ratio is 1080p or more, face detail of the user to be evaluated is acquired, imaging angle is ensured to be shot, partial area distortion is avoided, light conditions are sufficient and uniform in the acquisition process, and diffuse reflection is avoided to influence image quality.

And S2, inputting each current mode data in the multi-mode data into a corresponding trained scoring sub-model to obtain a corresponding sub-scoring value.

In this embodiment, after the electroencephalogram data, the eye movement data and the face video data are acquired, preprocessing is required to be performed on the electroencephalogram data, the eye movement data and the face video data, for example, the electroencephalogram data is filtered by using a band-pass filter, the frequency range is generally 0.5-70Hz, direct current drift and high-frequency noise can be effectively filtered, a common band-pass filter comprises an FIR (Finite Impulse Response ) filter, an IIR (Infinite Impulse Response, infinite impulse response) filter and the like, noise components caused by the eye electricity, myoelectricity and the like can be separated and removed by using a method of Independent Component Analysis (ICA) and the like, and in other embodiments, noise reduction can be achieved by using a method of narrow-band wave trap and the like. Segmenting the filtered and noise-reduced electroencephalogram data according to a time window to obtain a plurality of electroencephalogram data segments, wherein the time window can take 2-10 seconds, extracting characteristics of each electroencephalogram data segment, such as calculating statistical characteristics of mean value, variance, peak value, time domain waveform signal energy and the like of each electroencephalogram data segment, using fast Fourier transform to each electroencephalogram data segment to obtain a power spectrum of each frequency band, extracting relative energy or power characteristics of main frequency bands such as delta (1-3 Hz), theta (4-7 Hz), alpha (8-13 Hz), beta (14-30 Hz) and the like, extracting complexity and randomness characteristics of each electroencephalogram data segment by adopting methods such as relevant dimension, maximum Lyapunov index and the like, obtaining a time-frequency diagram by using methods such as wavelet transform and the like, describing energy distribution of each electroencephalogram data segment in time and frequency, and extracting characteristics such as relativity and synchronicity among electroencephalogram data segments of different parts.

For preprocessing of eye movement data, a low-pass filter can be used for smoothing and denoising the eye movement data, the low-pass filter can be used for carrying out eye movement calibration by adopting a mean value filter, gaussian filter and the like, then eye movement calibration is carried out by adopting a stimulus reference point, a mapping between an eye movement angle and a display coordinate is established for eliminating errors caused by head movement, in addition, distortion samples caused by eye rotation and eye closure in the eye movement data are required to be detected and removed, a speed threshold and an acceleration threshold can be set for identifying the distortion samples, such as an upper speed threshold, a lower speed threshold, an upper acceleration threshold and an acceleration lower limit threshold are set, eye movement data which are not between the upper speed threshold and the lower speed threshold are removed, eye movement data which are not between the upper acceleration threshold and the lower acceleration threshold are then divided into eye movement data segments with fixed time length, so that characteristics can be extracted, such as calculating statistical characteristics of fixation times, fixation duration and the like in each eye movement data segment, analyzing the viewing range, saccade path length, saccade average saccade path length, saccade diameter and the like of each eye movement data segment, and the characteristics of the eye movement data which are extracted from the eye movement data are subjected to the stimulus, and the pupil movement characteristics are predicted, and the pupil movement characteristics are changed, and the pupil movement characteristics are described.

For preprocessing of video face data, firstly, the illumination condition and shielding condition of each frame of image in the video face data are checked, then, images with quality which cannot meet the requirements are removed, and threshold values such as brightness, contrast and integrity can be set. After eliminating images with quality which cannot meet the requirements in video facial data, facial positioning can be performed by using Haar, HOG (Histogram of Oriented Gradient, direction gradient histogram) and other features, facial areas are extracted, the Haar features comprise edge features, linear features, central features and diagonal features, then the extracted facial images are rotated, scaled and corrected according to the positions of eyes, the corrected facial images are cut according to the required size to extract facial local areas, the facial image quality is improved by using histogram equalization, denoising and other methods, image enhancement of the facial images is realized, feature extraction is performed on the facial images after the image enhancement, if an expression descriptor based on feature point displacement is constructed, basic expression categories are represented, so that expression features are extracted, microscopic actions are captured by adopting methods such as optical flow, image enhancement and the like, micro expression appearance is judged, micro expression features are extracted, head three-dimensional rotation is estimated, the direction of attention deviation is judged to extract head posture features, and the eye positioning is combined with the eye opening degree and other features are analyzed to extract eye features.

The scoring sub-model in this embodiment includes: an electroencephalogram scoring sub-model, an eye movement scoring sub-model, and a face scoring sub-model; the electroencephalogram scoring sub-model uses an LSTM (Long Short Term Memory, long-term memory network) model, a time step folding layer is added behind the LSTM, the sub-scoring value of each time step can be obtained, the time step folding layer is a full-connection layer, the preprocessed electroencephalogram data is time sequence data, the preprocessed electroencephalogram data is input into the electroencephalogram scoring sub-model, the LSTM can learn the long-term dependency characteristic of a time sequence (namely the electroencephalogram data), long-distance time sequence rules are captured and transmitted through memory cells, the time sequence characteristic is obtained, the LSTM also comprises gating structures such as an input gate, an output gate and the like, unimportant information can be filtered, the time sequence characteristic of the preprocessed electroencephalogram data is obtained after the LSTM, and the time sequence characteristic is converted into the sub-scoring value of the electroencephalogram data corresponding to each time point after the time step folding layer and the full-connection layer are processed; the eye movement scoring sub-model uses a convolutional neural network model, the preprocessed eye movement data is usually a two-dimensional eye movement track image or thermodynamic diagram, the convolutional neural network can extract local features of the eye movement track image or thermodynamic diagram, such as eyeball accurate positioning, fixation point gathering and the like, the convolutional kernel automatically learns the features by sliding sampling at different positions, fine changes of the eye movement track can be detected, and the convolutional layer outputs sub-scoring values corresponding to the eye movement data through a full-connection layer after extracting the local features; the facial scoring sub-model uses a convolutional neural network model, the preprocessed facial video data is usually a facial image sequence, facial expression features are extracted through a convolutional layer, local connection is added, a part of interest region is connected, for example, the interest region is concentrated in key regions such as eyes and mouth, micro expression features are learned, after the facial features are extracted through the convolutional layer, the facial features are input into a full-connection layer to be converted into sub-scoring values corresponding to the facial video data.

Specifically, each evaluation molecular model is obtained through training by the following steps:

In this embodiment, the preset threshold may be set to 200 groups, in the case that the user to be evaluated is a new user, no history mode data exists, at this time, the number of collected current mode data is generally smaller than the preset threshold, then each pre-trained general scoring model needs to be loaded, if the collected current mode data is electroencephalogram data, the LSTM model corresponding to the electroencephalogram data is trained by using a preset training sample, the training method adopts a training method in the prior art, and the preset training sample is the electroencephalogram data of other users collected in advance, so as to obtain a general scoring model corresponding to the trained electroencephalogram data; under the condition that the acquired current mode data is eye movement data, training a convolutional neural model corresponding to the eye movement data by using a preset training sample, wherein the training method adopts a training method in the prior art, and the preset training sample is eye movement data of other users acquired in advance, so that a general scoring model corresponding to the trained eye movement data is obtained; under the condition that the acquired current modal data is facial video data, training a convolutional neural model corresponding to the facial video data by using a preset training sample, wherein the training method adopts a training method in the prior art, the preset training sample is the facial video data of other users acquired in advance, a general scoring model corresponding to the trained facial video data is obtained, after the trained general scoring model corresponding to each current modal data is obtained, the first preset percentage is 10% -20%, if the number of the acquired current modal data of the user to be evaluated is 50, 10 groups (namely, the first preset percentage is 20%) are randomly selected from the current modal data and serve as a corresponding fine tuning data set, the corresponding general scoring model is trained by using the fine tuning data set, specifically, parameters of all layers (such as a convolutional layer, a folding layer and the like) except the last layer (namely, a full-connection layer) in the general scoring model are fixed, the initial learning rate is set to be 0.00001, the number of training iteration rounds is 5, and the small learning rate and the number of times of overlapping are avoided are adopted; training for 5 rounds by utilizing the fine adjustment data set, and updating the weight of the last layer to obtain a corresponding scoring sub-model, wherein the scoring sub-model is a model obtained after adjustment according to the individual data of the user to be evaluated; after continuously collecting multi-modal data of a user to be evaluated, the historical modal data of the user to be evaluated is gradually increased, and in the process that the historical modal data is gradually increased, a first preset percentage of modal data is selected from the historical modal data and the current modal data to serve as a corresponding fine adjustment data set, so that training is continuously carried out on all universal scoring models, and dynamic adjustment of model parameters is achieved; under the condition that the sum of the historical mode data corresponding to each evaluation molecular model and the current mode data of the user to be evaluated is not smaller than a preset threshold value, the user to be evaluated is already the old user, the second preset percentage is 80% -90%, if the sum of the historical mode data corresponding to the evaluation molecular models and the current mode data is 600 groups of data, 80% of data (480 groups) are taken as training sets, 20% of data (120 groups) are taken as verification sets, a corresponding pre-trained general score model is loaded, network weights are randomly initialized, the pre-training weights are not loaded, a smaller learning rate is set, for example, 0.001 is set, the general score model corresponding to training is used, the number of iteration rounds is 100, each training round is carried out for 10 times, the general score model is tested by using the corresponding verification sets, a loss function corresponding to the verification set of each test is recorded, if the loss function of continuous 5 times of iteration is not reduced, the loss function corresponding to the verification sets is converged, model parameters of training loss stabilization are ended in advance, the corresponding training loss functions are obtained, the corresponding individual score models are obtained, and the individual evaluation model to be evaluated is realized.

In this embodiment, after the scoring sub-model corresponding to each current mode data is obtained, the corresponding scoring sub-model can be further fine-tuned by using each current mode data, so as to enhance model generalization. And feeding back the performance of each scoring sub-model on the corresponding current modal data to the corresponding general scoring model, and repeating the process as a supervision signal for general scoring model optimization, and continuously using each current modal data to jointly optimize the corresponding scoring sub-model and the corresponding general scoring model to realize group coordination.

And S3, carrying out linear weighting according to each sub-score value, the accuracy weight corresponding to each sub-score value and the time attenuation weight corresponding to each sub-score value to obtain a comprehensive score value, wherein each accuracy weight is determined according to the accuracy of the corresponding score sub-model, and each time attenuation weight is determined according to each corresponding score sub-model and a preset time attenuation function.

In this embodiment, the time decay function is:

，

wherein n isNumber of scoring sub-models, 0<i<n, i is a positive integer,for the current time +.>Is the optimal time attenuation coefficient corresponding to the scoring sub-model i, < - >The time attenuation weight corresponding to the scoring submodel i is obtained;

defining a time attenuation coefficient of each evaluation molecular module;

In particular, the method comprises the steps of,the rate of decay of the sub-score values of scoring sub-model i over time is controlled, and different settings can be setReflecting the timeliness of different scoring submodels i, such as the electroencephalogram scoring submodel with very strong timeliness and arrangementAre generally larger; for the eye movement scoring submodel, the timeliness is weaker, set +.>Is usually smaller, in this embodiment +. >Initializing to a smaller value, for example, 0.01, namely defining the time attenuation coefficient of each evaluation sub-module, inputting each training data into a corresponding scoring sub-module to obtain a corresponding prediction score, comparing the prediction score with a corresponding real label, calculating the difference between the prediction score and the real label, using MSE (Mean squared error, root mean square error), MAE (mean absolute error, average absolute error), cross entropy and the like as difference metrics to obtain a comparison result, accumulating and summing the prediction scores of all training samples and the difference between the real labels to obtain a final loss function, updating the value of the corresponding time attenuation coefficient through the back propagation of the loss function, minimizing the loss function, continuously optimizing the corresponding time attenuation coefficient through a plurality of training rounds until the loss function converges or the maximum round is reached, finally obtaining the time attenuation coefficient which is the optimal time attenuation coefficient of the corresponding scoring sub-module, setting the optimal time attenuation coefficient, reflecting the sensitivity of attention assessment to time, determining the time attenuation weight of the sub-module through the optimal time attenuation coefficient and the time attenuation coefficient, enabling the attention assessment to be more stable to the most recent attention, and realizing the stable and stable attention assessment of the data.

In this embodiment, the determining, by using the accuracy weights, the accuracy of the corresponding scoring sub-models includes:

，

Specifically, extracting features in each current mode data of the user to be evaluated to form a corresponding feature set, for example, a feature set corresponding to the electroencephalogram data is an electroencephalogram feature set, a feature set corresponding to the eye movement data is an eye movement feature set, a feature set corresponding to the face video data is a face video feature set, the distinguishing degree of the corresponding feature set is calculated on each evaluation molecular model, N feature sets corresponding to each mode data need to be collected in advance, each feature set of the user to be evaluated and each feature set of the N users are input into a corresponding scoring sub-model, each scoring sub-model obtains a plurality of feature outputs, a classifier is used for predicting which of the user to be evaluated and the N users belongs to, and the classification accuracy is calculated and is the distinguishing degree of the corresponding scoring sub-model; nodding is a common attention-related feature as in facial video data, where the head remains relatively stationary while the average person is concentrating on work, but there are differences between individuals. For example, user a may have a nodding habit when concentrating, which is an individual characteristic of a, while other users may not nod when concentrating, which is different from user a. If such individual differences are not taken into consideration, when the system detects the user a nod, it may be misjudged as inattentive because the nod is generally regarded as a feature of inattention. In the application, by calculating the distinguishing degree of the face scoring submodel, the nodding has strong distinguishing degree for the user A, namely the nodding features can effectively distinguish different attention states of the user A, so that the weight of the nodding features in the face scoring submodel can be enhanced, and personalized parameter adjustment can be carried out. When the user A clicks, the user A can correctly judge that the user A is actually in a concentration state. Thereby improving the accuracy of the assessment of the attentiveness status of user a. The specific calculation process is as follows:

(1) A section of facial video data of the user A in different attention states is collected, and head position change among video frames is extracted as nodding features.

(2) When the user A concentrates on, the head position change is larger than the head position change when not concentrating on, the concentrated state point feature set is set as F1, and the non-concentrated state point feature set is set as F2.

(3) The other 3 users' nod feature sets in both the concentrate and the non-concentrate state are collected and respectively marked as f1_1, f2_1, f1_2, f2_2, f1_3, f2_3.

(4) These 8 feature sets are input into the face scoring sub-model, each resulting in an output score O1_ A, O2_ A, O _1, O2_1, O1_2, O2_2, O1_3, O2_3.

(5) The 8 output scores were used as samples, and the labels were a focus, a not focus, 1 not focus, 2 not focus, 3 not focus, and construct a classification problem.

(6) After training using the Logistic regression classifier, the 8 samples, O1_ A, O2_ A, O1_1, O2_1, O1_2, O2_2, O1_3, O2_3, were predicted. Assuming that the prediction results are that the predictions of O1_ A, O2_ A, O _1, O2_1, O1_2 and O2_2 are correct, and the predictions of O1_3 and O2_3 are incorrect, the calculated classification accuracy is equal to that the correct number of samples accounts for 75% of the total number of samples, and the differentiation of the nodding feature set on the face scoring sub-model is 75%.

After the differentiation degree corresponding to each scoring sub-model is obtained, the accuracy weight of each scoring sub-model is obtained through calculation according to an accuracy weight calculation formula, and after the modal data of the user to be evaluated and other users are continuously collected, the differentiation degree can be recalculated, the accuracy weight can be adjusted according to the collected modal data of the user to be evaluated and other users, and the accuracy weight of each scoring sub-model can be optimized in real time. And finally obtaining the accuracy weight matched with the individual difference of the user to be evaluated for attention scoring.

After each sub-score value, the accuracy weight corresponding to each sub-score value and the time attenuation weight corresponding to each sub-score value are obtained, a comprehensive score value is obtained through calculation according to a score formula, wherein the score formula is as follows:

，

wherein n is the number of scoring sub-models, 0<i<n, i is a positive integer,for the current time +.>Is the accuracy weight corresponding to scoring submodel i, < ->Time decay weight corresponding to scoring submodel i, < ->Is a comprehensive score value.

The embodiment further includes:

If not, the normal attention state is determined.

Specifically, the confirmation time window is a time period, such as 1 minute, 3 minutes, 5 minutes or 10 minutes, through the setting of the confirmation time window, when the comprehensive score value is detected to be lower than the preset threshold value, a confirmation stage is started, whether the comprehensive score value is returned to be higher than a normal value or not in the confirmation time window is judged, if the comprehensive score value is returned to be higher than the normal value in the confirmation time window, the user to be evaluated is indicated to be in a short-time low-attention state, if the comprehensive score value is not returned to be higher than the normal value in the confirmation time window, the user to be evaluated is indicated to be in a real low-attention state, a warning is sent, and through the confirmation stage, false warning of the short-time low-attention is avoided.

The method for determining the confirmation time window in this embodiment includes:

Counting the proportion of the true low-attention state with correct prediction of all the historical comprehensive score value sequences to the actual low-attention state in each time window to obtain a corresponding recall rate, and taking the time window with the highest recall rate as a first window;

Specifically, a section of historical comprehensive score value sequence may be { score1, score 2..score n }, the candidate time sequence may be { T1, T2,..tm }, e.g., {1min, 3min, 5min, 10min }, when the comprehensive score value in the historical comprehensive score value sequence is lower than the preset score value, each time window is used to predict the real low-attention state corresponding to all the historical comprehensive score value sequences, then the recall rate corresponding to each time window is counted, the higher the recall rate is, the more the identifiable low-attention time period is indicated, the better the effect is, the recall rate curves of different time windows can be drawn, the x-axis is the different time windows, the y-axis is the corresponding recall rate, the curve is drawn by using the (time window, the recall rate) coordinate point, the peak value of the curve is observed, and the time window with the highest recall rate is found as the first window.

The second window is calculated in this embodiment using the following formula,

，

wherein, the liquid crystal display device comprises a liquid crystal display device,for the second window, k1 and k2 are preset coefficients, < >>For the first average, +.>For the second average, the values of k1 and k2 may be trained by grid search, specifically, the values of k1 and k2 are set to be super-parameters, e.g., the k1 value ranges of [0.5,1,2]The k2 value range is [5,10,20 ]]Establishing a two-dimensional parameter grid, wherein the horizontal axis is the value of k1, and the vertical axis is the value of k2, so that 9 combinations (k 1, k 2) exist in the two-dimensional parameter grid, and the combination (k 1, k2) Training a scoring sub-model, calculating an evaluation index on a verification set, such as accuracy, traversing all combinations, finding out the set of hyper-parameter values (k 1, k 2) that make the evaluation index the best, retraining the scoring sub-model with the optimal (k 1, k 2) to obtain a final model, selecting the optimal parameters by traversing different combinations in the hyper-parameter space, using the verification set evaluation index, and finding out the optimal values of k1 and k 2; by simultaneously utilizing the information of the mean value and the standard deviation, the whole level and the fluctuation range of the historical score can be more comprehensively evaluated, more reasonable window time is adaptively calculated, if the recall rate of the second window is higher than that of the first window, the self-adaptive mode is described to truly promote the confirmation effect, the second window is determined to be the confirmation time window, and otherwise, the first window is determined to be the confirmation time window.

According to the attention assessment method based on multi-mode fusion, disclosed by the embodiment of the invention, the attention state can be reflected more comprehensively by comprehensively utilizing multi-mode data such as brain electricity data, eye movement data and facial video data, and the accuracy is higher than that of a single mode; the time attenuation weights of the corresponding scoring submodels are determined through the scoring submodels and the preset time attenuation function, so that the influence of the latest data on attention assessment is larger, meanwhile, the characteristic of attention attenuation along with time is simulated, the cognitive characteristic that the attention cannot be continuously stabilized is met, and the real-time assessment and monitoring of high sensitivity of the attention variation can be realized; and training the last layer of the general scoring model by using the fine-tuning data set to obtain a corresponding scoring sub-model, wherein the scoring sub-model is a model obtained after adjustment according to individual data of the user to be evaluated, after continuously collecting multi-modal data of the user to be evaluated, the historical modal data of the user to be evaluated is gradually increased until the user to be evaluated becomes an old user, loading the corresponding pre-trained general scoring model, randomly initializing network weights, not loading the pre-training weights, training the corresponding general scoring model by using the training set, obtaining model parameters when training is unstable, obtaining the corresponding scoring sub-model, and realizing individual modeling of the user to be evaluated.

As shown in fig. 2, an embodiment of the present invention discloses an attention assessment system based on multi-modal fusion, including:

the data acquisition module 10 is configured to acquire multi-modal data of a user to be evaluated, where the multi-modal data includes: a plurality of current modality data reflecting the attention characteristics of the user to be evaluated from different angles;

the scoring sub-module 20 is configured to input each current mode data in the multi-mode data into a corresponding trained scoring sub-model to obtain a corresponding sub-scoring value;

and the comprehensive scoring module 30 is configured to perform linear weighting according to each sub-score value, an accuracy weight corresponding to each sub-score value, and a time attenuation weight corresponding to each sub-score value to obtain a comprehensive score value, where each accuracy weight is determined according to the accuracy of the corresponding scoring sub-model, and each time attenuation weight is determined according to each corresponding scoring sub-model and a preset time attenuation function.

For specific configurations of the trip hotspot extraction system, reference may be made to the configurations of the trip hotspot extraction method above, which are not described herein. The modules of the travel hotspot extraction system can be all or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a trip hotspot extraction method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 3 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor that when executing the computer program performs the steps of:

In one embodiment, the time decay function is:

，

wherein n is the number of scoring sub-models, 0<i<n, i is a positive integer,for the current time +.>Is the optimal time attenuation coefficient corresponding to the scoring sub-model i, < - >The time attenuation weight corresponding to the scoring submodel i is obtained;

defining a time attenuation coefficient of each evaluation molecular module;

In one embodiment, each of the scoring models is trained by:

In one embodiment, the determining the accuracy weights according to the accuracy of the corresponding scoring sub-model includes:

,

wherein n is the number of scoring sub-models, 0<i<n, i is a positive integer,for scoring the accuracy corresponding to submodel i, < - >For scoring the degree of differentiation corresponding to submodel i, < >>For the average differentiation, ++>For the weight adjustment factor, +.>And (5) weighing the accuracy corresponding to the scoring submodel i.

In one embodiment, further comprising:

if not, the normal attention state is determined.

In one embodiment, the method for determining the acknowledgement time window includes:

In one embodiment, the multimodal data includes: any of brain electrical data, eye movement data, and facial video data.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the time decay function is:

，

defining a time attenuation coefficient of each evaluation molecular module;

In one embodiment, each of the scoring models is trained by:

，

In one embodiment, further comprising:

if not, the normal attention state is determined.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims

1. A method of attention assessment based on multimodal fusion, comprising:

performing linear weighting according to each sub-score value, an accuracy weight corresponding to each sub-score value and a time attenuation weight corresponding to each sub-score value to obtain a comprehensive score value, wherein each accuracy weight is determined according to the accuracy of a corresponding score sub-model, and each time attenuation weight is determined according to each corresponding score sub-model and a preset time attenuation function;

the time decay function is:

，

defining a time attenuation coefficient of each evaluation molecular module;

2. The attention assessment method based on multi-modal fusion as claimed in claim 1, wherein each of the assessment models is trained by:

3. The attention assessment method based on multi-modal fusion of claim 1, wherein the respective accuracy weights are determined from the accuracy of the corresponding scoring submodels, comprising:

,

4. The attention assessment method based on multi-modal fusion as claimed in claim 1, further comprising:

If not, the normal attention state is determined.

5. The attention assessment method based on multi-modal fusion as claimed in claim 4, wherein the determination method of the confirmation time window includes:

6. The attention assessment method based on multimodal fusion as recited in claim 1, wherein the multimodal data includes: any of brain electrical data, eye movement data, and facial video data.

7. An attention assessment system based on multimodal fusion, comprising:

the comprehensive scoring module is used for carrying out linear weighting according to each sub-scoring value, the accuracy weight corresponding to each sub-scoring value and the time attenuation weight corresponding to each sub-scoring value to obtain a comprehensive scoring value, wherein each accuracy weight is determined according to the accuracy of the corresponding scoring sub-model, each time attenuation weight is determined according to each corresponding scoring sub-model and a preset time attenuation function, and the time attenuation function is as follows:

，

defining a time attenuation coefficient of each evaluation molecular module;

8. Computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.