CN117576771A

CN117576771A - Visual attention assessment method, device, medium and equipment

Info

Publication number: CN117576771A
Application number: CN202410069098.XA
Authority: CN
Inventors: 周宏豪; 马宁; 郑迪; 董波; 陈奕菡
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-02-20
Anticipated expiration: 2044-01-17
Also published as: CN117576771B

Abstract

The specification discloses a visual attention assessment method, a device, a medium and equipment, which are used for acquiring an observation video and a tested eye movement track of the observed video. And extracting each semantic feature corresponding to the observed video according to the pre-trained video semantic segmentation model. And determining a preset cognitive decision model, and inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter. The attention parameter reflects the degree of attention and preference of the subject for each semantic feature in the observed video. The attention parameters to be tested are input into the classification model corresponding to the observed video which is trained in advance, the classification result output by the classification model is obtained, and the attention map to be tested is drawn according to the classification result, so that the information of the eye movement track to be tested which is focused can be intuitively known.

Description

Visual attention assessment method, device, medium and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a visual attention assessment method, apparatus, medium, and device.

Background

As the research on the human brain continues to be in progress, research shows that the information collected by eyes can reach 80% of the information processed by the human brain. From the perspective of human evolution, the evolution of the eye greatly promotes the development of the brain. The eye movement track reflects the strategy of human for collecting information from the external world and is the basis of the brain cognition of the external world.

When the brain is suffering from cognitive impairment or pathology, this leads to impairment of cognitive ability and is the origin of many neurological diseases. Obviously, this also results in a change in the strategy of acquiring information, such as the eye movement trajectory. Such changes include semantic level eye movement trajectory policy changes, i.e., policy changes that collect information on semantic information contained in the viewed image.

At present, a model is built through machine learning, and the mapping relation between visual attention characteristics and nerve diseases is predicted based on eye movement tracks, so that a doctor is assisted to diagnose the nerve diseases. However, this approach generally cannot consider both physical information and semantic information of the video viewed by the eye movement track, and the interpretability of the diagnosis result of the model prediction cannot be guaranteed, so that the visual attention characteristics cannot be accurately estimated by the eye movement track.

To this end, the present specification provides a visual attention assessment method, apparatus, medium, and device.

Disclosure of Invention

The present specification provides a visual attention assessment method, apparatus, medium and device to partially solve the above-mentioned problems of the prior art.

The technical scheme adopted in the specification is as follows:

the present specification provides a visual attention assessment method comprising:

acquiring an observation video and a tested eye movement track of the observation video to be tested;

extracting each semantic feature corresponding to the observed video according to a pre-trained semantic segmentation model;

determining a preset cognitive decision model, and inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter;

inputting the attention parameter to be tested into a pre-trained classification model to obtain a classification result output by the classification model, and drawing the attention map to be tested according to the classification result.

Optionally, the cognitive decision model includes a gaze probability sub-model and an attention parameter sub-model;

inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter, wherein the method specifically comprises the following steps of:

Determining the fixation point of each frame of image to be watched according to the tested eye movement track;

for each frame of image, determining the gazing probability of the gazing point of the frame of image when the observed video is watched under test through the gazing probability submodel according to the initialized attention parameter and each semantic feature of the gazing point of the frame of image;

and determining the attention parameter to be tested through the attention parameter submodel according to the gazing probability of the gazing point of each frame of image.

Optionally, the classification model is trained using the following method, wherein:

generating a virtual eye movement track according to the observation video;

taking the attention parameter of the virtual eye movement track and the observation video as training samples, and taking the coincidence probability of the gaze point and each semantic feature in the virtual eye movement track as a sample label;

inputting attention parameters contained in the training sample into a classification model to be trained, and obtaining a classification result of the virtual eye movement track output by the classification model to be trained;

and determining loss according to the difference between the sample label and the classification result of the virtual eye movement track, and training the classification model to be trained by taking the minimum loss as an optimization target.

Optionally, generating a virtual eye movement track according to the observation video specifically includes:

generating a virtual eye movement track of the virtual tested watching the observed video according to the observed video and the coincidence probability of the preset virtual tested fixation point and each semantic feature;

and screening out the virtual eye movement track conforming to the real eye movement data through virtual eye movement track screening conditions, wherein the screening conditions at least comprise the probability distribution of the fixation point, the eye movement speed and the identification of a discrimination network.

Optionally, the gaze probability sub-model includes a gaze sub-model and a probability sub-model;

according to the initialized attention parameters and semantic features of the gaze point of the frame image, determining the gazing probability of the gaze point of the frame image when the observed video is watched under test through the gazing probability submodel, wherein the method specifically comprises the following steps:

for each pixel point in the frame image, determining the gazing utility of the pixel point through the gazing sub-model according to the initialized attention parameters corresponding to the semantic features and the semantic features corresponding to the pixel point;

determining the gazing utility of the frame image according to the gazing utility of each pixel point in the frame image;

Determining the coordinates of the fixation point of the frame image according to the tested eye movement track;

determining the gazing utility of a pixel point of a corresponding position of the coordinates of the gazing point of the frame image in the frame image as the gazing utility of the gazing point of the frame image;

and determining the gazing probability of the gazing point of the frame image according to the gazing utility of the gazing point of the frame image and the gazing utility of the frame image through the probability submodel.

Optionally, extracting each semantic feature corresponding to the observed video according to a pre-trained semantic segmentation model specifically includes:

inputting the frame image into a pre-trained semantic segmentation model for each frame image of the observed video, and determining semantic features of each pixel point of the frame image, wherein the semantic features comprise: character features, object features, brightness features, motion features, face features;

when the semantic features of each pixel point of the frame image comprise face features, determining a face region in the frame image;

inputting the face region into a trained face feature point detection model, and determining a face core region in the face region;

determining a first weight of each pixel point of the face core area in the frame image and a second weight of each pixel point of the non-face core area in the frame image, wherein the first weight is larger than the second weight, and the larger the weight of each pixel point is, the larger the influence on the attention parameter is;

And determining the semantic features of each pixel point of the frame image according to the first weight and the second weight.

Optionally, obtaining an observation video and a tested eye movement track of a tested watching the observation video specifically includes:

obtaining an observation video and a virtual eye contour, and displaying the virtual eye contour on a screen;

collecting images in a designated area and displaying the images on the screen, wherein the designated area is the area where the tested eye movement track is located when the tested eye movement track is collected;

displaying prompt information on the screen, wherein the prompt information is used for prompting the tested eyes to coincide with the outline of the virtual eyes;

determining the position of the tested eye on the screen according to the collected images in the appointed area;

and when the position of the tested eye on the screen is matched with the position of the virtual eye outline displayed on the screen, playing the observation video, and collecting the tested eye movement track of the tested watching the observation video.

The present specification provides a visual attention assessment device comprising:

the acquisition module is used for acquiring an observation video and a tested eye movement track of the observation video to be tested;

The extraction module is used for extracting each semantic feature corresponding to the observation video according to the pre-trained semantic segmentation model;

the determining module is used for determining a preset cognitive decision model, and inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter;

and the classification module is used for inputting the attention parameter to be tested into a pre-trained classification model to obtain a classification result output by the classification model, and drawing the attention map to be tested according to the classification result.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the visual attention assessment method described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the visual attention assessment method described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

according to the visual attention assessment method provided by the specification, the observation video and the tested eye movement track of the tested watching observation video are obtained. And extracting each semantic feature corresponding to the observed video according to the pre-trained semantic segmentation model. And determining a preset cognitive decision model, and inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter. And inputting the attention parameters of the tested eye movement track into a pre-trained classification model to obtain a classification result output by the classification model, and drawing a tested attention pattern according to the classification result.

Through a pre-trained semantic segmentation model, all semantic features of an observation video are acquired, attention parameters are determined, the attention parameters represent the preference of the tested person to all the semantic features, the attention degree of all the semantic features in the process of predicting the attention features of the eye movement track is reflected, which information is focused by the tested person's eye movement track can be intuitively known, and the interpretability of the attention map is improved. The problem that the interpretability of the model predicted diagnosis result cannot be guaranteed, so that the visual attention characteristics cannot be accurately estimated through the eye movement track is avoided.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a schematic flow chart of a visual attention assessment method according to an embodiment of the present disclosure;

FIG. 2 is a schematic illustration of an attention pattern provided herein;

FIG. 3a is a schematic diagram of a training discrimination network provided in the present specification;

FIG. 3b is a schematic diagram of a usage discrimination network provided in the present specification;

FIG. 4 is a schematic illustration of a visual attention assessment provided herein;

FIG. 5 is a schematic view of a visual attention assessment device provided herein;

fig. 6 is a schematic structural diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of a visual attention assessment method provided in the present specification, specifically including the following steps:

s100: and acquiring an observation video and a tested eye movement track of the observation video to be tested.

Since the process of obtaining the observation video and the eye movement track of the observation video to be watched usually involves processing a large amount of data and requires a high computational power on the device, in the embodiment of the present disclosure, the process of the visual attention assessment method may be executed by the server. Of course, the present specification does not limit what device is used to carry out the multi-organ reconstruction method, and may be used to carry out a visual attention assessment method by acquiring an observation video and a subject eye movement trace of a subject observation video using devices such as a personal computer, a mobile terminal, and a server. For convenience of description, the following description will be made with the server as an execution subject.

In one or more embodiments herein, the server may obtain the observation video for the subject to view from a database of vision research institutions, network platforms, and related institutions. The obtained observation video may be a dedicated observation video set by a relevant institution for eye movement track study, or may be a non-dedicated observation video retrieved on a network platform. The present specification is not limited herein and may be acquired according to actual circumstances.

In one or more embodiments herein, the server may obtain a subject eye movement trajectory for a subject viewing an observation video through an eye movement tracking device.

Specifically, when the eye tracking device is used for watching the observation video, the eye movement state is calibrated with the fixation point on the screen of the eye tracking device, so that the eye tracking device can watch the observation video, and the eye tracking device can acquire the eye movement track of the tested eye.

S102: and extracting each semantic feature corresponding to the observed video according to the pre-trained semantic segmentation model.

In one or more embodiments of the present disclosure, a server inputs an acquired observation video into a pre-trained semantic segmentation model, and obtains each semantic feature of the observation video output by the pre-trained semantic segmentation model.

Specifically, the server extracts each semantic feature of each pixel point in each frame of image of the observation video through a pre-trained semantic segmentation model. In the observed video, the elements are mainly characters and objects, wherein the characters and the objects have the possibility of moving, and the brightness changes of different areas in the observed video are different.

However, the brightness variation of different areas in the observation video may attract the attention of the tested person, and the moving object and the person may attract the attention of the tested person, and when the person appears in the observation video, the attention of the tested person and the object is different. In general, when a person is present in an observation video, the attention of the person to be tested is distributed more over the person, and when no person is present, the attention of the person to be tested is on the object of interest.

Therefore, the person, the object, the brightness, the movement, etc. may change the attention of the subject, and in this specification, the semantic of the observation video is divided by using a semantic division model, and at least the semantic features are divided into the semantic features of the person, the object, the brightness, the movement, etc. I.e. whether each pixel is a person or an object, how bright, or whether it is moving.

In one or more embodiments of the present disclosure, semantic features that may be extracted by the semantic segmentation model may be preset, and a semantic segmentation model may be specifically trained.

In one or more embodiments of the present disclosure, the server may further extract physical features from semantic features such as motion and brightness of the observed video more accurately through digital image processing techniques.

Specifically, the server can obtain the brightness characteristic of each frame of image by calculating the gray value of each frame of image of the observed video, and can also calculate the gray value of each pixel point of each frame of image to obtain the brightness characteristic of each pixel point of each frame of image. The server can determine the motion characteristics of each frame of image through the brightness change of the front frame of image and the rear frame of image, namely a frame difference method. The brightness change of the front and rear frame images can be identified, and then the brightness change is binarized into brightness characteristics.

S104: and determining a preset cognitive decision model, and inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter.

In one or more embodiments of the present disclosure, after determining each semantic feature of the observed video, the server also needs to determine the location, i.e., coordinates, of the actual gaze point in the subject eye trajectory. When the person is examined to a certain area of a certain frame of image in the observation video, the person can be considered to be attracted by the semantic features of the area, and after each semantic feature of the observation video is determined, the person to be examined is analyzed according to the position of the point of regard.

In one or more embodiments of the present disclosure, the server may determine a preset cognitive decision model, and input each semantic feature of the observed video and the eye movement track to be tested into the cognitive decision model to obtain the attention parameter to be tested.

Specifically, the cognitive decision model includes a gaze probability sub-model and an attention parameter sub-model. And determining the fixation point of each frame of image to be watched according to the eye movement track. For each frame of image, determining the gazing probability of the gazing point of the frame of image when the observed video is watched under test through the gazing probability submodel according to the initialized attention parameter and each semantic feature of the gazing point of the frame of image. According to the gazing probability of the gazing point of each frame image, the gazing parameters to be tested are determined through the gazing parameter submodel.

By determining the attention parameter to be tested, the process of accurately describing the eye movement track better explains the attention characteristics of the tested person when watching the observation video.

Further, in one or more embodiments of the present disclosure, the gaze probability sub-model includes a gaze sub-model and a probability sub-model, and for each frame of image of the observed video, for each pixel point in the frame of image, the server determines, according to each initialized attention parameter corresponding to each semantic feature and each semantic feature corresponding to the pixel point, a gaze utility of the pixel point through the gaze sub-model.

Specifically, according to the coordinates of each pixel point in the frame imageAnd the character feature corresponding to the pixel extracted from the observed video +.>Object characteristics->Luminance characteristics->And movement characteristics->The gazing utility of the pixel point is calculated by adopting the following formula:

wherein,and C, all represent attention parameters. C represents the distraction parameter in the attention parameter, when the point of regard is not at any one of the four types of characteristics of character, object, brightness and motion,the four attention parameters have values of 0, and the attention distraction parameter at this time reflects the gazing effect of the gazing point at a point except for the person, the object, the brightness and the movement in the observed video.

When the tested person looks at a certain area of a certain frame of image in the observed video, the tested person can be considered to be attracted by the semantic features of the area, or look at the area relativelyIn other areas of the same frame of image, higher benefits can be provided to the subject, which are called gaze utility. Each semantic feature may provide benefits, and the benefits expected from a pixel point in the gaze image may be related only to the content of the observed video, e.g., the benefits expected from a gaze character may be available And (3) representing. The fixation result of the specific eye movement track to be tested also depends on the preference of the semantic features, for example, the preference of watching the person can be used +.>And (3) representing. Multiplying the semantic features with both the attention parameters (i.e., preferences) can obtain the actual gaze utility under test.

In one or more embodiments of the present disclosure, for each frame of image of an observed video, a server may determine a gaze utility of the frame of image based on gaze utilities of pixels of the frame of image. And for each pixel point, determining the gazing probability of the pixel point according to the ratio of the gazing utility of the pixel point to the gazing utility of the frame image through a probability submodel. The following formula can be used for calculation:

according to the eye movement track to be tested, the server can determine the coordinates of the fixation point of each frame of image in the observed video. For each gaze point, determining the gaze utility of the pixel point of the position of the coordinates of the gaze point in the image of the gaze point as the gaze utility of the gaze point. And taking the ratio of the gazing utility of the gazing point to the gazing utility of the image where the gazing point is located as the gazing probability of the gazing point through the probability submodel.

In one or more embodiments of the present description, the server may further calculate an attention parameter at which the probability of being gazed is maximum through maximum likelihood estimation. Through the attention parameter submodel, the method is easy to understand and realize according to the maximum likelihood estimation method, a more accurate attention parameter estimation value can be obtained, and particularly under the condition of larger pixel point quantity, the accuracy of an estimation result is higher. The following formula can be used for calculation:

The attention parameter when the attention probability is maximum is calculated as the attention parameter to be tested according to the attention parameter submodel and the maximum likelihood estimation method.

S106: inputting the attention parameter to be tested into a pre-trained classification model to obtain a classification result output by the classification model, and drawing the attention map to be tested according to the classification result.

In one or more embodiments of the present disclosure, the server may determine an attention profile of the eye movement trace of the subject according to the determined attention parameters of each frame of image of the observed video of the subject.

In one or more embodiments of the present disclosure, the server may input the attention parameter to be tested into the pre-trained classification model, obtain a classification result output by the pre-trained classification model, and draw the attention map to be tested according to the classification result.

The mean and variance of the classification results are calculated to draw the attention pattern of the subject.

Fig. 2 is a schematic diagram of an attention pattern provided in the present specification. In the figure, the black dots represent the mean value and the confidence range is represented between two black crosses.

Based on the visual attention evaluation method shown in fig. 1, an observation video and a subject eye movement track of a subject watching the observation video are acquired. And extracting each semantic feature corresponding to the observed video according to the pre-trained semantic segmentation model. And determining a preset cognitive decision model, and inputting each semantic feature of the observed video and the tested eye movement track into the cognitive decision model to obtain the tested attention parameter. And inputting the attention parameters of the tested eye movement track into a pre-trained classification model to obtain a classification result output by the classification model, and drawing a tested attention pattern according to the classification result.

Additionally, in one or more embodiments of the present description, the server may obtain the observation video and the virtual eye contour, and display the virtual eye contour on a screen. And acquiring images in a designated area and displaying the images on a screen, wherein the designated area is the area where the tested eye moves when the tested eye moves. And displaying prompt information on the screen, wherein the prompt information is used for prompting the tested to coincide the eyes with the outline of the virtual eyes. And determining the position of the tested eye on the screen according to the acquired images in the designated area. And when the position of the tested eye on the screen is matched with the position of the virtual eye outline displayed on the screen, playing the observation video, and collecting the tested eye movement track of the tested observation video.

The server can also judge whether the collected position of the tested eye is matched with the position of the virtual eye outline. If yes, playing the observation video, and collecting the eye movement track of the observation video to be watched. If not, continuously displaying the prompt information on the screen.

Can help the test with difficult calibration, such as infants between 0 and 3 years old, and the like, and realize normal use of eye movement tracking equipment, such as an eye movement instrument and the like.

In one or more embodiments of the present disclosure, the displayed observation video is a continuous video, and the tested eye movement track that the server needs to acquire is a continuous eye movement track without taking blink into consideration. If faults appear in the collected images of the tested eyes in the region where the tested eyes are located on the screen in matching with the positions of the virtual eye outlines displayed on the screen when the observation video is played, the tested eyes are prompted to acquire the eye movement tracks again on the screen, and the observation video is played again.

In the process of collecting the eye movement track of the observed video to be watched, the virtual eye outline is not displayed on the screen, so that the observed video to be watched is prevented from being interfered.

In one or more embodiments of the present disclosure, the server may determine that the position of the tested eye on the screen matches the position of the virtual eye contour displayed on the screen when the distance between the position of the center of the tested eye on the screen and the position of the center of the virtual eye contour displayed on the screen is within a preset range.

In one or more embodiments of the present disclosure, when the server obtains less eye movement track data, and the training sample for training the classification model is insufficient, a large number of virtual eye movement tracks of the virtual observation video under test may be generated as the training sample.

Specifically, according to the superposition probability of the observed video and each semantic feature corresponding to the preset virtual tested, generating a virtual eye movement track of the virtual tested observed video. And screening out a virtual eye movement track conforming to the real eye movement data by using a virtual eye movement track screening condition, wherein the screening condition at least comprises a fixation point probability distribution, an eye movement speed and a discrimination network identification, and the coincidence probability refers to the probability of each semantic feature to be watched.

In the stage of generating the virtual eye movement trajectory, the server may set a plurality of virtual subjects, and set the coincidence probability of the semantic features in advance for each virtual subject. For example, a virtual test is generated in which only a person is seen, and the probability of coincidence of the corresponding semantic features is 100% for a person, 0% for an object, 0% for brightness, and 0% for motion. Virtual test is generated to only see the object, and the coincidence probability of the corresponding semantic features is 0% of the person, 100% of the object, 0% of the brightness and 0% of the motion. Then, the gaze point randomly generated for each frame of image of the observation video can constitute one example of a virtual eye movement track.

In the stage of screening the virtual eye movement track, the server can screen the virtual eye movement track which accords with the real human eye movement data through virtual eye movement track screening conditions such as gaze point probability distribution, eye movement speed, discrimination network identification and the like.

The gaze point probability distribution refers to statistics of real eye movement tracks of different observation videos, and the real gaze point probability distribution is obtained. The specification requires that the gaze point probability distribution of the generated virtual eye movement trajectory is close to the real gaze point probability distribution.

Eye movement speed refers to the natural eye movement speed controlled by the eye muscles. Eye movement velocity exhibits a significant nonlinear correlation with the magnitude of the angle of view being tested (angle from front view). Eye movement velocity is calculated by the position and time interval of two consecutive gaze points. The position of the gaze point is then converted into a viewing angle. The specification requires that the virtual tested eye movement speed of the generated virtual eye movement track is always within a reasonable range corresponding to the visual angle.

The identification network identification is also called as a 'real person mode', and the disclosed real person eye movement track data and virtual eye movement track data conforming to the probability distribution of the point of regard and the eye movement speed are utilized to train the identification network based on deep learning to distinguish the real person eye movement track data from the virtual eye movement track data. The present specification requires that the generated virtual eye movement locus is recognized as real eye movement locus data through a discrimination network. Further, a generation network may also be established, thus constituting a classical generation countermeasure network (Generative Adversarial Networks, GAN).

Fig. 3a is a schematic diagram of a training discrimination network provided in the present specification. And at the stage of training the discrimination network, inputting the real eye movement track data into the discrimination network, recognizing the discrimination network as a real person, inputting the virtual eye movement track data into the discrimination network, and recognizing the discrimination network as virtual.

Fig. 3b is a schematic diagram of a usage discrimination network provided in the present specification. A determination network is used to determine whether the virtual eye movement trace data can be identified as a real person. The virtual eye movement locus data recognized as a real person is regarded as a virtual eye movement locus required to be generated in the specification.

In one or more embodiments of the present description, the method for the server to train the classification model may be to generate a virtual eye movement track from the observed video. And taking the attention parameter and the observation video of the virtual eye movement track as training samples, and taking the coincidence probability of each semantic feature corresponding to the fixation point of the virtual eye movement track to be tested in the virtual eye movement track as a sample mark. And inputting the attention parameters contained in the training sample into the classification model to be trained to obtain the classification result of the virtual eye movement track output by the classification model to be trained. And determining loss according to the difference between the sample label and the classification result of the virtual eye movement track, taking the minimum loss as an optimization target, and training a classification model to be trained.

By generating a virtual eye movement trajectory representative of a virtual test, an unsupervised visual attention assessment can be achieved.

In one or more embodiments of the present description, the server may also set the virtual tested class-to-class differences. The method comprises the following steps of determining the number of types of virtual eye movement tracks for generating equivalent virtual tested according to the differences among the types:

representing differences between classes, & lt & gt>Is a percentage. d is the characteristic dimension (e.g., person, object, brightness, motion, distraction, 5 dimensions). N is the number of virtual trials.

For example, the number of the cells to be processed,it is necessary to generate virtual eye trajectories with equal (e.g., 100) probabilities of 6 different semantic features.

Firstly, possible values are obtained according to the differences among classes. For example, the number of the cells to be processed,the value set is {0%,20%,40%,80%,100% }.

These values are then arranged and combined to obtain all possible d-dimensional vectors, i.e. a Cartesian product is calculated.

Finally, the probability vectors of the element sum 1 in the vectors, namely all possible attention allocation proportions, are selected. The attention distribution ratio is the intrinsic attention preference of the test (independent of the observed video content) and is therefore used as the output label for the classification model.

In one or more embodiments of the present description, attention parameters are calculated and a classification model is trained. The number of classification models depends on the specific classification method, such as decision tree, support vector machine, neural network, etc. For example, the simplest classifier requires N, i.e., the current class as the positive class and the remaining classes as the negative classes.

In one or more embodiments of the present description, the server may also use the trained classification model for discrimination assessment that discriminates observed videos. Observing the content of the video can have an impact on the classification accuracy of the classification model. By comparing differences between different classesThe classification accuracy can evaluate the discrimination degree of the observed video. By multiple changes->Draw->And a relation diagram of classification accuracy, namely, observation video discrimination evaluation. The higher the overall classification accuracy of the video, the better the discrimination.

In one or more embodiments of the present disclosure, when a face exists in an observation video acquired by a server, a pre-trained face feature point detection model may also be used to implement semantic segmentation of the face and the face.

Inputting each frame of image of the observation video into a pre-trained semantic segmentation model, and determining semantic features of each pixel point of the frame of image, wherein the semantic features comprise: character features, object features, brightness features, motion features, and face features. And when the semantic features of each pixel point of the frame image comprise the face features, determining the face region in the frame image. And inputting the face region into a trained face feature point detection model, and determining a face core region in the face region. Determining a first weight of each pixel point of a face core area in the frame image and a second weight of each pixel point of a non-face core area in the frame image, wherein the first weight is larger than the second weight, and the larger the first weight is, the larger the influence on the attention parameter is. And determining the semantic features of each pixel point of the frame image according to the first weight and the second weight.

The semantic segmentation model segments and identifies semantic features of each frame of image, the segmented character features are used for describing the attributes of characters in the image, the object features describe the attributes of various objects in the image, the brightness features describe the brightness degree of each pixel point in the image, and the motion features describe the motion information of each pixel point in the image. Face features are a subclass of character features, and mainly focus on the attributes of faces in images.

In one or more embodiments of the present disclosure, the person feature comprises a face feature, and only the face feature is weighted after the first weight and the second weight are determined.

When the server detects a face, a bounding box (bounding box) can be used to cut out the current frame image to obtain a face region. And obtaining the face characteristics of the face through a face characteristic point detection model which is trained in advance.

For example, a face region exists in the clipped current frame image, a face core region and a non-face core region of the face region are determined according to a face feature point detection model, the weight of the determined pixel points of the face core region is greater than one, the weight of the determined pixel points of the non-face core region is determined to be less than one, and when the character feature is calculated according to the weight, the weighting result of the pixel points of the face core region is greater than the weighting result of the pixel points of the non-face core region.

When a face appears in one frame of image, marks of the character features are given different weights, and the weights of the semantic features of the face core areas (five sense organs) are set to be larger than those of the semantic features of the rest of non-face core areas of the character, so that the judgment standard for watching the character is improved.

In one or more embodiments of the present description, there is a feature coupling problem because people in the observed video may often move. If five attention parameters are calculated in parallel, the specific gravity of active attention and passive attention cannot be controlled, the bias introduced by the observed video cannot be eliminated, and the generated attention spectrum cannot generate uniform evaluation standards for all the observed videos. This problem can be solved by calculating the probabilities of active attention and passive attention independently.

Specific: the present application will { person, object, distraction } make up an active attention set, { brightness, motion, distraction } make up a passive attention set, { active (including person and object but not distinguishing them), passive (including brightness and motion but not distinguishing them), distraction } make up a joint attention set. And then classifying by using the trained classification models respectively. The probabilities of the three attention sets output by the classification model are obtained respectively: . Wherein,is the probability of distraction corresponding to three attention sets, and then, only +.>For the purposes of->Representing the active and passive attention probabilities, respectively. The probabilities corresponding to the five attention parameter dimensions are finally obtained through weighting the inside of the active attention and the passive attention by the first two probabilities:

in one or more embodiments of the present description, fig. 4 is a schematic diagram of one visual attention assessment provided herein. The conceptual layer of fig. 4 shows the reaction process of the human brain, and the calculation layer shows the technical concept of the scheme. The lower right hatched area in fig. 4 represents a picture that can be viewed by the eyes of a person, a rectangle in the picture representing an object, and a small person representing a person. In the human brain at the lower left of fig. 4, three oval hatched areas represent three areas of the human brain, respectively. The leftmost elliptical area is responsible for receiving pictures seen by eyes by a human brain, belongs to perception recognition, and represents extracting semantic features in the pictures in a conceptual layer, and represents extracting the semantic features by using a semantic segmentation model in a calculation layer. The middle elliptic region is responsible for the attention distribution of the human brain, belongs to attention selection, and in the conceptual layer, the human brain can have different attention distribution on each semantic feature, and the calculation layer calculates attention parameters through a cognitive decision model. The right shadow area is responsible for the physiological activities of eyes, belongs to physiological control, controls eye muscles and nerve fixation pictures in a conceptual layer, ensures that the fixation point of the eyes falls on semantic features attractive to a tested, and determines the eye movement track through eye movement calibration and fixation point identification in a calculation layer.

The above provides a visual attention assessment method for one or more embodiments of the present specification, and based on the same thought, the present specification further provides a corresponding visual attention assessment device, as shown in fig. 4.

Fig. 5 is a schematic diagram of a visual attention assessment device provided in the present specification, specifically including:

the acquisition module 500 is configured to acquire an observation video and a tested eye movement track of a tested person watching the observation video;

the extracting module 502 is configured to extract each semantic feature corresponding to the observed video according to a pre-trained semantic segmentation model;

the determining module 504 is configured to determine a preset cognitive decision model, and input each semantic feature of the observed video and the eye movement track to be tested into the cognitive decision model to obtain the attention parameter to be tested;

and the classification module 506 is configured to input the attention parameter to be tested into a pre-trained classification model, obtain a classification result output by the classification model, and draw the attention map to be tested according to the classification result.

Optionally, the second determining module 504 is specifically configured to determine, for each frame of image, a gaze point of each frame of image to be watched according to the eye movement track to be watched, according to the initialized gaze parameters and semantic features of the gaze point of the frame of image, determine, by using the gaze probability submodel, a gaze probability of the gaze point of the frame of image when the observed video is watched, and determine, by using the gaze parameter submodel, the gaze probability of the gaze point of each frame of image according to the gaze probability of the gaze point of each frame of image.

Optionally, the classification module 506 is further configured to train a classification model by using the following method, generate a virtual eye movement track according to the observed video, use an attention parameter of the virtual eye movement track and the observed video as a training sample, use a coincidence probability of a gaze point and each semantic feature in the virtual eye movement track as a sample label, input the attention parameter contained in the training sample into the classification model to be trained, obtain a classification result of the virtual eye movement track output by the classification model to be trained, determine a loss according to a difference between the sample label and the classification result of the virtual eye movement track, and train the classification model to be trained with the minimum loss as an optimization target.

Optionally, the classifying module 506 is further specifically configured to generate a virtual eye movement track of the virtual tested to watch the observed video according to the observed video and a preset coincidence probability of the virtual tested gaze point and each semantic feature, and screen out a virtual eye movement track conforming to the real eye movement data through a virtual eye movement track screening condition, where the screening condition at least includes gaze point probability distribution, eye movement speed and discrimination network identification.

Optionally, the second determining module 504 is further specifically configured to determine, for each pixel point in the frame image, according to the initialized attention parameters corresponding to the semantic features and the semantic features corresponding to the pixel point, by using the gaze sub-model, a gaze utility of the pixel point, determine, according to the gaze utility of each pixel point in the frame image, a gaze utility of the frame image, determine, according to the eye movement track, a gaze utility of the pixel point corresponding to the position in the frame image, as a gaze utility of the gaze point of the frame image, determine, by using the probability sub-model, a gaze probability of the viewpoint of the frame image according to coordinates of the gaze point of the frame image.

Optionally, the extracting module 502 is specifically configured to, for each frame of image of the observed video, input the frame of image into a pre-trained semantic segmentation model, and determine semantic features of each pixel point of the frame of image, where the semantic features include: when the semantic features of the pixels of the frame image comprise the human face features, determining a human face region in the frame image, inputting the human face region into a trained human face feature point detection model, determining a face core region in the human face region, determining a first weight of the pixels of the face core region in the frame image, and determining a second weight of the pixels of the non-face core region in the frame image, wherein the first weight is larger than the second weight, the larger the weight of the pixels has larger influence on attention parameters, and determining the semantic features of the pixels of the frame image according to the first weight and the second weight.

Optionally, the obtaining module 500 is specifically configured to obtain an observation video and a virtual eye contour, display the virtual eye contour on a screen, collect an image in a designated area, and display the image on the screen, where the designated area is an area where the tested eye is located when the tested eye movement track is collected, display prompt information on the screen, where the prompt information is used to prompt the tested eye to coincide with the virtual eye contour, determine a position of the tested eye on the screen according to the collected image in the designated area, and play the observation video when the position of the tested eye on the screen matches with the position of the virtual eye contour displayed on the screen, and collect the tested eye movement track of the observed video.

The present specification also provides a computer readable storage medium storing a computer program operable to perform a visual attention assessment method as provided in fig. 1 above.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 6. At the hardware level, as shown in fig. 6, the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile storage, and may of course include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the visual attention assessment method described above with respect to fig. 1.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A visual attention assessment method, comprising:

2. The method of claim 1, wherein the cognitive decision model comprises a gaze probability sub-model and an attention parameter sub-model;

3. The method of claim 1, wherein the classification model is trained using a method wherein:

generating a virtual eye movement track according to the observation video;

4. A method as claimed in claim 3, wherein generating a virtual eye movement track from the observed video, in particular comprises:

5. The method of claim 2, wherein the gaze probability sub-model comprises a gaze sub-model and a probability sub-model;

6. The method of claim 1, wherein extracting semantic features corresponding to the observed video according to a pre-trained semantic segmentation model, specifically comprises:

7. The method according to claim 1, wherein obtaining an observation video and a subject eye movement track for a subject viewing the observation video, specifically comprises:

obtaining an observation video and a virtual eye contour;

displaying the virtual eye outline and prompt information on a screen, wherein the prompt information is used for prompting the tested eyes to coincide with the virtual eye outline;

collecting images in a designated area, and displaying the collected images on the screen, wherein the designated area is the area where the tested eye moves when the tested eye moves;

and when the position of the tested eye in the image on the screen is matched with the position of the virtual eye outline displayed on the screen, playing the observation video, and collecting the tested eye movement track of the tested watching the observation video.

8. A visual attention assessment device, comprising:

9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.