CN114596624B

CN114596624B - Human eye state detection method and device, electronic equipment and storage medium

Info

Publication number: CN114596624B
Application number: CN202210412912.4A
Authority: CN
Inventors: 周波; 邹小刚; 苗瑞; 梁书玉
Original assignee: Shenzhen HQVT Technology Co Ltd
Current assignee: Shenzhen Haiqing Zhiyuan Technology Co ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-05
Anticipated expiration: 2042-04-20
Also published as: CN114596624A

Abstract

The application relates to the technical field of deep learning, in particular to a human eye state detection method, a human eye state detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: inputting an original picture into a nested residual error dense block RRDB network to obtain a feature map; inputting the characteristic diagram into a region suggestion network RPN to obtain a plurality of candidate region positioning frames, determining a target candidate region positioning frame containing the eye image in the candidate region positioning frames, and calculating the eye state corresponding to the target candidate region positioning frame based on a predefined algorithm. Therefore, the feature data in all layers can be obtained based on the RRDB network, so that the required feature map is obtained, all the features of the eye of the original picture are reserved, and the accuracy and the robustness of the eye state detection can be improved.

Description

Human eye state detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a method and an apparatus for detecting a state of a human eye, an electronic device, and a storage medium.

Background

With the development of deep learning technology, a series of target detection algorithms based on deep learning are derived, such as: fast R-CNN (Region-conditional Neural Networks), Fast R-CNN and the like can be used for face detection, pedestrian detection, vehicle detection and the like. The positioning of human eyes and the detection of the state of the eyes can effectively assist the relevant work in the computer vision field such as face detection, expression recognition, posture estimation, man-machine interaction and the like, and the blink frequency can be calculated through the positioning and state detection of the human eyes so as to judge the fatigue state, so that the fatigue state judgment is simpler and quicker compared with the fatigue judgment of wearable equipment utilizing electrocardio or electroencephalogram to monitor fatigue.

In the prior art, a statistical model based on the appearance of eyes or an algorithm based on the inherent characteristics and structural information of eyes can be used for positioning the eyes, and further, the shape of the eyes is classified manually, or the state of the eyes is classified based on a human eye template, or the state of the eyes is detected based on a state classification algorithm of machine learning.

However, the above method usually requires a lot of manual involvement and takes a lot of manpower, and the above method for positioning human eyes and classifying states is often affected by factors such as facial expressions, posture changes, illumination, shielding, background interference, image definition, and the like, and is poor in accuracy and robustness.

Disclosure of Invention

The application provides a human eye state detection method, a human eye state detection device, electronic equipment and a storage medium, which are used for solving the problems that the existing human eye positioning and state classification method is easily influenced by factors such as facial expressions, posture changes, illumination, shielding, background interference, image definition and the like, and the accuracy and robustness are poor.

In a first aspect, the present application provides a method for detecting a state of a human eye, the method comprising:

inputting an original picture into a nested residual error dense block RRDB network to obtain a feature map;

inputting the characteristic diagram into a region suggestion network RPN to obtain a plurality of candidate region positioning frames, determining a target candidate region positioning frame containing the eye image in the candidate region positioning frames, and calculating the eye state corresponding to the target candidate region positioning frame based on a predefined algorithm.

Optionally, the RRDB network includes a plurality of residual dense blocks, each residual dense block including a plurality of convolutional layers; inputting an original picture into a nested residual dense block RRDB network to obtain a feature map, wherein the feature map comprises the following steps:

inputting an original picture into each residual error dense block, and obtaining eye feature data output by a first layer and a middle layer and a local feature map output by a last layer through calculation of a plurality of layers of convolution layers;

fusing the eye feature data of each layer calculated by each residual dense block based on the position area of the local feature map to obtain a feature map comprising the eye feature data

And fusing the fused feature maps corresponding to the residual error dense blocks again to obtain the required feature maps.

Optionally, each convolution layer is provided with a corresponding filtering mechanism, and the filtering mechanism is used for extracting eye feature data corresponding to the original picture; inputting an original picture into the residual error dense block, and obtaining eye feature data output by the first layer and the middle layer and a local feature map output by the last layer through calculation of the multilayer convolution layer, wherein the method comprises the following steps:

and inputting an original picture into the residual error dense block, and sequentially performing dimension reduction processing and eye feature extraction on the original picture by using a filtering mechanism corresponding to each convolution layer to obtain eye feature data output by the first layer and the middle layer and a local feature map output by the last layer.

Optionally, inputting the feature map into a regional suggestion network RPN to obtain a plurality of candidate region positioning frames, and determining a target candidate region positioning frame containing an eye image in the plurality of candidate region positioning frames, includes:

inputting the feature map into an RPN to obtain a plurality of candidate region positioning frames;

mapping the candidate area positioning frames to the characteristic diagram through mapping transformation of coordinates to obtain characteristic matrixes of the candidate area positioning frames;

calculating the intersection ratio of the candidate region positioning frames based on the feature matrix, and judging whether the candidate region positioning frames contain eye images or not based on the intersection ratio;

determining a target candidate area positioning frame according to the judgment result; and the candidate region positioning frame containing the eye image is a candidate region positioning frame with the intersection ratio larger than a preset threshold value.

Optionally, calculating the eye state corresponding to the target candidate region location frame based on a predefined algorithm includes:

scaling the target candidate region positioning frame to a predefined size, and splicing and de-duplicating the scaled target candidate region positioning frame;

and classifying the spliced and de-duplicated candidate regions by utilizing a normalized index function, and outputting the human eye state.

Optionally, the method further includes:

acquiring a training data set; the training data set comprises a human face data set and labels corresponding to the human eye state data set; the label is used for distinguishing an eye opening state, an eye closing state and a background;

and dividing the training data set into a training set, a verification set and a test set according to a predefined proportion, and inputting the training set, the verification set and the test set into the RRDB network for training, verifying and testing.

Optionally, the method further includes:

obtaining statistical verification results, wherein the verification results comprise correct detection, failed detection and wrong detection; the detection correctly indicates that the detection result is that the human eye state is actually the human eye state; the detection failure indicates that the detection result is that the human eye state is actually a background; the detection error indicates that the detection result is that the background is actually a human eye state;

calculating an evaluation index based on the number of correct detections, the number of failed detections and the number of errors in the verification result; the evaluation index is used for evaluating the accuracy of human eye state detection;

adjusting parameters in the RRDB network and the RPN based on the evaluation index.

Optionally, the method further includes:

and when detecting that the number of the pictures in the eye closing state is larger than the preset number, sending an alarm prompt to the user terminal equipment to remind the user of being in a fatigue state.

In a second aspect, the present application provides an apparatus for detecting a state of a human eye, the apparatus comprising:

the input module is used for inputting an original picture into a nested residual error dense block RRDB network to obtain a feature map;

and the processing module is used for inputting the characteristic diagram into a regional suggestion network RPN to obtain a plurality of candidate region positioning frames, determining a target candidate region positioning frame containing the eye image in the candidate region positioning frames, and calculating the eye state corresponding to the target candidate region positioning frame based on a predefined algorithm.

In a third aspect, the present application provides an electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of the first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium storing computer-executable instructions for implementing the method according to any one of the first aspect when executed by a processor.

In summary, the present application provides a method, an apparatus, an electronic device and a storage medium for detecting a human eye state, which can obtain a feature map by inputting an original picture into a nested residual error dense block RRDB network; and further inputting the feature map into a region suggestion network RPN to obtain a plurality of candidate region positioning frames, further determining a plurality of target candidate region positioning frames containing eye images in the candidate region positioning frames, and calculating eye states corresponding to the target candidate region positioning frames based on a predefined algorithm. Therefore, the feature data in all layers can be obtained based on the RRDB network, so that the required feature map is obtained, all the features of the eye of the original picture are reserved, and the accuracy and the robustness of the eye state detection can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a human eye state detection method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an RRDB network according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a specific human eye state detection method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a human eye state detection device according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the claimed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The term "plurality" in this application means two or more. The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the preceding and following related objects are in an "or" relationship; in the formula, the character "/" indicates that the preceding and following related objects are in a relationship of "division". "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

In the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same or similar items having substantially the same function and action. For example, the first device and the second device are only used for distinguishing different devices, and the sequence order thereof is not limited. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application.

It should be understood that, in the embodiment of the present application, the size of the serial number of each process described below does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Embodiments of the present application will be described below with reference to the accompanying drawings. Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application, and the method for detecting a state of a human eye provided in the present application may be applied to the application scenario shown in fig. 1, where the application scenario includes: original picture 101 and server 102; the server 102 may detect an eye state of a person in the original picture 101, specifically, the server 102 may input the obtained original picture 101 into a human eye state detection model for detection, and may obtain an eye state of the person, such as an eye opening state or an eye closing state, where the human eye state detection model includes a nested-in-Residual Dense Block (RRDB) Network and a Region suggestion Network (RPN), the RRDB is configured to obtain a feature map from the original picture, the feature map retains global Dense features, that is, integrity of eye feature data, the RPN is configured to generate a candidate Region according to the feature map, and perform target classification on a feature matrix formed by the candidate Region, so as to accelerate a processing speed.

Specifically, the statistical model performs eye positioning by mainly recognizing basic appearance characteristics (such as eye width, shape, color characteristics, and the like) of the eyes, or performs eye positioning by using an algorithm for calculating a structural position of the eyes on the face; further, the current open-close state may be determined by setting a threshold value based on the characteristics of the relevant appearance data of the open eye and the closed eye, or the object detection may be performed based on a state classification algorithm of machine learning, so as to obtain the human eye state, such as the yolo algorithm.

In a possible implementation manner, an integral projection method may be used to position the human eyes, and then a Support Vector Machine (SVM), Adaboost (i.e., an iterative algorithm) or other classifier may be trained on the human eye images in the open-closed state to classify the human eye states, so as to obtain the open-closed eye states.

However, the method needs to extract accurate human eye features based on huge prior data so as to perform model training, and the calculated amount is large.

Based on the above, the present application provides a human eye state detection method, which is an improved fast R-CNN algorithm based on human eye positioning and state detection, wherein a main Network Visual Geometry Group Network (VGG) 16 of the algorithm is replaced by an RRDB Network with better effect, global dense features are retained, further, an image is input into the RRDB Network to obtain a corresponding eye opening and closing state feature map, then a candidate frame is generated by using an RPN structure, and a plurality of target candidate region positioning frames are determined from the candidate frame generated by the RPN, and then the eye state in the target candidate area positioning frame is judged based on a predefined algorithm, so that the characteristic data in all layers can be obtained based on the RRDB network, and then the required characteristic diagram is obtained, all the characteristics of the eye of the original image are reserved, and the accuracy and the robustness of the human eye state detection can be further improved.

Exemplarily, fig. 2 is a schematic flowchart of a method for detecting a human eye state provided in an embodiment of the present application, and as shown in fig. 2, the method in the embodiment of the present application includes:

s201, inputting the original picture into a nested residual error dense block RRDB network to obtain a feature map.

In the embodiment of the present application, the RRDB includes a plurality of Residual Dense Blocks (RDBs) and an overall Residual structure. Compared with the backbone network of VGG16, the RRDB can fully utilize all hierarchical features of the input original picture, where the VGG16 network is composed of 13 convolutional layers and 3 fully-connected layers, each convolutional layer and fully-connected layer uses the ReLU activation function, and only feature extraction can be performed layer by layer, and the RRDB network considers the output results of all previous convolutional layers in each convolutional layer feature extraction, thereby preserving the integrity of feature data.

In this step, the feature map may refer to a feature map containing globally dense features in the original picture, and the feature map adaptively retains and accumulates the eye features of each layer in a fused manner, so that the feature map retains all valid eye features of the original picture. For example, each convolution layer can be judged layer by layer and screened out according with the eye characteristics based on facial expression, posture change, illumination, shielding, background interference, image definition and the like.

For example, in the application scenario of fig. 1, the server 102 may input the obtained original picture 101 into the RRDB network, and further, the original picture 101 is processed by convolution layers in each residual dense block in the RRDB network, so as to obtain a feature map that retains all valid features of the original picture.

S202, inputting the feature map into a region suggestion network RPN to obtain a plurality of candidate region positioning frames, determining target candidate region positioning frames containing eye images in the candidate region positioning frames, and calculating eye states corresponding to the target candidate region positioning frames based on a predefined algorithm.

The target candidate region positioning frames are multiple, and the eye states may include an eye opening state and an eye closing state, or may have other states, which is not specifically limited in this embodiment of the application.

In the embodiment of the application, the RPN is used for generating a candidate region, and the candidate region comprises adding several convolution layers to the last layer of the network, and regressing a bounding box and a target score in a fixed grid; the input of the RPN may be an image of any shape, and the output is a rectangular candidate region localization box and object confidence.

For example, a 3 × 3 window is adopted to slide on a convolution feature map output by a last shared convolution layer to generate a candidate region frame by using RPN, so as to obtain image features, and simultaneously, the image features are mapped to an original picture based on each point on the convolution feature map, so as to predict k candidate region frames, in order to ensure multi-scale detection, k may be 9, the candidate region frames have the same center, and the length-width ratios of the candidate region frames are 3, respectively 1:2, 2:1 and 1:1, and each ratio has 3 areas, so that the obtained candidate region frame may include eye features as much as possible.

Specifically, the RPN is used for generating the candidate region positioning frame, so that the position of human eyes in an original picture can be obtained, the candidate region positioning frame generated by the RPN is projected onto the feature map to obtain a corresponding feature matrix, a plurality of target candidate region positioning frames can be further determined based on the feature matrix, the plurality of target candidate region positioning frames are spliced, a feature map containing complete eye features can be obtained, and the state of human eyes in the feature map can be further judged.

The candidate region positioning frame may be represented by position information of a group of frames in the picture, the position information of each frame is composed of information of an upper left corner and a lower right corner, such as (x, y, w, h), where x, y, w, and h respectively represent a center coordinate, a width, and a height of the candidate region positioning frame; each point on the feature map can correspond to a region (with the size of 16 × 16) of the original picture, and when the aspect ratio of the candidate region positioning frame is (1: 2), (2:1) and (1:1), the length of the candidate region positioning frame is 128, 256 and 512, respectively, and the candidate region positioning frame corresponds to a part of the image to be framed, so that the frame can cover enough regions as much as possible.

In this step, the predefined algorithm refers to an algorithm that can distinguish human eye states, for example, a Softmax function (i.e., a normalized exponential function) based on the proposed feature mapping can be used to determine the category of the target candidate region location box proposal, i.e., an open-eye state or a closed-eye state, and the predefined algorithm is not particularly limited in the embodiments of the present application.

For example, in the application scenario of fig. 1, the server 102 may input the feature map obtained based on the RRDB network into the RPN, further may obtain a plurality of candidate region positioning frames, where the plurality of candidate region positioning frames include candidate regions with different scales of each region in the original picture, further may determine a plurality of target candidate region positioning frames including the eye image in the plurality of candidate region positioning frames, further splices the plurality of target candidate region positioning frames, and may calculate, based on a predefined algorithm, eye states, such as an eye open state, corresponding to the spliced target candidate region positioning frames.

It should be noted that, when the method provided by the present application is used to perform eye state detection, a detection result may further include a background, that is, other results besides the eye state, such as a result that no eye state is detected, or a result that only glasses, eyebrows, and the like are detected, which is not specifically limited in this embodiment of the present application.

Therefore, the application provides a human eye state detection method, which can obtain a required characteristic diagram by inputting an original picture into an RRDB network; and further inputting the feature map into a network RPN to obtain a plurality of candidate region positioning frames, further determining a plurality of target candidate region positioning frames containing eye images in the candidate region positioning frames, and calculating the eye state corresponding to the target candidate region positioning frames based on a predefined algorithm. Therefore, the feature data in all layers can be acquired based on the RRDB network, the integrity of the eye feature data is reserved, and a required feature map is further acquired.

In this embodiment, the RRDB network may include a plurality of residual dense blocks RDBs, each RDB may include a plurality of convolutional layers, and each convolutional layer corresponds to a corresponding activation function; fig. 3 is a schematic structural diagram of an RRDB network according to an embodiment of the present disclosure; taking the example of three RDBs in one RRDB network as shown in fig. 3, specifically, the output of each RDB in one RRDB network can directly access the next RDB, so that the locally fused features can be delivered continuously, and in particular, the RRDB network contains a continuous memory mechanism and global feature fusion with local residual learning.

On one hand, the RRDB network can not only read the state from the previous RDB, but also fully utilize all convolution layers (Conv) therein through local dense connection, each Conv is corresponding to an LReLU (Leaky-ReLU) function, and further, the feature data output by each Conv can be adaptively reserved based on the local feature fusion of the output of the RDB; on the other hand, by using global residual learning, the RRDB network can combine shallow features and deep features together to obtain global dense features from the original picture, and the local feature fusion can realize extremely high growth rate through the training of a stable larger network, so that when the features of the original picture are obtained, the RRDB-based backbone network can completely and accurately retain the effective features of eyes in all the original pictures.

The local feature map output by each RDB may not include complete eye feature data, so that data output by each convolution layer in the RDB is overlapped based on the local feature map (i.e., the eye feature map), and feature data possibly screened out from the local feature map is supplemented, so that finally obtained feature data is more complete.

It should be noted that, each Conv filter may have different factors, and output feature data thereof is also different, for example, a first convolution layer in a certain RDB may screen out background interference in the feature map, further, a second convolution layer may screen out a mask in the feature map, a third convolution layer may screen out a mouth and a nose below eyes in the feature map, and a fourth convolution layer may screen out an eyebrow and a forehead above eyes in the feature map, and finally, a local feature map including eye features may be obtained; the convolution process of multiple convs in each RDB is not particularly limited in the embodiments of the present application, and the above is only an example.

It is understood that the local feature map is also composed of eye feature data, and the position region where the local feature map is located is the corresponding eye region.

For example, in the application scenario of fig. 1, after the server 102 inputs the original picture 101 into the RRDB network, the specific processing procedure of the RRDB network is as follows: inputting the original picture 101 into an RDB (remote data base) based on each residual dense block in an RRDB (remote random access memory) network, and obtaining eye feature data output by a first layer and a middle layer and a local feature map output by a last layer through calculation of a plurality of convolution layers; further, the eye feature data of each layer obtained by calculating each RDB is overlapped with the eye feature data of each layer by taking the position area where the local feature map is located as a reference, the feature data lacking in the area is supplemented, and then the feature map including the eye feature data can be obtained, and further, the fused feature maps corresponding to each residual error dense block are fused again, that is, the feature data are overlapped, and the feature data are summarized, so that the required feature map is obtained.

Therefore, the method and the device can acquire the more complete characteristic diagram of the eye characteristics, so that the detection result is more accurate.

Optionally, each convolution layer is provided with a corresponding filtering mechanism, and the filtering mechanism is used for extracting eye feature data corresponding to the original picture; inputting an original picture into the residual error dense block, and obtaining eye feature data output by a first layer and a middle layer and a local feature map output by a last layer through calculation of a plurality of layers of convolution layers, wherein the steps of the method comprise:

In this embodiment of the application, the filtering mechanism may refer to filtering out different filtering factors, and extracting feature data meeting preset requirements, where the filtering factors include facial expression, posture change, illumination, shielding, background interference, image sharpness, and features of five sense organs except eyes, and the preset requirements refer to requirements of each convolution layer that meet an output result, such as a dimension of 24 dimensions, and the embodiment of the application is not specifically limited to this.

In this step, each convolution layer has a corresponding filtering mechanism, so as to extract feature data meeting a preset requirement, for example, a first convolution layer can correspondingly screen out background interference in a feature map and output 36-dimensional feature data, a second convolution layer correspondingly screens out a shelter in the feature map and outputs 24-dimensional feature data, a third convolution layer correspondingly screens out a mouth part, a nose part and the like below eyes in the feature map and outputs 12-dimensional feature data, that is, a corresponding local feature map containing eye features.

Illustratively, in the application scenario of fig. 1, the server 102 inputs the original picture 101 into the RDB, sequentially filters the background interference, the occlusion, the eyebrow and forehead above the eyes, and the mouth and nose below the eyes in the original picture 101 by using the corresponding filtering mechanism of each convolution layer, and obtains the 128-dimensional eye feature data of the first layer, the 48-dimensional eye feature data and the 24-dimensional eye feature data of the middle 2 layer, and the 12-dimensional eye feature data output by the last layer.

It should be noted that, in the embodiment of the present application, the filtering mechanism corresponding to each convolution layer and the number of convolution layers included in the residual dense block are not specifically limited, and the above description is only an example.

Therefore, the method and the device can perform dimension reduction processing and eye feature extraction on the original picture based on the filtering mechanism corresponding to the convolutional layer, output eye feature data meeting requirements, have clear layers and improve the processing accuracy.

In this step, for example, if one RRDB has one RDB and one RDB includes four convolution layers, the feature map obtained at this time is 1/16 of the original picture, and further, the candidate region locator box may be mapped to a region corresponding to 16 × 16 on the original picture to obtain a feature matrix corresponding to the candidate region locator box, where the feature matrix is formed by feature data, it may be understood that size transformation and aspect ratio transformation are performed on the mapped region to obtain a corresponding candidate region locator box.

In the embodiment of the application, whether the eye image is included is determined by calculating the intersection ratio of the target candidate region positioning frames, a preset threshold value can be set to be 0.7, namely the intersection ratio is greater than 0.7, the positioning frame is marked as the positioning frame including the eye image, and the marking number is 1; and marking the candidate frame with the intersection ratio less than 0.3 as a positioning frame with the detection result as the background, wherein the background refers to other images except human eyes, and the corresponding mark number is 0. In the application, the number of the marked numbers can be counted to verify the accuracy of the model for detecting the state of the human eye, the counting mode can be manual or machine, and the embodiment of the application is not particularly limited to this.

For example, in the application scenario of fig. 1, after the feature map obtained by the server 102 from the RRDB network is input into the RPN, the specific processing procedure of the RPN is as follows: based on the property of the RPN, a plurality of candidate region positioning frames can be obtained first, and further, the candidate region positioning frames are mapped to corresponding feature maps through mapping transformation of coordinates, so that feature matrixes of the candidate region positioning frames can be obtained; further, the intersection ratio of a plurality of candidate region positioning frames corresponding to the feature matrix is calculated by using the existing intersection definition, and after the intersection ratio of each candidate region positioning frame is obtained, the candidate region positioning frame with the intersection ratio larger than 0.7 is determined as the target candidate region positioning frame, so that each target candidate region positioning frame can contain the eye image.

Therefore, the target candidate region positioning frame containing the eye image can be determined from the candidate region positioning frames, and then the target candidate region positioning frame is used for subsequent calculation, so that the calculation accuracy is improved.

In this step, the application may scale the suggestion boxes with different sizes to the image with a fixed size by using a region of interest Pooling (RoI posing) layer, and further, the RoI Pooling layer may also locate the feature data in the box of the target candidate region for subsequent classification.

Specifically, the region-of-interest Pooling layer is used for fixing the sizes of the regions with different sizes to a uniform size through the RoI Pooling, namely, dividing the target candidate region positioning frame into a fixed grid form, performing the maximum Pooling operation on each small grid to obtain a scaled target candidate region positioning frame, further, splicing and de-duplicating the scaled target candidate region positioning frame, and further determining the type of the spliced and de-duplicated candidate region, namely, the eye-open state or the eye-close state, by using the full connection layer and a normalized index (namely, Softmax function) based on the proposed feature mapping.

It should be noted that the predefined size is a uniform fixed size, which is not specifically limited in the embodiments of the present application, and may be determined according to specific situations.

Therefore, the target candidate region positioning frames are spliced and deduplicated, interference of repeated data is reduced, the human eye state is judged, and accuracy of detecting the human eye state is improved.

Optionally, the method further includes:

The background refers to images other than human eyes, such as glasses, eyebrows, and mouth worn, and the embodiments of the present application may detect facial features other than the eye-open state and the eye-close state, which are collectively referred to as the background, and the specific content of the background is not limited in the embodiments of the present application.

In the embodiment of the application, the first 50000 images of a celebrity face Attribute (CelebFaces Attribute, CelebA) data set and annotation data are used as data sets for human eye positioning and state detection; the CelebA calibration data set comprises 202599 face images, the data set carries out five-point calibration and attribute annotation on the face of each image, and the calibration points are respectively positioned on the left eye, the right eye, the nose tip and the left mouth corner; attributes include gender, curly or straight hair, whether to wear glasses, whether to wear a hat, etc.

Specifically, after the training data set is obtained, the face data set needs to be converted into a corresponding PASCAL (Statistical modeling and Computational Learning) voc (visual Object classes) data format, where the format mainly includes three parts, including options, JPEGImages, and ImageSets, and accordingly, the JPEGImages file is used for storing data image samples, the options file is used for storing data image labeling information, and the ImageSets file is used for storing data image indexes, and further, the following operations may be performed on the CelebA data set to implement an improved fast R-CNN training process:

according to the PASCAL VOC format, storing original images of eyes of each face in the first 50000 images of CelebA data sets into a JPEGImages file, further labeling eye regions of the face images to obtain the region coordinate information and the category of the eyes, introducing the region coordinate information and the category of the eyes into a server by using a predefined machine algorithm, labeling the states of the eyes of the face by using an open source labeling tool labelImg, namely dividing all the data sets into three states of open eyes, closed eyes and a background, correspondingly labeling the states, wherein the states comprise the open eyes state of 11, the open eyes state of 10 and the background of 00, storing the labels into an xml format, and further storing the xml format file into an antibodies file, wherein the xml format file comprises data such as labels and the like.

Further, the open-eye and closed-eye images in the first 50000 images of the CelebA data set were classified into 8: 1:1 into a training set, a verification set, and a test set, and storing contents corresponding to pictures in the training set, the verification set, and the test set into ImageSets files, such as picture names, for facilitating subsequent searching for corresponding required pictures based on the picture names, and increasing the search rate, wherein the data set division can be implemented by writing an Integrated Development Environment (IDE) tool pychar in Python language, which is not specifically limited in the embodiments of the present application.

It can be understood that finding a corresponding required picture may be used to delete a part of pictures, and the like, which is not limited in this embodiment of the application.

Further, inputting the divided training set, verification set and test set into the RRDB network to train, verify and test the human eye state detection model, wherein the human eye state detection method corresponding to the steps S201-S202 is the human eye state detection model.

It should be noted that, the predefined ratio is not specifically limited in the embodiments of the present application, and the above is only an example.

Therefore, the training of the human eye detection model can be carried out on the human face image and the human eye state corresponding to the human face image, and the accuracy of the human eye detection model detection is improved.

Optionally, the method further includes:

obtaining statistical verification results, wherein the verification results comprise correct detection, failed detection and wrong detection; the detection correctly indicates that the detection result is that the human eye state is actually the human eye state; the detection failure indicates that the detection result is that the human eye state is actually a background; the detection error indicates that the detection result is that the background is actually the human eye state;

In this step, the accuracy of the human eye state detection may be evaluated using an evaluation index, such as Precision (Precision), Recall (Recall), etc., and the Precision and Recall may be calculated based on the number corresponding to the statistical verification results, including correct detection, failed detection, and incorrect detection.

Wherein, the detection correctly indicates that the detection result is that the human eye state is actually the human eye state, and the corresponding number of picturesThe amount is TP; the detection failure indicates that the detection result is that the human eye state is actually a background, and the number of the corresponding pictures is FP; the detection error indicates that the detection result is that the background is actually the human eye state, and the number of the corresponding pictures is FN; then

，

。

Optionally, the data sample with the intersection ratio of the candidate region positioning frames being greater than or equal to the first threshold is a positive sample, which indicates that the detection result is the human eye state; the data sample with the intersection ratio smaller than the second threshold of the candidate region positioning frame is a negative sample, which indicates that the detection result is the background, for example, the data sample with the intersection ratio greater than or equal to 0.5 is a positive sample, the data sample with the intersection ratio smaller than 0.5 is a negative sample, the first threshold and the second threshold may be equal to each other, or may not be equal to each other, which is not specifically limited in this embodiment of the application.

Further, based on the fact that each input picture in the acquired verification set corresponds to the human eye state or background, the number of pictures with correct detection, detection failure and detection error is counted based on the state corresponding to the detection result, then an evaluation index is calculated by using a calculation formula of Precision and Recall, and parameters in the RRDB network and the RPN are adjusted based on the evaluation index.

It should be noted that, the process of adjusting the parameters in the RRDB network and the RPN is not specifically limited in the embodiment of the present application.

Based on this, the evaluation index may be calculated by obtaining a statistical verification result, where the verification result may be calculated by the server by using the intersection ratio and the state corresponding to each input picture, or may be calculated by direct manual calculation and then input to the server, which is not specifically limited in this embodiment of the present application.

Therefore, the embodiment of the application can also classify the verification results, further evaluate the accuracy of the human eye state detection, perform tuning based on the evaluation results, and improve the accuracy of the human eye state detection.

Optionally, the method further includes:

In the embodiment of the application, the state of human eyes can be detected by using the method of steps S201 to S202 for a plurality of continuous human face pictures of the user taken within a preset time period, and if the number of the pictures detected in the eye-closed state within the preset time period is greater than the preset number, it can be determined that the user is in a fatigue state.

Optionally, the blinking frequency may also be calculated based on the states of human eyes corresponding to the multiple continuous pictures of the user's face captured within the preset time period, that is, the blinking frequency is calculated according to the number of pictures corresponding to the eye-open state and the eye-closed state, so as to determine whether the user is in a fatigue state.

In this step, the alert prompt may refer to a prompt message sent by the system to prompt the user to be in a fatigue state, where the alert prompt may be sent to the user terminal device in the form of a piece of music, or may be in the form of making a call or directly displaying a message prompt box on the user terminal device, and the display content is "in a fatigue state".

For example, taking a user driving a vehicle as an example, in the driving process, a camera mounted in the vehicle may capture 30 continuous face images within 1 minute, and transmit the 30 images to a server for processing, the server performs eye state detection based on an eye state detection method, and if it is detected that the number of pictures in the eye state is greater than 20, a music prompt is sent to a user terminal device to remind the user of fatigue.

Therefore, the embodiment of the application can also calculate the blinking frequency based on the eye-closing state, so as to judge the fatigue state, and compared with the traditional wearable device which monitors fatigue by utilizing electrocardio or electroencephalogram, the method is simpler and faster, and further sends a prompt to a user, so that the potential safety hazard caused by fatigue driving of people can be reduced.

It should be noted that, in the embodiment of the present application, specific values of the preset threshold, the first threshold, the second threshold, and the preset number are not limited, and may be set in advance or modified manually.

With reference to the foregoing embodiment, fig. 4 is a flowchart illustrating a specific method for detecting a human eye state according to an embodiment of the present disclosure, and as shown in fig. 4, an original picture is input into a backbone network (RRDB) to obtain a feature map, and then the feature map is input into an RPN to obtain a plurality of candidate region location frames, and then the candidate region location frames are subjected to RoI pooling (region of interest pooling), and then input into a classifier to determine a human eye state; and the process of the RoI pooling also collects the characteristic maps and extracts the suggested characteristic maps for subsequently distinguishing the human eye states.

The human eye state detection method is an improved Faster R-CNN algorithm, a backbone network VGG16 of the algorithm is mainly replaced by an RRDB with a better effect, and compared with a traditional backbone network based on VGG16, the RRDB can fully utilize all layered features of input original pictures and keep the integrity of data.

It should be noted that, the eyes, as an important component of the face, may reflect many important information, and may effectively assist the relevant work in the computer vision field, such as face detection, expression recognition, posture estimation, human-computer interaction, etc., for the human eye state detection.

In the foregoing embodiments, the human eye state detection method provided in the embodiments of the present application is described, and in order to implement each function in the method provided in the embodiments of the present application, the electronic device serving as an execution subject may include a hardware structure and/or a software module, and implement each function in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether any of the above functions is implemented as a hardware structure, a software module, or a combination of a hardware structure and a software module depends upon the particular application and design constraints imposed on the technical solution.

For example, fig. 5 is a schematic structural diagram of a human eye state detection device according to an embodiment of the present application, and as shown in fig. 5, the device includes: the image processing system comprises an input module 510 and a processing module 520, wherein the input module 510 is configured to input an original picture into a nested residual dense block RRDB network to obtain a feature map;

the processing module 520 is configured to input the feature map into a regional suggestion network RPN to obtain a plurality of candidate region location frames, determine a target candidate region location frame containing an eye image in the plurality of candidate region location frames, and calculate a human eye state corresponding to the target candidate region location frame based on a predefined algorithm.

Optionally, the RRDB network includes a plurality of residual dense blocks, each residual dense block including a plurality of convolutional layers; the input module 510 includes a convolution calculation unit, a fusion unit and a re-fusion unit;

specifically, the convolution calculating unit is configured to, for each residual dense block, input an original picture into the residual dense block, and obtain eye feature data output by the first layer and the middle layer and a local feature map output by the last layer through multi-layer convolution layer calculation;

the fusion unit is used for fusing the eye feature data of each layer obtained by calculating each residual error dense block based on the position area of the local feature map to obtain the feature map comprising the eye feature data

And the re-fusion unit is used for re-fusing the fused feature maps corresponding to each residual error dense block to obtain the required feature map.

Optionally, each convolution layer is provided with a corresponding filtering mechanism, and the filtering mechanism is used for extracting eye feature data corresponding to the original picture; the fusion unit is specifically configured to:

Optionally, the processing module 520 includes a determining unit and a calculating unit; the determining unit is configured to:

Optionally, the computing unit is configured to:

Optionally, the apparatus further includes a training module, where the training module is configured to:

Optionally, the apparatus further includes a parameter adjusting module, where the parameter adjusting module is configured to:

calculating an evaluation index based on the number of correct detection, the number of failed detection and the number of detection errors in the verification result; the evaluation index is used for evaluating the accuracy of human eye state detection;

Optionally, the apparatus further includes an alarm module, where the alarm module is configured to:

For specific implementation principles and effects of the human eye state detection device provided in the embodiment of the present application, reference may be made to relevant descriptions and effects corresponding to the above embodiments, which are not described in detail herein.

For example, an embodiment of the present application further provides a schematic structural diagram of an electronic device, and fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, as shown in fig. 6, the electronic device may include: a processor 601 and a memory 602 communicatively coupled to the processor; the memory 602 stores computer programs; the processor 601 executes the computer program stored in the memory 602, so that the processor 601 executes the method according to any of the embodiments.

The memory 602 and the processor 601 may be connected by a bus 603.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program executing instruction is stored, and the computer executing instruction is used for implementing the human eye state detection method in any one of the foregoing embodiments of the present application when executed by a processor.

The embodiment of the present application further provides a chip for executing the instruction, where the chip is used to execute the method for detecting the state of the human eye performed by the electronic device in any of the embodiments described above in the present application.

An embodiment of the present application further provides a computer program product, which includes a program code, and when a computer runs the computer program, the program code executes the human eye state detection method executed by an electronic device in any of the foregoing embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to implement the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware mode, and can also be realized in a mode of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.

The Memory may include a Random Access Memory (RAM), and may further include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory, and may also be a usb disk, a removable hard disk, a read-only Memory, a magnetic disk or an optical disk.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random-Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the embodiments of the present application should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a state of a human eye, the method comprising:

inputting the characteristic diagram into a region suggestion network RPN to obtain a plurality of candidate region positioning frames, determining a target candidate region positioning frame containing an eye image in the candidate region positioning frames, and calculating a human eye state corresponding to the target candidate region positioning frame based on a predefined algorithm;

wherein the RRDB network comprises a plurality of residual dense blocks, each residual dense block comprising a plurality of convolutional layers; inputting the original picture into a nested residual dense block RRDB network to obtain a feature map, wherein the feature map comprises:

fusing eye feature data of each layer obtained by calculating each residual dense block based on the position area of the local feature map to obtain a feature map comprising the eye feature data;

fusing the fused feature maps corresponding to the residual error dense blocks again to obtain the required feature maps;

each convolution layer is provided with a corresponding filtering mechanism, the original picture is input into the residual error dense block, and eye feature data output by the first layer and the middle layer and a local feature map output by the last layer are obtained through calculation of the plurality of convolution layers, and the method comprises the following steps:

inputting an original picture into the residual error dense block, and sequentially performing dimensionality reduction processing and eye feature extraction on the original picture by using a filtering mechanism corresponding to each convolution layer to obtain eye feature data output by a first layer and a middle layer and a local feature map output by a last layer;

the filtering mechanism is used for screening out different filtering factors and extracting eye feature data corresponding to an original picture and meeting preset requirements, the filtering factors comprise facial expressions, posture changes, illumination, shielding, background interference, image definition and five-sense-organ features except eyes, and the filtering factors corresponding to each convolution layer are different.

2. The method according to claim 1, wherein inputting the feature map into a region suggestion network RPN, obtaining a plurality of candidate region location boxes, and determining a target candidate region location box containing an eye image among the plurality of candidate region location boxes comprises:

3. The method of claim 1, wherein calculating the state of the human eye corresponding to the target candidate region location box based on a predefined algorithm comprises:

4. The method of claim 1, further comprising:

5. The method of claim 4, further comprising:

6. The method according to any one of claims 1-5, further comprising:

7. An eye state detection apparatus, comprising:

the processing module is used for inputting the feature map into a region suggestion network RPN to obtain a plurality of candidate region positioning frames, determining a target candidate region positioning frame containing an eye image in the candidate region positioning frames, and calculating a human eye state corresponding to the target candidate region positioning frame based on a predefined algorithm;

wherein the RRDB network comprises a plurality of residual dense blocks, each residual dense block comprising a plurality of convolutional layers; the input module is specifically configured to:

8. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1-6.

9. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, perform the method of any one of claims 1-6.