CN114283488B

CN114283488B - Method for generating detection model and method for detecting eye state by using detection model

Info

Publication number: CN114283488B
Application number: CN202210218130.7A
Authority: CN
Inventors: 贾福昌; 李茂林
Original assignee: Beijing Superred Technology Co Ltd
Current assignee: Beijing Superred Technology Co Ltd
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-14
Anticipated expiration: 2042-03-08
Also published as: CN114283488A

Abstract

The present disclosure discloses a method of generating a detection model for detecting an eye state, comprising the steps of: preprocessing a sample image containing a single eye to generate a training image and label data; constructing a detection model and initial network parameters, wherein the detection model is formed by coupling a trunk feature extraction component, a feature pyramid component and an information prediction component, and the eye state comprises a closed state and a non-closed state; inputting the training image into a detection model for processing so as to output a prediction result; and calculating a loss value based on the prediction result and the label data, and adjusting network parameters according to the loss value until a predetermined condition is met, wherein the corresponding detection model is a finally generated detection model for detecting the eye state. The disclosure also discloses a method and a system for detecting the eye state in the eye image based on the detection model. Based on the detection model, the eye state in the eye image can be detected quickly and accurately.

Description

Method for generating detection model and method for detecting eye state by using detection model

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and a system for detecting an eye state in an eye image.

Background

With the development of video, image processing and pattern recognition technologies, biometric recognition, particularly face recognition technology, has become a stable, accurate and efficient target recognition technology. The eyes are important facial features, and have a wide range of application scenarios in the aspects of target recognition, target tracking and the like. For example, whether eyes are opened or not needs to be judged in the face recognition process, so that the problem of mistaken unlocking when a user sleeps is avoided; for another example, when determining whether the driver has a fatigue driving behavior, a reliable method for determining the eye opening/closing state is extremely important; for another example, in the identity authentication process based on the human face features, it is often necessary to detect the eye opening/closing state of the user to determine that the identified object is a living body; and so on.

In view of the above, a reliable solution for detecting the eye state is needed.

Disclosure of Invention

To this end, the present disclosure provides a method and system for detecting an eye state in an eye image in an attempt to solve or at least alleviate the problems presented above.

According to a first aspect of the present disclosure, there is provided a method of generating a detection model for detecting an eye state, comprising the steps of: preprocessing a sample image containing a single eye to generate a training image and label data; constructing a detection model and initial network parameters, wherein the detection model is formed by coupling a trunk feature extraction component, a feature pyramid component and an information prediction component, and the eye state comprises a closed state and a non-closed state; inputting the training image into a detection model for processing so as to output a prediction result; and calculating a loss value based on the prediction result and the label data, and adjusting network parameters according to the loss value until a preset condition is met, wherein the corresponding detection model is a finally generated detection model for detecting the eye state.

Optionally, in a method according to the present disclosure, the tag data includes: at least one of a preset area label, a state label and a background label, wherein the preset area label indicates a rectangular area containing eyes in a closed state, the state label indicates an eye state category of the sample image, the background label indicates a background category of the sample image, and the background category comprises: an image containing eyes, an image not containing eyes; the predicted results include: at least one of a predicted region containing an eye in a closed state, a state class prediction value, a background class prediction value.

Optionally, in the method according to the present disclosure, the skeleton feature extraction component is formed by sequentially coupling 4 hole convolution modules, and the hole convolution modules at least include hole convolution blocks with different ratios.

Optionally, in a method according to the present disclosure, the feature pyramid component includes at least a plurality of convolution processing modules and a plurality of feature upsampling modules, where the convolution processing modules include a convolution layer, a normalization layer, and an activation layer coupled in sequence; the characteristic up-sampling module comprises the convolution processing module and an up-sampling layer which are coupled in sequence.

Optionally, in the method according to the present disclosure, the information prediction component includes a closed-eye region prediction branch and a category prediction branch, wherein the closed-eye region prediction branch includes at least a plurality of convolution processing modules adapted to output a prediction region including eyes in a closed state; the category prediction branch at least comprises a plurality of convolution processing modules and a classification layer and is suitable for outputting a state category prediction value and a background category prediction value.

Optionally, in the method according to the present disclosure, the step of inputting the training image into the detection model for processing to output the prediction result includes: inputting a training image into a trunk feature extraction component, sequentially processing by a plurality of hole convolution modules, outputting a third feature sub-image through a 3 rd hole convolution module, and outputting a fourth feature sub-image through a 4 th hole convolution module; inputting the third characteristic subgraph and the fourth characteristic subgraph into the characteristic pyramid component, performing characteristic extraction and sampling, and outputting a first output characteristic graph and a second output characteristic graph; and inputting the first output characteristic diagram and the second output characteristic diagram into an information prediction component, and outputting a prediction result after convolution processing.

Optionally, in the method according to the present disclosure, the step of inputting the third feature sub-graph and the fourth feature sub-graph into the feature pyramid component, performing feature extraction and sampling, and outputting the first output feature graph and the second output feature graph includes: processing the third feature subgraph at least by a convolution processing module, and fusing the processed feature graph with the fourth feature subgraph to obtain a first fused feature graph; processing the first fusion feature map by at least 1 convolution processing module and 1 feature up-sampling module to obtain a first intermediate feature map; processing the first fusion feature map by at least 2 convolution processing modules and 1 feature up-sampling module to obtain a second intermediate feature map; fusing the first intermediate feature map and the second intermediate feature map to obtain a second fused feature map; performing convolution processing on the second fusion feature map to generate a first output feature map; and processing the first output characteristic diagram through a convolution processing module to generate a second output characteristic diagram.

Optionally, the method according to the present disclosure further comprises the steps of: calculating a first loss using a preset area label and a predicted area containing eyes in a closed state; calculating a second loss by using the state label and the state category predicted value; calculating a third loss by using the background label and the background category predicted value; a loss value is determined based on the first loss, the second loss, and the third loss.

According to a second aspect of the present disclosure, there is provided a method of detecting an eye state, comprising the steps of: inputting an image indicating an object to be detected into a detection model for detecting eye states, and outputting at least a state category predicted value after processing, wherein the state category predicted value is a probability value of eyes in a closed state; and when the probability value in the closed state is larger than a preset threshold value, determining that the eyes of the object to be detected are in the closed state, wherein the detection model is generated by training by executing the method.

Optionally, the method according to the present disclosure further comprises the steps of: inputting an image indicating an object to be detected into a detection model, and outputting a prediction region and a background category prediction value after processing, wherein the prediction region is a rectangular region containing eyes in a closed state, and the background category prediction value is a probability value of the eyes contained in the image; and when the background category predicted value indicates that the image contains human eyes and the state category predicted value indicates that the eyes are in a closed state, determining that the eyes of the object to be detected are in the closed state.

According to a third aspect of the present disclosure, there is provided a system for detecting an eye state, comprising: the image acquisition unit is suitable for acquiring an image of an eye region containing an object to be detected and preprocessing the image to generate an image to be detected; the image processing unit is suitable for inputting the image to be detected into the detection model for processing so as to output a prediction result; the prediction result unit is suitable for determining the eye state of the object to be detected based on the prediction result, wherein the eye state comprises a closed state and a non-closed state; and the convolution network generation unit is suitable for training and generating a detection model for detecting the eye state.

According to a fourth aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory storing program instructions that, when read and executed by the processor, cause the computing device to perform the above-described method.

According to a fifth aspect of the present disclosure, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the above method.

According to the technical scheme of the method, a detection model is constructed based on the hole convolution, and corresponding label data are set; and then, training the detection model by using the training image and the label data to obtain the detection model for detecting the eye state. In addition, in order to enhance the accuracy of the state class prediction result output by the detection model, a background label is added when label data is set, and the background label is used as an auxiliary confidence coefficient of state class prediction.

The foregoing description is only an overview of the technical solutions of the present disclosure, and the embodiments of the present disclosure are described below in order to make the technical means of the present disclosure more clearly understood and to make the above and other objects, features, and advantages of the present disclosure more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a system 100 for detecting eye states according to one embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a method 300 of generating a detection model for detecting eye states according to one embodiment of the present disclosure;

FIG. 4 illustrates a schematic structural diagram of a detection model 400 according to some embodiments of the present disclosure;

FIG. 5 illustrates a schematic structural diagram of a hole convolution module 500 according to some embodiments of the present disclosure;

FIG. 6 illustrates a schematic structural diagram of a feature pyramid component 420 according to some embodiments of the present disclosure;

FIG. 7 illustrates a schematic structural diagram of an information prediction component 430, according to some embodiments of the present disclosure;

fig. 8 shows a flow diagram of a method 800 of detecting an eye state according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

To address the problems in the prior art, the present disclosure provides a convolutional neural network-based scheme for detecting eye states. Wherein the eye state comprises an occluding state and a non-occluding state. Fig. 1 shows a schematic diagram of a system 100 for detecting eye states according to one embodiment of the present disclosure. As shown in fig. 1, the system 100 includes an image acquisition unit 110, an image processing unit 120, a prediction result unit 130, and a convolution network generation unit 140.

It should be noted that the system 100 shown in fig. 1 is merely exemplary. In particular implementations, a different number of units (e.g., image acquisition units) may be included in the system 100, as the present disclosure is not limited in this respect.

The system 100 processes the acquired image containing the eye region of the object to be detected to output a prediction result. The predicted result at least comprises: including the predicted area and state class prediction values for the eyes in the closed state. In some preferred embodiments, the prediction result may further include a context category prediction value.

According to one embodiment, the predicted area is a rectangular area containing eyes in a closed state, e.g. a rectangular box is used to identify closed eyes in the image. In one embodiment, the prediction region is represented as:

in which

In order to predict the center position of the region,

the width and height of the rectangular area. If it is detected that the image does not include an eye in a closed state, the prediction area is empty.

In addition, the state class prediction value includes: probability value of eye being in closed state. When p1 is greater than a preset threshold, the system 100 determines that the eye state in the image is a closed state; otherwise, determining that the eye state in the image is a non-closed state. Also, when the system 100 determines that the eye state in the image is a closed state, the position of the eye in the closed state may also be determined in conjunction with the predicted area.

In addition, a background prediction value can also be utilized as an auxiliary confidence for the prediction state category. The background prediction value is used for predicting whether eyes are contained in the image or not, and comprises the following steps: the image contains a probability value for the eye (denoted p 2). If the p2 is larger than the preset value, determining that the image contains eyes; otherwise, it is determined that the image does not contain eyes. Assume a situation where the output p1 is greater than a preset threshold value, indicating that the eye state in the image is closed, but the output p2 value is less than a preset value, indicating that no eye is included in the image, i.e., the output results are inconsistent, and the prediction result is not reliable. That is, the larger the value of p2, the more trustworthy the state class prediction value is.

Based on the system 100, by processing the image including the eyes of the object to be detected, the state of the eyes in the image can be accurately detected, that is, the state of the eyes of the object to be detected can be determined. Providing quick and effective results for subsequent applications.

According to the embodiment of the present disclosure, the image capturing apparatus 110 may be deployed in places such as an important conference entrance, a side inspection station, an entrance/exit port, an airport, a security inspection station, an interior of a vehicle, and the like, but is not limited thereto. The image capturing device 110 is used to capture an image of an eye region containing an object to be detected (simply referred to as an eye image). The image capturing unit 110 may be any type of image capturing device, and the present disclosure does not limit the type and hardware configuration.

According to the embodiment of the present disclosure, the image capturing unit 110 may also pre-process the captured eye image to generate an image to be detected. Preprocessing operations include, but are not limited to: image data enhances, increases or decreases brightness of the image, adds noise, interferes, crops, scales, normalizes, etc. the image data is then processed to generate a new image.

The image processing unit 120 is coupled to the image acquisition unit 110, and receives the image to be detected, and inputs the image to be detected into the detection model for processing, so as to output a prediction result. The detection model is based on a convolutional neural network for detecting the state of the eye in the input image.

The prediction result unit 130 is coupled to the image processing unit 120, and the prediction result unit 130 determines an eye state of the object to be detected, which is a closed state or a non-closed state, based on the prediction result.

In some embodiments, the system 100 further comprises a convolutional network generation unit 140 for training generation of the detection model.

It should be noted that, regarding the specific execution flow of the system 100 and its parts, reference may be made to the following detailed description of the method 300 and the method 800, which is not specifically expanded herein.

System 100 may be implemented by one or more computing devices. FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the present disclosure. It should be noted that the computing device 200 shown in fig. 2 is only one example.

As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communicating between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The physical memory in the computing device is usually referred to as a volatile memory RAM, and data in the disk needs to be loaded into the physical memory to be read by the processor 204. System memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204. Operating system 220 may be, for example, Linux, Windows, or the like, which includes program instructions for handling basic system services and for performing hardware-dependent tasks. The application 222 includes program instructions for implementing various user-desired functions, and the application 222 may be, for example, but not limited to, a browser, instant messenger, a software development tool (e.g., an integrated development environment IDE, a compiler, etc.), and the like. When the application 222 is installed into the computing device 200, a driver module may be added to the operating system 220.

When the computing device 200 is started, the processor 204 reads program instructions of the operating system 220 from the memory 206 and executes them. Applications 222 run on top of operating system 220, utilizing the interface provided by operating system 220 and the underlying hardware to implement various user-desired functions. When the user starts the application 222, the application 222 is loaded into the memory 206, and the processor 204 reads the program instructions of the application 222 from the memory 206 and executes the program instructions.

Computing device 200 also includes storage 232, storage 232 including removable storage 236 and non-removable storage 238, each of removable storage 236 and non-removable storage 238 being connected to storage interface bus 234.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display 253 or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

The computing device 200 also includes a storage interface bus 234 coupled to the bus/interface controller 230. The storage interface bus 234 is coupled to the storage device 232, and the storage device 232 is adapted for data storage. The example storage device 232 may include removable storage 236 (e.g., CD, DVD, U-disk, removable hard disk, etc.) and non-removable storage 238 (e.g., hard disk drive, HDD, etc.).

In the computing device 200 according to the present disclosure, the application 222 includes instructions for performing the method for generating a detection model for detecting an eye state 300 of the present disclosure, and/or instructions for performing the method for detecting an eye state 800 of the present disclosure, which may instruct the processor 204 to perform the above-described method of the present disclosure to train generation of a detection model for detecting an eye state and to detect an eye state in an eye image using the detection model.

Fig. 3 shows a flow diagram of a method 300 of generating a detection model for detecting eye states according to one embodiment of the present disclosure. The method 300 may be performed by the image acquisition unit 110 and the convolutional network generation unit 140. The method 300 is directed to generating a detection model through training. By utilizing the detection model, the eye state in the eye image can be detected, and the position of the eye in the closed state can be determined, so that on one hand, the accuracy of eye state type identification is ensured, and on the other hand, the speed of eye state detection is increased.

As shown in fig. 3, the method 300 begins at step S310. In step S310, the sample image containing a single eye is preprocessed by the image acquisition unit 110 to generate a training image and label data.

The sample image may be obtained via the image acquisition unit 110, or may be obtained from other video image databases, which is not limited in this disclosure. According to one embodiment of the present disclosure, an eye image including a single eye (the eye may be in a closed state or in a non-closed state) is selected as a sample image. Of course, some images not including eyes may be selected as negative examples, which are not described in detail in this disclosure.

In one aspect, a sample image is preprocessed to generate a training image. Wherein the pre-treatment at least comprises: increasing or decreasing the brightness of the image, adding noise, blurring the image, image scaling, normalizing, etc. According to an embodiment of the present disclosure, after processing the sample image, such as image brightness, contrast, and noise, the processed image is scaled to a uniform size (optionally, the uniform size is 224 × 224, but not limited thereto), and then the pixel value of the processed image is normalized.

On the other hand, for each sample image, corresponding label data is generated. Wherein the tag data includes: and at least one of a preset area label, a state label and a background label.

According to an embodiment of the present disclosure, for each sample image, a preset area label indicating a predetermined number of rectangular areas containing eyes in a closed state is set, and the preset rectangular areas have different size sizes. In one embodiment, the predetermined number is 980. The preset area label is defined as

，k=1,2, …, 980. Wherein the central position is

) The predetermined rectangular area has a size of: (

). Thus, the preset area label size of each sample image is 980 x 4.

Meanwhile, a status label is set for each sample image. The status label indicates an eye status category of the sample image. According to the present disclosure, when an eye in a closed state is included in a sample image (i.e., a preset area label indicates that a preset rectangular area exists), the state label is set to 1; otherwise, the status flag is set to 0.

Furthermore, as previously described, in some embodiments according to the present disclosure, a background label may also be provided to assist in explaining whether the sample image contains an eye. The background label indicates a background category of the sample image, which, according to an embodiment of the present disclosure, includes: the image contains eyes and the image does not contain eyes. In one embodiment, when the sample image contains an eye, the background label is 1; otherwise, the background label is 0.

Subsequently, in step S320, a detection model for detecting the eye state and initial network parameters are constructed.

The detection model is based on a convolution network and is formed by coupling a trunk feature extraction component, a feature pyramid component and an information prediction component. The training image is input into the detection model, and is processed by the trunk feature extraction component, the feature pyramid component and the information prediction component in sequence, and then is finally output by two branches, wherein one branch outputs a predicted eye closing region (namely, a predicted region containing eyes in a closed state), the other branch outputs a state category predicted value, and a background category predicted value can also be output at the same time.

The following illustrates a detection model 400 and the specific structure of the components therein, according to some embodiments of the present disclosure. It should be understood that the following structures are shown by way of example only, and any convolutional network constructed based on the description of the embodiments of the present disclosure is within the scope of the present disclosure.

FIG. 4 illustrates a schematic structural diagram of a detection model 400 according to some embodiments of the present disclosure. As shown in fig. 4, the detection model 400 includes a stem feature extraction component 410, a feature pyramid component 420, and an information prediction component 430, and the feature pyramid component 420 is coupled to the stem feature extraction component 410 and the information prediction component 430, respectively.

The following describes the detection model 400 and its internal structure with reference to fig. 4 to 7.

As shown in fig. 4, the main feature extraction component 410 is formed by sequentially coupling 4 hole Convolution modules 500 (MDC _ Block), which are respectively denoted as MDC _ Block1, MDC _ Block2, MDC _ Block3, and MDC _ Block 4. The structures of the 4 hole convolution modules are basically consistent, and the 4 hole convolution modules comprise hole convolution blocks with different hole rates (displacement rates).

FIG. 5 illustrates a schematic structural diagram of a hole convolution module 500 according to some embodiments of the present disclosure.

As shown in FIG. 5, the hole convolution module 500 includes a first volume block 510, a second volume block 520, a third volume block 530, a fourth volume block 540, a pooling layer 550, and an activation layer 560. The first and second volume blocks 510 and 520 include a hole volume block (DConv) and a general volume block (Conv), respectively, and the hole rates of the hole volume blocks in the first and second volume blocks 510 and 520 are different from each other. Only a general volume block (Conv) is included in the third volume block 530 and the fourth volume block 540.

It should be noted that after the structure of the hole convolution module 500 is constructed, there are some parameters to be preset, for example, the size of the convolution kernel (kernel) used in each convolution block, the moving step size of the convolution kernel, the number of the surrounding filling edges, the step size of the pooling layer, the activation function selected by the activation layer, and the like. In one embodiment, the step size of pooling layer 550 is taken to be Stride =2, and LeakyRelu activation function is selected for activation layer 560. In addition, table 1 also shows some other parameters of hole convolution module 500.

Table 1 partial parameter example of the hole convolution module 500.

According to an embodiment of the present disclosure, the values of i are different in the 4 hole convolution modules 500. The values of i corresponding to MDC _ Block1, MDC _ Block2, MDC _ Block3 and MDC _ Block4 are 1,2, 3 and 4, respectively.

According to the embodiment of the present disclosure, the main feature extraction component 410 receives the training image, sequentially processes the training image by the 4 hole convolution modules 500, correspondingly generates a first feature sub-graph F1, a second feature sub-graph F2, a third feature sub-graph F3, and a fourth feature sub-graph F4, and outputs F3 and F4 to the feature pyramid component 420.

In one embodiment, the input training image size is

，

，

. 2 feature subgraphs of the output, where F3 is of size

=28，

=28，

(ii) a F4 has a size of

=14，

=14，

。

The feature pyramid component 420 includes at least a plurality of convolution processing modules (ConvBM) and a plurality of feature upsampling modules (FeatureUp). FIG. 6 illustrates a structural schematic of a feature pyramid component 420 according to some embodiments of the present disclosure.

As shown in FIG. 6, the feature pyramid component 420 includes coupled convolution processing layers CP1, feature fusion layer CA1, 2 convolution processing layers CP 2-CP 3, 2 feature upsampling modules U1-U2, feature fusion layer CA2, convolution layer C0, and convolution processing module ConvBM. The 3 convolution processing layers CP 1-CP 3 comprise convolution processing modules (ConvBM) and maximum pooling layers (Max boosting). The 2 feature upsampling modules U1-U2 are identical in structure and respectively comprise a convolution processing module (ConvBM) and an upsampling layer (UpSample) which are sequentially coupled.

Similarly, the network parameters in feature pyramid component 420 also need to be preset. For example, convolution layer C0 has a convolution kernel size of 1 x 1; in CP 1-CP 3, the step size of the maximum pooling layer is Stride = 2.

Furthermore, as can be seen from the above description, a plurality of convolution processing modules (convbms) are nested in feature pyramid component 420. In one embodiment, the convolution processing module (ConvBM) may employ the same network structure. The convolution processing module (ConvBM) includes a convolution layer, a normalization layer (BatchNorm) and an activation layer coupled in sequence, wherein the convolution kernel size is 3 × 3, and the activation function employed by the activation layer is, for example, leak re lu, but is not limited thereto.

According to an embodiment of the present disclosure, the third feature sub-graph F3 and the fourth feature sub-graph F4 are input to the feature pyramid component 420 for feature extraction and sampling. Specifically, the third feature sub-graph F3 is processed by the convolution processing layer CP1 (i.e., at least by the convolution processing module) to obtain a processed feature graph, and the processed feature graph and the fourth feature sub-graph F4 are input to the feature fusion layer CA1 and fused (for example, by using Concat) to obtain a first fused feature graph. Then, the first fused feature map is processed by a convolution processing layer CP2 (i.e., at least 1 convolution processing module) and a feature upsampling module U1 in sequence, so as to obtain a first intermediate feature map. Meanwhile, the first fused feature map is processed through convolution processing layers CP 2-CP 3 (namely, at least 2 convolution processing modules) and a feature upsampling module U2 in sequence to obtain a second intermediate feature map. Then, the first intermediate feature map and the second intermediate feature map input feature fusion layer CA2 are fused (Concat) to obtain a second fused feature map. The second fused feature map is then convolved by convolutional layer C0 to generate a first output feature map (denoted as O1). In one aspect, O1 is output to a subsequent information prediction component 430; on the other hand, the convolution processing module ConvBM processes the O1 to generate a second output characteristic map (denoted as O2) and outputs the second output characteristic map to the information prediction component 430.

In one embodiment, the input F3 has a size of

=28，

=28，

F4 has a size of

=14，

=14，

(ii) a Two characteristic maps of output, wherein O1 has the size of

，

，

O2 has a size of

，

，

。

Information prediction component 430 includes a closed-eye region predicted branch and a category predicted branch. The eye-closing region prediction branch at least comprises a plurality of convolution processing modules and can output a prediction region comprising eyes in a closed state. The category prediction branch at least comprises a plurality of convolution processing modules and a classification layer to output a state category prediction value and a background category prediction value.

FIG. 7 illustrates a block diagram of an information prediction component 430 according to some embodiments of the present disclosure. As shown in fig. 7, the information prediction component 430 includes 2 predicted branches: a closed-eye region prediction branch 432 and a category prediction branch 434. Wherein the eye-closed region prediction branch 432 comprises 3 convolution processing modules ConvBM; the class prediction branch 434 comprises 3 convolution processing modules ConvBM and a classification layer S. According to an embodiment of the present disclosure, the convolution processing module ConvBM in the information prediction component 430 may be consistent with the convolution processing module ConvBM in the feature pyramid component 420, and is not described herein again. And the classification layer S can adopt Softmax to output the prediction probability values under two classifications. In conjunction with the above, the output prediction probability value can be expressed as (p1, p2), where p1 and p2 represent the state class prediction value and the background class prediction value, respectively, and in one embodiment, p1 predicts the probability that the eye is in a closed state and the p2 prediction image includes the probability of the eye.

According to one embodiment, the first output signature O1 and the second output signature O2 (sized as

*

*

，iValues 1 and 2, which can be referred to in the foregoing description), respectively input to the information prediction module 430, and perform convolution and other processing, and the output prediction results are described as follows. O1 and O2 are input into the closed-eye region prediction branch 432, respectively, with corresponding output sizes

*

4 (i =1, 2). Input O1 and O2 into class prediction branch 434, respectively, with corresponding output sizes of

*

2 (i =1, 2). For each branch, the two output results are spliced to finally obtain:

the output size of the closed-eye region prediction branch 432 is: (

*

+

*

）*4=

)*4=980*4；

The output size of the class prediction branch 434 is: (

*

+

*

）*2=

)*2=980*2。

After the detection model 400 is constructed and the initial network parameters are set, the detection model 400 may be trained. It should be understood that the above description of the network parameters in the detection model 400 is only an example, and in practical applications, a person skilled in the art may set corresponding network parameters according to a network structure, a training process, and the like, which is not described in detail in this disclosure.

In the following step S330, the training image is input to the detection model to be processed to output the prediction result.

In combination with the description in step S320, the training image is input into the detection model, and the processed and output prediction result includes: a prediction region containing an eye in a closed state, a state class prediction value, and a background class prediction value. Further, the output size of the closed-eye region prediction branch 432 is 980 x 4, which contains 980 prediction regions, where the prediction region is predicted by

To determine a prediction region in which, among other things,

indicates the position of the center of the prediction region,

indicating the width and height of the prediction region. The output size of the class prediction branch 434 is 980 x 2, and each of the output sizes corresponds to each of the prediction regionsThe predicted probability value that the eye is in the closed state and the image contain the predicted probability value of the eye. In one embodiment, the average values are taken to obtain p1 and p2 as the category prediction value and the background prediction value, respectively.

Subsequently, in step S340, based on the prediction result and the label data, a loss value is calculated, and the network parameter is adjusted according to the loss value until a predetermined condition is satisfied, and the corresponding detection model is the finally generated detection model for detecting the eye state.

According to one embodiment, the loss value is calculated as follows.

1) Calculating a first loss using a predetermined zone label and a predicted zone containing eyes in a closed state, and recording the first loss as

. In one embodiment, the first loss is calculated by:

，

in the formula (I), the compound is shown in the specification,

is shown askThe number of the prediction regions is one,

is shown askN =980 predetermined areas.

2) Calculating a second loss using the state label and the state class prediction value, and recording the second loss as

. In one embodiment, the second loss is calculated by the following equation:

，

in the formula (I), the compound is shown in the specification,

is a state class prediction value, namely p1, with a value between 0 and 1;

is a status label, the value is 0 or 1.

3) Calculating a third loss by using the background label and the background category predicted value, and recording the third loss as

. In one embodiment, the third loss is calculated by:

，

in the formula (I), the compound is shown in the specification,

for the background class prediction value, namely p2, the value is between 0 and 1;

for background labels, the value is 0 or 1.

Finally, a loss value is determined based on the first loss, the second loss, and the third loss. Optionally, the above three losses are weighted and summed to obtain the loss value of the whole training, which is recorded as

：

。

According to an embodiment of the present disclosure, the loss value is calculated in the above manner; and adjusting the network parameters of the detection model according to the loss value so as to update the detection model. Then, inputting the training image into the updated detection model, and outputting a prediction result; adjusting network parameters corresponding to the detection model by using the prediction result and the loss value of the label data, and updating the detection model; … …, respectively; and repeating the iteration process until a preset condition is met, wherein the corresponding detection model is the finally generated detection model for detecting the eye state.

It should be noted that, the present disclosure does not make too much limitation on the predetermined condition, and may be an error between loss values of two or more iterations, which is smaller than a preset smaller value; or iteratively updated up to a predetermined maximum number of times, etc.

According to the method 300 of the present disclosure, a detection model is constructed based on the hole convolution, and corresponding tag data is set; and then, training the detection model by using the training image and the label data to obtain the detection model for detecting the eye state. In addition, in order to enhance the accuracy of the state type prediction result output by the detection model, a background label is added when label data is set, and the background label is used as an auxiliary confidence coefficient for state type prediction.

The detection model generated according to the method 300 belongs to a lightweight network, on one hand, the accuracy of eye closing detection of human eyes is guaranteed, and on the other hand, the speed of eye closing detection of human eyes is increased.

Fig. 8 shows a flow diagram of a method 800 of detecting an eye state according to one embodiment of the present disclosure. Method 800 is implemented with system 100. It should be understood that the contents of the method 800 and the method 300 are complementary and repeated, and are not described again here.

As shown in fig. 8, the method 800 begins at step S810. In step S810, an image indicating an object to be detected is input to a detection model for detecting an eye state, and at least a state category prediction value is output after processing. Wherein, the state category predicted value is the probability value that the eye is in the closed state, namely p 1.

In one embodiment, an image indicative of an object to be detected is acquired with the image acquisition unit 110, which is generally expected to include an eye region of the object to be detected.

It should be noted that the image acquired by the image acquisition unit 110 needs to be preprocessed to obtain a preprocessed image. The image input to the detection model is typically a pre-processed image. The preprocessing process may refer to the processing of the sample image in step S310, and is not described herein again.

The image is input to a detection model, and a prediction region and a prediction value of a background category are output after processing. As described above, the predicted region is a rectangular region containing eyes in a closed state, and the background category predicted value is a probability value p2 that the image contains eyes.

According to an embodiment of the present disclosure, the detection model may be generated by method 300 training. For the related description of the processing flow and the prediction result of the input image by the detection model, reference may be made to the related description in the method 300, and details are not repeated here.

Subsequently, in step S820, when the probability value in the closed state is greater than a preset threshold value, it is determined that the eyes of the object to be detected are in the closed state. And then, according to the output prediction area, the eye closing position can be determined.

And when the background category predicted value is larger than a preset value, confirming that the image contains eyes. Combining the two prediction results, only when the background category prediction value indicates that the image contains human eyes (i.e. p2 is greater than the preset value) and the state category prediction value indicates that the eyes are in a closed state (i.e. p1 is greater than the preset threshold value), determining that the eyes of the object to be detected are in the closed state. And, the larger the value of p2, the more trustworthy the detected closed-eye state is.

According to the method 800 disclosed by the invention, whether the eyes of the object to be detected are in the closed state or not can be accurately detected, and when the eyes are in the closed state, the eyes are accurately positioned to the positions of the eyes in the closed state. The scheme can be applied to various scenes of target detection, such as fatigue driving detection, identity authentication and the like, and provides a quick and effective result for subsequent application.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the methods of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the present disclosure. The required structure for constructing such a system will be apparent from the description above. Moreover, this disclosure is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the present disclosure as described herein, and any descriptions above of specific languages are provided for disclosure of preferred embodiments of the present disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be construed to reflect the intent: that is, the claimed disclosure requires more features than are expressly recited in each claim. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Additionally, some of the embodiments are described herein as a method or combination of method elements that can be implemented by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as disclosed herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims

1. A method of generating a detection model for detecting eye states, comprising the steps of:

preprocessing a sample image containing a single eye to generate a training image and label data, the label data comprising: at least one of a preset area label indicating a rectangular area containing eyes in a closed state, a state label indicating an eye state category of the sample image, and a background label indicating a background category of the sample image, the background category including: an image containing eyes, an image not containing eyes;

establishing a detection model and initial network parameters, wherein the detection model is formed by coupling a trunk feature extraction component, a feature pyramid component and an information prediction component, and the eye state comprises a closed state and a non-closed state;

inputting the training image into the detection model for processing so as to output a prediction result, wherein the prediction result comprises: the eye prediction method comprises the steps of including at least one of a prediction area of an eye in a closed state, a state category prediction value and a background category prediction value, wherein the background category prediction value is used as an auxiliary confidence coefficient of a prediction state category, and the larger the background category prediction value is, the more credible the state category prediction value is; and

calculating a loss value based on the prediction result and the label data, adjusting network parameters according to the loss value until a preset condition is met, wherein the corresponding detection model is a finally generated detection model for detecting the eye state,

in the detection model, the main feature extraction component is formed by sequentially coupling 4 hole convolution modules, each hole convolution module includes a first convolution block, a second convolution block, a third convolution block, a fourth convolution block, a pooling layer and an active layer, the first convolution block and the second convolution block are respectively coupled with the fourth convolution block, the third convolution block and the fourth convolution block are respectively coupled with the pooling layer, the pooling layer is coupled with the active layer, the first convolution block and the second convolution block respectively include a hole convolution block and a general convolution block, the hole convolution blocks in the first convolution block and the second convolution block have different hole rates, and the third convolution block and the fourth convolution block only include a general convolution block.

2. The method of claim 1, wherein,

the feature pyramid component at least comprises a plurality of convolution processing modules and a plurality of feature up-sampling modules, wherein the convolution processing modules comprise a convolution layer, a normalization layer and an activation layer which are sequentially coupled; the feature upsampling module comprises the convolution processing module and an upsampling layer which are coupled in sequence,

the information prediction component comprises a closed-eye region prediction branch and a category prediction branch, wherein the closed-eye region prediction branch at least comprises a plurality of convolution processing modules and is suitable for outputting a prediction region comprising eyes in a closed state;

the class prediction branch at least comprises a plurality of convolution processing modules and a classification layer and is suitable for outputting a state class prediction value and a background class prediction value.

3. The method of claim 1, wherein inputting the training image into the detection model for processing to output a prediction result comprises:

inputting the training image into the trunk feature extraction component, sequentially processing by a plurality of hole convolution modules, outputting a third feature sub-image through a 3 rd hole convolution module, and outputting a fourth feature sub-image through a 4 th hole convolution module;

inputting the third characteristic subgraph and the fourth characteristic subgraph into the characteristic pyramid component, performing characteristic extraction and sampling, and outputting a first output characteristic graph and a second output characteristic graph; and

and inputting the first output characteristic diagram and the second output characteristic diagram into the information prediction component, and outputting the prediction result after convolution processing.

4. The method of claim 3, wherein the step of inputting the third feature sub-graph and the fourth feature sub-graph into the feature pyramid component, performing feature extraction and sampling, and outputting the first output feature graph and the second output feature graph comprises:

processing the third feature subgraph at least by a convolution processing module, and fusing the processed feature graph with the fourth feature subgraph to obtain a first fused feature graph;

processing the first fusion feature map by at least 1 convolution processing module and 1 feature up-sampling module to obtain a first intermediate feature map;

processing the first fusion feature map by at least 2 convolution processing modules and 1 feature up-sampling module to obtain a second intermediate feature map;

fusing the first intermediate feature map and the second intermediate feature map to obtain a second fused feature map;

performing convolution processing on the second fused feature map to generate the first output feature map;

and processing the first output characteristic diagram through the convolution processing module to generate the second output characteristic diagram.

5. The method of claim 1, wherein calculating a loss value based on the prediction and the tag data comprises:

calculating a first loss using the preset zone label and the predicted zone containing the eye in the closed state;

calculating a second loss using the state label and the state category prediction value;

calculating a third loss using the context label and the context category prediction value;

determining the loss value based on the first loss, the second loss, and the third loss.

6. A method of detecting eye state, comprising the steps of:

inputting an image indicating an object to be detected into a detection model for detecting eye states, and outputting at least a state category predicted value, a predicted region and a background category predicted value after processing, wherein the state category predicted value is a probability value that eyes are in a closed state, the predicted region is a rectangular region containing the eyes in the closed state, and the background category predicted value is a probability value that the image contains the eyes; and

determining that the eyes of the object to be detected are in a closed state when the background class prediction value indicates that the image contains human eyes and the state class prediction value indicates that the eyes are in a closed state, wherein the detection model is generated by performing the method of any one of claims 1-5.

7. A system for detecting eye conditions, adapted to perform the method of any of claims 1-5, and/or the method of claim 6, comprising:

the image acquisition unit is suitable for acquiring an image of an eye region containing an object to be detected and preprocessing the image to generate an image to be detected;

the image processing unit is suitable for inputting the image to be detected into a detection model for processing so as to output a prediction result;

the prediction result unit is suitable for determining the eye state of the object to be detected based on the prediction result, wherein the eye state comprises a closed state and a non-closed state;

a convolutional network generating unit adapted to train and generate the detection model for detecting the eye state.

8. A computing device, comprising:

at least one processor and a memory storing program instructions;

the program instructions, when read and executed by the processor, cause the computing device to perform the method of any of claims 1-5 and/or perform the method of claim 6.

9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-5 and/or perform the method of claim 6.