CN113243886A

CN113243886A - Vision detection system and method based on deep learning and storage medium

Info

Publication number: CN113243886A
Application number: CN202110652556.9A
Authority: CN
Inventors: 桑高丽; 卢丽; 闫超; 韩强; 陶陶
Original assignee: Sichuan Yifei Technology Co ltd
Current assignee: Sichuan Yifei Technology Co ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-08-13
Anticipated expiration: 2041-06-11
Also published as: CN113243886B

Abstract

The invention discloses a vision detection system, a method and a storage medium based on deep learning, wherein the vision detection system comprises an identification display module, an image acquisition module, a posture evaluation module and a result evaluation module; the image acquisition module is used for acquiring images of limb actions made by the arms of the user and inputting the images into the posture evaluation module; the gesture evaluation module is used for detecting and acquiring gesture key points of limb actions made by the arms; and the result evaluation module judges the arm state of the user according to the arm posture key point of the user, further judges whether the action of the user is consistent with the vision detection identification, and outputs a detection result. The result evaluation module judges whether the hand posture direction of the user is consistent or not according to the hand posture direction of the user, the hand posture is larger than the finger direction, the target is easier to identify, and the detection precision is higher.

Description

Vision detection system and method based on deep learning and storage medium

Technical Field

The invention belongs to the technical field of vision detection, and particularly relates to a vision detection system and method based on deep learning and a storage medium.

Background

With the development of science and technology and the popularity of information technology, people have more and more time to use high-technology equipment such as mobile phones, computers, televisions and the like, so that the risk of eyesight damage is increased, and particularly for teenager groups, the people cannot ignore the eyesight damage. The general vision detection is generally performed in a professional institution in a manual mode, and a user cannot perform independent operation.

Meanwhile, some intelligent vision detection systems, such as the intelligent vision detector based on image analysis disclosed in patent CN106778597A, adopt a detector of a traditional algorithm in the field of machine vision to judge the posture and the pointing direction of a finger, and construct a vision detection system based on the direction information of the finger. However, such methods are based on conventional algorithms and have relatively low accuracy. Meanwhile, the finger has a small target and a complex joint, so that the gesture and the pointing direction of the finger are difficult to accurately judge, and the performance of the system is limited.

In the field of deep learning, machine vision models constructed based on a self-attention mechanism, which have appeared in recent years, have achieved excellent accuracy in many fields, but the general self-attention mechanism has increased exponentially to the power of 4 as the image input size increases. On a large-sized input image, the amount of computation is required to be excessive.

Disclosure of Invention

The present invention aims to provide a vision testing system, method and storage medium based on deep learning, and aims to solve the above problems.

The invention is mainly realized by the following technical scheme:

a vision detection system based on deep learning comprises an identification display module, an image acquisition module, a posture evaluation module and a result evaluation module; the identification display module is used for displaying the vision detection identification, and the user utilizes the arm to make corresponding limb actions; the image acquisition module is used for acquiring images of limb actions made by the arms of the user and inputting the images into the posture evaluation module; the gesture evaluation module is used for detecting and acquiring gesture key points of limb actions made by the arms; and the result evaluation module judges the arm state of the user according to the arm posture key point of the user, further judges whether the action of the user is consistent with the vision detection identification, and outputs a detection result.

The invention displays the mark for vision detection through the display module, the user uses the arm to make corresponding limb movement, after the image acquisition module acquires the image of the limb movement, the posture key point of the user is detected by the posture evaluation module based on deep learning, and finally the result evaluation module judges whether the user movement is consistent with the vision detection mark. The result evaluation module judges whether the hand posture direction of the user is consistent or not according to the hand posture direction of the user, the hand posture is larger than the finger direction, the target is easier to identify, and the detection precision is higher.

In order to better implement the present invention, further, the posture evaluation module includes a target detection submodule and a posture detection submodule, and the target detection submodule is used for detecting a coordinate frame of a human body; the input of the posture detection submodule is an image area corresponding to a human body, key points of the human body posture are detected, and coordinate information of the key points of the posture is output.

In order to better implement the present invention, further, the gesture detection sub-module includes a plurality of alternating local attention units and a result output unit, which are sequentially arranged from front to back, wherein the alternating local attention units are used for extracting gesture feature information and generating a feature map; and the result output unit is used for up-sampling the feature map to improve the resolution of the feature map and generating final posture key point coordinate information from the feature map.

The target detection submodule is used for detecting a coordinate frame of a human body and can be realized by detectors such as yolo and the like. And then, cutting an image area corresponding to the human body, and using the cut image area as the input of a posture detection submodule to detect 17 key points of the human body posture. And the result output unit performs up-sampling on the characteristic diagram by adopting deconvolution. The number of channels of the output feature map is 17, and the feature map corresponds to 17 key points of the human body posture respectively. And the specific coordinate information of the key points is given by the coordinates of the maximum value on the characteristic diagram. The attitude evaluation module is constructed by adopting an alternating local attention mechanism, and has the advantages of high precision, small calculated amount and the like compared with the traditional machine vision method, a convolutional neural network and the like.

In order to better implement the present invention, further, the alternating local attention unit includes a region embedding layer and several alternating local attention layers, which are sequentially arranged from front to back, the region embedding layer is used for down-sampling the input image or feature map to fuse the information of all spatial points in the region into a single feature vector; the alternating local attention layer comprises a first region division layer, a first region self-attention layer, a second region division layer and a second region self-attention layer which are sequentially arranged from front to back; the first area dividing layer and the second area dividing layer are respectively used for dividing the feature map into a plurality of areas, and the first area self-attention layer and the second area self-attention layer are respectively used for performing self-attention operation in each area. For a region embedding layer with a downsampling rate of k, it can be implemented by a convolution operation with a convolution kernel size of k and a convolution step size of k.

In order to better implement the present invention, if the size of the input feature map is HxW, the partition size of the first region partition layer is M, the partition size of the second region partition layer is N, and the first region partition layer and the second region partition layer partition the feature map into (H/M) x (W/M) MxM regions and (H/N) x (W/N) NxN regions, respectively; the partition sizes M and N are relatively prime integers. The division mode can lead the same characteristic point to be divided with different characteristic points when the first area and the second area are divided, thereby realizing information circulation between the areas and leading the alternate local attention layer to obtain global characteristic information.

In order to better implement the present invention, the division sizes of the first area division layer and the second area division layer are 7 and 5, respectively.

Further, if the down-sampling rate of the area embedding layer is 2, the number of the alternating local attention layers in the 4 alternating local attention units is 2, 4, 10, 1.

In order to better implement the invention, the result evaluation module further determines the state of the arm according to the relative position of the elbow and wrist key points in the human posture key points, if the state of the arm of the user is consistent with the vision detection identifier, the detection result is determined to be correct, otherwise, the detection result is determined to be wrong.

And the result evaluation module judges whether the arm is in 5 types of states including leftward, rightward, upward and downward and other states by judging the relative positions of the elbow key point, the wrist key point and other key points in the key points of the human posture, and represents the response of the user to the vision detection identifier. If the response is consistent with the vision detection identification, the result is judged to be correct, and if the response is not consistent with the vision detection identification, the result is judged to be wrong.

A vision detection method based on deep learning is carried out by adopting the vision detection system, and comprises the following steps:

step S100: the vision detection identification is displayed through the identification display module, so that a user can observe the vision detection identification and make corresponding actions by using arms;

step S200: acquiring an image of limb actions of an arm of a user through an image acquisition module;

step S300: inputting the image collected in the step S200 into a posture evaluation module and detecting a posture key point of a user; firstly, detecting a coordinate frame of a human body in an image through a target detection submodule, then cutting an image area corresponding to the human body and inputting the image area into a posture detection submodule, detecting key points of the posture of the human body, and obtaining coordinate information of the posture key points;

step S400: and the result evaluation module judges the state of the arm according to the relative position of the key point of the human posture, if the state of the arm of the user is consistent with the vision detection identifier, the detection result is judged to be correct, and if not, the detection result is judged to be wrong.

In order to better implement the present invention, further, in step S300, the alternating local attention unit is used to extract the pose feature information and generate a feature map, and then the feature map is up-sampled by the result output unit, and coordinate information of the pose key point is generated from the feature map.

A computer readable storage medium storing computer program instructions which, when executed by a processor, implement the vision detection method described above.

The invention has the beneficial effects that:

(1) the result evaluation module in the invention detects the arm gesture direction of the user, but not the finger direction. Compared with the direction of the fingers, the gesture of the arm is larger in target and easier to identify, and the detection precision is higher;

(2) the invention judges the direction of the arm by detecting the relative positions of the elbow and wrist key points and other key points to judge the response result of the user. Compared with the common mode of judging the orientation of the finger, the method has the advantages that the target of the arm is larger, and the judgment precision is higher;

(3) the attitude evaluation module is constructed by adopting an alternating local attention mechanism, and has the advantages of high precision, small calculated amount and the like compared with the traditional machine vision method, a convolutional neural network and the like;

(4) compared with the common self-attention unit, the alternate local attention unit adopted by the invention has the advantages that the calculation complexity is reduced by

Become into

Or

Typically M and N are much smaller than H, W, and therefore the computational complexity is much reduced.

Drawings

FIG. 1 is a functional block diagram of the present invention;

FIG. 2 is a process flow diagram of a pose estimation module;

FIG. 3 is a functional block diagram of a gesture detection sub-module;

FIG. 4 is a schematic diagram of an alternate local attention unit;

FIG. 5 is a schematic diagram of a structure of alternating local attention layers.

Detailed Description

Example 1:

a vision detection system based on deep learning is shown in figure 1 and comprises an identification display module, an image acquisition module, a posture evaluation module and a result evaluation module; the identification display module is used for displaying the vision detection identification, and the user utilizes the arm to make corresponding limb actions; the image acquisition module is used for acquiring images of limb actions made by the arms of the user and inputting the images into the posture evaluation module; the gesture evaluation module is used for detecting and acquiring gesture key points of limb actions made by the arms; and the result evaluation module judges the arm state of the user according to the arm posture key point of the user, further judges whether the action of the user is consistent with the vision detection identification, and outputs a detection result.

Further, the result evaluation module judges the state of the arm according to the relative position of the elbow key point and the wrist key point in the human posture key point, if the state of the arm of the user is consistent with the vision detection identifier, the detection result is judged to be correct, and if not, the detection result is judged to be wrong.

Example 2:

in this embodiment, optimization is performed on the basis of embodiment 1, and as shown in fig. 2, the posture evaluation module includes a target detection submodule and a posture detection submodule, where the target detection submodule is used to detect a coordinate frame of a human body; the input of the posture detection submodule is an image area corresponding to a human body, key points of the human body posture are detected, and coordinate information of the key points of the posture is output.

Further, as shown in fig. 3, the gesture detection sub-module includes a plurality of alternating local attention units and a result output unit, which are sequentially arranged from front to back, and the alternating local attention units are configured to extract gesture feature information and generate a feature map; and the result output unit is used for up-sampling the feature map to improve the resolution of the feature map and generating final posture key point coordinate information from the feature map.

Further, as shown in fig. 4, the alternating local attention unit includes a region embedding layer and several alternating local attention layers, which are sequentially arranged from front to back, and the region embedding layer is configured to down-sample an input image or feature map to fuse information of all spatial points in a region into a single feature vector; as shown in fig. 5, the alternating local attention layer includes a first region division layer, a first region self-attention layer, a second region division layer, and a second region self-attention layer, which are sequentially arranged from front to back; the first area dividing layer and the second area dividing layer are respectively used for dividing the feature map into a plurality of areas, and the first area self-attention layer and the second area self-attention layer are respectively used for performing self-attention operation in each area.

Further, if the size of the input feature map is HxW, the partition size of the first region partition layer is M, the partition size of the second region partition layer is N, and the first region partition layer and the second region partition layer partition the feature map into (H/M) x (W/M) MxM regions and (H/N) x (W/N) NxN regions, respectively; the partition sizes M and N are relatively prime integers. H, W is a conventional expression of the size of the feature map, and therefore, the description thereof is omitted.

Further, the division sizes of the first area division layer and the second area division layer are 7 and 5, respectively.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

Further, as shown in fig. 2, the posture evaluation module includes a target detection submodule and a posture detection submodule, and the target detection submodule is used for detecting a coordinate frame of a human body; the input of the posture detection submodule is an image area corresponding to a human body, key points of the human body posture are detected, and coordinate information of the key points of the posture is output.

Further, as shown in fig. 3, the pose estimation module is constructed by using an alternating local attention mechanism, and includes several alternating local attention units, configured to extract pose feature information and generate a feature map, so as to obtain coordinate information of a final pose key point.

As shown in fig. 4, the alternating local attention unit includes a region embedding layer and several alternating local attention layers, which are sequentially arranged from front to back, and the region embedding layer is used to down-sample an input image or feature map to fuse information of all spatial points in a region into a single feature vector; as shown in fig. 5, the alternating local attention layer includes a first region division layer, a first region self-attention layer, a second region division layer, and a second region self-attention layer, which are sequentially arranged from front to back; the first area dividing layer and the second area dividing layer are respectively used for dividing the feature map into a plurality of areas, and the first area self-attention layer and the second area self-attention layer are respectively used for performing self-attention operation in each area.

Further, if the size of the input feature map is HxW, the partition size of the first region partition layer is M, the partition size of the second region partition layer is N, and the first region partition layer and the second region partition layer partition the feature map into (H/M) x (W/M) MxM regions and (H/N) x (W/N) NxN regions, respectively; the partition sizes M and N are relatively prime integers.

Example 4:

a vision testing method based on deep learning is carried out by adopting the vision testing system as shown in figure 1, and comprises the following steps:

step S300: inputting the image collected in the step S200 into a posture evaluation module and detecting a posture key point of a user; as shown in fig. 2, firstly, a coordinate frame of a human body in an image is detected by a target detection submodule, then, an image area corresponding to the human body is cut and input into a posture detection submodule, key points of the posture of the human body are detected, and coordinate information of the posture key points is obtained;

Further, in step S300, the alternative local attention unit is used to extract the pose feature information and generate a feature map, and then the result output unit performs upsampling on the feature map and generates coordinate information of the pose key point from the feature map.

Example 5:

a vision detection method based on deep learning is realized by adopting a vision detection system, as shown in figure 1, firstly, a display module is used for displaying a mark for vision detection, a user utilizes an arm to make corresponding limb movement, after the image acquisition module acquires the corresponding limb movement, a posture key point of the user is detected by a posture evaluation module based on the deep learning, and finally, a result evaluation module is used for judging whether the movement of the user is consistent with the vision detection mark.

Further, as shown in fig. 2, the pose estimation module is composed of a target detection submodule and a pose detection submodule constructed by using an alternating local attention method. The target detection submodule is used for detecting a coordinate frame of a human body and can be realized by detectors such as yolo and the like. And then, cutting an image area corresponding to the human body, and using the cut image area as the input of a posture detection submodule to detect 17 key points of the human body posture. In this embodiment, the object detection submodule detects a human image in an image by using a yolov5 object detector. The corresponding region is then cropped, scaled to 224x224 size, as input to the pose detection sub-module.

Further, as shown in FIG. 3, the gesture detection sub-module is constructed primarily using an alternating local attention approach. The specific structure is composed of a plurality of repeated alternating local attention units and a result output unit. And the alternating local attention unit is used for extracting the attitude characteristic information and generating a characteristic diagram. And the result output unit is used for up-sampling the feature map, improving the resolution of the feature map and generating final posture key point coordinate information from the feature map. In this embodiment, 4 alternating local attention cells are used.

Further, as shown in fig. 4, the alternating local attention cells are composed of a region embedding layer and a plurality of repeated alternating local attention layers. And the region embedding layer is used for downsampling the input image or the feature map and fusing the information of all spatial points in the region into a single feature vector. For a region embedding layer with a downsampling rate of k, it can be implemented by a convolution operation with a convolution kernel size of k and a convolution step size of k.

As shown in fig. 5, the alternating local attention layer is composed of a first zone division layer, a first zone self-attention layer, a second zone division layer, and a second zone self-attention layer. If the size of the input feature map is HxW, the division size of the first region division layer is M, the division size of the second region division layer is N, and the first region division layer and the second region division layer divide the feature map into (H/M) x (W/M) MxM regions and (H/N) x (W/N) NxN regions respectively; the division sizes M and N are relatively prime integers, so that feature information can be exchanged among the regions, and therefore the alternative local attention layer can acquire global feature information.

Further, the division size of the first area is 7, and the division size of the second area is 5. The computational complexity of the region self-attention unit is

Or

While the computational complexity of a normal self-attention unit is

Further, the region embedding layer down-sampling rate is 2. Of the 4 alternating local attention cells, the number of repetitions of the alternating local attention layer is 2, 4, 10, 1, respectively.

Further, the result output unit performs up-sampling on the feature map by using deconvolution. The number of channels of the output feature map is 17, and the feature map corresponds to 17 key points of the human body posture respectively. And the specific coordinate information of the key points is given by the coordinates of the maximum value on the characteristic diagram.

The result evaluation module in the invention detects the arm gesture direction of the user, but not the finger direction. Compared with the direction of the fingers, the gesture of the arm is larger in target and easier to identify, and the detection precision is higher.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A vision detection system based on deep learning is characterized by comprising an identification display module, an image acquisition module, a posture evaluation module and a result evaluation module; the identification display module is used for displaying the vision detection identification, and the user utilizes the arm to make corresponding limb actions; the image acquisition module is used for acquiring images of limb actions made by the arms of the user and inputting the images into the posture evaluation module; the gesture evaluation module is used for detecting and acquiring gesture key points of limb actions made by the arms; and the result evaluation module judges the arm state of the user according to the arm posture key point of the user, further judges whether the action of the user is consistent with the vision detection identification, and outputs a detection result.

2. The deep learning based vision detection system of claim 1, wherein the gesture evaluation module comprises a target detection submodule and a gesture detection submodule, and the target detection submodule is used for detecting a coordinate frame of a human body; the input of the posture detection submodule is an image area corresponding to a human body, key points of the human body posture are detected, and coordinate information of the key points of the posture is output.

3. The vision detection system based on deep learning of claim 2, wherein the gesture detection submodule includes a plurality of alternative local attention units and a result output unit, which are arranged from front to back in sequence, and the alternative local attention units are used for extracting gesture feature information and generating a feature map; and the result output unit is used for up-sampling the feature map to improve the resolution of the feature map and generating final posture key point coordinate information from the feature map.

4. The vision detection system based on deep learning of claim 3, wherein the alternative local attention unit comprises a region embedding layer and a plurality of alternative local attention layers, the region embedding layer is arranged from front to back, and the region embedding layer is used for down-sampling an input image or a feature map to fuse information of all spatial points in a region into a single feature vector; the alternating local attention layer comprises a first region division layer, a first region self-attention layer, a second region division layer and a second region self-attention layer which are sequentially arranged from front to back; the first area dividing layer and the second area dividing layer are respectively used for dividing the feature map into a plurality of areas, and the first area self-attention layer and the second area self-attention layer are respectively used for performing self-attention operation in each area.

5. The vision detection system based on deep learning of claim 4, wherein if the size of the input feature map is HxW, the division size of the first region division layer is M, the division size of the second region division layer is N, and the first region division layer and the second region division layer divide the feature map into (H/M) x (W/M) MxM regions, (H/N) x (W/N) NxN regions, respectively; the partition sizes M and N are relatively prime integers.

6. The vision testing system based on deep learning of claim 5, wherein the first region division layer and the second region division layer have division sizes of 7 and 5 respectively.

7. The vision testing system based on deep learning of claim 1, wherein the result evaluation module determines the state of the arm according to the relative position of the elbow and wrist key points in the key points of the human posture, and if the state of the arm of the user is consistent with the vision testing mark, the testing result is determined to be correct, otherwise, the testing result is determined to be wrong.

8. A vision testing method based on deep learning, which is performed by using the vision testing system of any one of claims 1-7, and comprises the following steps:

9. The vision testing method based on deep learning of claim 8, wherein in step S300, the alternative local attention unit is used to extract the pose feature information and generate the feature map, and then the result output unit performs up-sampling on the feature map and generates the coordinate information of the pose key point from the feature map.

10. A computer-readable storage medium storing computer program instructions, characterized in that the program instructions, when executed by a processor, implement the method of claim 8 or 9.