CN112749646A

CN112749646A - Interactive point-reading system based on gesture recognition

Info

Publication number: CN112749646A
Application number: CN202011620981.1A
Authority: CN
Inventors: 黄坚; 李慧敏
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-05-04

Abstract

The invention provides an interactive click-to-read system based on gesture recognition, which takes an image acquired by a camera in real time as input, preprocesses the input image, performs gesture recognition on a target image containing a hand by using a gesture recognition network, firstly classifies each pixel point in the image by using the gesture recognition network, determines the category of each pixel point so as to obtain a hand cutting image, and performs gesture classification according to the obtained hand cutting image to realize gesture recognition; and transmitting the preprocessed image into an image recognition module, detecting a boundary frame of the object to be recognized, matching the boundary frame with the preprocessed image according to a gesture recognition result, and performing translation and voice reading processing according to a matching result, thereby realizing a point reading function. The system realizes gesture recognition through a gesture recognition network, further realizes an interactive point reading function, and gets rid of dependence of a traditional point reading algorithm on a point reading tool and constraint limitation of an existing interactive point reading system on a recognition object.

Description

Interactive point-reading system based on gesture recognition

Technical Field

The invention relates to gesture recognition, finger tip detection, target detection, character detection, recognition and gesture segmentation, belongs to the field of artificial intelligence, and particularly relates to an interactive click-to-read system based on gesture recognition.

Background

With the continuous development of scientific technology, intelligent education is gradually popularized, for example, a text touch-and-talk pen, a touch-and-talk machine and the like have been widely applied, and currently, several common interactive touch-and-talk systems rely on touch-and-talk equipment (for example, a touch-and-talk machine and a touch-and-talk pen), and the systems directly extract required texts by using the touch-and-talk equipment and perform necessary processing on the texts, so that a touch-and-talk function is realized; some special sensing materials are worn on the finger tips of other systems, so that the finger tips are recognized; some methods that use artificial intelligence to detect the position of the fingertip iteratively and then recognize the text, for example, the patent with application number 201910837914.6, use artificial intelligence to build a finger feature recognition neural network to recognize the fingertip, and this method only recognizes the position of the fingertip and the characters in the area in front of the fingertip; chinese patent No. CN109325464A discloses a character recognition method based on artificial intelligence, which utilizes a pure deep learning algorithm to realize finger point reading, realizes the rapid text recognition and list word searching process, consumes no more than 300ms in the whole process, and greatly improves the point reading efficiency. Some methods realize finger tip detection by assisting with hand key points, and the methods have the following limitations in different degrees:

first, the dependency problem of the dedicated identification device;

secondly, the constraint limitation of the recognition object, namely that some interactive point-and-read systems can only recognize characters, but can not recognize non-text objects;

thirdly, in the process of recognizing characters, there is a certain degree of ambiguity of recognition: for example, after the position information of the fingertips exists, how to extract the text information pointed by the fingertips according to the positions of the fingertips and how to determine the information association among the texts, wherein the positions pointed by the fingertips refer to the character, a word or a segment of the character; how the characters should be processed if the sizes of the characters are different, and the like;

fourthly, the interactive click-to-read semantics are not rich enough, the existing system can only realize the word click-to-read, can not realize intelligent human-computer interaction, and can not meet the requirements of people on intelligence.

Disclosure of Invention

The invention solves the problems: the constraint limits of the prior art described in the above background art are overcome or solved to different degrees, and an interactive touch and talk system based on gesture recognition is provided, which is used for realizing an interactive touch and talk function to meet the demand of people on intellectualization.

The system takes an image acquired by a camera in real time as input, preprocesses the input image, and performs gesture recognition on the image by using a gesture recognition network; and matching the boundary frame of the object to be recognized with the result of gesture recognition according to the boundary frame of the object to be recognized detected by the image recognition module, and performing translation and voice reading processing according to the matching result so as to realize a point reading function.

The technical problem to be solved is as follows: the method overcomes the defect that recognition of gestures and finger fingertips depends on recognition equipment in the prior art, realizes intelligent interaction, and meets the requirement of users on intelligence.

The technical scheme is as follows: an interactive point-and-read system based on gesture recognition, characterized in that the system comprises the following modules: the camera is connected with the image preprocessing module, the image preprocessing module is respectively connected with the gesture recognition module and the image recognition module, the gesture recognition module is connected with the integration module, the image recognition module is connected with the integration module, the integration module is connected with the translation module, and the translation module is connected with the voice module.

The camera is used for acquiring images in real time;

the image preprocessing module is used for preprocessing the image;

the gesture recognition module is used for recognizing the preprocessed image, receiving the output image of the image preprocessing module, segmenting the image by using a gesture recognition network, classifying each pixel point in the image, determining the category of each pixel point so as to obtain the segmented image of the hand, classifying the gesture according to the geometric shape of a hand region in the segmented image, realizing gesture recognition, and performing different subsequent processing on different gesture recognition results;

the image recognition module is used for detecting and positioning the object to be recognized, receiving the image output by the image preprocessing module, detecting the object in the image by using an image recognition algorithm, returning the information of the boundary frame and the label of the object to be recognized in the image, and transmitting the information into the integration module;

the integrated module receives the output of the gesture recognition module and the image recognition module, matches the result of gesture recognition (here, taking a single finger fingertip click-to-read gesture as an example, explaining is carried out, and the coordinates of the finger fingertip are approximately calculated according to the geometric shape of the gesture) with the boundary box of image recognition, completes the bidirectional matching process of the gesture positioning fingertip and the image recognition positioning boundary box, outputs the label information of the detection object, transmits the label information to the translation module, completes the subsequent processing and realizes the click-to-read function;

the translation module: the system is used for translating the label information returned by the detection and integration module into different languages so as to meet different requirements;

the voice module: and reading the result of the translation module.

The image preprocessing module is used for preprocessing the image acquired by the camera, and comprises: image denoising and image zooming, wherein according to an experimental result, the denoising can improve the classification accuracy in gesture recognition; meanwhile, the training and recognition time of the gesture recognition network can be changed by zooming the image to different degrees;

the gesture recognition module is used for recognizing gestures so as to realize subsequent processing; the module receives an output image of the image preprocessing module, divides the image by a gesture recognition network, classifies each pixel point in the image, determines the category of each pixel point, thereby acquiring a hand part cutting image, classifies gestures according to the acquired hand part cutting image, realizes gesture recognition, and performs different subsequent processing on different gestures (enumerating several different gestures in the following steps, explaining different subsequent processing flows), and the module comprises the following steps:

(1) inputting the image output by the image preprocessing module into a gesture recognition network, classifying each pixel point in the image, and determining the category of each pixel point so as to obtain a hand segmentation graph of the image;

(2) dividing the hand part in the step (1), classifying, and recognizing different gestures;

(3) performing subsequent processing according to the gesture recognized in the step (2); here, several different gestures are described, and if the gesture recognition results are:

1) if the single fingertip reads the gesture, the mass center of the gesture shape outline is approximately calculated according to the geometric shape of the gesture, and the coordinates of the finger fingertip are calculated according to the mass center coordinates and the geometric characteristics of the single fingertip reading gesture;

2) if the gesture recognition result is a camera pause/acquisition control gesture, controlling the pause/acquisition of the camera according to the recognition result;

3) if the gesture is matched by multiple fingertips, the subsequent flow processing is realized by correspondingly matching the geometric shape of the gesture with the result of the image recognition module in claim 1;

4) other gestures, according to the gesture result, carry on different treatment;

the image recognition module is used for processing the output of the image preprocessing module and transmitting the detected bounding box and label information into the subsequent integration module, and the purpose of connecting the image preprocessing module with the image recognition module is to solve the possible interference of gestures to the image recognition module in the image recognition process;

the integrated module is used for receiving the gesture recognition result output by the gesture control module and the label information and the boundary box of the object returned by the image recognition module, matching the gesture recognition result (here, taking a single-finger fingertip point-reading gesture as an example to explain a matching process) of the gesture recognition module with the boundary box of the image recognition module, if the fingertip coordinate is successfully matched with the boundary box of the detection object, outputting the label information of the detection object to the translation module, and if the fingertip coordinate is not matched with the boundary box, further prompting the explanation information.

The translation module and the voice module call an open source library;

the invention provides an interactive click-to-read system based on gesture recognition, which takes an image acquired by a camera in real time as input, preprocesses the input image, and performs gesture recognition on a target image containing a hand by using a gesture recognition network; and transmitting the preprocessed image into an image recognition module, detecting a boundary frame of the object to be recognized, matching the boundary frame with the preprocessed image according to a gesture recognition result, and performing translation and voice reading processing according to a matching result, thereby realizing a point reading function. The system realizes gesture recognition through a gesture recognition network, further realizes an interactive point reading function, and gets rid of dependence of a traditional point reading algorithm on a point reading tool and constraint limitation of an existing interactive point reading system on a recognition object.

Compared with the prior art, the invention has the advantages that:

firstly, the dependence problem of special point reading equipment is solved;

secondly, constraint limitation of the recognition object is solved, and point reading of texts and non-texts is realized;

thirdly, the semantics of interactive point reading is expanded, intelligent human-computer interaction is realized according to different gestures, and the requirements of people on intelligence are met to a certain extent.

Fourthly, the expandability is strong, and the system can also be applied to other related fields such as common text recognition, children picture recognition, picture book recognition and the like, and can also be applied to related fields such as finger tip detection and the like based on gesture recognition.

Drawings

FIG. 1 is a schematic diagram of the overall flow principle of the system of the present invention (note: arrows indicating multiple calls to preprocessed images when the preprocessed images are transmitted to a gesture recognition network and an image recognition module after image preprocessing is performed using an OpenCV library);

FIG. 2 is a schematic flow diagram of a gesture recognition network;

FIG. 3 is a network architecture diagram of a gesture recognition network;

FIG. 4 is a schematic flow diagram of an image recognition network;

FIG. 5 is a schematic diagram of the system of the present invention;

fig. 6 shows the raw image (a) and the corresponding gesture segmentation map (b) after OpenCV processing.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an intelligent interactive touch and talk system based on gesture recognition comprises the following steps:

(1) capturing images with a camera

(2) The method comprises the following steps of performing image compression and denoising on an image acquired by a camera by using an OpenCV algorithm library, and specifically comprises the following steps:

step 1: denoising the image and smoothing the object boundary;

step 2: scaling images to a certain scale, wherein the computation time and recognition precision of the images with different scales transmitted into a gesture recognition network and an image recognition network are different (scaling the images to 800x800 resolution in an experiment)

(3) Classifying each pixel point in the image by using a pre-trained gesture recognition network, determining the category of each pixel point so as to obtain a hand segmentation image, classifying gestures according to the obtained hand segmentation image, realizing gesture recognition, and performing subsequent processing according to a recognition result; here, several different gestures are listed for explanation, and if the gesture recognition result is a camera pause/acquisition control gesture, the camera pause/acquisition is controlled according to the recognition result; if the result of the gesture recognition is a non-control gesture:

1) the single fingertip is clicked and read the gesture, then according to the geometric shape of the gesture, approximately calculate the centroid of the gesture shape outline, according to the centroid coordinate and the geometric characteristics of the single fingertip click and read the gesture to find the coordinates of the finger fingertip, specifically, find the position of the outline centroid to the farthest point of the outline and approximate the centroid to the coordinates of the fingertip, wherein the related formula of finding the Euclidean distance is as follows,

transmitting the calculated fingertip coordinates into an integration module;

2) if the camera pauses/acquires the control gesture, pausing/acquiring the camera is controlled according to the recognition result;

3) matching gestures with multiple fingertips, and performing corresponding matching according to the geometric shape of the gestures and the result of an image recognition module in the following steps to realize subsequent flow processing;

4) other gestures are processed differently according to the gesture recognition result;

(4) detecting and positioning the image by using a pre-trained image detection network, returning the information of a boundary frame and a label of a detection object in the image, and transmitting the information into an integrated module;

(5) and matching the gesture recognition result with the image recognition result, outputting label information of the detection object to a translation module if the fingertip coordinate is successfully matched with the boundary box of the detection object, and further prompting description information if the fingertip coordinate is not matched with the boundary box.

(6) Translating and reading the matched label text information in the step (5) in a voice mode

Referring to fig. 2, the process of the gesture recognition neural network includes the following steps:

step 1: acquiring training data by photographing, zooming an original image into an image (original image) with the resolution of 800X800 after image preprocessing, manually marking and processing to obtain a gesture segmentation image of the image, and taking the zoomed image and the gesture segmentation image as input images;

step 2: and (3) performing data enhancement operation on the input image in Step1 to expand the input sample image of the network, and specifically, introducing a data enhancement technology used in part:

1) data flipping

Horizontally and vertically overturning an input image by using an OpenCV (open circuit vehicle library);

2) translation

Performing image translation operation on input images by using OpenCV (open circuit computer vision library)

Step 3: and (3) constructing a gesture recognition network, transmitting the input image after Step2 into the gesture recognition network, and performing network training, wherein a schematic diagram of a network structure is shown in FIG. 3.

Step 4: the gesture is recognized by the gesture recognition network, the gradient of the original image and the segmented image is solved by calculating a loss function between the original image and the segmented image, the gradient is reversely propagated to update network parameters, the weight parameters of the deep convolutional neural network are continuously trained through samples in Step2, after the network is stable, corresponding parameters of the network are obtained, and training is completed.

Further, the image recognition network flow is similar, and comprises the following steps:

1) training of non-textual object networks, see fig. 4:

step 1: collecting training images (containing partial COCO data sets), and manually marking samples; starting with a data set COCO profile, the data set contains several categories: including person, bicycle, car, motorbike, aeroplane, bus, cat, dog, kite, etc., as described in detail in the official website (http:// cococataset.org);

step 2: and (4) making a data set of the user for training according to the gesture recognition network. Experiments show that when the intelligent point-and-read equipment is used for recognizing the children picture books, the sample type is different from the characteristic style in the COCO data set, and the data set of the intelligent point-and-read equipment needs to be manufactured and trained according to the gesture recognition network. The difference is that the input image of the COCO dataset does not need to be divided into images, but the bounding box information and the label class information of the object to be identified need to be added into the label data.

Step 3: and (5) constructing an image recognition network and carrying out network training. Image recognition module is accessed after training is finished

2) Training of text object network:

after multiple experiments, when a self-made data set training text detection and identification network is realized based on the existing neural networks such as CRNN and HigherHRNet, a better experiment effect cannot be obtained due to the constraint of the existing conditions (such as hardware equipment GPU) and the like, so that after a character boundary box is detected, an open source OCR character recognition library tesseract-OCR is used for character recognition, and the method specifically comprises the following steps:

step 1: self-making a data set, converting the training image into a tif format, and converting the tif format into a box file according to requirements;

step 2: opening a jTessBoxEditor tool in tesseract-ocr, opening a training image and performing position correction;

step 3: training according to the tesseract-ocr requirement, and accessing the image recognition module after training;

step 4: transmitting the detected text bounding box information into tesseract recognition;

referring to fig. 5, an intelligent interactive touch-and-talk system based on gesture recognition is characterized by comprising the following modules: the system comprises the following modules: the camera is connected with the image preprocessing module, the image preprocessing module is connected with the gesture recognition module and the image recognition module, the gesture recognition module is connected with the integration module, the image recognition module is connected with the integration module, the integration module is connected with the translation module, and the translation module is connected with the voice module.

The camera is used for acquiring images in real time;

the image preprocessing module is used for preprocessing the image;

the gesture recognition module is used for recognizing the processed image, receiving the input image of the image preprocessing module, segmenting the image by using a gesture recognition network, classifying each pixel point in the image, determining the category of each pixel point, acquiring the shape and contour division of a hand region, classifying gestures according to the acquired shape of the hand region, realizing gesture recognition, and performing different subsequent processing on different gestures;

the image recognition module is used for detecting and positioning the object to be recognized, receiving the image output by the image preprocessing module, detecting the boundary frame and the label information of the object to be recognized in the image returned by the object in the image by utilizing an image recognition algorithm, and transmitting the boundary frame and the label information into the integration module;

the integrated module receives the input of the gesture recognition module and the image recognition module, controls the acquisition and pause of the camera according to the gesture recognition result, performs coordinate matching with the image recognition boundary box according to the gesture recognition result (here, if the gesture is a single finger fingertip point reading gesture, the coordinates of the finger fingertip are approximately calculated according to the geometric shape of the gesture), completes the bidirectional matching process of the gesture positioning fingertip and the image recognition positioning boundary box, outputs the label information of the detected object, transmits the label information into the translation module, completes the subsequent processing and realizes the point reading function;

the voice module: and reading the result of the translation module.

As shown in fig. 6, wherein (a) is an acquired image of a certain frame, an original image after OpenCV processing, and (b) is a gesture segmentation image corresponding to (a) and output by the neural network;

the method for reading by point based on gesture recognition has the following advantages:

firstly, a method for identifying fingertips by training a neural network by using a fingertip sample, which is different from the method in the background art, is realized;

secondly, for single-fingertip recognition, as detailed in the single-fingertip point-reading gesture, the coordinate position of the single fingertip can be conveniently calculated by using the geometric characteristics of the gesture shown in (b) in fig. 6;

thirdly, the gesture-based touch and talk method expands rich semantics of interactive touch and talk, corresponding touch and talk, control and the like can be realized according to different gestures, intelligent human-computer interaction can be better realized, and the requirements of people on intelligence are met.

The above examples are provided for the purpose of describing the present invention only, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. An interactive point-and-read system based on gesture recognition, characterized in that the system comprises the following modules: the camera is connected with the image preprocessing module, the image preprocessing module is connected with the gesture recognition module and the image recognition module, the gesture recognition module is connected with the integration module, the image recognition module is connected with the integration module, the integration module is connected with the translation module, and the translation module is connected with the voice module.

The camera is used for acquiring images in real time;

the image preprocessing module is used for preprocessing the image;

the gesture recognition module is used for recognizing the preprocessed image, receiving the input image of the image preprocessing module, segmenting the image by using a gesture recognition network, classifying each pixel point in the image, determining the category of each pixel point so as to obtain a hand segmentation image, classifying gestures according to the obtained hand segmentation image, realizing gesture recognition, and performing different subsequent processing on different gestures;

the image recognition module is used for detecting and positioning an object to be recognized, receiving the image output by the image preprocessing module, detecting the boundary frame and the label information (including object category information) of the object to be recognized in the image returned by the object in the image by utilizing an image recognition algorithm, and transmitting the boundary frame and the label information into the integration module;

the integrated module receives the input of the gesture recognition module and the image recognition module, matches the input according to a gesture recognition result and an image recognition result, completes the bidirectional matching process of the gesture recognition result and the image recognition positioning boundary box, outputs the label information of the detection object, and transmits the label information to the translation module to complete the subsequent processing so as to realize the point reading function;

the voice module: and reading the result of the translation module.

2. The gesture recognition based interactive point-and-read system of claim 1, wherein: the image preprocessing module is used for preprocessing the image acquired by the camera, and comprises: image denoising and image zooming, wherein according to an experimental result, the denoising can improve the classification accuracy in gesture recognition; meanwhile, the images are zoomed to different degrees, so that the training and recognition time of the gesture recognition network can be changed.

3. The gesture recognition based interactive point-and-read system of claim 1, wherein: the gesture recognition module is used for realizing gesture recognition and further realizing subsequent processing, receives an output image of the image preprocessing module, divides the image by using a gesture recognition network, classifies each pixel point in the image, determines the category of each pixel point, acquires a hand part cutting image, performs gesture classification according to the acquired hand part cutting image, realizes gesture recognition, and performs different subsequent processing on different gesture recognition results, and specifically comprises the following steps:

(1) inputting the image transmitted by the image processing module into a gesture recognition network, classifying each pixel point in the image, and determining the category of each pixel point so as to obtain a hand segmentation image;

(2) dividing the hand part in the step (1), classifying, and recognizing different gestures according to the geometric shape of the outline of the hand region;

(3) performing subsequent processing by using the gesture recognized in the step (2); here, several different gestures are described, and if the gesture recognition results are:

4) and other gestures are processed differently according to gesture results.

4. The gesture recognition based interactive point-and-read system of claim 1, wherein: the image recognition module is used for processing the output of the image preprocessing module, detecting an object in the image, and transmitting the detected boundary frame and the label information into the subsequent integration module; the purpose of connecting the image preprocessing module with the image recognition module is to solve the possible interference of gestures to the image recognition module in the image recognition process.

5. The gesture recognition based interactive point-and-read system of claim 1, wherein: the integrated module is used for receiving the gesture recognition result output by the gesture control module and the label information and the boundary box of the object returned by the image recognition module, matching the gesture recognition result of the gesture recognition module with the boundary box of the image recognition module, outputting the label information of the detected object to the translation module if the fingertip coordinate is successfully matched with the boundary box of the detected object, and further prompting description information if the fingertip coordinate is not matched with the boundary box.

6. The gesture recognition based interactive point-and-read system of claim 1, wherein: and the translation module and the voice module call an open source library.