CN211512572U

CN211512572U - Interactive blind guiding system

Info

Publication number: CN211512572U
Application number: CN201921601724.6U
Authority: CN
Inventors: 彭文杰; 余菲; 林坤阳; 林泽锋; 郑东润; 范智博; 罗家祥
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-09-18
Anticipated expiration: 2029-09-25

Abstract

The utility model belongs to the technical field of lead blind system, a interactive system of leading blind is related to. The interactive blind guiding system comprises a central processing unit, and a depth camera, a high-end voice synthesis device, a microphone and a power supply which are connected with the central processing unit, wherein: a central processing unit: for system control, target detection, path planning, speech recognition and signal transfer; a depth camera: the system comprises a video acquisition module, a video processing module and a display module, wherein the video acquisition module is used for acquiring images of a current scene to generate an RGB image and a depth map; the high-end speech synthesis device: the voice information is synthesized and the object searching result or the road planning condition is played; a microphone: the voice information acquisition module is used for acquiring the voice information of the user and transmitting the voice information to the central processing unit; power supply: for supplying power to the central processor. The utility model discloses can assist the blind person to live better, improve blind person's quality of life.

Description

Interactive blind guiding system

Technical Field

The utility model belongs to the technical field of lead blind system, a interactive system of leading blind is related to.

Background

In recent years, with the development of computer science and technology, under the great push of deep learning of new intelligent technology, various technologies of artificial intelligence, such as voice recognition technology, image recognition technology, data mining technology, etc., have been substantially developed and successfully applied to various products. Deep learning is a key point and a focus of research in the field of computer vision at present, and is also one of common methods for solving complex environmental problems. Computer vision, as a milestone in the history of human science and technology development, plays a very important role in the development of intelligent technology, and undoubtedly receives extensive attention from both academic and industrial fields. In the existing deep learning method, the neural network obtains good results in the aspect of target detection.

At present, an intelligent blind guiding system appearing in the market mainly helps the blind person to go out by a blind guiding rod based on infrared ray assistance, intelligent interactivity is not realized, the system is basically determined by the blind person according to judgment, and the accident rate is high. The intelligent glasses for the blind people which are recently emerging need to be matched with manual customer service to remotely realize interaction, so that the intelligent glasses for the blind people are difficult to popularize and use, high in cost, high in resource consumption and greatly limited by networks.

Currently, some low-power-consumption target detection networks achieve the accuracy and precision similar to those of common target detection networks, but the required computing resources are greatly reduced, so that the deep neural network can be deployed in portable equipment.

SUMMERY OF THE UTILITY MODEL

To the current situation that present intelligent blind guiding system interactivity is not strong, the utility model provides a blind system is led to interactive has good interactivity, has greatly improved blind person user's life and has experienced.

The utility model discloses a following technical scheme realizes:

an interactive blind guiding system comprises a central processing unit, a depth camera, a high-end voice synthesis device, a microphone and a power supply, wherein the depth camera, the high-end voice synthesis device, the microphone and the power supply are connected with the central processing unit, and the interactive blind guiding system comprises:

a central processing unit: for system control, target detection, path planning, speech recognition and signal transfer;

a depth camera: the system comprises a video acquisition module, a video processing module and a display module, wherein the video acquisition module is used for acquiring images of a current scene to generate an RGB image and a depth map;

the high-end speech synthesis device: the voice information is synthesized and the object searching result or the road planning condition is played;

a microphone: the voice information acquisition module is used for acquiring user voice information and transmitting the acquired user voice information to the central processing unit;

power supply: for supplying power to the central processor.

Preferably, the central processor is NVIDIA Jetson TX2 development kit.

Preferably, the depth camera is an Intel-D435 depth camera.

Preferably, the high-end speech synthesis device is a YS-XFSV2 high-end speech synthesis device.

Preferably, the power supply is a 19V mobile power supply.

Compared with the prior art, the utility model discloses a following advantage and beneficial effect:

the utility model innovatively combines the object searching function and the blind guiding function, the interactive blind guiding system of the utility model has the function of helping the blind to search objects, and reduces the dependence of the blind on family members; the system has an autonomous path planning function, and improves the trip safety of the blind; the blind person can use the blind guiding device conveniently through voice awakening; the blind person can know things around conveniently by having a good scene description function; thereby assisting the blind to better live and improving the life quality of the blind.

Drawings

Fig. 1 is a diagram of an interactive blind guidance system according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating the operation of the interactive blind guiding system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments, but the embodiments of the present invention are not limited thereto.

In order to better describe the present invention, in the research and implementation process of the interactive blind guiding system, the deep learning and neural network training methods introduced in the related papers are all used, and the occurring symbols can find the corresponding theoretical basis and source code. The used binary channels convolution neural network of convolution neural network and the used binary channels convolution neural network of speech recognition that the target detection was used all can adopt the current relevant neural network model realization commonly used, the utility model discloses a neural network model structure etc. are not being repeated here.

An interactive blind guiding system, as shown in fig. 1, comprises a central processing unit 1, and a depth camera 2, a high-end speech synthesis apparatus 3, a microphone 4 and a power supply 5 connected thereto, wherein:

a central processing unit: the method is used for system control, target detection, path planning, voice recognition and signal transmission, and ensures the stable operation of the whole system.

A depth camera: the method is used for acquiring the image of the current scene and generating the RGB image and the depth map. In this embodiment, an Intel-D435 depth camera is used.

And the high-end voice synthesis device is used for synthesizing the voice information output by the central processing unit and playing the object searching result or the road planning condition. In this embodiment, the high-end speech synthesis device is a YS-XFSV2 high-end speech synthesis device.

A microphone: used for collecting the user voice information and transmitting the collected user voice information to the central processing unit.

Power supply: used for supplying power to the central processing unit. The power supply is a 19V mobile power supply in the embodiment, and can provide persistent power supply for TX2, so that the system can be controlled without a power line, and portability is greatly improved.

In this example, the following steps are takenThe NVIDIA Jetson TX2 development kit was used as the central processor. NVIDIAJetson TX2 is an artificial intelligence super computer module adopting NVIDIA Pascal^TMAnd (5) architecture. Jetson TX2 is energy efficient, small in size, yet is suitable for use in smart tip devices such as robots, drones, smart cameras, and portable medical devices. Furthermore, Jetson TX2 not only supports all the functions of the Jetson TX1 module, but also supports larger, more complex deep neural networks.

The central processing unit is provided with a target detection unit, a voice recognition unit and a road planning unit. Wherein:

an object detection unit: based on the realization of a common convolutional neural network, a specially-sorted data set is trained to realize the functions of object positioning and classification and help the blind to find objects.

In this embodiment, training the convolutional neural network includes the following steps:

the method includes the steps that 20 classes are listed based on objects frequently used by blind people in life, 300 pictures of each class are collected as a data set through online searching, actual scene shooting and the like.

In the aspect of data set, the basic requirement of blind person for finding things is not satisfied due to the fact that the open source data sets such as VOC, COCO and the like exist and are different from the real data distribution required by the indoor environment where the equipment is located. In order to alleviate the current situation, the utility model collects some indoor object data sets existing on the internet, on the basis, about 100 and about 200 data sets are manufactured for each category, data enhancement is carried out through methods such as random rotation, translation, overturning, brightness adjustment, contrast adjustment and cutting, and a certain amount of data which accord with the current application scene is proportionally selected from the open source data set and added into the new data set to retrain the model.

Unifying the picture size to 416 multiplied by 416 standard size;

framing the positions of 20 classes appearing in the picture by using a marking tool and marking the classes, and performing data enhancement processing on the marked picture and a marked file, namely performing random operations of rotation, translation, turning, brightness adjustment, contrast adjustment and cutting on the picture;

and fourthly, disordering the sequence of the data set to serve as the input of the convolutional neural network, taking a preset loss function as the target of model training, selecting a proper optimizer, and setting the learning rate which is reduced along with the increase of the training turns, so that the training of the neural network can be started. The precision of the parameters in this training phase uses single precision floating point numbers.

In this embodiment, the initial learning rate is 0.001, and the learning rate is adjusted to 1/10 during the training of 60 rounds and 90 rounds, respectively.

In this embodiment, the convolutional neural network model deployment verification includes the following steps:

the precision of the trained convolutional neural network parameters is reduced, half-precision floating point number is used for replacing single-precision floating point number operation applied during training, and the inference speed of the convolutional neural network model can be further improved.

And deploying the convolutional neural network model in an NVIDIA Jetson TX2 development suite and detecting a real scene to verify the object detection effect of the blind in the real life scene.

A voice recognition unit: for encoding the voice command and outputting the voice information.

The voice recognition unit comprises a wake-up word detection module, a keyword detection module and a voice guidance module, and once processing is carried out on the audio after the sampling points of the microphone reach a certain number. Under the initial condition of the interactive blind guiding system, the voice recognition unit only works in the awakening word detection module for saving power consumption, other modules are in a standby state, and the functions and the working flow of each submodule are as follows:

awakening word detection module: the system is used for detecting and identifying the awakening words, the system operation is started through the awakening words, and the work flow comprises the following steps:

when a user sends an instruction, basic processing is carried out on a time domain signal collected by a microphone, wherein the basic processing comprises framing, windowing, Fourier transform, logarithm taking and the like, and a spectrogram is obtained.

And secondly, coding the spectrogram to obtain signal codes.

Because the detection of the awakening word is simple in the speech recognition application, on the premise of ensuring the accuracy and recall rate, in order to improve the overall operation speed of the awakening word detection module and avoid excessive complication, only one-dimensional convolution kernel is used for processing a spectrogram to obtain signal codes.

And thirdly, predicting whether the audio contains the awakening words or not based on signal coding, wherein the signal coding passes through a gating circulation unit layer, a full connection layer and the like, the regularization of the neural network is realized by using random inactivation in the middle, and the probability value of the audio containing the awakening words is output.

Comparing the probability value of the awakening word with the awakening word threshold value, if the probability value is larger than the awakening word threshold value, successfully awakening, starting the keyword detection module, temporarily closing the awakening word detection module, otherwise, the system does not respond, and repeating the awakening word detection step.

A keyword detection module: the method is used for detecting and identifying the keywords.

In this embodiment, the interactive blind guiding system presets 20 keywords of object categories, which include: chairs, cups, books, remote controls, glasses, electric kettles, tissues, trash cans, mobile phones, bags, bowls, people, toothbrushes, combs, shoes, purses, keys, pens, and backpacks. The keyword detection module only accepts the input of one keyword each time, and the work flow comprises the following steps:

after a keyword detection module is started, the keyword detection module processes signals collected by a microphone to obtain a spectrogram of the signals.

And secondly, since the keyword detection part can be regarded as the pluralization of the detection of the awakening words, the method has the same general steps as the awakening word detection module and is finally output as the probability value vector of each preset keyword existing in the audio.

Comparing the probability value of each keyword with a keyword threshold value, outputting the keyword with the probability value larger than the keyword threshold value as detected, and executing subsequent operation according to the number of the detected keywords, specifically:

if the keyword is not detected, repeating the keyword detection step, if no effective keyword can be detected within the specified time, entering a standby state by the detection module, and restarting the awakening word detection module;

if a plurality of keywords are detected, starting a voice guidance module, reminding a user of inputting only one keyword by voice at each time by voice and requiring to input again, and restarting a keyword detection module;

if a keyword is detected, starting a corresponding target detection module or a corresponding road planning module according to the detected keyword, and executing a corresponding behavior.

The voice guidance module: the module is responsible for realizing the functions of reminding and guiding the user through voice. The work flow comprises the following steps:

firstly, when a key word detection module detects a plurality of key words, a preset voice is played through a YS-XFSV2 high-end voice synthesis device to remind the user of the function.

Secondly, after the target detection unit or the road planning unit is started, according to the output result of the target detection unit or the road planning unit, the YS-XFSV2 high-end voice synthesis module plays preset voice to remind the blind user. The method comprises the following steps:

after the target detection module is started, if the current target detection result is not obtained, playing preset voice to remind the user to move. And if the target is successfully detected, guiding the user to move according to the target center coordinate output by the target detection module.

After the road planning module is started, a proper path is output to the blind user according to the distribution condition of the current obstacles, and preset voice is played to remind the user to move.

A road planning unit: the feasible directions are classified and processed based on the neural network so as to realize the function of planning the path of the road ahead and help the blind to effectively avoid the barrier. The system comprises an image preprocessing module and a neural network module. Wherein: the image preprocessing module is used for preprocessing the RGB image and the depth map acquired by the depth camera. The neural network module is realized based on a two-channel convolution neural network.

After the two-channel neural network model is trained, the Intel-D435 depth camera outputs RGB images and depth maps in real time, the RGB images are converted into gray maps and depth maps to be preprocessed, and then the gray maps and the depth maps are input into the two-channel neural network model, so that 5 direction instructions of left turning, left front, forward, right front and right turning can be output in real time.

Specifically, after the gray-scale image and the depth image are unified into a 224 × 224 image size and input into a dual-channel feature extraction network, two 7 × 7 × 160 output feature vectors can be obtained due to the fact that the down-sampling rate is 32, the two output feature vectors are connected and combined, and then five-class output can be obtained through a full connection layer, so that the blind person can be guided to move forward.

The utility model discloses interactive blind system's work flow is led, as shown in FIG. 2, include:

s101, processing the audio frequency once by the interactive blind guiding system after the sampling points of the microphone reach a certain number, detecting and identifying the awakening words in the audio frequency by an awakening word detection module of a voice identification unit, starting a keyword detection module when the awakening words are identified and the awakening word probability value is greater than the awakening word threshold value, temporarily closing the awakening word detection module, and entering S102; otherwise, the system does not respond and repeats the wakeup word detection step S101.

S102, the keyword detection module detects and identifies keywords in the audio, compares the probability value of each identified keyword with a keyword threshold, outputs the keyword with the probability value larger than the keyword threshold as the detected keyword, and executes subsequent operations according to the number of the detected keywords, specifically:

if a keyword is detected, starting a corresponding target detection unit or a road planning unit according to the detected keyword, and executing a corresponding object searching or real-time road planning behavior, wherein the behavior comprises the following steps:

when an object is searched, the target detection unit obtains image information input through the Intel-D435 depth camera, when the object is searched, the RGB image is input into the convolutional neural network to extract features to obtain deep image features, whether the object in the image and the object required by a user exist or not is determined, and the result is converted into voice for broadcasting to the user through the voice module.

And acquiring an RGB image and a depth map from a depth camera during real-time road planning, preprocessing the depth map, inputting a gray scale map formed by converting the depth map and the RGB image into a trained two-channel input neural network for path planning in five directions, and converting a planning result into voice output.

The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be equivalent replacement modes, and all are included in the scope of the present invention.

Claims

1. An interactive blind guiding system is characterized by comprising a central processing unit, a depth camera, a high-end voice synthesis device, a microphone and a power supply, wherein the depth camera, the high-end voice synthesis device, the microphone and the power supply are connected with the central processing unit, and the interactive blind guiding system comprises:

power supply: for supplying power to the central processor.

2. The interactive blind guide system of claim 1 wherein the central processor is nvidia jetson TX2 development kit.

3. The interactive blind guidance system of claim 1 wherein the depth camera is an Intel-D435 depth camera.

4. The interactive blind guide system of claim 1, wherein the high-end speech synthesis device is a YS-XFSV2 high-end speech synthesis device.

5. The interactive blind guide system of claim 1, wherein the power supply is a 19V mobile power supply.