CN110718227A

CN110718227A - Multi-mode interaction based distributed Internet of things equipment cooperation method and system

Info

Publication number: CN110718227A
Application number: CN201910988977.1A
Authority: CN
Inventors: 郑敏; 郑炜乔
Original assignee: Shenzhen Huachuang Technology Co Ltd
Current assignee: Shenzhen Huachuang Technology Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-01-21

Abstract

The invention discloses a distributed Internet of things equipment cooperation method and a system thereof based on multi-modal interaction, each sub-equipment of the distributed Internet of things respectively collects voice signals in real time through a microphone to make voice wake-up judgment, a camera is started on the voice wake-up equipment to collect face images in real time to make face detection, the face images are sent to an interactive central control through network communication, the interactive central control carries out arbitration and cooperation according to the voice wake-up and face detection results reported by each sub-equipment, the equipment which really responds to user wake-up is determined, voice commands are monitored continuously, simultaneously wake-up information of other sub-equipment is eliminated, the voice commands of the user are processed in real time in voice, corresponding control commands and voice reply contents are sent to the Internet of things sub-equipment which responds to the wake-up, the invention carries out arbitration and cooperation through the distributed Internet of things equipment and the interactive central control according to the multi-modal results, the accuracy rate of collaborative interaction and response of the distributed Internet of things equipment is improved.

Description

Multi-mode interaction based distributed Internet of things equipment cooperation method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a distributed Internet of things equipment cooperation method and system based on multi-mode interaction.

Background

With the continuous development of the technology in the field of artificial intelligence, the accuracy of voice recognition and face detection is continuously improved, and a plurality of intelligent voice devices are provided in daily life. The built-in microphone or the microphone array in the intelligent voice device can realize that a user can carry out far-field interaction with the intelligent device at a short distance or with a certain distance, but the voice interaction accuracy rate is reduced or even can not be realized when the distance range is exceeded. At present, a plurality of intelligent devices with voice interaction are distributed in the family environment, if an intelligent voice sound box is placed in a living room, an intelligent desk lamp is placed in a bedroom and the like, the devices are placed in a distributed mode, along with the rapid development of the internet of things, the realization of the interconnection of the multiple devices by the aid of the multiple voice intelligent devices is an inevitable technical trend and the living needs of smart families, and a method for the collaborative interaction of the devices of the distributed internet of things is needed under the scene. In the prior art, distributed internet of things devices use the same awakening word, and after a user is awakened by voice, all devices respond, so that which device should respond to a request of the user cannot be judged, and the use experience of the user is seriously influenced.

Disclosure of Invention

The invention aims to provide a distributed Internet of things equipment cooperation method and a system thereof based on multi-mode interaction, which can reduce network delay, improve response speed, solve the messy result of synchronous awakening of multiple equipment, improve response accuracy and stability of distributed Internet of things equipment through multi-mode interaction, effectively solve the problem of interconnection and cooperative work of multiple distributed Internet of things voice equipment in a family scene, and improve user experience in the Internet of things environment so as to solve the problems provided in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a distributed Internet of things equipment cooperation method based on multi-modal interaction comprises the following steps:

s1: each sub-device of the distributed Internet of things device locally collects voice of a user in real time and performs voice awakening judgment;

s2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;

s3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;

s4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;

s5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.

Further, in S1, the distributed internet of things device characterizes a plurality of smart terminals, each having its microphone array, including but not limited to a linear 2-microphone, linear 4-microphone, linear 6-microphone, ring 4-microphone, or irregular microphone array.

Furthermore, in S2, the face detection method includes two steps of preprocessing an image and a face detection algorithm based on MTCNN, where MTCNN is composed of 3 lightweight CNNs in a network structure, and P-Net, R-Net, and O-Net respectively, and the input preprocessed image is processed successively through the 3 networks to finally output results of face detection and key point detection.

Further, in S2, the real-time face detection function adopts a face detection algorithm based on a multi-task cascaded convolutional neural network-MTCNN.

The invention provides another technical scheme: a distributed Internet of things equipment cooperative system based on multi-modal interaction comprises distributed Internet of things equipment, an interaction center control and a voice cloud server, wherein the distributed Internet of things equipment is provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed Internet of things sub-equipment starts the camera image acquisition module to acquire pictures in real time and performs face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when the existence of a face is judged, and the distributed Internet of things equipment is further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the distributed Internet of things equipment cooperation method and system based on multi-mode interaction, interaction central control is connected with all distributed Internet of things equipment through a local area network, arbitration decision is made according to received awakening information and face detection results, equipment needing to be awakened and responded is quickly determined and informed, network delay is reduced, response speed is improved, and meanwhile messy results of synchronous awakening of multiple equipment are solved.

2. According to the distributed Internet of things equipment cooperation method and system based on multi-mode interaction, response accuracy and stability of the distributed Internet of things equipment are improved through the multi-mode interaction, meanwhile, the problem that a plurality of distributed Internet of things voice equipment in a family scene are connected and cooperate is effectively solved, and user experience in the Internet of things environment is improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a flow chart of a face detection method of the present invention;

FIG. 3 is a diagram illustrating the effect of the face detection method of the present invention;

fig. 4 is a block diagram of the system of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, in the embodiment of the present invention: the distributed Internet of things equipment cooperation method based on multi-modal interaction comprises the following steps:

step 1: each sub-device of the distributed Internet of things device locally acquires the voice of a user in real time and performs voice awakening judgment; in the step, the distributed internet of things equipment represents a plurality of intelligent terminals, and each intelligent terminal is provided with a microphone array, including but not limited to a linear 2-microphone, linear 4-microphone, linear 6-microphone, annular 4-microphone or other irregular microphone arrays; after a user sends a wake-up voice signal, the distributed internet of things equipment can receive the wake-up voice signal from the user, after the sub-equipment executes a voice wake-up command, if a plurality of sub-equipment simultaneously respond to the user, user experience and voice interaction quality are greatly influenced, the voice wake-up behavior of the user needs to be decided at the moment, namely, the wake-up behaviors of the plurality of sub-equipment are arbitrated, the internet of things sub-equipment or the intelligent terminal which really needs to respond to the wake-up behavior of the user, namely, the wake-up equipment which is most suitable for interacting with the user is determined, and wake-up responses of other sub-equipment are eliminated at the same time.

Step 2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;

and step 3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;

and 4, step 4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;

and 5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.

According to the embodiment, in the process of determining the sub-devices which are awakened by the user, face detection is performed on each sub-device which is awakened by the voice, whether the user exists in the current scene is determined according to the face detection result, whether the user is awakened by the voice or sends a voice command to the sub-devices of the distributed Internet of things is judged, the identity of the user is confirmed again through multi-mode information of the voice and the vision, and therefore the accuracy and the reliability of the cooperative response of the distributed Internet of things devices are improved.

In the above embodiment, the real-time face detection function involved in step 2 adopts a face detection algorithm based on a multi-task cascaded convolutional neural network (MTCNN), which is a coarse-to-fine method, and has the advantages of real-time processing capability, high speed and good effect, and can be run on an intelligent terminal, thereby solving the defects of high requirement on environment, high requirement on face and high detection time consumption of the traditional face detection algorithm.

In the above embodiment, the face detection method includes two steps of preprocessing an image and a face detection algorithm based on MTCNN, where MTCNN is composed of 3 lightweight CNNs in a network structure, and P-Net, R-Net, and O-Net, and the input preprocessed image is processed successively through the 3 networks, and finally, results of face detection and key point detection are output.

For better explaining the above invention, the flow of the face detection algorithm referred to herein, please refer to fig. 2, which specifically includes the following steps:

step 21: firstly, the picture collected by the camera is input, and the input picture is scaled to different scales to form an image pyramid through the preprocessing operation of picture size conversion so as to meet the requirement of unchanged scales;

step 22: inputting the preprocessed image pyramid into an MTCNN network for processing, wherein the image pyramid is processed in a coarse-to-fine mode through 3 sub-networks respectively, and the specific process is as follows:

step 221, inputting the preprocessed image pyramid into a P-Net, outputting a face classification result, an image candidate window and a face landmark positioning result, wherein the P-Net is a full convolution network and is used for generating candidate frames and border regression vectors, correcting the candidate windows by using a border regression method, and merging overlapped candidate frames by using non-maximum suppression processing;

specifically, the input is a 12 × 12 picture, and the generated training data (by generating a frame and then cutting the frame into 12 × 12 pictures) needs to be converted into a 12 × 3 structure before training; generating 10 5 × 5 feature maps by 10 convolution kernels of 3 × 3, maximal pooling operation of 3 × 3; generating 16 characteristic graphs of 3 × 3 through 16 convolution kernels of 3 × 10 of the second layer; then generating 32 characteristic graphs of 1 by 1 through 32 convolution kernels of 3 by 16 of the third layer; finally, outputting 3 vectors after P-Net feedforward aiming at 32 feature graphs of 1 x 1, wherein 2 feature graphs of 1 x 1 are generated through 2 convolution kernels of 1 x 32 and used for face two classification results, namely, the probability of face classification is output; generating 4 characteristic graphs of 1 × 1 for the regression judgment of the bounding box through 4 convolution kernels of 1 × 32; and the other output is that 10 characteristic graphs of 1 × 1 are generated by 10 convolution kernels of 1 × 32 and used for judging the positioning of the human face landmarks, namely the human face contour point information.

In step 222, the image candidate window input R-Net determined by the output of P-Net is further classified, which is equivalent to a fine picking process;

specifically, according to the coordinates output by the P-Net, a picture is cut out from the original image (according to a square cutting method of the maximum side length, deformation is avoided, and more details are retained), and the picture size is transformed again to generate a 24 × 24 frame, the 24 × 3 frame is converted into a 24 × 3 structure and is input to the R-Net, and 28 feature maps of 11 are generated through 28 convolution kernels of 3 × 3 and maximum pooling operation of 3 × 3; generating 48 4 × 4 feature maps by maximum pooling operation of 3 × 3 through 48 convolution kernels of 3 × 28 of the second layer; then generating 64 characteristic maps of 3 by 3 through 64 convolution kernels of 2 by 48 of the third layer; finally, 3 results are output through a full connection layer of 128 neurons, wherein one result is a 2-dimensional data result for face two classification, namely the probability of face classification is output; one is 4 coordinate offsets of the regression judgment of the candidate bounding box; yet another output is 10 face contour point information for human face landmark location determination.

In step S223, the picture obtained by cutting the information of the output candidate window of R-Net in the previous step on the original picture is input to O-Net to determine the final position of the face frame and the feature point, and the judgment of whether the face exists, the face frame positioning, and the 5 feature point positions of the face are output;

specifically, a picture is cut out on an original image according to the information output by the R-Net (the picture is processed the same as the input data of the R-Net, deformation is avoided and more details are kept according to a square cutting method of the maximum side length), and the picture size is converted again to generate a frame with the size of 48 × 48, the frame is converted into a structure with the size of 48 × 3 and is input to the O-Net, and 32 feature maps with the size of 23 × 23 are generated through 32 convolution kernels with the size of 3 × 3 and the maximum pooling operation with the size of 3 × 3; generating 64 10 x 10 signatures by maximum pooling of 3 x 3 operations with 64 3 x 32 convolution kernels of the second tier; generating 64 4 × 4 feature maps through maximum pooling operation of 2 × 2 by 64 convolution kernels of 3 × 64 of the third layer; then, 128 convolution kernels of 2 × 64 of the fourth layer are used to generate 128 characteristic maps of 3 × 3; finally, 3 results are output through a full connection layer of 256 neurons, wherein one result is a 2-dimensional data result for face two classification, namely the probability of face classification is output; one is 4 coordinate offsets of the regression judgment of the candidate bounding box; yet another output is 10 face contour point information for human face landmark location determination.

Step 23, determining a face detection result and 5 face key points according to the output of the MTCNN network in S22, aligning the 5 feature points on the original image to a specific position of the picture through affine transformation, and framing the face position to display the detection result;

in this embodiment, the output of three sub-networks in the MTCNN network all processes candidate face frames, that is, according to the probability scores of the face classification results, a superposition evaluation formula (IOU) and non-maximum suppression (NMS) are used to screen candidate frames, and most of the candidate frames that are not faces are screened;

specifically, for the candidate frames, a positioning accuracy evaluation formula is adopted for the accuracy of the candidate frames, that is, the overlapping degree (IOU) of the two candidate frames is defined, and the area ratio of the overlapping area of the two candidate rectangular frames in the union set of the two candidate frames is determined.

In the embodiment, a non-maximum suppression (NMS) method is adopted to screen the candidate frames, the non-maximum suppression (NMS) method essentially suppresses elements which are not maximum values, local maximum values are searched, the local maximum values represent a neighborhood, and two parameters of the neighborhood are variable, namely the dimension of the neighborhood and the size of the neighborhood; specifically, the candidate frames are sorted according to the result confidence degrees, the candidate frame A with the highest confidence degree is selected, if the overlapping area of the rest frames with the candidate frame A is larger than a threshold value, the candidate frame is deleted, the candidate frame A is left, and so on, and the candidate frame with the smaller overlapping area and the higher confidence degree is finally screened out.

In the MTCNN face detection method in this embodiment, before the detection stage, training of a neural network is required, and learning of 3 tasks, that is, classification of a face and a non-face, bounding box regression, and face feature point positioning (or face landmark positioning) is required:

1) the design of the face/non-face classifier is according to the formula:

wherein

This is a cross-entropy loss function, p, for face classification_iIn order to be a probability of being a face,a real label for the background.

2) Bounding box regression is the regression loss calculated by euclidean distance.

Wherein the content of the first and second substances,

in order to obtain the coordinates predicted by the network,

is a reality ofOf the real background coordinates of the image, wherein,

is a (upper left coordinates x and y, length, width) quadruple.

3) And (3) positioning the human face landmark, and calculating the Euclidean distance between the position of the landmark predicted by the network and the actual real landmark and minimizing the distance, which is the same as the regression of the bounding box.

Wherein the content of the first and second substances,

in order to be predicted by the network,for the actual real landmark coordinates, since a total of 5 points, each represented by two coordinate values, x and y, are present, therefore,

and

is a ten-tuple.

In this embodiment, the MTCNN training learning process is to minimize this function in the following equation:

P-Net，R-Net(α_det＝1，α_box＝0.5，α_landmark＝0.5)

O-Net(α_det＝1，α_box＝0.5，α_landmark＝1)

where N is the number of training samples, α_jWhich indicates the importance of the task or tasks,

in order to be the label of the sample,

is a loss function.

In this embodiment, the effect graphs of each stage of the face detection algorithm are as shown in fig. 3, and the input picture is subjected to picture preprocessing, that is, the size of the picture is changed, and different scales are output to form a picture pyramid; inputting the preprocessed image into a P-Net network, and outputting a candidate bounding box of the face through a non-maximum value inhibition and frame regression method; only pictures framed by the candidate bounding boxes are input into an R-Net network, and non-maximum value inhibition and frame regression methods are also adopted to obtain fewer but gradually accurate candidate bounding boxes of the human face; in the O-Net network, a small number of face candidate pictures are input, and a face bounding box and a face characteristic point result are output.

Referring to fig. 4, the present invention further provides another technical solution: a distributed Internet of things equipment cooperative system based on multi-modal interaction comprises distributed Internet of things equipment, an interaction center control and a voice cloud server, wherein the distributed Internet of things equipment is provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed Internet of things sub-equipment starts the camera image acquisition module to acquire pictures in real time and performs face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when the existence of a face is judged, and the distributed Internet of things equipment is further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.

In summary, the following steps: the invention provides a distributed Internet of things equipment cooperation method and system based on multi-modal interaction, which are used for solving the technical problem of poor interaction experience of distributed intelligent voice equipment, each sub-equipment of the distributed Internet of things respectively collects voice signals in real time through a microphone or a microphone array to make voice awakening judgment, a camera is started on the voice awakened equipment to collect face images in real time to make face detection, when the voice awakened sub-equipment simultaneously detects the existence of a face, the face signals are sent to an interaction center control through network communication or broadcasting, the interaction center control carries out arbitration and cooperation according to multi-modal results of voice awakening and face detection reported by each sub-equipment, equipment really responding to user awakening is determined, voice commands are continuously monitored, awakening information of other distributed Internet of things sub-equipment is simultaneously cleared, and the user voice commands request a voice cloud server to carry out voice recognition through communication in real time, After voice real-time processing such as semantic understanding, corresponding control commands and voice reply contents are issued to the Internet of things sub-equipment of the awakening response.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims

1. A multi-modal interaction based distributed Internet of things device cooperation method is characterized by comprising the following steps:

2. The method for distributed internet of things device collaboration based on multi-modal interaction as claimed in claim 1, wherein in S1, the distributed internet of things device characterizes a plurality of smart terminals, each smart terminal having its microphone array including but not limited to linear 2-microphone, linear 4-microphone, linear 6-microphone, ring 4-microphone or irregular microphone array.

3. The distributed internet of things device cooperation method based on multi-modal interaction as claimed in claim 1, wherein in S2, the face detection method includes two steps of pre-processing operation of pictures and face detection algorithm based on MTCNN, MTCNN is composed of 3 lightweight CNNs on network structure, which are P-Net, R-Net and O-Net respectively, and the input pre-processed pictures are processed successively through the 3 networks to finally output results of face detection and keypoint detection.

4. The method as claimed in claim 3, wherein in the step S2, the real-time face detection function adopts a face detection algorithm based on a multitask cascade convolutional neural network-MTCNN.

5. The distributed internet of things device coordination system based on multi-modal interaction as claimed in claim 1, which is characterized by comprising distributed internet of things devices, an interaction center control and a voice cloud server, wherein the distributed internet of things devices are provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed internet of things sub-devices start the camera image acquisition module to acquire pictures in real time and perform face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when a face is judged to exist, and the distributed internet of things devices are further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.