CN110718227A - Multi-mode interaction based distributed Internet of things equipment cooperation method and system - Google Patents

Multi-mode interaction based distributed Internet of things equipment cooperation method and system Download PDF

Info

Publication number
CN110718227A
CN110718227A CN201910988977.1A CN201910988977A CN110718227A CN 110718227 A CN110718227 A CN 110718227A CN 201910988977 A CN201910988977 A CN 201910988977A CN 110718227 A CN110718227 A CN 110718227A
Authority
CN
China
Prior art keywords
voice
sub
things
equipment
distributed internet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910988977.1A
Other languages
Chinese (zh)
Inventor
郑敏
郑炜乔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huachuang Technology Co Ltd
Original Assignee
Shenzhen Huachuang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huachuang Technology Co Ltd filed Critical Shenzhen Huachuang Technology Co Ltd
Priority to CN201910988977.1A priority Critical patent/CN110718227A/en
Publication of CN110718227A publication Critical patent/CN110718227A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a distributed Internet of things equipment cooperation method and a system thereof based on multi-modal interaction, each sub-equipment of the distributed Internet of things respectively collects voice signals in real time through a microphone to make voice wake-up judgment, a camera is started on the voice wake-up equipment to collect face images in real time to make face detection, the face images are sent to an interactive central control through network communication, the interactive central control carries out arbitration and cooperation according to the voice wake-up and face detection results reported by each sub-equipment, the equipment which really responds to user wake-up is determined, voice commands are monitored continuously, simultaneously wake-up information of other sub-equipment is eliminated, the voice commands of the user are processed in real time in voice, corresponding control commands and voice reply contents are sent to the Internet of things sub-equipment which responds to the wake-up, the invention carries out arbitration and cooperation through the distributed Internet of things equipment and the interactive central control according to the multi-modal results, the accuracy rate of collaborative interaction and response of the distributed Internet of things equipment is improved.

Description

Multi-mode interaction based distributed Internet of things equipment cooperation method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a distributed Internet of things equipment cooperation method and system based on multi-mode interaction.
Background
With the continuous development of the technology in the field of artificial intelligence, the accuracy of voice recognition and face detection is continuously improved, and a plurality of intelligent voice devices are provided in daily life. The built-in microphone or the microphone array in the intelligent voice device can realize that a user can carry out far-field interaction with the intelligent device at a short distance or with a certain distance, but the voice interaction accuracy rate is reduced or even can not be realized when the distance range is exceeded. At present, a plurality of intelligent devices with voice interaction are distributed in the family environment, if an intelligent voice sound box is placed in a living room, an intelligent desk lamp is placed in a bedroom and the like, the devices are placed in a distributed mode, along with the rapid development of the internet of things, the realization of the interconnection of the multiple devices by the aid of the multiple voice intelligent devices is an inevitable technical trend and the living needs of smart families, and a method for the collaborative interaction of the devices of the distributed internet of things is needed under the scene. In the prior art, distributed internet of things devices use the same awakening word, and after a user is awakened by voice, all devices respond, so that which device should respond to a request of the user cannot be judged, and the use experience of the user is seriously influenced.
Disclosure of Invention
The invention aims to provide a distributed Internet of things equipment cooperation method and a system thereof based on multi-mode interaction, which can reduce network delay, improve response speed, solve the messy result of synchronous awakening of multiple equipment, improve response accuracy and stability of distributed Internet of things equipment through multi-mode interaction, effectively solve the problem of interconnection and cooperative work of multiple distributed Internet of things voice equipment in a family scene, and improve user experience in the Internet of things environment so as to solve the problems provided in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a distributed Internet of things equipment cooperation method based on multi-modal interaction comprises the following steps:
s1: each sub-device of the distributed Internet of things device locally collects voice of a user in real time and performs voice awakening judgment;
s2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;
s3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;
s4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;
s5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.
Further, in S1, the distributed internet of things device characterizes a plurality of smart terminals, each having its microphone array, including but not limited to a linear 2-microphone, linear 4-microphone, linear 6-microphone, ring 4-microphone, or irregular microphone array.
Furthermore, in S2, the face detection method includes two steps of preprocessing an image and a face detection algorithm based on MTCNN, where MTCNN is composed of 3 lightweight CNNs in a network structure, and P-Net, R-Net, and O-Net respectively, and the input preprocessed image is processed successively through the 3 networks to finally output results of face detection and key point detection.
Further, in S2, the real-time face detection function adopts a face detection algorithm based on a multi-task cascaded convolutional neural network-MTCNN.
The invention provides another technical scheme: a distributed Internet of things equipment cooperative system based on multi-modal interaction comprises distributed Internet of things equipment, an interaction center control and a voice cloud server, wherein the distributed Internet of things equipment is provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed Internet of things sub-equipment starts the camera image acquisition module to acquire pictures in real time and performs face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when the existence of a face is judged, and the distributed Internet of things equipment is further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the distributed Internet of things equipment cooperation method and system based on multi-mode interaction, interaction central control is connected with all distributed Internet of things equipment through a local area network, arbitration decision is made according to received awakening information and face detection results, equipment needing to be awakened and responded is quickly determined and informed, network delay is reduced, response speed is improved, and meanwhile messy results of synchronous awakening of multiple equipment are solved.
2. According to the distributed Internet of things equipment cooperation method and system based on multi-mode interaction, response accuracy and stability of the distributed Internet of things equipment are improved through the multi-mode interaction, meanwhile, the problem that a plurality of distributed Internet of things voice equipment in a family scene are connected and cooperate is effectively solved, and user experience in the Internet of things environment is improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a flow chart of a face detection method of the present invention;
FIG. 3 is a diagram illustrating the effect of the face detection method of the present invention;
fig. 4 is a block diagram of the system of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, in the embodiment of the present invention: the distributed Internet of things equipment cooperation method based on multi-modal interaction comprises the following steps:
step 1: each sub-device of the distributed Internet of things device locally acquires the voice of a user in real time and performs voice awakening judgment; in the step, the distributed internet of things equipment represents a plurality of intelligent terminals, and each intelligent terminal is provided with a microphone array, including but not limited to a linear 2-microphone, linear 4-microphone, linear 6-microphone, annular 4-microphone or other irregular microphone arrays; after a user sends a wake-up voice signal, the distributed internet of things equipment can receive the wake-up voice signal from the user, after the sub-equipment executes a voice wake-up command, if a plurality of sub-equipment simultaneously respond to the user, user experience and voice interaction quality are greatly influenced, the voice wake-up behavior of the user needs to be decided at the moment, namely, the wake-up behaviors of the plurality of sub-equipment are arbitrated, the internet of things sub-equipment or the intelligent terminal which really needs to respond to the wake-up behavior of the user, namely, the wake-up equipment which is most suitable for interacting with the user is determined, and wake-up responses of other sub-equipment are eliminated at the same time.
Step 2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;
and step 3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;
and 4, step 4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;
and 5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.
According to the embodiment, in the process of determining the sub-devices which are awakened by the user, face detection is performed on each sub-device which is awakened by the voice, whether the user exists in the current scene is determined according to the face detection result, whether the user is awakened by the voice or sends a voice command to the sub-devices of the distributed Internet of things is judged, the identity of the user is confirmed again through multi-mode information of the voice and the vision, and therefore the accuracy and the reliability of the cooperative response of the distributed Internet of things devices are improved.
In the above embodiment, the real-time face detection function involved in step 2 adopts a face detection algorithm based on a multi-task cascaded convolutional neural network (MTCNN), which is a coarse-to-fine method, and has the advantages of real-time processing capability, high speed and good effect, and can be run on an intelligent terminal, thereby solving the defects of high requirement on environment, high requirement on face and high detection time consumption of the traditional face detection algorithm.
In the above embodiment, the face detection method includes two steps of preprocessing an image and a face detection algorithm based on MTCNN, where MTCNN is composed of 3 lightweight CNNs in a network structure, and P-Net, R-Net, and O-Net, and the input preprocessed image is processed successively through the 3 networks, and finally, results of face detection and key point detection are output.
For better explaining the above invention, the flow of the face detection algorithm referred to herein, please refer to fig. 2, which specifically includes the following steps:
step 21: firstly, the picture collected by the camera is input, and the input picture is scaled to different scales to form an image pyramid through the preprocessing operation of picture size conversion so as to meet the requirement of unchanged scales;
step 22: inputting the preprocessed image pyramid into an MTCNN network for processing, wherein the image pyramid is processed in a coarse-to-fine mode through 3 sub-networks respectively, and the specific process is as follows:
step 221, inputting the preprocessed image pyramid into a P-Net, outputting a face classification result, an image candidate window and a face landmark positioning result, wherein the P-Net is a full convolution network and is used for generating candidate frames and border regression vectors, correcting the candidate windows by using a border regression method, and merging overlapped candidate frames by using non-maximum suppression processing;
specifically, the input is a 12 × 12 picture, and the generated training data (by generating a frame and then cutting the frame into 12 × 12 pictures) needs to be converted into a 12 × 3 structure before training; generating 10 5 × 5 feature maps by 10 convolution kernels of 3 × 3, maximal pooling operation of 3 × 3; generating 16 characteristic graphs of 3 × 3 through 16 convolution kernels of 3 × 10 of the second layer; then generating 32 characteristic graphs of 1 by 1 through 32 convolution kernels of 3 by 16 of the third layer; finally, outputting 3 vectors after P-Net feedforward aiming at 32 feature graphs of 1 x 1, wherein 2 feature graphs of 1 x 1 are generated through 2 convolution kernels of 1 x 32 and used for face two classification results, namely, the probability of face classification is output; generating 4 characteristic graphs of 1 × 1 for the regression judgment of the bounding box through 4 convolution kernels of 1 × 32; and the other output is that 10 characteristic graphs of 1 × 1 are generated by 10 convolution kernels of 1 × 32 and used for judging the positioning of the human face landmarks, namely the human face contour point information.
In step 222, the image candidate window input R-Net determined by the output of P-Net is further classified, which is equivalent to a fine picking process;
specifically, according to the coordinates output by the P-Net, a picture is cut out from the original image (according to a square cutting method of the maximum side length, deformation is avoided, and more details are retained), and the picture size is transformed again to generate a 24 × 24 frame, the 24 × 3 frame is converted into a 24 × 3 structure and is input to the R-Net, and 28 feature maps of 11 are generated through 28 convolution kernels of 3 × 3 and maximum pooling operation of 3 × 3; generating 48 4 × 4 feature maps by maximum pooling operation of 3 × 3 through 48 convolution kernels of 3 × 28 of the second layer; then generating 64 characteristic maps of 3 by 3 through 64 convolution kernels of 2 by 48 of the third layer; finally, 3 results are output through a full connection layer of 128 neurons, wherein one result is a 2-dimensional data result for face two classification, namely the probability of face classification is output; one is 4 coordinate offsets of the regression judgment of the candidate bounding box; yet another output is 10 face contour point information for human face landmark location determination.
In step S223, the picture obtained by cutting the information of the output candidate window of R-Net in the previous step on the original picture is input to O-Net to determine the final position of the face frame and the feature point, and the judgment of whether the face exists, the face frame positioning, and the 5 feature point positions of the face are output;
specifically, a picture is cut out on an original image according to the information output by the R-Net (the picture is processed the same as the input data of the R-Net, deformation is avoided and more details are kept according to a square cutting method of the maximum side length), and the picture size is converted again to generate a frame with the size of 48 × 48, the frame is converted into a structure with the size of 48 × 3 and is input to the O-Net, and 32 feature maps with the size of 23 × 23 are generated through 32 convolution kernels with the size of 3 × 3 and the maximum pooling operation with the size of 3 × 3; generating 64 10 x 10 signatures by maximum pooling of 3 x 3 operations with 64 3 x 32 convolution kernels of the second tier; generating 64 4 × 4 feature maps through maximum pooling operation of 2 × 2 by 64 convolution kernels of 3 × 64 of the third layer; then, 128 convolution kernels of 2 × 64 of the fourth layer are used to generate 128 characteristic maps of 3 × 3; finally, 3 results are output through a full connection layer of 256 neurons, wherein one result is a 2-dimensional data result for face two classification, namely the probability of face classification is output; one is 4 coordinate offsets of the regression judgment of the candidate bounding box; yet another output is 10 face contour point information for human face landmark location determination.
Step 23, determining a face detection result and 5 face key points according to the output of the MTCNN network in S22, aligning the 5 feature points on the original image to a specific position of the picture through affine transformation, and framing the face position to display the detection result;
in this embodiment, the output of three sub-networks in the MTCNN network all processes candidate face frames, that is, according to the probability scores of the face classification results, a superposition evaluation formula (IOU) and non-maximum suppression (NMS) are used to screen candidate frames, and most of the candidate frames that are not faces are screened;
specifically, for the candidate frames, a positioning accuracy evaluation formula is adopted for the accuracy of the candidate frames, that is, the overlapping degree (IOU) of the two candidate frames is defined, and the area ratio of the overlapping area of the two candidate rectangular frames in the union set of the two candidate frames is determined.
In the embodiment, a non-maximum suppression (NMS) method is adopted to screen the candidate frames, the non-maximum suppression (NMS) method essentially suppresses elements which are not maximum values, local maximum values are searched, the local maximum values represent a neighborhood, and two parameters of the neighborhood are variable, namely the dimension of the neighborhood and the size of the neighborhood; specifically, the candidate frames are sorted according to the result confidence degrees, the candidate frame A with the highest confidence degree is selected, if the overlapping area of the rest frames with the candidate frame A is larger than a threshold value, the candidate frame is deleted, the candidate frame A is left, and so on, and the candidate frame with the smaller overlapping area and the higher confidence degree is finally screened out.
In the MTCNN face detection method in this embodiment, before the detection stage, training of a neural network is required, and learning of 3 tasks, that is, classification of a face and a non-face, bounding box regression, and face feature point positioning (or face landmark positioning) is required:
1) the design of the face/non-face classifier is according to the formula:
Figure BDA0002237621500000081
wherein
Figure BDA0002237621500000082
This is a cross-entropy loss function, p, for face classificationiIn order to be a probability of being a face,a real label for the background.
2) Bounding box regression is the regression loss calculated by euclidean distance.
Figure BDA0002237621500000084
Wherein the content of the first and second substances,
Figure BDA0002237621500000085
in order to obtain the coordinates predicted by the network,
Figure BDA0002237621500000086
is a reality ofOf the real background coordinates of the image, wherein,
Figure BDA0002237621500000087
is a (upper left coordinates x and y, length, width) quadruple.
3) And (3) positioning the human face landmark, and calculating the Euclidean distance between the position of the landmark predicted by the network and the actual real landmark and minimizing the distance, which is the same as the regression of the bounding box.
Figure BDA0002237621500000088
Wherein the content of the first and second substances,
Figure BDA0002237621500000091
in order to be predicted by the network,for the actual real landmark coordinates, since a total of 5 points, each represented by two coordinate values, x and y, are present, therefore,
Figure BDA0002237621500000093
and
Figure BDA0002237621500000094
is a ten-tuple.
In this embodiment, the MTCNN training learning process is to minimize this function in the following equation:
Figure BDA0002237621500000095
Figure BDA0002237621500000096
P-Net,R-Net(αdet=1,αbox=0.5,αlandmark=0.5)
O-Net(αdet=1,αbox=0.5,αlandmark=1)
where N is the number of training samples, αjWhich indicates the importance of the task or tasks,
Figure BDA0002237621500000098
in order to be the label of the sample,
Figure BDA0002237621500000097
is a loss function.
In this embodiment, the effect graphs of each stage of the face detection algorithm are as shown in fig. 3, and the input picture is subjected to picture preprocessing, that is, the size of the picture is changed, and different scales are output to form a picture pyramid; inputting the preprocessed image into a P-Net network, and outputting a candidate bounding box of the face through a non-maximum value inhibition and frame regression method; only pictures framed by the candidate bounding boxes are input into an R-Net network, and non-maximum value inhibition and frame regression methods are also adopted to obtain fewer but gradually accurate candidate bounding boxes of the human face; in the O-Net network, a small number of face candidate pictures are input, and a face bounding box and a face characteristic point result are output.
Referring to fig. 4, the present invention further provides another technical solution: a distributed Internet of things equipment cooperative system based on multi-modal interaction comprises distributed Internet of things equipment, an interaction center control and a voice cloud server, wherein the distributed Internet of things equipment is provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed Internet of things sub-equipment starts the camera image acquisition module to acquire pictures in real time and performs face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when the existence of a face is judged, and the distributed Internet of things equipment is further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.
In summary, the following steps: the invention provides a distributed Internet of things equipment cooperation method and system based on multi-modal interaction, which are used for solving the technical problem of poor interaction experience of distributed intelligent voice equipment, each sub-equipment of the distributed Internet of things respectively collects voice signals in real time through a microphone or a microphone array to make voice awakening judgment, a camera is started on the voice awakened equipment to collect face images in real time to make face detection, when the voice awakened sub-equipment simultaneously detects the existence of a face, the face signals are sent to an interaction center control through network communication or broadcasting, the interaction center control carries out arbitration and cooperation according to multi-modal results of voice awakening and face detection reported by each sub-equipment, equipment really responding to user awakening is determined, voice commands are continuously monitored, awakening information of other distributed Internet of things sub-equipment is simultaneously cleared, and the user voice commands request a voice cloud server to carry out voice recognition through communication in real time, After voice real-time processing such as semantic understanding, corresponding control commands and voice reply contents are issued to the Internet of things sub-equipment of the awakening response.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims (5)

1. A multi-modal interaction based distributed Internet of things device cooperation method is characterized by comprising the following steps:
s1: each sub-device of the distributed Internet of things device locally collects voice of a user in real time and performs voice awakening judgment;
s2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;
s3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;
s4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;
s5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.
2. The method for distributed internet of things device collaboration based on multi-modal interaction as claimed in claim 1, wherein in S1, the distributed internet of things device characterizes a plurality of smart terminals, each smart terminal having its microphone array including but not limited to linear 2-microphone, linear 4-microphone, linear 6-microphone, ring 4-microphone or irregular microphone array.
3. The distributed internet of things device cooperation method based on multi-modal interaction as claimed in claim 1, wherein in S2, the face detection method includes two steps of pre-processing operation of pictures and face detection algorithm based on MTCNN, MTCNN is composed of 3 lightweight CNNs on network structure, which are P-Net, R-Net and O-Net respectively, and the input pre-processed pictures are processed successively through the 3 networks to finally output results of face detection and keypoint detection.
4. The method as claimed in claim 3, wherein in the step S2, the real-time face detection function adopts a face detection algorithm based on a multitask cascade convolutional neural network-MTCNN.
5. The distributed internet of things device coordination system based on multi-modal interaction as claimed in claim 1, which is characterized by comprising distributed internet of things devices, an interaction center control and a voice cloud server, wherein the distributed internet of things devices are provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed internet of things sub-devices start the camera image acquisition module to acquire pictures in real time and perform face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when a face is judged to exist, and the distributed internet of things devices are further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.
CN201910988977.1A 2019-10-17 2019-10-17 Multi-mode interaction based distributed Internet of things equipment cooperation method and system Pending CN110718227A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910988977.1A CN110718227A (en) 2019-10-17 2019-10-17 Multi-mode interaction based distributed Internet of things equipment cooperation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910988977.1A CN110718227A (en) 2019-10-17 2019-10-17 Multi-mode interaction based distributed Internet of things equipment cooperation method and system

Publications (1)

Publication Number Publication Date
CN110718227A true CN110718227A (en) 2020-01-21

Family

ID=69211832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910988977.1A Pending CN110718227A (en) 2019-10-17 2019-10-17 Multi-mode interaction based distributed Internet of things equipment cooperation method and system

Country Status (1)

Country Link
CN (1) CN110718227A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931551A (en) * 2020-05-26 2020-11-13 东南大学 Face detection method based on lightweight cascade network
CN112634885A (en) * 2020-05-18 2021-04-09 北京如影智能科技有限公司 Voice wake-up method and device for cross-local area network
CN112634872A (en) * 2020-12-21 2021-04-09 北京声智科技有限公司 Voice equipment awakening method and device
CN112908325A (en) * 2021-01-29 2021-06-04 中国平安人寿保险股份有限公司 Voice interaction method and device, electronic equipment and storage medium
CN113470634A (en) * 2020-04-28 2021-10-01 海信集团有限公司 Control method of voice interaction equipment, server and voice interaction equipment
CN114287151A (en) * 2020-07-28 2022-04-05 北京小米移动软件有限公司 Wireless communication method, terminal, base station, communication device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330563A1 (en) * 2016-05-13 2017-11-16 Bose Corporation Processing Speech from Distributed Microphones
CN107622652A (en) * 2016-07-15 2018-01-23 青岛海尔智能技术研发有限公司 The sound control method and appliance control system of appliance system
CN108564052A (en) * 2018-04-24 2018-09-21 南京邮电大学 Multi-cam dynamic human face recognition system based on MTCNN and method
CN110136714A (en) * 2019-05-14 2019-08-16 北京探境科技有限公司 Natural interaction sound control method and device
CN110288997A (en) * 2019-07-22 2019-09-27 苏州思必驰信息科技有限公司 Equipment awakening method and system for acoustics networking
CN110322878A (en) * 2019-07-01 2019-10-11 华为技术有限公司 A kind of sound control method, electronic equipment and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330563A1 (en) * 2016-05-13 2017-11-16 Bose Corporation Processing Speech from Distributed Microphones
CN107622652A (en) * 2016-07-15 2018-01-23 青岛海尔智能技术研发有限公司 The sound control method and appliance control system of appliance system
CN108564052A (en) * 2018-04-24 2018-09-21 南京邮电大学 Multi-cam dynamic human face recognition system based on MTCNN and method
CN110136714A (en) * 2019-05-14 2019-08-16 北京探境科技有限公司 Natural interaction sound control method and device
CN110322878A (en) * 2019-07-01 2019-10-11 华为技术有限公司 A kind of sound control method, electronic equipment and system
CN110288997A (en) * 2019-07-22 2019-09-27 苏州思必驰信息科技有限公司 Equipment awakening method and system for acoustics networking

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YANG WANG ET AL.: "《Research on Face Detection Method Based on Improved MTCNN Network》", 《ICDIP 2019》 *
丰慧芳: "《基于卡口监控视频的人脸特征点定位关键技术研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
孔德壮等: "《人脸表情识别在辅助医疗中的应用及方法研究》", 《生命科学仪器》 *
张衡等: "《基于级联卷积网络的人脸特征点检测》", 《南京邮电大学学报(自然科学版)》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470634A (en) * 2020-04-28 2021-10-01 海信集团有限公司 Control method of voice interaction equipment, server and voice interaction equipment
CN113470634B (en) * 2020-04-28 2024-05-17 海信集团有限公司 Voice interaction equipment control method, server and voice interaction equipment
CN112634885A (en) * 2020-05-18 2021-04-09 北京如影智能科技有限公司 Voice wake-up method and device for cross-local area network
CN111931551A (en) * 2020-05-26 2020-11-13 东南大学 Face detection method based on lightweight cascade network
CN111931551B (en) * 2020-05-26 2022-04-12 东南大学 Face detection method based on lightweight cascade network
CN114287151A (en) * 2020-07-28 2022-04-05 北京小米移动软件有限公司 Wireless communication method, terminal, base station, communication device and storage medium
CN114287151B (en) * 2020-07-28 2024-04-05 北京小米移动软件有限公司 Wireless communication method, terminal, base station, communication device and storage medium
CN112634872A (en) * 2020-12-21 2021-04-09 北京声智科技有限公司 Voice equipment awakening method and device
CN112908325A (en) * 2021-01-29 2021-06-04 中国平安人寿保险股份有限公司 Voice interaction method and device, electronic equipment and storage medium
CN112908325B (en) * 2021-01-29 2022-10-28 中国平安人寿保险股份有限公司 Voice interaction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110718227A (en) Multi-mode interaction based distributed Internet of things equipment cooperation method and system
CN110728255B (en) Image processing method, image processing device, electronic equipment and storage medium
Steffens et al. Personspotter-fast and robust system for human detection, tracking and recognition
CN110135249B (en) Human behavior identification method based on time attention mechanism and LSTM (least Square TM)
CN113284168A (en) Target tracking method and device, electronic equipment and storage medium
WO2021213158A1 (en) Real-time face summarization service method and system for intelligent video conference terminal
EP3647992A1 (en) Face image processing method and apparatus, storage medium, and electronic device
JP2006011978A (en) Image processing method and image processor
CN108960076B (en) Ear recognition and tracking method based on convolutional neural network
CN108986137B (en) Human body tracking method, device and equipment
CN111680550B (en) Emotion information identification method and device, storage medium and computer equipment
CN111008994A (en) Moving target real-time detection and tracking system and method based on MPSoC
CN111401322A (en) Station entering and exiting identification method and device, terminal and storage medium
CN113850136A (en) Yolov5 and BCNN-based vehicle orientation identification method and system
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
CN112700568B (en) Identity authentication method, equipment and computer readable storage medium
Afroze et al. An empirical framework for detecting speaking modes using ensemble classifier
CN112766065A (en) Mobile terminal examinee identity authentication method, device, terminal and storage medium
CN113052136A (en) Pedestrian detection method based on improved Faster RCNN
CN117813581A (en) Multi-angle hand tracking
CN114283461A (en) Image processing method, apparatus, device, storage medium, and computer program product
WO2020237674A1 (en) Target tracking method and apparatus, and unmanned aerial vehicle
CN113276113A (en) Sight line positioning and voice control system and method for space manipulator on-orbit operation
CN117553808B (en) Deep learning-based robot positioning navigation method, device, equipment and medium
CN113903083B (en) Behavior recognition method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200121

RJ01 Rejection of invention patent application after publication