CN110718227A - Multi-mode interaction based distributed Internet of things equipment cooperation method and system - Google Patents
Multi-mode interaction based distributed Internet of things equipment cooperation method and system Download PDFInfo
- Publication number
- CN110718227A CN110718227A CN201910988977.1A CN201910988977A CN110718227A CN 110718227 A CN110718227 A CN 110718227A CN 201910988977 A CN201910988977 A CN 201910988977A CN 110718227 A CN110718227 A CN 110718227A
- Authority
- CN
- China
- Prior art keywords
- voice
- sub
- things
- equipment
- distributed internet
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000001514 detection method Methods 0.000 claims abstract description 57
- 238000004891 communication Methods 0.000 claims abstract description 21
- 230000004044 response Effects 0.000 claims abstract description 20
- 230000002452 interceptive effect Effects 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 8
- 230000015572 biosynthetic process Effects 0.000 claims description 6
- 238000003786 synthesis reaction Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 230000001788 irregular Effects 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 2
- 230000009133 cooperative interaction Effects 0.000 abstract description 2
- 238000011176 pooling Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 239000003795 chemical substances by application Substances 0.000 description 4
- 230000001629 suppression Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000005764 inhibitory process Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/34—Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/12—Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a distributed Internet of things equipment cooperation method and a system thereof based on multi-modal interaction, each sub-equipment of the distributed Internet of things respectively collects voice signals in real time through a microphone to make voice wake-up judgment, a camera is started on the voice wake-up equipment to collect face images in real time to make face detection, the face images are sent to an interactive central control through network communication, the interactive central control carries out arbitration and cooperation according to the voice wake-up and face detection results reported by each sub-equipment, the equipment which really responds to user wake-up is determined, voice commands are monitored continuously, simultaneously wake-up information of other sub-equipment is eliminated, the voice commands of the user are processed in real time in voice, corresponding control commands and voice reply contents are sent to the Internet of things sub-equipment which responds to the wake-up, the invention carries out arbitration and cooperation through the distributed Internet of things equipment and the interactive central control according to the multi-modal results, the accuracy rate of collaborative interaction and response of the distributed Internet of things equipment is improved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a distributed Internet of things equipment cooperation method and system based on multi-mode interaction.
Background
With the continuous development of the technology in the field of artificial intelligence, the accuracy of voice recognition and face detection is continuously improved, and a plurality of intelligent voice devices are provided in daily life. The built-in microphone or the microphone array in the intelligent voice device can realize that a user can carry out far-field interaction with the intelligent device at a short distance or with a certain distance, but the voice interaction accuracy rate is reduced or even can not be realized when the distance range is exceeded. At present, a plurality of intelligent devices with voice interaction are distributed in the family environment, if an intelligent voice sound box is placed in a living room, an intelligent desk lamp is placed in a bedroom and the like, the devices are placed in a distributed mode, along with the rapid development of the internet of things, the realization of the interconnection of the multiple devices by the aid of the multiple voice intelligent devices is an inevitable technical trend and the living needs of smart families, and a method for the collaborative interaction of the devices of the distributed internet of things is needed under the scene. In the prior art, distributed internet of things devices use the same awakening word, and after a user is awakened by voice, all devices respond, so that which device should respond to a request of the user cannot be judged, and the use experience of the user is seriously influenced.
Disclosure of Invention
The invention aims to provide a distributed Internet of things equipment cooperation method and a system thereof based on multi-mode interaction, which can reduce network delay, improve response speed, solve the messy result of synchronous awakening of multiple equipment, improve response accuracy and stability of distributed Internet of things equipment through multi-mode interaction, effectively solve the problem of interconnection and cooperative work of multiple distributed Internet of things voice equipment in a family scene, and improve user experience in the Internet of things environment so as to solve the problems provided in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a distributed Internet of things equipment cooperation method based on multi-modal interaction comprises the following steps:
s1: each sub-device of the distributed Internet of things device locally collects voice of a user in real time and performs voice awakening judgment;
s2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;
s3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;
s4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;
s5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.
Further, in S1, the distributed internet of things device characterizes a plurality of smart terminals, each having its microphone array, including but not limited to a linear 2-microphone, linear 4-microphone, linear 6-microphone, ring 4-microphone, or irregular microphone array.
Furthermore, in S2, the face detection method includes two steps of preprocessing an image and a face detection algorithm based on MTCNN, where MTCNN is composed of 3 lightweight CNNs in a network structure, and P-Net, R-Net, and O-Net respectively, and the input preprocessed image is processed successively through the 3 networks to finally output results of face detection and key point detection.
Further, in S2, the real-time face detection function adopts a face detection algorithm based on a multi-task cascaded convolutional neural network-MTCNN.
The invention provides another technical scheme: a distributed Internet of things equipment cooperative system based on multi-modal interaction comprises distributed Internet of things equipment, an interaction center control and a voice cloud server, wherein the distributed Internet of things equipment is provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed Internet of things sub-equipment starts the camera image acquisition module to acquire pictures in real time and performs face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when the existence of a face is judged, and the distributed Internet of things equipment is further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the distributed Internet of things equipment cooperation method and system based on multi-mode interaction, interaction central control is connected with all distributed Internet of things equipment through a local area network, arbitration decision is made according to received awakening information and face detection results, equipment needing to be awakened and responded is quickly determined and informed, network delay is reduced, response speed is improved, and meanwhile messy results of synchronous awakening of multiple equipment are solved.
2. According to the distributed Internet of things equipment cooperation method and system based on multi-mode interaction, response accuracy and stability of the distributed Internet of things equipment are improved through the multi-mode interaction, meanwhile, the problem that a plurality of distributed Internet of things voice equipment in a family scene are connected and cooperate is effectively solved, and user experience in the Internet of things environment is improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a flow chart of a face detection method of the present invention;
FIG. 3 is a diagram illustrating the effect of the face detection method of the present invention;
fig. 4 is a block diagram of the system of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, in the embodiment of the present invention: the distributed Internet of things equipment cooperation method based on multi-modal interaction comprises the following steps:
step 1: each sub-device of the distributed Internet of things device locally acquires the voice of a user in real time and performs voice awakening judgment; in the step, the distributed internet of things equipment represents a plurality of intelligent terminals, and each intelligent terminal is provided with a microphone array, including but not limited to a linear 2-microphone, linear 4-microphone, linear 6-microphone, annular 4-microphone or other irregular microphone arrays; after a user sends a wake-up voice signal, the distributed internet of things equipment can receive the wake-up voice signal from the user, after the sub-equipment executes a voice wake-up command, if a plurality of sub-equipment simultaneously respond to the user, user experience and voice interaction quality are greatly influenced, the voice wake-up behavior of the user needs to be decided at the moment, namely, the wake-up behaviors of the plurality of sub-equipment are arbitrated, the internet of things sub-equipment or the intelligent terminal which really needs to respond to the wake-up behavior of the user, namely, the wake-up equipment which is most suitable for interacting with the user is determined, and wake-up responses of other sub-equipment are eliminated at the same time.
Step 2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;
and step 3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;
and 4, step 4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;
and 5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.
According to the embodiment, in the process of determining the sub-devices which are awakened by the user, face detection is performed on each sub-device which is awakened by the voice, whether the user exists in the current scene is determined according to the face detection result, whether the user is awakened by the voice or sends a voice command to the sub-devices of the distributed Internet of things is judged, the identity of the user is confirmed again through multi-mode information of the voice and the vision, and therefore the accuracy and the reliability of the cooperative response of the distributed Internet of things devices are improved.
In the above embodiment, the real-time face detection function involved in step 2 adopts a face detection algorithm based on a multi-task cascaded convolutional neural network (MTCNN), which is a coarse-to-fine method, and has the advantages of real-time processing capability, high speed and good effect, and can be run on an intelligent terminal, thereby solving the defects of high requirement on environment, high requirement on face and high detection time consumption of the traditional face detection algorithm.
In the above embodiment, the face detection method includes two steps of preprocessing an image and a face detection algorithm based on MTCNN, where MTCNN is composed of 3 lightweight CNNs in a network structure, and P-Net, R-Net, and O-Net, and the input preprocessed image is processed successively through the 3 networks, and finally, results of face detection and key point detection are output.
For better explaining the above invention, the flow of the face detection algorithm referred to herein, please refer to fig. 2, which specifically includes the following steps:
step 21: firstly, the picture collected by the camera is input, and the input picture is scaled to different scales to form an image pyramid through the preprocessing operation of picture size conversion so as to meet the requirement of unchanged scales;
step 22: inputting the preprocessed image pyramid into an MTCNN network for processing, wherein the image pyramid is processed in a coarse-to-fine mode through 3 sub-networks respectively, and the specific process is as follows:
step 221, inputting the preprocessed image pyramid into a P-Net, outputting a face classification result, an image candidate window and a face landmark positioning result, wherein the P-Net is a full convolution network and is used for generating candidate frames and border regression vectors, correcting the candidate windows by using a border regression method, and merging overlapped candidate frames by using non-maximum suppression processing;
specifically, the input is a 12 × 12 picture, and the generated training data (by generating a frame and then cutting the frame into 12 × 12 pictures) needs to be converted into a 12 × 3 structure before training; generating 10 5 × 5 feature maps by 10 convolution kernels of 3 × 3, maximal pooling operation of 3 × 3; generating 16 characteristic graphs of 3 × 3 through 16 convolution kernels of 3 × 10 of the second layer; then generating 32 characteristic graphs of 1 by 1 through 32 convolution kernels of 3 by 16 of the third layer; finally, outputting 3 vectors after P-Net feedforward aiming at 32 feature graphs of 1 x 1, wherein 2 feature graphs of 1 x 1 are generated through 2 convolution kernels of 1 x 32 and used for face two classification results, namely, the probability of face classification is output; generating 4 characteristic graphs of 1 × 1 for the regression judgment of the bounding box through 4 convolution kernels of 1 × 32; and the other output is that 10 characteristic graphs of 1 × 1 are generated by 10 convolution kernels of 1 × 32 and used for judging the positioning of the human face landmarks, namely the human face contour point information.
In step 222, the image candidate window input R-Net determined by the output of P-Net is further classified, which is equivalent to a fine picking process;
specifically, according to the coordinates output by the P-Net, a picture is cut out from the original image (according to a square cutting method of the maximum side length, deformation is avoided, and more details are retained), and the picture size is transformed again to generate a 24 × 24 frame, the 24 × 3 frame is converted into a 24 × 3 structure and is input to the R-Net, and 28 feature maps of 11 are generated through 28 convolution kernels of 3 × 3 and maximum pooling operation of 3 × 3; generating 48 4 × 4 feature maps by maximum pooling operation of 3 × 3 through 48 convolution kernels of 3 × 28 of the second layer; then generating 64 characteristic maps of 3 by 3 through 64 convolution kernels of 2 by 48 of the third layer; finally, 3 results are output through a full connection layer of 128 neurons, wherein one result is a 2-dimensional data result for face two classification, namely the probability of face classification is output; one is 4 coordinate offsets of the regression judgment of the candidate bounding box; yet another output is 10 face contour point information for human face landmark location determination.
In step S223, the picture obtained by cutting the information of the output candidate window of R-Net in the previous step on the original picture is input to O-Net to determine the final position of the face frame and the feature point, and the judgment of whether the face exists, the face frame positioning, and the 5 feature point positions of the face are output;
specifically, a picture is cut out on an original image according to the information output by the R-Net (the picture is processed the same as the input data of the R-Net, deformation is avoided and more details are kept according to a square cutting method of the maximum side length), and the picture size is converted again to generate a frame with the size of 48 × 48, the frame is converted into a structure with the size of 48 × 3 and is input to the O-Net, and 32 feature maps with the size of 23 × 23 are generated through 32 convolution kernels with the size of 3 × 3 and the maximum pooling operation with the size of 3 × 3; generating 64 10 x 10 signatures by maximum pooling of 3 x 3 operations with 64 3 x 32 convolution kernels of the second tier; generating 64 4 × 4 feature maps through maximum pooling operation of 2 × 2 by 64 convolution kernels of 3 × 64 of the third layer; then, 128 convolution kernels of 2 × 64 of the fourth layer are used to generate 128 characteristic maps of 3 × 3; finally, 3 results are output through a full connection layer of 256 neurons, wherein one result is a 2-dimensional data result for face two classification, namely the probability of face classification is output; one is 4 coordinate offsets of the regression judgment of the candidate bounding box; yet another output is 10 face contour point information for human face landmark location determination.
Step 23, determining a face detection result and 5 face key points according to the output of the MTCNN network in S22, aligning the 5 feature points on the original image to a specific position of the picture through affine transformation, and framing the face position to display the detection result;
in this embodiment, the output of three sub-networks in the MTCNN network all processes candidate face frames, that is, according to the probability scores of the face classification results, a superposition evaluation formula (IOU) and non-maximum suppression (NMS) are used to screen candidate frames, and most of the candidate frames that are not faces are screened;
specifically, for the candidate frames, a positioning accuracy evaluation formula is adopted for the accuracy of the candidate frames, that is, the overlapping degree (IOU) of the two candidate frames is defined, and the area ratio of the overlapping area of the two candidate rectangular frames in the union set of the two candidate frames is determined.
In the embodiment, a non-maximum suppression (NMS) method is adopted to screen the candidate frames, the non-maximum suppression (NMS) method essentially suppresses elements which are not maximum values, local maximum values are searched, the local maximum values represent a neighborhood, and two parameters of the neighborhood are variable, namely the dimension of the neighborhood and the size of the neighborhood; specifically, the candidate frames are sorted according to the result confidence degrees, the candidate frame A with the highest confidence degree is selected, if the overlapping area of the rest frames with the candidate frame A is larger than a threshold value, the candidate frame is deleted, the candidate frame A is left, and so on, and the candidate frame with the smaller overlapping area and the higher confidence degree is finally screened out.
In the MTCNN face detection method in this embodiment, before the detection stage, training of a neural network is required, and learning of 3 tasks, that is, classification of a face and a non-face, bounding box regression, and face feature point positioning (or face landmark positioning) is required:
1) the design of the face/non-face classifier is according to the formula:
This is a cross-entropy loss function, p, for face classificationiIn order to be a probability of being a face,a real label for the background.
2) Bounding box regression is the regression loss calculated by euclidean distance.
Wherein the content of the first and second substances,in order to obtain the coordinates predicted by the network,is a reality ofOf the real background coordinates of the image, wherein,is a (upper left coordinates x and y, length, width) quadruple.
3) And (3) positioning the human face landmark, and calculating the Euclidean distance between the position of the landmark predicted by the network and the actual real landmark and minimizing the distance, which is the same as the regression of the bounding box.
Wherein the content of the first and second substances,in order to be predicted by the network,for the actual real landmark coordinates, since a total of 5 points, each represented by two coordinate values, x and y, are present, therefore,andis a ten-tuple.
In this embodiment, the MTCNN training learning process is to minimize this function in the following equation:
P-Net,R-Net(αdet=1,αbox=0.5,αlandmark=0.5)
O-Net(αdet=1,αbox=0.5,αlandmark=1)
where N is the number of training samples, αjWhich indicates the importance of the task or tasks,in order to be the label of the sample,is a loss function.
In this embodiment, the effect graphs of each stage of the face detection algorithm are as shown in fig. 3, and the input picture is subjected to picture preprocessing, that is, the size of the picture is changed, and different scales are output to form a picture pyramid; inputting the preprocessed image into a P-Net network, and outputting a candidate bounding box of the face through a non-maximum value inhibition and frame regression method; only pictures framed by the candidate bounding boxes are input into an R-Net network, and non-maximum value inhibition and frame regression methods are also adopted to obtain fewer but gradually accurate candidate bounding boxes of the human face; in the O-Net network, a small number of face candidate pictures are input, and a face bounding box and a face characteristic point result are output.
Referring to fig. 4, the present invention further provides another technical solution: a distributed Internet of things equipment cooperative system based on multi-modal interaction comprises distributed Internet of things equipment, an interaction center control and a voice cloud server, wherein the distributed Internet of things equipment is provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed Internet of things sub-equipment starts the camera image acquisition module to acquire pictures in real time and performs face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when the existence of a face is judged, and the distributed Internet of things equipment is further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.
In summary, the following steps: the invention provides a distributed Internet of things equipment cooperation method and system based on multi-modal interaction, which are used for solving the technical problem of poor interaction experience of distributed intelligent voice equipment, each sub-equipment of the distributed Internet of things respectively collects voice signals in real time through a microphone or a microphone array to make voice awakening judgment, a camera is started on the voice awakened equipment to collect face images in real time to make face detection, when the voice awakened sub-equipment simultaneously detects the existence of a face, the face signals are sent to an interaction center control through network communication or broadcasting, the interaction center control carries out arbitration and cooperation according to multi-modal results of voice awakening and face detection reported by each sub-equipment, equipment really responding to user awakening is determined, voice commands are continuously monitored, awakening information of other distributed Internet of things sub-equipment is simultaneously cleared, and the user voice commands request a voice cloud server to carry out voice recognition through communication in real time, After voice real-time processing such as semantic understanding, corresponding control commands and voice reply contents are issued to the Internet of things sub-equipment of the awakening response.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.
Claims (5)
1. A multi-modal interaction based distributed Internet of things device cooperation method is characterized by comprising the following steps:
s1: each sub-device of the distributed Internet of things device locally collects voice of a user in real time and performs voice awakening judgment;
s2: each sub-device which is awakened and hit by the voice starts a camera to obtain a picture of the current scene, real-time face detection is carried out, and a face detection result and confidence coefficient are calculated;
s3: when the face exists in the current scene on each sub-device, immediately transmitting a voice awakening result and a face detection result on the sub-device to an interaction central control, wherein the results include but are not limited to confidence degrees of voice awakening and face awakening; if the face does not exist in the current scene, the voice awakening result of the clearing device is not reported to the interaction central control;
s4: the interactive central control determines the sub-equipment corresponding to the maximum voice awakening score and the face detection confidence coefficient result as the sub-equipment awakened by the response user according to the received voice awakening result and the face detection result of each sub-equipment, informs the sub-equipment of response prompt, continuously picks up the user voice command, continuously initiates a voice processing request to the voice cloud server by the user voice command of the sub-equipment, and simultaneously clears awakening information of other distributed sub-equipment;
s5: and the voice cloud server executes voice recognition, semantic understanding, dialogue management and voice synthesis operation in real time to process the voice command of the user and returns a response result.
2. The method for distributed internet of things device collaboration based on multi-modal interaction as claimed in claim 1, wherein in S1, the distributed internet of things device characterizes a plurality of smart terminals, each smart terminal having its microphone array including but not limited to linear 2-microphone, linear 4-microphone, linear 6-microphone, ring 4-microphone or irregular microphone array.
3. The distributed internet of things device cooperation method based on multi-modal interaction as claimed in claim 1, wherein in S2, the face detection method includes two steps of pre-processing operation of pictures and face detection algorithm based on MTCNN, MTCNN is composed of 3 lightweight CNNs on network structure, which are P-Net, R-Net and O-Net respectively, and the input pre-processed pictures are processed successively through the 3 networks to finally output results of face detection and keypoint detection.
4. The method as claimed in claim 3, wherein in the step S2, the real-time face detection function adopts a face detection algorithm based on a multitask cascade convolutional neural network-MTCNN.
5. The distributed internet of things device coordination system based on multi-modal interaction as claimed in claim 1, which is characterized by comprising distributed internet of things devices, an interaction center control and a voice cloud server, wherein the distributed internet of things devices are provided with a microphone array audio acquisition module and a camera image acquisition module, the microphone array audio acquisition module acquires voice signals in real time and performs signal processing operation and voice wake-up processing, the distributed internet of things sub-devices start the camera image acquisition module to acquire pictures in real time and perform face detection processing after voice wake-up, multi-modal data of voice wake-up and face detection are transmitted to the interaction center control through communication connection when a face is judged to exist, and the distributed internet of things devices are further provided with a voice reply and broadcast module; the interaction center control comprises a voice awakening arbitration module, a voice agent service module and a network communication module according to the content uploaded by each distributed Internet of things device, determines distributed Internet of things sub-devices needing to be awakened and responded through the voice awakening arbitration module, the voice agent service module and the network communication module, enables the distributed Internet of things sub-devices to continue monitoring user voice commands, simultaneously clears awakening information of other distributed sub-devices, requests the voice cloud server for voice recognition and semantic understanding through network communication in real time, and sends corresponding control commands and voice reply content to the Internet of things sub-devices needing to be awakened and responded after voice real-time processing; the voice cloud server comprises an execution voice recognition module, a semantic understanding module, a dialogue management module, a voice synthesis module and a network communication module, and returns a response result to the interaction center control through the network communication module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910988977.1A CN110718227A (en) | 2019-10-17 | 2019-10-17 | Multi-mode interaction based distributed Internet of things equipment cooperation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910988977.1A CN110718227A (en) | 2019-10-17 | 2019-10-17 | Multi-mode interaction based distributed Internet of things equipment cooperation method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110718227A true CN110718227A (en) | 2020-01-21 |
Family
ID=69211832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910988977.1A Pending CN110718227A (en) | 2019-10-17 | 2019-10-17 | Multi-mode interaction based distributed Internet of things equipment cooperation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110718227A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931551A (en) * | 2020-05-26 | 2020-11-13 | 东南大学 | Face detection method based on lightweight cascade network |
CN112634885A (en) * | 2020-05-18 | 2021-04-09 | 北京如影智能科技有限公司 | Voice wake-up method and device for cross-local area network |
CN112634872A (en) * | 2020-12-21 | 2021-04-09 | 北京声智科技有限公司 | Voice equipment awakening method and device |
CN112908325A (en) * | 2021-01-29 | 2021-06-04 | 中国平安人寿保险股份有限公司 | Voice interaction method and device, electronic equipment and storage medium |
CN113470634A (en) * | 2020-04-28 | 2021-10-01 | 海信集团有限公司 | Control method of voice interaction equipment, server and voice interaction equipment |
CN114287151A (en) * | 2020-07-28 | 2022-04-05 | 北京小米移动软件有限公司 | Wireless communication method, terminal, base station, communication device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170330563A1 (en) * | 2016-05-13 | 2017-11-16 | Bose Corporation | Processing Speech from Distributed Microphones |
CN107622652A (en) * | 2016-07-15 | 2018-01-23 | 青岛海尔智能技术研发有限公司 | The sound control method and appliance control system of appliance system |
CN108564052A (en) * | 2018-04-24 | 2018-09-21 | 南京邮电大学 | Multi-cam dynamic human face recognition system based on MTCNN and method |
CN110136714A (en) * | 2019-05-14 | 2019-08-16 | 北京探境科技有限公司 | Natural interaction sound control method and device |
CN110288997A (en) * | 2019-07-22 | 2019-09-27 | 苏州思必驰信息科技有限公司 | Equipment awakening method and system for acoustics networking |
CN110322878A (en) * | 2019-07-01 | 2019-10-11 | 华为技术有限公司 | A kind of sound control method, electronic equipment and system |
-
2019
- 2019-10-17 CN CN201910988977.1A patent/CN110718227A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170330563A1 (en) * | 2016-05-13 | 2017-11-16 | Bose Corporation | Processing Speech from Distributed Microphones |
CN107622652A (en) * | 2016-07-15 | 2018-01-23 | 青岛海尔智能技术研发有限公司 | The sound control method and appliance control system of appliance system |
CN108564052A (en) * | 2018-04-24 | 2018-09-21 | 南京邮电大学 | Multi-cam dynamic human face recognition system based on MTCNN and method |
CN110136714A (en) * | 2019-05-14 | 2019-08-16 | 北京探境科技有限公司 | Natural interaction sound control method and device |
CN110322878A (en) * | 2019-07-01 | 2019-10-11 | 华为技术有限公司 | A kind of sound control method, electronic equipment and system |
CN110288997A (en) * | 2019-07-22 | 2019-09-27 | 苏州思必驰信息科技有限公司 | Equipment awakening method and system for acoustics networking |
Non-Patent Citations (4)
Title |
---|
YANG WANG ET AL.: "《Research on Face Detection Method Based on Improved MTCNN Network》", 《ICDIP 2019》 * |
丰慧芳: "《基于卡口监控视频的人脸特征点定位关键技术研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
孔德壮等: "《人脸表情识别在辅助医疗中的应用及方法研究》", 《生命科学仪器》 * |
张衡等: "《基于级联卷积网络的人脸特征点检测》", 《南京邮电大学学报(自然科学版)》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113470634A (en) * | 2020-04-28 | 2021-10-01 | 海信集团有限公司 | Control method of voice interaction equipment, server and voice interaction equipment |
CN113470634B (en) * | 2020-04-28 | 2024-05-17 | 海信集团有限公司 | Voice interaction equipment control method, server and voice interaction equipment |
CN112634885A (en) * | 2020-05-18 | 2021-04-09 | 北京如影智能科技有限公司 | Voice wake-up method and device for cross-local area network |
CN111931551A (en) * | 2020-05-26 | 2020-11-13 | 东南大学 | Face detection method based on lightweight cascade network |
CN111931551B (en) * | 2020-05-26 | 2022-04-12 | 东南大学 | Face detection method based on lightweight cascade network |
CN114287151A (en) * | 2020-07-28 | 2022-04-05 | 北京小米移动软件有限公司 | Wireless communication method, terminal, base station, communication device and storage medium |
CN114287151B (en) * | 2020-07-28 | 2024-04-05 | 北京小米移动软件有限公司 | Wireless communication method, terminal, base station, communication device and storage medium |
CN112634872A (en) * | 2020-12-21 | 2021-04-09 | 北京声智科技有限公司 | Voice equipment awakening method and device |
CN112908325A (en) * | 2021-01-29 | 2021-06-04 | 中国平安人寿保险股份有限公司 | Voice interaction method and device, electronic equipment and storage medium |
CN112908325B (en) * | 2021-01-29 | 2022-10-28 | 中国平安人寿保险股份有限公司 | Voice interaction method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110718227A (en) | Multi-mode interaction based distributed Internet of things equipment cooperation method and system | |
CN110728255B (en) | Image processing method, image processing device, electronic equipment and storage medium | |
Steffens et al. | Personspotter-fast and robust system for human detection, tracking and recognition | |
CN110135249B (en) | Human behavior identification method based on time attention mechanism and LSTM (least Square TM) | |
CN113284168A (en) | Target tracking method and device, electronic equipment and storage medium | |
WO2021213158A1 (en) | Real-time face summarization service method and system for intelligent video conference terminal | |
EP3647992A1 (en) | Face image processing method and apparatus, storage medium, and electronic device | |
JP2006011978A (en) | Image processing method and image processor | |
CN108960076B (en) | Ear recognition and tracking method based on convolutional neural network | |
CN108986137B (en) | Human body tracking method, device and equipment | |
CN111680550B (en) | Emotion information identification method and device, storage medium and computer equipment | |
CN111008994A (en) | Moving target real-time detection and tracking system and method based on MPSoC | |
CN111401322A (en) | Station entering and exiting identification method and device, terminal and storage medium | |
CN113850136A (en) | Yolov5 and BCNN-based vehicle orientation identification method and system | |
CN110188179B (en) | Voice directional recognition interaction method, device, equipment and medium | |
CN112700568B (en) | Identity authentication method, equipment and computer readable storage medium | |
Afroze et al. | An empirical framework for detecting speaking modes using ensemble classifier | |
CN112766065A (en) | Mobile terminal examinee identity authentication method, device, terminal and storage medium | |
CN113052136A (en) | Pedestrian detection method based on improved Faster RCNN | |
CN117813581A (en) | Multi-angle hand tracking | |
CN114283461A (en) | Image processing method, apparatus, device, storage medium, and computer program product | |
WO2020237674A1 (en) | Target tracking method and apparatus, and unmanned aerial vehicle | |
CN113276113A (en) | Sight line positioning and voice control system and method for space manipulator on-orbit operation | |
CN117553808B (en) | Deep learning-based robot positioning navigation method, device, equipment and medium | |
CN113903083B (en) | Behavior recognition method and apparatus, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200121 |
|
RJ01 | Rejection of invention patent application after publication |