CN117193524A

CN117193524A - Man-machine interaction system and method based on multi-mode feature fusion

Info

Publication number: CN117193524A
Application number: CN202311071778.7A
Authority: CN
Inventors: 刘志欢; 朱广鹏; 张骥超; 徐秋秋; 陈成
Original assignee: Nanjing Panda Electronics Co Ltd; Nanjing Panda Electronics Manufacturing Co Ltd
Current assignee: Nanjing Panda Electronics Co Ltd; Nanjing Panda Electronics Manufacturing Co Ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-12-08

Abstract

The invention discloses an interactive processing system based on multi-mode feature fusion, which comprises an information acquisition layer, a data processing layer, a fusion calculation layer and an interactive feedback layer; the information acquisition layer is used for acquiring a plurality of initial modal interaction data of the target object; the data processing layer is used for preprocessing the acquired multiple initial modal interaction data; the fusion calculation layer is used for carrying out feature fusion on the processed multi-mode interaction data; and the interaction feedback layer is used for outputting the fused interaction result. According to the multi-mode fusion interaction system provided by the invention, the acquired information of the target human body is subjected to fusion analysis, so that interaction behaviors, interaction configuration and the like can be effectively integrated, and the natural degree of a user and the man-machine interaction system is greatly improved. Meanwhile, the intention of the user is analyzed more accurately, and the accuracy of the interactive command is further improved; advances in human-to-human personification technology are facilitated, and devices are dialogized with users in a more realistic and natural manner.

Description

Man-machine interaction system and method based on multi-mode feature fusion

Technical Field

The invention belongs to the field of man-machine interaction, and particularly relates to a man-machine interaction system and method based on multi-modal feature fusion.

Background

With the continuous development of artificial intelligence technology, human-computer interaction experience is more real-time and variable, experience levels are more abundant and diversified, and the requirements of rapidness, naturalization, emotion and individuation of human-computer interaction are more prominent than ever. The man-machine interaction mode is iterated for several times, and the diversified interaction mode increases the operation difficulty of the intelligent equipment; the part of interaction modes are greatly influenced by environmental factors, so that the flexibility of the intelligent equipment is reduced; the contact interaction mode is not beneficial to the sanitary maintenance of equipment, and the potential safety hazard of cross infection can also occur. The application number is 201610999277.9 and the name is 'a multi-mode man-machine interaction system and a control method thereof', and the multi-mode data interaction is realized by setting a touch/somatosensory gesture interaction mapping instruction, a logic processing and a cooperative/mutually exclusive access mechanism, and converting a user control instruction into a service instruction for scheduling and executing of an application system; however, the data volume of the multi-mode data is huge, the preprocessing of the mode data is lacked, and the processing speed is slow; secondly, the acquired modal data and the interaction mapping instruction are in one-to-one correspondence in a mapping mode, and the multi-modal data does not carry out feature extraction and fusion, so that the response speed of interaction is reduced, and the accuracy of recognition and feedback is reduced.

Disclosure of Invention

The invention aims to: the invention aims to provide an intelligent human-computer interaction processing system with fused modal characteristics, which can effectively integrate interaction behaviors, interaction configuration and the like, and greatly improve the naturalness of a user and the human-computer interaction system; the invention further aims at providing a man-machine interaction method based on multi-modal feature fusion.

The technical scheme is as follows: the invention provides an intelligent human-computer interaction processing system with fused modal characteristics, which comprises an information acquisition layer, a data processing layer, a fusion calculation layer and an interaction feedback layer; the information acquisition layer is used for acquiring a plurality of initial modal interaction data of the target object; the data processing layer is used for preprocessing the acquired multiple initial modal interaction data; the fusion calculation layer is used for carrying out feature fusion on the processed multi-mode interaction data; and the interaction feedback layer is used for outputting the fused interaction result.

The information acquisition layer comprises a camera module, a voice module, a human body induction module and a touch control module; the method is mainly used for initial acquisition of each mode data of the target object.

The data processing layer is used for constructing a preset multi-mode preprocessing model based on the preset of each mode sub-model.

The fusion calculation layer is a plurality of high-performance calculation modules and is used for carrying out multimode classification fusion on the modal data of the data processing layer.

The interactive feedback layer selects a specific feedback mode through the output fusion result, and makes accurate response and next judgment on the target object.

An interactive processing method based on multi-mode feature fusion comprises the following steps:

step 1, acquiring a plurality of initial modal interaction data of a target object through an acquisition layer;

step 2, preprocessing a plurality of initial modal interaction data by using a data processing layer based on the initial modal interaction data;

step 3, feature fusion is carried out on the preprocessed multi-mode interaction data through a fusion calculation layer;

and 4, outputting the fused interaction result by using the interaction feedback layer.

Wherein, the information preprocessing comprises the following steps:

step 11, transmitting the modal data of each sub-module in the information acquisition layer to the data preprocessing layer in a wired or wireless communication mode;

step 12, the data preprocessing layer judges the mode label type of the initial mode data based on a preset multi-mode preprocessing model;

step 13, cutting and scaling the image mode data through image preprocessing for the image mode data, and adjusting the proportion; then, data enhancement is carried out on the image, so that the robustness is improved; adjusting the image pixels to be within the range of [0,1] by adopting a normalization method, and reducing the data dimension;

step 14, for voice mode data, carrying out endpoint detection on the voice mode data through voice preprocessing, extracting useful voice segments, and extracting voice frequency domain characteristics by adopting short-time Fourier transform; the voice quality is enhanced through filtering and noise reduction technology, and the voice is divided into segments with fixed length and is converted into spectrogram input;

step 15, for the text modal data, text preprocessing is used for carrying out text analysis on the text data by using natural language processing technologies such as word segmentation, part-of-speech tagging and the like; constructing word vector representation of the text, and converting the word vector representation into vector input with fixed dimension; the attention mechanism is adopted, the importance weight of the automatic learning words of the model is improved, PADDING or clipping is carried out on the data of the input text, and the consistency of the lengths is ensured;

and step 16, after preprocessing different modal data, splicing the modal data feature vectors of the images, the voices and the texts, and outputting the modal data feature vectors to a fusion calculation layer through a sub-preset model.

The specific process of multimode classification fusion in the step 3 is as follows:

step 31, classifying and extracting features of each preprocessing mode data such as voice, touch, text, somatosensory and the like, and extracting and fusing multi-mode features by adopting a deep learning network model;

step 32, extracting spatial features in images and videos by using a convolutional neural network, extracting time sequence features in voices and texts by using a recurrent neural network, and carrying out weighted fusion of features of different modes by using an attention mechanism;

and 33, learning weights of different modal characteristics by adopting a multi-modal fusion function to obtain fusion characteristics. The fusion network adaptively obtains the optimal combination of different modes through end-to-end training, and mode complementation is realized; the fusion characteristics are input into an output network to obtain a command or control signal matched with the interaction scene;

step 34, constructing dialogue management models aiming at different interaction scenes, training the dialogue management models by utilizing deep reinforcement learning, predefining an operation flow, and completing flexible control of corresponding equipment according to multi-mode information such as voice instructions, gesture actions and the like of a user;

and 35, combining the knowledge graph, the user portrait and other methods, continuously accumulating the information such as the preference, the application habit and the like of the user, continuously optimizing the interaction strategy, improving the command recognition accuracy and naturalness, obtaining a final fusion result, and feeding back to the interaction feedback layer.

The multi-mode classifying and fusing process is dynamically adjusted in a multi-calculation mode through a high-performance calculating module.

In the step 4, the feedback mode of the interactive feedback layer is combined with the current interactive mode of voice, action, somatosensory and touch, and the fusion result of the target object is responded to make further judgment on the habit, behavior, emotion and other human activities of the target object.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable progress: the invention is favorable for the human-computer interaction system to be presented to the user in various, flexible and natural interaction modes; compared with the processing mode of the multi-mode data and the interaction instruction mapping in the prior art, the method has the advantages that the multi-mode interaction mode is fused and preprocessed, the characteristics of the different-mode data are extracted, the characteristic vectors are spliced according to the weights, the characteristic vectors are output to the fusion calculation layer through the sub-preset model, the data processing efficiency is improved, and the accuracy of the interaction process is improved; secondly, through fusion of multi-mode data, deep understanding of a man-machine interaction system on commands is realized, and user intention is more accurately analyzed by combining scenes, semantics and emotion, so that accuracy of interaction commands is further improved; promote the progress of human-machine personification technology, the device dialogues with the user in a more realistic and natural manner; the method and the system have the advantages that the calculation scheduling distribution problem in the multimode fusion model process can be dynamically adjusted by adopting a flexible calculation mode of a plurality of groups of calculation modules, the reasonable resource allocation is realized to the maximum, and the multimode fusion precision and the high-efficiency interaction performance are ensured.

Drawings

FIG. 1 is a system block diagram of the present invention;

FIG. 2 is a flow chart of the information preprocessing of the present invention;

FIG. 3 is a flow chart of the multi-mode classification fusion of the present invention.

Detailed Description

FIG. 1 is a system block diagram of an intelligent human-computer interaction processing system with multi-modal feature fusion, which comprises an information acquisition layer, a data processing layer, a fusion calculation layer and an interaction feedback layer; the information acquisition layer is used for acquiring a plurality of initial modal interaction data of the target object, and comprises a camera module, a voice module, a human body induction module and a touch control module; the camera module is a high-definition USB camera or a network camera and is used for collecting image data of a target object; the voice module is a microphone, a pickup or a sound sensor and is used for collecting sound data of a target object; the human body sensing module is an infrared sensor and a distance sensor and is used for collecting infrared data of a target object; the touch control module is a touch screen or a touch sensor and is used for acquiring touch data of a target object; the data processing layer is used for preprocessing the acquired multiple initial modal interaction data; the fusion calculation layer comprises a plurality of high-performance calculation modules and is used for carrying out feature fusion on the processed multi-mode interaction data; the interaction feedback layer is used for outputting the fused interaction result; and the interaction feedback layer is used for outputting the fused interaction result.

FIG. 2 is a flow chart of data preprocessing, and is a multi-modal feature fusion intelligent human-computer interaction processing method, and the specific implementation process comprises the following steps:

The specific implementation process comprises the following steps: transmitting the initial mode data of each sub-module in the information acquisition layer to the data preprocessing layer in a wired or wireless communication mode, wherein the communication mode comprises serial port communication, network communication, USB communication and the like; and judging the mode label type of the initial mode data, such as an image data label, a voice data label, a text data label and the like, based on a preset multi-mode preprocessing model.

If the image data is the image data, aiming at image preprocessing, firstly cutting and scaling the image data, and adjusting the proportion; the fixed size is beneficial to network processing, and then data enhancement is carried out on the image, such as random horizontal overturn, color dithering and the like, so that the robustness of the model is enhanced; and (3) adjusting the image pixels to be within the range of [0,1] by adopting a normalization method, and reducing the dimension of the image data.

If the voice data is voice data, aiming at voice preprocessing, carrying out end point detection on input voice, and extracting voice segments. And extracting voice spectrum characteristics by adopting modes such as short-time Fourier transform and the like, and representing voice frequency domain characteristics. Filtering and noise reduction techniques are applied to enhance speech quality. Dividing the voice into segments with fixed length, and converting the segments into a spectrogram input model.

If the text data is text data, text preprocessing is performed, and text analysis is performed by using natural language processing technologies such as word segmentation, part-of-speech tagging and the like. Word vector representations of text are constructed and converted to vector inputs of fixed dimensions. And an attention mechanism is adopted, so that the model automatically learns the importance weight of the word. And performing PADDING or clipping on the data of the input text to ensure consistent length. And after preprocessing all the modal data, splicing the modal data feature vectors of the images, the voice and the text, and outputting the spliced modal data feature vectors to a fusion calculation layer through a sub-preset model.

Fig. 3 is a multimode classification fusion flow chart of an intelligent man-machine interaction processing method for multimode feature fusion, and the implementation process is as follows: and classifying and extracting features of each preprocessing mode data such as voice data, touch data, text data, somatosensory data and the like, for example, voice data extracting acoustic, prosody and other voice features, touch data extracting touch gesture and other features, somatosensory data extracting human body actions, facial expressions and other physiological features, and text data extracting sentence features.

Extracting and fusing multi-modal characteristics by adopting a deep learning network model; and extracting spatial features in images and videos by using a convolutional neural network, extracting time sequence features in voice and texts by using a recurrent neural network, and carrying out weighted fusion of different modal features by using an attention mechanism. Compared with the traditional manual feature engineering, the deep learning can realize end-to-end feature learning and fusion, and the complexity of manual design is reduced; and constructing management modules aiming at different interaction scenes. For example, in an intelligent home scene, the operation flow of the home device can be predefined, and different device operation APIs are called according to the multi-mode information such as voice instructions, gesture actions and the like of the user, so that the control of the corresponding device is completed. Compared with the rule based method, the method can realize more flexible interaction flow management by utilizing the dialogue management model trained by the deep reinforcement learning; and combining methods such as knowledge maps and user portraits, continuously accumulating information such as user preferences and application habits, and realizing personalized interaction experience. With the increase of the user interaction times, the interaction strategy can be continuously optimized, and the command recognition Accuracy and naturalness are improved.

In the data flow of multi-mode fusion, whether a plurality of high-performance computing modules are required to operate is judged according to the complexity, the dynamic scheduling of the sub-mode classifier model is realized in a multi-computing mode, the high-efficiency multi-mode fusion precision and interaction performance are ensured, and the fusion characteristics are input into an output processing module to obtain final prediction. The process realizes the deep fusion of the characteristics of different modes, can model the dependency relationship among multiple modes, and improves the performance of an interactive system.

In a specific implementation, for example, the user wants to open a living room curtain, he can say to the system: "open curtain in living room". The information acquisition layer of the system receives the voice input and then transmits the voice input to the data processing layer for preprocessing, including voice recognition and semantic understanding. The preprocessed voice data is sent to a fusion calculation layer to generate a corresponding sub-mode classifier model. If no other modality data is entered, this model is directly used to generate interactive feedback, i.e. turn on the lights of the living room and feed back to the user through the sound box: "the living room lights have been turned on. If the user predefines gesture actions to control the living room curtain, if the user swings hands leftwards to open the curtain and swings hands rightwards to close the curtain, the user also swings hands leftwards when sending out a voice command, and two modes of voice and image data exist at the moment, and multi-mode input is processed and fused in the process, so that the interaction accuracy is greatly enhanced. Meanwhile, the system can dynamically adjust the fusion strategy and the output mode according to the feedback of the user, so that the interaction is more natural and efficient.

Claims

1. The interactive processing system based on the multi-mode feature fusion is characterized by comprising an information acquisition layer, a data processing layer, a fusion calculation layer and an interactive feedback layer; the information acquisition layer is used for acquiring a plurality of initial modal interaction data of the target object; the data processing layer is used for preprocessing the acquired multiple initial modal interaction data; the fusion calculation layer is used for carrying out feature fusion on the processed multi-mode interaction data; and the interaction feedback layer is used for outputting the fused interaction result.

2. The interactive processing system based on multi-mode feature fusion according to claim 1, wherein the information acquisition layer comprises a camera module, a voice module, a human body induction module and a touch control module; the method is used for initial acquisition of each mode data of the target object.

3. The interactive processing system based on multi-modal feature fusion according to claim 1, wherein the data processing layer builds a pre-set multi-modal pre-processing model according to the pre-set of each modal sub-model.

4. The interactive processing system based on multi-mode feature fusion according to claim 1, wherein the fusion computing layer is a plurality of high-performance computing modules and is used for performing multi-mode classification fusion on the mode data of the data processing layer.

5. The interactive processing system method based on multi-mode feature fusion according to claim 1, wherein the interactive feedback layer selects a specific feedback mode through the output fusion result, and makes accurate response and next judgment on the target object.

6. A processing method applied to the interactive processing system based on multi-modal feature fusion according to claim 1, characterized by comprising the steps of:

step 1, acquiring a plurality of initial modal interaction data of a target object through an information acquisition layer;

step 2, preprocessing a plurality of initial modal interaction data through a data processing layer based on the initial modal interaction data;

and 4, outputting the interaction result after feature fusion by using the interaction feedback layer, and feeding back.

7. The interaction processing method based on multi-modal feature fusion according to claim 6, wherein in step 2, preprocessing the plurality of initial modal interaction data through the data processing layer includes the following steps:

8. The interaction processing method based on multi-modal feature fusion according to claim 6, wherein in the step 3, the specific process of feature fusion of the preprocessed multi-modal interaction data through the fusion calculation layer is as follows:

step 31, classifying and extracting features of the preprocessed modal data of the voice, touch, text and somatosensory, and extracting and fusing the multi-modal features by adopting a deep learning network model;

step 33, learning weights of different modal characteristics by adopting a multi-modal fusion function to obtain fusion characteristics; the fusion network adaptively obtains the optimal combination of different modes through end-to-end training, and mode complementation is realized; the fusion characteristics are input into an output network to obtain a command or control signal matched with the interaction scene;

and 35, combining a knowledge graph and a user portrayal method, continuously accumulating preference and application habit information of a user, continuously optimizing an interaction strategy, improving command recognition accuracy and naturalness, obtaining a final fusion result, and feeding back to an interaction feedback layer.

9. The interactive processing method based on multi-mode feature fusion according to claim 6, wherein the multi-mode classification fusion process is dynamically adjusted in a multi-calculation mode through a high-performance calculation module.

10. The interactive processing method based on multi-modal feature fusion according to claim 6, wherein in the step 4, the feedback mode of the interactive feedback layer is used in combination with the current interactive mode of voice, action, somatosensory and touch, and the fusion result of the target object is responded to make further judgment on the habit, behavior, emotion and other human activities of the target object.