CN114998700B

CN114998700B - Immersion degree calculation method and system for multi-feature fusion in man-machine interaction scene

Info

Publication number: CN114998700B
Application number: CN202210663978.0A
Authority: CN
Inventors: 李树涛; 宋启亚; 孙斌
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2024-06-25
Anticipated expiration: 2042-06-14
Also published as: CN114998700A

Abstract

The invention discloses an immersion degree calculation method and system for multi-feature fusion in a man-machine interaction scene, wherein the method comprises the following steps: determining a sequence of various features extracted from a real-time image of a user in a human-computer interaction scene, wherein the sequence comprises part or all of human-computer interaction distance features, human body posture features, head posture features, facial posture features, eye posture features and lip movement features; and carrying out joint representation on the sequences of the multiple features to obtain a feature representation vector Hde, and classifying the feature representation vector by a classifier to obtain the immersion degree. The method and the device can realize the immersion degree calculation of multi-feature fusion in the man-machine interaction scene, accurately evaluate the interaction wish of the person and the robot, effectively improve the interaction experience in the man-machine interaction scene, combine the time sequence information of various features, effectively improve the accuracy of the immersion degree calculation of the user, and further realize the selection of the target user and the control of the working state of the robot.

Description

Immersion degree calculation method and system for multi-feature fusion in man-machine interaction scene

Technical Field

The invention belongs to the technical fields of computer vision, artificial intelligence and the like, and particularly relates to an immersion degree calculation method and system for multi-feature fusion in a man-machine interaction scene.

Background

The information communication between the robot and the person can be realized through various expression modes, such as voice, natural language, action, expression and the like. The voice is one of the most direct and rich communication modes between people and machines, and is an important interface for human-computer intelligent interaction. However, in an actual man-machine interaction scene, complicated environmental noise can interfere with the performance of voice interaction, so that the machine is wrongly awakened and easily broken, intelligent interaction of a robot is not facilitated, and interaction experience is deteriorated. The immersion degree calculation is to collect information of different modes in user expression through different sensors and technical means, and analyze various characteristic information capable of measuring user immersion degree in the human-computer interaction process, so that interaction will of users is estimated. However, in actual process, the voice is easily affected by the environment and is difficult to track, and the visual information is relatively stable. The visual behaviors of the users are analyzed to evaluate the immersion level of the users, so that the robots understand the interactive willingness of the users, and related services are provided.

Before man-machine interaction, the robot selects users with high immersion degree and actively provides services for the users by evaluating the immersion degree of the users in the surrounding environment, so that the same reaction of the robot to the users with different immersion degrees and the false awakening of voice in a noisy environment are effectively avoided, and the willingness of the robot to actively interact is improved; in man-machine interaction, the robot judges the interaction wish of the user by continuously monitoring the user immersion degree, adjusts the interaction degree according to the user immersion degree, and improves the naturalness of the man-machine interaction. And when the immersion degree of the user is low, the interactive behavior and willingness of the machine are reduced or suspended, so that invalid interaction is avoided, and the working efficiency of the robot is improved. Therefore, how to effectively calculate the user immersion and judge the interaction will under the human-computer interaction scene is a key technical problem to be solved urgently for realizing the intelligent interaction of robots.

Disclosure of Invention

The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides the immersion degree calculation method and the system for multi-feature fusion in the human-computer interaction scene, which can realize the immersion degree calculation of the multi-feature fusion in the human-computer interaction scene, accurately evaluate the interaction wish of a person and a robot, effectively improve the interaction experience in the human-computer interaction scene, and effectively improve the accuracy of the immersion degree calculation of a user by combining time sequence information of various features.

In order to solve the technical problems, the invention adopts the following technical scheme:

a multi-feature fusion immersion degree calculation method under a man-machine interaction scene comprises the following steps:

1) Determining a sequence of various features extracted from a real-time image of a user in a human-computer interaction scene, wherein the various features comprise part or all of human-computer interaction distance features, human body posture features, head posture features, facial posture features, eye posture features and lip movement features;

2) Performing joint representation on the sequences of the multiple features to obtain a feature representation vector Hde;

3) And classifying the feature expression vector Hde by a classifier to obtain the immersion degree of the user.

Optionally, step 2) includes: respectively carrying out standardization treatment on sequences of multiple features to ensure that the dimensions of the sequences are the same and are normalized to form feature data X; and sending the feature data X into a gate control cycle time sequence network GRU to perform time sequence modeling on sequences of all features, and then selecting an input layer feature vector through an attention layer Att to obtain a fused feature representation vector Hde.

Optionally, the classifier adopted in the step 3) is a multi-layer perceptron, and the function expression is:

P(Y) =sigmoid(MLP(Hde))

in the above formula, P (Y) is the obtained immersion degree of the user, sigmoid is a normalized exponential function, and MLP is a confidence value that the multi-layer perceptron returns the feature expression vector Hde to the corresponding.

Optionally, the plurality of features in step 1) include a human-computer interaction distance feature, and the expression of the calculation function of the human-computer interaction distance feature is:

，

In the above formula, f (d _i) is the human-computer interaction distance feature of the ith user, x _i,1 and y _i,1 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point of the ith user, x _i,2 and y _i,2 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the left shoulder feature point of the ith user, x _i,5 and y _i,5 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the right shoulder feature point of the ith user, the neck feature point, the left shoulder feature point and the right shoulder feature point are all obtained by estimating the human body posture of a real-time image of the user in a human-computer interaction scene, and the three-dimensional space coordinates of the neck feature point, the left shoulder feature point and the right shoulder feature point are obtained by transforming an image coordinate system based on the point coordinates and depth.

Optionally, the human body posture feature in step 1) includes a human body azimuth feature, and the expression of the calculation function of the human body azimuth feature is:

，

In the above formula, f (α _i) is the human azimuth angle feature of the ith user, α _i is the human azimuth angle of the ith user, x _i,1 and y _i,1 are the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point, respectively, the neck feature point is obtained by performing human posture estimation on a real-time image of the user in a human-computer interaction scene, and the three-dimensional space coordinate of the neck feature point is obtained by performing image coordinate system transformation based on the point coordinate and depth.

Optionally, the expression of the calculation function of the head pose feature in step 1) is:

，

In the above formula, f (h _i) is the head posture feature of the ith user, α _i is the human azimuth angle of the ith user, β _i is the head angle of the ith user, x _i,1 and y _i,1 are the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point, x _i,k and y _i,k are the x coordinate and y coordinate of the three-dimensional space coordinate of any kth nose feature point, the neck feature point and the kth nose feature point are obtained by estimating the human body posture of a real-time image of the user in a man-machine interaction scene, and the three-dimensional space coordinates of the neck feature point and the kth nose feature point are obtained by transforming an image coordinate system based on the point coordinate and the depth.

Optionally, the expression of the calculation function of the lip movement feature in step 1) is:

，

In the above formula, f (lar _i) is the lip movement characteristic of the ith user, lar _i is the lip up-down distance of the ith user, σ is the threshold for judging the lip movement characteristic, lar _i is the threshold for judging the lip movement characteristic, and when the value is greater than or equal to σ, the lip is judged to be open, the lip movement characteristic is 1, otherwise, the lip is judged to be closed, and the lip movement characteristic is 0, wherein the lip up-down distance of the ith user is obtained by positioning the lip key characteristic points from the real-time image of the user in the man-machine interaction scene and calculating based on the lip key characteristic point coordinates obtained by positioning.

Optionally, step 3) further includes a step of comparing the immersion degree of the user with a preset threshold, and if the immersion degree of the user is smaller than the set threshold and the robot is in the man-machine interaction state, suspending or exiting the man-machine interaction state of the robot; if the user's immersion degree is greater than or equal to a set threshold value and the robot is in a non-human-computer interaction state at present, firstly judging the number of detected users, if the number of users is greater than 1, selecting the user with the highest immersion degree as a target user, if the number of users is equal to 1, selecting the user as the target user, and then waking up the robot to enter a human-computer interaction state for carrying out human-computer interaction with the target user; the sensing system of the robot keeps working states in a man-machine interaction state and a non-man-machine interaction state, and the motion system of the robot is only in the working state in the man-machine interaction state and in the non-working state in the non-man-machine interaction state.

In addition, the invention also provides an immersion computing system for multi-feature fusion in a man-machine interaction scene, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the immersion computing method for multi-feature fusion in the man-machine interaction scene.

Furthermore, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for being programmed or configured by a microprocessor to execute the steps of the immersion degree computing method of multi-feature fusion in the man-machine interaction scene.

Compared with the prior art, the invention has the following advantages:

1. In consideration of the problem that the immersion degree of the user can be evaluated before and during interaction, the interaction experience of the user can be improved, the immersion degree calculation of multi-feature fusion in a man-machine interaction scene can be realized, the interaction will of a person and a robot can be accurately evaluated, and the interaction experience in the man-machine interaction scene can be effectively improved.

2. The method comprises the steps of determining a sequence of various features extracted from a real-time image of a user in a human-computer interaction scene, wherein the sequence of the various features comprises time sequence information, and the accuracy of the immersion degree calculation of the user can be effectively improved by combining the time sequence information of the features through the variety of the various features (part or all of human-computer interaction distance features, human body posture features, head posture features, facial posture features, eye posture features and lip movement features).

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of the working principle of the method according to the embodiment of the invention.

Fig. 3 is a schematic diagram of human body feature points used by the lips according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of input data preprocessing and timing relationship modeling in an embodiment of the present invention.

FIG. 5 is a flow chart of comparing immersion with a preset threshold in an embodiment of the present invention.

Detailed Description

As shown in fig. 1, the immersion degree calculating method for multi-feature fusion in the human-computer interaction scene of the embodiment includes:

The interactive object in the man-machine interaction scene of the embodiment comprises a person and a robot, wherein the robot can be in an anthropomorphic form or a non-anthropomorphic form, the robot comprises a motion system and a sensing system, wherein the sensing system comprises a sensor and a processor and is mainly used for executing the immersion degree calculation method of multi-feature fusion in the man-machine interaction scene of the embodiment, for example, the sensor in the embodiment refers to a depth camera, but it is required to be explained that other types of sensors or combination of various sensors can be adopted as required on the premise of acquiring various features.

Referring to fig. 2, step 2) and step 3) in the present embodiment are steps of immersion calculation for multi-feature fusion of a plurality of features. Step 2) the joint representation of the sequence of multiple features is performed in order to obtain a feature representation vector Hde that integrates the multiple features. Thus, the joint representation method and various data fusion methods can be adopted according to requirements, such as a cyclic neural network (RNN), a long and short time memory network (LSTM), a gated cyclic neural network and the like. As an alternative implementation manner, step 2) in this embodiment includes: respectively carrying out standardization treatment (namely pretreatment) on sequences of various features to ensure that the dimensions of the sequences are the same and the sequences are normalized within the range of [0,1] to form feature data X; the feature data X is sent to a gate-controlled cyclic timing network GRU to perform timing modeling on the sequence of each feature, and then an input layer feature vector is selected through an attention layer Att, so as to obtain a fused feature representation vector Hde, which can be expressed as Hde =att (GRU (X)). Because the gating cycle time sequence network GRU has fewer parameters than the cycle neural network RNN and the long and short time memory network LSTM, the problems of network gradient elimination and explosion in sequence modeling are effectively relieved, and because the gating cycle time sequence network GRU has fewer parameters, the calculation time of a model is reduced, and the instantaneity is higher. However, the model inputs various time sequence features, the attention layer Att can give different weights to the input features through an attention mechanism, and a feature sequence important for immersion evaluation is effectively selected, so that the model in the embodiment selects a gating attention time sequence network (a gating cycle time sequence network GRU and the attention layer Att) to represent and fuse different features. It should be noted that, the gate cycle timing network GRU and the attention layer Att are both of the existing neural network structures, and in this embodiment, only the basic applications of the gate cycle timing network GRU and the attention layer Att are involved, and the improvements of the gate cycle timing network GRU and the attention layer Att are not involved, so their details will not be described in detail herein.

Referring to fig. 4, step 3) is essentially a step of modeling the timing relationship of the feature representation vector Hde. The classifier used in step 3) is to achieve mapping between the feature representation vector Hde and the user's immersion, so a viable classifier model can be used as needed. As an optional implementation manner, in step 3) of this embodiment, the feature vector is mapped to the dimension space of the output value by using the multi-layer perceptron, so as to implement mapping between the feature expression vector Hde and the user's immersion degree, where the function expression is:

P(Y) = sigmoid(MLP(Hde))，

In the above formula, P (Y) is the obtained immersion degree of the user, sigmoid is a normalized exponential function, and MLP is a confidence value that the multi-layer perceptron returns the feature expression vector Hde to the corresponding. It should be noted that the multi-layer perceptron is an existing classifier model, and in this embodiment, only the basic application of the multi-layer perceptron is related, and no improvement of the multi-layer perceptron is related, so details of the implementation will not be described herein.

It should be noted that, step 1) determining "in the sequence of the plurality of features extracted from the real-time image of the user in the human-computer interaction scene may be understood as selecting or inputting, that is, the operation of performing the sequence of the plurality of features extracted from the real-time image of the user in the human-computer interaction scene before step 1).

In this embodiment, the real-time image of the user in the human-computer interaction scene is an image to be detected, which is obtained by capturing with a depth camera and includes a RGB image I _r and a depth image I _d formed by depth information, and before multiple features need to be extracted, the RGB image I _r and the depth image I _d need to be aligned to obtain a color image I _s with depth information. In addition, a depth camera may be used to directly obtain the color image I _s with depth information. On the basis, the position (three-dimensional space coordinate) of each pixel point relative to the coordinate system of the robot can be obtained by utilizing the imaging model and principle of the camera and utilizing image transformation; then, according to the characteristic points of the human body, the head, the face, the eyes and the lips, the human-computer interaction distance characteristic, the human body posture characteristic, the head posture characteristic, the face posture characteristic, the eye posture characteristic and the lip movement characteristic can be extracted. Referring to fig. 2, as an alternative implementation manner, four features of a human-computer interaction distance feature, a human azimuth feature (a human body posture feature), a head posture feature and a lip movement feature are specifically adopted in this embodiment, and the obtained feature data X may be represented as x= { f (d _i), f(α_i), f(h_i), f(lar_i) }, where f (d _i) is a human-computer interaction distance feature of the ith user, f (α _i) is a human azimuth feature of the ith user, f (h _i) is a head posture feature of the ith user, and f (lar _i) is a lip movement feature of the ith user.

For a sequence of image frames I _s={I_s ^m |m=, 1,2,3, where I _s ^m is any mth image frame, the human interaction distance feature is specifically denoted as f (d _i)={( f(d_i))^m|m=,1,2,3,...,t},( f(d_i))^m is a human interaction distance feature corresponding to the mth image frame I _s ^m, the human azimuth feature is specifically denoted as f (α _i)={( f(α_i))^m|m=,1,2,3,...,t},( f(α_i))^m is a human azimuth feature corresponding to the mth image frame I _s ^m), the head pose feature is specifically denoted as f (h _i)={( f(h_i))^m|m=,1,2,3,...,t},( f(h_i))^m is a head pose feature corresponding to the mth image frame I _s ^m, the lip motion feature is specifically denoted as f (lar _i)={( f(lar_i))^m|m=,1,2,3,...,t},( f(lar_i))^m is a lip motion feature corresponding to the mth image frame I _s ^m), where t is the number of image frames).

In this embodiment, the plurality of features in step 1) include a human-computer interaction distance feature, and the expression of the calculation function of the human-computer interaction distance feature is:

，

In the above formula, f (d _i) is the human-computer interaction distance feature of the ith user, x _i,1 and y _i,1 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point of the ith user, x _i,2 and y _i,2 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the left shoulder feature point of the ith user, x _i,5 and y _i,5 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the right shoulder feature point of the ith user, the neck feature point, the left shoulder feature point and the right shoulder feature point are all obtained by estimating the human body gesture from the real-time image of the user in the human-computer interaction scene, and the three-dimensional space coordinates of the neck feature point, the left shoulder feature point and the right shoulder feature point are obtained by transforming the image coordinate system based on the point coordinates and the depth. The larger f (d _i) indicates the more interaction distance between the person and the robot, whereas the smaller f (d _i) indicates the smaller man-machine interaction distance.

In this embodiment, a human body key point detection frame LIGHTWEIGHT OPENPOSE is specifically adopted to perform human body posture estimation, so as to obtain key points Ha of 18 human bodies:

Ha={(u_ij,v_ij) | i=1,2,3,…,n;j=1,2,3,…,18}，

Where n represents the number of users and j represents the number of keypoints extracted. The neck feature point, the left shoulder feature point and the right shoulder feature point are all key points of the key points Ha of 18 human bodies, specifically, the 1 st, 2 nd and 5 th key points in the key points Ha of 18 human bodies, and the coordinates in the RGB image I _r can be expressed as follows:

Hc={(u_i1,v_i1),(u_i2,v_i2),(u_i5,v_i5)},

The three-dimensional space coordinates obtained by the transformation of the image coordinate system are respectively:

Hc={(x_i1,y_i1,z_i1),(x_i2,y_i2,z_i2),(x_i5,y_i5, z_i2)}.

In this embodiment, the human body posture feature in step 1) includes a human body azimuth feature, and the expression of the calculation function of the human body azimuth feature is:

，

The head pose reflects whether the user focuses on the robot, and feature points (nose feature points) on the nose are selected from face feature points to perform head pose estimation. In this embodiment, the expression of the calculation function of the head pose feature in step 1) is:

，

In this embodiment, a face feature point detection model of an open source face database Dlib is specifically adopted to perform face feature point detection, so as to obtain 68 face feature points Fa:

Fa={(u_ij,v_ij) | i=1,2,3,…,n;j=1,2,3,…,68}，

wherein n represents the number of users, i represents the users, and j represents the sequence number of the face feature points.

In this embodiment, the expression of the calculation function of the lip movement feature in step 1) is:

，

In the above formula, f (lar _i) is the lip movement characteristic of the ith user, lar _i is the lip up-down distance of the ith user, σ is the threshold for judging the lip movement characteristic, lar _i is the threshold for judging the lip movement characteristic, and when the value is greater than or equal to σ, the lip is judged to be open, the lip movement characteristic is 1, otherwise, the lip is judged to be closed, and the lip movement characteristic is 0, wherein the lip up-down distance of the ith user is obtained by positioning the lip key characteristic points from the real-time image of the user in the man-machine interaction scene and calculating based on the lip key characteristic point coordinates obtained by positioning. Judging whether a user speaks through the movement of the lip, reflecting the interaction intention of the user with the robot, and selecting 8 key points of the lip area from the face feature points to calculate the distance of the up-and-down movement of the lip, namely:

，

In the above formula, fk _k represents the kth face feature point, k is 49,51,52,53,55,57,58,59, that is, 8 face feature points in the lip region are selected to calculate the distance of the lip moving up and down. Referring to fig. 3, k takes 49 as a face feature point of the left mouth corner, k takes 55 as a face feature point of the right mouth corner, k takes 51-53 as a face feature point of the middle part of the upper lip and face feature points on both sides thereof, and k takes 57-59 as a face feature point of the middle part of the lower lip and face feature points on both sides thereof. The distance of the up-and-down movement of the lips is shown by the line segment with arrow between the face feature points fk ₅₂ and fk ₅₈; in addition, the distance of the left-right movement of the lips is indicated by the arrowed line segment between the face feature points fk ₄₉ and fk ₅₅, and can also be used as an optional lip movement feature.

Referring to fig. 4, in this embodiment, the sequence of four features of the man-machine interaction distance feature, the human azimuth feature, the head pose feature and the lip movement feature is denoted as x _1,1～x_t,1、x_1,2～x_t,2、x_1,3～x_t,3、x_1,4～x_t,4, where t represents the length of the sequence. Respectively carrying out standardization treatment on sequences of various features to ensure that the dimensions of the sequences are the same and are normalized, for example, the sequences in FIG. 4 together generate characteristic data X ₁～X_b forming a group b; the feature data X ₁～X_b is sent to the gated loop timing network GRU (Gate Recurrent Unit) to perform timing modeling on the sequence of each feature, and then the input layer feature vector is selected through the attention layer Att to obtain a fused feature representation vector Hde. And finally, obtaining the corresponding immersion degree through the multi-layer perceptron.

Referring to fig. 2 and 5, step 3) of the present embodiment further includes a step of comparing the immersion degree of the user with a preset threshold, and if the immersion degree of the user is less than the preset threshold and the robot is currently in a man-machine interaction state, suspending or exiting the man-machine interaction state of the robot; if the user's immersion degree is greater than or equal to a set threshold value and the robot is in a non-human-computer interaction state at present, firstly judging the number of detected users, if the number of users is greater than 1, selecting the user with the highest immersion degree as a target user, if the number of users is equal to 1, selecting the user as the target user, and then waking up the robot to enter a human-computer interaction state for carrying out human-computer interaction with the target user; the sensing system of the robot keeps working states in a man-machine interaction state and a non-man-machine interaction state, and the motion system of the robot is only in the working state in the man-machine interaction state and in the non-working state in the non-man-machine interaction state. According to the method, before man-machine interaction, the robot selects the users with high immersion degree and actively provides services for the users with high immersion degree through evaluation of the user immersion degree in the surrounding environment, so that the robot is effectively prevented from making the same reaction to the users with different immersion degrees and waking up by mistake in the noisy environment, and the willingness of the robot to actively interact is improved; in man-machine interaction, the robot judges the interaction willingness of the user by continuously monitoring the user immersion degree, and reduces or pauses the interaction behavior and willingness of the robot when the user immersion degree is low, so that invalid interaction is avoided, the working efficiency of the robot is improved, and the working energy consumption of the robot is reduced. The method can actively judge the immersion degree of the user to adjust the interaction state of the user under a complex man-machine interaction scene, and effectively improves the naturalness of man-machine interaction.

In summary, in this embodiment, the modeling and fusion are performed on multiple features of the face of the user through the gated attention time sequence network, so as to evaluate the immersion score of the user in the interaction scene. Further, before man-machine interaction, the robot selects users with high immersion degree through evaluation of user immersion degree in surrounding environment and actively provides services for the users, so that the same reaction of the robot to users with different immersion degrees and false awakening of voice in a noisy environment are effectively avoided, and the willingness of the robot to actively interact is improved; in man-machine interaction, the robot judges the interaction willingness of the user by continuously monitoring the user immersion degree, and reduces or pauses the interaction behavior and willingness of the robot when the user immersion degree is low, so that invalid interaction is avoided, the working efficiency of the robot is improved, and the working energy consumption of the robot is reduced.

In addition, the embodiment also provides an immersion degree computing system for multi-feature fusion in a man-machine interaction scene, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the immersion degree computing method for multi-feature fusion in the man-machine interaction scene. Further, a depth camera connected to the microprocessor through the data acquisition card may also be considered part of the immersion computing system, such that the sensing system of the robot essentially constitutes the aforementioned immersion computing system. Further, it is contemplated that the robotic system may comprise a sensing system of robots and a motion system of robots that are interconnected, and that the complete robotic system may also be configured substantially as the aforementioned immersion computing system. The motion system of the robot comprises a robot body with driving joints (such as steering engines) and a control unit arranged in the robot body, wherein the control unit comprises a controller and a driving circuit board, and the driving circuit board is connected with driving motors of the driving joints. The shape of the robot body is not particularly limited, and may be an anthropomorphic shape or a non-anthropomorphic shape.

In addition, the embodiment also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for being programmed or configured by a microprocessor to execute the steps of the immersion degree computing method of multi-feature fusion in the man-machine interaction scene.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The immersion degree calculation method for multi-feature fusion in a man-machine interaction scene is characterized by comprising the following steps of:

3) Classifying the feature expression vector Hde through a classifier to obtain the immersion degree of the user;

Step 2) comprises: respectively carrying out standardization treatment on sequences of multiple features to ensure that the dimensions of the sequences are the same and are normalized to form feature data X; the feature data X is sent to a gate control cycle time sequence network GRU to carry out time sequence modeling on sequences of all features, and then an input layer feature vector is selected through an attention layer Att to obtain a fused feature representation vector Hde;

The classifier adopted in the step 3) is a multi-layer perceptron, and the function expression is as follows:

P(Y) = sigmoid(MLP(Hde))

in the above formula, P (Y) is the obtained immersion degree of the user, sigmoid is a normalized exponential function, and MLP is a confidence coefficient value obtained by regression of the feature expression vector Hde by the multi-layer perceptron;

the plurality of features in step 1) comprise human-computer interaction distance features, and the expression of the calculation function of the human-computer interaction distance features is as follows:

，

In the above formula, f (d _i) is the human-computer interaction distance feature of the ith user, x _i,1 and y _i,1 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point of the ith user, x _i,2 and y _i,2 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the left shoulder feature point of the ith user, x _i,5 and y _i,5 are respectively the x coordinate and y coordinate of the three-dimensional space coordinate of the right shoulder feature point of the ith user, the neck feature point, the left shoulder feature point and the right shoulder feature point are all obtained by estimating the human body posture of a real-time image of the user in a human-computer interaction scene, and the three-dimensional space coordinates of the neck feature point, the left shoulder feature point and the right shoulder feature point are obtained by transforming an image coordinate system based on the point coordinates and depth;

the human body posture features in the step 1) comprise human body azimuth features, and the expression of the calculation function of the human body azimuth features is as follows:

，

In the above formula, f (alpha _i) is the human azimuth angle characteristic of the ith user, alpha _i is the human azimuth angle of the ith user, x _i,1 and y _i,1 are the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point, wherein the neck feature point is obtained by estimating the human body posture of a real-time image of the user in a man-machine interaction scene, and the three-dimensional space coordinate of the neck feature point is obtained by transforming an image coordinate system based on the point coordinate and depth;

The expression of the calculation function of the head posture feature in step 1) is:

，

In the above formula, f (h _i) is the head posture feature of the ith user, alpha _i is the human azimuth angle of the ith user, beta _i is the head angle of the ith user, x _i,1 and y _i,1 are the x coordinate and y coordinate of the three-dimensional space coordinate of the neck feature point, x _i,k and y _i,k are the x coordinate and y coordinate of the three-dimensional space coordinate of any kth nose feature point, the neck feature point and the kth nose feature point are obtained by estimating the human body posture of a real-time image of the user in a man-machine interaction scene, and the three-dimensional space coordinates of the neck feature point and the kth nose feature point are obtained by transforming an image coordinate system based on point coordinates and depth;

The expression of the calculation function of the lip movement characteristics in the step 1) is as follows:

，

2. The method for calculating the immersion degree of multi-feature fusion in the human-computer interaction scene according to claim 1, wherein the step 3) further comprises the step of comparing the immersion degree of the user with a preset threshold value, and if the immersion degree of the user is smaller than the preset threshold value and the robot is in the human-computer interaction state currently, suspending or exiting the human-computer interaction state of the robot; if the user's immersion degree is greater than or equal to a set threshold value and the robot is in a non-human-computer interaction state at present, firstly judging the number of detected users, if the number of users is greater than 1, selecting the user with the highest immersion degree as a target user, if the number of users is equal to 1, selecting the user as the target user, and then waking up the robot to enter a human-computer interaction state for carrying out human-computer interaction with the target user; the sensing system of the robot keeps working states in a man-machine interaction state and a non-man-machine interaction state, and the motion system of the robot is only in the working state in the man-machine interaction state and in the non-working state in the non-man-machine interaction state.

3. An immersion computing system for multi-feature fusion in a human-machine interaction scenario comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the steps of the method for immersion computing for multi-feature fusion in a human-machine interaction scenario of claim 1 or 2.

4. A computer readable storage medium having a computer program stored therein, the computer program being for programming or configuring by a microprocessor to perform the steps of the immersion level calculation method of multi-feature fusion in a human-machine interaction scenario according to claim 1 or 2.