CN111985432B

CN111985432B - Multi-modal data fusion method based on Bayesian theorem and adaptive weight adjustment

Info

Publication number: CN111985432B
Application number: CN202010882365.7A
Authority: CN
Inventors: 左韬; 王星
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Aobo Jiangsu Robot Co ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2022-08-12
Anticipated expiration: 2040-08-28
Also published as: CN111985432A

Abstract

The invention discloses a multi-modal data fusion method based on Bayesian theorem and adaptive weight adjustment, which comprises the following steps: and carrying out multiple interactive experiments and collecting data, so as to calculate the prior probability of the Bayesian theorem. In the actual interaction process, the real-time interaction effect is inspected by utilizing the distance between the actual interaction point and the central point of the preset interaction range in real time, and the distance factor is fused into a self-adaptive parameter to be used as a self-adaptive weight of the multi-mode data fusion; and (4) providing a Bayesian weight by utilizing Bayes theorem and combining the actual interaction condition of each mode, and further adjusting the interaction result. The method comprehensively adjusts the decision fusion weight of the multi-modal data from two angles of the interaction process and the interaction result, and improves the accuracy and the robustness of the man-machine interaction process based on the multi-modal fusion.

Description

Multi-modal data fusion method based on Bayesian theorem and adaptive weight adjustment

Technical Field

The invention relates to the field of multi-modal data fusion, in particular to a multi-modal data fusion method based on Bayesian theorem and adaptive weight adjustment.

Background

The MultiModal fusion is responsible for combining information of a plurality of modalities to perform target prediction (classification or regression), and belongs to one of the earliest research directions of MMML (MultiModal Machine Learning), which is also the most widely used direction at present. According to the fusion level, the multi-modal fusion can be divided into three categories, namely pixel level, feature level and decision level, and the fusion is respectively carried out on original data, abstract features and decision results. The feature level fusion can be divided into two categories, namely early fusion and late fusion, which represent that the fusion occurs in the early stage and the late stage of feature extraction, and hybrid (mixing) method for mixing multiple fusion levels is also provided.

The fusion model provided by the invention belongs to back-end fusion, namely decision fusion, and the back-end fusion is to perform fusion on decision probabilities output by classifiers respectively trained by different modal data. The advantage of doing so is that the errors of the fusion model come from different classifiers, while the errors from different classifiers are often not correlated with each other, do not affect each other, and do not cause further accumulation of errors. Common back-end fusion modes include maximum value fusion (max-fusion), average value fusion (averaged-fusion), Bayes 'rule based fusion (Bayes' rule based), ensemble learning (ensemble learning), and the like, and selecting an appropriate weight for back-end fusion to improve the robustness of a data fusion structure is also a hot issue of research.

The fusion model provides proper weight for the fusion process mainly by means of Bayes theorem and interactive quality evaluation. The contribution of the whole system is that a method for integrating information of two modes, namely a sound mode and a human body posture mode (namely two kinds of information of finger pointing and face orientation) is provided, compared with man-machine interaction of a single mode, multi-mode data has higher robustness in man-machine interaction, and different mode data can be mutually corrected.

Disclosure of Invention

The invention integrates the Bayesian theorem for comprehensively inspecting the evaluation results of different modes and the self-adaptive weight for dynamically inspecting the interaction quality of each mode into a multi-mode data decision fusion framework. The influence of the correctness of the judgment result of each mode on the whole interaction process is analyzed, and the results given by the two modes are comprehensively considered by using Bayesian theorem, so that a corresponding decision coefficient is given to each mode. The adaptive weight is given by considering the interaction effect of two video modalities (finger pointing and face orientation), and the purpose is to improve the robustness of an interactive system by giving higher weight to the modality which is more excellent in performance. The interaction effect is embodied in the geometric distance between the specific physical interaction direction in the video frame and the center of the preset interaction target, and the interaction accuracy can be further improved by giving a larger weight to the interaction target closer to the specific physical interaction direction.

The specific invention content is as follows: a multi-modal data back-end decision fusion method based on Bayesian theorem adaptive parameter adjustment comprises the following steps:

step 1: obtaining the prior probability of a Bayesian formula for multiple times according to the original model;

step 1.1: selecting a proper video frame to judge the finger direction and the face direction of the interactor;

step 1.2: for each video frame selected by the finger pointing mode, judging the finger interaction points determined in the frame, and providing a decision vector with the length of N interaction points, wherein the decision vector is in the following form:

t _pt (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T ，

step 1.3: adding the given decision vectors corresponding to the finger direction modes to obtain decision probability:

wherein F _p For the number of video frames selected,

in the stage of calculating the prior probability by Beyes's theorem, the coefficient alpha _i Taking a constant, wherein the value of the constant is not 1, and each element of the decision probability represents the probability that the corresponding preset interaction point is the finger modal target interaction point;

step 1.4: the decision vector and decision probability for the face orientation modality are given in the same way:

t _et (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T

step 1.5: calculating the similarity of HMM (Hidden Markov Model) of the speech signal and each template through HTK (high Markov Model Toolkit), and obtaining the probability of each interactive target given by the sound mode:

P _s ＝[p _s1 ,p _s2 ,...,p _si ,...,p _sN ] ^T ；

step 1.6: adding the decision probabilities determined by the three modalities to obtain the decision probability corresponding to each preset interaction point given by the whole interaction process:

the interaction point corresponding to the element with the largest value in the decision vector represented by the formula is the interaction point determined in the interaction process;

step 1.7: the process from step 1.1 to step 1.6 is repeated for a plurality of times, the actual target interaction point of the interactor in each test is given artificially, the correctness of each mode and the whole interaction model are given by taking the actual target interaction point as a standard, and the recorded data are as follows:

number of experiments	Modal (eye)	Modality p (point)	Modality s (sound)	g(general)
					1	T	T	T	T
2	F	T	T	T
					3	T	F	T	T
4	T	T	T	T
					5	T	T	T	T
6	T	T	F	T
					7	T	F	F	F
8	T	T	T	T
					9	T	T	T	T
10	F	T	T	F
					11	T	T	T	T
12	T	F	F	F
					13	T	F	T	F
14	F	T	T	T
					15	T	T	T	F
...	...	...	...	...

In the table, T represents correct judgment, F represents error, and g (general) represents the overall judgment condition of the interactive system;

step 1.8: counting the number of related events according to the data given in the table, and giving the prior probability of a Bayesian formula according to the central limit theorem:

the probability in each formula above will provide a prior probability for bayesian formula, where N (a) represents the number of occurrences of event a in the table above, e.g., N (g ═ T) represents the number of times general judges are correct in the data table;

step 1, providing a parameter basis (namely Bayesian prior probability) for Bayesian theorem through experiments of which weights are not considered for a plurality of times;

step 2: providing Bayesian prior probability in step 1, then performing man-machine interaction, and adding self-adaptive weight adjustment in step 2;

step 2.1: selecting a proper video frame during interaction to judge the finger direction and the face direction of an interactor;

step 2.2: for each video frame selected by the finger pointing mode, judging the finger pointing interaction points determined in the frame, and providing a decision vector with the length of N of the interaction points, wherein the form of the decision vector is as follows:

t _pt (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T ，

each element in the vector corresponds to a preset interaction target:

the value of the element corresponding to the selected interactive point is 1, and correspondingly, the values of other elements are 0 (one interactive target is selected by one video frame);

step 2.3: calculating the geometric distance between the interaction point determined by the image frame and the center of the preset interaction area (because the two points are in the same plane, the distance calculation only considers two-dimensional coordinates):

L _i ＝((x ₁ -x ₂ ) ² +(y ₁ -y ₂ ) ² ) ^0.5

step 2.4: will be at a distance L _i Is assigned to the adaptive weight value alpha _i ：

L _i ＝α _i ；

Step 2.5: substituting the obtained self-adaptive weight into decision probability vectors of e and p modes:

and step 3: on the basis of the Bayes prior probability given in the step 1 and the self-adaptive weight given in the step 2, the final weight adjustment is carried out by utilizing a Bayes formula according to the judgment results of 3 modes;

step 3.1: when the interaction points judged by the three modes are the same, the interaction points are directly selected as interaction targets of the whole system;

step 3.2: when the interaction point of one mode judgment is different from the results of other two same mode judgments (namely the element positions corresponding to the maximum values in the probability vectors given by the three modes are different), a Bayesian formula is introduced for weighting. The following formula is given as an example of different e-mode results and the same p-and s-mode results:

the probability corresponds to the final interactive result and takes the result of e-mode judgment,

the probability corresponds to the final interaction result and takes the judgment result of the p and s modes;

step 3.2.1: applying the self-adaptive weight and the Bayes formula to the probability vector fusion process of each mode to obtain an improved probability vector:

at this time, the interactive target corresponding to the maximum value element in the probability vector is the interactive target determined by the whole improved system;

step 3.3: when the interaction targets determined by the three modalities are different from each other (namely, the three modalities determine three different interaction targets), applying a Bayesian formula of a corresponding situation to the weighting process:

step 3.4: the three Bayes formulas are substituted into the following formula to obtain the final probability vector,

and the interactive target corresponding to the element with the largest value in the probability vector is the interactive target selected by the whole improved system.

The invention has the following advantages and beneficial effects:

the invention utilizes Bayesian theorem to carry out comprehensive classification analysis on the interaction results given by the three modes, thereby providing Bayesian weight values which refer to various mode output results, and the process considers the accuracy of data from the mode output results and aims to give higher weight values to the possibly correct mode analysis results and correspondingly give lower weight values to the possibly wrong mode results, thereby improving the interaction accuracy. Different from the attention of the Bayes principle to interaction results of various modes, the self-adaptive weight adjustment focuses on the interaction process. The quality of the interaction process is evaluated by observing the geometric distance between the actual physical interaction point corresponding to the selected video frame and the preset interaction point in the interaction process, and the interaction accuracy is improved from another angle by endowing an interaction mode with good quality with a higher weight.

Drawings

FIG. 1 is a schematic diagram of human-machine interaction operation, wherein a depth camera is located on an interaction plane.

Fig. 2 is a flow chart of the proposed multimodal data fusion.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with the specific embodiments, and the description is only for explaining the present invention and is not intended to limit the present invention. Fig. 1 is a schematic diagram of human-computer interaction performed by an operator, as shown in the figure, the interactor faces to a plane where a predetermined interaction target is located (the predetermined interaction target is in the same plane, and the z coordinate of the plane is 0 as shown in fig. 1), and a Kinect depth camera is placed on the plane of the interaction target. The interactive target is selected by the interactive person through the face orientation, the finger pointing direction and the voice information, and the interactive target is selected by the interactive person through three modalities (ideally, the interactive targets of the three modalities are consistent). In order to ensure the interaction accuracy, as shown in fig. 2, i.e. the flow of the interaction system, a large number of interaction experiments (i.e. repeated interaction processes) need to be performed first, the positive-error data of each modality and the actual target interaction points of the interactors are collected, the table in step 1.7 is counted, the prior probability of the bayesian theorem is calculated, and a data basis is provided for correcting the weight of each modality through the bayesian principle in step 3. In the actual interaction process (different from the multi-time interaction process of collecting data by the Bayesian principle in the step 1), the step 2 evaluates the interactive video frames in real time (the evaluation basis is the interaction point distance), and gives the self-adaptive weight according to the evaluation. And 3, comprehensively considering the analysis result of each mode by using a Bayes principle to give a weight based on result analysis and substituting the self-adaptive weight and the Bayes weight into the probability vector to improve the interaction accuracy. The method specifically comprises the following steps:

t _pt (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T ，

wherein F _p For the number of video frames selected,

t _et (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T

P _s ＝[p _s1 ,p _s2 ,...,p _si ,...,p _sN ] ^T ；

t _pt (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T ，

each element in the vector corresponds to a preset interaction target:

step 2.3: calculating the geometric distance between the interaction point determined by the image frame and the center of the preset interaction area (since the two points are in the same plane, the distance calculation only considers two-dimensional coordinates):

L _i ＝((x ₁ -x ₂ ) ² +(y ₁ -y ₂ ) ² ) ^0.5

L _i ＝α _i ；

step 3.2: when the interaction point of one mode judgment is different from the results of other two same mode judgments (namely the element positions corresponding to the maximum values in the probability vectors given by the three modes are different), a Bayesian formula is introduced for weighting. The following formula is given as an example that the e mode results are different and the p and s mode results are the same:

the probability is corresponding to the final interaction result, and the result of judging the p and s modes is taken;

step 3.3: when the interaction targets determined by the three modalities are different from each other (namely, the three modalities determine three different interaction targets), applying a Bayesian formula of a corresponding situation to a weighting process:

Claims

1. A multi-modal data fusion method based on Bayesian theorem and adaptive weight adjustment is characterized by comprising the following steps:

t _pt (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T ，

wherein F _p For the number of video frames selected,

in the stage of calculating the prior probability by Bayes' theorem, the coefficient alpha _i Taking a constant, initially assigning the constant as 1, and determining the probability that each element of the probability represents the corresponding preset interaction point as a finger modal target interaction point;

t _et (i)＝[t _t ,t ₂ ,...,t _j ,...,t _N ] ^T

step 1.5: calculating the similarity between the Hidden Markov Model of the voice signal and each template through a Hidden Markov Model Toolkit, namely a voice recognition Toolkit, and obtaining the probability of each interactive target given by the voice mode:

P _s ＝[p _sl ,p _s2 ,...,p _si ,...,p _sN ] ^T ；

the interaction point corresponding to the element with the largest value in the decision vector represented by the formula is the interaction point determined in the interaction process, wherein M is the number of modes;

number of experiments Modality e, eye Modality p, point Modality s, sound g，general 1 T T T T 2 F T T T 3 T F T T 4 T T T T 5 T T T T 6 T T F T 7 T F F F 8 T T T T 9 T T T T 10 F T T F 11 T T T T 12 T F F F 13 T F T F 14 F T T T 15 T T T F ... ... ... ... ...

In the table, T represents correct judgment, F represents error, and g, namely general represents the integral judgment condition of the interactive system;

step 1.8: the number of relevant events is counted according to the data given in the table above, and the prior probability of the Bayes formula is given according to the central limit theorem:

the probability in the above formulas will provide prior probability for bayesian formula, where N (a) represents the number of occurrences of event a in the above table, and N (g ═ T) represents the number of times the general judgment is correct in the data table;

step 1, providing a parameter basis, namely Bayesian prior probability, for Bayesian theorem through experiments in which weights are not considered for multiple times;

t _pt (i)＝[t ₁ ,t ₂ ,...,t _j ,...,t _N ] ^T ，

each element in the vector corresponds to a preset interaction target:

the value of the element corresponding to the selected interactive point is 1, correspondingly, the values of other elements are 0, and one interactive target is selected from one video frame;

step 2.3: and calculating the geometric distance between the interaction point determined by the video frame and the center of the preset interaction area, wherein the two points are positioned on the same plane, so that the distance calculation only considers two-dimensional coordinates:

L _i ＝((x ₁ -x ₂ ) ² +(y ₁ -y ₂ ) ² ) ^0.5

step 2.4: will be at a distance L _i And assigning to a self-adaptive weight:

L _i ＝α _i ；

step 3.1: when the interaction points judged by the three modes are the same, directly selecting the interaction points as interaction targets of the whole system;

step 3.2: when the interaction point judged by a certain mode is different from the results judged by other two same modes, namely the element positions corresponding to the maximum values in the probability vectors given by the three modes are different, introducing a Bayesian formula for weighting, wherein the following formula takes the example that the results of e modes are different and the results of p and s modes are the same:

the probability corresponds to the final interaction result and takes the result of e-mode judgment;

step 3.3: when the interaction targets determined by the three modalities are different, namely when the three modalities determine three different interaction targets, applying a Bayesian formula of the corresponding situation to the weighting process: