CN108986801A

CN108986801A - A kind of man-machine interaction method, device and human-computer interaction terminal

Info

Publication number: CN108986801A
Application number: CN201710408396.7A
Authority: CN
Inventors: 杜广龙
Original assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Current assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Priority date: 2017-06-02
Filing date: 2017-06-02
Publication date: 2018-12-11
Anticipated expiration: 2037-06-02
Also published as: WO2018219198A1; CN108986801B

Abstract

The embodiment of the present invention provides a kind of man-machine interaction method, device and human-computer interaction terminal, this method comprises: obtaining the control information that user conveys, the control information includes voice messaging；Extract the text feature of the voice messaging；Determine the corresponding Text eigenvector of the text feature；According to the Classification of Speech model of pre-training, the matched speech samples of the Text eigenvector are determined；The Classification of Speech model table is shown with ownership probability of the Text eigenvector with corresponding speech samples；Phonetic control command by the corresponding phonetic control command of identified voice sample, as the voice messaging；According to the phonetic control command, target control instruction is generated.The embodiment of the present invention is able to ascend the naturality and intelligence of human-computer interaction, reduces user's threshold of human-computer interaction, to provide strong support for the universal of human-computer interaction.

Description

A kind of man-machine interaction method, device and human-computer interaction terminal

Technical field

The present invention relates to human-computer interaction technique fields, and in particular to a kind of man-machine interaction method, device and human-computer interaction are whole End.

Background technique

Human-computer interaction refers to communicates with each other between user and machine, so that machine understands a kind of technology that user is intended to；Tool Body, by human-computer interaction, user can be by conveying control information to machine, so that machine completes the work that user is intended to. Human-computer interaction has a wide range of applications in multiple fields, is related to mobile phone control, automatic driving etc., especially with machine How the development of device people (such as server people) technology, human-computer interaction technology are preferably applied in terms of robot control, at The key point promoted for robot technology.

It was found by the inventors of the present invention that current human-computer interaction technology urgent problem is how to promote human-computer interaction Naturality and intelligence so that user's threshold of human-computer interaction reduces, human-computer interaction technology can be widely spread.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of man-machine interaction method, device and human-computer interaction terminal, to promote people The naturality and intelligence of machine interaction, reduce user's threshold of human-computer interaction, to provide strong branch for the universal of human-computer interaction It holds.

To achieve the above object, the embodiment of the present invention provides the following technical solutions:

A kind of man-machine interaction method, comprising:

The control information that user conveys is obtained, the control information includes voice messaging；

Extract the text feature of the voice messaging；

Determine the corresponding Text eigenvector of the text feature；

According to the Classification of Speech model of pre-training, the matched speech samples of the Text eigenvector are determined；The voice Disaggregated model indicates ownership probability of the Text eigenvector with corresponding speech samples；

Phonetic control command by the corresponding phonetic control command of identified voice sample, as the voice messaging；

According to the phonetic control command, target control instruction is generated.

The embodiment of the present invention also provides a kind of human-computer interaction device, comprising:

Data obtaining module is controlled, for obtaining the control information of user's reception and registration, the control information includes voice messaging；

Text character extraction module, for extracting the text feature of the voice messaging；

Text eigenvector determining module, for determining the corresponding Text eigenvector of the text feature；

Speech samples determining module determines the Text eigenvector for the Classification of Speech model according to pre-training The speech samples matched；The Classification of Speech model table is shown with ownership probability of the Text eigenvector with corresponding speech samples；

Phonetic order determining module is used for by the corresponding phonetic control command of identified voice sample, as the voice The phonetic control command of information；

Target instruction target word generation module, for generating target control instruction according to the phonetic control command.

The embodiment of the present invention also provides a kind of human-computer interaction terminal, comprising: at least one processor and at least one processing Device；

The memory is stored with program, and the processor calls described program；Described program is used for:

Extract the text feature of the voice messaging；

Determine the corresponding Text eigenvector of the text feature；

Based on the above-mentioned technical proposal, man-machine interaction method provided in an embodiment of the present invention can believe the control that user conveys Voice messaging in breath carries out Text character extraction, and determines corresponding Text eigenvector；To according to the voice of pre-training Disaggregated model, it may be determined that the matched speech samples of Text eigenvector；And then with the corresponding voice of identified voice sample Control instruction generates target control instruction by the phonetic control command as the phonetic control command of the voice messaging, Realize the generation instructed in human-computer interaction process for the target control of machine.

Possible intention is belonged to since the Classification of Speech model of pre-training can accurately define each Text eigenvector Speech samples probability so that the corresponding relationship of speech samples and Text eigenvector is more accurate；Therefore by the present invention Embodiment, user can carry out human-computer interaction by being similar to the exchange way of person to person, and user passes through natural voice messaging After conveying voice messaging to human-computer interaction terminal, human-computer interaction terminal can utilize Classification of Speech model, accurately identify user The matched speech samples of the voice messaging of reception and registration, to identify the voice letter that user conveys by the matched speech samples of institute Cease the phonetic control command being intended to.Using the embodiment of the present invention, user conveys the mode of voice messaging can be more naturally, man-machine Interactive terminal can accurately match the speech samples of user speech information by Classification of Speech model, realize user speech letter The accurate determination for ceasing the phonetic control command being intended to reduces user to improve the naturality and intelligence of human-computer interaction The exchange threshold of human-computer interaction is carried out, provides strong support for the universal of human-computer interaction.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is the structural block diagram of man-machine interactive system provided in an embodiment of the present invention；

Fig. 2 is another structural block diagram of man-machine interactive system provided in an embodiment of the present invention；

Fig. 3 is the structural block diagram of human-computer interaction terminal；

Fig. 4 is the construction method flow chart of Classification of Speech model provided in an embodiment of the present invention；

Fig. 5 is the flow chart of man-machine interaction method provided in an embodiment of the present invention；

Fig. 6 is the example schematic diagram of human-computer interaction；

Fig. 7 is another flow chart of man-machine interaction method provided in an embodiment of the present invention；

Fig. 8 is the method flow diagram for improving particle filter processing gesture posture feature；

Fig. 9 is recongnition of objects method flow diagram provided in an embodiment of the present invention；

Figure 10 is the structural block diagram of human-computer interaction device provided in an embodiment of the present invention；

Figure 11 is another structural block diagram of human-computer interaction device provided in an embodiment of the present invention；

Figure 12 is another structural block diagram of human-computer interaction device provided in an embodiment of the present invention；

Figure 13 is the another structural block diagram of human-computer interaction device provided in an embodiment of the present invention；

Figure 14 is another structural block diagram again of human-computer interaction device provided in an embodiment of the present invention；

Figure 15 is another structural block diagram again of human-computer interaction device provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Man-machine interaction method provided in an embodiment of the present invention can be applicable to robot control, mobile phone control, automatic Pilot etc. Aspect；For purposes of illustration only, hereafter will be mainly in terms of service robot control, to human-computer interaction side provided in an embodiment of the present invention Method is introduced；Certainly, man-machine interaction method provided in an embodiment of the present invention mobile phone control, in terms of use Principle is consistent with the use principle in terms of service robot controls, can be cross-referenced.

It needs to introduce, service robot is one kind of robot, and service robot can be divided into professional domain service Robot and personal, home-services robot, service robot are of wide application, and are mainly engaged in maintenance, repairing, fortune The work such as defeated, cleaning, security personnel, rescue, monitoring.

Optionally, Fig. 1 is a kind of alternative construction block diagram of man-machine interactive system provided in an embodiment of the present invention, referring to Fig.1, The man-machine interactive system may include: human-computer interaction terminal 10 and service robot 11；Human-computer interaction terminal 10 and service-delivery machine People 11 can realize information exchange by internet；

Based on man-machine interactive system shown in Fig. 1, user can be by conveying control information, human-computer interaction to human-computer interaction terminal It, can be by the Internet transmission control instruction to service-delivery machine after terminal understands the corresponding control instruction of control information that user conveys People executes the control instruction by service robot, completes the work that user is intended to；

Optionally, user conveys the mode of control information to can be voice to human-computer interaction terminal；It is also possible to voice knot Close gesture etc.；

Further, service robot can by the status information of robot, and/or, view-based access control model perception environmental information, lead to The Internet transmission is crossed to human-computer interaction terminal, from human-computer interaction terminal to the status information of user's show robot, and/or, clothes The environmental information (can be shown by the display screen of human-computer interaction terminal) on business robot periphery, so that user preferably conveys control Information processed.

Man-machine interactive system shown in Fig. 1 can transmit information in human-computer interaction terminal and the service-delivery machine human world by internet, Realize remote control of the user to service robot；It certainly, is only a kind of alternative construction of man-machine interactive system shown in Fig. 1, optionally, The case where human-computer interaction terminal built in service robot is not precluded in the embodiment of the present invention, as shown in Fig. 2, to which human-computer interaction is whole End can be worked by local communication (forms such as local wired or local area network wireless) control service robot；Shown in Fig. 2 Man-machine interactive system is removed communication mode by being changed into outside local communication by internet communication, other aspects can with shown in Fig. 1 Man-machine interactive system is similar.

Optionally, human-computer interaction terminal may be considered service robot and the interaction platform of user, and realize to clothes The controlling terminal of business robot control；Human-computer interaction terminal can be separately positioned with service robot, realizes information by internet Interaction, is also possible to be built in service robot, human-computer interaction terminal can be in the corresponding control of control information for understanding user After instruction, corresponding control instruction is transmitted to service robot, thus to control member (such as motor, motor of service robot Deng) controlled, complete the work that user is intended to.

In embodiments of the present invention, human-computer interaction terminal can load corresponding program, realize that the embodiment of the present invention provides Man-machine interaction method；The program can be stored by the memory of human-computer interaction terminal, by the processor tune of human-computer interaction terminal With execution；Optionally, Fig. 3 shows a kind of alternative construction of human-computer interaction terminal, and referring to Fig. 3, which can be with It include: at least one processor 1, at least one communication interface 2, at least one processor 3 and at least one communication bus 4；

In embodiments of the present invention, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one, And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4；Obviously, processor shown in Fig. 3 1, the communication connection signal of communication interface 2, memory 3 and communication bus 4 is only optional；

Processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road.

Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory), a for example, at least magnetic disk storage.

Wherein, memory 3 is stored with program, and the program that processor 1 calls memory 3 to be stored realizes that the present invention is implemented The man-machine interaction method that example provides.

Voice commonly conveys the mode of control information, the control conveyed below with user to human-computer interaction terminal as user Information processed includes the case where voice, and man-machine interaction method provided in an embodiment of the present invention is introduced.It is described below man-machine Exchange method is suitable for service robot control, mobile phone control, automatic Pilot etc..

For the naturality and intelligence for promoting human-computer interaction, how to make that service robot is more acurrate, quick understandings use The intention of family voice is very important, therefore the embodiment of the present invention considers building accurately and efficiently Classification of Speech model, thus The corresponding speech samples of voice that more accurate, quick identification user conveys, by the corresponding phonetic control command of speech samples, Determine the phonetic control command that the voice that user conveys is intended to.

Fig. 4 is the construction method flow chart of Classification of Speech model provided in an embodiment of the present invention, the Classification of Speech model Construction method can be implemented by background server, and trained Classification of Speech model can import in human-computer interaction terminal, by man-machine Realize the identification of the corresponding speech samples of user speech in interactive terminal；Certainly, the building of Classification of Speech model, is also possible to by people It realizes machine interactive terminal；

Referring to Fig. 4, this method may include:

Step S100, training corpus is obtained, the training corpus record has the speech samples of each phonetic control command, One phonetic control command corresponds at least one speech samples.

The speech samples for each phonetic control command that training corpus record has the embodiment of the present invention to collect in advance, and training A phonetic control command corresponds at least one speech samples in corpus；It, can be with by the speech samples of each phonetic control command Classification of Speech model is trained using machine learning algorithm.

Optionally, speech samples can be the natural language of user, and phonetic control command can be natural-sounding and be converted Can by service robot understand control instruction.

Step S110, the text feature for extracting each speech samples, obtains multiple text features.

The embodiment of the present invention can carry out Text character extraction to each speech samples, and the extracted text of a speech samples is special Sign may be at least one, so that multiple text features can be obtained by carrying out Text character extraction to each speech samples；

Optionally, the extracted text feature of different speech samples there may be repeat the case where, can be to repeated text Feature carries out duplicate removal, so that duplicate text feature is not present in obtained multiple text features；

Optionally, the text feature of speech samples may be considered, after carrying out text conversion to speech samples, from being converted Text in the forms such as the keyword that extracts text feature.

Step S120, feature vector weighting is carried out respectively to each text feature, obtain the text feature of each text feature to Amount.

Optionally, the embodiment of the present invention can utilize TF-IDF (term frequency-inverse document Frequency, term frequency-inverse document frequency, a technique for for information retrieval) feature vector is carried out to each text feature respectively Weighting, hence for each text feature, obtain the corresponding Text eigenvector of text feature, get multiple text features to Amount.

It should be noted that TF-IDF is a kind of statistical method, to assess words for the weight of the file in corpus Want degree；The importance of words, but simultaneously can be as it be in corpus with the directly proportional increase of number that it occurs hereof The number of middle appearance is inversely proportional decline；

Optionally, for a text feature, the embodiment of the present invention can determine the words of this article eigen in corresponding voice Go out occurrence in sample (the corresponding speech samples of text feature may be considered, and extract the speech samples of this article eigen) Number, and the frequency of occurrence in training corpus, thus according to the words of this article eigen in corresponding speech samples and instruction Practice the frequency of occurrence in corpus, significance level of this article eigen in corresponding speech samples is determined, according to the important journey It spends and determines the corresponding Text eigenvector of this article eigen；Wherein, the words of the significance level and text feature is in speech samples Frequency of occurrence it is proportional, inversely with the frequency of occurrence of the words of text feature in corpus；

Optionally, when carrying out the acquisition of Text eigenvector, if there is n word, then accordingly the text of available n dimension is special Levy vector.

Step S130, according to machine learning algorithm, to the ownership probability of each Text eigenvector and corresponding speech samples It is modeled, obtains Classification of Speech model.

Optionally, the corresponding speech samples of a Text eigenvector, it can be understood as be that text feature vector is intended to The speech samples of expression, quantity at least one；The ownership probability of Text eigenvector and a corresponding speech samples, can be with It is considered that Text eigenvector belongs to the probability of the corresponding speech samples；

Optionally, since the text feature that different speech samples extract may be identical, a text feature may be right Answer at least one speech samples, correspondingly, speech samples corresponding to the Text eigenvector of a text feature be also likely to be to It is one few；And the Text eigenvector of a text feature can indicate, this article eigen is important in corresponding speech samples Degree, thus the embodiment of the present invention can according to the significance level represented by Text eigenvector, determine Text eigenvector with The ownership probability of corresponding each speech samples；

And then using machine learning algorithm, to each Text eigenvector, and, each Text eigenvector with it is corresponding each The ownership probability of speech samples is modeled, and Classification of Speech model is obtained；Optionally, Classification of Speech model can indicate text spy Levy the ownership probability of vector and corresponding speech samples.

Pass through Text eigenvector and the ownership probability of Text eigenvector and corresponding speech samples, the present invention Embodiment can accurately define the probability that each Text eigenvector belongs to the speech samples that may be intended to, so that voice sample This is more accurate with the corresponding relationship of Text eigenvector；It is trained with this and obtains Classification of Speech model, natural language can be passed through Text eigenvector, accurately determine belonged to speech samples, realize the corresponding voice of natural language that user conveys The accurate identification of sample；To the subsequent phonetic control command by identified speech samples, the voice as the natural language Control instruction can accurately determine the matched phonetic control command of natural language that user conveys, and promote service robot pair The identification accuracy for the phonetic control command that user's natural language is intended to provides for the intelligence and naturality of human-computer interaction It may.

Optionally, the embodiment of the present invention can utilize maximum entropy algorithm, to each Text eigenvector and corresponding voice control The ownership probability of instruction is modeled, and a kind of probability distribution uniform maximum entropy disaggregated model (shape of Classification of Speech model is obtained Formula), the ownership probability of Text eigenvector and corresponding speech samples is indicated by the maximum entropy disaggregated model；

Optionally, in specific modeling, the embodiment of the present invention can be realized using following formula；

Wherein, f_i(x, y) is the characteristic function of i-th of Text eigenvector, and n is characterized function number, the numerical value and text of n The numerical value of eigen vector is consistent, if i-th of Text eigenvector and corresponding speech samples appear in it is same it is collected from In right language, then it is assumed that f_i(x, y) is 1, otherwise it is assumed that f_i(x, y) is 0, λ_iFor f_i(x, y) corresponding weight, λ are that glug is bright Day multiplier, Z (x) be the normalization factor set, p^*For the expression parameter of maximum entropy disaggregated model.

Optionally, when being modeled using maximum entropy algorithm, since modeling process is realized by Given information, for known Information has accomplished to meet as far as possible, and does not do any hypothesis to unknown, it is possible to comprehensive observing phase to various correlations or not The probability of pass is applied under Text eigenvector classification, and performance is better than the machine learning algorithms such as other Bayes；This Inventive embodiments preferably use maximum entropy algorithm, establish the Classification of Speech model of maximum entropy disaggregated model form, but this is only The machine learning algorithm of other Bayes etc. is not precluded in preferred embodiment, the embodiment of the present invention.

After training obtains Classification of Speech model, human-computer interaction terminal is communicated to user using Classification of Speech model Control information comprising voice is handled, so that the speech samples that the voice that user conveys matches are identified, with the voice The corresponding phonetic control command of sample, the corresponding phonetic control command of voice conveyed as user.

Fig. 5 is the flow chart of man-machine interaction method provided in an embodiment of the present invention, and this method can be applied to human-computer interaction end End, referring to Fig. 5, this method may include:

Step S200, the control information that user conveys is obtained, the control information includes voice messaging.

Optionally, the control information that human-computer interaction terminal can be conveyed by setting detector acquisition user, control information can To include the voice messaging of user's reception and registration, it is also possible to the gesture information conveyed including user；This embodiment discussion controls packet The case where including voice messaging, for including the case where that gesture information will be described later；

Optionally, the form of detector can such as microphone speech detector, three-dimensional camera or infrared thermoviewer etc. Contactless visual detector etc.；The form of detector can type setting according to the control information, do not make fixed limitation.

Step S210, the text feature of the voice messaging is extracted.

Optionally, the embodiment of the present invention can carry out text conversion to the voice messaging, mention from the text being converted to Corresponding text feature is taken, to get the text feature of the voice messaging.

Step S220, the corresponding Text eigenvector of the text feature is determined.

Optionally, the embodiment of the present invention can determine the corresponding Text eigenvector of the text feature by TF-IDF；It can Choosing, when determining the corresponding Text eigenvector of the text feature, the embodiment of the present invention in combination with the voice messaging with Training corpus carries out the determination of the corresponding Text eigenvector of the text feature.

Step S230, according to the Classification of Speech model of pre-training, the matched speech samples of the Text eigenvector are determined.

Optionally, since the Classification of Speech model of pre-training can indicate Text eigenvector and corresponding speech samples Belong to probability, by Classification of Speech model, the embodiment of the present invention can determine that the voice that the Text eigenvector may belong to Sample, and the ownership probability with each speech samples that may belong to, so as to choose the ownership highest speech samples of probability, as The matched speech samples of Text eigenvector.

Step S240, the voice control by the corresponding phonetic control command of identified voice sample, as the voice messaging System instruction.

Step S250, according to the phonetic control command, target control instruction is generated.

Target control instruction can be the final control instruction for service robot of human-computer interaction terminal generation, in list Solely using on the basis of voice control, phonetic control command can be instructed directly as target control and be used by the embodiment of the present invention； And in the case where gesture is used in combination in user, user gesture corresponding gesture control instruction will also be instructed as target control One parameter, thus by combining the phonetic control command of user speech expression and the gesture control instruction of user gesture expression, Generate target control instruction.

Certainly, to the target object in service robot local environment scene, (target object may be considered clothes needing Robot be engaged in based on object operated by user's control) when being controlled (being only a kind of optional control situation), it may also be combined with Environment scene carries out the identification of target object, refers to so that service robot carries out target control for the target object of identification Enable corresponding control operation.

Man-machine interaction method provided in an embodiment of the present invention, the voice messaging in control information that can be conveyed to user carry out Text character extraction, and determine corresponding Text eigenvector；To according to the Classification of Speech model of pre-training, it may be determined that described The matched speech samples of Text eigenvector；And then with the corresponding phonetic control command of identified voice sample, as institute's predicate The phonetic control command of message breath generates target control instruction by the phonetic control command, realizes in human-computer interaction process For the generation of the target control instruction of machine.

The embodiment of the present invention using voice messaging carry out human-computer interaction example can as shown in fig. 6, user to human-computer interaction Terminal says the voice for carrying out service robot control；After human-computer interaction terminal obtains the voice that user conveys, digitize the speech into For text, extracts the text feature of text and determine the corresponding Text eigenvector of text feature, pass through maximum entropy disaggregated model The matched speech samples of text feature vector are determined, so that it is determined that going out the corresponding phonetic control command of the speech samples；People The phonetic control command is passed through the Internet transmission to service robot by machine interactive terminal, and service robot executes the voice control Remote control operation of the user to service robot is realized in instruction.Certainly, the human-computer interaction terminal in Fig. 6 may also be built in service In robot.

The embodiment of the present invention also realizes human-computer interaction in combination with user gesture, and user gesture is being combined to realize human-computer interaction In the process, human-computer interaction terminal need to understand the corresponding gesture control instruction of user gesture, be more accurate determination user gesture Corresponding gesture control instruction, it is necessary to the identification of user gesture be optimized, to pass through the standard for promoting user gesture identification Exactness carrys out the instruction of service hoisting gesture control and fixes exactness really.

In terms of the accuracy for promoting user gesture identification, the embodiment of the present invention can be quasi- by promoting the identification of hand gesture location The recognition accuracy of exactness and gesture posture is realized, and then determining hand gesture location is merged with gesture posture, realizes user hand The recognition accuracy of gesture is promoted.

Correspondingly, the control information that user conveys can also include gesture information in method shown in Fig. 5；Optionally, Fig. 7 Another flow chart of man-machine interaction method provided in an embodiment of the present invention is shown, this method can be by human-computer interaction terminal reality Row, referring to Fig. 7, this method may include:

Step S300, the control information that user conveys is obtained, the control information includes voice messaging and gesture information；Institute Stating gesture information includes: hand gesture location feature and gesture posture feature.

Optionally, gesture information can be represented by the user gesture image (such as user gesture image sequence) of continuous multiple frames Original gesture feature information, original gesture feature information can be extracted from the user gesture image of continuous multiple frames, should Original gesture feature information can be indicated by hand gesture location feature and gesture posture feature；Optionally, hand gesture location feature is hand The relevant feature in gesture position, such as manpower is in the coordinate of tri- axis of XYZ, speed, acceleration etc., gesture posture feature such as manpower about The rotation angle etc. of each axis of coordinate system of tri- axis of XYZ；

Optionally, the image detection that user gesture image can be contactless by three-dimensional camera or infrared thermoviewer etc. Device acquisition is realized；Such as stereoscopic vision camera or infrared imaging sensor can be with real-time detection and identification manpower, relevant biographies Sensor hardware has binocular vision camera, Kinect somatosensory sensor and Leap Motion sensor；With Leap Motion biography For sensor, manpower is placed on detection zone, and sensor can acquire three-dimension gesture image with high-frequency, return to manpower for Leap Rotation angle (gesture of the rectangular co-ordinate (a kind of form of hand gesture location feature) and palm of Motion basis coordinates about three-coordinate A kind of form of posture feature), to obtain the gesture information indicated by hand gesture location feature and gesture posture feature；

Optionally, the images of gestures of the embodiment of the present invention can be three dimensional form, to pass through the three-dimensional hand to manpower Gesture is captured, and can identify that the interaction of user is intended to and is converted into interactive instruction；It is different from traditional two-dimentional gesture interaction, three Dimension gesture data has many advantages, such as that semantic meaning representation is abundant, mapping intuitive.

Step S310, the text feature of the voice messaging is extracted.

Step S320, the corresponding Text eigenvector of the text feature is determined.

Step S330, according to the Classification of Speech model of pre-training, the matched speech samples of the Text eigenvector are determined.

Step S340, the voice control by the corresponding phonetic control command of identified voice sample, as the voice messaging System instruction.

Optionally, the processing of step S310 to step S340 are referred to step S210 shown in Fig. 5 to step S240；Scheming In method shown in 7, there is also the parallel processing for user gesture image, steps as follows.

Step S350, the hand gesture location feature is handled according to adaptive Interval Kalman filter, obtains target gesture position Set feature；And the gesture posture feature is handled according to particle filter is improved, obtain target gesture posture feature.

Optionally, adaptive Interval Kalman filter (Adaptive Kalman Filter) can utilize measurement data It while being filtered, constantly goes to judge whether the dynamic of system changes in itself by filtering, unites to model parameter and noise Meter characteristic is estimated and is corrected, with the actual error improved filter design, reduce filtering；It is filtered by adaptive section Kalman The original hand gesture location feature that wave processing is extracted from user gesture image, may filter that the noise of detector and manpower muscle are trembled The dynamic influence for user gesture, the accuracy for the hand gesture location feature (i.e. target hand gesture location feature) that promotes that treated；

It is extracted from user gesture image by improving particle filter (Improved Particle filter) processing The quaternion components of original gesture posture feature, (i.e. target gesture posture is special for the gesture posture feature that can make that treated Sign) quaternion components more approaching to reality quaternion components, to promote the accuracy of treated gesture posture feature.

Step S360, the target hand gesture location feature and the target gesture posture feature are merged, determines the hand of user Gesture feature.

The embodiment of the present invention and can will be adopted using adaptive Interval Kalman filter treated target hand gesture location feature With particle filter is improved, treated that target gesture feature blends, and realizes the determination of the gesture feature of user；The present invention is implemented Example is by adaptive Interval Kalman filter and improves particle filter, can carry out to the temporal correlation of hand gesture location and posture Constraint, to eliminate the instability and ambiguousness of three-dimension gesture data as much as possible.

Step S370, the corresponding gesture control instruction of the gesture feature is determined.

Optionally, the settable gesture control instruction database of the embodiment of the present invention records each gesture by gesture control instruction database The corresponding gesture feature of control instruction (can be three dimensional form), so that the embodiment of the present invention is determining user gesture image Gesture feature after, each gesture control that can be recorded by gesture control instruction database instructs corresponding gesture feature, determines The instruction of gesture control corresponding to the gesture feature of user gesture image.

Step S380, it is instructed according to the phonetic control command and the gesture control, generates target control instruction.

Optionally, step S350 to step S370 and step S310 to step S340 can be parallel, be to be directed to respectively The processing of the control information of the control information and user speech form of user gesture image format, step S350 to step S370, And step S310 is to can be without apparent tandem between step S340.

Optionally, for service robot target control instruct, generally can by four variables form control to (C is described in amount form_dir,C_opt,C_vol,C_unit), wherein C_dirFor operative orientation keyword, C_optAnd C_volIt is retouched for a heap operation It states, respectively operation keyword and operating value, C_unitFor operating unit, which is known as voice control variable； Four variables can then be specified by phonetic control command under normal circumstances；Correspondingly, generating mesh according to phonetic control command When marking control instruction, the embodiment of the present invention can determine the corresponding voice control variable of phonetic control command, the voice control variable Include: operative orientation keyword indicated by phonetic control command, operate keyword, operates the corresponding operating value of keyword, and Operating unit；To describe target control instruction with the dominant vector that the voice control variable is constituted, target control instruction is realized Generation.

In the case where combining voice and gesture control, the embodiment of the present invention can increase new variable C_hand, i.e. target control System instruction may be modified such that the dominant vector form being made up of following five variables is described:

(C_dir,C_opt,C_hand,C_val,C_unit)；

And in the case where not needing gesture control, it is believed that C_hand=NULL.

Correspondingly, when generating target control instruction, the present invention is implemented according to phonetic control command and gesture control instruction Example can determine the corresponding voice control variable of phonetic control command, which includes: indicated by phonetic control command Operative orientation keyword, operate keyword, the corresponding operating value of operation keyword and operating unit；And gesture is determined simultaneously The corresponding gesture control variable of control instruction；To which in conjunction with the voice control variable and the gesture control variable, formation is retouched State the dominant vector (C of target control instruction_dir,C_opt,C_hand,C_val,C_unit), realize the generation of target control instruction.

Optionally, the embodiment of the present invention, which is believed that in the detection range of visual detector, captures user gesture image, Then think to need that control gesture is combined to control；Otherwise it is assumed that do not need to combine gesture control, it can (may also by voice messaging In conjunction with the environment scene of service robot) determine (C_dir,C_opt,C_vol,C_unit) composition dominant vector.

Optionally, it is situated between below for the means for handling hand gesture location feature according to adaptive Interval Kalman filter It continues.It should be noted that during obtaining images of gestures by contactless visual detector, images of gestures expression User gesture can energy band detector noise, and user gesture determined by making often has unstability, ambiguousness and fuzzy Property；In addition, being not intended to act since human factor inevitably will appear muscle jitter etc., to make when user carries out gesture operation User gesture determined by obtaining has non-precision；Therefore the embodiment of the present invention can be handled by adaptive Interval Kalman filter Original hand gesture location feature in user gesture image, so that the noise and manpower muscle jitter that filter out detector are for user The influence of gesture.

Optionally, the model of adaptive Interval Kalman filter can be expressed as follows:

Wherein,It is the state vector of n × 1 at k moment, in order to make Kalman filtering preferably estimate manpower position data, shape The variable of manpower speed and manpower acceleration is introduced in state vector；The state transition matrix of n × n, can according to displacement, Relationship between velocity and acceleration is designed；It is n × l control output matrix, is determined by acceleration of gravity；It is defeated Incoming vector,WithNoisy vector is represented,Usual Gaussian distributed；Be m × 1 at k moment measurement vector (Member Element andUnanimously, obtained in k moment measurement, as manpower is equivalent in the position in the direction XYZ, speed, acceleration),It is m × n Observation matrix；

It should be noted that

Wherein, Φ is state transition matrix, in embodiments of the present inventionInclude position, speed, acceleration, the i.e. member of Φ Element is the coefficient of the variables such as position, speed, acceleration when meeting kinematics formula；Γ is constant matrices, by system input vectorElement number output be it is consistent with state vector；H is constant matrices, illustrates the relationship of measurement vector and state vector； Introducing three with Δ symbols herein indicates that the normal of unknown but bounded determines perturbation matrix.

Correspondingly, state x ' of the hand gesture location feature in moment k_kIt can be expressed as follows:

x′_k=[p_x,k,V_x,k,A_x,k,p_y,k,V_y,k,A_y,k,p_z,k,V_z,k,A_z,k]

Wherein, p_x,k, p_y,k, p_z,kFor the coordinate of k moment manpower tri- axis of XYZ in space, V_x,k, V_y,k, V_z,kFor k moment people Speed of the hand in the direction XYZ, A_x,k, A_y,k, A_z,kAcceleration for k moment manpower in the direction XYZ.Because of adaptive section karr Graceful filtering is an estimator, when more acurrate can estimate current using gesture coordinate, gesture speed and the acceleration of previous moment The position at quarter；

And in this process, noisy vector can indicate are as follows: w'_k=[0,0, w'_x,0,0,w'_y,0,0,w_z]^T, wherein (w'_x,w'_y,w_z) be palm acceleration process noise (can be do not meet gesture whole acceleration change rule make an uproar Sound), this noisy vector can be filtered out in the model of adaptive Interval Kalman filter；Thus by hand gesture location feature In the state x ' of moment k-1_k-1(as in model formation), noisy vector (as) using above-mentioned statement from Adapt to Interval Kalman filter model handled, available filtering noise, and eliminate muscle jitter at the time of k target Hand gesture location feature.

As can be seen that the embodiment of the present invention can determine that gesture acceleration becomes according to the corresponding acceleration of hand gesture location feature Law, to pass through the model of adaptive Interval Kalman filter, filtering deviates the noise of gesture acceleration change rule, and Using the model of adaptive Interval Kalman filter, sat according to the gesture of previous moment in the hand gesture location feature after noise filtering excessively Mark, gesture speed and acceleration, estimate gesture coordinate, gesture speed and the acceleration at current time, determine current time Target hand gesture location feature；

So that being obtained by the accuracy of the fused target hand gesture location feature of adaptive Interval Kalman filter It is promoted, can be used to carry out service robot coarse adjustment control operation (since user can not accurately allow when not by foreign object Hand is mobile in millimeter dimension accuracy, so what is carried out to service robot is coarse adjustment control).

Optionally, for according to the means for improving particle filter processing gesture posture feature, Fig. 8 shows implementation of the present invention What example provided improves the method flow diagram of particle filter processing gesture posture feature, and this method can be executed by human-computer interaction terminal, Referring to Fig. 8, this method may include:

Step S400, manpower represented by gesture posture feature is obtained in the rotation angle of each axis of three-dimensional system of coordinate.

Step S410, quaternion components are determined in the rotation angle of each axis of three-dimensional system of coordinate according to the manpower.

Optionally, Quaternion Algorithm can be used to carry out the estimation in rigid body direction, can carry out the calculating of quaternion components；Four First number component is one group of supercomplex, can describe rigid body in the posture in space, quaternion components can refer in inventive embodiments The posture of manpower；Correspondingly, the embodiment of the present invention can be determined by the original gesture posture feature extracted from images of gestures In the rotation angle of each axis of three-dimensional system of coordinate, (rotation angle can be in the information that original gesture posture feature is included manpower out One kind), and then corresponding quaternion components are determined with Quaternion Algorithm.

Step S420, according to particle filter is improved, the posterior probability of manpower particle is determined.

Optionally, error brought by Quaternion Algorithm is used in order to reduce, enhances data using particle filter is improved (fusion is each attitude data for expressing the particle of manpower, and improved particle filter algorithm can be chosen preferably for fusion The importance density function or optimization resampling process, it is therefore an objective to obtain accurate manpower attitude data)；Improving particle filter can be with The particle after resampling is handled using Markov chain Monte Carlo, to improve the diversification of particle, is avoided The local convergence phenomenon of standard particle filtering, improves the accuracy of data estimation.

Step S430, the quaternion components according to the posterior probability iterative processing, obtain target quaternion components, with Get target gesture posture feature.

Optionally, target quaternion components can approach the true quaternion components of manpower gesture.

Optionally, when the posterior probability for carrying out manpower particle determines, in t_kMoment, the posterior probability of manpower particle it is close It can be with like value is defined as:

Wherein, x_i,kIt is t_kI-th of state particle at moment, N are number of samples, ω_i,kIt is t_kI-th of particle at moment Criteria weights, δ are Dirac functions；x_kIt can be manpower state, be in embodiments of the present invention 4 elements of quaternary number, it can To be used to indicate the posture of manpower.

So as to pass through the posterior probability of manpower particle, iterative calculation manpower particle be (i.e. original manpower posture feature Quaternion components), make particle state increasingly approaching to reality value, obtains true three-dimension gesture posture (i.e. target gesture posture Feature)；

Specific iterative manner can following formula:

Wherein K_kIt is kalman gain, z_kIt is observation, h is Observation Operators, v_i,kIt is particle in t_kI-th of state of moment Observation error；

Rigid-body attitude (the purpose of calculating quaternion components is to obtain rigid-body attitude) is then indicated using quaternary number, in t_k+1Moment The quaternion components of each particle can be expressed as follows, to obtain target gesture posture feature；

Wherein ω indicates angular speed, and t is sample time.

The embodiment of the present invention, can be to the original gesture posture extracted from user gesture image by improving particle filter Feature is handled, and be may make the estimation accuracy of gesture posture feature to be greatly improved, be can be provided for service-delivery machine People carries out coarse adjustment control operation.

What needs to be explained here is that particle weight calculation need by combine Kalman filtering location estimation as a result, Since there are certain associations on out of control for the position of three-dimension gesture data and posture；I.e. the velocity and acceleration of gesture has side Tropism, and direction then needs body coordinate system determined by posture to be calculated, the position of gesture is on three-dimensional Superposition amount needs posture to be estimated, therefore by combining adaptive Interval Kalman filter, can by position and posture when The empty restrictive precision for improving data estimation.Since accurate position data can preferably calculate particle weights, to obtain Accurate attitude data, and accurately attitude data can be passed through by velocity and acceleration preferably estimated position data Adaptive Interval Kalman filter and improvement particle filter are handled and are merged to manpower position and attitude data, can be more preferable The three-dimension gesture feature of user is estimated on ground, improves the accuracy and robustness of identified user gesture feature.

Optionally, further, in fusion target hand gesture location feature and target gesture posture feature, the gesture of user is determined After feature, the embodiment of the present invention can be filtered with the gesture feature that damping method is not intended to indicate to user, and by drawing Enter the accuracy that virtual spring coefficient further increases gesture identification；It can specifically realized by following formula:

Wherein F is the input of robot control instruction, and wherein k is virtual spring coefficient, and D is the mobile distance of manpower, and τ is bullet Property limiting threshold value, when D be greater than τ, robot be not responding to the three-dimension gesture input；I.e. the embodiment of the present invention is in the gesture for determining user After feature, if the distance of the corresponding manpower movement of the gesture feature of user then needs to filter greater than the elastic limit threshold value of setting The gesture feature；I.e. in view of in interactive process manpower position may occur acutely it is mobile (be different from muscle jitter, referred to herein as Manpower is frequently moved in large-scale position), three-dimension gesture data at this time are to be not intended to input data, so by these data It filters out, keeps the stability of system；

Correspondingly, the embodiment of the present invention can determine non-mistake when determining that the corresponding gesture control of the gesture feature instructs The corresponding gesture control instruction of the gesture feature of filter.

The optional application example of the embodiment of the present invention can be such that

User says the voice " mobile toward this direction " for carrying out service robot control to human-computer interaction terminal, and makes It is directed toward gesture；After human-computer interaction terminal obtains the voice that user conveys, text is converted speech into, the text feature of text is extracted And determine the corresponding Text eigenvector of text feature, determine that text feature vector is matched by maximum entropy disaggregated model Speech samples enable so that it is determined that going out the mobile relevant voice control of the corresponding execution of the speech samples；

Meanwhile human-computer interaction terminal obtains the hand gesture location feature and gesture posture feature of user gesture image, according to certainly It adapts to Interval Kalman filter and handles the hand gesture location feature, and handle the gesture posture feature according to particle filter is improved, and Treated hand gesture location feature and gesture posture feature are merged, the gesture feature of user is determined, is based on the gesture Feature, human-computer interaction terminal can determine gesture control instruction relevant to moving direction；

It is instructed to which human-computer interaction terminal can be enabled according to determining voice control with gesture control, control service robot exists It moves in the direction of user's instruction；I.e. human-computer interaction terminal can " be moved by the operational order that the natural language of user obtains It is dynamic ", moving direction is the direction of user's finger.

In this human-computer interaction process, user can be in conjunction with voice and gesture, so that between user and service robot Exchange can be similar to the exchange between user, and human-computer interaction is very convenient direct, improve the naturality and intelligence of human-computer interaction Property, the exchange threshold that user carries out human-computer interaction is reduced, provides strong support for the universal of human-computer interaction.

Optionally, in some man-machine interaction scenarios, service robot is generally required according to user's control to environment scene In target object operated, if user indicate service robot " cup for picking up ground ", then service robot needs to know Not Chu " cup " this target object in environment scene, tell robot which is " cup " without user, " cup " exists Where etc. information, thus the cup in the environment-identification scene that service robot can be autonomous, and execute the operation of " picking up "；It can be seen that So that service robot has certain independence, user just seems very simple during control for cognition to environment, Therefore the target object in environment scene is accurately identified, facilitates the naturality and intelligence that promote human-computer interaction.

Optionally, Fig. 9 shows recongnition of objects method flow provided in an embodiment of the present invention, and this method can be applied to Human-computer interaction terminal, referring to Fig. 9, this method may include:

Step S500, environment scene image is obtained.

Optionally, the embodiment of the present invention can be acquired by image collecting devices such as cameras preset on service robot Environment scene image；Environment scene image may be considered the image of environment scene at service robot；

Optionally, if human-computer interaction terminal is interacted with service robot by internet, human-computer interaction terminal can lead to It crosses internet and obtains service robot environment scene image collected；If service robot is built-in with human-computer interaction terminal, Then human-computer interaction terminal can obtain the image collecting device of service robot, environment scene image collected.

Step S510, the HOG feature of the environment scene image is determined.

Optionally, HOG (Histogram of Oriented Gradient, direction gradient can be used in the embodiment of the present invention Histogram) characteristics of image in environment scene image is described in feature；Obviously, HOG feature is only one kind of characteristics of image The characteristics of image of other forms can also be used in optional avatar, the embodiment of the present invention.

HOG is primarily used to calculate the statistical value of topography's Gradient direction information.Relative to other feature descriptors, The advantage of HOG is that its algorithm operating is carried out in the local cells elementary layer of image, so that it is with good geometry and light Learn invariance.

The characteristics of image in environment scene image is described using HOG feature, then it can be first by environment scene image It is divided into a certain number of subgraphs, then each subgraph is divided into cell factory according to certain rules；Then, for each A subgraph can acquire the gradient orientation histogram (i.e. HOG feature) of each pixel in cell factory, it is straight to calculate each gradient direction Density of the square figure in the subgraph, to do normalized to each cell factory in the subgraph according to the density； Finally the normalization result of each subgraph is combined, determines the HOG feature of environment scene image.

Step S520, the target keyword in the voice messaging that user conveys is extracted.

Optionally, target keyword can be the keyword of target object to be identified in environment scene, carry in user Voice messaging in；Optionally, target keyword is usually that occlusion (needs the name of the target object operated in environment scene Word etc.), after following the movement word in voice messaging, or it is associated with word is acted in voice messaging.

Step S530, it according to the object-class model of pre-training, is matched from the HOG feature of environment scene image and mesh Mark the corresponding HOG feature of keyword.

Step S540, by the corresponding object of the matched HOG feature of institute in environment scene image, it is determined as identified target Object.

Optionally, target object may be considered service robot based on object operated by user's control, can be institute The object being directed to when stating target control instruction execution.

In this process, object-class model can indicate the corresponding HOG feature of each object, the instruction of object-class model Practice study, it is most important for the accuracy and efficiency of target identification；Here, deep learning method can be used in the embodiment of the present invention Training objective disaggregated model；Deep learning be from un-marked data be unfolded study, this closer to human brain mode of learning, Voluntarily mastering concept after training can be passed through；In face of mass data, deep learning algorithm can accomplish that traditional artificial intelligence is calculated The thing that method can not accomplish, and exporting result can be more accurate with the increase of data processing amount.This will be increased substantially The efficiency of computer treatmenting information；And according to the network structure of foundation difference, there is also very big for the training method of deep learning Difference；In order to allow the robot to complete on-line study within a short period of time, training obtains object-class model, and the present invention is implemented Example is quasi- to take a two stage method to be learnt；

Optionally, the characteristics of image comprising the object that can be reduced by one for any object, the embodiment of the present invention Feature set (referred to as fisrt feature collection) determine Candidate Set, it is special to reuse a bigger, more reliable image comprising the object (arrangement mode can be the feature according to HOG feature to feature in feature set (referred to as second feature collection) the arrangement Candidate Set of sign Be worth it is descending realize etc., specific queueing discipline can not do stringent limitation), i.e., the image of the second feature collection object that is included is special Sign, is more than, the characteristics of image for the object that fisrt feature collection is included, to choose the feature of the setting tagmeme arranged in Candidate Set As the training characteristics of the object, to obtain the training characteristics of the object；This processing is carried out for any object, then can be obtained The training characteristics of each object, and then object-class model is obtained according to the training of the training characteristics of each object.

Optionally, in human-computer interaction, robot can know unknown object by means of the Heuristics of user Not, or from identification mistake it is corrected, this just needs to establish the training pattern of a tape label data, can more new engine The learning network parameter of people.Under the cooperation of user, one side robot can be best understood from unknown by the description of user The feature (Features) of object；On the other hand, robot can correctly recognize object by sharing an experience for user (Ground-truth)；

In learning process, to find out the optimal parameter of accuracy of identification for making system；It here, will be defeated in user's supporting process What is entered is used to correct the data conduct of machine ginseng number, the learning network parameter attribute value (Features) and label of robot Data (Ground-truth), to update the learning network parameter of robot according to this feature value and label data.

Man-machine interaction method provided in an embodiment of the present invention, user can be by voices, alternatively, the shape of voice combination gesture Formula conveys control information to human-computer interaction terminal, and the mode that user carries out human-computer interaction can be similar to the exchange between user, people Machine interaction is very convenient direct；Meanwhile human-computer interaction terminal carries out target object in combination with the environment scene of service robot Identification, is described further the target object operated in the control information of reception and registration without user, so that user Human-computer interaction process seem very simple；As it can be seen that man-machine interaction method provided in an embodiment of the present invention improves human-computer interaction Naturality and intelligence, reduce the exchange threshold that user carries out human-computer interaction, for human-computer interaction it is universal provide it is strong Support.

Human-computer interaction device provided in an embodiment of the present invention is introduced below, human-computer interaction device described below can To be considered human-computer interaction terminal, the program module of setting needed for the man-machine interaction method that embodiment provides to realize the present invention. Human-computer interaction device content described below can correspond to each other reference with above-described man-machine interaction method content.

Figure 10 is the structural block diagram of human-computer interaction device provided in an embodiment of the present invention, which can be applied to human-computer interaction Terminal, referring to Fig.1 0, this method may include:

Data obtaining module 100 is controlled, for obtaining the control information of user's reception and registration, the control information includes voice letter Breath；

Text character extraction module 200, for extracting the text feature of the voice messaging；

Text eigenvector determining module 300, for determining the corresponding Text eigenvector of the text feature；

Speech samples determining module 400 determines the Text eigenvector for the Classification of Speech model according to pre-training Matched speech samples；The Classification of Speech model table is shown with ownership probability of the Text eigenvector with corresponding speech samples；

Phonetic order determining module 500 is used for by the corresponding phonetic control command of identified voice sample, as institute's predicate The phonetic control command of message breath；

Target instruction target word generation module 600, for generating target control instruction according to the phonetic control command.

Optionally, speech samples determining module 400 determines the text for the Classification of Speech model according to pre-training The matched speech samples of feature vector, specifically include:

Determine the speech samples that the Text eigenvector may belong to according to the Classification of Speech model, and with may return The ownership probability of each speech samples belonged to；

The ownership highest speech samples of probability are chosen, as the matched speech samples of the Text eigenvector.

Optionally, Figure 11 shows another structural block diagram of human-computer interaction device provided in an embodiment of the present invention, in conjunction with figure Shown in 10 and Figure 11, which can also include:

Classification of Speech model training module 700, for obtaining training corpus, the training corpus record has each voice The speech samples of control instruction, corresponding at least one speech samples of a phonetic control command；The text for extracting each speech samples is special Sign, obtains multiple text features；Feature vector weighting is carried out to each text feature respectively, obtains the text feature of each text feature Vector；According to machine learning algorithm, each Text eigenvector is modeled with the ownership probability of corresponding speech samples, is obtained Classification of Speech model.

Optionally, Classification of Speech model training module 700, for carrying out feature vector weighting respectively to each text feature, The Text eigenvector of each text feature is obtained, is specifically included:

For a text feature, frequency of occurrence of the words of this article eigen in corresponding speech samples is determined, and Frequency of occurrence in training corpus；

According to frequency of occurrence of the words of this article eigen in corresponding speech samples and training corpus, this article is determined Significance level of the eigen in corresponding speech samples；Wherein, the words of the significance level and text feature is in speech samples Frequency of occurrence it is proportional, inversely with the frequency of occurrence of the words of text feature in corpus；

The corresponding Text eigenvector of this article eigen is determined according to the significance level.

Optionally, Classification of Speech model training module 700 is used for according to machine learning algorithm, to each Text eigenvector It is modeled with the ownership probability of corresponding speech samples, obtains Classification of Speech model, specifically include:

Using maximum entropy algorithm, each Text eigenvector is built with the ownership probability of corresponding phonetic control command Mould obtains the uniform maximum entropy disaggregated model of probability distribution.

Optionally, the embodiment of the present invention may also be combined with user gesture and carry out human-computer interaction, correspondingly, the control information is also It may include gesture information；The gesture information may include: the hand gesture location feature and gesture from user gesture image zooming-out Posture feature；

Optionally, Figure 12 shows another structural block diagram of human-computer interaction device provided in an embodiment of the present invention, in conjunction with figure Shown in 10 and Figure 12, which can also include:

Adaptive Interval Kalman filter processing module 800, for according to the processing of adaptive Interval Kalman filter Hand gesture location feature obtains target hand gesture location feature；

Particle filter processing module 900 is improved, for obtaining according to the particle filter processing gesture posture feature is improved Target gesture posture feature；

Gesture feature determining module 1000, it is special for merging the target hand gesture location feature and the target gesture posture Sign, determines the gesture feature of user；

Gesture control instructs determining module 1100, for determining the corresponding gesture control instruction of the gesture feature；

Correspondingly, target instruction target word generation module 600, for generating target control instruction according to the phonetic control command, It specifically includes:

It is instructed according to the phonetic control command and the gesture control, generates target control instruction.

Optionally, adaptive Interval Kalman filter processing module 800, at according to adaptive Interval Kalman filter The hand gesture location feature is managed, target hand gesture location feature is obtained, specifically includes:

According to the corresponding acceleration of hand gesture location feature, gesture acceleration change rule is determined；

According to the model of adaptive Interval Kalman filter, filtering deviates the noise of gesture acceleration change rule；

Using the model of adaptive Interval Kalman filter, according to previous moment in the hand gesture location feature after noise filtering excessively Gesture coordinate, gesture speed and acceleration, estimate gesture coordinate, gesture speed and the acceleration at current time, determine to work as The target hand gesture location feature at preceding moment.

Optionally, particle filter processing module 900 is improved, for special according to the particle filter processing gesture posture is improved Sign, obtains target gesture posture feature, specifically includes:

Manpower represented by gesture posture feature is obtained in the rotation angle of each axis of three-dimensional system of coordinate；

According to the manpower in the rotation angle of each axis of three-dimensional system of coordinate, quaternion components are determined；

According to particle filter is improved, the posterior probability of manpower particle is determined；

According to quaternion components described in the posterior probability iterative processing, target quaternion components are obtained, to get mesh Mark gesture posture feature.

Optionally, target instruction target word generation module 600, for being referred to according to the phonetic control command and the gesture control It enables, generates target control instruction, specifically include:

Determine that the corresponding voice control variable of the phonetic control command, the voice control variable include: the voice Operative orientation keyword indicated by control instruction operates keyword, the corresponding operating value of operation keyword and operating unit； And determine that the gesture control instructs corresponding gesture control variable；

In conjunction with the voice control variable and the gesture control variable, formed the control of description target control instruction to Amount.

Optionally, Figure 13 shows the another structural block diagram of human-computer interaction device provided in an embodiment of the present invention, in conjunction with figure Shown in 12 and Figure 13, which can also include:

Gesture feature filtering module 1200, if the distance of the corresponding manpower movement of gesture feature for user, is greater than The elastic limit threshold value of setting, then filter the gesture feature；

Correspondingly, gesture control instructs determining module 1100, it can be used for determining the corresponding gesture of unfiltered gesture feature Control instruction.

Optionally, Figure 14 shows another structural block diagram again of human-computer interaction device provided in an embodiment of the present invention, in conjunction with Shown in Figure 10 and Figure 14, which can also include:

Recongnition of objects module 1300, for obtaining environment scene image；Determine the image of the environment scene image Feature；Extract the target keyword in the voice messaging that user conveys；According to the object-class model of pre-training, from environment scene Characteristics of image corresponding with target keyword is matched in the characteristics of image of image；The object-class model indicates each right As corresponding characteristics of image；By the corresponding object of the matched characteristics of image of institute in environment scene image, it is determined as identified mesh Mark object；The object that the target object is directed to when being the target control instruction execution.

Optionally, object-class model training module is realized as shown in Figure 15 for the training of object-class model, and Figure 15 is shown Another structural block diagram again of human-computer interaction device provided in an embodiment of the present invention, in conjunction with shown in Figure 14 and Figure 15, the device is also May include:

Object-class model training module 1400, for for any object, by the inclusion of the characteristics of image of the object Fisrt feature collection determines Candidate Set, arranges the feature in Candidate Set by the inclusion of the second feature collection of the characteristics of image of the object, Training characteristics of the feature of the setting tagmeme arranged in Candidate Set as the object are chosen, to obtain the training characteristics of each object； Wherein, the characteristics of image for the object that second feature collection is included, is more than, the characteristics of image for the object that fisrt feature collection is included； Object-class model is obtained according to the training of the training characteristics of each object.

Optionally, human-computer interaction device provided in an embodiment of the present invention can also be used in:

Using the data for being used to correct machine ginseng number of user's input as the characteristic value of the learning network parameter of robot And label data；The learning network parameter of robot is updated according to this feature value and label data.

Optionally, the module architectures of above-described human-computer interaction device can be loaded into human-computer interaction end by program form In end.The structure of human-computer interaction terminal can be as shown in Figure 3, comprising: at least one processor and at least one processor；

Wherein, the memory is stored with program, and the processor calls described program；Described program is used for:

Extract the text feature of the voice messaging；

Determine the corresponding Text eigenvector of the text feature；

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.

The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments in the case where not departing from core of the invention thought or scope.Therefore, originally Invention is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein Consistent widest scope.

Claims

1. a kind of man-machine interaction method characterized by comprising

Extract the text feature of the voice messaging；

Determine the corresponding Text eigenvector of the text feature；

According to the Classification of Speech model of pre-training, the matched speech samples of the Text eigenvector are determined；The Classification of Speech Model table is shown with ownership probability of the Text eigenvector with corresponding speech samples；

2. man-machine interaction method according to claim 1, which is characterized in that the Classification of Speech mould according to pre-training Type determines that the matched speech samples of the Text eigenvector include:

The speech samples that the Text eigenvector may belong to are determined according to the Classification of Speech model, and belong to possibility The ownership probability of each speech samples；

3. man-machine interaction method according to claim 1 or 2, which is characterized in that further include:

Training corpus is obtained, the training corpus record there are the speech samples of each phonetic control command, and a voice control refers to Enable corresponding at least one speech samples；

The text feature for extracting each speech samples obtains multiple text features；

Feature vector weighting is carried out to each text feature respectively, obtains the Text eigenvector of each text feature；

According to machine learning algorithm, each Text eigenvector is modeled with the ownership probability of corresponding speech samples, is obtained Classification of Speech model.

4. man-machine interaction method according to claim 3, which is characterized in that described to carry out feature respectively to each text feature Vector weighting, the Text eigenvector for obtaining each text feature include:

For a text feature, frequency of occurrence of the words of this article eigen in corresponding speech samples is determined, and instructing Practice the frequency of occurrence in corpus；

According to frequency of occurrence of the words of this article eigen in corresponding speech samples and training corpus, text spy is determined Levy the significance level in corresponding speech samples；Wherein, the words of the significance level and text feature going out in speech samples Occurrence number is proportional, inversely with the frequency of occurrence of the words of text feature in corpus；

5. man-machine interaction method according to claim 3, which is characterized in that it is described according to machine learning algorithm, to each text Eigen vector is modeled with the ownership probability of corresponding speech samples, is obtained Classification of Speech model and is included:

Using maximum entropy algorithm, each Text eigenvector is modeled with the ownership probability of corresponding phonetic control command, is obtained To the uniform maximum entropy disaggregated model of probability distribution.

6. man-machine interaction method according to claim 1, which is characterized in that the control information further includes gesture information； The gesture information includes: the hand gesture location feature and gesture posture feature from user gesture image zooming-out；

The method also includes:

The hand gesture location feature is handled according to adaptive Interval Kalman filter, obtains target hand gesture location feature；And according to It improves particle filter and handles the gesture posture feature, obtain target gesture posture feature；

The target hand gesture location feature and the target gesture posture feature are merged, determines the gesture feature of user；

Determine the corresponding gesture control instruction of the gesture feature；

It is described according to the phonetic control command, generating target control instruction includes:

7. man-machine interaction method according to claim 6, which is characterized in that the adaptive Interval Kalman filter of basis The hand gesture location feature is handled, obtaining target hand gesture location feature includes:

Using the model of adaptive Interval Kalman filter, according to the hand of previous moment in the hand gesture location feature after noise filtering excessively Gesture coordinate, gesture speed and acceleration estimate gesture coordinate, gesture speed and the acceleration at current time, when determining current The target hand gesture location feature at quarter.

8. man-machine interaction method according to claim 6, which is characterized in that described according to improvement particle filter processing Gesture posture feature, obtaining target gesture posture feature includes:

According to quaternion components described in the posterior probability iterative processing, target quaternion components are obtained, to get target hand Gesture posture feature.

9. man-machine interaction method according to claim 6, which is characterized in that described according to the phonetic control command and institute Gesture control instruction is stated, generating target control instruction includes:

Determine that the corresponding voice control variable of the phonetic control command, the voice control variable include: the voice control The indicated operative orientation keyword of instruction operates keyword, the corresponding operating value of operation keyword and operating unit；And really The fixed gesture control instructs corresponding gesture control variable；

In conjunction with the voice control variable and the gesture control variable, the dominant vector of description target control instruction is formed.

10. man-machine interaction method according to claim 1, which is characterized in that the method also includes:

Obtain environment scene image；

Determine the characteristics of image of the environment scene image；

Extract the target keyword in the voice messaging that user conveys；

According to the object-class model of pre-training, matched from the characteristics of image of environment scene image opposite with target keyword The characteristics of image answered；The object-class model indicates the corresponding characteristics of image of each object；

By the corresponding object of the matched characteristics of image of institute in environment scene image, it is determined as identified target object；The mesh The object that mark object is directed to when being the target control instruction execution.

11. man-machine interaction method according to claim 10, which is characterized in that the method also includes:

For any object, Candidate Set is determined by the inclusion of the fisrt feature collection of the characteristics of image of the object, it is right by the inclusion of this Feature in the second feature collection arrangement Candidate Set of the characteristics of image of elephant, the feature for choosing the setting tagmeme arranged in Candidate Set are made For the training characteristics of the object, to obtain the training characteristics of each object；Wherein, the image for the object that second feature collection is included is special Sign, is more than, the characteristics of image for the object that fisrt feature collection is included；

Object-class model is obtained according to the training of the training characteristics of each object.

12. man-machine interaction method according to claim 1, which is characterized in that the method also includes:

Using the data for being used to correct machine ginseng number of user's input as the characteristic value and mark of the learning network parameter of robot Sign data；

The learning network parameter of robot is updated according to this feature value and label data.

13. a kind of human-computer interaction device characterized by comprising

Speech samples determining module determines that the Text eigenvector is matched for the Classification of Speech model according to pre-training Speech samples；The Classification of Speech model table is shown with ownership probability of the Text eigenvector with corresponding speech samples；

Phonetic order determining module is used for by the corresponding phonetic control command of identified voice sample, as the voice messaging Phonetic control command；

14. human-computer interaction device according to claim 13, which is characterized in that the control information further includes gesture letter Breath；The gesture information includes: the hand gesture location feature and gesture posture feature from user gesture image zooming-out；

Described device further include:

Adaptive Interval Kalman filter processing module, for handling the hand gesture location according to adaptive Interval Kalman filter Feature obtains target hand gesture location feature；

Particle filter processing module is improved, for obtaining target hand according to the particle filter processing gesture posture feature is improved Gesture posture feature；

Gesture feature determining module is determined for merging the target hand gesture location feature and the target gesture posture feature The gesture feature of user；

Gesture control instructs determining module, for determining the corresponding gesture control instruction of the gesture feature；

The target instruction target word generation module, for generating target control instruction, specifically including according to the phonetic control command:

15. a kind of human-computer interaction terminal characterized by comprising at least one processor and at least one processor；

Extract the text feature of the voice messaging；

Determine the corresponding Text eigenvector of the text feature；