CN117573845B

CN117573845B - Robot natural language understanding method for cross-domain man-machine collaborative operation

Info

Publication number: CN117573845B
Application number: CN202410054169.9A
Authority: CN
Inventors: 宋伟; 袭向明; 谢冰; 张格格; 赵文宇; 朱世强; 顾建军
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-05-24
Anticipated expiration: 2044-01-15
Also published as: CN117573845A

Abstract

The invention discloses a robot natural language understanding method for cross-domain man-machine collaborative operation. The method comprises the processes of man-machine collaborative operation field/intention/slot position label and relation definition, data set construction, natural language general understanding model construction and parameter learning, natural language understanding scene model construction and parameter learning, online prediction by using the model and the like. Through defining the corresponding relation between the general slot label and the special slot label in the specific field, and the scene feature description and the character feature description, the recognition of the user intention and the recognition of the corresponding slot information are realized, and the cross-field generalization capability of the natural language understanding model is enhanced. The robot natural language understanding scene model constructed by the invention has the capability of processing multi-mode input data, and in the model parameter learning process, the measurement of model stability is increased, the accuracy of natural language understanding is effectively improved, and the false recognition rate is reduced.

Description

Robot natural language understanding method for cross-domain man-machine collaborative operation

Technical Field

The invention relates to the field of man-machine dialogue systems, in particular to a robot natural language understanding method for cross-domain man-machine collaborative operation.

Background

In order to realize cross-domain man-machine collaborative operation, a robot natural language understanding process is an important component of a robot system, and the process enables a robot to understand the intention of a user and slot position information related to the intention by analyzing text input by the user, so that the robot is used for generating man-machine dialogue states, generating dialogue strategies and other subsequent system steps, and therefore, the accuracy of a natural language understanding result has important significance on the overall reliability of the man-machine dialogue system. Common natural language understanding methods include rule/template-based methods, statistical model-based methods and deep learning-based methods, wherein the rule/template-based methods have format limitations on text instructions issued by people, are suitable for a simple man-machine conversation process, and are easy to understand failure due to the diversity of user expression modes in complex scenes, and have strong limitations. The statistical model-based method needs to calculate the conditional probability of various intentions of the dialog text of a given user on the basis of a large number of data sets collected in advance, and has certain statistical significance under the condition that the data sets are huge enough. However, in practical applications, the collection and annotation of scene data is often complex, and the construction of enough datasets to cover the application scene lacks feasibility. The deep learning-based method utilizes various deep learning models, and can simultaneously identify the intention of a user and the slot position information; however, in practical application, on one hand, the deep learning model generally needs scene data to perform fine adjustment on model parameters, which has a certain model migration cost, and on the other hand, the user intention and the slot have a certain mutual constraint relationship, and the existing method lacks of description and measurement of the constraint relationship.

Therefore, aiming at the problems of weak cross-scene generalization capability and lack of constraint relation between natural language understanding results in the existing natural language understanding method, a robot natural language understanding method with the cross-scene generalization capability is needed to solve the technical problem.

Disclosure of Invention

Aiming at the defects of the existing robot natural language understanding method, the invention provides a robot natural language understanding method for cross-domain man-machine collaborative operation, the method divides a slot label into a general slot and a special slot, and combines dialogue history and scene information to understand dialogue intention of a user from the aspects of dialogue domain, user intention type, slot type and the like, thereby improving man-machine dialogue effect and providing support for the cross-domain man-machine collaborative operation.

The invention aims at realizing the following technical scheme:

the first aspect of the invention: a robot natural language understanding method for cross-domain man-machine collaborative operation comprises the following steps:

(1) Constructing a corresponding field tag set, an intention type tag set, a general slot tag set and a special slot tag set for field recognition, intention recognition and slot filling in robot natural language understanding, and defining a corresponding relation between a general slot and a special slot;

(2) Carrying out data cleaning and feature extraction on the natural language text, labeling a field label, an intention type label, a special slot label and a general slot label corresponding to the text, describing the context where the text is located, and thus constructing and obtaining a labeled text data set;

(3) Constructing a natural language general understanding model, namely taking the processed text and characteristics in the marked text data set obtained in the step (2) as input, and taking a general slot label in the data set as output to finish general slot marking of each word element in the text data;

(4) Utilizing the marked text data set constructed in the step (2) to perform parameter learning on the natural language general understanding model constructed in the step (3);

(5) Constructing a natural language understanding scene model, namely taking the processed text and characteristics in the obtained text data set, the context description and the universal slot label of the text in the text data set as input, and taking the field label, the intention type label and the special slot label in the text data set as output to finish the natural language understanding of the text data;

(6) Utilizing the marked text data set constructed in the step (2) to perform parameter learning on the natural language understanding scene model constructed in the step (5);

(7) When the man-machine collaborative operation is performed, the robot receives a text input by a user and performs cleaning and feature extraction in the same manner as in the step (2);

(8) And (3) taking the processed user input text and characteristics obtained in the step (7) as input, and calculating the input text by using the natural language understanding scene model constructed in the step (6) to obtain corresponding field labels, intention classification labels and special slot labels.

Further, the step (2) specifically includes the following sub-steps:

(2.1) data cleaning, including unified code normalization of texts from various sources, deleting messy code words, correcting wrongly written words and grammar errors;

(2.2) extracting features, including word segmentation, part-of-speech tagging and dependency identification;

(2.3) labeling, including field label labeling and intention type label labeling at sentence level, special slot label sequence labeling and general slot label sequence labeling at word element level;

(2.4) contextual descriptions including contextual dialog history descriptions, in-place scene descriptions, and character feature descriptions.

Further, the universal slot labels of the words in the step (3) are obtained by encoding the input word sequence by using an encoding module in a natural language universal understanding model, establishing hidden space vector expression, and simultaneously predicting the universal slot labels corresponding to all the words in the sequence by using an encoding-conditional random field architecture.

Further, the natural language understanding scene model in the step (5) adopts a model structure based on a multi-task deep learning framework, wherein the model structure comprises a coding layer, a feature fusion layer and an output layer, and the features of each layer are as follows:

(4.1) the coding layer realizes the coding of the input of each mode and carries out vectorization characterization respectively;

(4.2) the feature fusion layer processes the output of the coding layer, so as to realize the fusion of the coded modal information and establish uniform information characterization;

(4.3) the output layer processes the output of the feature fusion layer to complete three subtasks, namely a field label prediction task, an intention type classification task and a special slot label prediction task; in the output layer, the input of each subtask comprises the output of the feature fusion layer and the output results of other subtasks.

Further, the text data in the step (5) includes text data of a man-machine conversation history, text data of a scene description and text data of a character feature description; the text feature includes a text dependency feature having a tree/forest like structure.

Further, the natural language understanding scene model in the step (5) has the capability of processing multi-mode input data, and a universal slot label sequence can be obtained after text data is processed by using the natural language universal understanding model.

Further, the comprehensive loss function adopted in the model parameter learning in the step (6) increases the dynamic process and the model stability of three subtasks in the output layer, and the expression of the comprehensive loss function is as follows:

；

wherein, Predicting a loss function of a task for a domain label,/>Loss function for classifying tasks for intent type,/>Predicting a loss function of a task for a special slot label,/>For the model stability metric, X is the set of all input variables involved in step (5), y= (/ >),/>,/>) For natural language understanding in step (5) the output of three subtasks of the scene model output layer, i.e./>Is a domain label,/>For intent type tag,/>Is a special slot label sequence, (x,/>) For a set of input-output sample points,/>Conditional probability for model prediction,/>Respectively, are weight constants,/>Is the model prediction result at the time (t-N).

The invention also provides a robot natural language understanding device for cross-domain man-machine collaborative operation, which comprises the following modules:

constructing a label collection module: firstly, constructing a corresponding field tag set, an intention type tag set, a general slot tag set and a special slot tag set for field recognition, intention recognition and slot filling related in robot natural language understanding, and defining a corresponding relation between a general slot and a special slot;

The first text cleaning and extracting module: performing data cleaning and feature extraction on the text, labeling a field label, an intention type label, a special slot label and a general slot label corresponding to the text, and describing the context where the text is located, thereby constructing a labeled text data set;

Constructing a natural language general understanding model module: the natural language general understanding model takes the processed text and characteristics in the text data set obtained by the first text cleaning and extracting module as input, and takes the general slot labels in the text data set as output to finish the general slot labeling of each word element in the text data;

a first parameter learning module: parameter learning is carried out on the constructed natural language general understanding model by utilizing the marked text data set constructed by the first text cleaning and extracting module;

Constructing a natural language understanding scene model module: the natural language understanding scene model takes the processed text and characteristics in the obtained text data set, the context description and the general slot label of the text in the text data set as input, and takes the field label, the intention type label and the special slot label in the text data set as output to finish the natural language understanding of the text data;

And a second parameter learning module: carrying out parameter learning on the constructed natural language understanding scene model by utilizing the marked text data set constructed by the first text cleaning and extracting module;

The second text cleaning and extracting module: when the man-machine collaborative operation is performed, the robot receives a text input by a user and performs cleaning and feature extraction in the same manner as the first text cleaning and extraction module;

The label acquisition module: and taking the processed user input text and characteristics obtained by the second text cleaning and extracting module as input, and calculating the input text by using the natural language understanding scene model constructed by the second parameter learning module to obtain corresponding field labels, intention classification labels and special slot labels.

There is also provided an electronic device including:

One or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the cross-domain human-machine collaborative work oriented robot natural language understanding method.

And a computer readable storage medium having stored thereon computer instructions which when executed by a processor perform the steps of a cross-domain human-machine collaborative oriented robotic natural language understanding as described.

The beneficial effects of the invention are as follows:

The invention decomposes the robot natural language understanding process into a scene-independent general slot prediction process and a scene-dependent natural language understanding process during the field-crossing man-machine collaborative operation, wherein the former is independent of the scene and can be reused in different fields, and the latter realizes the identification of user intention and the identification of corresponding slot information by defining the corresponding relation between the general slot label and the special slot label in the specific field and the scene characteristic description and the character characteristic description, thereby enhancing the field-crossing generalization capability of the natural language understanding model.

The natural language understanding scene model constructed by the invention has the capability of processing multi-mode input data, wherein the natural language understanding scene model comprises user input text and scene/character feature description represented by text sequences, text dependency relationship features represented by tree/forest-like structures and slot label relationships represented by graph structures, thereby improving the fusion capability of the model to the data and the understanding effect of the model to the user intention.

In the natural language understanding model constructed by the invention, in the model structural design, the measurement of the relationship among subtasks such as the field to which the intention of the user belongs, the intention type, the slot position information and the like is increased, and in the model parameter learning process, the measurement of the model stability is increased, the accuracy of natural language understanding is effectively improved, and the false recognition rate is reduced.

Drawings

FIG. 1 is a diagram of a natural language understanding method offline learning process of the present invention;

FIG. 2 is a diagram of an online prediction process of the present invention;

FIG. 3 is a schematic diagram of domain, intent, and general/special slot label relationships in the method of the present invention;

FIG. 4 is a schematic diagram of a general understanding model of natural language in the method of the present invention;

fig. 5 is a schematic diagram of a natural language understanding scene model in the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

The invention provides a robot natural language understanding method for cross-domain man-machine collaborative operation, which mainly comprises the following steps as shown in fig. 1 and 2, wherein the method is specifically described as follows:

(1) Aiming at the problems of domain identification, intention identification and slot filling in robot natural language understanding during cross-domain man-machine collaborative operation, a domain label set, an intention type label set, a general slot label set and a special slot label set are constructed, and corresponding relations between the domain label and the intention type label, between the intention type label and the special slot label, and between the general slot label and the special slot label are defined, as shown in fig. 3, an example of a corresponding relation is shown, and each label set can be distinguished according to the difference of the domains. Specifically:

The domain label identifies the domain to which a given natural language text belongs. For example, in fig. 3, there are 3 kinds of field tags in total, namely, fulfillment (travel), office service (office_service), and communication (communication). Given a natural language text of "take me to go to a toilet bar", if a dialogue occurs in an office scene, the attributed domain label is "office_service"; if it occurs during the journey, the attributed domain label is "travel".

The intent type tag identifies the type of intent that the speaker is to express. For example, in FIG. 3, there are 4 types of intent altogether, namely restaurant reservation (reservation), hotel reservation (hotel reservation), fetch (fetch), and send email (send_email). The text "give me a cup of plain boiled water" expresses the speaker's intent to request "fetch"; the text "help me reserve a hotel in singapore, me will check in on the open day" express the intention of the speaker to request "hotel_reserve".

The special slot tag identifies some additional information of the speaker's expression intent. For example, in fig. 3, there are 19 kinds of dedicated slot tags, i.e., arrival time (time_reserve), arrival time (time_ checkin), departure date (date_leave), arrival date (date_ arrive), guest name (name_name), title (title), number of guests (name_number), room number (room_number), room type (room_type), contact telephone (contact_test), home address (address), number (current), object (object), mail content (mail_content), mail subject (mail_subject), mail cipher address (mail_bcc_address), mail copy address (mail_cc_address), and mail transmission address (mail_to_address). Given the intent of "fetch" expressed by the text "give me a glass of plain boiled water," its "object" is "plain boiled water", "amountj" is "one glass", "to_ serve" (slot) is "me"; the text "help me reserve a hotel in singapore, me is to check in the intention of" expressed "hotel_reserve" on the open day, its "address" (slot) is "singapore", "arrive _date" (slot) is "open day", "guide_name" (slot) is "me", "room_number" (slot) is "one".

The generic slot tag identifies some specific information in the natural language text that is not related to the scene. For example, in fig. 3, there are 26 common slot tags, namely date_period (date_period), time_hour (time_hour), time_second (time_second), time interval (time_period), time duration (time_duration), month name (month_name), month number (month_number), year (year_number), day (day_number), first (first_name), last name (last_name), title (title), person reference (human_reference), number (number), unit (unit), room type (room_type), phone number (phone_number), city name (city_name), city_names (sub_ district _name), street name (mail_name), mail (postcode), object (object), text_address (mail) and mailbox address (mail). In the text of "give me a cup of plain boiled water," the "human_refer" is "me", "object" is "plain boiled water", "quality" is "unit" is "cup"; "help me reserve a hotel in singapore," i am going to live in the open, "human_refer" is "i", "quality" is "one", "unit" is "room", "city_name" is "singapore", "date_period" is "open".

Thus, in the above examples, "give me a cup of plain boiled water" and "help me reserve a hotel in singapore," i want to check in on the open day, "i'm" all have the universal slot label "human_refer" and their corresponding dedicated slot labels are "to_ serve" and "guide_name" respectively for different fields/intentions.

(2) And carrying out data cleaning and feature extraction on the text, labeling a field label, an intention type label, a special slot label and a general slot label corresponding to the text, and describing the context where the text is located, thereby constructing a labeled text data set. In particular, the method comprises the steps of,

(2.1) Data cleansing includes unified code (Unicode) normalization (since chinese text has multiple coding formats, it is required to be converted into a unified format when reading data), correcting mispronounced words, and adjusting the order of the words to conform to grammar.

(2.2) Feature extraction includes named entity recognition, word segmentation, part-of-speech tagging and dependency analysis of the text.

(2.3) Data annotation comprises: selecting proper field labels and intention type labels for natural language texts according to the specific conditions of the contexts, and labeling the whole sentence according to the field label set and the intention type label set constructed in the step (1); and (3) marking the corresponding general slot position label and special slot position label for each word element (token) in the segmented natural language text according to the general/special slot position label set constructed in the step (1).

(2.4) Text-in-context description, i.e., text description of the scene, conversation history, character feature description in which the current speaker is located, for example, one possible text-in-context description of "give me a cup of plain boiled water" is "at a company's 15-tier office area, speaker a is talking to his assistant B, and conveys an indication of work. The conversation history between them is '< conversation details >'. Speaker A is the company's manager and is mainly responsible for the work of products, personnel and the like, and assistant B assists A to complete the work of file transfer, conference and the like during the main work.

(3) And (3) constructing a natural language general understanding model, wherein the model takes the processed text and characteristics in the text data set obtained in the step (2) as input, and takes the general slot labels in the text data set obtained in the step (2) as output, so that the general slot labeling of each token (token) in the text data is realized. The constructed model based on the code-conditional random field (Encoder-CRF) architecture is shown in fig. 4, wherein a code (Encoder) module codes an input token sequence, establishes hidden space vector expression of the token sequence, and a Conditional Random Field (CRF) module predicts all universal slot labels corresponding to all tokens in the sequence simultaneously. For example, in one piece of data in the text data set obtained in the step (2), the text is "give me a cup of plain boiled water", and after word segmentation and part of speech tagging, the text is obtained and is characterized as "give/v|me/r|one/m|cup/q|plain boiled water/n", wherein "|" represents the word segmentation result, "v", "r", "m", "q", "n" respectively represent verbs, pronouns, numbers, works and nouns in the PKU tagging. After being processed by the natural language general understanding model shown in fig. 4, the obtained labeling result is as follows: "give/- |i/human_refer|one/quality|cup/unit|plain boiled water/object";

(4) And (3) utilizing the marked text data set constructed in the step (2) to perform parameter learning on the natural language general understanding model constructed in the step (3).

(5) And (3) constructing a natural language understanding scene model, wherein the model takes the text and the characteristics which are processed in the text data set obtained in the step (2), the context description of the text in the text data set obtained in the step (2), the universal slot label obtained in the step (3) as input, and the domain label, the intention type label and the special slot label in the text data set obtained in the step (2) as output, so that the natural language understanding of the text data is realized. The constructed model is shown in fig. 5, and is divided into a coding layer, a feature fusion layer and an output layer, wherein the features of each layer are as follows:

The coding layer (5.1) realizes coding of each mode input and respectively carries out vectorization characterization, wherein the input comprises text, text characteristics, text data of man-machine conversation history, text data of scene description and text data of character characteristics description, which are respectively shown in a Encoder module, a Encoder2 module, a Encoder module, a Encoder module and a Encoder module in fig. 5, and the specific steps are as follows:

The text comprises token sequences after word segmentation, corresponding part-of-speech tagging sequences, and general slot tag sequences obtained by a natural language general understanding model, for example,

"Give/v/- |I/r/human_reference|one/m/quality|cup/q/unit|plain boiled water/n/object";

the text features, including text dependency features with tree/forest like structures, as shown in table 1,

As an example of a text dependency feature with a tree structure in Table 1, where Stanford notation is employed, "-" indicates irrelevant, "dobj" indicates a direct object, "nummod" indicates a number modifier, "clf" indicates a category modifier, "range" indicates a number indirect object.

TABLE 1

Text data of the man-machine conversation history, for example, "conversation history between them is' < a: xxx. B: xxx. A: xxx. B: xxx >' ".

Text data of the scene description, e.g. "at a layer 15 office of a company, speaker a is talking to his assistant B and communicates a work instruction".

The text data of the character feature description, for example, "speaker A is the total manager of the company, is mainly responsible for the work of products, personnel and the like, and the assistant B assists A to complete the work of file transfer, conference and the like during the main work.

And (5.2) the feature fusion layer processes the output of the coding layer to realize the fusion of the coded modal information and establish uniform information characterization, as shown in a fusion model module in fig. 5.

And (5.3) the output layer processes the output of the feature fusion layer to complete three subtasks, namely a field label prediction task, an intention type classification task and a special slot label prediction task, as shown in a field label prediction model-Decoder 1-field label module, an intention type classification model-Decoder 2-intention type module and a special slot label prediction model-Decoder 3-special slot virtual column module in fig. 5. In the output layer, the input of each subtask comprises the output of the feature fusion layer and the output results of other subtasks. For example, the number of the cells to be processed,

Domain label prediction results: "office_service";

Intent type tag prediction results: "fetch";

special slot label prediction result: "give/o I/to_ serve I/quality cup/unit plain boiled water/object".

(6) And (3) utilizing the marked text data set obtained in the step (2) to perform parameter learning on the natural language understanding scene model constructed in the step (5). In python, a model structure is built using pytorch, the training dataset is randomly divided into training set T, test set P and validation set V, and parameter learning is performed using the training algorithm provided in pytorch, where the loss function used is as follows:

；

Wherein X is the set of all input variables involved in step (5), y= ("a") ,/>,/>) For natural language understanding in step (5) the output of three subtasks of the scene model output layer, i.e./>Is a domain label,/>For intent type tag,/>Is a special slot label sequence, (x,/>) For a set of input-output sample points,/>Conditional probability for model prediction,/>The model output layer three subtasks loss function and model prediction steady state error function,Respectively, are weight constants,/>Is the model prediction result at the time (t-N).

It is easy to see that, besides comparing the output result of the natural language scene model with the true value in the sample point, the loss function measures the variation trend of the model prediction result in the time dimension, because in the model output layer, the output of each subtask is the input of other subtasks, which is equivalent to the establishment of a feedback system, so that the model becomes a dynamic system, and the steady state error of the model needs to be reduced on the premise of ensuring the stability of the model, namely the fluctuation of the model output along with the time variation is smaller.

(7) Receiving user input text, and performing the same cleaning and feature extraction as in step (2). For example, with new data "help me reserve a hotel in singapore, me will check in on tomorrow", the results after processing are as follows:

part of speech tagging: "help/v I/r I reserve/v I/m I/q I Singapore/ns I/u I hotel/n I/w I/r I want/v I tomorrow/t I check in/v";

the dependencies are analyzed as shown in Table 2:

Table 2 has a text dependency feature example of a tree structure, where Stanford notation, "-" indicates irrelevant, "dobj" indicates a direct object, "nummod" indicates a word modifier, "clf" indicates a category modifier, "dep" indicates other dependencies, "punct" indicates punctuation, "assm" indicates an associated marker, "assmod" indicates an associated modifier, "nsubj" indicates a noun subject, "mmod" indicates a morbid verb, "tmod" indicates a time modifier.

TABLE 2

Text data of man-machine conversation history: "NULL";

Text data of scene description: "tomorrow is the first day of the vomit vacation, user A is discussing the vomit vacation iting with user B, they are planning to go along;

Text data of character feature description: "user A and user B are couples, their ages are 25 years and 27 years, respectively, and they both like food and a hot place.

(8) And (3) taking the processed text and characteristics input by the user obtained in the step (7) as input, and processing the natural language scene understanding model obtained in the step (6) to obtain corresponding field labels, intention classification labels and special slot labels. For example, the predicted results are as follows:

General slot label prediction result: "help/-H/prescribed/-reservation /) I one-I/human_refer I/o I tomorrow/date_period|in/-";

domain label prediction results: "travel";

intent type tag prediction results: "hotel_reserve";

The invention provides an electronic device, comprising:

One or more processors;

a memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement a cross-domain human-machine collaborative work oriented robotic natural language understanding method as described.

There is also provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of a cross-domain human-machine collaborative oriented robotic natural language understanding method as described.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A robot natural language understanding method for cross-domain man-machine collaborative operation is characterized by comprising the following steps:

(3) Constructing a natural language general understanding model, namely taking the processed text and characteristics in the marked text data set obtained in the step (2) as input, and taking a general slot label in the data set as output to finish marking general slots of each word element in the text data;

(5) Constructing a natural language understanding scene model, namely taking the processed text and characteristics in the obtained text data set, the context description and the general slot position label of the text in the text data set as input, and taking the field label, the intention type label and the special slot position label in the text data set as output to finish the natural language understanding of the text data;

(8) And (3) taking the processed user input text and characteristics obtained in the step (7) as input, and calculating the input text by using the natural language understanding scene model constructed in the step (6) to obtain corresponding field labels, intention type labels and special slot position labels.

2. The method for understanding the natural language of the robot for cross-domain man-machine collaborative operation according to claim 1, wherein the step (2) specifically comprises the following sub-steps:

3. The robot natural language understanding method for cross-domain man-machine collaborative operation according to claim 1, wherein the labeling of the universal slots of each word element in the step (3) is to encode the input word element sequence by using an encoding module in a natural language universal understanding model, establish the hidden space vector expression thereof, and simultaneously predict the universal slot labels corresponding to all word elements in the sequence by using an encoding-conditional random field architecture.

4. The method for understanding natural language of robot for cross-domain man-machine collaborative operation according to claim 1, wherein the natural language understanding scene model in step (5) adopts a model structure based on a multi-task deep learning frame, and comprises a coding layer, a feature fusion layer and an output layer, wherein the features of each layer are as follows:

5. The cross-domain human-machine collaborative operation-oriented robot natural language understanding method according to claim 1, wherein the text data in step (5) includes text data of a human-machine conversation history, text data of a scene description, and text data of a character feature description; the text feature includes a text dependency feature having a tree/forest like structure.

6. The robot natural language understanding method for cross-domain man-machine collaborative operation according to claim 1, wherein the natural language understanding scene model in the step (5) has the capability of processing multi-mode input data, and a universal slot label sequence can be obtained after text data is processed by using a natural language universal understanding model.

7. The method for understanding natural language of robot for cross-domain man-machine collaborative operation according to claim 1, wherein the comprehensive loss function adopted in the model parameter learning in the step (6) increases the dynamic process and model stability for three subtasks in the output layer, and the expression of the comprehensive loss function is as follows:

L(y,P(Y＝y|X＝x))

＝αL₁(y_d,P(Y_d＝y_d|X＝x))+βL₂(y_i,P(Y_i＝y_i|X＝x))+γL₃(y_s,P(Y_s＝y_s|X＝x))

+μL₄(P(Y＝y^(t)|X＝x),...,P(Y＝y^(t-N)|X＝x))

Wherein L ₁ is a loss function of the domain label prediction task, L ₂ is a loss function of the intent type classification task, L ₃ is a loss function of the special slot label prediction task, L ₄ is a model stability metric, X is a set of all input variables involved in step (5), y= (Y _d,Y_i,Y_s) is an output of three subtasks of the natural language understanding scene model output layer in step (5), i.e., Y _d is a domain label, Y _i is an intent type label, Y _s is a special slot label sequence, (X, Y) is a set of input-output sample points, P (y=y|x=x) is a conditional probability of model prediction, α, β, γ, μ are weight constants, respectively, and Y ^(t-N) is a model prediction result at (t-N) time.

8. The robot natural language understanding device for cross-domain man-machine collaborative operation is characterized by comprising the following modules:

constructing a natural language general understanding model module: the natural language general understanding model takes the processed text and characteristics in the text data set obtained by the first text cleaning and extracting module as input, and takes the general slot labels in the text data set as output, so as to finish the labeling of the general slots of each word element in the text data;

Constructing a natural language understanding scene model module: the natural language understanding scene model takes the processed text and characteristics in the obtained text data set, the context description and the general slot position label of the text in the text data set as input, and the field label, the intention type label and the special slot position label in the text data set as output to finish the natural language understanding of the text data;

The label acquisition module: and taking the processed user input text and characteristics obtained by the second text cleaning and extracting module as input, and calculating the input text by using a natural language understanding scene model constructed by the second parameter learning module to obtain corresponding field labels, intention type labels and special slot labels.

9. An electronic device, comprising:

One or more processors;

a memory for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1-7.