CN111178062A

CN111178062A - Man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method and device

Info

Publication number: CN111178062A
Application number: CN201911212568.9A
Authority: CN
Inventors: 王星光; 陈�峰
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-05-19
Anticipated expiration: 2039-12-02
Also published as: CN111178062B

Abstract

The invention discloses an accelerating labeling method and device for multi-turn dialogue corpora facing human-computer interaction, wherein the method comprises the following steps: acquiring a user utterance to be annotated and a context of the user utterance to be annotated; performing face similarity calculation on the words and conversation behaviors of the user to be annotated to obtain a first face similarity score; performing semantic similarity calculation on the utterance and the conversation behavior of the user to be annotated to obtain a first semantic similarity score; performing literal similarity calculation on the context and conversation behavior of the user utterance to be annotated to obtain a second literal similarity score; performing semantic similarity calculation on the context and conversation behavior of the user utterance to be annotated to obtain a second semantic similarity score; and determining candidate recommendation labels according to the first face similarity score, the first semantic similarity score, the second face similarity score and the second semantic similarity score. By the technical scheme of the invention, errors generated by marking are reduced, and the marking speed is increased.

Description

Man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method and device

Technical Field

The invention relates to the technical field of computers and information services, in particular to an accelerating labeling method and device for multi-turn dialogue corpora oriented to human-computer interaction.

Background

In the field of human-computer interaction application scenes such as intelligent customer service, children early education machines and the like, a large number of multi-turn dialogue corpora exist in system logs. Recognition of conversational behaviors (DA) in the Dialog corpus plays a key role in understanding the user's true intent. Conversation behavior describes the interaction of semantics, interplay, etc. of a user utterance (user utterance) during a conversation. The traditional corpus processing mode adopts manual marking, namely, manually marking the words of the user into predefined conversation behaviors, so that a machine learning technology is driven to learn the real intention of the words of the user.

Problems with manual labeling: on one hand, the conversation behaviors in the multi-turn dialogue corpus are more in types; on the other hand, the true intent of a user utterance often needs to be unambiguous depending on context; the two problems cause that the annotator is not only laborious and laborious in annotating the dialogues in multiple rounds, but also easily causes annotation errors.

Disclosure of Invention

The invention provides an accelerating labeling method and device for multi-turn dialogue corpora oriented to human-computer interaction. The technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a method for accelerating annotation of a multi-turn dialog corpus facing human-computer interaction, including:

acquiring a user utterance to be annotated and a context of the user utterance to be annotated;

performing literal similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score;

semantic similarity calculation is carried out on the user utterance to be annotated and the conversation behavior to obtain a first semantic similarity score;

performing literal similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second literal similarity score;

performing semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second semantic similarity score;

and determining candidate recommendation labels according to the first face similarity score, the first semantic similarity score, the second face similarity score and the second semantic similarity score.

In one embodiment, the performing a literal similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score includes:

performing word segmentation on the user utterance to be annotated to acquire first n-gram information;

determining a first query word representation according to the first n-gram information;

acquiring a user utterance marked in the corpus;

retrieving the marked user words through a first preset model to obtain a preset number of marked user words with the highest similarity to the user words to be marked and a first similarity;

and calculating the first similarity through a first preset algorithm to obtain the first literal similarity score.

In one embodiment, the semantic similarity calculation of the user utterance to be annotated and the conversation behavior to obtain a first semantic similarity score includes:

calculating the user utterance to be annotated through a pre-trained unsupervised language model to obtain a first sentence semantic vector;

acquiring a first preset sentence semantic vector of the conversation behavior in the corpus;

and calculating the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm to obtain the first semantic similarity score.

In one embodiment, the performing a literal similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second literal similarity score includes:

performing word segmentation on the context of the user utterance to be annotated to acquire second n-gram information;

determining a second query word representation according to the second n-gram information;

obtaining context texts marked with user utterances in the corpus;

retrieving the context texts of the marked user utterances through a second preset model to obtain the context texts and second similarities of the marked user utterances with preset numbers and highest ranking with the context similarity of the user utterances to be marked;

and calculating the second similarity through a third preset algorithm to obtain a second literal similarity score.

In one embodiment, the performing semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second semantic similarity score includes:

calculating the context of the user utterance to be annotated through the pre-trained unsupervised language model to obtain a second sentence semantic vector;

acquiring a second preset sentence semantic vector of the conversation behavior in the corpus;

and calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm to obtain the second semantic similarity score.

In one embodiment, the determining candidate recommended labels according to the first literal similarity score, the first semantic similarity score, the second literal similarity score, and the second semantic similarity score includes:

calculating the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity through a fifth preset algorithm to obtain a score of a conversation behavior;

and arranging a preset number of conversation behaviors according to the score of the conversation behaviors in the reverse order to obtain the candidate recommended labels corresponding to the user utterances to be labeled.

According to a second aspect of the embodiments of the present invention, there is provided a device for accelerating annotation of multiple rounds of dialog corpus facing human-computer interaction, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a user utterance to be annotated and the context of the user utterance to be annotated;

the first calculation module is used for performing literal similarity calculation and semantic similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score and a first semantic similarity score;

the second calculation module is used for performing word similarity calculation and semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second word similarity score and a second semantic similarity score;

and the determining module is used for determining candidate recommendation labels according to the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity score.

In one embodiment, the first calculation module includes:

the first obtaining submodule is used for carrying out word segmentation on the user words to be labeled so as to obtain first n-gram information;

the first determining submodule is used for determining a first query term representation according to the first n-gram information;

the second obtaining submodule is used for obtaining the user words marked in the corpus;

the first retrieval submodule is used for retrieving the labeled user words through a first preset model so as to obtain a preset number of labeled user words with the highest similarity with the user words to be labeled and a first similarity;

and the first calculating submodule is used for calculating the first similarity through a first preset algorithm so as to obtain the first literal similarity score.

The second calculation submodule is used for calculating the user utterance to be annotated through a pre-trained unsupervised language model so as to obtain a first sentence semantic vector;

a third obtaining submodule, configured to obtain a semantic vector of a first preset sentence of the conversation behavior in the corpus;

and the third calculation submodule is used for calculating the semantic vector of the first preset sentence and the semantic vector of the first sentence through a second preset algorithm so as to obtain the score of the first semantic similarity.

In one embodiment, the second calculation module includes:

the fourth obtaining submodule is used for carrying out word segmentation on the context of the user utterance to be annotated so as to obtain second n-gram information;

a second determining submodule, configured to determine a second query term representation according to the second n-gram information;

a fifth obtaining submodule, configured to obtain a context text to which the user utterance is tagged in the corpus;

the second retrieval submodule is used for retrieving the context texts of the marked user utterances through a second preset model so as to obtain the context texts and second similarity of a preset number of marked user utterances with the highest context similarity ranking with the user utterances to be marked;

and the fourth calculating submodule is used for calculating the second similarity through a third preset algorithm so as to obtain the second literal similarity score.

A fifth calculation submodule, configured to calculate, through the pre-trained unsupervised language model, a context of the user utterance to be annotated to obtain a second sentence semantic vector;

a sixth obtaining submodule, configured to obtain a second preset sentence semantic vector of the conversation behavior in the corpus;

and the sixth calculating submodule is used for calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm so as to obtain the second semantic similarity score.

In one embodiment, the determining module includes:

the seventh calculation submodule is used for calculating the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity through a fifth preset algorithm so as to obtain a score of a conversation behavior;

and the arrangement submodule is used for arranging a preset number of conversation behaviors according to the scores of the conversation behaviors in the reverse order so as to obtain the candidate recommended labels corresponding to the user utterances to be labeled.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the method comprises the steps of performing word similarity calculation and semantic similarity calculation on words and conversation behaviors of users to be annotated to obtain a first word similarity score and a first semantic similarity score, performing word similarity calculation and semantic similarity calculation on context and conversation behaviors of the words of the users to be annotated to obtain a second word similarity score and a second semantic similarity score, and determining accurate candidate recommended labels for the words of the users to be annotated according to the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity score, wherein the candidate recommended labels accurately provide options labels for the users to be annotated for annotators, so that the judgment and labeling of the annotators can be assisted, the phenomenon that the annotators waste labor and time due to more types of conversation behaviors is avoided, and the annotation efficiency is improved, in addition, because the context of the utterance of the user to be labeled and the semantic similarity score are directly combined when the candidate recommended label is determined, the accuracy of the label can be obviously improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating an accelerated annotation method for multi-turn interactive corpora according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an accelerating annotation method for multi-turn interactive corpora according to an embodiment of the present invention;

FIG. 3 is a block diagram of an accelerating annotation device for multi-turn interactive corpora according to an embodiment of the present invention;

fig. 4 is a block diagram of an accelerating annotation device for multi-turn interactive corpora according to an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Fig. 1 is a flowchart illustrating a method for accelerating annotation of multi-turn dialog corpus according to an embodiment of the present invention, as shown in fig. 1, the method can be implemented as the following steps S11-S16:

in step S11, a user utterance to be annotated and a context of the user utterance to be annotated are acquired;

wherein, the context of the user words to be marked refers to the text obtained by splicing the words before and after the user words to be marked, and is marked as context_uThe user utterance to be annotated is recorded as utter_u(ii) a The context of the user utterance to be annotated has the same/similar meaning as the user utterance to be annotated and is based on a distributed hypothesis, i.e. if two words are similar, their contexts are also similar.

In step S12, performing literal similarity calculation on the utterance and the conversation behavior of the user to be annotated to obtain a first literal similarity score;

conversational behaviors include, but are not limited to: welcome, thank you, farewell, question, etc., and the session behavior is noted da.

In step S13, performing semantic similarity calculation on the utterance of the user to be annotated and the conversation behavior to obtain a first semantic similarity score;

in step S14, performing literal similarity calculation on the context and the conversation behavior of the user utterance to be annotated to obtain a second literal similarity score;

in step S15, performing semantic similarity calculation on the context and the conversation behavior of the user utterance to be annotated to obtain a second semantic similarity score;

in step S16, a candidate recommendation label is determined according to the first literal similarity score, the first semantic similarity score, the second literal similarity score, and the second semantic similarity score.

It should be noted that, when the annotator has not yet performed any annotation, the steps S11-S16 are not performed, and when the annotator has annotated a small amount of corpus, the steps S11-S16 are performed.

As shown in FIG. 2, in one embodiment, the above step S12 can be implemented as the following steps S21-S25:

in step S21, performing word segmentation on the user utterance to be annotated to obtain first n-gram information; the first n-gram information is updated into the corpus.

In step S22, determining a first query term representation according to the first n-gram information;

for example, n-gram is in an element of {1, 2}, the statement of the user to be labeled is { I/today/no money/repayment }, and the look-up table characteristics determined by the statement to be labeled are { I, today, no money, repayment, I _ today, today _ no money, no money _ repayment }; wherein, when the n-gram value is 1, the characteristics of the lookup table are simplified into a traditional bag of words model (bag of words).

In step S23, a user utterance that has been tagged in the corpus is acquired; wherein, the corpus is a corpus.

In step S24, retrieving the labeled user utterance through a first preset model to obtain a preset number of labeled user utterances with the highest similarity to the user utterance to be labeled and a first similarity;

the first preset model is as follows:

wherein D represents the annotated user utterance in the corpus, sim_textIs utterer_uFirst degree of similarity to D, q_iIs utterer_uSome n-gram term of, tf (q)_iD) represents that D contains q_iThe word frequency, | D | represents the number of N-gram items contained in D, avgdl represents the number of N-gram items contained in the average of all user utterances in the corpus, N is the total number of user utterances in the corpus, and N (q is the total number of user utterances in the corpus_i) Indicating the corpus to contain q_iThe number of words of the user, k and b are model parameters, and can be freely set according to requirements.

In step S25, calculating a first similarity by a first preset algorithm to obtain a first literal similarity score;

the first literal similarity score, which may be expressed as score₁(utter_uDa), the first preset algorithm is as follows:

wherein, a plurality of D may exist in the corpus and are marked as the same conversation behavior da.

The marked user words are retrieved through the first preset model, the marked user words with the preset number and the highest ranking similarity with the user words to be marked can be obtained, manual retrieval by a marker is not needed, the workload of the marker is reduced, and the working efficiency is improved.

In one embodiment, the step S13 can be implemented as steps including:

among them, the pre-trained unsupervised language model includes but is not limited to: word2vec, ELMo, BERT, ERNIE, etc., e.g., using ELMo model to calculate, the uter can be obtained_uThe semantic vector of each term in the sentence is obtained by adding the terms, namely the semantic vector of the first sentence, which can be expressed as vec (utterer)_u)。

wherein the first preset sentence semantic vector is represented as vec (D)_da) Obtaining a set D of user utterances in a corpus, labeled da_daCalculating the sentence semantic vector of each user utterance in the set, and averaging to obtain vec (D)_da)。

Calculating the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm to obtain the first semantic similarity score;

the first semantic similarity score, which may be expressed as score₂(utter_uDa), the second preset algorithm is as follows:

wherein D is_daRepresenting a set of user utterances in the corpus labeled da.

The first preset sentence semantic vector and the first sentence semantic vector are calculated through a second preset algorithm, and the similarity of the words of the user to be annotated and the conversation behavior can be obtained, so that the calculation through the algorithm enables the similarity result to be more reliable.

In one embodiment, the step S14 can be implemented as steps including:

performing word segmentation on the context of the user utterance to be annotated to acquire second n-gram information; and updating the second n-gram information into the corpus.

obtaining context texts marked with user utterances in the corpus;

the second preset model is as follows:

wherein D' represents the context text, sim, of the annotated user utterance in the corpus_textIs context_uSecond degree of similarity to D', q_iIs context_uA certain n-gram term of, tf (q)_iAnd D ') represents D' wherein q is contained_iThe word frequency, | D '| indicates that D' includes n-gramThe number of terms, avgal, represents the number of N-gram terms contained in the corpus on average for all context texts, N' is the total number of context texts in the corpus, N (q)_i) Indicating the corpus to contain q_iThe number of contexts (k) and (b) are model parameters and can be freely set according to requirements.

Calculating the second similarity through a third preset algorithm to obtain a second literal similarity score;

the second literal similarity score, which may be expressed as score₃(context_uDa), the third preset algorithm is as follows:

wherein there may be multiple D' related user utterances in the corpus annotated as the same conversational behavior da.

The context of the marked user words is searched through the search model, a preset number of marked user words with the highest similarity ranking with the context of the user words to be marked can be obtained, manual search is not needed, the labor cost is further reduced, and the accuracy of the searched marked user words is high; and the second similarity is calculated through a third preset algorithm, so that the literal similarity between the user utterance context to be annotated and the conversation behavior can be obtained, and therefore, the literal similarity calculated through the algorithm is more accurate.

In one embodiment, the step S15 can be implemented as steps including:

calculating the context of the user utterance to be annotated through the pre-trained unsupervised language model to obtain a second sentence semantic vector; the second sentence semantic vector, which may be denoted as vec (context)_u)。

wherein the first preset sentence semantic vector is represented as vec (D'_da) Obtaining a context set D 'of user utterances in a corpus labeled da'_daComputing a setThe sentence semantic vector of each context text in the text is averaged to obtain vec (D'_da)。

The second semantic similarity score, which may be expressed as score₄(context_uDa), the fourth preset algorithm is as follows:

the similarity between the context of the words of the user to be annotated and the conversation behavior can be obtained by calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm, so that the annotator does not need to spend a large amount of time for comparing the context with the conversation behavior, and the time spent by the annotator for determining the real intention of the user in the context can be reduced.

In one embodiment, the step S16 can be implemented as steps including:

the fifth preset algorithm is as follows:

score(da)＝score₁(utter_wda)+score₂(utter_wda)+score₃(context_u，da)+score₄(context_u，da)

Because the matching degree between the preset number of conversation behaviors and the words of the user to be annotated is high or low, the candidate recommended annotations corresponding to the words of the user to be annotated, which are obtained by arranging the preset number of conversation behaviors according to the score of the conversation behaviors in the reverse order, can accurately and intuitively assist the annotator in annotating, and reduce errors generated during annotation.

For the above mentioned accelerating annotation method for multi-turn dialog corpus in human-computer interaction provided by the embodiment of the present invention, the embodiment of the present invention further provides an accelerating annotation device for multi-turn dialog corpus in human-computer interaction, as shown in fig. 3, the device includes:

an obtaining module 31, configured to obtain a user utterance to be annotated and a context of the user utterance to be annotated;

a first calculating module 32, configured to perform literal similarity calculation and semantic similarity calculation on the user utterance to be annotated and the session behavior to obtain a first literal similarity score and a first semantic similarity score;

a second calculating module 33, configured to perform literal similarity calculation and semantic similarity calculation on the context of the to-be-annotated user utterance and the conversation behavior to obtain a second literal similarity score and a second semantic similarity score;

and the determining module 34 is configured to determine a candidate recommendation label according to the first literal similarity score, the first semantic similarity score, the second literal similarity score, and the second semantic similarity score.

As shown in fig. 4, in one embodiment, the first calculation module 32 may include:

the first obtaining submodule 321 is configured to perform word segmentation on the user utterance to be annotated to obtain first n-gram information;

a first determining submodule 322, configured to determine a first query term representation according to the first n-gram information;

a second obtaining submodule 323, configured to obtain a user utterance labeled in a corpus;

a first retrieval sub-module 324, configured to retrieve the labeled user utterances through a first preset model, so as to obtain a preset number of labeled user utterances with a highest similarity to the user utterance to be labeled and a first similarity;

the first calculating submodule 325 is configured to calculate the first similarity through a first preset algorithm to obtain the first literal similarity score.

The second computation submodule 326 is configured to compute the to-be-annotated user utterance through a pre-trained unsupervised language model to obtain a first sentence semantic vector;

a third obtaining submodule 327, configured to obtain a first preset sentence semantic vector of the conversation behavior in the corpus;

a third calculating submodule 328, configured to calculate the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm, so as to obtain the first semantic similarity score.

In one embodiment, the second calculation module includes:

In one embodiment, the determining module includes:

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method is characterized by comprising the following steps:

2. The method of claim 1, wherein said performing a literal similarity calculation for said user utterance to be annotated and session behavior to obtain a first literal similarity score comprises:

acquiring a user utterance marked in the corpus;

3. The method of claim 1, wherein the performing semantic similarity calculations on the user utterance to be annotated and the conversational behavior to obtain a first semantic similarity score comprises:

4. The method of claim 1, wherein said performing a literal similarity calculation of the context of the user utterance to be annotated and the conversational behavior to obtain a second literal similarity score comprises:

obtaining context texts marked with user utterances in the corpus;

5. The method of claim 1, wherein the performing semantic similarity calculations on the context of the user utterance to be annotated and the conversational behavior to obtain a second semantic similarity score comprises:

6. The method of claim 1, wherein determining candidate recommended annotations based on the first literal similarity score, first semantic similarity score, second literal similarity score, and the second semantic similarity score comprises:

7. A accelerating annotation device for multi-round dialogue corpora of human-computer interaction is characterized by comprising:

8. The apparatus of claim 7, wherein the first computing module comprises:

the first calculation submodule is used for calculating the first similarity through a first preset algorithm so as to obtain a first literal similarity score;

9. The apparatus of claim 7, wherein the second computing module comprises:

the fourth calculation submodule is used for calculating the second similarity through a third preset algorithm so as to obtain a second literal similarity score;

10. The apparatus of claim 7, wherein the determining module comprises: