CN111178062A - Man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method and device - Google Patents

Man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method and device Download PDF

Info

Publication number
CN111178062A
CN111178062A CN201911212568.9A CN201911212568A CN111178062A CN 111178062 A CN111178062 A CN 111178062A CN 201911212568 A CN201911212568 A CN 201911212568A CN 111178062 A CN111178062 A CN 111178062A
Authority
CN
China
Prior art keywords
similarity score
annotated
similarity
semantic
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911212568.9A
Other languages
Chinese (zh)
Other versions
CN111178062B (en
Inventor
王星光
陈�峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN201911212568.9A priority Critical patent/CN111178062B/en
Publication of CN111178062A publication Critical patent/CN111178062A/en
Application granted granted Critical
Publication of CN111178062B publication Critical patent/CN111178062B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an accelerating labeling method and device for multi-turn dialogue corpora facing human-computer interaction, wherein the method comprises the following steps: acquiring a user utterance to be annotated and a context of the user utterance to be annotated; performing face similarity calculation on the words and conversation behaviors of the user to be annotated to obtain a first face similarity score; performing semantic similarity calculation on the utterance and the conversation behavior of the user to be annotated to obtain a first semantic similarity score; performing literal similarity calculation on the context and conversation behavior of the user utterance to be annotated to obtain a second literal similarity score; performing semantic similarity calculation on the context and conversation behavior of the user utterance to be annotated to obtain a second semantic similarity score; and determining candidate recommendation labels according to the first face similarity score, the first semantic similarity score, the second face similarity score and the second semantic similarity score. By the technical scheme of the invention, errors generated by marking are reduced, and the marking speed is increased.

Description

Man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method and device
Technical Field
The invention relates to the technical field of computers and information services, in particular to an accelerating labeling method and device for multi-turn dialogue corpora oriented to human-computer interaction.
Background
In the field of human-computer interaction application scenes such as intelligent customer service, children early education machines and the like, a large number of multi-turn dialogue corpora exist in system logs. Recognition of conversational behaviors (DA) in the Dialog corpus plays a key role in understanding the user's true intent. Conversation behavior describes the interaction of semantics, interplay, etc. of a user utterance (user utterance) during a conversation. The traditional corpus processing mode adopts manual marking, namely, manually marking the words of the user into predefined conversation behaviors, so that a machine learning technology is driven to learn the real intention of the words of the user.
Problems with manual labeling: on one hand, the conversation behaviors in the multi-turn dialogue corpus are more in types; on the other hand, the true intent of a user utterance often needs to be unambiguous depending on context; the two problems cause that the annotator is not only laborious and laborious in annotating the dialogues in multiple rounds, but also easily causes annotation errors.
Disclosure of Invention
The invention provides an accelerating labeling method and device for multi-turn dialogue corpora oriented to human-computer interaction. The technical scheme is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a method for accelerating annotation of a multi-turn dialog corpus facing human-computer interaction, including:
acquiring a user utterance to be annotated and a context of the user utterance to be annotated;
performing literal similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score;
semantic similarity calculation is carried out on the user utterance to be annotated and the conversation behavior to obtain a first semantic similarity score;
performing literal similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second literal similarity score;
performing semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second semantic similarity score;
and determining candidate recommendation labels according to the first face similarity score, the first semantic similarity score, the second face similarity score and the second semantic similarity score.
In one embodiment, the performing a literal similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score includes:
performing word segmentation on the user utterance to be annotated to acquire first n-gram information;
determining a first query word representation according to the first n-gram information;
acquiring a user utterance marked in the corpus;
retrieving the marked user words through a first preset model to obtain a preset number of marked user words with the highest similarity to the user words to be marked and a first similarity;
and calculating the first similarity through a first preset algorithm to obtain the first literal similarity score.
In one embodiment, the semantic similarity calculation of the user utterance to be annotated and the conversation behavior to obtain a first semantic similarity score includes:
calculating the user utterance to be annotated through a pre-trained unsupervised language model to obtain a first sentence semantic vector;
acquiring a first preset sentence semantic vector of the conversation behavior in the corpus;
and calculating the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm to obtain the first semantic similarity score.
In one embodiment, the performing a literal similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second literal similarity score includes:
performing word segmentation on the context of the user utterance to be annotated to acquire second n-gram information;
determining a second query word representation according to the second n-gram information;
obtaining context texts marked with user utterances in the corpus;
retrieving the context texts of the marked user utterances through a second preset model to obtain the context texts and second similarities of the marked user utterances with preset numbers and highest ranking with the context similarity of the user utterances to be marked;
and calculating the second similarity through a third preset algorithm to obtain a second literal similarity score.
In one embodiment, the performing semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second semantic similarity score includes:
calculating the context of the user utterance to be annotated through the pre-trained unsupervised language model to obtain a second sentence semantic vector;
acquiring a second preset sentence semantic vector of the conversation behavior in the corpus;
and calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm to obtain the second semantic similarity score.
In one embodiment, the determining candidate recommended labels according to the first literal similarity score, the first semantic similarity score, the second literal similarity score, and the second semantic similarity score includes:
calculating the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity through a fifth preset algorithm to obtain a score of a conversation behavior;
and arranging a preset number of conversation behaviors according to the score of the conversation behaviors in the reverse order to obtain the candidate recommended labels corresponding to the user utterances to be labeled.
According to a second aspect of the embodiments of the present invention, there is provided a device for accelerating annotation of multiple rounds of dialog corpus facing human-computer interaction, including:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a user utterance to be annotated and the context of the user utterance to be annotated;
the first calculation module is used for performing literal similarity calculation and semantic similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score and a first semantic similarity score;
the second calculation module is used for performing word similarity calculation and semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second word similarity score and a second semantic similarity score;
and the determining module is used for determining candidate recommendation labels according to the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity score.
In one embodiment, the first calculation module includes:
the first obtaining submodule is used for carrying out word segmentation on the user words to be labeled so as to obtain first n-gram information;
the first determining submodule is used for determining a first query term representation according to the first n-gram information;
the second obtaining submodule is used for obtaining the user words marked in the corpus;
the first retrieval submodule is used for retrieving the labeled user words through a first preset model so as to obtain a preset number of labeled user words with the highest similarity with the user words to be labeled and a first similarity;
and the first calculating submodule is used for calculating the first similarity through a first preset algorithm so as to obtain the first literal similarity score.
The second calculation submodule is used for calculating the user utterance to be annotated through a pre-trained unsupervised language model so as to obtain a first sentence semantic vector;
a third obtaining submodule, configured to obtain a semantic vector of a first preset sentence of the conversation behavior in the corpus;
and the third calculation submodule is used for calculating the semantic vector of the first preset sentence and the semantic vector of the first sentence through a second preset algorithm so as to obtain the score of the first semantic similarity.
In one embodiment, the second calculation module includes:
the fourth obtaining submodule is used for carrying out word segmentation on the context of the user utterance to be annotated so as to obtain second n-gram information;
a second determining submodule, configured to determine a second query term representation according to the second n-gram information;
a fifth obtaining submodule, configured to obtain a context text to which the user utterance is tagged in the corpus;
the second retrieval submodule is used for retrieving the context texts of the marked user utterances through a second preset model so as to obtain the context texts and second similarity of a preset number of marked user utterances with the highest context similarity ranking with the user utterances to be marked;
and the fourth calculating submodule is used for calculating the second similarity through a third preset algorithm so as to obtain the second literal similarity score.
A fifth calculation submodule, configured to calculate, through the pre-trained unsupervised language model, a context of the user utterance to be annotated to obtain a second sentence semantic vector;
a sixth obtaining submodule, configured to obtain a second preset sentence semantic vector of the conversation behavior in the corpus;
and the sixth calculating submodule is used for calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm so as to obtain the second semantic similarity score.
In one embodiment, the determining module includes:
the seventh calculation submodule is used for calculating the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity through a fifth preset algorithm so as to obtain a score of a conversation behavior;
and the arrangement submodule is used for arranging a preset number of conversation behaviors according to the scores of the conversation behaviors in the reverse order so as to obtain the candidate recommended labels corresponding to the user utterances to be labeled.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method comprises the steps of performing word similarity calculation and semantic similarity calculation on words and conversation behaviors of users to be annotated to obtain a first word similarity score and a first semantic similarity score, performing word similarity calculation and semantic similarity calculation on context and conversation behaviors of the words of the users to be annotated to obtain a second word similarity score and a second semantic similarity score, and determining accurate candidate recommended labels for the words of the users to be annotated according to the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity score, wherein the candidate recommended labels accurately provide options labels for the users to be annotated for annotators, so that the judgment and labeling of the annotators can be assisted, the phenomenon that the annotators waste labor and time due to more types of conversation behaviors is avoided, and the annotation efficiency is improved, in addition, because the context of the utterance of the user to be labeled and the semantic similarity score are directly combined when the candidate recommended label is determined, the accuracy of the label can be obviously improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart illustrating an accelerated annotation method for multi-turn interactive corpora according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an accelerating annotation method for multi-turn interactive corpora according to an embodiment of the present invention;
FIG. 3 is a block diagram of an accelerating annotation device for multi-turn interactive corpora according to an embodiment of the present invention;
fig. 4 is a block diagram of an accelerating annotation device for multi-turn interactive corpora according to an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Fig. 1 is a flowchart illustrating a method for accelerating annotation of multi-turn dialog corpus according to an embodiment of the present invention, as shown in fig. 1, the method can be implemented as the following steps S11-S16:
in step S11, a user utterance to be annotated and a context of the user utterance to be annotated are acquired;
wherein, the context of the user words to be marked refers to the text obtained by splicing the words before and after the user words to be marked, and is marked as contextuThe user utterance to be annotated is recorded as utteru(ii) a The context of the user utterance to be annotated has the same/similar meaning as the user utterance to be annotated and is based on a distributed hypothesis, i.e. if two words are similar, their contexts are also similar.
In step S12, performing literal similarity calculation on the utterance and the conversation behavior of the user to be annotated to obtain a first literal similarity score;
conversational behaviors include, but are not limited to: welcome, thank you, farewell, question, etc., and the session behavior is noted da.
In step S13, performing semantic similarity calculation on the utterance of the user to be annotated and the conversation behavior to obtain a first semantic similarity score;
in step S14, performing literal similarity calculation on the context and the conversation behavior of the user utterance to be annotated to obtain a second literal similarity score;
in step S15, performing semantic similarity calculation on the context and the conversation behavior of the user utterance to be annotated to obtain a second semantic similarity score;
in step S16, a candidate recommendation label is determined according to the first literal similarity score, the first semantic similarity score, the second literal similarity score, and the second semantic similarity score.
It should be noted that, when the annotator has not yet performed any annotation, the steps S11-S16 are not performed, and when the annotator has annotated a small amount of corpus, the steps S11-S16 are performed.
The method comprises the steps of performing word similarity calculation and semantic similarity calculation on words and conversation behaviors of users to be annotated to obtain a first word similarity score and a first semantic similarity score, performing word similarity calculation and semantic similarity calculation on context and conversation behaviors of the words of the users to be annotated to obtain a second word similarity score and a second semantic similarity score, and determining accurate candidate recommended labels for the words of the users to be annotated according to the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity score, wherein the candidate recommended labels accurately provide options labels for the users to be annotated for annotators, so that the judgment and labeling of the annotators can be assisted, the phenomenon that the annotators waste labor and time due to more types of conversation behaviors is avoided, and the annotation efficiency is improved, in addition, because the context of the utterance of the user to be labeled and the semantic similarity score are directly combined when the candidate recommended label is determined, the accuracy of the label can be obviously improved.
As shown in FIG. 2, in one embodiment, the above step S12 can be implemented as the following steps S21-S25:
in step S21, performing word segmentation on the user utterance to be annotated to obtain first n-gram information; the first n-gram information is updated into the corpus.
In step S22, determining a first query term representation according to the first n-gram information;
for example, n-gram is in an element of {1, 2}, the statement of the user to be labeled is { I/today/no money/repayment }, and the look-up table characteristics determined by the statement to be labeled are { I, today, no money, repayment, I _ today, today _ no money, no money _ repayment }; wherein, when the n-gram value is 1, the characteristics of the lookup table are simplified into a traditional bag of words model (bag of words).
In step S23, a user utterance that has been tagged in the corpus is acquired; wherein, the corpus is a corpus.
In step S24, retrieving the labeled user utterance through a first preset model to obtain a preset number of labeled user utterances with the highest similarity to the user utterance to be labeled and a first similarity;
the first preset model is as follows:
Figure BDA0002298538800000081
Figure BDA0002298538800000082
wherein D represents the annotated user utterance in the corpus, simtextIs uttereruFirst degree of similarity to D, qiIs uttereruSome n-gram term of, tf (q)iD) represents that D contains qiThe word frequency, | D | represents the number of N-gram items contained in D, avgdl represents the number of N-gram items contained in the average of all user utterances in the corpus, N is the total number of user utterances in the corpus, and N (q is the total number of user utterances in the corpusi) Indicating the corpus to contain qiThe number of words of the user, k and b are model parameters, and can be freely set according to requirements.
In step S25, calculating a first similarity by a first preset algorithm to obtain a first literal similarity score;
the first literal similarity score, which may be expressed as score1(utteruDa), the first preset algorithm is as follows:
Figure BDA0002298538800000091
wherein, a plurality of D may exist in the corpus and are marked as the same conversation behavior da.
The marked user words are retrieved through the first preset model, the marked user words with the preset number and the highest ranking similarity with the user words to be marked can be obtained, manual retrieval by a marker is not needed, the workload of the marker is reduced, and the working efficiency is improved.
In one embodiment, the step S13 can be implemented as steps including:
calculating the user utterance to be annotated through a pre-trained unsupervised language model to obtain a first sentence semantic vector;
among them, the pre-trained unsupervised language model includes but is not limited to: word2vec, ELMo, BERT, ERNIE, etc., e.g., using ELMo model to calculate, the uter can be obtaineduThe semantic vector of each term in the sentence is obtained by adding the terms, namely the semantic vector of the first sentence, which can be expressed as vec (utterer)u)。
Acquiring a first preset sentence semantic vector of the conversation behavior in the corpus;
wherein the first preset sentence semantic vector is represented as vec (D)da) Obtaining a set D of user utterances in a corpus, labeled dadaCalculating the sentence semantic vector of each user utterance in the set, and averaging to obtain vec (D)da)。
Calculating the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm to obtain the first semantic similarity score;
the first semantic similarity score, which may be expressed as score2(utteruDa), the second preset algorithm is as follows:
Figure BDA0002298538800000092
wherein D isdaRepresenting a set of user utterances in the corpus labeled da.
The first preset sentence semantic vector and the first sentence semantic vector are calculated through a second preset algorithm, and the similarity of the words of the user to be annotated and the conversation behavior can be obtained, so that the calculation through the algorithm enables the similarity result to be more reliable.
In one embodiment, the step S14 can be implemented as steps including:
performing word segmentation on the context of the user utterance to be annotated to acquire second n-gram information; and updating the second n-gram information into the corpus.
Determining a second query word representation according to the second n-gram information;
obtaining context texts marked with user utterances in the corpus;
retrieving the context texts of the marked user utterances through a second preset model to obtain the context texts and second similarities of the marked user utterances with preset numbers and highest ranking with the context similarity of the user utterances to be marked;
the second preset model is as follows:
Figure BDA0002298538800000101
Figure BDA0002298538800000102
wherein D' represents the context text, sim, of the annotated user utterance in the corpustextIs contextuSecond degree of similarity to D', qiIs contextuA certain n-gram term of, tf (q)iAnd D ') represents D' wherein q is containediThe word frequency, | D '| indicates that D' includes n-gramThe number of terms, avgal, represents the number of N-gram terms contained in the corpus on average for all context texts, N' is the total number of context texts in the corpus, N (q)i) Indicating the corpus to contain qiThe number of contexts (k) and (b) are model parameters and can be freely set according to requirements.
Calculating the second similarity through a third preset algorithm to obtain a second literal similarity score;
the second literal similarity score, which may be expressed as score3(contextuDa), the third preset algorithm is as follows:
Figure BDA0002298538800000103
wherein there may be multiple D' related user utterances in the corpus annotated as the same conversational behavior da.
The context of the marked user words is searched through the search model, a preset number of marked user words with the highest similarity ranking with the context of the user words to be marked can be obtained, manual search is not needed, the labor cost is further reduced, and the accuracy of the searched marked user words is high; and the second similarity is calculated through a third preset algorithm, so that the literal similarity between the user utterance context to be annotated and the conversation behavior can be obtained, and therefore, the literal similarity calculated through the algorithm is more accurate.
In one embodiment, the step S15 can be implemented as steps including:
calculating the context of the user utterance to be annotated through the pre-trained unsupervised language model to obtain a second sentence semantic vector; the second sentence semantic vector, which may be denoted as vec (context)u)。
Acquiring a second preset sentence semantic vector of the conversation behavior in the corpus;
wherein the first preset sentence semantic vector is represented as vec (D'da) Obtaining a context set D 'of user utterances in a corpus labeled da'daComputing a setThe sentence semantic vector of each context text in the text is averaged to obtain vec (D'da)。
And calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm to obtain the second semantic similarity score.
The second semantic similarity score, which may be expressed as score4(contextuDa), the fourth preset algorithm is as follows:
Figure BDA0002298538800000111
the similarity between the context of the words of the user to be annotated and the conversation behavior can be obtained by calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm, so that the annotator does not need to spend a large amount of time for comparing the context with the conversation behavior, and the time spent by the annotator for determining the real intention of the user in the context can be reduced.
In one embodiment, the step S16 can be implemented as steps including:
calculating the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity through a fifth preset algorithm to obtain a score of a conversation behavior;
the fifth preset algorithm is as follows:
score(da)=score1(utterwda)+score2(utterwda)+score3(contextu,da)+score4(contextu,da)
and arranging a preset number of conversation behaviors according to the score of the conversation behaviors in the reverse order to obtain the candidate recommended labels corresponding to the user utterances to be labeled.
Because the matching degree between the preset number of conversation behaviors and the words of the user to be annotated is high or low, the candidate recommended annotations corresponding to the words of the user to be annotated, which are obtained by arranging the preset number of conversation behaviors according to the score of the conversation behaviors in the reverse order, can accurately and intuitively assist the annotator in annotating, and reduce errors generated during annotation.
For the above mentioned accelerating annotation method for multi-turn dialog corpus in human-computer interaction provided by the embodiment of the present invention, the embodiment of the present invention further provides an accelerating annotation device for multi-turn dialog corpus in human-computer interaction, as shown in fig. 3, the device includes:
an obtaining module 31, configured to obtain a user utterance to be annotated and a context of the user utterance to be annotated;
a first calculating module 32, configured to perform literal similarity calculation and semantic similarity calculation on the user utterance to be annotated and the session behavior to obtain a first literal similarity score and a first semantic similarity score;
a second calculating module 33, configured to perform literal similarity calculation and semantic similarity calculation on the context of the to-be-annotated user utterance and the conversation behavior to obtain a second literal similarity score and a second semantic similarity score;
and the determining module 34 is configured to determine a candidate recommendation label according to the first literal similarity score, the first semantic similarity score, the second literal similarity score, and the second semantic similarity score.
As shown in fig. 4, in one embodiment, the first calculation module 32 may include:
the first obtaining submodule 321 is configured to perform word segmentation on the user utterance to be annotated to obtain first n-gram information;
a first determining submodule 322, configured to determine a first query term representation according to the first n-gram information;
a second obtaining submodule 323, configured to obtain a user utterance labeled in a corpus;
a first retrieval sub-module 324, configured to retrieve the labeled user utterances through a first preset model, so as to obtain a preset number of labeled user utterances with a highest similarity to the user utterance to be labeled and a first similarity;
the first calculating submodule 325 is configured to calculate the first similarity through a first preset algorithm to obtain the first literal similarity score.
The second computation submodule 326 is configured to compute the to-be-annotated user utterance through a pre-trained unsupervised language model to obtain a first sentence semantic vector;
a third obtaining submodule 327, configured to obtain a first preset sentence semantic vector of the conversation behavior in the corpus;
a third calculating submodule 328, configured to calculate the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm, so as to obtain the first semantic similarity score.
In one embodiment, the second calculation module includes:
the fourth obtaining submodule is used for carrying out word segmentation on the context of the user utterance to be annotated so as to obtain second n-gram information;
a second determining submodule, configured to determine a second query term representation according to the second n-gram information;
a fifth obtaining submodule, configured to obtain a context text to which the user utterance is tagged in the corpus;
the second retrieval submodule is used for retrieving the context texts of the marked user utterances through a second preset model so as to obtain the context texts and second similarity of a preset number of marked user utterances with the highest context similarity ranking with the user utterances to be marked;
and the fourth calculating submodule is used for calculating the second similarity through a third preset algorithm so as to obtain the second literal similarity score.
A fifth calculation submodule, configured to calculate, through the pre-trained unsupervised language model, a context of the user utterance to be annotated to obtain a second sentence semantic vector;
a sixth obtaining submodule, configured to obtain a second preset sentence semantic vector of the conversation behavior in the corpus;
and the sixth calculating submodule is used for calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm so as to obtain the second semantic similarity score.
In one embodiment, the determining module includes:
the seventh calculation submodule is used for calculating the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity through a fifth preset algorithm so as to obtain a score of a conversation behavior;
and the arrangement submodule is used for arranging a preset number of conversation behaviors according to the scores of the conversation behaviors in the reverse order so as to obtain the candidate recommended labels corresponding to the user utterances to be labeled.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A man-machine interaction multi-turn dialogue corpus oriented acceleration labeling method is characterized by comprising the following steps:
acquiring a user utterance to be annotated and a context of the user utterance to be annotated;
performing literal similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score;
semantic similarity calculation is carried out on the user utterance to be annotated and the conversation behavior to obtain a first semantic similarity score;
performing literal similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second literal similarity score;
performing semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second semantic similarity score;
and determining candidate recommendation labels according to the first face similarity score, the first semantic similarity score, the second face similarity score and the second semantic similarity score.
2. The method of claim 1, wherein said performing a literal similarity calculation for said user utterance to be annotated and session behavior to obtain a first literal similarity score comprises:
performing word segmentation on the user utterance to be annotated to acquire first n-gram information;
determining a first query word representation according to the first n-gram information;
acquiring a user utterance marked in the corpus;
retrieving the marked user words through a first preset model to obtain a preset number of marked user words with the highest similarity to the user words to be marked and a first similarity;
and calculating the first similarity through a first preset algorithm to obtain the first literal similarity score.
3. The method of claim 1, wherein the performing semantic similarity calculations on the user utterance to be annotated and the conversational behavior to obtain a first semantic similarity score comprises:
calculating the user utterance to be annotated through a pre-trained unsupervised language model to obtain a first sentence semantic vector;
acquiring a first preset sentence semantic vector of the conversation behavior in the corpus;
and calculating the first preset sentence semantic vector and the first sentence semantic vector through a second preset algorithm to obtain the first semantic similarity score.
4. The method of claim 1, wherein said performing a literal similarity calculation of the context of the user utterance to be annotated and the conversational behavior to obtain a second literal similarity score comprises:
performing word segmentation on the context of the user utterance to be annotated to acquire second n-gram information;
determining a second query word representation according to the second n-gram information;
obtaining context texts marked with user utterances in the corpus;
retrieving the context texts of the marked user utterances through a second preset model to obtain the context texts and second similarities of the marked user utterances with preset numbers and highest ranking with the context similarity of the user utterances to be marked;
and calculating the second similarity through a third preset algorithm to obtain a second literal similarity score.
5. The method of claim 1, wherein the performing semantic similarity calculations on the context of the user utterance to be annotated and the conversational behavior to obtain a second semantic similarity score comprises:
calculating the context of the user utterance to be annotated through the pre-trained unsupervised language model to obtain a second sentence semantic vector;
acquiring a second preset sentence semantic vector of the conversation behavior in the corpus;
and calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm to obtain the second semantic similarity score.
6. The method of claim 1, wherein determining candidate recommended annotations based on the first literal similarity score, first semantic similarity score, second literal similarity score, and the second semantic similarity score comprises:
calculating the first word similarity score, the first semantic similarity score, the second word similarity score and the second semantic similarity through a fifth preset algorithm to obtain a score of a conversation behavior;
and arranging a preset number of conversation behaviors according to the score of the conversation behaviors in the reverse order to obtain the candidate recommended labels corresponding to the user utterances to be labeled.
7. A accelerating annotation device for multi-round dialogue corpora of human-computer interaction is characterized by comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a user utterance to be annotated and the context of the user utterance to be annotated;
the first calculation module is used for performing literal similarity calculation and semantic similarity calculation on the user utterance to be annotated and the conversation behavior to obtain a first literal similarity score and a first semantic similarity score;
the second calculation module is used for performing word similarity calculation and semantic similarity calculation on the context of the user utterance to be annotated and the conversation behavior to obtain a second word similarity score and a second semantic similarity score;
and the determining module is used for determining candidate recommendation labels according to the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity score.
8. The apparatus of claim 7, wherein the first computing module comprises:
the first obtaining submodule is used for carrying out word segmentation on the user words to be labeled so as to obtain first n-gram information;
the first determining submodule is used for determining a first query term representation according to the first n-gram information;
the second obtaining submodule is used for obtaining the user words marked in the corpus;
the first retrieval submodule is used for retrieving the labeled user words through a first preset model so as to obtain a preset number of labeled user words with the highest similarity with the user words to be labeled and a first similarity;
the first calculation submodule is used for calculating the first similarity through a first preset algorithm so as to obtain a first literal similarity score;
the second calculation submodule is used for calculating the user utterance to be annotated through a pre-trained unsupervised language model so as to obtain a first sentence semantic vector;
a third obtaining submodule, configured to obtain a semantic vector of a first preset sentence of the conversation behavior in the corpus;
and the third calculation submodule is used for calculating the semantic vector of the first preset sentence and the semantic vector of the first sentence through a second preset algorithm so as to obtain the score of the first semantic similarity.
9. The apparatus of claim 7, wherein the second computing module comprises:
the fourth obtaining submodule is used for carrying out word segmentation on the context of the user utterance to be annotated so as to obtain second n-gram information;
a second determining submodule, configured to determine a second query term representation according to the second n-gram information;
a fifth obtaining submodule, configured to obtain a context text to which the user utterance is tagged in the corpus;
the second retrieval submodule is used for retrieving the context texts of the marked user utterances through a second preset model so as to obtain the context texts and second similarity of a preset number of marked user utterances with the highest context similarity ranking with the user utterances to be marked;
the fourth calculation submodule is used for calculating the second similarity through a third preset algorithm so as to obtain a second literal similarity score;
a fifth calculation submodule, configured to calculate, through the pre-trained unsupervised language model, a context of the user utterance to be annotated to obtain a second sentence semantic vector;
a sixth obtaining submodule, configured to obtain a second preset sentence semantic vector of the conversation behavior in the corpus;
and the sixth calculating submodule is used for calculating the second preset sentence semantic vector and the second sentence semantic vector through a fourth preset algorithm so as to obtain the second semantic similarity score.
10. The apparatus of claim 7, wherein the determining module comprises:
the seventh calculation submodule is used for calculating the first literal similarity score, the first semantic similarity score, the second literal similarity score and the second semantic similarity through a fifth preset algorithm so as to obtain a score of a conversation behavior;
and the arrangement submodule is used for arranging a preset number of conversation behaviors according to the scores of the conversation behaviors in the reverse order so as to obtain the candidate recommended labels corresponding to the user utterances to be labeled.
CN201911212568.9A 2019-12-02 2019-12-02 Acceleration labeling method and device for man-machine interaction multi-round dialogue corpus Active CN111178062B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911212568.9A CN111178062B (en) 2019-12-02 2019-12-02 Acceleration labeling method and device for man-machine interaction multi-round dialogue corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911212568.9A CN111178062B (en) 2019-12-02 2019-12-02 Acceleration labeling method and device for man-machine interaction multi-round dialogue corpus

Publications (2)

Publication Number Publication Date
CN111178062A true CN111178062A (en) 2020-05-19
CN111178062B CN111178062B (en) 2023-05-05

Family

ID=70646366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911212568.9A Active CN111178062B (en) 2019-12-02 2019-12-02 Acceleration labeling method and device for man-machine interaction multi-round dialogue corpus

Country Status (1)

Country Link
CN (1) CN111178062B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343248A (en) * 2021-07-19 2021-09-03 北京有竹居网络技术有限公司 Vulnerability identification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060173686A1 (en) * 2005-02-01 2006-08-03 Samsung Electronics Co., Ltd. Apparatus, method, and medium for generating grammar network for use in speech recognition and dialogue speech recognition
CN109740126A (en) * 2019-01-04 2019-05-10 平安科技(深圳)有限公司 Text matching technique, device and storage medium, computer equipment
CN110096567A (en) * 2019-03-14 2019-08-06 中国科学院自动化研究所 Selection method, system are replied in more wheels dialogue based on QA Analysis of Knowledge Bases Reasoning
CN110222154A (en) * 2019-06-10 2019-09-10 武汉斗鱼鱼乐网络科技有限公司 Similarity calculating method, server and storage medium based on text and semanteme

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060173686A1 (en) * 2005-02-01 2006-08-03 Samsung Electronics Co., Ltd. Apparatus, method, and medium for generating grammar network for use in speech recognition and dialogue speech recognition
CN109740126A (en) * 2019-01-04 2019-05-10 平安科技(深圳)有限公司 Text matching technique, device and storage medium, computer equipment
CN110096567A (en) * 2019-03-14 2019-08-06 中国科学院自动化研究所 Selection method, system are replied in more wheels dialogue based on QA Analysis of Knowledge Bases Reasoning
CN110222154A (en) * 2019-06-10 2019-09-10 武汉斗鱼鱼乐网络科技有限公司 Similarity calculating method, server and storage medium based on text and semanteme

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭鸿奇;李国佳;: "一种基于词语多原型向量表示的句子相似度计算方法" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343248A (en) * 2021-07-19 2021-09-03 北京有竹居网络技术有限公司 Vulnerability identification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111178062B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN112069298B (en) Man-machine interaction method, device and medium based on semantic web and intention recognition
CN111125335B (en) Question and answer processing method and device, electronic equipment and storage medium
CN101996631B (en) Method and device for aligning texts
US8200490B2 (en) Method and apparatus for searching multimedia data using speech recognition in mobile device
CN110276071B (en) Text matching method and device, computer equipment and storage medium
CN111090727B (en) Language conversion processing method and device and dialect voice interaction system
CN107305541A (en) Speech recognition text segmentation method and device
WO2003010754A1 (en) Speech input search system
CN116166782A (en) Intelligent question-answering method based on deep learning
WO2016200902A2 (en) Systems and methods for learning semantic patterns from textual data
CN111666764B (en) Automatic abstracting method and device based on XLNet
CN117149984B (en) Customization training method and device based on large model thinking chain
CN109063182B (en) Content recommendation method based on voice search questions and electronic equipment
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
CN113326702A (en) Semantic recognition method and device, electronic equipment and storage medium
Moyal et al. Phonetic search methods for large speech databases
CN110633724A (en) Intention recognition model dynamic training method, device, equipment and storage medium
CN117708157A (en) SQL sentence generation method and device
CN111159381A (en) Data searching method and device
CN111178062B (en) Acceleration labeling method and device for man-machine interaction multi-round dialogue corpus
CN116628146A (en) FAQ intelligent question-answering method and system in financial field
JP2007065029A (en) Syntax/semantic analysis system and program, and speech recognition system
CN116186219A (en) Man-machine dialogue interaction method, system and storage medium
CN115759048A (en) Script text processing method and device
CN116090450A (en) Text processing method and computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant