CN112699226A

CN112699226A - Method and system for semantic confusion detection

Info

Publication number: CN112699226A
Application number: CN202011586654.9A
Authority: CN
Inventors: 汪燕燕; 陈述; 沈艺; 张兵兵; 钟涛
Original assignee: Jiangsu Suning Cloud Computing Co ltd
Current assignee: Jiangsu Suning Cloud Computing Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-23
Also published as: CA3144128A1

Abstract

The invention discloses a method and a system for semantic confusion detection, wherein the method comprises the steps of acquiring a full-scale knowledge base of a conversation platform; performing surface semantic analysis on every two question sentences in the full-scale knowledge base, and identifying a first candidate confusing question sentence pair set; identifying a second set of candidate confusing question pairs based on the full-scale knowledge base by using a sentence vector model; fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set; updating the full-scale knowledge base based on the set of target candidate confusing question pairs. The system for semantic confusion detection provided by the invention improves the quality of a knowledge base by adopting the method for semantic confusion detection, realizes data iteration by constructing a data closed loop, and further improves the precision of confusion detection.

Description

Method and system for semantic confusion detection

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for semantic confusion detection.

Background

Each knowledge point in the dialogue platform knowledge base corresponds to an intention class of a question, edge questions which do not belong to the knowledge point may exist in the knowledge point, and question pairs with similar semantics may exist between the knowledge points. For example, the following similar question sentence may be under the knowledge point (intention class) of applying for the price insurance service: how to make a premium/just buy a price reduction how i need to make a premium/how do i change money for old machines/how much today i buy yesterday/how many of today's cheap robots can be recycled/etc. If the two questions do not belong to the knowledge point (intention class) of applying for the price guarantee service, the two questions are called as edge questions. As another example, question 1: how good you are, how much benefit you buy (standard question for knowledge points: how to buy in bulk), question 2: how much to buy is cheap (standard question corresponding to knowledge point: price can be preferential), the two questions belong to different knowledge points (intention class), namely, question 1 and question 2 are called confusing question pair. The existence of the edge question and the confusion question pair influences the purity of data under each knowledge point, and further reduces the identification accuracy of the user intention. Therefore, semantic confusion detection is an important method for improving the quality of the intelligent conversation platform, and has an important meaning for the construction of a conversation platform data closed loop.

At present, the main method of semantic confusion detection is to take all confusion classes as a new class, and further convert the semantic confusion detection into a classification problem to solve. However, this solution has two disadvantages:

first, the existing solutions are not adaptable to irregular modification of the knowledge base of the intelligent dialogue platform, nor do they create a closed data loop in the whole period of the platform operation. Specifically, the corpus of the platform configuration is uncertain and can be modified periodically, semantic confusion is relative to the corpus of a specific knowledge base, which is contradictory, and in addition, the knowledge bases of different dialog system configurations are different, and corresponding confusion classes are also different, so that an exact new category range cannot be given, and the confusion category definition is uncertain, so that a data closed loop cannot be constructed.

Second, the existing solution cannot continuously iterate the optimization model with the increase of the intention data of the knowledge base, and cannot perform real-time/instant confusion detection on the platform knowledge base. On one hand, as the data volume of the dialogue platform is continuously increased, the knowledge base linguistic data are more and more abundant, how to better utilize the data to continuously optimize the model is also the limitation of the existing scheme, however, the classification problem is limited by the category number, and a large amount of linguistic data cannot be fully utilized. On the other hand, in the operation period of the semantic robot, the dialog platform knowledge base is constantly changed, and the solution for converting semantic confusion detection into classification problem is limited to be implemented on line, so that the method cannot be expanded, namely, the platform knowledge base cannot be subjected to real-time/instant confusion detection.

Disclosure of Invention

The invention aims to provide a method and a system for semantic confusion detection, which are used for purifying linguistic data of a platform knowledge base, improving the quality of the knowledge base, constructing a data closed loop by linking a data layer and a training layer, realizing data iteration and further improving the precision of the confusion detection.

In order to achieve the above purpose, the invention provides the following technical scheme:

a method for semantic confusion detection, comprising:

acquiring a full knowledge base of a conversation platform;

performing surface semantic analysis on every two question sentences in the full-scale knowledge base, and identifying a first candidate confusing question sentence pair set;

identifying a second set of candidate confusing question pairs based on the full-scale knowledge base by using a sentence vector model;

fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;

updating the full-scale knowledge base based on the set of target candidate confusing question pairs.

Preferably, the method for performing surface semantic analysis on every two question pairs in the full-scale knowledge base and identifying the first candidate confusing question pair set includes:

calculating semantic similarity between every two question sentences in the full-scale knowledge base by utilizing various surface layer semantic analysis methods based on corresponding surface layer semantic features, and obtaining a plurality of surface layer semantic confusion question sentence pair sets which are in one-to-one correspondence with the surface layer semantic analysis methods based on the semantic similarity;

removing question pairs belonging to the same knowledge point in the surface semantic confusion question pair set;

and screening out a first candidate confusion question pair set from all the surface layer semantic confusion question pair sets by using a voting mechanism.

Preferably, the surface semantic analysis method comprises one or more of a jaccard similarity algorithm, a word vector model method and a TF-IDF method.

Preferably, the method for identifying the second set of candidate confusing question pairs by using the sentence vector model based on the full-scale knowledge base comprises the following steps:

coding all question sentences by using a sentence vector model, constructing an index library, and simultaneously acquiring semantic representation vectors of all question sentences;

inquiring K confusion question sentences corresponding to any detected question sentence from the index database by using a distance function based on the semantic representation vector, wherein K is more than or equal to 0;

and forming confusion question pairs by the confusion question sentences which belong to different knowledge points with the detected question sentences in the K confusion question sentences and the detected question sentences respectively, and storing the confusion question pairs into a second candidate confusion question pair set.

In particular, the index library comprises a FAISS library and the distance function comprises a cosine distance function.

Preferably, an intersection taking mode or a union taking mode is selected according to user requirements, and the first candidate confusing question pair set and the second candidate confusing question pair set are fused to obtain a target candidate confusing question pair set.

Preferably, the method for detecting semantic confusion further comprises detecting a question within a knowledge point, and the specific method comprises:

acquiring a central value of any knowledge point and the radius from all question sentences in the knowledge point to the central value;

storing all question sentences of which the distance from the knowledge points to the central value is greater than the radius into an edge question sentence candidate set of the knowledge points;

calculating outlier factors of all questions in the edge question candidate set, and storing the questions with the outlier factors larger than a preset threshold value into the edge question set of the knowledge point;

and updating the full-scale knowledge base based on the edge question set of the knowledge points.

Further, the central value of any knowledge point is the average value of the feature codes of each question sentence in any knowledge point;

the radius from all the question sentences in the knowledge points to the central value is the average distance from all the question sentences in the knowledge points to the central value.

Preferably, the method for updating the full-scale knowledge base based on the target candidate confusing question pair set and/or the edge question set of knowledge points comprises the following steps:

storing the target candidate confusing question set and/or the edge question set of the knowledge points in a database, and displaying the target candidate confusing question set and/or the edge question set on a front-end page for a user to check;

and moving the question sentence which judges the audit result as the wrong classification of the knowledge point to the correct knowledge point or deleting the question sentence from the full-scale knowledge base so as to update the full-scale knowledge base.

A system for semantic confusion detection comprises a data acquisition module, a first confusion detection module, a second confusion detection module, a fusion module and a data feedback module, wherein,

the data acquisition module is used for acquiring a full-scale knowledge base of the conversation platform;

the first confusion detection module identifies a first candidate confusion question pair set based on the surface semantic features of each question in the full-scale knowledge base;

the second confusion detection module identifies a second set of candidate confusing question pairs using a sentence vector model based on the full-scale knowledge base;

the fusion module is used for fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;

the data feedback module updates the full-scale knowledge base based on the set of target candidate confusing question pairs.

Compared with the prior art, the method and the system for semantic confusion detection provided by the invention have the following beneficial effects:

the method for detecting semantic confusion comprises the steps of firstly, respectively identifying a first candidate confusion question pair set and a second candidate confusion question pair set by utilizing a surface layer semantic analysis method and a sentence vector model analysis method; and finally updating a full knowledge base based on the target candidate confusing question set, wherein the updated knowledge base can be used for training a confusing detection system, a data layer and a training layer are linked, a data closed loop is constructed, data iteration is realized, and the precision of confusing detection is further improved.

The system for semantic confusion detection provided by the invention improves the quality of a knowledge base by adopting the method for semantic confusion detection, realizes data iteration by constructing a data closed loop, and further improves the precision of confusion detection.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for semantic confusion detection according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a process of obtaining a set of target candidate confusing question pairs according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a process of acquiring an edge question set of knowledge points in the embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, the present embodiment provides a method for semantic confusion detection, including:

acquiring a full knowledge base of a conversation platform;

identifying a second candidate confusing question pair set by using a sentence vector model based on the full-scale knowledge base;

the full-scale knowledge base is updated for the set based on the target candidate confusing question.

In actual use, after the full-scale knowledge base of the conversation platform is obtained, text processing such as removal of stop words and the like can be performed, and then confusion detection operation is performed. Referring to fig. 2, the method for performing surface semantic analysis on every two question pairs in the full-scale knowledge base to identify the first candidate confusing question pair set includes:

In specific implementation, the surface semantic analysis method comprises one or more of a jaccard similarity algorithm, a word vector model method and a TF-IDF method. The method for similarity calculation of the jaccard is characterized in that whether the word exists in a word table or not is represented by 0 or 1, semantic information of a question is represented, and the similarity of the jaccard is calculated to serve as the semantic similarity; the word vector model method is characterized in that the semantic information of the question is represented by splicing word vectors in a maximal pooling mode and an average pooling mode, and finally, the semantic similarity between the question is calculated through a cosine distance function equidistant function; the TF-IDF method uses a bag-of-words model to calculate the tfidf value of each word in a question sentence, the semantic features of the question sentence are represented by the model, and a distance function is used to calculate the semantic similarity of the two sentences.

For the question with different intentions, surface semantic analysis methods such as a jaccard similarity calculation method, a word vector model method and a TF-IDF method are utilized to represent the surface semantic features of the question from multiple aspects, semantic similarity between every two questions is calculated, then a confusion question pair is identified based on a similarity threshold value preset by semantic similarity contrast, and a plurality of surface semantic confusion question pair sets corresponding to the surface semantic analysis methods one to one are obtained. And then, screening most surface layer semantic confusing question pairs simultaneously contained in the surface layer semantic confusing question pair set from all the surface layer semantic confusing question pair sets by using a voting mechanism to form a first candidate confusing question pair set, taking three surface layer semantic confusing question pair sets as an example, and if two surface layer semantic confusing question pair sets simultaneously contain the same surface layer semantic confusing question pairs A1 and A2, storing the surface layer semantic confusing question pairs A1 and A2 into the first candidate confusing question pair set correspondingly.

In the method for semantic confusion detection provided in this embodiment, the method for identifying a second set of candidate confusing question pairs using a sentence vector model based on a full-scale knowledge base includes:

inquiring a distance function from an index base based on the semantic representation vector to obtain K confusion question sentences corresponding to any detected question sentence, wherein K is more than or equal to 0 and is set by a user according to requirements;

The index library comprises a FAISS library, and the distance function comprises a cosine distance function, a Euclidean distance function and the like. A Faiss similarity search class library is developed by Facebook AI Research, is a high-performance library for similarity search and dense vector clustering, supports search of billion-level vectors, and is the most mature approximate neighbor search library at present. It contains a number of algorithms that search a set of vectors of arbitrary size, and supporting code for algorithm evaluation and parameter adjustment. In addition, Faiss is written in C + + and provides Python interface perfectly connected with Numpy, which is convenient for calling.

In the specific implementation, the confusing question of each question in the full-scale knowledge base may be identified one by one, or a list of questions to be queried (for example, a list of newly added questions in the knowledge base) may be input, the confusing question of each question may be identified one by one for the questions in the list, a set of confusing questions of each question may be obtained, and then the questions belonging to the same knowledge point as the confusing question in the set of confusing questions may be deleted, and the question and the remaining confusing questions may form a group of confusing question pairs.

And further, selecting an intersection mode or a union mode according to user requirements, and fusing the first candidate confusing question pair set and the second candidate confusing question pair set to obtain a target candidate confusing question pair set. For example: if the accuracy of the target candidate confusing question pair set which the user wants to obtain is higher, selecting an intersection mode to fuse the first candidate confusing question pair set and the second candidate confusing question pair set; and if the recall rate of the target candidate confusing question pair set which the user wants to obtain is higher, selecting a union mode to fuse the first candidate confusing question pair set and the second candidate confusing question pair set.

In addition, referring to fig. 3, the method for semantic confusion detection according to the embodiment of the present invention further includes detecting a question at a knowledge point, and the specific method includes:

storing all question sentences of which the distances from the knowledge points to the central value are greater than the radius into the edge question sentence candidate set of the knowledge points;

calculating the outlier factors of all the questions in the edge question candidate set, and storing the questions with the outlier factors larger than a preset threshold value into the edge question set of the knowledge point;

The central value of any knowledge point is the average value of the feature codes of each question sentence in any knowledge point; the radius from all the question sentences in the knowledge points to the central value is the average distance from all the question sentences in the knowledge points to the central value.

In the specific implementation process, according to outlier detection based on the nearest neighbor degree, for a question, the k-nearest neighbor distance represents the outlier factor of the question, and the KNN nearest neighbor algorithm and the outlier algorithm are utilized to calculate the outlier factor of any question, and the specific method comprises the following steps:

acquiring a full knowledge base, and performing semantic coding on a question through a sentence vector model to acquire a semantic representation vector of the question;

for each question in a knowledge point, the distance between the question and other questions is calculated by utilizing a cosine distance equidistant function, and the maximum distance of k-nearest neighbors of the question represents the outlier factor of the question.

The sentence vector model is obtained by training an original corpus knowledge base of a dialogue platform. And the sentence vector model is independently deployed on the conversation platform, so that the sentence vector model is served. If the sample size of a certain knowledge point is less than k, the maximum distance between other question sentences in the existing knowledge points and the detected question sentence is taken as an outlier factor, and k is an adjustable parameter.

In addition, the local reachable density of any question can be used as an outlier factor, and a user can select the outlier factor by himself.

And finally, updating a full knowledge base based on the target candidate confusing question set and/or the edge question set of the knowledge points, wherein the specific method comprises the following steps:

In specific implementation, the target candidate confusing question pair set and/or the edge question set of the knowledge point are stored in the mysql and other databases and displayed on the front-end page, operators or AI trainees check the edge question and the confusing question pair in the set to judge whether to perform removal operation or move the question to the correct intention, so that the cost of maintaining the knowledge base by the operators is reduced, the operators are assisted to maintain the conversation robot platform, and the conversation efficiency is improved.

In addition, the above description is about a semantic confusion detection process, which is continuously iterative, and the system for semantic confusion detection provides a target candidate confusion question pair set and/or an edge question set of a knowledge point, and manually performs an audit feedback, corrects a knowledge base, retrains the knowledge base, trains the system for semantic confusion detection and a sentence vector model used therein, and the like, and continuously iterates to form positive feedback of data, thereby forming a data closed loop, and further improving the accuracy of semantic confusion detection.

The method for semantic confusion detection provided by the embodiment of the invention is mainly divided into two subtasks, namely detection of edge question sentences in knowledge points and detection of confusion question sentences among knowledge points. Two subtasks perform confusion detection on knowledge point data in the knowledge base from two different angles, namely, detecting an edge question (i.e., an outlier) which is not a knowledge point inside the knowledge point (i.e., inside a certain intention class), and identifying a confusion question pair between knowledge points (i.e., identifying a question pair with similar semantics between two or more knowledge points).

The detection of the edge question of a knowledge point can find the question which does not belong to a certain knowledge point. For example, the following similar question sentence may be under the knowledge point (intention class) of applying for the price insurance service: how to make a premium/just buy a price reduction how i need to make a premium/how do i change money for old machines/how much today i buy yesterday/how many of today's cheap robots can be recycled/etc. Wherein, how to change the old machine or buy the robot can be recovered, and the two question sentences do not belong to the knowledge point (intention class) of applying for the price insurance service, the two question sentences are called as edge question sentences which need to be found out, and the question sentences of the edge question sentences which pass the examination are moved to the correct knowledge point or deleted, so as to improve the accuracy of the knowledge base.

And the detection of confusing question between knowledge points can find that two question sentences belong to different knowledge points, for example, question 1: how good you are, how much benefit you buy (standard question for knowledge points: how to buy in bulk), question 2: how much to buy is cheap (standard question corresponding to knowledge point: price can be preferential), the two questions belong to different knowledge points (intention class), namely, question 1 and question 2 are called confusing question pair. And moving the checked confusing question pairs into correct knowledge points or deleting the confusing question pairs so as to improve the accuracy of the knowledge base.

The updated knowledge base can be used for training the confusion detection system, a data layer and a training layer are linked, a data closed loop is constructed, data iteration is realized, and the precision of confusion detection is further improved.

Example two

The embodiment of the invention provides a system for semantic confusion detection, which comprises a data acquisition module, a first confusion detection module, a second confusion detection module, a fusion module and a data feedback module, wherein the data acquisition module is used for acquiring a full-scale knowledge base of a conversation platform; the first confusion detection module identifies a first candidate confusion question pair set based on the surface semantic features of each question in the full-scale knowledge base; the second confusion detection module identifies a second candidate confusion question pair set by using a sentence vector model based on the full-scale knowledge base; the fusion module is used for fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set; the data feedback module updates the full knowledge base for the set based on the target candidate confusing question.

The system for semantic confusion detection provided by the embodiment of the invention makes sentence vector model service and index library service into independent modules, and forms a framework of the whole confusion detection processing system together with a data acquisition module, a data processing module, a confusion detection module, a fusion module and a data feedback module of a web end. The data processing module is used for removing text processing such as stop words, the confusion detection module is divided into edge question detection in knowledge points and confusion question pair detection between the knowledge points, the two processes are independent from each other so as to guarantee that an operation and maintenance knowledge base of operators is assisted in the actual application process, and the operations are independent and decoupled from each other. In addition, the system for semantic confusion detection provided by the embodiment of the invention is used as a ring in a semantic dialogue robot platform data closed loop, and is used for linking a data layer and a training layer, so that the corpus of a platform knowledge base is updated and purified, the quality of the knowledge base is improved, the updated knowledge base can be used for training the confusion detection system, the data layer and the training layer are linked, a data closed loop is constructed, data iteration is realized, and the confusion detection accuracy is further improved.

The system for semantic confusion detection provided by the invention adopts the method for semantic confusion detection in the first embodiment, improves the quality of the knowledge base, realizes data iteration by constructing a data closed loop, and further improves the precision of the confusion detection. Compared with the prior art, the beneficial effects of the system for semantic confusion detection provided by the embodiment of the present invention are the same as the beneficial effects of the method for semantic confusion detection provided by the first embodiment, and other technical features in the system are the same as those disclosed in the method of the previous embodiment, which are not repeated herein.

In the foregoing description of embodiments, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for semantic confusion detection, comprising:

acquiring a full knowledge base of a conversation platform;

2. The method for semantic confusion detection according to claim 1, wherein performing surface semantic analysis between two question pairs in the full-scale knowledge base, and identifying the first candidate set of confusing question pairs comprises:

3. The method for semantic confusion detection according to claim 2, wherein the surface layer semantic analysis method comprises one or more of a jaccard similarity algorithm, a word vector model method, and a TF-IDF method.

4. The method for semantic confusion detection as in claim 1, wherein identifying a second set of candidate confusing question pairs using a sentence vector model based on the full-scale knowledge base comprises:

5. The method for semantic confusion detection as claimed in claim 4, wherein the index library comprises a FAISS library and the distance function comprises a cosine distance function.

6. The method for semantic confusion detection according to claim 1, wherein an intersection manner or a union manner is selected according to user requirements, and the first candidate set of confusing question pairs and the second candidate set of confusing question pairs are fused to obtain a target set of confusing question pairs.

7. The method for semantic confusion detection according to any of claims 1-6, further comprising detection of edge question sentences within knowledge points, the specific method comprising:

8. The method for semantic confusion detection according to claim 7, wherein the central value of any knowledge point is an average value of feature codes of each question sentence in any knowledge point;

9. The method for semantic confusion detection according to claim 7, wherein the method of updating the full-scale knowledge base based on the set of target candidate confusing question pairs and/or the set of edge question pairs of knowledge points comprises:

10. A system for semantic confusion detection, which is characterized by comprising a data acquisition module, a first confusion detection module, a second confusion detection module, a fusion module and a data feedback module, wherein,