CN112699226A - Method and system for semantic confusion detection - Google Patents

Method and system for semantic confusion detection Download PDF

Info

Publication number
CN112699226A
CN112699226A CN202011586654.9A CN202011586654A CN112699226A CN 112699226 A CN112699226 A CN 112699226A CN 202011586654 A CN202011586654 A CN 202011586654A CN 112699226 A CN112699226 A CN 112699226A
Authority
CN
China
Prior art keywords
question
confusion
semantic
candidate
full
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011586654.9A
Other languages
Chinese (zh)
Inventor
汪燕燕
陈述
沈艺
张兵兵
钟涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Suning Cloud Computing Co ltd
Original Assignee
Jiangsu Suning Cloud Computing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Suning Cloud Computing Co ltd filed Critical Jiangsu Suning Cloud Computing Co ltd
Priority to CN202011586654.9A priority Critical patent/CN112699226A/en
Publication of CN112699226A publication Critical patent/CN112699226A/en
Priority to CA3144128A priority patent/CA3144128A1/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for semantic confusion detection, wherein the method comprises the steps of acquiring a full-scale knowledge base of a conversation platform; performing surface semantic analysis on every two question sentences in the full-scale knowledge base, and identifying a first candidate confusing question sentence pair set; identifying a second set of candidate confusing question pairs based on the full-scale knowledge base by using a sentence vector model; fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set; updating the full-scale knowledge base based on the set of target candidate confusing question pairs. The system for semantic confusion detection provided by the invention improves the quality of a knowledge base by adopting the method for semantic confusion detection, realizes data iteration by constructing a data closed loop, and further improves the precision of confusion detection.

Description

Method and system for semantic confusion detection
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for semantic confusion detection.
Background
Each knowledge point in the dialogue platform knowledge base corresponds to an intention class of a question, edge questions which do not belong to the knowledge point may exist in the knowledge point, and question pairs with similar semantics may exist between the knowledge points. For example, the following similar question sentence may be under the knowledge point (intention class) of applying for the price insurance service: how to make a premium/just buy a price reduction how i need to make a premium/how do i change money for old machines/how much today i buy yesterday/how many of today's cheap robots can be recycled/etc. If the two questions do not belong to the knowledge point (intention class) of applying for the price guarantee service, the two questions are called as edge questions. As another example, question 1: how good you are, how much benefit you buy (standard question for knowledge points: how to buy in bulk), question 2: how much to buy is cheap (standard question corresponding to knowledge point: price can be preferential), the two questions belong to different knowledge points (intention class), namely, question 1 and question 2 are called confusing question pair. The existence of the edge question and the confusion question pair influences the purity of data under each knowledge point, and further reduces the identification accuracy of the user intention. Therefore, semantic confusion detection is an important method for improving the quality of the intelligent conversation platform, and has an important meaning for the construction of a conversation platform data closed loop.
At present, the main method of semantic confusion detection is to take all confusion classes as a new class, and further convert the semantic confusion detection into a classification problem to solve. However, this solution has two disadvantages:
first, the existing solutions are not adaptable to irregular modification of the knowledge base of the intelligent dialogue platform, nor do they create a closed data loop in the whole period of the platform operation. Specifically, the corpus of the platform configuration is uncertain and can be modified periodically, semantic confusion is relative to the corpus of a specific knowledge base, which is contradictory, and in addition, the knowledge bases of different dialog system configurations are different, and corresponding confusion classes are also different, so that an exact new category range cannot be given, and the confusion category definition is uncertain, so that a data closed loop cannot be constructed.
Second, the existing solution cannot continuously iterate the optimization model with the increase of the intention data of the knowledge base, and cannot perform real-time/instant confusion detection on the platform knowledge base. On one hand, as the data volume of the dialogue platform is continuously increased, the knowledge base linguistic data are more and more abundant, how to better utilize the data to continuously optimize the model is also the limitation of the existing scheme, however, the classification problem is limited by the category number, and a large amount of linguistic data cannot be fully utilized. On the other hand, in the operation period of the semantic robot, the dialog platform knowledge base is constantly changed, and the solution for converting semantic confusion detection into classification problem is limited to be implemented on line, so that the method cannot be expanded, namely, the platform knowledge base cannot be subjected to real-time/instant confusion detection.
Disclosure of Invention
The invention aims to provide a method and a system for semantic confusion detection, which are used for purifying linguistic data of a platform knowledge base, improving the quality of the knowledge base, constructing a data closed loop by linking a data layer and a training layer, realizing data iteration and further improving the precision of the confusion detection.
In order to achieve the above purpose, the invention provides the following technical scheme:
a method for semantic confusion detection, comprising:
acquiring a full knowledge base of a conversation platform;
performing surface semantic analysis on every two question sentences in the full-scale knowledge base, and identifying a first candidate confusing question sentence pair set;
identifying a second set of candidate confusing question pairs based on the full-scale knowledge base by using a sentence vector model;
fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;
updating the full-scale knowledge base based on the set of target candidate confusing question pairs.
Preferably, the method for performing surface semantic analysis on every two question pairs in the full-scale knowledge base and identifying the first candidate confusing question pair set includes:
calculating semantic similarity between every two question sentences in the full-scale knowledge base by utilizing various surface layer semantic analysis methods based on corresponding surface layer semantic features, and obtaining a plurality of surface layer semantic confusion question sentence pair sets which are in one-to-one correspondence with the surface layer semantic analysis methods based on the semantic similarity;
removing question pairs belonging to the same knowledge point in the surface semantic confusion question pair set;
and screening out a first candidate confusion question pair set from all the surface layer semantic confusion question pair sets by using a voting mechanism.
Preferably, the surface semantic analysis method comprises one or more of a jaccard similarity algorithm, a word vector model method and a TF-IDF method.
Preferably, the method for identifying the second set of candidate confusing question pairs by using the sentence vector model based on the full-scale knowledge base comprises the following steps:
coding all question sentences by using a sentence vector model, constructing an index library, and simultaneously acquiring semantic representation vectors of all question sentences;
inquiring K confusion question sentences corresponding to any detected question sentence from the index database by using a distance function based on the semantic representation vector, wherein K is more than or equal to 0;
and forming confusion question pairs by the confusion question sentences which belong to different knowledge points with the detected question sentences in the K confusion question sentences and the detected question sentences respectively, and storing the confusion question pairs into a second candidate confusion question pair set.
In particular, the index library comprises a FAISS library and the distance function comprises a cosine distance function.
Preferably, an intersection taking mode or a union taking mode is selected according to user requirements, and the first candidate confusing question pair set and the second candidate confusing question pair set are fused to obtain a target candidate confusing question pair set.
Preferably, the method for detecting semantic confusion further comprises detecting a question within a knowledge point, and the specific method comprises:
acquiring a central value of any knowledge point and the radius from all question sentences in the knowledge point to the central value;
storing all question sentences of which the distance from the knowledge points to the central value is greater than the radius into an edge question sentence candidate set of the knowledge points;
calculating outlier factors of all questions in the edge question candidate set, and storing the questions with the outlier factors larger than a preset threshold value into the edge question set of the knowledge point;
and updating the full-scale knowledge base based on the edge question set of the knowledge points.
Further, the central value of any knowledge point is the average value of the feature codes of each question sentence in any knowledge point;
the radius from all the question sentences in the knowledge points to the central value is the average distance from all the question sentences in the knowledge points to the central value.
Preferably, the method for updating the full-scale knowledge base based on the target candidate confusing question pair set and/or the edge question set of knowledge points comprises the following steps:
storing the target candidate confusing question set and/or the edge question set of the knowledge points in a database, and displaying the target candidate confusing question set and/or the edge question set on a front-end page for a user to check;
and moving the question sentence which judges the audit result as the wrong classification of the knowledge point to the correct knowledge point or deleting the question sentence from the full-scale knowledge base so as to update the full-scale knowledge base.
A system for semantic confusion detection comprises a data acquisition module, a first confusion detection module, a second confusion detection module, a fusion module and a data feedback module, wherein,
the data acquisition module is used for acquiring a full-scale knowledge base of the conversation platform;
the first confusion detection module identifies a first candidate confusion question pair set based on the surface semantic features of each question in the full-scale knowledge base;
the second confusion detection module identifies a second set of candidate confusing question pairs using a sentence vector model based on the full-scale knowledge base;
the fusion module is used for fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;
the data feedback module updates the full-scale knowledge base based on the set of target candidate confusing question pairs.
Compared with the prior art, the method and the system for semantic confusion detection provided by the invention have the following beneficial effects:
the method for detecting semantic confusion comprises the steps of firstly, respectively identifying a first candidate confusion question pair set and a second candidate confusion question pair set by utilizing a surface layer semantic analysis method and a sentence vector model analysis method; and finally updating a full knowledge base based on the target candidate confusing question set, wherein the updated knowledge base can be used for training a confusing detection system, a data layer and a training layer are linked, a data closed loop is constructed, data iteration is realized, and the precision of confusing detection is further improved.
The system for semantic confusion detection provided by the invention improves the quality of a knowledge base by adopting the method for semantic confusion detection, realizes data iteration by constructing a data closed loop, and further improves the precision of confusion detection.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating a method for semantic confusion detection according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a process of obtaining a set of target candidate confusing question pairs according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a process of acquiring an edge question set of knowledge points in the embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1, the present embodiment provides a method for semantic confusion detection, including:
acquiring a full knowledge base of a conversation platform;
performing surface semantic analysis on every two question sentences in the full-scale knowledge base, and identifying a first candidate confusing question sentence pair set;
identifying a second candidate confusing question pair set by using a sentence vector model based on the full-scale knowledge base;
fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;
the full-scale knowledge base is updated for the set based on the target candidate confusing question.
The method for detecting semantic confusion comprises the steps of firstly, respectively identifying a first candidate confusion question pair set and a second candidate confusion question pair set by utilizing a surface layer semantic analysis method and a sentence vector model analysis method; and finally updating a full knowledge base based on the target candidate confusing question set, wherein the updated knowledge base can be used for training a confusing detection system, a data layer and a training layer are linked, a data closed loop is constructed, data iteration is realized, and the precision of confusing detection is further improved.
In actual use, after the full-scale knowledge base of the conversation platform is obtained, text processing such as removal of stop words and the like can be performed, and then confusion detection operation is performed. Referring to fig. 2, the method for performing surface semantic analysis on every two question pairs in the full-scale knowledge base to identify the first candidate confusing question pair set includes:
calculating semantic similarity between every two question sentences in the full-scale knowledge base by utilizing various surface layer semantic analysis methods based on corresponding surface layer semantic features, and obtaining a plurality of surface layer semantic confusion question sentence pair sets which are in one-to-one correspondence with the surface layer semantic analysis methods based on the semantic similarity;
removing question pairs belonging to the same knowledge point in the surface semantic confusion question pair set;
and screening out a first candidate confusion question pair set from all the surface layer semantic confusion question pair sets by using a voting mechanism.
In specific implementation, the surface semantic analysis method comprises one or more of a jaccard similarity algorithm, a word vector model method and a TF-IDF method. The method for similarity calculation of the jaccard is characterized in that whether the word exists in a word table or not is represented by 0 or 1, semantic information of a question is represented, and the similarity of the jaccard is calculated to serve as the semantic similarity; the word vector model method is characterized in that the semantic information of the question is represented by splicing word vectors in a maximal pooling mode and an average pooling mode, and finally, the semantic similarity between the question is calculated through a cosine distance function equidistant function; the TF-IDF method uses a bag-of-words model to calculate the tfidf value of each word in a question sentence, the semantic features of the question sentence are represented by the model, and a distance function is used to calculate the semantic similarity of the two sentences.
For the question with different intentions, surface semantic analysis methods such as a jaccard similarity calculation method, a word vector model method and a TF-IDF method are utilized to represent the surface semantic features of the question from multiple aspects, semantic similarity between every two questions is calculated, then a confusion question pair is identified based on a similarity threshold value preset by semantic similarity contrast, and a plurality of surface semantic confusion question pair sets corresponding to the surface semantic analysis methods one to one are obtained. And then, screening most surface layer semantic confusing question pairs simultaneously contained in the surface layer semantic confusing question pair set from all the surface layer semantic confusing question pair sets by using a voting mechanism to form a first candidate confusing question pair set, taking three surface layer semantic confusing question pair sets as an example, and if two surface layer semantic confusing question pair sets simultaneously contain the same surface layer semantic confusing question pairs A1 and A2, storing the surface layer semantic confusing question pairs A1 and A2 into the first candidate confusing question pair set correspondingly.
In the method for semantic confusion detection provided in this embodiment, the method for identifying a second set of candidate confusing question pairs using a sentence vector model based on a full-scale knowledge base includes:
coding all question sentences by using a sentence vector model, constructing an index library, and simultaneously acquiring semantic representation vectors of all question sentences;
inquiring a distance function from an index base based on the semantic representation vector to obtain K confusion question sentences corresponding to any detected question sentence, wherein K is more than or equal to 0 and is set by a user according to requirements;
and forming confusion question pairs by the confusion question sentences which belong to different knowledge points with the detected question sentences in the K confusion question sentences and the detected question sentences respectively, and storing the confusion question pairs into a second candidate confusion question pair set.
The index library comprises a FAISS library, and the distance function comprises a cosine distance function, a Euclidean distance function and the like. A Faiss similarity search class library is developed by Facebook AI Research, is a high-performance library for similarity search and dense vector clustering, supports search of billion-level vectors, and is the most mature approximate neighbor search library at present. It contains a number of algorithms that search a set of vectors of arbitrary size, and supporting code for algorithm evaluation and parameter adjustment. In addition, Faiss is written in C + + and provides Python interface perfectly connected with Numpy, which is convenient for calling.
In the specific implementation, the confusing question of each question in the full-scale knowledge base may be identified one by one, or a list of questions to be queried (for example, a list of newly added questions in the knowledge base) may be input, the confusing question of each question may be identified one by one for the questions in the list, a set of confusing questions of each question may be obtained, and then the questions belonging to the same knowledge point as the confusing question in the set of confusing questions may be deleted, and the question and the remaining confusing questions may form a group of confusing question pairs.
And further, selecting an intersection mode or a union mode according to user requirements, and fusing the first candidate confusing question pair set and the second candidate confusing question pair set to obtain a target candidate confusing question pair set. For example: if the accuracy of the target candidate confusing question pair set which the user wants to obtain is higher, selecting an intersection mode to fuse the first candidate confusing question pair set and the second candidate confusing question pair set; and if the recall rate of the target candidate confusing question pair set which the user wants to obtain is higher, selecting a union mode to fuse the first candidate confusing question pair set and the second candidate confusing question pair set.
In addition, referring to fig. 3, the method for semantic confusion detection according to the embodiment of the present invention further includes detecting a question at a knowledge point, and the specific method includes:
acquiring a central value of any knowledge point and the radius from all question sentences in the knowledge point to the central value;
storing all question sentences of which the distances from the knowledge points to the central value are greater than the radius into the edge question sentence candidate set of the knowledge points;
calculating the outlier factors of all the questions in the edge question candidate set, and storing the questions with the outlier factors larger than a preset threshold value into the edge question set of the knowledge point;
and updating the full-scale knowledge base based on the edge question set of the knowledge points.
The central value of any knowledge point is the average value of the feature codes of each question sentence in any knowledge point; the radius from all the question sentences in the knowledge points to the central value is the average distance from all the question sentences in the knowledge points to the central value.
In the specific implementation process, according to outlier detection based on the nearest neighbor degree, for a question, the k-nearest neighbor distance represents the outlier factor of the question, and the KNN nearest neighbor algorithm and the outlier algorithm are utilized to calculate the outlier factor of any question, and the specific method comprises the following steps:
acquiring a full knowledge base, and performing semantic coding on a question through a sentence vector model to acquire a semantic representation vector of the question;
for each question in a knowledge point, the distance between the question and other questions is calculated by utilizing a cosine distance equidistant function, and the maximum distance of k-nearest neighbors of the question represents the outlier factor of the question.
The sentence vector model is obtained by training an original corpus knowledge base of a dialogue platform. And the sentence vector model is independently deployed on the conversation platform, so that the sentence vector model is served. If the sample size of a certain knowledge point is less than k, the maximum distance between other question sentences in the existing knowledge points and the detected question sentence is taken as an outlier factor, and k is an adjustable parameter.
In addition, the local reachable density of any question can be used as an outlier factor, and a user can select the outlier factor by himself.
And finally, updating a full knowledge base based on the target candidate confusing question set and/or the edge question set of the knowledge points, wherein the specific method comprises the following steps:
storing the target candidate confusing question set and/or the edge question set of the knowledge points in a database, and displaying the target candidate confusing question set and/or the edge question set on a front-end page for a user to check;
and moving the question sentence which judges the audit result as the wrong classification of the knowledge point to the correct knowledge point or deleting the question sentence from the full-scale knowledge base so as to update the full-scale knowledge base.
In specific implementation, the target candidate confusing question pair set and/or the edge question set of the knowledge point are stored in the mysql and other databases and displayed on the front-end page, operators or AI trainees check the edge question and the confusing question pair in the set to judge whether to perform removal operation or move the question to the correct intention, so that the cost of maintaining the knowledge base by the operators is reduced, the operators are assisted to maintain the conversation robot platform, and the conversation efficiency is improved.
In addition, the above description is about a semantic confusion detection process, which is continuously iterative, and the system for semantic confusion detection provides a target candidate confusion question pair set and/or an edge question set of a knowledge point, and manually performs an audit feedback, corrects a knowledge base, retrains the knowledge base, trains the system for semantic confusion detection and a sentence vector model used therein, and the like, and continuously iterates to form positive feedback of data, thereby forming a data closed loop, and further improving the accuracy of semantic confusion detection.
The method for semantic confusion detection provided by the embodiment of the invention is mainly divided into two subtasks, namely detection of edge question sentences in knowledge points and detection of confusion question sentences among knowledge points. Two subtasks perform confusion detection on knowledge point data in the knowledge base from two different angles, namely, detecting an edge question (i.e., an outlier) which is not a knowledge point inside the knowledge point (i.e., inside a certain intention class), and identifying a confusion question pair between knowledge points (i.e., identifying a question pair with similar semantics between two or more knowledge points).
The detection of the edge question of a knowledge point can find the question which does not belong to a certain knowledge point. For example, the following similar question sentence may be under the knowledge point (intention class) of applying for the price insurance service: how to make a premium/just buy a price reduction how i need to make a premium/how do i change money for old machines/how much today i buy yesterday/how many of today's cheap robots can be recycled/etc. Wherein, how to change the old machine or buy the robot can be recovered, and the two question sentences do not belong to the knowledge point (intention class) of applying for the price insurance service, the two question sentences are called as edge question sentences which need to be found out, and the question sentences of the edge question sentences which pass the examination are moved to the correct knowledge point or deleted, so as to improve the accuracy of the knowledge base.
And the detection of confusing question between knowledge points can find that two question sentences belong to different knowledge points, for example, question 1: how good you are, how much benefit you buy (standard question for knowledge points: how to buy in bulk), question 2: how much to buy is cheap (standard question corresponding to knowledge point: price can be preferential), the two questions belong to different knowledge points (intention class), namely, question 1 and question 2 are called confusing question pair. And moving the checked confusing question pairs into correct knowledge points or deleting the confusing question pairs so as to improve the accuracy of the knowledge base.
The updated knowledge base can be used for training the confusion detection system, a data layer and a training layer are linked, a data closed loop is constructed, data iteration is realized, and the precision of confusion detection is further improved.
Example two
The embodiment of the invention provides a system for semantic confusion detection, which comprises a data acquisition module, a first confusion detection module, a second confusion detection module, a fusion module and a data feedback module, wherein the data acquisition module is used for acquiring a full-scale knowledge base of a conversation platform; the first confusion detection module identifies a first candidate confusion question pair set based on the surface semantic features of each question in the full-scale knowledge base; the second confusion detection module identifies a second candidate confusion question pair set by using a sentence vector model based on the full-scale knowledge base; the fusion module is used for fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set; the data feedback module updates the full knowledge base for the set based on the target candidate confusing question.
The system for semantic confusion detection provided by the embodiment of the invention makes sentence vector model service and index library service into independent modules, and forms a framework of the whole confusion detection processing system together with a data acquisition module, a data processing module, a confusion detection module, a fusion module and a data feedback module of a web end. The data processing module is used for removing text processing such as stop words, the confusion detection module is divided into edge question detection in knowledge points and confusion question pair detection between the knowledge points, the two processes are independent from each other so as to guarantee that an operation and maintenance knowledge base of operators is assisted in the actual application process, and the operations are independent and decoupled from each other. In addition, the system for semantic confusion detection provided by the embodiment of the invention is used as a ring in a semantic dialogue robot platform data closed loop, and is used for linking a data layer and a training layer, so that the corpus of a platform knowledge base is updated and purified, the quality of the knowledge base is improved, the updated knowledge base can be used for training the confusion detection system, the data layer and the training layer are linked, a data closed loop is constructed, data iteration is realized, and the confusion detection accuracy is further improved.
The system for semantic confusion detection provided by the invention adopts the method for semantic confusion detection in the first embodiment, improves the quality of the knowledge base, realizes data iteration by constructing a data closed loop, and further improves the precision of the confusion detection. Compared with the prior art, the beneficial effects of the system for semantic confusion detection provided by the embodiment of the present invention are the same as the beneficial effects of the method for semantic confusion detection provided by the first embodiment, and other technical features in the system are the same as those disclosed in the method of the previous embodiment, which are not repeated herein.
In the foregoing description of embodiments, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method for semantic confusion detection, comprising:
acquiring a full knowledge base of a conversation platform;
performing surface semantic analysis on every two question sentences in the full-scale knowledge base, and identifying a first candidate confusing question sentence pair set;
identifying a second set of candidate confusing question pairs based on the full-scale knowledge base by using a sentence vector model;
fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;
updating the full-scale knowledge base based on the set of target candidate confusing question pairs.
2. The method for semantic confusion detection according to claim 1, wherein performing surface semantic analysis between two question pairs in the full-scale knowledge base, and identifying the first candidate set of confusing question pairs comprises:
calculating semantic similarity between every two question sentences in the full-scale knowledge base by utilizing various surface layer semantic analysis methods based on corresponding surface layer semantic features, and obtaining a plurality of surface layer semantic confusion question sentence pair sets which are in one-to-one correspondence with the surface layer semantic analysis methods based on the semantic similarity;
removing question pairs belonging to the same knowledge point in the surface semantic confusion question pair set;
and screening out a first candidate confusion question pair set from all the surface layer semantic confusion question pair sets by using a voting mechanism.
3. The method for semantic confusion detection according to claim 2, wherein the surface layer semantic analysis method comprises one or more of a jaccard similarity algorithm, a word vector model method, and a TF-IDF method.
4. The method for semantic confusion detection as in claim 1, wherein identifying a second set of candidate confusing question pairs using a sentence vector model based on the full-scale knowledge base comprises:
coding all question sentences by using a sentence vector model, constructing an index library, and simultaneously acquiring semantic representation vectors of all question sentences;
inquiring K confusion question sentences corresponding to any detected question sentence from the index database by using a distance function based on the semantic representation vector, wherein K is more than or equal to 0;
and forming confusion question pairs by the confusion question sentences which belong to different knowledge points with the detected question sentences in the K confusion question sentences and the detected question sentences respectively, and storing the confusion question pairs into a second candidate confusion question pair set.
5. The method for semantic confusion detection as claimed in claim 4, wherein the index library comprises a FAISS library and the distance function comprises a cosine distance function.
6. The method for semantic confusion detection according to claim 1, wherein an intersection manner or a union manner is selected according to user requirements, and the first candidate set of confusing question pairs and the second candidate set of confusing question pairs are fused to obtain a target set of confusing question pairs.
7. The method for semantic confusion detection according to any of claims 1-6, further comprising detection of edge question sentences within knowledge points, the specific method comprising:
acquiring a central value of any knowledge point and the radius from all question sentences in the knowledge point to the central value;
storing all question sentences of which the distance from the knowledge points to the central value is greater than the radius into an edge question sentence candidate set of the knowledge points;
calculating outlier factors of all questions in the edge question candidate set, and storing the questions with the outlier factors larger than a preset threshold value into the edge question set of the knowledge point;
and updating the full-scale knowledge base based on the edge question set of the knowledge points.
8. The method for semantic confusion detection according to claim 7, wherein the central value of any knowledge point is an average value of feature codes of each question sentence in any knowledge point;
the radius from all the question sentences in the knowledge points to the central value is the average distance from all the question sentences in the knowledge points to the central value.
9. The method for semantic confusion detection according to claim 7, wherein the method of updating the full-scale knowledge base based on the set of target candidate confusing question pairs and/or the set of edge question pairs of knowledge points comprises:
storing the target candidate confusing question set and/or the edge question set of the knowledge points in a database, and displaying the target candidate confusing question set and/or the edge question set on a front-end page for a user to check;
and moving the question sentence which judges the audit result as the wrong classification of the knowledge point to the correct knowledge point or deleting the question sentence from the full-scale knowledge base so as to update the full-scale knowledge base.
10. A system for semantic confusion detection, which is characterized by comprising a data acquisition module, a first confusion detection module, a second confusion detection module, a fusion module and a data feedback module, wherein,
the data acquisition module is used for acquiring a full-scale knowledge base of the conversation platform;
the first confusion detection module identifies a first candidate confusion question pair set based on the surface semantic features of each question in the full-scale knowledge base;
the second confusion detection module identifies a second set of candidate confusing question pairs using a sentence vector model based on the full-scale knowledge base;
the fusion module is used for fusing the first candidate confusion question pair set and the second candidate confusion question pair set to obtain a target candidate confusion question pair set;
the data feedback module updates the full-scale knowledge base based on the set of target candidate confusing question pairs.
CN202011586654.9A 2020-12-29 2020-12-29 Method and system for semantic confusion detection Pending CN112699226A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011586654.9A CN112699226A (en) 2020-12-29 2020-12-29 Method and system for semantic confusion detection
CA3144128A CA3144128A1 (en) 2020-12-29 2021-12-29 Method and system for detecting semantic confusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011586654.9A CN112699226A (en) 2020-12-29 2020-12-29 Method and system for semantic confusion detection

Publications (1)

Publication Number Publication Date
CN112699226A true CN112699226A (en) 2021-04-23

Family

ID=75511441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011586654.9A Pending CN112699226A (en) 2020-12-29 2020-12-29 Method and system for semantic confusion detection

Country Status (2)

Country Link
CN (1) CN112699226A (en)
CA (1) CA3144128A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106707233A (en) * 2017-03-03 2017-05-24 广东工业大学 Multi-side positioning method and multi-side positioning device based on outlier detection
CN106776751A (en) * 2016-11-22 2017-05-31 上海智臻智能网络科技股份有限公司 The clustering method and clustering apparatus of a kind of data
CN109101579A (en) * 2018-07-19 2018-12-28 深圳追科技有限公司 customer service robot knowledge base ambiguity detection method
CN111190998A (en) * 2019-12-10 2020-05-22 上海八斗智能技术有限公司 Question-answering robot system based on hybrid model and question-answering robot
CN111581354A (en) * 2020-05-12 2020-08-25 金蝶软件(中国)有限公司 FAQ question similarity calculation method and system
CN112035598A (en) * 2020-11-03 2020-12-04 北京淇瑀信息科技有限公司 Intelligent semantic retrieval method and system and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776751A (en) * 2016-11-22 2017-05-31 上海智臻智能网络科技股份有限公司 The clustering method and clustering apparatus of a kind of data
CN106707233A (en) * 2017-03-03 2017-05-24 广东工业大学 Multi-side positioning method and multi-side positioning device based on outlier detection
CN109101579A (en) * 2018-07-19 2018-12-28 深圳追科技有限公司 customer service robot knowledge base ambiguity detection method
CN111190998A (en) * 2019-12-10 2020-05-22 上海八斗智能技术有限公司 Question-answering robot system based on hybrid model and question-answering robot
CN111581354A (en) * 2020-05-12 2020-08-25 金蝶软件(中国)有限公司 FAQ question similarity calculation method and system
CN112035598A (en) * 2020-11-03 2020-12-04 北京淇瑀信息科技有限公司 Intelligent semantic retrieval method and system and electronic equipment

Also Published As

Publication number Publication date
CA3144128A1 (en) 2022-06-29

Similar Documents

Publication Publication Date Title
CN107291783B (en) Semantic matching method and intelligent equipment
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN109472033A (en) Entity relation extraction method and system in text, storage medium, electronic equipment
CN114186084B (en) Online multi-mode Hash retrieval method, system, storage medium and equipment
CN113159187B (en) Classification model training method and device and target text determining method and device
CN110222192A (en) Corpus method for building up and device
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
CN111859953A (en) Training data mining method and device, electronic equipment and storage medium
CN111767376B (en) Question-answering system and method based on dynamic knowledge graph
CN112541070A (en) Method and device for excavating slot position updating corpus, electronic equipment and storage medium
CN113254507A (en) Intelligent construction and inventory method for data asset directory
CN114495143A (en) Text object identification method and device, electronic equipment and storage medium
CN115587207A (en) Deep hash retrieval method based on classification label
CN116340530A (en) Intelligent design method based on mechanical knowledge graph
CN111783464A (en) Electric power-oriented domain entity identification method, system and storage medium
CN117151222A (en) Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN115168537A (en) Training method and device of semantic retrieval model, electronic equipment and storage medium
CN111428502A (en) Named entity labeling method for military corpus
CN115797795B (en) Remote sensing image question-answer type retrieval system and method based on reinforcement learning
CN110705274A (en) Fusion type word meaning embedding method based on real-time learning
CN112699226A (en) Method and system for semantic confusion detection
CN116431746A (en) Address mapping method and device based on coding library, electronic equipment and storage medium
CN113807102B (en) Method, device, equipment and computer storage medium for establishing semantic representation model
CN115795060A (en) Entity alignment method based on knowledge enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination