CN117688189A

CN117688189A - Knowledge graph, knowledge base and large language model fused question-answering system construction method

Info

Publication number: CN117688189A
Application number: CN202311821070.9A
Authority: CN
Inventors: 田茂春; 李镇江; 蓝日成; 杨跃; 甘郝新; 范光伟; 赵平; 王清正; 刘怡心; 刘斌; 张水平; 赖杭; ***
Original assignee: Guangxi Datengxia Water Control Project Development Co ltd; Pearl River Hydraulic Research Institute of PRWRC
Current assignee: Guangxi Datengxia Water Control Project Development Co ltd; Pearl River Hydraulic Research Institute of PRWRC
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-12
Anticipated expiration: 2043-12-27
Also published as: CN117688189B

Abstract

The invention discloses a method for constructing a question-answering system by fusing a knowledge graph, a knowledge base and a large language model, belongs to the technical field of natural language processing, and provides a complete method for constructing a question-answering system. Aiming at the data characteristics of the water conservancy industry, the question-answering system is customized from multiple dimensions, and a set of perfect question-answering system construction methods are provided, including model selection, training strategies, data set construction modes and the like. According to the invention, a pipeline mode is adopted to combine natural language processing models to construct a set of perfect question processing architecture, and all required data sets are constructed from target knowledge graphs without a large number of manual labels. The framework ensures the accuracy and the comprehensiveness of the knowledge graph question-answering system, simultaneously couples the knowledge base, the knowledge graph and the large language model, realizes the advantage complementation between the knowledge base, the knowledge graph and the large language model, and improves the use experience of users.

Description

Knowledge graph, knowledge base and large language model fused question-answering system construction method

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method for constructing a question-answering system integrating a knowledge graph, a knowledge base and a large-scale language model.

Background

The question-answering system is an important means for humans to obtain information from large-scale data. The question answering system based on the natural language processing technology enables users to present questions in an intuitive and natural mode, so that needed information is obtained. Generally, knowledge sources can be divided into knowledge maps, knowledge bases and general documents according to different knowledge storage media. The knowledge base is a wide concept including any form of knowledge storage, and the knowledge graph is a specific form of knowledge base, emphasizes semantic relations among entities and is easy to inquire and infer. The method has the advantages that the independent knowledge patterns and document libraries are arranged in different vertical fields, and a question-answering system is built aiming at the knowledge patterns and the document libraries, so that the method has important significance in aspects of science popularization, knowledge learning and decision support, and can assist users to quickly search knowledge sources and discover potential relations among different knowledge objects.

However, the existing question-answering system has the following problems: the knowledge graph of the water conservancy industry has the characteristics of complex data structure, different entity lengths and the like, and the scene proportion of multi-hop query performed under specific conditions is larger, so that higher requirements are provided for entity extraction, entity link, inference rule design and the like. Furthermore, deep learning natural language processing models rely heavily on high quality manually labeled datasets, which is the greatest difficulty in the construction of question-answering systems. Finally, most of the existing question-answering systems focus on breakthrough of a specific technology, such as entity extraction, graph reasoning and overall modeling, and tend to ignore the integrity of the system itself. The knowledge graph is limited in size, the manually constructed rules are also usually focused on part of the most common problem types, and the question-answering system constructed in the mode always has limitations and is difficult to apply to actual life and work.

Disclosure of Invention

The invention aims at: in order to solve the problems, a question-answering system construction method integrating a knowledge graph, a knowledge base and a large-scale language model is provided.

The technical scheme adopted by the invention is as follows: a method for constructing a question-answering system integrating a knowledge graph, a knowledge base and a large language model comprises the following steps:

s1: acquiring a question input by a user, and extracting all entity references existing in a question-answering system problem by using a deep learning model;

s2: searching potential link entities for each entity in a specified knowledge graph by using a candidate entity ranking algorithm;

s3: and carrying out question classification on the questions presented by the user by using a preset question template set, and selectively returning one of the knowledge graph answers, the large language model answers or the knowledge base answers according to classification results.

In a preferred embodiment, the construction method includes obtaining a question input by a user, i.e. extracting a natural language question in the form of a string of characters transferred from the interface. All entity references present in the question-answering system questions are extracted using a deep learning model, with specific steps including model training and model prediction.

In a preferred embodiment, in the step S1, the model training includes the steps of:

s1.1: and designing a problem seed template according to the service requirement, and setting a category label for each template. Searching all entities in the knowledge graph, randomly filling the entities into a seed template according to the types of the entities, and recording the filled position indexes to form an entity extraction data set D. Wherein the filled template sentences are used as data set samples, and the recorded position indexes are used as labels.

S1.2: EDA data enhancement of entity extraction data set D, specifically including Random Mask (RM), random delete (RandomDelete, RD), random insert (RandomInsertion, RI) and synonym substitution (SynonymReplacement, SR), to obtain enhanced data set D _e-ner 。

S1.3: using enhanced dataset D _e-ner A named entity recognition model is trained, the named entity recognition model uses a PRGC (PotentialRelationandGlobalCorrespondence) architecture, and an Encoder part uses a Hadamard large open source Chinese pre-training Roberta-wwm model.

In a preferred embodiment, in the step S1, the model prediction uses the PRGC model after training to extract named entities from the question input by the user, so as to obtain all the entity references.

In a preferred embodiment, in the step S2, the specific steps are as follows:

s2.1: and traversing all entities in the searching knowledge graph to form a candidate entity list, storing the list by using a Faiss vector library, and selecting an m3e-base model by using a text vectorization model.

S2.2: for each entity mention, use Faiss vector library as L ₂ Top-5 entities most relevant to similarity retrieval as candidate link entities E _sim And obtains a normalized similarity value as a vector similarity score S _sim 。

S2.3: for E _sim The popularity score of each candidate link entity is calculated, and the specific calculation formula is as follows:

wherein: in-deg (e) is the sum of the outbound and inbound degrees of entity e, and α is a hyper-parameter (typically a positive integer, which varies according to the complexity of the knowledge-graph).

S2.4: the search score is the sum of the vector similarity score and the candidate entity popularity score, namely:

reorder candidate link entity E based on search score _sim And obtaining the most relevant entity mentioned by each entity, and completing entity linking.

In a preferred embodiment, in the step S3, the model training includes the steps of:

s3.1: enhancement data set D obtained using S1.2 _e-ner Further constructing a question classification data set:

s3.2: classifying data sets D using questions _e-cls And training a question classification model. The question classification model uses a Bert-FC architecture, wherein the Bert model uses a Hadamard large open source Chinese pre-trained Roberta-wwm model.

In a preferred embodiment, in the step S3.1, the model training includes the steps of:

s3.1.1: the question is Sentence, and all entities in the question are extracted from the recorded entity filling position and marked as e ₁ 、e ₂ 、…、e _n 。

S3.1.2: using special tags [ CLS ] and [ SEP ], the question is spliced with all entities into the following form:

Q＝[CLS],Sentence,[SEP],e ₁ ,[SEP],e ₂ ,[SEP]……

wherein Q is taken as a sample, and the category of the seed template is taken as a label, so as to obtain a question classification data set D _e-cls 。

In a preferred embodiment, in the step S3, the step of model prediction includes:

s3.3: and further classifying by using a question classification model according to the questions input by the user and the obtained candidate link entities to obtain the category to which the questions belong, and correspondingly returning a knowledge graph answer, a large model answer or a knowledge base answer. The specific description is as follows:

s3.3.1: knowledge graph answer: after the link entity is structured and mapped, the Cypher query statement is called to query and infer, and a specific entity or path is returned to be used as a knowledge graph answer.

S3.3.2: large model answer: and constructing attribute information of the questions and the link entities into Prompt, and further inputting the Prompt into the large model to obtain a large model answer. Wherein the large model may use an open API, or a localized proprietary deployment.

S3.3.3: knowledge base answer: the knowledge base answers contain the summarized results of the large model, as well as the background knowledge about the questions and the sources of the background knowledge. The method specifically comprises the following three steps:

s3.3.3.1: the docx library or the pdfplumber library based on Python splits a local file (e.g. a PDF document) into private knowledge bases, and the splitting rules are four (implemented by regular expressions):

(1) The Chinese and English periods not before and after the number are replaced by line-wrapping characters (\n).

(2) A line feed (\n) is inserted in the middle of the case where (\d\d) occurs after punctuation.

(3) Chinese and English semicolons are replaced by \n.

(4) And adding the sentences before Chinese and English colon (:) as shared sentences into each subsequent clause.

Finally, the documents are split according to the line-feed symbol (n) to form a knowledge base.

S3.3.3.2: and (3) using Faiss in combination with a text embedding model m3e-base, retrieving knowledge in a knowledge base according to user input problems as background knowledge, and simultaneously recording a knowledge source file.

S3.3.3.3: and forming a promtt by the user input questions and knowledge base background knowledge, and further inputting the promtt into the large model to obtain summarized answers of the large model as knowledge base answers. The large model used was consistent with S3.3.2 procedure.

In a preferred embodiment, in the step S1, extracting transformation refers to extracting possible entity references from the natural language text, and identifying key objects queried by the question text is designed to further explore a new paradigm of the question-answering system; on one hand, entity references possibly existing in the text are extracted based on a natural language processing model, keyword extraction can be intelligently realized, and the method is suitable for continuously changing question contexts; on the other hand, the text similarity algorithm is utilized to extract the related knowledge in the knowledge base, so that the expertise of the large language model answer can be enhanced, and the illusion problem of the large language model is greatly lightened; in addition, the system also realizes user-friendly interaction service.

In a preferred embodiment, in the step S3, the system interaction flow is as follows: (1) After the problem is preprocessed, a named entity identification module and an entity linking module are input to finish the linking of the problem and the knowledge graph candidate entity; (2) Completing intention recognition and template matching by combining the problems with candidate entities successfully linked through a text classification model; (3) According to the identified intention and the template, the system automatically selects an answer mode for the question, wherein the answer mode comprises knowledge graph inquiry, knowledge graph reasoning, inquiring a large language model by combining a knowledge base and a prompt word and independently inquiring the large language model; (4) And returning different types of answers according to different answer modes, and returning the answers to the user through interfaces of different styles.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

1. the invention provides a complete question-answering system construction method. Aiming at the data characteristics of the water conservancy industry, the question-answering system is customized from multiple dimensions, and a set of perfect question-answering system construction methods are provided, including model selection, training strategies, data set construction modes and the like. On the premise of ensuring the functional accuracy, the capability of the deep learning technology in the question-answering system is enhanced, and the functions of the knowledge graph and other technologies in the question-answering system are fully exerted.

2. In the invention, aiming at important data in the knowledge graph, a question-answering system gives out answers to the knowledge graph by using a rule template and a Cypher query statement, so that a user can intuitively check knowledge venation and check the relation between knowledge entities through operations such as double click expansion; aiming at general data in a knowledge base, a question-answering system gives out summarized answers through a large language model, so that the tedious work of reading a large number of files by a user is avoided, knowledge texts and source files are given out at the same time, and a reasonable question-answering effect is realized; aiming at wide other data, the question-answering system gives answers through a large language model, so that all questions asked by a user can be answered, and the use experience of the user is improved.

Drawings

Fig. 1 is a schematic flow diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1:

examples:

a method for constructing a question-answering system integrating a knowledge graph, a knowledge base and a large language model comprises the following steps:

The construction method comprises the steps of acquiring a question input by a user, namely extracting a character string form natural language question transmitted from an interface. All entity references present in the question-answering system questions are extracted using a deep learning model, with specific steps including model training and model prediction.

In step S1, the model training includes the steps of:

In step S1, model prediction uses a PRGC model which is completed through training to extract named entities from questions input by a user, and all entity references are obtained.

In step S2, the specific steps are as follows:

In step S3, the model training includes the steps of:

In step S3.1, the model training comprises the steps of:

Q＝[CLS],Sentence,[SEP],e ₁ ,[SEP],e ₂ ,[SEP]……

In step S3, the step of model prediction includes:

(3) Chinese and English semicolons are replaced by \n.

In step S1, extracting transformation refers to extracting possible entity references from natural language text, and identifying key objects queried by question text, which are designed for further exploring a new paradigm of the question-answering system; on one hand, entity references possibly existing in the text are extracted based on a natural language processing model, keyword extraction can be intelligently realized, and the method is suitable for continuously changing question contexts; on the other hand, the text similarity algorithm is utilized to extract the related knowledge in the knowledge base, so that the expertise of the large language model answer can be enhanced, and the illusion problem of the large language model is greatly lightened; in addition, the system also realizes user-friendly interaction service.

In step S3, the system interaction flow is as follows: (1) After the problem is preprocessed, a named entity identification module and an entity linking module are input to finish the linking of the problem and the knowledge graph candidate entity; (2) Completing intention recognition and template matching by combining the problems with candidate entities successfully linked through a text classification model; (3) According to the identified intention and the template, the system automatically selects an answer mode for the question, wherein the answer mode comprises knowledge graph inquiry, knowledge graph reasoning, inquiring a large language model by combining a knowledge base and a prompt word and independently inquiring the large language model; (4) And returning different types of answers according to different answer modes, and returning the answers to the user through interfaces of different styles.

The invention provides a complete question-answering system construction method. Aiming at the data characteristics of the water conservancy industry, the question-answering system is customized from multiple dimensions, and a set of perfect question-answering system construction methods are provided, including model selection, training strategies, data set construction modes and the like. On the premise of ensuring the functional accuracy, the capability of the deep learning technology in the question-answering system is enhanced, and the functions of the knowledge graph and other technologies in the question-answering system are fully exerted.

In the invention, aiming at important data in the knowledge graph, a question-answering system gives out answers to the knowledge graph by using a rule template and a Cypher query statement, so that a user can intuitively check knowledge venation and check the relation between knowledge entities through operations such as double click expansion; aiming at general data in a knowledge base, a question-answering system gives out summarized answers through a large language model, so that the tedious work of reading a large number of files by a user is avoided, knowledge texts and source files are given out at the same time, and a reasonable question-answering effect is realized; aiming at extensive other data, the question-answering system gives answers through a large language model, ensures that all questions queried by a user can be answered, improves the user experience. The framework ensures the accuracy and the comprehensiveness of the knowledge graph question-answering system, simultaneously couples the knowledge base, the knowledge graph and the large language model, realizes the advantage complementation between the knowledge base, the knowledge graph and the large language model, and improves the use experience of users.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The previous description is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for constructing a question-answering system integrating a knowledge graph, a knowledge base and a large language model is characterized by comprising the following steps of: the construction method of the question-answering system comprises the following steps:

2. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: the construction method comprises the steps of acquiring a user input question, namely extracting a character string form natural language problem transmitted from an interface, extracting all entity references existing in the question-answering system problem by using a deep learning model, and specifically comprises the steps of model training and model prediction.

3. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: in the step S1, the model training includes the following steps:

s1.1: designing a problem seed template according to service requirements, and setting a category label for each template; searching all entities in the knowledge graph, randomly filling the entities into a seed template according to the types of the entities, and recording the filled position indexes to form an entity extraction data set D; the filled template sentences are used as data set samples, and recorded position indexes are used as labels;

s1.2: EDA data enhancement is carried out on the entity extraction data set D, and the EDA data enhancement specifically comprises random covering, random deleting, random inserting and synonym replacing, so that an enhanced data set D is obtained _e-ner ；

S1.3: using enhanced dataset D _e-ner Training a named entity recognition model, wherein the named entity recognition model uses a PRGC architecture, and an Encoder part uses a Hadamard large open source Chinese pre-training Roberta-wwm model.

4. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: in the step S1, the model prediction uses the trained PRGC model to extract named entities from the question inputted by the user, so as to obtain all the entity references.

5. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: in the step S2, the specific steps are as follows:

s2.1: traversing all entities in the searching knowledge graph to form a candidate entity list, storing the list by using a Faiss vector library, and selecting an m3e-base model by using a text vectorization model;

s2.2: for each entity mention, use Faiss vector library as L ₂ Similarity retrievalThe most relevant Top-5 entities are taken as candidate link entities E _sim And obtains a normalized similarity value as a vector similarity score S _sim ；

wherein: in-deg (e) is the sum of the outbound and inbound degrees of entity e, and α is a hyper-parameter;

6. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: in the step S3, a preset question template set is used to classify questions presented by a user, a deep learning model is required to be used in the process of selectively returning one of a knowledge graph answer, a large language model answer or a knowledge base answer according to classification results, model training and model prediction are included in the process of using the deep learning model, and the model training includes the following steps:

s3.1: using the resulting enhanced data set D _e-ner Further constructing a question classification data set:

s3.2: classifying data sets D using questions _e-cls Training a question classification model; the question classification model uses a Bert-FC architecture, wherein the Bert model uses Haw large open source Chinese pre-training Roberta-wwm model.

7. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 6, wherein: in the step S3.1, the model training includes the following steps:

s3.1.1: the question is Sentence, and all entities in the question are extracted from the recorded entity filling position and marked as e ₁ 、e ₂ 、…、e _n ；

Q＝[CLS],Sentence,[SEP],e ₁ ,[SEP],e ₂ ,[SEP]……

8. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: in the step S3, a preset question template set is used to classify questions presented by a user, a deep learning model is required to be used in the process of selectively returning one of a knowledge graph answer, a large language model answer or a knowledge base answer according to classification results, model training and model prediction are included in the process of using the deep learning model, and the step of model prediction includes:

s3.3: according to the questions input by the user and the obtained candidate link entities, further classifying by using a question classification model to obtain the category to which the questions belong, and correspondingly returning a knowledge graph answer, a large model answer or a knowledge base answer; the specific description is as follows:

s3.3.1: knowledge graph answer: after the link entity is structured and mapped, invoking a Cypher query statement to query and infer, and returning a specific entity or path to be used as a knowledge graph answer;

s3.3.2: large model answer: the method comprises the steps that attribute information of a question and a link entity is built into a promt, and the promt is further input into a large model to obtain a large model answer; wherein the large model may use an open API, or a localized private deployment;

s3.3.3: knowledge base answer: the knowledge base answers contain the summarized results of the large model, and also the related background knowledge of the questions and the sources of the background knowledge; the method specifically comprises the following three steps:

s3.3.3.1: the Python-based docx library or the pdfplumberer library will be local

The part of the piece is provided with a plurality of grooves,

finally, splitting the document according to the line feed symbol (n) to form a knowledge base;

s3.3.3.2: using Faiss in combination with a text embedded model m3e-base, searching knowledge in a knowledge base according to user input problems to serve as background knowledge, and recording a knowledge source file;

s3.3.3.3: the method comprises the steps that a user input question and knowledge base background knowledge form a promt, the promt is further input into a large model, and a summary answer of the large model is obtained and used as a knowledge base answer; the large model used was consistent with S3.3.2 procedure.

9. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 1, wherein: in the step S1, extraction and transformation are to extract possible entity references from the natural language text, and identify key objects queried by the question text, which are designed for further exploring a new paradigm of the question-answering system; and extracting deterministic short answers based on a large language model, and simultaneously combining a knowledge graph and a knowledge base to jointly enable a question-answering system.

10. The method for constructing a question-answering system integrating knowledge graph, knowledge base and large language model as claimed in claim 9, wherein: in the step S3, the system interaction flow is as follows: (1) After the problem is preprocessed, a named entity identification module and an entity linking module are input to finish the linking of the problem and the knowledge graph candidate entity; (2) Completing intention recognition and template matching by combining the problems with candidate entities successfully linked through a text classification model; (3) According to the identified intention and the template, the system automatically selects an answer mode for the question, wherein the answer mode comprises knowledge graph inquiry, knowledge graph reasoning, inquiring a large language model by combining a knowledge base and a prompt word and independently inquiring the large language model; (4) And returning different types of answers according to different answer modes, and returning the answers to the user through interfaces of different styles.