CN109947921B

CN109947921B - Intelligent question-answering system based on natural language processing

Info

Publication number: CN109947921B
Application number: CN201910207884.0A
Authority: CN
Inventors: 陈婧怡; 陈慧萍; 杜鹏; 丁翰雯
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2022-09-02
Anticipated expiration: 2039-03-19
Also published as: CN109947921A

Abstract

The invention discloses an intelligent question-answering system based on natural language processing, which comprises a knowledge base construction module, a question-answering pair management module and a question-answering matching module; the knowledge base building module comprises a document preprocessing module, a document structure tree building module and a question and answer pair building module; the question-answer pair management module comprises a task management module, a document management module, a keyword management module and a question-answer pair operation module; the question-answer matching module is used for matching questions asked by the user with question-answer pairs created by the knowledge base generation module.

Description

Intelligent question-answering system based on natural language processing

Technical Field

The invention belongs to the technical field of intelligent customer service, and particularly relates to an intelligent question-answering system based on natural language processing.

Background

With the rapid development of the internet and the widespread use of personal computers, more and more messages and data are distributed in the form of electronic documents through a hypertext transfer protocol. Thus, the speed and ability of data retrieval presents significant challenges. How to accurately and timely acquire information required by users in an information world which is as large as the sea has become a big problem in the development of the internet to date.

The search engine technology is a well-established information retrieval technology, but as internet data grows wildly, the disadvantages of the search engine gradually emerge. Baidu, Google, and Canon, traditional search engines such as these are usually only able to use keywords as input. For ordinary users, it is often difficult for them to condenser out a small number of keywords to accurately express their query intentions. In addition, the returned result of the search engine is not a concise accurate answer, but a list of web page fragments. These web page fragments usually contain a lot of noise data, and the user still needs to read these web page fragments or even the corresponding original web page to find the answer he needs.

In order to improve the user experience of information retrieval, research is being conducted on question-answering systems directly using natural language as input and output, and users can directly express their query requirements in text or voice mode using natural language. After understanding the query intention of the user, the question-answering system directly returns accurate answers expressed in a natural language form to the user through a series of retrieval, analysis and processing. Therefore, the question-answering system is a more convenient, friendly and accurate service for users.

For enterprises with manual customer service, the question-answering system can save a large amount of manpower for the enterprises, and the question-answering system is more stable and efficient. For example, the traditional customer service methods for china mobile include 10086 channels such as manual service, business hall manual service windows, etc., and these service methods have costs such as communication charges, training charges, and manual resources, and are limited by conditions such as time (24-hour service cannot be provided), field (centralized customer service office), etc. With the increase of the number of clients in enterprises, the huge consultation demands are often overwhelmed by the customer service team.

Therefore, under the wave of modernization, informatization and intelligentization development of enterprises, the intelligent question-answering system is produced.

At present, the foreign english question-answering system mainly includes a START question-answering system developed by massachusetts university, an AnswerBus question-answering system developed by michigan university, an ask msr question-answering system by microsoft, and an ask jeevees response-answering system. Besides the question-answering system represented by english, there is also a cross-language question-answering evaluation system CLEF, for example.

Compared with the research progress of foreign question-answering systems, the Chinese question-answering system based on Chinese is started later in China after 1970, and the Chinese academy language institute researches the first Chinese man-machine conversation system in China until 1980. At present, Qinghua university, Compound Dan university, Beijing language university and the like obtain a lot of achievements in the research field of Chinese natural language. For example, the easylav campus navigation system of the university of qinghua, and the question-answering system about the relationship of the human body in the dream of red building developed by the department of chinese.

The knowledge base is one of the key competitiveness of the intelligent question-answering system, and the construction of the high-quality knowledge base is one of the problems in the industry. The traditional manual construction of the knowledge base is time-consuming and labor-consuming, the coverage is narrow, at present, unstructured data are converted into structured knowledge maps to be stored, a large amount of human resources and technical support are needed, the storage of the knowledge maps is not flexible enough, the structure is complex, and the efficiency and the accuracy of knowledge base query are not high enough. The existing intelligent question-answering system can only answer the public and a small number of questions and cannot answer the questions accurately. Therefore, an automatic scheme is urgently needed, which can automatically construct a high-quality knowledge base according to given documents (such as product manuals, case documents, user guides and the like), so that the question-answering system is more intelligent.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an intelligent question-answering system based on natural language processing, which extracts as many high-quality question-answering pairs from documents as possible and improves the retrieval efficiency and accuracy of a knowledge base by answering.

The technical problem to be solved by the invention is realized by the following technical scheme:

an intelligent question-answering system based on natural language processing comprises a knowledge base construction module, a question-answering pair management module and a question-answering matching module; the knowledge base building module comprises a document preprocessing module, a document structure tree building module and a question and answer pair building module; the question-answer pair management module comprises a task management module, a document management module, a keyword management module and a question-answer pair operation module; the question-answer matching module is used for matching questions provided by the user with question-answer questions created by the knowledge base generation module.

Further, the document preprocessing module is configured to filter useless information in the document, and the filtering process includes:

filtering a garbage output file set OUT1 in the received document by adopting a regular expression;

removing repeated parts in the file set OUT1 by adopting a longest common subsequence algorithm to obtain a file set OUT 2;

classifying the file set OUT2 according to the set granularity, and removing the common part in each classified document to obtain a file set OUT3 containing a directory and a text;

and classifying the file set OUT3 by adopting a longest common substring algorithm, and removing common parts of all classified documents to obtain a text set OUT 4.

Further, the document structure tree building module is configured to build a document structure tree, and the building process includes:

1) analyzing to obtain HTML source codes of the text, and traversing according to the depth to build an HTML tree;

2) adjusting the structure of the constructed HTML tree to enable leaf nodes of the tree to directly form an answer part of a question-answer pair so as to generate a document structure tree;

3) and deeply traversing the document structure tree to generate a problem keyword structure tree.

The rule for generating the problem keyword structure tree is as follows:

a) traversing to a leaf node;

b) punctuation points which represent complete sentence meaning exist in the child nodes;

c) the child node has a branch and complies with the following decision rule:

c1) semantic approximation of each child node;

c2) the child subtrees are identical in structure.

Further, the question-answer pair module is used for constructing question-answer pairs, and the construction process includes:

1) performing depth-first traversal on the obtained document structure tree by the question-answer pair construction module, taking the obtained keyword set in each path as a question alternative keyword, traversing the father node of the leaf node to remove father node information to form an answer, and generating a keyword group-answer set;

2) after generating questions, when constructing question-answer pairs, if any part of keywords, question sentences and answers is null value, abandoning the question-answer pairs;

3) removing repeated question sentences, preliminarily obtaining question-answer pairs, taking root nodes as keywords, and if the keywords are not matched with the questions, generating the keywords as the keywords of the question-answer pairs by utilizing a word segmentation and named entity extraction method;

4) when a pure question does not enter a question generation process in the traversal process, the question is directly used as a question, subordinate nodes are used as answers, a question-answer pair is formed, and a question-setting entity is extracted from the question to form keyword export.

Further, the generation problem is specifically: performing Chinese word segmentation on the problem keyword structure tree to construct a custom word bank, and generating a question through a semantic template method: subtracting leaf nodes from the document structure tree to generate a problem keyword structure tree, firstly judging whether subtree nodes contain keywords of a user-defined word bank, and if the subtree nodes contain the keywords or are completely matched with the keywords, deleting the words; and then, judging whether the subtree nodes contain the keywords of the verb thesaurus fixed-phrase modifier thesaurus, classifying, performing syntactic transformation, and finally generating the problem.

Further, the task management module is used for managing task issuing and task state monitoring; the document management is used for managing file uploading, file decompression and document group query; the question-answer pair operation module is used for managing the adding, deleting, modifying and inquiring operations of the question-answer pairs.

Further, the question matching of the question-answer matching module comprises:

receiving user questions Q1

Inverted indexing of Q1 by keyword set;

solving the longest public subsequence of each keyword in the question-answer pair;

calculating the matching rate of Q1 and each keyword in the keyword library, and taking the keyword with the highest matching rate value as the Q1 keyword;

indexing a question set with the same keywords in the database by using the keywords of Q1;

and (4) solving short text similarity of each question pair Q1 in the question set, and taking the answer corresponding to the question with the maximum similarity value as the answer of Q1 and returning the answer to the user.

Further, the keyword extraction includes: traversing all document titles and solving the word frequency of all separators; selecting a special symbol with the highest word frequency as a separator, segmenting the key words and generating word frequency mapping; and filtering phrases with higher word frequency, then segmenting the question, and extracting nouns or dynamic nouns in the phrases as keywords of the question.

The beneficial effects include:

(1) high automation: after the user uploads the document, the process from analyzing and extracting the text to finally constructing the question-answer pair to complete the construction of the knowledge base can be completely automatic.

(2) The storage is flexible: the storage mode of the information in the existing knowledge base is structured storage, and the information is not easy to expand and store. The method creatively provides the information stored in the form of question-answer pairs, is easy to expand and store, is easy to retrieve and query, and can be directly exported to be FAQ (frequently asked questions and answers).

(3) The question answering accuracy is high: the question-answer pairs are extracted by utilizing the document structure tree, and as long as the document structure tree is high in quality, the extraction accuracy of the question-answer pairs can reach 100% theoretically.

(4) The knowledge base query efficiency is high: when the question and answer are matched, the search efficiency is greatly improved by searching the keywords first and then searching the question mode under the keywords.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a system architecture diagram of the present invention;

fig. 3 is a flow chart of the operation of the present invention.

Detailed Description

To further describe the technical features and effects of the present invention, the present invention will be further described with reference to the accompanying drawings and detailed description.

The invention analyzes the input document, constructs the document structure tree, extracts question-answer pairs with high quality as much as possible from the document, realizes the automatic generation of the question-answer pairs based on rules, provides a reliable solution for conveniently and efficiently constructing and managing the knowledge base, greatly improves the efficiency and the accuracy of the knowledge base retrieval, and promotes the intelligent question-answer system to be more efficiently and widely applied.

As shown in fig. 1-3, an intelligent question-answering system based on natural language processing includes a knowledge base construction module, a question-answer pair management module and a question-answer matching module; the knowledge base building module comprises a document preprocessing module, a document structure tree building module and a question and answer pair building module; the question-answer pair management module comprises a task management module, a document management module, a keyword management module and a question-answer pair operation module; the question-answer matching module is used for matching questions provided by the user with question-answer questions created by the knowledge base generation module.

The knowledge base is stored in a database in a question and answer pair mode after being established, a Tomcat server is adopted in a background, a MySQL database is adopted in the database, and a PC end or a mobile phone end can be used in a foreground display part for free switching.

In actual operation, an operator transmits a ZIP format compressed file to be analyzed to an intelligent question-answering system, the system decompresses the file, transmits the decompressed file path and task ID to a document preprocessing module in the system, the document preprocessing module extracts a text from the file, constructs a document structure tree according to the obtained text, and the question-answering pair constructing module traverses the document structure tree, extracts keywords, constructs question-answering pairs and finally stores the question-answering pairs in a database.

Specifically, in the process of constructing question-answer pairs, the document is preprocessed, that is, effective information is extracted from the original HTML file. In order to remove the interference information, a regular expression is firstly adopted to filter useless information (mainly: link, css, js scripts, comments and empty label pairs) in a received document to output a file set OUT 1;

then, removing repeated parts in the file set OUT1 by adopting a longest common subsequence algorithm to obtain a file set OUT 2;

then, classifying the file set OUT2 according to the set granularity, and removing the common part in each classified document to obtain a file set OUT3 containing directories and texts;

and finally, classifying the file set OUT3 by adopting a longest common substring algorithm, and removing common parts of all classified documents to obtain a text set OUT 4.

(Note: the Longest Common Subsequence (LCS) is a problem for finding the longest subsequence in all sequences in a set of sequences (usually two sequences.) A sequence is called the longest common subsequence of known sequences if it is a subsequence of two or more known sequences, respectively, and is the longest of all sequences that meet this condition.)

Starting to build the document structure tree after the preprocessing is completed, including

1) Analyzing to obtain HTML source codes (OUT4) of the text, and traversing according to the depth to build an HTML tree;

2) adjusting the structure of the constructed HTML tree to enable leaf nodes of the tree to directly form an answer part of a question-answer pair to generate a document structure tree; (since there may be errors in the development of the document or development that does not follow the display effect, there may be a small portion of the document structure tree that is not generated accurately enough to filter the question and answer that needs to be reviewed manually)

The rule for generating the problem keyword structure tree is as follows:

a) traversing to a leaf node;

c) the child node has a branch and complies with the following decision rule:

c1) semantic approximation of each child node; (determination by Baidu short text proximity interface)

c2) The child subtrees are identical in structure.

And then constructing question-answer pairs on the basis, specifically comprising the following steps:

2) after the question is generated, if any part of the key words, the question sentences and the answers is null value when the question-answer pairs are constructed, the question-answer pairs are discarded;

The method for generating the problems specifically comprises the following steps:

performing Chinese word segmentation on the problem keyword structure tree to construct a custom word bank, and generating a question through a semantic template method: subtracting leaf nodes from the document structure tree to generate a problem keyword structure tree, firstly judging whether a subtree node contains keywords of a user-defined word bank ACML and a BCML, and if the subtree node contains the keywords or is completely matched with the keywords, deleting the words; and then judging whether the subtree nodes contain keywords of a verb lexicon VL and a definite-phrase modifier lexicon AL, and carrying out syntactic transformation in a classified manner to generate a question.

The construction of the word banks ACML, BCML, VL and AL is to perform Chinese word segmentation by means of Stanford CoreNLP (a set of open source word segmentation tools of Stanford university), and then to manually screen words within a certain threshold range as the content of the corresponding word bank.

A question as a question is generated in detail as follows:

s0., for each node of the problem keyword structure tree, Chinese word segmentation is carried out by means of Stanford CoreNLP, words in a certain threshold range are screened in a manual mode, and a user-defined word bank is constructed: a class A nonsense word bank (ACML), a class B nonsense word Bank (BCML), a verb word bank (VL), and a definite-language modifier word bank (AL). Wherein the class A nonsense word bank contains words such as: "user guide", "welcome use", "know", etc., which is required to remove the redundant part and delete the word when the node contains such a word; the class B nonsense word bank contains words such as: the whole node has no effect on question generation, and the whole node needs to be deleted.

S1, setting the granularity of the effective problem keyword node to be 4 (the value is required to be larger than 2), and selecting a first subtree.

S2, pruning, wherein the method comprises the following steps: traversing each node of the subtree, and if the node contains Chinese punctuation or A-class nonsense word stock ACML (such as 'help center', 'user guide' and the like), directly deleting the node; if a node contains the class B nonsense word bank BCML (e.g., "user guide," "welcome use," "know," etc.), the node is retained and the word is deleted. Otherwise, no processing is performed.

S3, cutting branches, wherein the method comprises the following steps: judging whether the depth of the subtree obtained after pruning in the S2 is larger than the node granularity of the effective problem keywords, if so, returning a null value, and turning to S8; otherwise, execution continues with S4.

S4, carrying out classification processing according to the depth of the subtree: if the subtree depth is 1, go to S5; if the subtree depth is 2, go to S6; otherwise, S7 is executed;

s5, performing syntactic analysis on the current subtree, and if the words and sentences contained in the nodes contain a verb lexicon VL, generating a question structure Stc51 as follows:

how "+ < VL > + < the node is to remove the other words and sentences of the verb, and the word order is not changed >

Otherwise, the generated question structure Stc52 is:

< node 1> + "what is"

Turning to S8;

s6, the generated problem structure Stc6 is as follows: what is "+ < node 2> +" of < node 1> + ".

Turning to S8;

s7, judging whether the end node is a contained word (normal or abnormal) of the fixed language word stock AL, if so, generating a problem structure Stc71 as follows:

what is "+ < AL > + < node (length-1) >") of < node 1> + < node 2> + … + < node (length-2) > ".

Otherwise, the generated question structure Stc72 is:

what is "+ < node (length) > +" of < node 1> + < node 2> + … + < node (length-1) > ".

Turning to S8;

s8, if the next subtree is not empty, selecting the next subtree and turning to S2; otherwise, the algorithm is completed and exited.

Note: the vocabulary is defined as follows:

class a nonsense word bank: a Class of Meaningless Lexicon (ACML)

Class B nonsense word bank: b Class of Meaningless Lexicon (BCML)

A verb word library: verb Lexicon (VL)

A language-fixed word bank: attribute Lexicon (AL)

The matching process between the questions posed by the user and the question-answer pairs in the knowledge base is as follows:

s1, receiving a user question Q1;

s2, performing reverse indexing on the Q1 by using a keyword set in the resident memory;

s2.1, solving the Longest Common Subsequence (LCS) of the question-answer pair and each keyword t;

s2.2, calculating the character length/t-degree character length of the lcs (t-Q1) of the matching rate of the question and each keyword in the keyword library, and taking the maximum value as the keyword of Q1;

s3, indexing a question set with the same keywords in the knowledge base by using the keywords of Q1;

s4, solving short text similarity of each question pair Q1 in the set, and taking the answer corresponding to the maximum similarity value as an answer to return to the user.

The invention has wide application field and can provide a highly automated intelligent customer service system for various industries. For company information query, the number of retrieval directory levels is large and complicated for information query of general enterprise web pages, and it is difficult for web page visitors to obtain required information in time. When internal and internal employees of an enterprise need information of other departments or want to inquire the problems of the local employees, the employees only can turn over data and documents, and the method is time-consuming and inconvenient. The system can not only accurately reply the information required by the client of the company through the intelligent system, save the query time of the webpage visitor of the company, bring more potential clients for the company, but also construct a knowledge base for the documents and the data in the company, and facilitate the study and the query of the staff in the company. In the medical aspect, patients often face to have no knowledge of hospital information, cannot see a doctor in time, and are not registered to know what department, what expert and related problems of medicine use. The system can provide medical inquiry, timely provide hospital related information for patients, provide convenience for patients to see a doctor by using the medicine related information, further dredge the hospital order, and reduce misunderstandings and contradictions caused by the fact that information cannot be communicated timely. In the aspect of distance education, the construction cost of the knowledge base of the distance education platform is high, the knowledge base is constructed only by manpower, and time and labor are consumed. The system can be used for more conveniently constructing a knowledge base for the platform and further providing convenience for students in primary and middle schools to acquire professional subject knowledge.

The above embodiments do not limit the present invention in any way, and all technical solutions obtained by taking equivalent substitutions or equivalent changes fall within the scope of the present invention.

Claims

1. An intelligent question-answering system based on natural language processing is characterized by comprising a knowledge base construction module, a question-answering pair management module and a question-answering matching module; the knowledge base building module comprises a document preprocessing module, a document structure tree building module and a question and answer pair building module; the question-answer pair management module comprises a task management module, a document management module, a keyword management module and a question-answer pair operation module; the question-answer matching module is used for matching questions provided by the user with question-answer questions created by the knowledge base generation module;

the document structure tree building module is used for building a document structure tree, and the building process comprises the following steps:

3) deeply traversing the document structure tree to generate a problem keyword structure tree;

the question-answer pair module is used for constructing question-answer pairs, and the construction process comprises the following steps:

performing depth-first traversal on the obtained document structure tree by the question-answer pair construction module, taking the obtained keyword set in each path as a question alternative keyword, traversing the father node of the leaf node to remove father node information to form an answer, and generating a keyword group-answer set;

after generating questions, when constructing question-answer pairs, if any part of keywords, question sentences and answers is null value, abandoning the question-answer pairs;

removing repeated question sentences, preliminarily obtaining question-answer pairs, taking root nodes as keywords, and if the keywords are not matched with the questions, generating the keywords as the keywords of the question-answer pairs by using a word segmentation and named entity extraction method;

when a pure question does not enter a question generation process in the traversal process, the question is directly used as a question, subordinate nodes are used as answers, a question-answer pair is formed, and a question-setting entity is extracted from the question to form keyword export.

2. The intelligent question-answering system based on natural language processing according to claim 1, wherein the document preprocessing module is used for filtering useless information in the document, and the filtering process comprises:

filtering a garbage output file set OUT1 in the received document by using a regular expression;

3. The intelligent question-answering system based on natural language processing according to claim 1, wherein the rules for generating the question keyword structure tree are as follows:

a) traversing to a leaf node;

c) the child node has a branch and complies with the following decision rule: c1) Semantic approximation of each child node;

c2) the child subtrees are identical in structure.

4. The intelligent question-answering system based on natural language processing according to claim 1, wherein the generated question is specifically: performing Chinese word segmentation on the problem keyword structure tree to construct a custom word bank, and generating a question through a semantic template method: subtracting leaf nodes from the document structure tree to generate a problem keyword structure tree, firstly judging whether subtree nodes contain keywords of a user-defined word bank, and if the subtree nodes contain the keywords or are completely matched with the keywords, deleting the words; and then, judging whether the subtree nodes contain the keywords of the verb thesaurus and the fixed-language modifier thesaurus, classifying, performing syntactic transformation, and finally generating the problem.

5. The intelligent question answering system based on natural language processing according to claim 1, wherein the task management module is used for managing task issuing and task state monitoring; the document management is used for managing file uploading, file decompression and document group query; the question-answer pair operation module is used for managing the addition, deletion, modification and query operations of the question-answer pair.

6. The intelligent question-answering system based on natural language processing according to claim 1, wherein the question matching of the question-answering matching module comprises:

accept user questions Q1;

inverted indexing of Q1 by keyword set;

calculating the matching rate of Q1 and each keyword in the keyword library, and taking the keyword with the highest matching rate value as the keyword of Q1;

7. The intelligent question-answering system based on natural language processing according to claim 1, wherein the keyword extraction comprises: traversing all document titles and solving the word frequency of all separators; selecting a special symbol with the highest word frequency as a separator, segmenting the key words and generating word frequency mapping; and filtering phrases with higher word frequency, then segmenting the question, and extracting nouns or dynamic nouns in the phrases as keywords of the question.