CN117171333A - Electric power file question-answering type intelligent retrieval method and system - Google Patents
Electric power file question-answering type intelligent retrieval method and system Download PDFInfo
- Publication number
- CN117171333A CN117171333A CN202311451435.3A CN202311451435A CN117171333A CN 117171333 A CN117171333 A CN 117171333A CN 202311451435 A CN202311451435 A CN 202311451435A CN 117171333 A CN117171333 A CN 117171333A
- Authority
- CN
- China
- Prior art keywords
- document
- word
- text
- file
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 238000004458 analytical method Methods 0.000 claims abstract description 13
- 239000013598 vector Substances 0.000 claims description 65
- 238000012706 support-vector machine Methods 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 230000011218 segmentation Effects 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000007635 classification algorithm Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000003058 natural language processing Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000013145 classification model Methods 0.000 claims description 2
- 238000000354 decomposition reaction Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 2
- 230000006872 improvement Effects 0.000 description 7
- 230000009193 crawling Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000013604 expression vector Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of information retrieval, and particularly relates to a power file question-answering type intelligent retrieval method and system. Aiming at the defect that the existing retrieval method cannot give consideration to the retrieval accuracy and diversity, the invention adopts the following technical scheme: an intelligent search method for a power file question-answer type comprises the following steps: step S1, user semantic analysis, comprising the following steps: extracting user semantic concepts; the semantic expansion of the user is realized; step S2, document retrieval and processing, comprising the following steps: establishing a file database; measuring the similarity of the documents; constructing an intent graph to represent relationships between document data and query terms; step S3, answer extraction, including: and presenting the search result according to the search intention of the user and combining the traditional relevance features. The power file question-answering type intelligent retrieval method and system can overcome the disadvantage that the traditional retrieval method cannot give consideration to both accurate matching and diversity matching.
Description
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to a power file question-answering type intelligent retrieval method and system.
Background
In recent years, many researchers at home and abroad propose many information retrieval methods based on different theories, so that the information retrieval capability is improved to a certain extent, but certain limitations still exist.
The application publication number CN 113987146A of Chinese patent application discloses a novel intelligent question-answering system special for an electric power intranet, which comprises an intelligent question-answering module, a first control module, a second control module and a third control module, wherein the intelligent question-answering module comprises an input module and an output module; the input module is used for inputting search content by a user; the semantic understanding module is used for carrying out semantic understanding on the search content; the file crawling and searching module is used for crawling file data sources and establishing file indexes; the database crawling and searching module is used for crawling the business database; the application module database is used for outputting application module data according to the understanding of the semantic understanding module on the search content, and the application module data at least comprises address links of the application module; and the output module is used for outputting file indexes and/or service information and/or application module data. The system can solve the requirement of refined search, and improves the efficiency of obtaining the required answers by the power users.
However, in power systems, accuracy and diversity are both very important. The proposal of an information retrieval method which combines retrieval accuracy and diversity is urgent. A good search method needs to achieve the following: 1. the accuracy of information retrieval is improved; 2. the recommendation diversity is improved on the basis of the retrieval accuracy: 3. redundancy in computation is reduced, computation speed is increased, and user experience is optimized.
Disclosure of Invention
Aiming at the defect that the existing retrieval method cannot achieve both retrieval accuracy and diversity, the invention provides a power file question-answer type intelligent retrieval method and system, and the accuracy and the diversity of retrieval results are achieved. Further, the retrieval speed is improved, and the user experience is optimized.
In order to achieve the above purpose, the invention adopts the following technical scheme: the utility model provides a power file question-answer type intelligent retrieval method, the power file question-answer type intelligent retrieval method includes:
step S1, user semantic analysis, comprising the following steps: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
step S2, document retrieval and processing, comprising the following steps: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database;
step S3, answer extraction, including: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with the relevance features.
The invention relates to an intelligent search method for a question and answer type power file, which comprises a user semantic analysis step, a document search and processing step and an answer extraction step; the user semantic analysis step is mainly oriented to question and answer ports, and a machine learning method is adopted to realize extraction and expansion of user semantic concepts; the document searching and processing steps mainly face to a data port, a file database is constructed by carrying out structuring processing on the power industry files, the document similarity is measured by covering the similarity on intention among the documents, and an intention graph database based on the document intention is constructed; the answer extraction step mainly faces to a connection port, sparse vector matching is realized by natural language processing and the like, machine reading understanding question-answer matching is completed by combining a graph convolution neural network with a BERT model and the like, and finally a retrieval result is presented; the sensitivity and the retrieval performance of the retrieval system can be effectively improved, the semantic expansion of users can be deeply understood and enhanced, diversified retrieval results can be provided according to the user will on the basis of maintaining the matching accuracy, and finally the disadvantage that the traditional retrieval method cannot give consideration to both precise matching and diversity matching is overcome.
In the step S1, a support vector machine classification algorithm is adopted, TF-IDF is used as a feature vector to carry out intention classification on the semantics, and 1-gram and 2-gram models are adopted to obtain a concept set of the user semantics, so that the concept extraction of the user semantics is realized; and carrying out synonym expansion on keywords in the concepts based on the synonym table to obtain an expanded concept set, and realizing semantic expansion of users.
As an improvement, step S1 includes:
s11, performing feature representation by using TF-IDF;
step S12, training a support vector machine classifier to classify the intention;
and step S13, training the 1-gram and 2-gram models to obtain a concept set of user semantics.
As an improvement, in step S1, a synonym expansion is performed on keywords in a concept based on a synonym table, and an expanded concept set is obtained to implement semantic expansion of a user, including:
firstly, constructing a synonym table by using an existing professional dictionary or vocabulary library, wherein the synonym table comprises a group of synonyms or words of a paraphrasing;
secondly, jieba word segmentation is carried out on concepts or texts to be expanded, and the texts are segmented into single words;
then, for each word after word segmentation, searching whether corresponding synonyms exist in a synonym table, and if so, adding the synonyms into the concept to serve as expansion of the word;
and finally, merging the expanded synonyms with the keywords in the original concepts, and removing the repeated words.
As an improvement, step S2 includes:
step S21, constructing a file database, comprising:
first, collecting power files and converting them into a text format; preprocessing the text, including removing noise, punctuation marks and stop words; analyzing the text content in the text file, and extracting the characters in the file;
second, text is processed using natural language processing, including: segmenting the text content of the file, dividing the text content into chapters, clauses and paragraphs, identifying entities in the text content, and identifying entities involved in the file, such as organization names, place names and dates, so as to help further organize and classify the file;
then, extracting keywords from the file, and finding out core words and topics in the file so as to facilitate subsequent retrieval and classification;
and finally, establishing a file database according to the analyzed and processed text content, storing the structured file data by using the relational database, and establishing an index for the data in the file database so as to quickly search and inquire.
As an improvement, step S2 includes:
step S22, constructing a document intent graph database, comprising the following steps:
firstly, classifying similarity of willingness of documents through a pre-training language model, and judging whether willingness coverage association exists between two documents or not;
secondly, selecting a neo4j graph database model to store association data and willingness similarity between documents;
then, each document in the document data is represented as a node in the graph database, and an edge is created between every two similar documents;
finally, the preprocessed document data and the similarity relation are imported into a graph database.
As an improvement, step S3 includes:
step S31, performing primary matching on the inverted index, including:
firstly, processing each document in a document set to generate a corresponding inverted index, wherein the structure of the inverted index is a word list, each word corresponds to one or more document IDs, and the document IDs are documents containing the word;
secondly, matching concept words in the expanded user query with word lists in the inverted index, and finding out a corresponding document ID list for each concept word;
finally, according to the matched document ID, corresponding document content or concept sets are acquired, wherein the document content or concept sets contain information related to the query intention of the user.
As an improvement, step S3 includes:
step S32, encoding by using a BERT model, including:
firstly, word segmentation is carried out on an input text, and is completed by using a special word segmentation device of a pre-training model, wherein the word segmentation result is a series of words or sub-words, and each word corresponds to a number and is used for subsequent input representation;
secondly, the BERT model adds special marks to the text input so that the model can distinguish the beginning and end of sentences;
then, the BERT model adopts a transducer encoder to encode words of the text layer by layer to obtain vector representation of each word;
in BERT, the vector of each word is composed of its original word vector and position coding, and the transform encoder depth codes the text through a multi-layer self-attention mechanism and feed-forward neural network to capture the context information, the multi-head attention mechanism is calculated as follows:
,
wherein Q is a query vector, K is a key vector key, V is a value vector value,representing the dimension of the input vector;
forming three calculation vectors of a query vector query, a key vector key and a value vector value through linear transformation on the calculation of the attention, calculating the attention score between every two sequences one by one based on the three calculation vectors, and finding out the most relevant search result by using the key words; on the other hand, in order to make attention computation have richer levels, the associated logic and attention characteristics between expression sequences in different spaces are used for computing the same input in attention layers with different angles, so as to obtain different output results and understandings.
As an improvement, step S3 includes:
step S33, obtaining a query representation and a document representation by using a graph convolution neural network and calculating a diversity score, wherein the step comprises the following steps:
the diversity features extracted by the graph convolution neural network are used to generate a diversity score for the documents, the document nodes updating their representations by the information collected from their neighbors, specifically formulated as follows:
,
wherein,is the identifier of each layer in the graph convolutional neural network,/for each layer>Is an undirected intended adjacency matrix added to the self-loop, < >>For the degree matrix->Is a nodeFeature matrix, wherein->Is the dimension of the node feature, +.>Is->Layer-specific trainable weight matrix of a layer, < ->Representing an activation function, typically a ReLU function;
based on diversity features extracted from current intent graphTo calculate a diversity score, expressed as:
,
wherein,is a multi-layer perceptron (Multilayer Perceptron);
and S34, integrating the inverted index result and the diversity score result to obtain the integrated ranking of the search documents considering both accuracy and diversity, and displaying the integrated ranking to the user according to the score size.
A power file question-and-answer intelligent retrieval system, the power file question-and-answer intelligent retrieval system comprising:
the user semantic analysis module is used for: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
a document searching and processing module for: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database; the method comprises the steps of carrying out a first treatment on the surface of the
Answer extraction module for: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with conventional relevance features.
The power file question-answering type intelligent retrieval method and system can effectively improve the sensitivity and retrieval performance of the retrieval system, deeply understand and strengthen the semantic expansion of users, provide diversified retrieval results according to the user will on the basis of keeping the matching accuracy, and finally overcome the disadvantage that the traditional retrieval method cannot give consideration to both accurate matching and diversity matching; the machine reading understanding question-answer matching is completed through the graphic neural network and the natural language processing technology, and the semantic matching and the intention matching are fused, so that diversified accurate retrieval results are presented.
Drawings
FIG. 1 is a flow chart of an intelligent retrieval method of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a Support Vector Machine (SVM) classification principle in an embodiment of the present invention.
FIG. 3 is a network diagram of a BERT pre-training model of an embodiment of the invention.
FIG. 4 is a graph of a graph convolutional neural network calculation according to an embodiment of the present invention.
Fig. 5 is a block diagram of an intelligent retrieval system according to an embodiment of the invention.
Detailed Description
The following description of the technical solutions of the inventive embodiments of the present invention is provided only for the preferred embodiments of the invention, but not all. Based on the examples in the implementation manner, other examples obtained by a person skilled in the art without making any inventive effort fall within the scope of protection created by the present invention.
Referring to fig. 1 to 5, an intelligent search method for electric power files according to an embodiment of the present invention includes:
step S1, user semantic analysis, comprising the following steps: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
step S2, document retrieval and processing, comprising the following steps: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and extracting more accurate document diversity relation; establishing a classifier to judge whether two different documents contain the same or similar intents, and constructing an intent graph to represent the relationship between document data and query sentences;
step S3, answer extraction, including: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with conventional relevance features.
Referring to fig. 1, the main steps of the power file question-answer intelligent search method of the present embodiment include user semantic analysis, document search and processing, and answer extraction. The user semantic analysis step mainly comprises user semantic concept extraction and user semantic expansion. The document retrieval and processing steps mainly include creating a document database, constructing an intent graph to represent the relationships between document data and query statements. The answer extraction step mainly comprises the steps of preliminary matching of a user semantic expansion concept set and a document concept set, encoding related words, updating the query and the representation of each document on an intent, and presenting a search result according to the search intent of the user and combining traditional relevance features.
In the embodiment, in step S1, a support vector machine (Support Vector Machine, SVM) classification algorithm is adopted, TF-IDF is used as a feature vector to carry out intention classification on the semantics, and 1-gram and 2-gram models are adopted to obtain a concept set of the user semantics, so that user semantic concept extraction is realized; and carrying out synonym expansion on keywords in the concepts based on the synonym table to obtain an expanded concept set, and realizing semantic expansion of users.
In this embodiment, step S1 includes:
step S11, adopting TF-IDF to perform characteristic representation,
specifically, TF refers to the frequency of occurrence of a word in one text, IDF refers to the importance of a word to the text in the entire set, expressed as:
TF = total number of occurrences of a certain word in the text/total number of words of the text,
idf=log (total number of text in corpus/(number of text containing the word + 1)),
TF-IDF=TF*IDF,
the final TF-IDF code considers the importance of each word in terms of its frequency of occurrence in the text and the importance in the entire text set, thereby representing a text with a vector;
is there a charge regulation "i want to query the energy management platform", "is the latest electricity price expanded? "is the electric charge of different areas of the same city consistent? "the three query texts are taken as examples to respectively calculate TF-IDF values of the corresponding words, and the text encoding is carried out. The calculated TF-IDF values are shown in the following table.
。
Step S12, training a support vector machine classifier to classify the intention,
referring to fig. 2, the goal of the support vector machine model is to find a hyperplane to separate the two types of data points. Specifically, given a set of training samplesWherein->Representing input feature vectors, ++>Representing the corresponding category label->The aim of a two-class support vector machine is to find a hyperplane +.>The following conditions are satisfied:
for all belonging to a categorySample->There is->,
For all belonging to a categorySample->There is->,
Wherein,is the normal vector,/->Is the intercept point of the beam,
in this process, it is desirable to maximize the distance of the support vector to the hyperplane, such hyperplane being referred to as the maximum-interval hyperplane, the optimization problem of the maximum interval is expressed as:
wherein,for->The vector is subjected to a dot product,
for all training samples, the constraints are:
,
the multi-classification problem is solved by adopting a one-to-many strategy, in the one-to-many strategy, each category is independently used as one category, K classification support vector machine models are constructed, each classification model is used for distinguishing one category from all other categories, and for the first categoryThe samples of which are marked as positive examples, and the samples of the other K-1 categories are marked as negative examples, each category being +.>The representation is made of a combination of a first and a second color,
: category->As positive examples, all the remaining categories are negative examples,
: category->As positive examples, all the remaining categories are negative examples,
...
: category->As positive examples, all the remaining categories are negative examples,
in the training stage, training each two-class support vector machine model to obtain a corresponding weight vector and a bias term;
when prediction is carried out, inputting a new sample into each support vector machine model, and then selecting the category with the highest output score as a final prediction result;
the coded query text is subjected to SVM topic classification, and topics can be extracted. The text is classified into a theme framework related to energy cost.
Step S13, training 1-gram and 2-gram models to obtain concept sets of user semantics,
specifically, the probability of the sentence segment in the corpus is calculated through a statistical language model, and the conditional probability product of the word appearing on the basis of the existence of the front word is calculated according to Bayesian chain decomposition, which is expressed as follows:
,
is expressed as:
wherein,representing a word string of words from the first to the t-th word in the sentence segment, ++>Representing the number of occurrences of a word string in a sentence segment, it is apparent that the probability of occurrence of a word is related to all words preceding it, assuming thisThe individual word is related to only the first n-1 words, and can be converted into the following form:
the 1-gram and 2-gram models are special cases when n=1, 2.
In this embodiment, in step S1, synonym expansion is performed on keywords in the concept based on a synonym table, so as to obtain an expanded concept set to implement semantic expansion of a user, including:
first, a synonym table is constructed using an existing specialized dictionary or lexicon, which contains a set of synonyms or paraphrased words. The synonym table is shown in the following table.
。
Secondly, jieba word segmentation is carried out on concepts or texts to be expanded, and the texts are segmented into single words. The word segmentation results are shown in the following table.
。
And then, for each word after word segmentation, searching whether corresponding synonyms exist in a synonym table, and if so, adding the synonyms into the concept as expansion of the word. The synonym expansion results are shown in the following table.
。
And finally, merging the expanded synonyms with the keywords in the original concepts, and removing the repeated words.
In this embodiment, step S2 includes:
step S21, constructing a file database, comprising:
first, collecting power files and converting them into a text format; preprocessing the text, including removing noise, punctuation marks and stop words; analyzing the text content in the text file, and extracting the characters in the file;
next, text is processed using natural language processing (Natural Language Processing, NLP), including: segmenting the text content of the file, dividing the text content into chapters, clauses and paragraphs, identifying entities involved in the file, such as organization names, place names and dates, and helping to further organize and classify the file;
then, extracting keywords from the file to find out core words and topics in the file, which is helpful for subsequent retrieval and classification;
and finally, establishing a file database according to the analyzed and processed text content, storing the structured file data by using the relational database, and establishing an index for the data in the file database so as to quickly search and inquire.
In this embodiment, step S2 includes:
step S22, constructing a document intent graph database, comprising the following steps:
firstly, classifying similarity of willingness of documents through a pre-training language model, and judging whether willingness coverage association exists between two documents or not;
secondly, selecting a neo4j graph database model to store association data and willingness similarity between documents;
then, each document in the document data is represented as a node in the graph database, and an edge is created between every two similar documents;
finally, the preprocessed document data and the similarity relation are imported into a graph database.
In this embodiment, step S3 includes:
step S31, performing primary matching on the inverted index, including:
firstly, processing each document in a document set to generate a corresponding inverted index, wherein the structure of the inverted index is a word list, each word corresponds to one or more document IDs, and the document IDs are documents containing the word;
secondly, matching concept words in the expanded user query with word lists in the inverted index, and finding out a corresponding document ID list for each concept word;
finally, according to the matched document ID, corresponding document content or concept sets are acquired, wherein the document content or concept sets contain information related to the query intention of the user. The matching results are shown in the following table.
。
In this embodiment, step S3 includes:
and step S32, encoding by adopting a BERT (Bidirectional Encoder Representation from Transformers) model.
The principle of the BERT pre-training model is bi-directional training based on a transducer model. By performing unsupervised pre-training on a large scale of text corpus, a generic representation of natural language is learned. Referring to the BERT pre-training model network diagram of fig. 3, trm is an attention calculation module in a transducer model, E1-En represents an input text sequence, each E uses a bi-directional context to learn the representation of a word, the input E is encoded by a bi-directional attention mechanism taking into account the context information on the left and right sides of the word, and the resulting T1-Tn represents the encoded output sequence. The method can fully reflect different dependency relations in the corresponding range, so that the representation is richer and the semantics are more accurate.
Specifically, the coding process of the BERT model includes:
firstly, word segmentation is carried out on an input text, and is completed by using a special word segmentation device of a pre-training model, wherein the word segmentation result is a series of words or sub-words, and each word corresponds to a number and is used for subsequent input representation;
second, the BERT model adds special tags to the text input so that the model can distinguish between the beginning and end of a sentence, e.g., each text is preceded by a [ CLS ] tag, representing the beginning of a sentence; adding a [ SEP ] mark at the tail of each sentence to represent the end of the sentence;
then, the BERT model adopts a transducer encoder to encode words of the text layer by layer to obtain vector representation of each word;
in BERT, the vector of each Word is composed of its original Word vector (Word vector) and position coding (Positional Encoding), and the transform encoder depth codes the text through a multi-layer self-attention mechanism and feed-forward neural network to capture context information, the multi-head attention mechanism is calculated as follows:
,
wherein Q is a query vector, K is a key vector key, V is a value vector value,representing the dimension of the input vector;
forming three calculation vectors of a query vector query, a key vector key and a value vector value through linear transformation on the calculation of the attention, calculating the attention score between every two sequences one by one based on the three calculation vectors, and finding out the most relevant search result by using the key words; on the other hand, in order to make attention computation have richer levels, the associated logic and attention characteristics between expression sequences in different spaces are used for computing the same input in attention layers with different angles, so as to obtain different output results and understandings.
In this embodiment, step S3 includes:
step S33, acquiring a query representation and a document representation by using a graph convolution neural network (Graph Convolutional Network, GCN) and calculating a diversity score.
Referring to fig. 4, the graph convolution neural network learns the representation of nodes on the graph by convolution based on a large heterogeneous graph consisting of a document intent graph and a search text. Each node has a feature vector representing its attributes or features, and edges representing relationships between the nodes. Nodes a-f in the graph represent text vectors, the graph convolution neural network carries out weighted aggregation on neighbor information of the nodes to update the representation of the corresponding nodes, and meanwhile, the updating result can be influenced by adding the self-loop setting, namely the state field of the target node. In fig. 4, node a gathers the information of the whole graph through two layers of aggregation, and the corresponding convolution layers can be stacked to capture the relationship between nodes with different distances, so that more complex graph structural features are learned.
Specifically, the process of obtaining a query representation and a document representation and calculating a diversity score using a graph convolution neural network includes:
the diversity features extracted by the graph convolution neural network are used to generate a diversity score for the documents, the document nodes updating their representations by the information collected from their neighbors, specifically formulated as follows:
,
wherein,is the identifier of each layer in the graph convolutional neural network,/for each layer>Is an undirected intended adjacency matrix added to the self-loop, < >>For the degree matrix->Is a node feature matrix, wherein->Is the dimension of the node feature, +.>Is->Layer-specific trainable weight matrix of a layer, < ->Representing an activation function, typically a ReLU function;
based on diversity features extracted from current intent graphTo calculate a diversity score, expressed as:
,
wherein,MLPis a multi-layer perceptron.
The diversity score calculation results are shown in the following table.
。
And S34, integrating the inverted index result and the diversity score result to obtain the integrated ranking of the search documents considering both accuracy and diversity, and displaying the integrated ranking to the user according to the score size. The results are shown in the following table.
。
The power file question-answer intelligent retrieval method comprises a user semantic analysis step, a document retrieval and processing step and an answer extraction step; the user semantic analysis step is mainly oriented to question and answer ports, and a machine learning method is adopted to realize extraction and expansion of user semantic concepts; the document searching and processing steps mainly face to a data port, a file database is constructed by carrying out structuring processing on the power industry files, the document similarity is measured by covering the similarity on intention among the documents, and an intention graph database based on the document intention is constructed; the answer extraction step mainly faces to a connection port, sparse vector matching is realized by natural language processing and the like, machine reading understanding question-answer matching is completed by combining a graph convolution neural network with a BERT model and the like, and finally a retrieval result is presented; the sensitivity and the retrieval performance of the retrieval system can be effectively improved, the semantic expansion of users can be deeply understood and enhanced, diversified retrieval results can be provided according to the user will on the basis of maintaining the matching accuracy, and finally the disadvantage that the traditional retrieval method cannot give consideration to both accurate matching and diversity matching is overcome; the graph convolution neural network is combined with the BERT model, redundancy in calculation is reduced, calculation speed is improved, and user experience is optimized.
Referring to fig. 1 to 5, the embodiment of the invention also provides a power file question-answer type intelligent retrieval system, which comprises:
the user semantic analysis module is used for: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
a document searching and processing module for: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and extracting more accurate document diversity relation; establishing a classifier to judge whether two different documents contain the same or similar intents, and constructing an intent graph to represent the relationship between document data and query sentences;
answer extraction module for: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with conventional relevance features.
Referring to FIG. 5, in the answerThe extraction module constructs an isomerism map based on the document intent map and the query text, wherein X1-Xn in the map represent different document nodes, xq represents the query text nodes, and on one hand, feature expression vectors, namely correlation feature vectors, of different texts are obtained on the basis of BERT model embeddingOn the other hand, a selected document d2 is obtained according to the content of the query text, the overall intent graph is further adjusted, the adjusted intent graph is input into a graph convolution neural network, conv1 and Conv2 in the graph represent different graph convolution layers, semantic representations of all the documents are updated and aggregated to obtain corresponding feature vectors->Z1-Zn represents updated document nodes, and Zq represents updated query text nodes. Thus, the original relevance feature vector +.>And aggregate updated feature vector +.>Further inputting the obtained product into a multi-layer perceptron network, and splicing the output to obtain a final diversity representation ++>。
While the invention has been described in terms of specific embodiments, it will be apparent to those skilled in the art that the invention is not limited to the specific embodiments described. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the appended claims.
Claims (10)
1. An intelligent search method for a question-answer type electric power file is characterized by comprising the following steps of: the power file question-answer type intelligent retrieval method comprises the following steps:
step S1, user semantic analysis, comprising the following steps: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
step S2, document retrieval and processing, comprising the following steps: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database;
step S3, answer extraction, including: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation; and presenting the search result according to the search intention of the user and the correlation characteristic.
2. The intelligent search method for question-answering of a power file according to claim 1, wherein the method comprises the following steps: in the step S1, a support vector machine classification algorithm is adopted, TF-IDF is used as a feature vector to carry out intention classification on the semantics, and a 1-gram model and a 2-gram model are adopted to obtain a concept set of the user semantics, so that the concept extraction of the user semantics is realized; and carrying out synonym expansion on keywords in the concepts based on the synonym table to obtain an expanded concept set, and realizing semantic expansion of users.
3. The intelligent search method for question-answering of the power file according to claim 2, wherein the method comprises the following steps: the step S1 comprises the following steps:
step S11, adopting TF-IDF to perform characteristic representation,
specifically, TF refers to the frequency of occurrence of a word in one text, IDF refers to the importance of a word to the text in the entire set, expressed as:
TF = total number of occurrences of a certain word in the text/total number of words of the text,
idf=log (total number of text in corpus/(number of text containing the word + 1)),
TF-IDF=TF*IDF,
the final TF-IDF code considers the importance of each word in terms of its frequency of occurrence in the text and the importance in the entire text set, thereby representing a text with a vector;
step S12, training a support vector machine classifier to classify the intention,
specifically, given a set of training samplesWherein->The input feature vector is represented as such,representing the corresponding category label->The goal of the two-class support vector machine is to find a hyperplaneThe following conditions are satisfied:
for all belonging to a categorySample->There is->,
For all belonging to a categorySample->There is->,
Wherein,is the normal vector,/->Is the intercept point of the beam,
in this process, it is desirable to maximize the distance of the support vector to the hyperplane, such hyperplane being referred to as the maximum-interval hyperplane, the optimization problem of the maximum interval is expressed as:
,
wherein,for->The vector is subjected to a dot product,
for all training samples, the constraints are:
,
the multi-classification problem is solved by adopting a one-to-many strategy, in the one-to-many strategy, each category is independently used as one category, K classification support vector machine models are constructed, each classification model is used for distinguishing one category from all other categories, and for the first categoryThe samples of which are marked as positive examples, and the samples of the other K-1 categories are marked as negative examples, each category being +.>The representation is made of a combination of a first and a second color,
: category->As positive examples, all the remaining categories are negative examples,
: category->As positive examples, all the remaining categories are negative examples,
...
: category->As positive examples, all the remaining categories are negative examples,
in the training stage, training each two-class support vector machine model to obtain a corresponding weight vector and a bias term;
when prediction is carried out, inputting a new sample into each support vector machine model, and then selecting the category with the highest output score as a final prediction result;
step S13, training 1-gram and 2-gram models to obtain concept sets of user semantics,
specifically, the probability of the sentence segment in the corpus is calculated through a statistical language model, and the conditional probability product of the word appearing on the basis of the existence of the front word is calculated according to Bayesian chain decomposition, which is expressed as follows:
,
is expressed as:
wherein,representing a word string of words from the first to the t-th word in the sentence segment, ++>Representing the number of occurrences of a word string in a sentence segment, it is apparent that the probability of occurrence of a word is related to all words preceding it, and that this word, assuming that it is related to only the preceding n-1 words, can be converted into the following form:
the 1-gram and 2-gram models are special cases when n=1, 2.
4. The intelligent search method for question-answering of the power file according to claim 2, wherein the method comprises the following steps: in step S1, performing synonym expansion on keywords in the concept based on the synonym table to obtain an expanded concept set to realize semantic expansion of the user, including:
firstly, constructing a synonym table by using an existing professional dictionary or vocabulary library, wherein the synonym table comprises a group of synonyms or words of a paraphrasing;
secondly, word segmentation is carried out on concepts or texts to be expanded, and the texts are segmented into single words;
then, for each word after word segmentation, searching whether corresponding synonyms exist in a synonym table, and if so, adding the synonyms into the concept to serve as expansion of the word;
and finally, merging the expanded synonyms with the keywords in the original concepts, and removing the repeated words.
5. The intelligent search method for question-answering of a power file according to claim 1, wherein the method comprises the following steps: the step S2 comprises the following steps:
step S21, constructing a file database, comprising:
first, collecting power files and converting them into a text format; preprocessing the text, including removing noise, punctuation marks and stop words; analyzing the text content in the text file, and extracting the characters in the file;
second, text is processed using natural language processing, including: segmenting the text content of the file, dividing the text content into chapters, clauses and paragraphs, and identifying entities involved in the file by identifying the entities of the text content;
then, extracting keywords from the file, and finding out core words and topics in the file;
and finally, establishing a file database according to the analyzed and processed text content, storing the structured file data by using the relational database, and establishing an index for the data in the file database.
6. The intelligent search method for question-answering of a power file according to claim 5, wherein the method comprises the following steps: the step S2 comprises the following steps:
step S22, constructing a document intent graph database, comprising the following steps:
firstly, classifying similarity of willingness of documents through a pre-training language model, and judging whether willingness coverage association exists between two documents or not;
secondly, selecting a graph database model to store association data and willingness similarity between documents;
then, each document in the document data is represented as a node in the graph database, and an edge is created between every two similar documents;
finally, the preprocessed document data and the similarity relation are imported into a graph database.
7. The intelligent search method for question-answering of a power file according to claim 1, wherein the method comprises the following steps: the step S3 comprises the following steps:
step S31, performing primary matching on the inverted index, including:
firstly, processing each document in a document set to generate a corresponding inverted index, wherein the structure of the inverted index is a word list, each word corresponds to one or more document IDs, and the document IDs are documents containing the word;
secondly, matching concept words in the expanded user query with word lists in the inverted index, and finding out a corresponding document ID list for each concept word;
finally, according to the matched document ID, corresponding document content or concept sets are acquired, wherein the document content or concept sets contain information related to the query intention of the user.
8. The intelligent search method for question-answering of a power file according to claim 7, wherein: the step S3 comprises the following steps:
step S32, encoding by using a BERT model, including:
firstly, word segmentation is carried out on an input text, and is completed by using a special word segmentation device of a pre-training model, wherein the word segmentation result is a series of words or sub-words, and each word corresponds to a number and is used for subsequent input representation;
secondly, the BERT model adds special marks to the text input so that the model can distinguish the beginning and end of sentences;
then, the BERT model adopts a transducer encoder to encode words of the text layer by layer to obtain vector representation of each word;
in BERT, the vector of each word is composed of its original word vector and position coding, and the transform encoder depth codes the text through a multi-layer self-attention mechanism and feed-forward neural network to capture the context information, the multi-head attention mechanism is calculated as follows:
,
wherein Q is a query vector, K is a key vector key, V is a value vector value,representing the dimension of the input vector;
forming three calculation vectors of a query vector query, a key vector key and a value vector value through linear transformation on the calculation of the attention, calculating the attention score between every two sequences one by one based on the three calculation vectors, and finding out the most relevant search result by using the key words; on the other hand, in order to make attention computation have richer levels, the associated logic and attention characteristics between expression sequences in different spaces are used for computing the same input in attention layers with different angles, so as to obtain different output results and understandings.
9. The intelligent search method for question-answering of power files according to claim 8, wherein the method comprises the following steps: the step S3 comprises the following steps:
step S33, obtaining a query representation and a document representation by using a graph convolution neural network and calculating a diversity score, wherein the step comprises the following steps:
the diversity features extracted by the graph convolution neural network are used to generate a diversity score for the documents, the document nodes updating their representations by the information collected from their neighbors, specifically formulated as follows:
,
wherein,is the identifier of each layer in the graph convolutional neural network,/for each layer>Is an undirected intended adjacency matrix added to the self-loop, < >>For the degree matrix->Is a node feature matrix, wherein->Is the dimension of the node feature, +.>Is->Layer-specific trainable weight matrix of a layer, < ->Representing an activation function;
based on diversity features extracted from current intent graphTo calculate a diversity score, expressed as:
,
wherein,is a multi-layer perceptron;
and S34, integrating the inverted index result and the diversity score result to obtain the integrated ranking of the search documents considering both accuracy and diversity, and displaying the integrated ranking to the user according to the score size.
10. An intelligent search system for question-answering of a power file is characterized in that: the power file question-answering type intelligent retrieval system comprises:
the user semantic analysis module is used for: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
a document searching and processing module for: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database;
answer extraction module for: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with the relevance features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311451435.3A CN117171333A (en) | 2023-11-03 | 2023-11-03 | Electric power file question-answering type intelligent retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311451435.3A CN117171333A (en) | 2023-11-03 | 2023-11-03 | Electric power file question-answering type intelligent retrieval method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117171333A true CN117171333A (en) | 2023-12-05 |
Family
ID=88932173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311451435.3A Pending CN117171333A (en) | 2023-11-03 | 2023-11-03 | Electric power file question-answering type intelligent retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117171333A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496542A (en) * | 2023-12-29 | 2024-02-02 | 恒生电子股份有限公司 | Document information extraction method, device, electronic equipment and storage medium |
CN118013020A (en) * | 2024-04-09 | 2024-05-10 | 北京知呱呱科技有限公司 | Patent query method and system for generating joint training based on retrieval |
CN118093834A (en) * | 2024-04-22 | 2024-05-28 | 邦宁数字技术股份有限公司 | AIGC large model-based language processing question-answering system and method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246492A (en) * | 2008-02-26 | 2008-08-20 | 华中科技大学 | Full text retrieval system based on natural language |
CN109885672A (en) * | 2019-03-04 | 2019-06-14 | 中国科学院软件研究所 | A kind of question and answer mode intelligent retrieval system and method towards online education |
CN110674279A (en) * | 2019-10-15 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Question-answer processing method, device, equipment and storage medium based on artificial intelligence |
CN111046661A (en) * | 2019-12-13 | 2020-04-21 | 浙江大学 | Reading understanding method based on graph convolution network |
CN111611361A (en) * | 2020-04-01 | 2020-09-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Intelligent reading, understanding, question answering system of extraction type machine |
CN114036262A (en) * | 2021-11-15 | 2022-02-11 | 中国人民大学 | Graph-based search result diversification method |
CN116881425A (en) * | 2023-08-08 | 2023-10-13 | 武汉烽火普天信息技术有限公司 | Universal document question-answering implementation method, system, device and storage medium |
-
2023
- 2023-11-03 CN CN202311451435.3A patent/CN117171333A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246492A (en) * | 2008-02-26 | 2008-08-20 | 华中科技大学 | Full text retrieval system based on natural language |
CN109885672A (en) * | 2019-03-04 | 2019-06-14 | 中国科学院软件研究所 | A kind of question and answer mode intelligent retrieval system and method towards online education |
CN110674279A (en) * | 2019-10-15 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Question-answer processing method, device, equipment and storage medium based on artificial intelligence |
CN111046661A (en) * | 2019-12-13 | 2020-04-21 | 浙江大学 | Reading understanding method based on graph convolution network |
CN111611361A (en) * | 2020-04-01 | 2020-09-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Intelligent reading, understanding, question answering system of extraction type machine |
CN114036262A (en) * | 2021-11-15 | 2022-02-11 | 中国人民大学 | Graph-based search result diversification method |
CN116881425A (en) * | 2023-08-08 | 2023-10-13 | 武汉烽火普天信息技术有限公司 | Universal document question-answering implementation method, system, device and storage medium |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496542A (en) * | 2023-12-29 | 2024-02-02 | 恒生电子股份有限公司 | Document information extraction method, device, electronic equipment and storage medium |
CN117496542B (en) * | 2023-12-29 | 2024-03-15 | 恒生电子股份有限公司 | Document information extraction method, device, electronic equipment and storage medium |
CN118013020A (en) * | 2024-04-09 | 2024-05-10 | 北京知呱呱科技有限公司 | Patent query method and system for generating joint training based on retrieval |
CN118093834A (en) * | 2024-04-22 | 2024-05-28 | 邦宁数字技术股份有限公司 | AIGC large model-based language processing question-answering system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN111259127B (en) | Long text answer selection method based on transfer learning sentence vector | |
CN117171333A (en) | Electric power file question-answering type intelligent retrieval method and system | |
CN112115238A (en) | Question-answering method and system based on BERT and knowledge base | |
CN113254659A (en) | File studying and judging method and system based on knowledge graph technology | |
CN111914556B (en) | Emotion guiding method and system based on emotion semantic transfer pattern | |
CN110263325A (en) | Chinese automatic word-cut | |
CN111767325B (en) | Multi-source data deep fusion method based on deep learning | |
CN112163089B (en) | High-technology text classification method and system integrating named entity recognition | |
CN111639183A (en) | Financial industry consensus public opinion analysis method and system based on deep learning algorithm | |
CN113360582B (en) | Relation classification method and system based on BERT model fusion multi-entity information | |
CN113051922A (en) | Triple extraction method and system based on deep learning | |
CN115759092A (en) | Network threat information named entity identification method based on ALBERT | |
CN114936277A (en) | Similarity problem matching method and user similarity problem matching system | |
CN115688784A (en) | Chinese named entity recognition method fusing character and word characteristics | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN116010553A (en) | Viewpoint retrieval system based on two-way coding and accurate matching signals | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
CN115169349A (en) | Chinese electronic resume named entity recognition method based on ALBERT | |
CN113220964B (en) | Viewpoint mining method based on short text in network message field | |
CN114356990A (en) | Base named entity recognition system and method based on transfer learning | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN113869054A (en) | Deep learning-based electric power field project feature identification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |