CN117171333A - Electric power file question-answering type intelligent retrieval method and system - Google Patents

Electric power file question-answering type intelligent retrieval method and system Download PDF

Info

Publication number
CN117171333A
CN117171333A CN202311451435.3A CN202311451435A CN117171333A CN 117171333 A CN117171333 A CN 117171333A CN 202311451435 A CN202311451435 A CN 202311451435A CN 117171333 A CN117171333 A CN 117171333A
Authority
CN
China
Prior art keywords
document
word
text
file
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311451435.3A
Other languages
Chinese (zh)
Inventor
胡若云
孙钢
王庆娟
丁伟斌
沈艳阳
宋宛净
吴要毛
张维
张德奇
蒋颖
景伟强
肖吉东
钟震远
侯昱杰
楼斐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN202311451435.3A priority Critical patent/CN117171333A/en
Publication of CN117171333A publication Critical patent/CN117171333A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of information retrieval, and particularly relates to a power file question-answering type intelligent retrieval method and system. Aiming at the defect that the existing retrieval method cannot give consideration to the retrieval accuracy and diversity, the invention adopts the following technical scheme: an intelligent search method for a power file question-answer type comprises the following steps: step S1, user semantic analysis, comprising the following steps: extracting user semantic concepts; the semantic expansion of the user is realized; step S2, document retrieval and processing, comprising the following steps: establishing a file database; measuring the similarity of the documents; constructing an intent graph to represent relationships between document data and query terms; step S3, answer extraction, including: and presenting the search result according to the search intention of the user and combining the traditional relevance features. The power file question-answering type intelligent retrieval method and system can overcome the disadvantage that the traditional retrieval method cannot give consideration to both accurate matching and diversity matching.

Description

Electric power file question-answering type intelligent retrieval method and system
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to a power file question-answering type intelligent retrieval method and system.
Background
In recent years, many researchers at home and abroad propose many information retrieval methods based on different theories, so that the information retrieval capability is improved to a certain extent, but certain limitations still exist.
The application publication number CN 113987146A of Chinese patent application discloses a novel intelligent question-answering system special for an electric power intranet, which comprises an intelligent question-answering module, a first control module, a second control module and a third control module, wherein the intelligent question-answering module comprises an input module and an output module; the input module is used for inputting search content by a user; the semantic understanding module is used for carrying out semantic understanding on the search content; the file crawling and searching module is used for crawling file data sources and establishing file indexes; the database crawling and searching module is used for crawling the business database; the application module database is used for outputting application module data according to the understanding of the semantic understanding module on the search content, and the application module data at least comprises address links of the application module; and the output module is used for outputting file indexes and/or service information and/or application module data. The system can solve the requirement of refined search, and improves the efficiency of obtaining the required answers by the power users.
However, in power systems, accuracy and diversity are both very important. The proposal of an information retrieval method which combines retrieval accuracy and diversity is urgent. A good search method needs to achieve the following: 1. the accuracy of information retrieval is improved; 2. the recommendation diversity is improved on the basis of the retrieval accuracy: 3. redundancy in computation is reduced, computation speed is increased, and user experience is optimized.
Disclosure of Invention
Aiming at the defect that the existing retrieval method cannot achieve both retrieval accuracy and diversity, the invention provides a power file question-answer type intelligent retrieval method and system, and the accuracy and the diversity of retrieval results are achieved. Further, the retrieval speed is improved, and the user experience is optimized.
In order to achieve the above purpose, the invention adopts the following technical scheme: the utility model provides a power file question-answer type intelligent retrieval method, the power file question-answer type intelligent retrieval method includes:
step S1, user semantic analysis, comprising the following steps: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
step S2, document retrieval and processing, comprising the following steps: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database;
step S3, answer extraction, including: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with the relevance features.
The invention relates to an intelligent search method for a question and answer type power file, which comprises a user semantic analysis step, a document search and processing step and an answer extraction step; the user semantic analysis step is mainly oriented to question and answer ports, and a machine learning method is adopted to realize extraction and expansion of user semantic concepts; the document searching and processing steps mainly face to a data port, a file database is constructed by carrying out structuring processing on the power industry files, the document similarity is measured by covering the similarity on intention among the documents, and an intention graph database based on the document intention is constructed; the answer extraction step mainly faces to a connection port, sparse vector matching is realized by natural language processing and the like, machine reading understanding question-answer matching is completed by combining a graph convolution neural network with a BERT model and the like, and finally a retrieval result is presented; the sensitivity and the retrieval performance of the retrieval system can be effectively improved, the semantic expansion of users can be deeply understood and enhanced, diversified retrieval results can be provided according to the user will on the basis of maintaining the matching accuracy, and finally the disadvantage that the traditional retrieval method cannot give consideration to both precise matching and diversity matching is overcome.
In the step S1, a support vector machine classification algorithm is adopted, TF-IDF is used as a feature vector to carry out intention classification on the semantics, and 1-gram and 2-gram models are adopted to obtain a concept set of the user semantics, so that the concept extraction of the user semantics is realized; and carrying out synonym expansion on keywords in the concepts based on the synonym table to obtain an expanded concept set, and realizing semantic expansion of users.
As an improvement, step S1 includes:
s11, performing feature representation by using TF-IDF;
step S12, training a support vector machine classifier to classify the intention;
and step S13, training the 1-gram and 2-gram models to obtain a concept set of user semantics.
As an improvement, in step S1, a synonym expansion is performed on keywords in a concept based on a synonym table, and an expanded concept set is obtained to implement semantic expansion of a user, including:
firstly, constructing a synonym table by using an existing professional dictionary or vocabulary library, wherein the synonym table comprises a group of synonyms or words of a paraphrasing;
secondly, jieba word segmentation is carried out on concepts or texts to be expanded, and the texts are segmented into single words;
then, for each word after word segmentation, searching whether corresponding synonyms exist in a synonym table, and if so, adding the synonyms into the concept to serve as expansion of the word;
and finally, merging the expanded synonyms with the keywords in the original concepts, and removing the repeated words.
As an improvement, step S2 includes:
step S21, constructing a file database, comprising:
first, collecting power files and converting them into a text format; preprocessing the text, including removing noise, punctuation marks and stop words; analyzing the text content in the text file, and extracting the characters in the file;
second, text is processed using natural language processing, including: segmenting the text content of the file, dividing the text content into chapters, clauses and paragraphs, identifying entities in the text content, and identifying entities involved in the file, such as organization names, place names and dates, so as to help further organize and classify the file;
then, extracting keywords from the file, and finding out core words and topics in the file so as to facilitate subsequent retrieval and classification;
and finally, establishing a file database according to the analyzed and processed text content, storing the structured file data by using the relational database, and establishing an index for the data in the file database so as to quickly search and inquire.
As an improvement, step S2 includes:
step S22, constructing a document intent graph database, comprising the following steps:
firstly, classifying similarity of willingness of documents through a pre-training language model, and judging whether willingness coverage association exists between two documents or not;
secondly, selecting a neo4j graph database model to store association data and willingness similarity between documents;
then, each document in the document data is represented as a node in the graph database, and an edge is created between every two similar documents;
finally, the preprocessed document data and the similarity relation are imported into a graph database.
As an improvement, step S3 includes:
step S31, performing primary matching on the inverted index, including:
firstly, processing each document in a document set to generate a corresponding inverted index, wherein the structure of the inverted index is a word list, each word corresponds to one or more document IDs, and the document IDs are documents containing the word;
secondly, matching concept words in the expanded user query with word lists in the inverted index, and finding out a corresponding document ID list for each concept word;
finally, according to the matched document ID, corresponding document content or concept sets are acquired, wherein the document content or concept sets contain information related to the query intention of the user.
As an improvement, step S3 includes:
step S32, encoding by using a BERT model, including:
firstly, word segmentation is carried out on an input text, and is completed by using a special word segmentation device of a pre-training model, wherein the word segmentation result is a series of words or sub-words, and each word corresponds to a number and is used for subsequent input representation;
secondly, the BERT model adds special marks to the text input so that the model can distinguish the beginning and end of sentences;
then, the BERT model adopts a transducer encoder to encode words of the text layer by layer to obtain vector representation of each word;
in BERT, the vector of each word is composed of its original word vector and position coding, and the transform encoder depth codes the text through a multi-layer self-attention mechanism and feed-forward neural network to capture the context information, the multi-head attention mechanism is calculated as follows:
wherein Q is a query vector, K is a key vector key, V is a value vector value,representing the dimension of the input vector;
forming three calculation vectors of a query vector query, a key vector key and a value vector value through linear transformation on the calculation of the attention, calculating the attention score between every two sequences one by one based on the three calculation vectors, and finding out the most relevant search result by using the key words; on the other hand, in order to make attention computation have richer levels, the associated logic and attention characteristics between expression sequences in different spaces are used for computing the same input in attention layers with different angles, so as to obtain different output results and understandings.
As an improvement, step S3 includes:
step S33, obtaining a query representation and a document representation by using a graph convolution neural network and calculating a diversity score, wherein the step comprises the following steps:
the diversity features extracted by the graph convolution neural network are used to generate a diversity score for the documents, the document nodes updating their representations by the information collected from their neighbors, specifically formulated as follows:
wherein,is the identifier of each layer in the graph convolutional neural network,/for each layer>Is an undirected intended adjacency matrix added to the self-loop, < >>For the degree matrix->Is a nodeFeature matrix, wherein->Is the dimension of the node feature, +.>Is->Layer-specific trainable weight matrix of a layer, < ->Representing an activation function, typically a ReLU function;
based on diversity features extracted from current intent graphTo calculate a diversity score, expressed as:
wherein,is a multi-layer perceptron (Multilayer Perceptron);
and S34, integrating the inverted index result and the diversity score result to obtain the integrated ranking of the search documents considering both accuracy and diversity, and displaying the integrated ranking to the user according to the score size.
A power file question-and-answer intelligent retrieval system, the power file question-and-answer intelligent retrieval system comprising:
the user semantic analysis module is used for: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
a document searching and processing module for: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database; the method comprises the steps of carrying out a first treatment on the surface of the
Answer extraction module for: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with conventional relevance features.
The power file question-answering type intelligent retrieval method and system can effectively improve the sensitivity and retrieval performance of the retrieval system, deeply understand and strengthen the semantic expansion of users, provide diversified retrieval results according to the user will on the basis of keeping the matching accuracy, and finally overcome the disadvantage that the traditional retrieval method cannot give consideration to both accurate matching and diversity matching; the machine reading understanding question-answer matching is completed through the graphic neural network and the natural language processing technology, and the semantic matching and the intention matching are fused, so that diversified accurate retrieval results are presented.
Drawings
FIG. 1 is a flow chart of an intelligent retrieval method of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a Support Vector Machine (SVM) classification principle in an embodiment of the present invention.
FIG. 3 is a network diagram of a BERT pre-training model of an embodiment of the invention.
FIG. 4 is a graph of a graph convolutional neural network calculation according to an embodiment of the present invention.
Fig. 5 is a block diagram of an intelligent retrieval system according to an embodiment of the invention.
Detailed Description
The following description of the technical solutions of the inventive embodiments of the present invention is provided only for the preferred embodiments of the invention, but not all. Based on the examples in the implementation manner, other examples obtained by a person skilled in the art without making any inventive effort fall within the scope of protection created by the present invention.
Referring to fig. 1 to 5, an intelligent search method for electric power files according to an embodiment of the present invention includes:
step S1, user semantic analysis, comprising the following steps: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
step S2, document retrieval and processing, comprising the following steps: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and extracting more accurate document diversity relation; establishing a classifier to judge whether two different documents contain the same or similar intents, and constructing an intent graph to represent the relationship between document data and query sentences;
step S3, answer extraction, including: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with conventional relevance features.
Referring to fig. 1, the main steps of the power file question-answer intelligent search method of the present embodiment include user semantic analysis, document search and processing, and answer extraction. The user semantic analysis step mainly comprises user semantic concept extraction and user semantic expansion. The document retrieval and processing steps mainly include creating a document database, constructing an intent graph to represent the relationships between document data and query statements. The answer extraction step mainly comprises the steps of preliminary matching of a user semantic expansion concept set and a document concept set, encoding related words, updating the query and the representation of each document on an intent, and presenting a search result according to the search intent of the user and combining traditional relevance features.
In the embodiment, in step S1, a support vector machine (Support Vector Machine, SVM) classification algorithm is adopted, TF-IDF is used as a feature vector to carry out intention classification on the semantics, and 1-gram and 2-gram models are adopted to obtain a concept set of the user semantics, so that user semantic concept extraction is realized; and carrying out synonym expansion on keywords in the concepts based on the synonym table to obtain an expanded concept set, and realizing semantic expansion of users.
In this embodiment, step S1 includes:
step S11, adopting TF-IDF to perform characteristic representation,
specifically, TF refers to the frequency of occurrence of a word in one text, IDF refers to the importance of a word to the text in the entire set, expressed as:
TF = total number of occurrences of a certain word in the text/total number of words of the text,
idf=log (total number of text in corpus/(number of text containing the word + 1)),
TF-IDF=TF*IDF,
the final TF-IDF code considers the importance of each word in terms of its frequency of occurrence in the text and the importance in the entire text set, thereby representing a text with a vector;
is there a charge regulation "i want to query the energy management platform", "is the latest electricity price expanded? "is the electric charge of different areas of the same city consistent? "the three query texts are taken as examples to respectively calculate TF-IDF values of the corresponding words, and the text encoding is carried out. The calculated TF-IDF values are shown in the following table.
Step S12, training a support vector machine classifier to classify the intention,
referring to fig. 2, the goal of the support vector machine model is to find a hyperplane to separate the two types of data points. Specifically, given a set of training samplesWherein->Representing input feature vectors, ++>Representing the corresponding category label->The aim of a two-class support vector machine is to find a hyperplane +.>The following conditions are satisfied:
for all belonging to a categorySample->There is->
For all belonging to a categorySample->There is->
Wherein,is the normal vector,/->Is the intercept point of the beam,
in this process, it is desirable to maximize the distance of the support vector to the hyperplane, such hyperplane being referred to as the maximum-interval hyperplane, the optimization problem of the maximum interval is expressed as:
wherein,for->The vector is subjected to a dot product,
for all training samples, the constraints are:
the multi-classification problem is solved by adopting a one-to-many strategy, in the one-to-many strategy, each category is independently used as one category, K classification support vector machine models are constructed, each classification model is used for distinguishing one category from all other categories, and for the first categoryThe samples of which are marked as positive examples, and the samples of the other K-1 categories are marked as negative examples, each category being +.>The representation is made of a combination of a first and a second color,
: category->As positive examples, all the remaining categories are negative examples,
: category->As positive examples, all the remaining categories are negative examples,
...
: category->As positive examples, all the remaining categories are negative examples,
in the training stage, training each two-class support vector machine model to obtain a corresponding weight vector and a bias term;
when prediction is carried out, inputting a new sample into each support vector machine model, and then selecting the category with the highest output score as a final prediction result;
the coded query text is subjected to SVM topic classification, and topics can be extracted. The text is classified into a theme framework related to energy cost.
Step S13, training 1-gram and 2-gram models to obtain concept sets of user semantics,
specifically, the probability of the sentence segment in the corpus is calculated through a statistical language model, and the conditional probability product of the word appearing on the basis of the existence of the front word is calculated according to Bayesian chain decomposition, which is expressed as follows:
is expressed as:
wherein,representing a word string of words from the first to the t-th word in the sentence segment, ++>Representing the number of occurrences of a word string in a sentence segment, it is apparent that the probability of occurrence of a word is related to all words preceding it, assuming thisThe individual word is related to only the first n-1 words, and can be converted into the following form:
the 1-gram and 2-gram models are special cases when n=1, 2.
In this embodiment, in step S1, synonym expansion is performed on keywords in the concept based on a synonym table, so as to obtain an expanded concept set to implement semantic expansion of a user, including:
first, a synonym table is constructed using an existing specialized dictionary or lexicon, which contains a set of synonyms or paraphrased words. The synonym table is shown in the following table.
Secondly, jieba word segmentation is carried out on concepts or texts to be expanded, and the texts are segmented into single words. The word segmentation results are shown in the following table.
And then, for each word after word segmentation, searching whether corresponding synonyms exist in a synonym table, and if so, adding the synonyms into the concept as expansion of the word. The synonym expansion results are shown in the following table.
And finally, merging the expanded synonyms with the keywords in the original concepts, and removing the repeated words.
In this embodiment, step S2 includes:
step S21, constructing a file database, comprising:
first, collecting power files and converting them into a text format; preprocessing the text, including removing noise, punctuation marks and stop words; analyzing the text content in the text file, and extracting the characters in the file;
next, text is processed using natural language processing (Natural Language Processing, NLP), including: segmenting the text content of the file, dividing the text content into chapters, clauses and paragraphs, identifying entities involved in the file, such as organization names, place names and dates, and helping to further organize and classify the file;
then, extracting keywords from the file to find out core words and topics in the file, which is helpful for subsequent retrieval and classification;
and finally, establishing a file database according to the analyzed and processed text content, storing the structured file data by using the relational database, and establishing an index for the data in the file database so as to quickly search and inquire.
In this embodiment, step S2 includes:
step S22, constructing a document intent graph database, comprising the following steps:
firstly, classifying similarity of willingness of documents through a pre-training language model, and judging whether willingness coverage association exists between two documents or not;
secondly, selecting a neo4j graph database model to store association data and willingness similarity between documents;
then, each document in the document data is represented as a node in the graph database, and an edge is created between every two similar documents;
finally, the preprocessed document data and the similarity relation are imported into a graph database.
In this embodiment, step S3 includes:
step S31, performing primary matching on the inverted index, including:
firstly, processing each document in a document set to generate a corresponding inverted index, wherein the structure of the inverted index is a word list, each word corresponds to one or more document IDs, and the document IDs are documents containing the word;
secondly, matching concept words in the expanded user query with word lists in the inverted index, and finding out a corresponding document ID list for each concept word;
finally, according to the matched document ID, corresponding document content or concept sets are acquired, wherein the document content or concept sets contain information related to the query intention of the user. The matching results are shown in the following table.
In this embodiment, step S3 includes:
and step S32, encoding by adopting a BERT (Bidirectional Encoder Representation from Transformers) model.
The principle of the BERT pre-training model is bi-directional training based on a transducer model. By performing unsupervised pre-training on a large scale of text corpus, a generic representation of natural language is learned. Referring to the BERT pre-training model network diagram of fig. 3, trm is an attention calculation module in a transducer model, E1-En represents an input text sequence, each E uses a bi-directional context to learn the representation of a word, the input E is encoded by a bi-directional attention mechanism taking into account the context information on the left and right sides of the word, and the resulting T1-Tn represents the encoded output sequence. The method can fully reflect different dependency relations in the corresponding range, so that the representation is richer and the semantics are more accurate.
Specifically, the coding process of the BERT model includes:
firstly, word segmentation is carried out on an input text, and is completed by using a special word segmentation device of a pre-training model, wherein the word segmentation result is a series of words or sub-words, and each word corresponds to a number and is used for subsequent input representation;
second, the BERT model adds special tags to the text input so that the model can distinguish between the beginning and end of a sentence, e.g., each text is preceded by a [ CLS ] tag, representing the beginning of a sentence; adding a [ SEP ] mark at the tail of each sentence to represent the end of the sentence;
then, the BERT model adopts a transducer encoder to encode words of the text layer by layer to obtain vector representation of each word;
in BERT, the vector of each Word is composed of its original Word vector (Word vector) and position coding (Positional Encoding), and the transform encoder depth codes the text through a multi-layer self-attention mechanism and feed-forward neural network to capture context information, the multi-head attention mechanism is calculated as follows:
wherein Q is a query vector, K is a key vector key, V is a value vector value,representing the dimension of the input vector;
forming three calculation vectors of a query vector query, a key vector key and a value vector value through linear transformation on the calculation of the attention, calculating the attention score between every two sequences one by one based on the three calculation vectors, and finding out the most relevant search result by using the key words; on the other hand, in order to make attention computation have richer levels, the associated logic and attention characteristics between expression sequences in different spaces are used for computing the same input in attention layers with different angles, so as to obtain different output results and understandings.
In this embodiment, step S3 includes:
step S33, acquiring a query representation and a document representation by using a graph convolution neural network (Graph Convolutional Network, GCN) and calculating a diversity score.
Referring to fig. 4, the graph convolution neural network learns the representation of nodes on the graph by convolution based on a large heterogeneous graph consisting of a document intent graph and a search text. Each node has a feature vector representing its attributes or features, and edges representing relationships between the nodes. Nodes a-f in the graph represent text vectors, the graph convolution neural network carries out weighted aggregation on neighbor information of the nodes to update the representation of the corresponding nodes, and meanwhile, the updating result can be influenced by adding the self-loop setting, namely the state field of the target node. In fig. 4, node a gathers the information of the whole graph through two layers of aggregation, and the corresponding convolution layers can be stacked to capture the relationship between nodes with different distances, so that more complex graph structural features are learned.
Specifically, the process of obtaining a query representation and a document representation and calculating a diversity score using a graph convolution neural network includes:
the diversity features extracted by the graph convolution neural network are used to generate a diversity score for the documents, the document nodes updating their representations by the information collected from their neighbors, specifically formulated as follows:
wherein,is the identifier of each layer in the graph convolutional neural network,/for each layer>Is an undirected intended adjacency matrix added to the self-loop, < >>For the degree matrix->Is a node feature matrix, wherein->Is the dimension of the node feature, +.>Is->Layer-specific trainable weight matrix of a layer, < ->Representing an activation function, typically a ReLU function;
based on diversity features extracted from current intent graphTo calculate a diversity score, expressed as:
wherein,MLPis a multi-layer perceptron.
The diversity score calculation results are shown in the following table.
And S34, integrating the inverted index result and the diversity score result to obtain the integrated ranking of the search documents considering both accuracy and diversity, and displaying the integrated ranking to the user according to the score size. The results are shown in the following table.
The power file question-answer intelligent retrieval method comprises a user semantic analysis step, a document retrieval and processing step and an answer extraction step; the user semantic analysis step is mainly oriented to question and answer ports, and a machine learning method is adopted to realize extraction and expansion of user semantic concepts; the document searching and processing steps mainly face to a data port, a file database is constructed by carrying out structuring processing on the power industry files, the document similarity is measured by covering the similarity on intention among the documents, and an intention graph database based on the document intention is constructed; the answer extraction step mainly faces to a connection port, sparse vector matching is realized by natural language processing and the like, machine reading understanding question-answer matching is completed by combining a graph convolution neural network with a BERT model and the like, and finally a retrieval result is presented; the sensitivity and the retrieval performance of the retrieval system can be effectively improved, the semantic expansion of users can be deeply understood and enhanced, diversified retrieval results can be provided according to the user will on the basis of maintaining the matching accuracy, and finally the disadvantage that the traditional retrieval method cannot give consideration to both accurate matching and diversity matching is overcome; the graph convolution neural network is combined with the BERT model, redundancy in calculation is reduced, calculation speed is improved, and user experience is optimized.
Referring to fig. 1 to 5, the embodiment of the invention also provides a power file question-answer type intelligent retrieval system, which comprises:
the user semantic analysis module is used for: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
a document searching and processing module for: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and extracting more accurate document diversity relation; establishing a classifier to judge whether two different documents contain the same or similar intents, and constructing an intent graph to represent the relationship between document data and query sentences;
answer extraction module for: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with conventional relevance features.
Referring to FIG. 5, in the answerThe extraction module constructs an isomerism map based on the document intent map and the query text, wherein X1-Xn in the map represent different document nodes, xq represents the query text nodes, and on one hand, feature expression vectors, namely correlation feature vectors, of different texts are obtained on the basis of BERT model embeddingOn the other hand, a selected document d2 is obtained according to the content of the query text, the overall intent graph is further adjusted, the adjusted intent graph is input into a graph convolution neural network, conv1 and Conv2 in the graph represent different graph convolution layers, semantic representations of all the documents are updated and aggregated to obtain corresponding feature vectors->Z1-Zn represents updated document nodes, and Zq represents updated query text nodes. Thus, the original relevance feature vector +.>And aggregate updated feature vector +.>Further inputting the obtained product into a multi-layer perceptron network, and splicing the output to obtain a final diversity representation ++>
While the invention has been described in terms of specific embodiments, it will be apparent to those skilled in the art that the invention is not limited to the specific embodiments described. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the appended claims.

Claims (10)

1. An intelligent search method for a question-answer type electric power file is characterized by comprising the following steps of: the power file question-answer type intelligent retrieval method comprises the following steps:
step S1, user semantic analysis, comprising the following steps: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
step S2, document retrieval and processing, comprising the following steps: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database;
step S3, answer extraction, including: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation; and presenting the search result according to the search intention of the user and the correlation characteristic.
2. The intelligent search method for question-answering of a power file according to claim 1, wherein the method comprises the following steps: in the step S1, a support vector machine classification algorithm is adopted, TF-IDF is used as a feature vector to carry out intention classification on the semantics, and a 1-gram model and a 2-gram model are adopted to obtain a concept set of the user semantics, so that the concept extraction of the user semantics is realized; and carrying out synonym expansion on keywords in the concepts based on the synonym table to obtain an expanded concept set, and realizing semantic expansion of users.
3. The intelligent search method for question-answering of the power file according to claim 2, wherein the method comprises the following steps: the step S1 comprises the following steps:
step S11, adopting TF-IDF to perform characteristic representation,
specifically, TF refers to the frequency of occurrence of a word in one text, IDF refers to the importance of a word to the text in the entire set, expressed as:
TF = total number of occurrences of a certain word in the text/total number of words of the text,
idf=log (total number of text in corpus/(number of text containing the word + 1)),
TF-IDF=TF*IDF,
the final TF-IDF code considers the importance of each word in terms of its frequency of occurrence in the text and the importance in the entire text set, thereby representing a text with a vector;
step S12, training a support vector machine classifier to classify the intention,
specifically, given a set of training samplesWherein->The input feature vector is represented as such,representing the corresponding category label->The goal of the two-class support vector machine is to find a hyperplaneThe following conditions are satisfied:
for all belonging to a categorySample->There is->
For all belonging to a categorySample->There is->
Wherein,is the normal vector,/->Is the intercept point of the beam,
in this process, it is desirable to maximize the distance of the support vector to the hyperplane, such hyperplane being referred to as the maximum-interval hyperplane, the optimization problem of the maximum interval is expressed as:
wherein,for->The vector is subjected to a dot product,
for all training samples, the constraints are:
the multi-classification problem is solved by adopting a one-to-many strategy, in the one-to-many strategy, each category is independently used as one category, K classification support vector machine models are constructed, each classification model is used for distinguishing one category from all other categories, and for the first categoryThe samples of which are marked as positive examples, and the samples of the other K-1 categories are marked as negative examples, each category being +.>The representation is made of a combination of a first and a second color,
: category->As positive examples, all the remaining categories are negative examples,
: category->As positive examples, all the remaining categories are negative examples,
...
: category->As positive examples, all the remaining categories are negative examples,
in the training stage, training each two-class support vector machine model to obtain a corresponding weight vector and a bias term;
when prediction is carried out, inputting a new sample into each support vector machine model, and then selecting the category with the highest output score as a final prediction result;
step S13, training 1-gram and 2-gram models to obtain concept sets of user semantics,
specifically, the probability of the sentence segment in the corpus is calculated through a statistical language model, and the conditional probability product of the word appearing on the basis of the existence of the front word is calculated according to Bayesian chain decomposition, which is expressed as follows:
is expressed as:
wherein,representing a word string of words from the first to the t-th word in the sentence segment, ++>Representing the number of occurrences of a word string in a sentence segment, it is apparent that the probability of occurrence of a word is related to all words preceding it, and that this word, assuming that it is related to only the preceding n-1 words, can be converted into the following form:
the 1-gram and 2-gram models are special cases when n=1, 2.
4. The intelligent search method for question-answering of the power file according to claim 2, wherein the method comprises the following steps: in step S1, performing synonym expansion on keywords in the concept based on the synonym table to obtain an expanded concept set to realize semantic expansion of the user, including:
firstly, constructing a synonym table by using an existing professional dictionary or vocabulary library, wherein the synonym table comprises a group of synonyms or words of a paraphrasing;
secondly, word segmentation is carried out on concepts or texts to be expanded, and the texts are segmented into single words;
then, for each word after word segmentation, searching whether corresponding synonyms exist in a synonym table, and if so, adding the synonyms into the concept to serve as expansion of the word;
and finally, merging the expanded synonyms with the keywords in the original concepts, and removing the repeated words.
5. The intelligent search method for question-answering of a power file according to claim 1, wherein the method comprises the following steps: the step S2 comprises the following steps:
step S21, constructing a file database, comprising:
first, collecting power files and converting them into a text format; preprocessing the text, including removing noise, punctuation marks and stop words; analyzing the text content in the text file, and extracting the characters in the file;
second, text is processed using natural language processing, including: segmenting the text content of the file, dividing the text content into chapters, clauses and paragraphs, and identifying entities involved in the file by identifying the entities of the text content;
then, extracting keywords from the file, and finding out core words and topics in the file;
and finally, establishing a file database according to the analyzed and processed text content, storing the structured file data by using the relational database, and establishing an index for the data in the file database.
6. The intelligent search method for question-answering of a power file according to claim 5, wherein the method comprises the following steps: the step S2 comprises the following steps:
step S22, constructing a document intent graph database, comprising the following steps:
firstly, classifying similarity of willingness of documents through a pre-training language model, and judging whether willingness coverage association exists between two documents or not;
secondly, selecting a graph database model to store association data and willingness similarity between documents;
then, each document in the document data is represented as a node in the graph database, and an edge is created between every two similar documents;
finally, the preprocessed document data and the similarity relation are imported into a graph database.
7. The intelligent search method for question-answering of a power file according to claim 1, wherein the method comprises the following steps: the step S3 comprises the following steps:
step S31, performing primary matching on the inverted index, including:
firstly, processing each document in a document set to generate a corresponding inverted index, wherein the structure of the inverted index is a word list, each word corresponds to one or more document IDs, and the document IDs are documents containing the word;
secondly, matching concept words in the expanded user query with word lists in the inverted index, and finding out a corresponding document ID list for each concept word;
finally, according to the matched document ID, corresponding document content or concept sets are acquired, wherein the document content or concept sets contain information related to the query intention of the user.
8. The intelligent search method for question-answering of a power file according to claim 7, wherein: the step S3 comprises the following steps:
step S32, encoding by using a BERT model, including:
firstly, word segmentation is carried out on an input text, and is completed by using a special word segmentation device of a pre-training model, wherein the word segmentation result is a series of words or sub-words, and each word corresponds to a number and is used for subsequent input representation;
secondly, the BERT model adds special marks to the text input so that the model can distinguish the beginning and end of sentences;
then, the BERT model adopts a transducer encoder to encode words of the text layer by layer to obtain vector representation of each word;
in BERT, the vector of each word is composed of its original word vector and position coding, and the transform encoder depth codes the text through a multi-layer self-attention mechanism and feed-forward neural network to capture the context information, the multi-head attention mechanism is calculated as follows:
wherein Q is a query vector, K is a key vector key, V is a value vector value,representing the dimension of the input vector;
forming three calculation vectors of a query vector query, a key vector key and a value vector value through linear transformation on the calculation of the attention, calculating the attention score between every two sequences one by one based on the three calculation vectors, and finding out the most relevant search result by using the key words; on the other hand, in order to make attention computation have richer levels, the associated logic and attention characteristics between expression sequences in different spaces are used for computing the same input in attention layers with different angles, so as to obtain different output results and understandings.
9. The intelligent search method for question-answering of power files according to claim 8, wherein the method comprises the following steps: the step S3 comprises the following steps:
step S33, obtaining a query representation and a document representation by using a graph convolution neural network and calculating a diversity score, wherein the step comprises the following steps:
the diversity features extracted by the graph convolution neural network are used to generate a diversity score for the documents, the document nodes updating their representations by the information collected from their neighbors, specifically formulated as follows:
wherein,is the identifier of each layer in the graph convolutional neural network,/for each layer>Is an undirected intended adjacency matrix added to the self-loop, < >>For the degree matrix->Is a node feature matrix, wherein->Is the dimension of the node feature, +.>Is->Layer-specific trainable weight matrix of a layer, < ->Representing an activation function;
based on diversity features extracted from current intent graphTo calculate a diversity score, expressed as:
wherein,is a multi-layer perceptron;
and S34, integrating the inverted index result and the diversity score result to obtain the integrated ranking of the search documents considering both accuracy and diversity, and displaying the integrated ranking to the user according to the score size.
10. An intelligent search system for question-answering of a power file is characterized in that: the power file question-answering type intelligent retrieval system comprises:
the user semantic analysis module is used for: performing intention classification on the semantics by using a feature vector by adopting a classification algorithm to obtain a concept set of the user semantics, and realizing the extraction of the concept of the user semantics; synonym expansion is carried out on keywords in the concepts to obtain an expanded concept set, so that semantic expansion of users is realized;
a document searching and processing module for: carrying out structuring treatment on the power file, and establishing a file database; measuring the similarity of the documents by covering the similarity of the intents among the documents, and constructing a document intent graph database;
answer extraction module for: completing preliminary matching of a user semantic expansion concept set and a document concept set according to the inverted index; coding all the related characters returned by sparse matching through a pre-training model; updating the query and the representation of each document on the intent graph to obtain a context-aware query representation and an intent-aware document representation, and presenting the search results in accordance with the user's search intent in combination with the relevance features.
CN202311451435.3A 2023-11-03 2023-11-03 Electric power file question-answering type intelligent retrieval method and system Pending CN117171333A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311451435.3A CN117171333A (en) 2023-11-03 2023-11-03 Electric power file question-answering type intelligent retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311451435.3A CN117171333A (en) 2023-11-03 2023-11-03 Electric power file question-answering type intelligent retrieval method and system

Publications (1)

Publication Number Publication Date
CN117171333A true CN117171333A (en) 2023-12-05

Family

ID=88932173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311451435.3A Pending CN117171333A (en) 2023-11-03 2023-11-03 Electric power file question-answering type intelligent retrieval method and system

Country Status (1)

Country Link
CN (1) CN117171333A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496542A (en) * 2023-12-29 2024-02-02 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium
CN118013020A (en) * 2024-04-09 2024-05-10 北京知呱呱科技有限公司 Patent query method and system for generating joint training based on retrieval
CN118093834A (en) * 2024-04-22 2024-05-28 邦宁数字技术股份有限公司 AIGC large model-based language processing question-answering system and method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN109885672A (en) * 2019-03-04 2019-06-14 中国科学院软件研究所 A kind of question and answer mode intelligent retrieval system and method towards online education
CN110674279A (en) * 2019-10-15 2020-01-10 腾讯科技(深圳)有限公司 Question-answer processing method, device, equipment and storage medium based on artificial intelligence
CN111046661A (en) * 2019-12-13 2020-04-21 浙江大学 Reading understanding method based on graph convolution network
CN111611361A (en) * 2020-04-01 2020-09-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligent reading, understanding, question answering system of extraction type machine
CN114036262A (en) * 2021-11-15 2022-02-11 中国人民大学 Graph-based search result diversification method
CN116881425A (en) * 2023-08-08 2023-10-13 武汉烽火普天信息技术有限公司 Universal document question-answering implementation method, system, device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN109885672A (en) * 2019-03-04 2019-06-14 中国科学院软件研究所 A kind of question and answer mode intelligent retrieval system and method towards online education
CN110674279A (en) * 2019-10-15 2020-01-10 腾讯科技(深圳)有限公司 Question-answer processing method, device, equipment and storage medium based on artificial intelligence
CN111046661A (en) * 2019-12-13 2020-04-21 浙江大学 Reading understanding method based on graph convolution network
CN111611361A (en) * 2020-04-01 2020-09-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligent reading, understanding, question answering system of extraction type machine
CN114036262A (en) * 2021-11-15 2022-02-11 中国人民大学 Graph-based search result diversification method
CN116881425A (en) * 2023-08-08 2023-10-13 武汉烽火普天信息技术有限公司 Universal document question-answering implementation method, system, device and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496542A (en) * 2023-12-29 2024-02-02 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium
CN117496542B (en) * 2023-12-29 2024-03-15 恒生电子股份有限公司 Document information extraction method, device, electronic equipment and storage medium
CN118013020A (en) * 2024-04-09 2024-05-10 北京知呱呱科技有限公司 Patent query method and system for generating joint training based on retrieval
CN118093834A (en) * 2024-04-22 2024-05-28 邦宁数字技术股份有限公司 AIGC large model-based language processing question-answering system and method

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN112115238A (en) Question-answering method and system based on BERT and knowledge base
CN113254659A (en) File studying and judging method and system based on knowledge graph technology
CN111914556B (en) Emotion guiding method and system based on emotion semantic transfer pattern
CN110263325A (en) Chinese automatic word-cut
CN111767325B (en) Multi-source data deep fusion method based on deep learning
CN112163089B (en) High-technology text classification method and system integrating named entity recognition
CN111639183A (en) Financial industry consensus public opinion analysis method and system based on deep learning algorithm
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN113051922A (en) Triple extraction method and system based on deep learning
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN114936277A (en) Similarity problem matching method and user similarity problem matching system
CN115688784A (en) Chinese named entity recognition method fusing character and word characteristics
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN114265936A (en) Method for realizing text mining of science and technology project
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN113869054A (en) Deep learning-based electric power field project feature identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination