CN111897926A - Chinese query expansion method integrating deep learning and expansion word mining intersection - Google Patents
Chinese query expansion method integrating deep learning and expansion word mining intersection Download PDFInfo
- Publication number
- CN111897926A CN111897926A CN202010774430.4A CN202010774430A CN111897926A CN 111897926 A CN111897926 A CN 111897926A CN 202010774430 A CN202010774430 A CN 202010774430A CN 111897926 A CN111897926 A CN 111897926A
- Authority
- CN
- China
- Prior art keywords
- word
- expansion
- chinese
- pseudo
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a Chinese query expansion method with intersection fusion of deep learning and expanded word mining, which comprises the steps of firstly carrying out word embedding semantic learning training on a primary detection document set by adopting a deep learning tool to obtain a word embedding expanded word set with rich context semantic information, then mining an association rule mode on a primary detection front row pseudo-related feedback document set by utilizing a pseudo-related feedback expanded word mining method based on Copulas theory to obtain a rule expanded word set containing feature interword association information based on statistical analysis, and finally carrying out intersection fusion on the word embedding expanded word set and the rule expanded word set to obtain a final expanded word set so as to improve the quality of expanded words. The method integrates deep learning and expansion word mining, and excavates the high-quality expansion words related to the original query, so that the problems of query subject drift and word mismatching can be solved, the text information retrieval performance is improved, and the method has good application value and popularization prospect.
Description
Technical Field
The invention relates to a Chinese query expansion method integrating deep learning and expansion word mining, belonging to the technical field of information retrieval.
Background
Query expansion is one of key technologies for solving the problems of query topic drift and word mismatching in information retrieval, and the query expansion refers to modifying the weight of an original query or adding words related to the original query to obtain a new query longer than the original query so as to describe the semantic meaning or topic implied by the original query more completely and accurately, make up for the deficiency of user query information and improve the retrieval performance of an information retrieval system. The core problem of query expansion is the source of the expansion terms and the design of the expansion model. With the development of network technology and the arrival of the big data era, network users have more and more requirements on information retrieval, for example, how to accurately retrieve required information from massive big data makes query expansion research become a hotspot in the field of information retrieval.
In recent decades, researchers have conducted research on query expansion models from different perspectives and methods, and have gained abundant research results, wherein relevant feedback expansion based on association pattern mining and recently emerging query expansion based on deep learning are more concerned and discussed by the researchers at home and abroad. For example, Bouziri et al propose supervised learning based extension word mining (see: Bouziri A, Latiri C, Gaussian E et al. learning Query extension from rules in terms of [ C ]. Proceedings of the 7th International Journal Discovery on Knowledge Discovery and Knowledge management (IC3K), Lisbon, Portugal,2015: 525. 530.) and ranking learning model based extension word mining methods (see: Bouziri A, Latiri C, Gaussian E. efficiency assessment search for implementation of the invention, simulation analysis of the general implementation of the invention, extension word mining methods based on the ranking learning model (see: implementation of the general correlation of [ C ]. 19, simulation of the invention, simulation of the language of the invention classification of the extension of the general implementation of the Query, scientific analysis of the language of the invention [ C ]. 12, simulation of the language of the invention, simulation of the extension of the language of the invention, simulation of the language of the invention, and the implementation of the language of the Query of the language of the human classification of, 2019,9(6): 5016-: kuzi S, Shok A, Kurland O.query expansion Word embedding [ C ]. Proceedings of the 25th ACM International Conference on Information and Knowledge management.NewYork: ACM Press,2016:1929 and 1932.) A CBOW model of a deep learning tool Word2vec is applied to train Word vectors in search corpuses, and feature words related to query semantics are selected from the Word vectors to realize query expansion. Experimental results show that the query expansion method is effective and has better performance in the aspect of improving the information retrieval performance. However, the existing query expansion method has not completely solved the technical problems of query topic drift, word mismatching and the like existing in information retrieval, and meanwhile, although the expansion words from the association mode can carry feature word association information based on statistical analysis, the semantic information of the expansion words in the document context is lacked, and the word embedding expansion words derived from word vector semantic learning training have rich document context semantic information but have no feature word association information based on statistical analysis.
In order to fully exert the expansion advantages of the regular expansion words and the word embedded expansion words and make up the respective defects, the invention integrates deep learning with the expansion word mining based on the Copulas theory, and provides a Chinese query expansion method integrating the deep learning with the expansion word mining in an intersection manner, which can solve the problems of query theme drift and word mismatching in a text information retrieval system, improve the text information retrieval performance and has good application value and wide popularization prospect.
Disclosure of Invention
The invention aims to provide a Chinese query expansion method integrating deep learning and expansion word mining intersection, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve and enhance the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.
The invention adopts the following specific technical scheme:
a Chinese query expansion method integrating deep learning and expanded word mining intersection comprises the following steps:
step 1, searching a Chinese document set for original query to obtain a primary detection document set, and performing Chinese word segmentation and stop word removal pretreatment on the primary detection document set.
Step 2, performing word embedding semantic learning training on the initial inspection document set by using a deep learning tool to obtain a feature word and word embedding vector set, and specifically comprising the following steps:
and (2.1) performing word embedding semantic learning training on the primary detection pseudo-related feedback document set by adopting a Skip-gram model (in detail, https:// code. Google. com/p/word2vec /) of a deep learning tool Google open source word vector tool word2vec to obtain a word embedding vector set of the primary detection document feature words.
(2.2) in the word embedding vector set of the initial detection document characteristic words, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity degree of (q) VCosi,cetj) The formula is shown as formula (1), wherein j is more than or equal to 1 and less than or equal to m. The word embedding candidate expansion words refer to those non-query terms in the word embedding vector set.
In the formula (1), vcetjIndicating that the jth word is embedded into the candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
(2.3) given a minimum vector cosine similarity threshold minqvcos, extracting its VCos (q)i,cetj) Query term q of not less than minqvcosiThe word is embedded into a candidate expansion word as the query term qiWord embedding expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnEmbedding all the words into the extension Word combination, removing the repeated words to obtain the final Word embedding extension Word set ET _ WE (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word embedding extension wordsWEETAnd then, the step 3 is carried out. The ET _ WE is shown as formula (2):
the weight w of the word embedded expansion wordWEETFor querying the cosine similarity of terms and words embedded in expanded words, e.g.And (3) when repeated words appear, the word embedding expansion word weight is equal to the cumulative sum of the similarity of each vector of the repeated words.
Step 3, adopting a Copulas theory-based pseudo-related feedback expansion word mining method to mine rule expansion words in the initial detection pseudo-related feedback document set, and establishing a rule expansion word set, wherein the method specifically comprises the following steps:
and (3.1) extracting m pieces of primary detection documents in the primary detection document set, constructing a primary detection pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and feature word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a feature word weight, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.
The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.
(3.2) taking the feature words in the Chinese feature word library as 1_ candidate item set C1。
(3.3) calculation of C1Support degree CSup (C) based on Copulas theory1) If CSup (C)1) More than or equal to the minimum support threshold ms, C is set1As 1_ frequent item set L1And added to the frequent itemset set fis (frequency itemset).
The csup (copula basedsupport) represents the support degree based on copula theory. The CSup (C)1) Is calculated as shown in equation (4):
in the formula (4), C1"Count" represents 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, all doc (count) represents the total number of the pseudo related feedback Chinese document libraryNumber of documents, C1"Weight" represents the 1_ candidate C1Item set weights in the pseudo-relevance feedback Chinese document library, allItems (weight) represent the weighted sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.
(3.4) adopting a self-connection method to connect (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationkAnd k is more than or equal to 2.
The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.
(3.5) when mining to 2_ candidate C2When, if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2And (4) transferring to the step (3.6). When mining to k _ candidate CkAnd when the k is more than or equal to 3, directly transferring to the step (3.6).
(3.6) calculation of CkSupport degree CSup (C) based on Copulas theoryk) If CSup (C)k) Not less than ms, then CkIs k _ frequent item set LkAdding into FIS, then transferring into the step (3.7), or directly transferring into the step (3.7).
The CSup (C)k) Is calculated as shown in equation (5):
in the formula (5), Ck"Count" denotes k _ candidate CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Ck"Weight" denotes k _ candidate CkItem set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).
(3.7) k is added with 1 and then is transferred to the step (3.4) to continue to sequentially execute the next step,up to said LkAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.8).
(3.8) taking out k _ frequent item set L from FISkAnd k is more than or equal to 2.
(3.9) extraction of LkIs set of proper subset entries EtjAnd QiAnd is andQi∪Etj=Lk,et (E) describedjFor a proper subset of terms set without query terms, said QiThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
(3.10) calculate association rule Q based on Copulas theoryi→EtjConfidence of (CConf) (Q)i→Etj) If CConf (Q)i→ETj) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, Q is addedi→EtjAdding into the association rule set AR (Association rule), then, proceeding to step (3.9), from LkTo re-extract the other proper subset item sets EtjAnd QiSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (3.8), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).
The CConf (copula based Confidence) represents a confidence based on copula theory, the CConf (Q)i→ETj) Is represented by equation (6):
in the formula (6), the reaction mixture is,Qi"Count represents the proper subset item set QiFrequency of occurrence, Q, in pseudo-associative feedback Chinese document libraryi'Weight' represents a proper subset item set QiTerm set weights in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) _ Count represents a set of items (Q)i∪Etj) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) ' Weight representation item set (Q)i∪Etj) Item set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).
(3.11) extracting the association rule backing Et from the association rule set ARjAs rule extension words, obtain rule extension word set ET _ AR (extension Term from Association rules), calculate rule extension word weight wEtThen, the process proceeds to step 4.
The ET _ AR is shown as formula (7):
in formula (7), RetiIndicating the ith rule expansion word.
The rule expansion word weight wEtThe calculation formula is shown in formula (8):
in the formula (8), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
Step 4, performing intersection fusion on the rule expansion word set and the word embedding expansion word set to obtain a final expansion word, and realizing query expansion, wherein the specific steps are as follows:
(4.1) performing intersection operation on the rule extension word Set ET _ AR and the word embedding extension word Set ET _ WE to obtain a final extension word Set ETS _ Q (expansion Term Set for Query Q) of the original Query Term Set Q, and calculating a final extension word weight w(ETi)。
The final extended word set ETS _ Q is calculated as shown in equation (9):
final expanded word weight w (ET)i) Is calculated as shown in equation (10):
w(ETi)=wEt+wWEET(10)
and (4.2) combining the final expansion word with the original query into a new query, and searching the Chinese document again to realize query expansion.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a Chinese query expansion method with deep learning and expanded word mining intersection fusion, which is characterized in that a deep learning tool is adopted for carrying out word embedding semantic learning training on a primary detection document set to obtain a word embedding expanded word set with rich context semantic information, a copula theory-based pseudo-related feedback expanded word mining method is utilized to mine an association rule mode for the primary detection front row pseudo-related feedback document set to obtain a rule expanded word set containing characteristic interword association information based on statistical analysis, and the word embedding expanded word set and the rule expanded word set are subjected to intersection fusion to obtain a final expanded word set so as to improve the quality of expanded words. Experimental results show that the method can inhibit the problems of query theme drift and word mismatching, improve the information retrieval performance, has higher retrieval performance than similar comparison methods in recent years, and has better application value and popularization prospect.
(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that compared with the reference retrieval, the MAP of the method of the invention has the highest average amplification of 27.87 percent, the MAP of the method of the invention has higher average amplification than that of the comparison method, the average amplification of 18.21 percent, and the experimental effect is obvious, thus the retrieval performance of the method of the invention is better than that of the reference retrieval and comparison method, the information retrieval performance can be improved, the problems of query drift and word mismatching in the information retrieval are reduced, and the method has very high application value and wide popularization prospect.
Drawings
FIG. 1 is a general flow diagram of the method for expanding Chinese queries by merging deep learning and expanded word mining intersections according to the present invention.
Detailed Description
Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:
1. item set
In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.
2. Associating front and back pieces of a rule
Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.
3. Rule expansion word
The rule expansion word means that the expansion word is from a back item set of the association rule, and the front item set of the association rule is an original query item set.
4. Rule expansion word weight calculation
Taking the confidence coefficient of the association rule of the former item set as the original query word as the weight w of the rule expansion wordEt。
The weight w of the expanded wordEtThe calculation formula is shown in formula (11):
in the formula (11), the association rule Qi→ETjIn, QiFor sets of terms containing query terms, for association rule antecedents, ETjAn item set which does not contain the query terms and contains the expansion terms is a back part of the association rule; AllDoc (count) indicates pseudo-correlation feedbackThe total number of documents in the Chinese document library; allItems (weight) represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library; qi_ Count represents a set of items QiFrequency of occurrence, Q, in pseudo-associative feedback Chinese document libraryi' Weight representation item set QiTerm set weights in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) _ Count represents a set of items (Q)i∪Etj) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) ' Weight representation item set (Q)i∪Etj) Item set weights in a pseudo-relevance feedback Chinese document library. max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
5. Word embedding expansion word and weight thereof
The word embedding expansion word is derived from a set of word embedding vectors. The specific description is as follows: in the word embedding vector set, calculating each query term Q of the original query term set Qi(qi∈Q,Q=(q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity degree of (q) VCosi,cetj) Giving a minimum similarity threshold value minqvcos, and extracting query terms q with the cosine similarity of the word vector not less than the similarity threshold value minqvcosiThe word is embedded into a candidate expansion word as the query term qiWord embedding expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnAnd embedding all the words into the extension Word combinations, and removing repeated words to obtain a final Word embedding extension Word set ET _ WE (expansion Term from Word embedding) of the original query Term set Q. The word embedding candidate expansion words refer to those non-query terms in the word embedding vector set.
The word embedding extension word ET _ WE is shown as equation (12):
the vector cosine similarity VCos (q)i,cetj) Is calculated as shown in equation (13):
in the formula (13), the vcetjIndicating that the jth word is embedded into the candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
And taking the vector cosine similarity value of the query lexical item and the word embedding expansion word as the weight of the word embedding expansion word.
The weight w of the word embedded expansion wordWEETWhen a repeated word appears, the word embedding expansion word weight is equal to the accumulated sum of the similarity of each vector of the repeated word, as shown in formula (14).
6. Support degree and confidence degree based on Copulas theory
The copula function theory (see Sklar A. the principles de repetition a n dimension sets sources marks J. the Publication de l 'institute de Statistique l' universities 1959,8(1): 229) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function. By referring to Copulas function theory, the invention provides support csup (Copulas based supported port) and confidence cconf (Copulas based configured confidence) based on Copulas theory, which are described in detail as follows.
Characteristic term set (T) based on copula theory1∪T2) Support CSup (T)1∪T2) Is calculated as shown in equation (15):
in formula (15), (T)1∪T2) _ Count represents a set of items (T)1∪T2) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (T)1∪T2) ' Weight representation item set (T)1∪T2) Item set weight in the pseudo relevant feedback Chinese document library, and all doc (count) represents the total document quantity of the pseudo relevant feedback Chinese document library; allitems (weight) represents the weighted sum of all Chinese feature words in the pseudo-relevant feedback Chinese document library.
Association rule (T) based on copula theory1→T2) Confidence CConf (T)1→T2) Is calculated as shown in equation (16):
in formula (16), T1"Count" represents a set of items T1Frequency of occurrence, T, in pseudo-correlated feedback Chinese document library1' Weight representation item set T1Term set weights in pseudo-relevance feedback Chinese document library, (T)1∪T2) _ Count represents a set of items (T)1∪T2) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (T)1∪T2) ' Weight representation item set (T)1∪T2) Item set weights in a pseudo-relevant feedback Chinese document library; AllDoc (count) and AllItems (weight) are defined as in equation (15).
The invention is further explained below by referring to the drawings and specific comparative experiments.
As shown in FIG. 1, the method for expanding Chinese queries by fusion of deep learning and expanded word mining intersection of the present invention comprises the following steps:
step 1, searching a Chinese document set for original query to obtain a primary detection document set, and performing Chinese word segmentation and stop word removal pretreatment on the primary detection document set.
Step 2, performing word embedding semantic learning training on the initial inspection document set by using a deep learning tool to obtain a feature word and word embedding vector set, and specifically comprising the following steps:
and (2.1) performing word embedding semantic learning training on the primary detection pseudo-related feedback document set by adopting a Skip-gram model of a deep learning tool Google open source word vector tool word2vec to obtain a word embedding vector set of the primary detection document characteristic words.
(2.2) in the word embedding vector set of the initial detection document characteristic words, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity degree of (q) VCosi,cetj) The formula is shown as formula (1), wherein j is more than or equal to 1 and less than or equal to m. The word embedding candidate expansion words refer to those non-query terms in the word embedding vector set.
In the formula (1), vcetjIndicating that the jth word is embedded into the candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
(2.3) given a minimum vector cosine similarity threshold minqvcos, extracting its VCos (q)i,cetj) Query term q of not less than minqvcosiThe word is embedded into a candidate expansion word as the query term qiWord embedding expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnEmbedding all the words into the extension Word combination, removing the repeated words to obtain the final Word embedding extension Word set ET _ WE (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word embedding extension wordsWEETAnd then, the step 3 is carried out. The ET _ WE is shown as formula (2):
the weight w of the word embedded expansion wordWEETFor searching terms and word-inlaysAnd (3) entering the vector cosine similarity of the expansion word, wherein when the repeated word appears, the weight of the word embedded expansion word is equal to the cumulative sum of the vector similarities of the repeated word.
Step 3, adopting a Copulas theory-based pseudo-related feedback expansion word mining method to mine rule expansion words in the initial detection pseudo-related feedback document set, and establishing a rule expansion word set, wherein the method specifically comprises the following steps:
and (3.1) extracting m pieces of primary detection documents in the primary detection document set, constructing a primary detection pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and feature word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a feature word weight, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.
The invention adopts TF-IDF weighting technology to calculate the weight of the feature words.
(3.2) taking the feature words in the Chinese feature word library as 1_ candidate item set C1。
(3.3) calculation of C1Support degree CSup (C) based on Copulas theory1) If CSup (C)1) More than or equal to the minimum support threshold ms, C is set1As 1_ frequent item set L1And added to the frequent itemset set fis (frequency itemset).
The csup (copula basedsupport) represents the support degree based on copula theory. The CSup (C)1) Is calculated as shown in equation (4):
in the formula (4), C1"Count" represents 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, all doc (count) represents the total document number of the pseudo related feedback Chinese document library, C1"Weight" represents the 1_ candidate C1Item set weights in pseudo-relevance feedback Chinese document library, allItems (weight) representing pseudo-relevanceAnd feeding back the weight accumulation sum of all Chinese characteristic words in the Chinese document library.
(3.4) adopting a self-connection method to connect (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationkAnd k is more than or equal to 2.
The self-join method uses a candidate set join method given in Apriori algorithm.
(3.5) when mining to 2_ candidate C2When, if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2And (4) transferring to the step (3.6). When mining to k _ candidate CkAnd when the k is more than or equal to 3, directly transferring to the step (3.6).
(3.6) calculation of CkSupport degree CSup (C) based on Copulas theoryk) If CSup (C)k) Not less than ms, then CkIs k _ frequent item set LkAdding into FIS, then transferring into the step (3.7), or directly transferring into the step (3.7).
The CSup (C)k) Is calculated as shown in equation (5):
in the formula (5), Ck"Count" denotes k _ candidate CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Ck"Weight" denotes k _ candidate CkItem set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).
(3.7) after k is added with 1, the step (3.4) is carried out to continue the subsequent steps until the LkAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.8).
(3.8) taking out k _ frequent item set L from FISkAnd k is more than or equal to 2.
(3.9) extraction of LkIs set of proper subset entries EtjAnd QiAnd is andQi∪Etj=Lk,et (E) describedjFor a proper subset of terms set without query terms, said QiThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
(3.10) calculate association rule Q based on Copulas theoryi→EtjConfidence of (CConf) (Q)i→Etj) If CConf (Q)i→ETj) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, Q is addedi→EtjAdding into the association rule set AR (Association rule), then, proceeding to step (3.9), from LkTo re-extract the other proper subset item sets EtjAnd QiSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (3.8), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).
The CConf (copula based Confidence) represents a confidence based on copula theory, the CConf (Q)i→ETj) Is represented by equation (6):
in the formula (6), Qi"Count represents the proper subset item set QiFrequency of occurrence, Q, in pseudo-associative feedback Chinese document libraryi'Weight' represents a proper subset item set QiTerm set weights in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) _ Count represents a set of items (Q)i∪Etj) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) ' Weight representation item set (Q)i∪Etj) Item set weights in a pseudo-relevance feedback Chinese document library. AllDoc (count) and AllItems (weight) are defined as in equation (4).
(3.11) extracting the association rule backing Et from the association rule set ARjAs rule extension words, obtain rule extension word set ET _ AR (extension Term from Association rules), calculate rule extension word weight wEtThen, the process proceeds to step 4.
The ET _ AR is shown as formula (7):
in formula (7), RetiIndicating the ith rule expansion word.
The rule expansion word weight wEtThe calculation formula is shown in formula (8):
in the formula (8), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
Step 4, performing intersection fusion on the rule expansion word set and the word embedding expansion word set to obtain a final expansion word, and realizing query expansion, wherein the specific steps are as follows:
(4.1) performing intersection operation on the rule extension word Set ET _ AR and the word embedding extension word Set ET _ WE to obtain a final extension word Set ETS _ Q (expansion Term Set for Query Q) of the original Query Term Set Q, and calculating a final extension word weight w (ET)i)。
The final extended word set ETS _ Q is calculated as shown in equation (9):
final expanded word weight w (ET)i) Is calculated as shown in equation (10):
w(ETi)=wEt+wWEET(10)
And (4.2) combining the final expansion word with the original query into a new query, and searching the Chinese document again to realize query expansion.
Experimental design and results:
we compared the method of the present invention with the prior art similar method to perform the search experiment to illustrate the effectiveness of the method of the present invention.
1. Experimental environment and experimental data:
the experimental corpus is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text standard corpus, 901446 Chinese traditional documents (converted into Chinese simplified bodies during experiments) are distributed in 8 data sets as shown in Table 1. The NTCIR-5CLIR corpus has 50 chinese queries, 4 types of query topics and a result set of 2 evaluation criteria (i.e., Rigid (highly relevant, relevant to query) and Relax (highly relevant, and partially relevant to query) evaluation criteria). The retrieval experiment is completed by adopting the topics of Description (Desc for short, belonging to long query) and Title query (belonging to short query). The index for evaluation of the search for the Experimental results is MAP (mean Average precision)
TABLE 1 original corpus and its quantity
2. The reference retrieval and comparison method comprises the following steps:
the experimental basic retrieval environment is built by Lucene.
The reference retrieval is a retrieval result obtained by submitting an original query to Lucene.
The comparative method is described as follows:
comparative method 1: query expansion method of word vectors based on literature (see details: Kan, Linyuan, kojiu, etc.. patent query expansion word vector method study [ J ]. computer science and exploration, 2018,12(6):972-980.), parameters: α is 0.1 and k is 60.
Comparative method 2: mining rule expansion words by adopting a weighted association pattern mining technology of documents (see detail: yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper, 2017,36(3): 307-: c is 0.1, mi is 0.0001, and the results of the experiment are average values when ms is 0.004,0.005,0.006,0.007, respectively.
The Skip-gram model words used by the invention are embedded with semantic learning training parameters: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.
3. The experimental methods and results are as follows:
the average value of the MAP obtained by the method of the invention is shown in the table 2 and the table 3, wherein the average amplification (%) in the table refers to the total average amplification of the search results of the method of the invention on 8 data sets relative to the reference search and the contrast expansion method.
The average amplification calculation method comprises the following steps: firstly, the amplification of the retrieval result of the method of the invention on each data set relative to the reference retrieval and contrast expansion method is calculated, then, the amplification on each data set is accumulated and then is divided by 8, and the total average amplification of the retrieval result of the method of the invention relative to other methods is obtained.
TABLE 2 MAP value of search performance (Title query) for the method of the present invention and the reference search and comparison method
TABLE 3 search Performance MAP values for the inventive method and the reference search and comparison method (Desc Inquiry)
Tables 2 and 3 show that compared with the reference retrieval, the method has the advantages that the MAP value of the retrieval result is better improved, the effect is obvious, the average amplification is 27.87%, the MAP value of the method is mostly higher than that of the comparison method, and the extended retrieval performance of the method is higher than that of the reference retrieval and the similar comparison method. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.
Claims (7)
1. A Chinese query expansion method integrating deep learning and expanded word mining intersection is characterized by comprising the following steps:
step 1, searching a Chinese document set for original query to obtain a primary check document set, and performing Chinese word segmentation and stop word removal preprocessing on the primary check document set;
step 2, performing word embedding semantic learning training on the initial inspection document set by using a deep learning tool to obtain a feature word and word embedding vector set, and specifically comprising the following steps:
(2.1) performing word embedding semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a word embedding vector set of the primary detection document feature words;
(2.2) in the word embedding vector set of the initial detection document characteristic words, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word embedding candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity degree of (q) VCosi,cetj) Wherein j is more than or equal to 1 and less than or equal to m; the word embedding candidate expansion words refer to those non-query terms in the word embedding vector set;
(2.3) given a minimum vector cosine similarity threshold minqvcos, extracting its VCos (q)i,cetj) Query term q of not less than minqvcosiThe word is embedded into a candidate expansion word as the query term qiWord embedding expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnEmbedding all the words into the expanded word combination, and removing repeated words to obtain the original query term setThe final word embedding expansion word set ET _ WE of the Q is combined, and the weight w of the word embedding expansion word is calculatedWEETThen, turning to the step 3;
step 3, adopting a Copulas theory-based pseudo-related feedback expansion word mining method to mine rule expansion words in the initial detection pseudo-related feedback document set, and establishing a rule expansion word set, wherein the method specifically comprises the following steps:
(3.1) extracting m pieces of primary detection documents in the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and feature word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a feature word weight, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library;
(3.2) taking the feature words in the Chinese feature word library as 1_ candidate item set C1;
(3.3) calculation of C1Support degree CSup (C) based on Copulas theory1) If CSup (C)1) More than or equal to the minimum support threshold ms, C is set1As 1_ frequent item set L1And adding to a frequent item set FIS;
(3.4) adopting a self-connection method to connect (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationkThe k is more than or equal to 2;
(3.5) when mining to 2_ candidate C2When, if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2Transferring to the step (3.6); when mining to k _ candidate CkWhen k is more than or equal to 3, directly switching to the step (3.6);
(3.6) calculation of CkSupport degree CSup (C) based on Copulas theoryk) If CSup (C)k) Not less than ms, then CkIs k _ frequent item set LkAdding into FIS, then transferring into the step (3.7), or directly transferring into the step (3.7);
(3.7) after k is added with 1, the step (3.4) is carried out to continue the subsequent steps until the LkIf the item set is an empty set, finishing the excavation of the frequent item set, and turning to the step (3.8);
(3.8) taking out k _ frequent item set L from FISkThe k is more than or equal to 2;
(3.9) extraction of LkIs set of proper subset entries EtjAnd QiAnd is andQi∪Etj=Lk,et (E) describedjFor a proper subset of terms set without query terms, said QiThe method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;
(3.10) calculate association rule Q based on Copulas theoryi→EtjConfidence of (CConf) (Q)i→Etj) If CConf (Q)i→ETj) If the confidence coefficient is more than or equal to the minimum confidence coefficient threshold mc, Q is addedi→EtjAdding to the association rule set AR, then proceeding to step (3.9), from LkTo re-extract the other proper subset item sets EtjAnd QiSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (3.8), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (3.11) is carried out;
(3.11) extracting the association rule backing Et from the association rule set ARjAs a rule expansion word, obtaining a rule expansion word set ET _ AR, and calculating a rule expansion word weight wEtThen, go to step 4;
step 4, performing intersection fusion on the rule expansion word set and the word embedding expansion word set to obtain a final expansion word, and realizing query expansion, wherein the specific steps are as follows:
(4.1) performing intersection operation on the rule extended word set ET _ AR and the word embedded extended word set ET _ WE to obtainTo the final extension word set ETS _ Q of the original query term set Q, and calculating the final extension word weight w (ET)i);
And (4.2) combining the final expansion word with the original query into a new query, and searching the Chinese document again to realize query expansion.
2. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein:
in said step (2.2), the word vector cosine similarity VCos (q)i,cetj) The calculation of (a) is performed according to equation (1):
in the formula (1), vcetjIndicating that the jth word is embedded into the candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiA word vector value of;
in the step (2.3), the word embedding expansion word set ET _ WE of the final query Q is shown as a formula (2);
the weight w of the word embedded expansion wordWEETThe cosine similarity of the vector of the query term and the word embedded expansion word is shown in formula (3), and when a repeated word appears, the weight of the word embedded expansion word is equal to the accumulated sum of the similarity of each vector of the repeated word;
3. the method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein:
in the step (3.3), the CSup (C)1) Is calculated as shown in equation (4):
in the formula (4), C1"Count" represents 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, all doc (count) represents the total document number of the pseudo related feedback Chinese document library, C1"Weight" represents the 1_ candidate C1Item set weights in the pseudo-correlation feedback Chinese document library, wherein allItems (weight) represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library;
in the step (3.6), the CSup (C)k) Is calculated as shown in equation (5):
in the formula (5), Ck"Count" denotes k _ candidate CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Ck"Weight" denotes k _ candidate CkItem set weights in a pseudo-relevant feedback Chinese document library; the definitions of alldoc (count) and allItems (weight) are the same as in formula (4);
in the step (3.10), the CConf (Q)i→ETj) Is represented by equation (6):
in the formula (6), Qi"Count represents the proper subset item set QiFrequency of occurrence, Q, in pseudo-associative feedback Chinese document libraryi'Weight' represents a proper subset item set QiTerm set weights in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) _ Count represents a set of items (Q)i∪Etj) Frequency of occurrence in pseudo-relevance feedback Chinese document library, (Q)i∪Etj) ' Weight representation item set (Q)i∪Etj) Item set weights in a pseudo-relevant feedback Chinese document library; allDoc (count) and allItems (weight) are defined as in formula (4);
in the step (3.11), the ET _ AR is as shown in formula (7):
in formula (7), RetiRepresenting the ith rule expansion word;
the rule expansion word weight wEtThe calculation formula is shown in formula (8):
in the formula (8), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
4. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein:
in the step (4.1), the final extended word set ETS _ Q is calculated as shown in equation (9):
final expanded word weight w (ET)i) Is calculated as shown in equation (10):
w(ETi)=wEt+wWEET(10)。
5. the method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein: in the step (2.1), the deep learning tool adopts a Skip-gram model of a Google open source word vector tool word2 vec.
6. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein: in the step (3.1), a TF-IDF weighting technology is adopted to calculate the weight of the feature words.
7. The method for expanding Chinese queries by intersection fusion of deep learning and expanded word mining according to claim 1, wherein: in the step (3.4), the self-connection method adopts a candidate connection method given in Apriori algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010774430.4A CN111897926A (en) | 2020-08-04 | 2020-08-04 | Chinese query expansion method integrating deep learning and expansion word mining intersection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010774430.4A CN111897926A (en) | 2020-08-04 | 2020-08-04 | Chinese query expansion method integrating deep learning and expansion word mining intersection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111897926A true CN111897926A (en) | 2020-11-06 |
Family
ID=73245586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010774430.4A Withdrawn CN111897926A (en) | 2020-08-04 | 2020-08-04 | Chinese query expansion method integrating deep learning and expansion word mining intersection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111897926A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765966A (en) * | 2021-04-06 | 2021-05-07 | 腾讯科技(深圳)有限公司 | Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment |
CN114036516A (en) * | 2021-10-27 | 2022-02-11 | 西安电子科技大学 | Unknown sensitive function discovery method based on two-stage analogy reasoning |
-
2020
- 2020-08-04 CN CN202010774430.4A patent/CN111897926A/en not_active Withdrawn
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765966A (en) * | 2021-04-06 | 2021-05-07 | 腾讯科技(深圳)有限公司 | Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment |
CN112765966B (en) * | 2021-04-06 | 2021-07-23 | 腾讯科技(深圳)有限公司 | Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment |
CN114036516A (en) * | 2021-10-27 | 2022-02-11 | 西安电子科技大学 | Unknown sensitive function discovery method based on two-stage analogy reasoning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wen et al. | Research on keyword extraction based on word2vec weighted textrank | |
Mahata et al. | Theme-weighted ranking of keywords from text documents using phrase embeddings | |
CN104182527A (en) | Partial-sequence itemset based Chinese-English test word association rule mining method and system | |
CN111897926A (en) | Chinese query expansion method integrating deep learning and expansion word mining intersection | |
CN111753066A (en) | Method, device and equipment for expanding technical background text | |
CN111897922A (en) | Chinese query expansion method based on pattern mining and word vector similarity calculation | |
CN109739953B (en) | Text retrieval method based on chi-square analysis-confidence framework and back-part expansion | |
CN109726263B (en) | Cross-language post-translation hybrid expansion method based on feature word weighted association pattern mining | |
CN109684463B (en) | Cross-language post-translation and front-part extension method based on weight comparison and mining | |
CN111723179A (en) | Feedback model information retrieval method, system and medium based on concept map | |
CN111897928A (en) | Chinese query expansion method for embedding expansion words into query words and counting expansion word union | |
CN111897924A (en) | Text retrieval method based on association rule and word vector fusion expansion | |
CN111897921A (en) | Text retrieval method based on word vector learning and mode mining fusion expansion | |
Bouziri et al. | Learning query expansion from association rules between terms | |
Heidary et al. | Automatic text summarization using genetic algorithm and repetitive patterns | |
CN111897927B (en) | Chinese query expansion method integrating Copulas theory and association rule mining | |
Li et al. | Deep learning and semantic concept spaceare used in query expansion | |
CN108416442B (en) | Chinese word matrix weighting association rule mining method based on item frequency and weight | |
CN111897919A (en) | Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion | |
CN109684464B (en) | Cross-language query expansion method for realizing rule back-part mining through weight comparison | |
CN111897925B (en) | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning | |
CN109684465B (en) | Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison | |
Wu et al. | Beyond greedy search: pruned exhaustive search for diversified result ranking | |
CN111897923A (en) | Text retrieval method based on intersection expansion of word vector and association mode | |
CN113064978A (en) | Project construction period rationality judgment method and device based on feature word matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201106 |