CN111897923A - Text retrieval method based on intersection expansion of word vector and association mode - Google Patents
Text retrieval method based on intersection expansion of word vector and association mode Download PDFInfo
- Publication number
- CN111897923A CN111897923A CN202010774137.8A CN202010774137A CN111897923A CN 111897923 A CN111897923 A CN 111897923A CN 202010774137 A CN202010774137 A CN 202010774137A CN 111897923 A CN111897923 A CN 111897923A
- Authority
- CN
- China
- Prior art keywords
- word
- expansion
- word vector
- rule
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text retrieval method based on word vector and association mode intersection expansion, which comprises the steps of firstly, obtaining a primary check document set by querying and retrieving a Chinese document set through a user, then, respectively obtaining a regular expansion word set and a word vector expansion word set by carrying out two processes of regular expansion word mining and word vector semantic learning training on the primary check document set, wherein the regular expansion words contain characteristic inter-word association information based on statistical analysis, the word vector expansion words contain rich context semantic information, and the regular expansion word set and the word vector expansion word set are subjected to intersection fusion to obtain a final expansion word set so as to improve the quality of the expansion words and realize query expansion. Experimental results show that the method can reduce the problems of query subject drift and word mismatching in information retrieval, improve the information retrieval performance, has higher retrieval performance than similar comparison methods in recent years, and has better application value and popularization prospect.
Description
Technical Field
The invention relates to a text retrieval method based on intersection expansion of word vectors and associated modes, and belongs to the technical field of information retrieval.
Background
In the face of mass information resources of the internet in the big data era, how network users accurately and efficiently inquire more required information from network big data information is always a problem concerned by people in the information retrieval field of academia and industry. Query expansion is one of core key technologies for solving the problems of information retrieval word mismatching and query subject drift, and the query expansion refers to adding other feature words related to the original query semantics, so that the defect of semantic information caused by too simple original query is made up, and the purpose of improving the information retrieval performance is achieved. The Information retrieval method based on query expansion is regarded by scholars at home and abroad, for example, Pan et al (see Min Pan, Jimmy Huang, Tingting He, et al. A Simple Kernel Co-currence-based Enhancement for Pseudo-relevance-retrieval feed [ J ]. Journal of the association for Information Science and Technology (JASIST),2020,71(3): 264:. 281.) use Pseudo-relevance Feedback query expansion method based on Kernel term Co-occurrence semantics in Information retrieval, experiments show the validity of the method, Latiria et al (see Laddtiria C, Haad H, Hamrowni T. todigi and metadata relating to Information retrieval application J. see yellow Information mining, see: J. extension of query J. for Information mining, see yellow Information retrieval J. 247. use the rules of query expansion method of Pseudo-relevance Information extraction J. (see: 247. for yellow Information mining), the method comprises the following steps of (1) pseudo-correlation feedback query expansion [ J ] mined based on matrix weighted association rules, 2009,20(7): 1854-: patent query expansion word vector method research [ J ] computer science and exploration, 2018,12(6):972-980.) in information retrieval, query expansion is achieved by selecting expansion words through calculating word vector cosine similarity, and the like, and experimental results show that the method can improve retrieval performance.
However, the existing query expansion method does not finally and completely solve the technical problems of query topic drift, word mismatching and the like in information retrieval, and in order to better improve the information retrieval performance and effectively restrain the problems of query topic drift and word mismatching in information retrieval, the invention provides a text retrieval method based on word vector and association mode intersection expansion, which improves and improves the text information retrieval performance and has good application value and wide popularization prospect.
Disclosure of Invention
The invention aims to provide a text retrieval method based on intersection expansion of word vectors and associated modes, which is used in the field of information retrieval, such as a Chinese web information retrieval system or a search engine, and can improve the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.
The invention adopts the following specific technical scheme:
a text retrieval method based on intersection expansion of word vectors and association modes comprises the following steps:
step 1, a Chinese user queries and searches a Chinese document set to obtain a primary check document, and a primary check document set is constructed.
And 2, extracting m primary detection documents from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a weight of the feature words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.
The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.
Step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:
(3.1) mining 1_ candidate C1: directly extracting characteristic words from Chinese characteristic word stock as 1_ candidate item set C1。
(3.2) mining 1_ frequent item set L1: calculating C1Support Copulas _ S (C) based on Copulas function1) Extract Copulas _ S (C)1) C not lower than minimum support threshold ms1As 1_ frequent item set L1And added to the frequent item set fi (frequency itemset).
The copolas _ s (copolas based supported support) represents the degree of support based on the copolas function.
The copolas _ S (C)1) Is calculated as in equation (1)The following steps:
in formula (1), Frequency [ C ]1]Represents the 1_ candidate C1The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C1]Represents the 1_ candidate C1Item set weights in the pseudo-relevance feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.
(3.3) generating k _ candidate CkAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationk。
The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.
(3.4) to 2_ candidate C2Pruning: if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2And (4) transferring to the step (3.5).
(3.5) mining k _ frequent item set LkAnd k is more than or equal to 2: calculating CkSupport Copulas _ S (C) based on Copulas functionk) Extract Copulas _ S (C)k) C not lower than minimum support threshold mskAs k _ frequent item set LkAnd is added to the FI.
The copolas _ S (C)k) Is calculated as shown in equation (2):
in the formula (2), the Frequency [ C ]k]Is represented by CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Ck]Is represented by CkThe term set weights, SumCount and SumWeight, in the pseudo-relevance feedback Chinese document library are defined as in equation (1).
(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L iskAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7).
(3.7) optionally taking out L from FIkAnd k is more than or equal to 2.
(3.8) extraction of LkIs a proper subset of item sets LetAnd LqAnd is andLq∪Let=Lk,said LetFor a proper subset of terms set without query terms, said LqThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
(3.9) mining association rule Lq→Let: calculating the association rule L based on Copulas functionq→LetConfidence copolas _ C (L)q→Let) Extracting confidence Copulas _ C (L)q→Let) Association rule L not lower than minimum confidence threshold mcq→LetTo the association rule set ar (association rule).
The copula _ c (copula based confidence) represents the confidence of the association rule based on the copula function.
The copolas _ C (L)q→Let) The formula (3) is shown as follows:
in formula (3), Frequency [ L ]q]Representing a proper subset item set LqFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lq]Representing a proper subset item set LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevance feedback Chinese document library. SumCount and SumWeight are as defined for formula (1).
(3.10) extract Copulas _ C (L)q→Let) Q of not less than a minimum confidence threshold mci→EtjAdding into the association rule set AR (Association rule), and then proceeding to step (3.8), from LkTo re-extract other proper subset item sets LetAnd LqSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FIkThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FIkIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).
(3.11) extracting a subsequent item set L of the association rule from the association rule set ARetSaid L iset=(Ret1,Ret2,…,Rets) S is more than or equal to 1, and a slave item set LetExtracting rule extension words, removing repeated items to obtain a rule extension word set ARET (expansion Term from Association rules), and calculating a weight w of the rule extension wordsRetThen, the process proceeds to step 4.
The ARET is shown as a formula (4):
in the formula (4), RetiThe ith rule expansion word indicating that duplicate terms have been removed, said i ≧ 1.
The rule expansion word weight wRetThe calculation formula is shown in formula (5):
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
And 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:
and (4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set.
The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (detailed in https:// code. Google. com/p/word2vec /).
(4.2) in the initial survey document feature word vector set, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) The formula is shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m. The word vector candidate expansion words refer to those non-query terms in the word vector set.
In the formula (6), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSimi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnRemoving repeated words to obtain the final Word vector extension Word set WEET (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word vector extension wordsweetThen, go to step 5.
The WEET is shown as a formula (7):
the weight w of the word vector expansion wordweetIn order to query the vector cosine similarity between the term and the word vector expansion word, as shown in formula (8), when a repeated word occurs, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word.
Step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:
(5.1) performing intersection operation on the rule Expansion word set ARET and the word vector Expansion word set WEET to obtain a final Expansion word set FETS (final Expansion Term set) of the original query Term set Q, realizing intersection Expansion of the word vectors and the association mode, and calculating a final Expansion word weight wFetThen, the process proceeds to step 6.
The FETS is represented by formula (9):
in the formula (9), the FetnRepresenting the nth final expanded word.
Final expanded word weight wFetExtending word set weights w for rulesRetSum word vector expansion word weight wweetSum of wweetAs shown in equation (10):
wFet=wRet+wweet(10)
and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a text retrieval method based on word vector and association mode intersection expansion, which comprises the steps of firstly, obtaining a primary check document set by querying and retrieving a Chinese document set through a user, then, respectively obtaining a regular expansion word set and a word vector expansion word set by carrying out two processes of regular expansion word mining and word vector semantic learning training on the primary check document set, wherein the regular expansion words contain characteristic inter-word association information based on statistical analysis, the word vector expansion words contain rich context semantic information, and the regular expansion word set and the word vector expansion word set are subjected to intersection fusion to obtain a final expansion word set so as to improve the quality of the expansion words and realize query expansion. Experimental results show that the method can improve the information retrieval performance, the retrieval performance is higher than that of the similar comparison method in recent years, and the method has good application value and popularization prospect.
(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that compared with the reference retrieval, the MAP of the method of the invention has the highest average amplification of 27.87 percent, the MAP of the method of the invention has higher average amplification than that of the comparison method, the average amplification of 18.21 percent, and the experimental effect is obvious, thus the retrieval performance of the method of the invention is better than that of the reference retrieval and comparison method, the information retrieval performance can be improved, the problems of query drift and word mismatching in the information retrieval are reduced, and the method has very high application value and wide popularization prospect.
Drawings
Fig. 1 is a general flow diagram of a text retrieval method based on word vector and association pattern intersection expansion according to the present invention.
Detailed Description
Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:
1. item set
In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.
2. Associating rules front and back parts
Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.
Support degree and confidence degree of Copulas function and characteristic word association mode
The copula function theory (see Sklar A. the principles de repetition a n dimension sets sources marks J. the Publication de l 'institute de Statistique l' universities 1959,8(1): 229) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function.
The invention utilizes copula function to integrate the frequency and weight of the feature term set into the support degree and confidence degree of the feature term association mode, and provides the calculation formula of the support degree copula _ s (copula based support) and the confidence degree copula _ c (copula based confidence) as follows.
Characteristic term set (T)1∪T2) Degree of support Copulas _ S (T)1∪T2) Is calculated as shown in equation (11):
in formula (11), Frequency [ T ]1∪T2]Representing a set of items (T)1∪T2) Frequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ T1∪T2]Representing a set of items (T)1∪T2) Term set weights in pseudo-relevance feedback Chinese document library, SumCount representing pseudo-relevance feedbackThe total document number of the Chinese document library, SumWeight, represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library.
Association rule (T)1→T2) The confidence copolas _ C is calculated as shown in equation (12):
in formula (12), Frequency [ T ]1]Representing a set of items T1Frequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ T1]Representing a set of items T1Term set weight, Frequency [ T ], in pseudo-relevance-feedback Chinese document library1∪T2]、Weight[T1∪T2]SumCount and SumWeight are as defined for formula (11).
4. Rule expansion word
The rule expansion word means that the expansion word is from a back item set of the association rule, and the front item set of the association rule is an original query item set.
5. Rule expansion word weight calculation
The invention takes the confidence coefficient of the association rule as the weight w of the rule expansion wordRet。
The rule expansion word weight wRetThe calculation formula is shown in formula (11):
in the formula (13), LRetSet of terms comprising an extension term Ret for the absence of query terms, said LqIs a term set containing query terms, and Q is an original query term set,Lq∪Let=Lk,the Frequency [ L ]q]Representing a set of items LqIn falseRelevance feedback frequency of occurrence in Chinese document library, Weight [ Lq]Representing a set of items LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevant feedback Chinese document library; SumCount and SumWeight are as defined for formula (11); max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
6. Word vector expansion word and weight thereof
The meaning of the word vector expansion word is described as follows:
in the initial examination document feature word vector set, calculating query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) Wherein j is more than or equal to 1 and less than or equal to m, a minimum vector cosine similarity threshold value minSim is given, and CosSim (q) of the minimum vector cosine similarity threshold value is extractedi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnAnd after repeated words are removed, obtaining a final word vector extension word set WEET (expansion Term from word extension) of the original query Term set Q. The word vector candidate expansion words refer to those non-query terms in the word vector set.
The word vector cosine similarity CosSim (q)i,cetj) As shown in equation (14):
in the formula (14), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
The word vector expansion word set WEET is shown as formula (15):
the invention takes the vector cosine similarity value of the query lexical item and the word vector expansion word as the weight of the word vector expansion word.
The weight w of the word vector expansion wordweetWhen a repeated word appears, the word vector expansion word weight is equal to the cumulative sum of the similarity of each vector of the repeated word, as shown in formula (16).
The invention is further explained below by referring to the drawings and specific comparative experiments.
As shown in fig. 1, the text retrieval method based on the intersection expansion of the word vector and the association mode of the present invention includes the following steps:
step 1, a Chinese user queries and searches a Chinese document set to obtain a primary check document, and a primary check document set is constructed.
And 2, extracting m primary detection documents from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a feature word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.
Step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:
(3.1) mining 1_ candidate C1: directly extracting characteristic words from Chinese characteristic word library as 1_ candidateItem set C1。
(3.2) mining 1_ frequent item set L1: calculating C1Support Copulas _ S (C) based on Copulas function1) Extract Copulas _ S (C)1) C not lower than minimum support threshold ms1As 1_ frequent item set L1And added to the frequent item set fi (frequency itemset).
The copolas _ s (copolas based supported support) represents the degree of support based on the copolas function.
The copolas _ S (C)1) Is calculated as shown in equation (1):
in formula (1), Frequency [ C ]1]Represents the 1_ candidate C1The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C1]Represents the 1_ candidate C1Item set weights in the pseudo-relevance feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.
(3.3) generating k _ candidate CkAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationk。
The self-join method uses a candidate set join method given in Apriori algorithm.
(3.4) to 2_ candidate C2Pruning: if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2And (4) transferring to the step (3.5).
(3.5) mining k _ frequent item set LkAnd k is more than or equal to 2: calculating CkSupport Copulas _ S (C) based on Copulas functionk) Extract Copulas _ S (C)k) C not lower than minimum support threshold mskAs k _ frequent item set LkAnd is added to the FI.
The copolas _ S (C)k) Is calculated as shown in equation (2):
in the formula (2), the Frequency [ C ]k]Is represented by CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Ck]Is represented by CkThe term set weights, SumCount and SumWeight, in the pseudo-relevance feedback Chinese document library are defined as in equation (1).
(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L iskAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7).
(3.7) optionally taking out L from FIkAnd k is more than or equal to 2.
(3.8) extraction of LkIs a proper subset of item sets LetAnd LqAnd is andLq∪Let=Lk,said LetFor a proper subset of terms set without query terms, said LqThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
(3.9) mining association rule Lq→Let: calculating the association rule L based on Copulas functionq→LetConfidence copolas _ C (L)q→Let) Extracting confidence Copulas _ C (L)q→Let) Association rule L not lower than minimum confidence threshold mcq→LetTo the association rule set ar (association rule).
The copula _ c (copula based confidence) represents the confidence of the association rule based on the copula function.
The copolas _ C (L)q→Let) The formula (3) is shown as follows:
in formula (3), Frequency [ L ]q]Representing a proper subset item set LqFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lq]Representing a proper subset item set LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevance feedback Chinese document library. SumCount and SumWeight are as defined for formula (1).
(3.10) extract Copulas _ C (L)q→Let) Q of not less than a minimum confidence threshold mci→EtjAdding into the association rule set AR (Association rule), and then proceeding to step (3.8), from LkTo re-extract other proper subset item sets LetAnd LqSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FIkThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FIkIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).
(3.11) extracting a subsequent item set L of the association rule from the association rule set ARetSaid L iset=(Ret1,Ret2,…,Rets) S is more than or equal to 1, and a slave item set LetExtracting rule extension words, removing repeated items to obtain a rule extension word set ARET (expansion Term from Association rules), and calculating a weight w of the rule extension wordsRetThen, the process proceeds to step 4.
The ARET is shown as a formula (4):
in the formula (4), RetiThe ith rule expansion word indicating that duplicate terms have been removed, said i ≧ 1.
The rule expansion word weight wRetThe calculation formula is shown in formula (5):
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
And 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:
and (4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set.
The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.
(4.2) in the initial survey document feature word vector set, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) The formula is shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m. The word vector candidate expansion words refer to those non-query terms in the word vector set.
In the formula (6), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSimi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnRemoving repeated words to obtain the final Word vector extension Word set WEET (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word vector extension wordsweetThen, go to step 5.
The WEET is shown as a formula (7):
the weight w of the word vector expansion wordweetIn order to query the vector cosine similarity between the term and the word vector expansion word, as shown in formula (8), when a repeated word occurs, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word.
Step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:
(5.1) performing intersection operation on the rule Expansion word set ARET and the word vector Expansion word set WEET to obtain a final Expansion word set FETS (final Expansion Term set) of the original query Term set Q, realizing intersection Expansion of the word vectors and the association mode, and calculating a final Expansion word weight wFetThen, the process proceeds to step 6.
The FETS is represented by formula (9):
in the formula (9), the FetnRepresenting the nth final expanded word.
Final expanded word weight wFetExtending word set weights w for rulesRetSum word vector expansion word weight wweetSum of wweetAs shown in equation (10):
wFet=wRet+wweet(10)
and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.
Experimental design and results:
we compare the experiment with the prior similar method to illustrate the effectiveness of the method of the invention.
1. Experimental environment and experimental data:
the experimental corpus is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text standard corpus, 901446 Chinese traditional documents (converted into Chinese simplified bodies during experiments) are distributed in 8 data sets as shown in Table 1. The NTCIR-5CLIR corpus has 50 chinese queries, 4 types of query topics and a result set of 2 evaluation criteria (i.e., Rigid (highly relevant, relevant to query) and Relax (highly relevant, and partially relevant to query) evaluation criteria). The retrieval experiment is completed by adopting the topics of Description (Desc for short, belonging to long query) and Title query (belonging to short query). The index for experimental results retrieval evaluation is P @ 5.
TABLE 1 original corpus and its quantity
2. The reference retrieval and comparison method comprises the following steps:
the experimental basic retrieval environment is built by Lucene.
The reference retrieval is a retrieval result obtained by submitting an original query to Lucene.
The comparative method is described as follows:
comparative method 1: mining rule extension words based on a weighted frequent pattern mining technology of multiple support thresholds of documents (see Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining with weighted multiple minimum supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605-: the results of the experiment were average values of ms of 0.1,0.15,0.2,0.25, 0.1 and 0.1 for mc, LMS, HMS, WT and WT, respectively.
Comparative method 2: mining rule extension words by adopting a fully weighted positive and negative association mode mining technology of documents (detailed in: yellow name selection, JianCaoqing, more-English cross-language query translation based on fully weighted positive and negative association mode mining, expansion [ J ]. electronic newspaper, 2018,46(12):3029 and 3036.) to realize query expansion and parameters: the results of the experiments were average values of ms of 0.10,0.11,0.12, and 0.13, where mc is 0.1, α is 0.3, minPR is 0.1, and minNR is 0.01.
The Skip-gram model words used by the invention are embedded with semantic learning training parameters: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.
3. The experimental methods and results are as follows:
the average value of the retrieval result P @5 obtained by 50 Chinese queries on 8 data sets of NTCIR-5CLIR corpus is shown in tables 2 and 3, wherein the average amplification (%) in the tables refers to the total average amplification of the retrieval result on 8 data sets by the method relative to the reference retrieval and contrast expansion method. The average amplification calculation method comprises the following steps: firstly, the amplification of the retrieval result of the method of the invention on each data set relative to the reference retrieval and contrast expansion method is calculated, then, the amplification on each data set is accumulated and then is divided by 8, and the total average amplification of the retrieval result of the method of the invention relative to other methods is obtained.
TABLE 2 search Performance P @5 value (Title Inquiry) of the inventive method and the reference search and comparison method
TABLE 3 search Performance P @5 values (Desc Inquiry) for the inventive method and the reference search and comparison method
Tables 2 and 3 show that compared with the reference retrieval, the retrieval result P @5 value of the method is better improved, the effect is obvious, the average amplification is 20.44% at most, most of the P @5 values of the method are higher than those of the comparative method, and the expansion retrieval performance of the method is higher than that of the reference retrieval and the similar comparative method. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.
Claims (1)
1. A text retrieval method based on intersection expansion of word vectors and association modes is characterized by comprising the following steps:
step 1, a Chinese user queries and retrieves a Chinese document set to obtain a primary check document, and a primary check document set is constructed;
step 2, extracting m pieces of primary detection documents in the front row from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a feature word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library;
step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:
(3.1) mining 1_ candidate C1: directly extracting characteristic words from Chinese characteristic word stock as 1_ candidate item set C1;
(3.2) mining 1_ frequent item set L1: calculating C1Support Copulas _ S (C) based on Copulas function1) Extract Copulas _ S (C)1) Is not lowC at minimum support threshold ms1As 1_ frequent item set L1Adding the frequent item set FI to the frequent item set;
the copolas _ S (C)1) Is calculated as shown in equation (1):
in formula (1), Frequency [ C ]1]Represents the 1_ candidate C1The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C1]Represents the 1_ candidate C1Item set weights in the pseudo-correlation feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library;
(3.3) generating k _ candidate CkAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationk(ii) a The self-connection method adopts a candidate item set connection method given in an Apriori algorithm;
(3.4) to 2_ candidate C2Pruning: if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2Transferring to the step (3.5);
(3.5) mining k _ frequent item set LkAnd k is more than or equal to 2: calculating CkSupport Copulas _ S (C) based on Copulas functionk) Extract Copulas _ S (C)k) C not lower than minimum support threshold mskAs k _ frequent item set LkAdded to FI;
the copolas _ S (C)k) Is calculated as shown in equation (2):
in the formula (2), the Frequency [ C ]k]Is represented by CkAppearing in a pseudo-relevant feedback Chinese document libraryFrequency, Weight [ Ck]Is represented by CkItem set weights in the pseudo-relevance feedback Chinese document library, SumCount and SumWeight are defined as in formula (1);
(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L iskIf the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7);
(3.7) optionally taking out L from FIkThe k is more than or equal to 2;
(3.8) extraction of LkIs a proper subset of item sets LetAnd LqAnd is andLq∪Let=Lk,said LetFor a proper subset of terms set without query terms, said LqThe method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;
(3.9) mining association rule Lq→Let: calculating the association rule L based on Copulas functionq→LetConfidence copolas _ C (L)q→Let) Extracting confidence Copulas _ C (L)q→Let) Association rule L not lower than minimum confidence threshold mcq→LetAdding the data into an association rule set AR;
the copolas _ C (L)q→Let) The formula (3) is shown as follows:
in formula (3), Frequency [ L ]q]Representing a proper subset item set LqFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lq]Representing a proper subset item set LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkExtracting in pseudo-relevant feedback Chinese document libraryFrequency of occurrence, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevant feedback Chinese document library; SumCount and SumWeight are as defined for formula (1);
(3.10) extract Copulas _ C (L)q→Let) Q of not less than a minimum confidence threshold mci→EtjAdding to the association rule set AR, then proceeding to step (3.8), from LkTo re-extract other proper subset item sets LetAnd LqSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FIkThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FIkIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (3.11) is carried out;
(3.11) extracting a subsequent item set L of the association rule from the association rule set ARetSaid L iset=(Ret1,Ret2,…,Rets) S is more than or equal to 1, and a slave item set LetExtracting rule expansion words, removing repeated items to obtain a rule expansion word set ARET, and calculating the weight w of the rule expansion wordsRetThen, go to step 4;
the ARET is shown as a formula (4):
in the formula (4), RetiAn ith rule expansion word indicating that duplicate terms have been removed, where i ≧ 1;
the rule expansion word weight wRetThe calculation formula is shown in formula (5):
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word;
and 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:
(4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set;
the deep learning tool is a Skip-gram model of a Google open source word vector tool word2 vec;
(4.2) in the initial survey document feature word vector set, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) As shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m; the word vector candidate expansion words refer to the non-query terms in the word vector set;
in the formula (6), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiA word vector value of;
(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSimi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnRemoving repeated words to obtain the final word vector expansion word set WEET of the original query term set Q, and calculating the weight w of the word vector expansion wordsweetThen, howeverThen, turning to the step 5;
the WEET is shown as a formula (7):
the weight w of the word vector expansion wordweetFor the vector cosine similarity between the query term and the word vector expansion word, as shown in formula (8), when a repeated word appears, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word;
step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:
(5.1) performing intersection operation on the rule expansion word set ARET and the word vector expansion word set WEET to obtain a final expansion word set FETS of the original query term set Q, realizing intersection expansion of the word vectors and the association mode, and calculating a final expansion word weight wFetThen, go to step 6;
the FETS is represented by formula (9):
in the formula (9), the FetnRepresents the nth final expanded word;
final expanded word weight wFetExtending word set weights w for rulesRetSum word vector expansion word weight wweetSum of wweetAs shown in equation (10):
wFet=wRet+wweet(10)
and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010774137.8A CN111897923A (en) | 2020-08-04 | 2020-08-04 | Text retrieval method based on intersection expansion of word vector and association mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010774137.8A CN111897923A (en) | 2020-08-04 | 2020-08-04 | Text retrieval method based on intersection expansion of word vector and association mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111897923A true CN111897923A (en) | 2020-11-06 |
Family
ID=73245599
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010774137.8A Withdrawn CN111897923A (en) | 2020-08-04 | 2020-08-04 | Text retrieval method based on intersection expansion of word vector and association mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111897923A (en) |
-
2020
- 2020-08-04 CN CN202010774137.8A patent/CN111897923A/en not_active Withdrawn
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052593B (en) | Topic keyword extraction method based on topic word vector and network structure | |
Wen et al. | Research on keyword extraction based on word2vec weighted textrank | |
CN108846029B (en) | Information correlation analysis method based on knowledge graph | |
CN106570120A (en) | Process for realizing searching engine optimization through improved keyword optimization | |
Huang et al. | An approach on Chinese microblog entity linking combining *** encyclopaedia and word2vec | |
CN111897922A (en) | Chinese query expansion method based on pattern mining and word vector similarity calculation | |
CN108334573B (en) | High-correlation microblog retrieval method based on clustering information | |
CN109739953B (en) | Text retrieval method based on chi-square analysis-confidence framework and back-part expansion | |
CN111897926A (en) | Chinese query expansion method integrating deep learning and expansion word mining intersection | |
CN111897923A (en) | Text retrieval method based on intersection expansion of word vector and association mode | |
CN111897924A (en) | Text retrieval method based on association rule and word vector fusion expansion | |
CN111897927B (en) | Chinese query expansion method integrating Copulas theory and association rule mining | |
Li et al. | Complex query recognition based on dynamic learning mechanism | |
Li et al. | Deep learning and semantic concept spaceare used in query expansion | |
Miyanishi et al. | Time-aware latent concept expansion for microblog search | |
CN111897921A (en) | Text retrieval method based on word vector learning and mode mining fusion expansion | |
CN111897919A (en) | Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion | |
Kuhr et al. | Context-specific adaptation of subjective content descriptions | |
CN111897928A (en) | Chinese query expansion method for embedding expansion words into query words and counting expansion word union | |
Yang et al. | An improved pagerank algorithm based on time feedback and topic similarity | |
CN108416442B (en) | Chinese word matrix weighting association rule mining method based on item frequency and weight | |
CN111897925B (en) | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning | |
CN109684464B (en) | Cross-language query expansion method for realizing rule back-part mining through weight comparison | |
Yan | Research on keyword extraction based on abstract extraction | |
CN111897920A (en) | Text retrieval method based on word embedding and association mode union expansion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201106 |