CN111897923A - Text retrieval method based on intersection expansion of word vector and association mode - Google Patents

Text retrieval method based on intersection expansion of word vector and association mode Download PDF

Info

Publication number
CN111897923A
CN111897923A CN202010774137.8A CN202010774137A CN111897923A CN 111897923 A CN111897923 A CN 111897923A CN 202010774137 A CN202010774137 A CN 202010774137A CN 111897923 A CN111897923 A CN 111897923A
Authority
CN
China
Prior art keywords
word
expansion
word vector
rule
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010774137.8A
Other languages
Chinese (zh)
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN202010774137.8A priority Critical patent/CN111897923A/en
Publication of CN111897923A publication Critical patent/CN111897923A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text retrieval method based on word vector and association mode intersection expansion, which comprises the steps of firstly, obtaining a primary check document set by querying and retrieving a Chinese document set through a user, then, respectively obtaining a regular expansion word set and a word vector expansion word set by carrying out two processes of regular expansion word mining and word vector semantic learning training on the primary check document set, wherein the regular expansion words contain characteristic inter-word association information based on statistical analysis, the word vector expansion words contain rich context semantic information, and the regular expansion word set and the word vector expansion word set are subjected to intersection fusion to obtain a final expansion word set so as to improve the quality of the expansion words and realize query expansion. Experimental results show that the method can reduce the problems of query subject drift and word mismatching in information retrieval, improve the information retrieval performance, has higher retrieval performance than similar comparison methods in recent years, and has better application value and popularization prospect.

Description

Text retrieval method based on intersection expansion of word vector and association mode
Technical Field
The invention relates to a text retrieval method based on intersection expansion of word vectors and associated modes, and belongs to the technical field of information retrieval.
Background
In the face of mass information resources of the internet in the big data era, how network users accurately and efficiently inquire more required information from network big data information is always a problem concerned by people in the information retrieval field of academia and industry. Query expansion is one of core key technologies for solving the problems of information retrieval word mismatching and query subject drift, and the query expansion refers to adding other feature words related to the original query semantics, so that the defect of semantic information caused by too simple original query is made up, and the purpose of improving the information retrieval performance is achieved. The Information retrieval method based on query expansion is regarded by scholars at home and abroad, for example, Pan et al (see Min Pan, Jimmy Huang, Tingting He, et al. A Simple Kernel Co-currence-based Enhancement for Pseudo-relevance-retrieval feed [ J ]. Journal of the association for Information Science and Technology (JASIST),2020,71(3): 264:. 281.) use Pseudo-relevance Feedback query expansion method based on Kernel term Co-occurrence semantics in Information retrieval, experiments show the validity of the method, Latiria et al (see Laddtiria C, Haad H, Hamrowni T. todigi and metadata relating to Information retrieval application J. see yellow Information mining, see: J. extension of query J. for Information mining, see yellow Information retrieval J. 247. use the rules of query expansion method of Pseudo-relevance Information extraction J. (see: 247. for yellow Information mining), the method comprises the following steps of (1) pseudo-correlation feedback query expansion [ J ] mined based on matrix weighted association rules, 2009,20(7): 1854-: patent query expansion word vector method research [ J ] computer science and exploration, 2018,12(6):972-980.) in information retrieval, query expansion is achieved by selecting expansion words through calculating word vector cosine similarity, and the like, and experimental results show that the method can improve retrieval performance.
However, the existing query expansion method does not finally and completely solve the technical problems of query topic drift, word mismatching and the like in information retrieval, and in order to better improve the information retrieval performance and effectively restrain the problems of query topic drift and word mismatching in information retrieval, the invention provides a text retrieval method based on word vector and association mode intersection expansion, which improves and improves the text information retrieval performance and has good application value and wide popularization prospect.
Disclosure of Invention
The invention aims to provide a text retrieval method based on intersection expansion of word vectors and associated modes, which is used in the field of information retrieval, such as a Chinese web information retrieval system or a search engine, and can improve the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.
The invention adopts the following specific technical scheme:
a text retrieval method based on intersection expansion of word vectors and association modes comprises the following steps:
step 1, a Chinese user queries and searches a Chinese document set to obtain a primary check document, and a primary check document set is constructed.
And 2, extracting m primary detection documents from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a weight of the feature words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.
The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.
Step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:
(3.1) mining 1_ candidate C1: directly extracting characteristic words from Chinese characteristic word stock as 1_ candidate item set C1
(3.2) mining 1_ frequent item set L1: calculating C1Support Copulas _ S (C) based on Copulas function1) Extract Copulas _ S (C)1) C not lower than minimum support threshold ms1As 1_ frequent item set L1And added to the frequent item set fi (frequency itemset).
The copolas _ s (copolas based supported support) represents the degree of support based on the copolas function.
The copolas _ S (C)1) Is calculated as in equation (1)The following steps:
Figure BDA0002617752590000021
in formula (1), Frequency [ C ]1]Represents the 1_ candidate C1The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C1]Represents the 1_ candidate C1Item set weights in the pseudo-relevance feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.
(3.3) generating k _ candidate CkAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationk
The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.
(3.4) to 2_ candidate C2Pruning: if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2And (4) transferring to the step (3.5).
(3.5) mining k _ frequent item set LkAnd k is more than or equal to 2: calculating CkSupport Copulas _ S (C) based on Copulas functionk) Extract Copulas _ S (C)k) C not lower than minimum support threshold mskAs k _ frequent item set LkAnd is added to the FI.
The copolas _ S (C)k) Is calculated as shown in equation (2):
Figure BDA0002617752590000031
in the formula (2), the Frequency [ C ]k]Is represented by CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Ck]Is represented by CkThe term set weights, SumCount and SumWeight, in the pseudo-relevance feedback Chinese document library are defined as in equation (1).
(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L iskAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7).
(3.7) optionally taking out L from FIkAnd k is more than or equal to 2.
(3.8) extraction of LkIs a proper subset of item sets LetAnd LqAnd is and
Figure BDA0002617752590000033
Lq∪Let=Lk
Figure BDA0002617752590000034
said LetFor a proper subset of terms set without query terms, said LqThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
(3.9) mining association rule Lq→Let: calculating the association rule L based on Copulas functionq→LetConfidence copolas _ C (L)q→Let) Extracting confidence Copulas _ C (L)q→Let) Association rule L not lower than minimum confidence threshold mcq→LetTo the association rule set ar (association rule).
The copula _ c (copula based confidence) represents the confidence of the association rule based on the copula function.
The copolas _ C (L)q→Let) The formula (3) is shown as follows:
Figure BDA0002617752590000032
in formula (3), Frequency [ L ]q]Representing a proper subset item set LqFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lq]Representing a proper subset item set LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevance feedback Chinese document library. SumCount and SumWeight are as defined for formula (1).
(3.10) extract Copulas _ C (L)q→Let) Q of not less than a minimum confidence threshold mci→EtjAdding into the association rule set AR (Association rule), and then proceeding to step (3.8), from LkTo re-extract other proper subset item sets LetAnd LqSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FIkThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FIkIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).
(3.11) extracting a subsequent item set L of the association rule from the association rule set ARetSaid L iset=(Ret1,Ret2,…,Rets) S is more than or equal to 1, and a slave item set LetExtracting rule extension words, removing repeated items to obtain a rule extension word set ARET (expansion Term from Association rules), and calculating a weight w of the rule extension wordsRetThen, the process proceeds to step 4.
The ARET is shown as a formula (4):
Figure BDA0002617752590000041
in the formula (4), RetiThe ith rule expansion word indicating that duplicate terms have been removed, said i ≧ 1.
The rule expansion word weight wRetThe calculation formula is shown in formula (5):
Figure BDA0002617752590000042
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
And 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:
and (4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set.
The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (detailed in https:// code. Google. com/p/word2vec /).
(4.2) in the initial survey document feature word vector set, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) The formula is shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m. The word vector candidate expansion words refer to those non-query terms in the word vector set.
Figure BDA0002617752590000043
In the formula (6), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSimi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnRemoving repeated words to obtain the final Word vector extension Word set WEET (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word vector extension wordsweetThen, go to step 5.
The WEET is shown as a formula (7):
Figure BDA0002617752590000051
the weight w of the word vector expansion wordweetIn order to query the vector cosine similarity between the term and the word vector expansion word, as shown in formula (8), when a repeated word occurs, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word.
Figure BDA0002617752590000052
Step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:
(5.1) performing intersection operation on the rule Expansion word set ARET and the word vector Expansion word set WEET to obtain a final Expansion word set FETS (final Expansion Term set) of the original query Term set Q, realizing intersection Expansion of the word vectors and the association mode, and calculating a final Expansion word weight wFetThen, the process proceeds to step 6.
The FETS is represented by formula (9):
Figure BDA0002617752590000053
in the formula (9), the FetnRepresenting the nth final expanded word.
Final expanded word weight wFetExtending word set weights w for rulesRetSum word vector expansion word weight wweetSum of wweetAs shown in equation (10):
wFet=wRet+wweet(10)
and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a text retrieval method based on word vector and association mode intersection expansion, which comprises the steps of firstly, obtaining a primary check document set by querying and retrieving a Chinese document set through a user, then, respectively obtaining a regular expansion word set and a word vector expansion word set by carrying out two processes of regular expansion word mining and word vector semantic learning training on the primary check document set, wherein the regular expansion words contain characteristic inter-word association information based on statistical analysis, the word vector expansion words contain rich context semantic information, and the regular expansion word set and the word vector expansion word set are subjected to intersection fusion to obtain a final expansion word set so as to improve the quality of the expansion words and realize query expansion. Experimental results show that the method can improve the information retrieval performance, the retrieval performance is higher than that of the similar comparison method in recent years, and the method has good application value and popularization prospect.
(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that compared with the reference retrieval, the MAP of the method of the invention has the highest average amplification of 27.87 percent, the MAP of the method of the invention has higher average amplification than that of the comparison method, the average amplification of 18.21 percent, and the experimental effect is obvious, thus the retrieval performance of the method of the invention is better than that of the reference retrieval and comparison method, the information retrieval performance can be improved, the problems of query drift and word mismatching in the information retrieval are reduced, and the method has very high application value and wide popularization prospect.
Drawings
Fig. 1 is a general flow diagram of a text retrieval method based on word vector and association pattern intersection expansion according to the present invention.
Detailed Description
Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:
1. item set
In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.
2. Associating rules front and back parts
Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.
Support degree and confidence degree of Copulas function and characteristic word association mode
The copula function theory (see Sklar A. the principles de repetition a n dimension sets sources marks J. the Publication de l 'institute de Statistique l' universities 1959,8(1): 229) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function.
The invention utilizes copula function to integrate the frequency and weight of the feature term set into the support degree and confidence degree of the feature term association mode, and provides the calculation formula of the support degree copula _ s (copula based support) and the confidence degree copula _ c (copula based confidence) as follows.
Characteristic term set (T)1∪T2) Degree of support Copulas _ S (T)1∪T2) Is calculated as shown in equation (11):
Figure BDA0002617752590000061
in formula (11), Frequency [ T ]1∪T2]Representing a set of items (T)1∪T2) Frequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ T1∪T2]Representing a set of items (T)1∪T2) Term set weights in pseudo-relevance feedback Chinese document library, SumCount representing pseudo-relevance feedbackThe total document number of the Chinese document library, SumWeight, represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library.
Association rule (T)1→T2) The confidence copolas _ C is calculated as shown in equation (12):
Figure BDA0002617752590000071
in formula (12), Frequency [ T ]1]Representing a set of items T1Frequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ T1]Representing a set of items T1Term set weight, Frequency [ T ], in pseudo-relevance-feedback Chinese document library1∪T2]、Weight[T1∪T2]SumCount and SumWeight are as defined for formula (11).
4. Rule expansion word
The rule expansion word means that the expansion word is from a back item set of the association rule, and the front item set of the association rule is an original query item set.
5. Rule expansion word weight calculation
The invention takes the confidence coefficient of the association rule as the weight w of the rule expansion wordRet
The rule expansion word weight wRetThe calculation formula is shown in formula (11):
Figure BDA0002617752590000072
in the formula (13), LRetSet of terms comprising an extension term Ret for the absence of query terms, said LqIs a term set containing query terms, and Q is an original query term set,
Figure BDA0002617752590000074
Lq∪Let=Lk
Figure BDA0002617752590000073
the Frequency [ L ]q]Representing a set of items LqIn falseRelevance feedback frequency of occurrence in Chinese document library, Weight [ Lq]Representing a set of items LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevant feedback Chinese document library; SumCount and SumWeight are as defined for formula (11); max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
6. Word vector expansion word and weight thereof
The meaning of the word vector expansion word is described as follows:
in the initial examination document feature word vector set, calculating query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) Wherein j is more than or equal to 1 and less than or equal to m, a minimum vector cosine similarity threshold value minSim is given, and CosSim (q) of the minimum vector cosine similarity threshold value is extractedi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnAnd after repeated words are removed, obtaining a final word vector extension word set WEET (expansion Term from word extension) of the original query Term set Q. The word vector candidate expansion words refer to those non-query terms in the word vector set.
The word vector cosine similarity CosSim (q)i,cetj) As shown in equation (14):
Figure BDA0002617752590000081
in the formula (14), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
The word vector expansion word set WEET is shown as formula (15):
Figure BDA0002617752590000082
the invention takes the vector cosine similarity value of the query lexical item and the word vector expansion word as the weight of the word vector expansion word.
The weight w of the word vector expansion wordweetWhen a repeated word appears, the word vector expansion word weight is equal to the cumulative sum of the similarity of each vector of the repeated word, as shown in formula (16).
Figure BDA0002617752590000083
The invention is further explained below by referring to the drawings and specific comparative experiments.
As shown in fig. 1, the text retrieval method based on the intersection expansion of the word vector and the association mode of the present invention includes the following steps:
step 1, a Chinese user queries and searches a Chinese document set to obtain a primary check document, and a primary check document set is constructed.
And 2, extracting m primary detection documents from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a feature word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library.
Step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:
(3.1) mining 1_ candidate C1: directly extracting characteristic words from Chinese characteristic word library as 1_ candidateItem set C1
(3.2) mining 1_ frequent item set L1: calculating C1Support Copulas _ S (C) based on Copulas function1) Extract Copulas _ S (C)1) C not lower than minimum support threshold ms1As 1_ frequent item set L1And added to the frequent item set fi (frequency itemset).
The copolas _ s (copolas based supported support) represents the degree of support based on the copolas function.
The copolas _ S (C)1) Is calculated as shown in equation (1):
Figure BDA0002617752590000091
in formula (1), Frequency [ C ]1]Represents the 1_ candidate C1The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C1]Represents the 1_ candidate C1Item set weights in the pseudo-relevance feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese feature words in the pseudo-relevance feedback Chinese document library.
(3.3) generating k _ candidate CkAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationk
The self-join method uses a candidate set join method given in Apriori algorithm.
(3.4) to 2_ candidate C2Pruning: if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2And (4) transferring to the step (3.5).
(3.5) mining k _ frequent item set LkAnd k is more than or equal to 2: calculating CkSupport Copulas _ S (C) based on Copulas functionk) Extract Copulas _ S (C)k) C not lower than minimum support threshold mskAs k _ frequent item set LkAnd is added to the FI.
The copolas _ S (C)k) Is calculated as shown in equation (2):
Figure BDA0002617752590000092
in the formula (2), the Frequency [ C ]k]Is represented by CkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Ck]Is represented by CkThe term set weights, SumCount and SumWeight, in the pseudo-relevance feedback Chinese document library are defined as in equation (1).
(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L iskAnd (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7).
(3.7) optionally taking out L from FIkAnd k is more than or equal to 2.
(3.8) extraction of LkIs a proper subset of item sets LetAnd LqAnd is and
Figure BDA0002617752590000094
Lq∪Let=Lk
Figure BDA0002617752590000095
said LetFor a proper subset of terms set without query terms, said LqThe method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
(3.9) mining association rule Lq→Let: calculating the association rule L based on Copulas functionq→LetConfidence copolas _ C (L)q→Let) Extracting confidence Copulas _ C (L)q→Let) Association rule L not lower than minimum confidence threshold mcq→LetTo the association rule set ar (association rule).
The copula _ c (copula based confidence) represents the confidence of the association rule based on the copula function.
The copolas _ C (L)q→Let) The formula (3) is shown as follows:
Figure BDA0002617752590000093
in formula (3), Frequency [ L ]q]Representing a proper subset item set LqFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lq]Representing a proper subset item set LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevance feedback Chinese document library. SumCount and SumWeight are as defined for formula (1).
(3.10) extract Copulas _ C (L)q→Let) Q of not less than a minimum confidence threshold mci→EtjAdding into the association rule set AR (Association rule), and then proceeding to step (3.8), from LkTo re-extract other proper subset item sets LetAnd LqSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FIkThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FIkIf and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (3.11).
(3.11) extracting a subsequent item set L of the association rule from the association rule set ARetSaid L iset=(Ret1,Ret2,…,Rets) S is more than or equal to 1, and a slave item set LetExtracting rule extension words, removing repeated items to obtain a rule extension word set ARET (expansion Term from Association rules), and calculating a weight w of the rule extension wordsRetThen, the process proceeds to step 4.
The ARET is shown as a formula (4):
Figure BDA0002617752590000101
in the formula (4), RetiThe ith rule expansion word indicating that duplicate terms have been removed, said i ≧ 1.
The rule expansion word weight wRetThe calculation formula is shown in formula (5):
Figure BDA0002617752590000102
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
And 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:
and (4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set.
The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.
(4.2) in the initial survey document feature word vector set, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) The formula is shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m. The word vector candidate expansion words refer to those non-query terms in the word vector set.
Figure BDA0002617752590000111
In the formula (6), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiThe word vector value of.
(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSimi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnRemoving repeated words to obtain the final Word vector extension Word set WEET (expansion Term from Word embedding) of the original query Term set Q, and calculating the weight w of the Word vector extension wordsweetThen, go to step 5.
The WEET is shown as a formula (7):
Figure BDA0002617752590000112
the weight w of the word vector expansion wordweetIn order to query the vector cosine similarity between the term and the word vector expansion word, as shown in formula (8), when a repeated word occurs, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word.
Figure BDA0002617752590000113
Step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:
(5.1) performing intersection operation on the rule Expansion word set ARET and the word vector Expansion word set WEET to obtain a final Expansion word set FETS (final Expansion Term set) of the original query Term set Q, realizing intersection Expansion of the word vectors and the association mode, and calculating a final Expansion word weight wFetThen, the process proceeds to step 6.
The FETS is represented by formula (9):
Figure BDA0002617752590000114
in the formula (9), the FetnRepresenting the nth final expanded word.
Final expanded word weight wFetExtending word set weights w for rulesRetSum word vector expansion word weight wweetSum of wweetAs shown in equation (10):
wFet=wRet+wweet(10)
and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.
Experimental design and results:
we compare the experiment with the prior similar method to illustrate the effectiveness of the method of the invention.
1. Experimental environment and experimental data:
the experimental corpus is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text standard corpus, 901446 Chinese traditional documents (converted into Chinese simplified bodies during experiments) are distributed in 8 data sets as shown in Table 1. The NTCIR-5CLIR corpus has 50 chinese queries, 4 types of query topics and a result set of 2 evaluation criteria (i.e., Rigid (highly relevant, relevant to query) and Relax (highly relevant, and partially relevant to query) evaluation criteria). The retrieval experiment is completed by adopting the topics of Description (Desc for short, belonging to long query) and Title query (belonging to short query). The index for experimental results retrieval evaluation is P @ 5.
TABLE 1 original corpus and its quantity
Figure BDA0002617752590000121
2. The reference retrieval and comparison method comprises the following steps:
the experimental basic retrieval environment is built by Lucene.
The reference retrieval is a retrieval result obtained by submitting an original query to Lucene.
The comparative method is described as follows:
comparative method 1: mining rule extension words based on a weighted frequent pattern mining technology of multiple support thresholds of documents (see Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining with weighted multiple minimum supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605-: the results of the experiment were average values of ms of 0.1,0.15,0.2,0.25, 0.1 and 0.1 for mc, LMS, HMS, WT and WT, respectively.
Comparative method 2: mining rule extension words by adopting a fully weighted positive and negative association mode mining technology of documents (detailed in: yellow name selection, JianCaoqing, more-English cross-language query translation based on fully weighted positive and negative association mode mining, expansion [ J ]. electronic newspaper, 2018,46(12):3029 and 3036.) to realize query expansion and parameters: the results of the experiments were average values of ms of 0.10,0.11,0.12, and 0.13, where mc is 0.1, α is 0.3, minPR is 0.1, and minNR is 0.01.
The Skip-gram model words used by the invention are embedded with semantic learning training parameters: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.
3. The experimental methods and results are as follows:
the average value of the retrieval result P @5 obtained by 50 Chinese queries on 8 data sets of NTCIR-5CLIR corpus is shown in tables 2 and 3, wherein the average amplification (%) in the tables refers to the total average amplification of the retrieval result on 8 data sets by the method relative to the reference retrieval and contrast expansion method. The average amplification calculation method comprises the following steps: firstly, the amplification of the retrieval result of the method of the invention on each data set relative to the reference retrieval and contrast expansion method is calculated, then, the amplification on each data set is accumulated and then is divided by 8, and the total average amplification of the retrieval result of the method of the invention relative to other methods is obtained.
TABLE 2 search Performance P @5 value (Title Inquiry) of the inventive method and the reference search and comparison method
Figure BDA0002617752590000131
TABLE 3 search Performance P @5 values (Desc Inquiry) for the inventive method and the reference search and comparison method
Figure BDA0002617752590000132
Tables 2 and 3 show that compared with the reference retrieval, the retrieval result P @5 value of the method is better improved, the effect is obvious, the average amplification is 20.44% at most, most of the P @5 values of the method are higher than those of the comparative method, and the expansion retrieval performance of the method is higher than that of the reference retrieval and the similar comparative method. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims (1)

1. A text retrieval method based on intersection expansion of word vectors and association modes is characterized by comprising the following steps:
step 1, a Chinese user queries and retrieves a Chinese document set to obtain a primary check document, and a primary check document set is constructed;
step 2, extracting m pieces of primary detection documents in the front row from the primary detection document set, constructing a primary detection pseudo-related feedback document set, performing Chinese word segmentation and stop word removal on the primary detection pseudo-related feedback document set, extracting Chinese feature words, calculating a feature word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese feature word library;
step 3, mining rule expansion words in the initial detection pseudo-related feedback document set by using support degree and confidence degree based on Copulas theory, and constructing a rule expansion word set, wherein the specific steps are as follows:
(3.1) mining 1_ candidate C1: directly extracting characteristic words from Chinese characteristic word stock as 1_ candidate item set C1
(3.2) mining 1_ frequent item set L1: calculating C1Support Copulas _ S (C) based on Copulas function1) Extract Copulas _ S (C)1) Is not lowC at minimum support threshold ms1As 1_ frequent item set L1Adding the frequent item set FI to the frequent item set;
the copolas _ S (C)1) Is calculated as shown in equation (1):
Figure FDA0002617752580000011
in formula (1), Frequency [ C ]1]Represents the 1_ candidate C1The frequency of occurrence in the pseudo-related feedback Chinese document library, SumCount represents the total document number of the pseudo-related feedback Chinese document library, Weight [ C1]Represents the 1_ candidate C1Item set weights in the pseudo-correlation feedback Chinese document library, and SumWeight represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library;
(3.3) generating k _ candidate CkAnd k is more than or equal to 2: adopting a self-connection method to set (k-1) _ frequent item set Lk-1Deriving k _ candidate C from concatenationk(ii) a The self-connection method adopts a candidate item set connection method given in an Apriori algorithm;
(3.4) to 2_ candidate C2Pruning: if the C is2If the original query term is not contained, the C is deleted2If the C is2If the original query term is contained, the C is left2Then, C is left2Transferring to the step (3.5);
(3.5) mining k _ frequent item set LkAnd k is more than or equal to 2: calculating CkSupport Copulas _ S (C) based on Copulas functionk) Extract Copulas _ S (C)k) C not lower than minimum support threshold mskAs k _ frequent item set LkAdded to FI;
the copolas _ S (C)k) Is calculated as shown in equation (2):
Figure FDA0002617752580000012
in the formula (2), the Frequency [ C ]k]Is represented by CkAppearing in a pseudo-relevant feedback Chinese document libraryFrequency, Weight [ Ck]Is represented by CkItem set weights in the pseudo-relevance feedback Chinese document library, SumCount and SumWeight are defined as in formula (1);
(3.6) after k is added with 1, the step (3.3) is carried out to continue the subsequent steps until the L iskIf the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (3.7);
(3.7) optionally taking out L from FIkThe k is more than or equal to 2;
(3.8) extraction of LkIs a proper subset of item sets LetAnd LqAnd is and
Figure FDA0002617752580000021
Lq∪Let=Lk
Figure FDA0002617752580000022
said LetFor a proper subset of terms set without query terms, said LqThe method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;
(3.9) mining association rule Lq→Let: calculating the association rule L based on Copulas functionq→LetConfidence copolas _ C (L)q→Let) Extracting confidence Copulas _ C (L)q→Let) Association rule L not lower than minimum confidence threshold mcq→LetAdding the data into an association rule set AR;
the copolas _ C (L)q→Let) The formula (3) is shown as follows:
Figure FDA0002617752580000023
in formula (3), Frequency [ L ]q]Representing a proper subset item set LqFrequency of occurrence in pseudo-relevance feedback Chinese document library, Weight [ Lq]Representing a proper subset item set LqTerm set weight, Frequency [ L ], in pseudo-relevance-feedback Chinese document libraryk]Representing a set of items LkExtracting in pseudo-relevant feedback Chinese document libraryFrequency of occurrence, Weight [ Lk]Representing a set of items LkItem set weights in a pseudo-relevant feedback Chinese document library; SumCount and SumWeight are as defined for formula (1);
(3.10) extract Copulas _ C (L)q→Let) Q of not less than a minimum confidence threshold mci→EtjAdding to the association rule set AR, then proceeding to step (3.8), from LkTo re-extract other proper subset item sets LetAnd LqSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset items are taken out once, then go to step (3.7) to perform a new round of association rule pattern mining, and take out any other L from the FIkThen, the subsequent steps are sequentially performed, and the process is circulated until all k _ frequent item sets L in FIkIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (3.11) is carried out;
(3.11) extracting a subsequent item set L of the association rule from the association rule set ARetSaid L iset=(Ret1,Ret2,…,Rets) S is more than or equal to 1, and a slave item set LetExtracting rule expansion words, removing repeated items to obtain a rule expansion word set ARET, and calculating the weight w of the rule expansion wordsRetThen, go to step 4;
the ARET is shown as a formula (4):
Figure FDA0002617752580000024
in the formula (4), RetiAn ith rule expansion word indicating that duplicate terms have been removed, where i ≧ 1;
the rule expansion word weight wRetThe calculation formula is shown in formula (5):
Figure FDA0002617752580000025
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word;
and 4, performing word vector semantic learning training on the initial inspection document set by using a deep learning tool to obtain a word vector expansion word set, wherein the method specifically comprises the following steps:
(4.1) performing word vector semantic learning training on the primary detection pseudo-related feedback document set by adopting a deep learning tool to obtain a primary detection document feature word vector set;
the deep learning tool is a Skip-gram model of a Google open source word vector tool word2 vec;
(4.2) in the initial survey document feature word vector set, calculating each query term qi(qiE.g. Q, Q is the original query term set, Q ═ Q1,q2,…,qn) I is more than or equal to 1 and less than or equal to n)) and all word vector candidate expansion words (cet)1,cet2,…,cetm) Word vector cosine similarity CosSim (q)i,cetj) As shown in formula (6), wherein j is more than or equal to 1 and less than or equal to m; the word vector candidate expansion words refer to the non-query terms in the word vector set;
Figure FDA0002617752580000031
in the formula (6), the vcetjRepresenting the jth word vector candidate expansion word cetjWord vector value of, vqiRepresenting the ith query term qiA word vector value of;
(4.3) giving a minimum vector cosine similarity threshold value minSim, and extracting CosSim (q) of the minimum vector cosine similarity threshold value minSimi,cetj) Query term q not lower than minSimiAs the q-word vector candidate expansion wordiWord vector expansion word (q)iet1,qiet2,…,qietp1) Will query term q1,q2,…,qnRemoving repeated words to obtain the final word vector expansion word set WEET of the original query term set Q, and calculating the weight w of the word vector expansion wordsweetThen, howeverThen, turning to the step 5;
the WEET is shown as a formula (7):
Figure FDA0002617752580000032
the weight w of the word vector expansion wordweetFor the vector cosine similarity between the query term and the word vector expansion word, as shown in formula (8), when a repeated word appears, the weight of the word vector expansion word is equal to the cumulative sum of the vector similarities of the repeated word;
Figure FDA0002617752580000033
step 5, performing intersection operation on the rule expansion word set and the word vector expansion word set to obtain a final expansion word, and realizing intersection expansion of the word vector and the association mode, wherein the specific steps are as follows:
(5.1) performing intersection operation on the rule expansion word set ARET and the word vector expansion word set WEET to obtain a final expansion word set FETS of the original query term set Q, realizing intersection expansion of the word vectors and the association mode, and calculating a final expansion word weight wFetThen, go to step 6;
the FETS is represented by formula (9):
Figure FDA0002617752580000041
in the formula (9), the FetnRepresents the nth final expanded word;
final expanded word weight wFetExtending word set weights w for rulesRetSum word vector expansion word weight wweetSum of wweetAs shown in equation (10):
wFet=wRet+wweet(10)
and 6, combining the final expansion word and the original query word into a document set in the new query re-retrieval, obtaining a final retrieval result and returning the final retrieval result to the user.
CN202010774137.8A 2020-08-04 2020-08-04 Text retrieval method based on intersection expansion of word vector and association mode Withdrawn CN111897923A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010774137.8A CN111897923A (en) 2020-08-04 2020-08-04 Text retrieval method based on intersection expansion of word vector and association mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010774137.8A CN111897923A (en) 2020-08-04 2020-08-04 Text retrieval method based on intersection expansion of word vector and association mode

Publications (1)

Publication Number Publication Date
CN111897923A true CN111897923A (en) 2020-11-06

Family

ID=73245599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010774137.8A Withdrawn CN111897923A (en) 2020-08-04 2020-08-04 Text retrieval method based on intersection expansion of word vector and association mode

Country Status (1)

Country Link
CN (1) CN111897923A (en)

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
Wen et al. Research on keyword extraction based on word2vec weighted textrank
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN106570120A (en) Process for realizing searching engine optimization through improved keyword optimization
Huang et al. An approach on Chinese microblog entity linking combining *** encyclopaedia and word2vec
CN111897922A (en) Chinese query expansion method based on pattern mining and word vector similarity calculation
CN108334573B (en) High-correlation microblog retrieval method based on clustering information
CN109739953B (en) Text retrieval method based on chi-square analysis-confidence framework and back-part expansion
CN111897926A (en) Chinese query expansion method integrating deep learning and expansion word mining intersection
CN111897923A (en) Text retrieval method based on intersection expansion of word vector and association mode
CN111897924A (en) Text retrieval method based on association rule and word vector fusion expansion
CN111897927B (en) Chinese query expansion method integrating Copulas theory and association rule mining
Li et al. Complex query recognition based on dynamic learning mechanism
Li et al. Deep learning and semantic concept spaceare used in query expansion
Miyanishi et al. Time-aware latent concept expansion for microblog search
CN111897921A (en) Text retrieval method based on word vector learning and mode mining fusion expansion
CN111897919A (en) Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion
Kuhr et al. Context-specific adaptation of subjective content descriptions
CN111897928A (en) Chinese query expansion method for embedding expansion words into query words and counting expansion word union
Yang et al. An improved pagerank algorithm based on time feedback and topic similarity
CN108416442B (en) Chinese word matrix weighting association rule mining method based on item frequency and weight
CN111897925B (en) Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning
CN109684464B (en) Cross-language query expansion method for realizing rule back-part mining through weight comparison
Yan Research on keyword extraction based on abstract extraction
CN111897920A (en) Text retrieval method based on word embedding and association mode union expansion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201106