CN103593425A - Preference-based intelligent retrieval method and system - Google Patents

Preference-based intelligent retrieval method and system Download PDF

Info

Publication number
CN103593425A
CN103593425A CN201310549069.5A CN201310549069A CN103593425A CN 103593425 A CN103593425 A CN 103593425A CN 201310549069 A CN201310549069 A CN 201310549069A CN 103593425 A CN103593425 A CN 103593425A
Authority
CN
China
Prior art keywords
user
retrieval
subject matter
result
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310549069.5A
Other languages
Chinese (zh)
Other versions
CN103593425B (en
Inventor
李鹏
周育忠
王庆红
龚婷
陈传夫
王平
冉从敬
吴江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
CSG Electric Power Research Institute
Research Institute of Southern Power Grid Co Ltd
Original Assignee
Wuhan University WHU
Research Institute of Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU, Research Institute of Southern Power Grid Co Ltd filed Critical Wuhan University WHU
Priority to CN201310549069.5A priority Critical patent/CN103593425B/en
Publication of CN103593425A publication Critical patent/CN103593425A/en
Application granted granted Critical
Publication of CN103593425B publication Critical patent/CN103593425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of data retrieval, and discloses a preference-based intelligent retrieval method and system. The method includes the steps that a subject preference model of a user is established on the basis of subject classification of data, user characteristics and operation logs; query expansion is performed through the subject preference model of the user and retrieval input of the user to obtain a primary retrieval result; subject preference scoring is performed on the data through the subject preference model of the user and distribution conditions of the data on all subjects, and personalized retrieval ranking is preformed on the primary retrieval result on the basis of subject preference; secondary feedback retrieval is performed on the ranked primary retrieval result through a comprehensive model of relevance feedback and pseudo relevance feedback to obtain a final retrieval result. According to the method and system, subject distribution of data resources is determined through the subject indexing technology, retrieval vectors better representing user requirements are established through the query expansion based on subjects, relevance feedback and other technologies, and the retrieval result meeting potential requirements of the user better is provided for the user.

Description

Intelligent search method based on preference and system
Technical field
The present invention relates to data retrieval field, especially relate to a kind of intelligent search method and system based on preference.
Background technology
Along with improving constantly and the high speed development of information technoloy equipment of social informatization degree, the memory space of information is index ascendant trend; And meanwhile people to information to obtain requirement more and more higher, how to utilize retrieval technique to find fast required useful information more and more difficult.Traditional search engine is retrieved based on keyword, even if but adopt a plurality of keywords to carry out combined retrieval, in the face of the network information of magnanimity, the quantity of the result that obtains remains millions of, and from these results, find the information needing most is also a large order concerning user.Therefore, the problem of current data retrieval most critical is exactly the information how to find user to need most from result for retrieval.
In prior art, search engine or data retrieval system can sort to result for retrieval based on partial statistical information, to strive for that the higher result of the degree of correlation is preferentially offered to user.Similarly statistical information mainly contains the keyword frequency of occurrences, matching degree and clicking rate etc., and these information are that definite content of data itself is added up, although the large content for the treatment of capacity is clear and definite, more easily realizes.In addition, also have the more advanced system of part to carry out further optimizing, such as the statistical nature based on various text semantics is expanded etc. by Data classification or to keyword, make every effort to make forward result for retrieval high as much as possible with the degree of correlation of the keyword of retrieving.But the text message of the descriptor in the inquiry request that aforesaid way is mainly submitted to based on user's single (combination that keyword, time, range of search etc. require) and data, and because above-mentioned two kinds of Information Availability contents are limited, the information that adds data itself cannot embody the difference between user, even if adopt the mode of prior art to be optimized, result for retrieval is also difficult to embody all sidedly the demand difference of different user, and this causes recall precision, degree of accuracy and the user satisfaction of existing mode to be difficult to the state that reaches desirable.
Summary of the invention
For the above-mentioned defect existing in prior art, technical matters to be solved by this invention is how for the difference optimization of different user, to retrieve.
For solving the problems of the technologies described above, on the one hand, the invention provides a kind of intelligent search method based on preference, the method comprising the steps of:
S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;
S3, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;
S4, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.
Preferably, in described step S1, the described user's of foundation subject matter preferences model comprises step:
According to described subject classification, set up theme vector space;
According to described user characteristics, determine user's predefine subject matter preferences vector;
According to described Operation Log, determine user's historical subject matter preferences vector;
Historical subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.
Preferably, in described step S2, described in carry out expanding query and comprise step:
Calculate the probability distribution of each lexical item in the term corresponding data set in described user search input;
Calculate in the vector space of described user's subject matter preferences model the probability distribution of each lexical item in the corresponding data acquisition of each descriptor;
Weigh the mutual difference of above-mentioned two kinds of probability distribution, select the less descriptor of probability distribution difference, it is added in retrieval vector with certain weight.
Preferably, in described step S3, described personalized retrieval sequence comprises step:
By calculating the vector similarity of each result and described user's subject matter preferences model in described first result for retrieval, pass judgment on the score of described each result on the theme of user preference;
Calculate the quality score of described each result;
According to the weighting of described vector similarity, described score on the theme of user preference and described quality score, obtain the score of sequence eventually of described each result, according to the described score of sequence eventually, each result in described first result for retrieval is sorted.
Preferably, in described step S4, described secondary feedback searching comprises step:
Utilize described relevant feedback to determine the vector set of the correlated results in described first result for retrieval;
Utilize described spurious correlation feedback to determine the vector set of the uncorrelated result in described first result for retrieval;
The vector set of described user's subject matter preferences model, the vector set of described correlated results, described uncorrelated result and the combination of original query vector are carried out to feedback query.
On the other hand, the present invention also provides a kind of intelligent retrieval system based on preference simultaneously, and this system comprises:
User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;
Retrieval ordering module, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;
Feedback searching module, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.
Preferably, described user's subject matter preferences identification module further comprises:
Theme vector space module, for setting up theme vector space according to described subject classification;
Predefine preference module, for determining user's predefine subject matter preferences vector according to described user characteristics;
Historical preference module, for determining user's historical subject matter preferences vector according to described Operation Log;
Preference pattern acquisition module, for historical subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.
Preferably, described query expansion module further comprises:
Term distribution module, for calculating the probability distribution of each lexical item in the term corresponding data set of described user search input;
Descriptor distribution module, for calculating the probability distribution of each lexical item in corresponding data acquisition of each descriptor of vector space of described user's subject matter preferences model;
Expansion module, for weighing the mutual difference of above-mentioned two kinds of probability distribution, selects the less descriptor of probability distribution difference, and it is added in retrieval vector with certain weight.
Preferably, described retrieval ordering module further comprises:
Theme obtains sub-module, for by calculating the vector similarity of described each result of first result for retrieval and described user's subject matter preferences model, passes judgment on the score of described each result on the theme of user preference;
Quality score module, for calculating the quality score of described each result;
Order module, for obtain the score of sequence eventually of described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score, according to the described score of sequence eventually, each result in described first result for retrieval is sorted.
Preferably, described feedback searching module further comprises:
Relevant feedback module, for utilizing described relevant feedback to determine the vector set of the correlated results of described first result for retrieval;
Spurious correlation feedback module, for utilizing described spurious correlation feedback to determine the vector set of the uncorrelated result of described first result for retrieval;
Feedback module, for carrying out feedback query by the vector set of described user's subject matter preferences model, the vector set of described correlated results, described uncorrelated result and the combination of original query vector.
The invention provides a kind of intelligent search method and system based on preference, utilize the theme of subject indexing technology specified data resource to distribute, the technique constructions such as the query expansion of use based on theme and relevant feedback more can representative of consumer demand retrieval vector, by combining the intelligent sequencing model of user's subject matter preferences, to user, provide the result for retrieval that more meets its potential demand again.The algorithm that the present invention realizes and system can be identified user's Intelligence Request potential, that be described based on professional thesaurus, thereby have better retrieval effectiveness.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the intelligent search method based on preference in one embodiment of the present of invention;
Fig. 2 is the query expansion algorithm flow schematic diagram based on theme in a preferred embodiment of the present invention;
Fig. 3 is in conjunction with the Relevance Feedback Algorithms schematic flow sheet of theme in a preferred embodiment of the present invention;
The modular structure schematic diagram of the intelligent retrieval system based on preference in the typical application scenarios of the present invention of Fig. 4 position.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is for implementing preferred embodiments of the present invention, and described description is to illustrate that rule of the present invention is object, not in order to limit scope of the present invention.Protection scope of the present invention should with claim the person of being defined be as the criterion, the embodiment based in the present invention, the every other embodiment that those of ordinary skills obtain under the prerequisite of not making creative work, belongs to the scope of protection of the invention.
Prior art is optimized mainly for the data that are retrieved, and optimal situation has also just been carried out precise classification and expansion to the data that are retrieved, and then the descriptor in the inquiry request that it is submitted to user's single is mated.Although this mode has improved the degree of accuracy of retrieval to a great extent, but it does not embody the difference between user, as long as inquiry request is identical, result for retrieval will be identical, and this has the situation of different demands to exist obvious difference from users different in actual conditions.
In an embodiment of the present invention, user's potential demand is obtained in retrieval behavior by observation analysis user within longer a period of time, by user's request and Data classification combination, dominant relevant feedback and recessive Relevance Feedback are dissolved in Optimization of Information Retrieval, and the demand difference that has accurately embodied user has also effectively improved whole efficiency and the degree of accuracy of data retrieval.
Referring to Fig. 1, in one embodiment of the invention, the intelligent search method based on preference comprises step:
S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;
S3, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;
S4, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.
Below the various optimal ways of above-described embodiment are done to further expansion explanation, in preferred embodiment below, for further outstanding Technique Rule of the present invention and actual effect, the data area being retrieved is limited in technical information information, but relevant technical staff in the field should be appreciated that, technical information information is a concrete classification in total data, technical scheme of the present invention obviously can directly apply in various numerical information, and following preferred embodiment should not regarded limitation of the present invention as.
There is potential theme demand to obtaining of data resource in user, take scientific and technical literature as example, and the user of different field has significant difference to the demand of same keyword, makes the theme demand of this recessiveness show more obviously.In a preferred embodiment of the invention, in step S1, use descriptor category table to shine upon user's request, find that user is in the classificatory preference of document resource, thereby provide good basis for intelligent retrieval.Subject matter preferences is mainly considered from following two aspects:
One, the predefine of user's subject matter preferences
Different users has different features, wherein has the potential demand that much reflects user, therefore, and can be according to the more pre-defined users' of user characteristics (such as user's region, functional information or post scope of item etc.) subject matter preferences.Specifically, user such as power industry mesohigh test post, the relevant document resources such as power transformer, isolating switch, mutual inductor are had to specific demand, thereby can from these post documents, extract descriptor, in conjunction with Position Responsibility descriptor, be mapped on the subject category of standard, as user's demand preference predefine.More preferably, the subject matter preferences that represents user in step S1 with vector space model:
First, analyze theme distribution situation, set up N dimension theme vector space [(k 1, w 1), (k 2, w 2) ... (k n, w n)]; Wherein, k ibe i theme, w ifor user is at k ion preference degree, i ∈ 1,2 ..., N.
Then, from user characteristics (as Position Responsibility descriptor, post document etc.), extract descriptor, add up the frequency of these descriptor calculate its probability distribution; Wherein,
Figure BSA0000097293250000062
word sub is the theme iword frequency, freq sub_totaltotal word frequency of the set of words that is the theme.
Finally, will after certain system adjustment, be used for characterizing user at each descriptor sub ion preference degree, thereby obtain predefined user's subject matter preferences vector W pre=(w ' 1, w ' 2..., w ' n); Wherein,
Figure BSA0000097293250000063
i=1,2 ..., n, represents that user is at theme k nupper predefined preference degree.
Two, from User operation log, find user's subject matter preferences
User's retrieval behavior is the part in the global behavior of user's obtaining information; The relevant user that has clicks, downloads, collects the operations such as document from system, and these operations all can be recorded in system journal.Thereby can from a large amount of Operation Log information of user, excavate user's subject matter preferences, for intelligent retrieval provides shoring of foundation.In the step S1 of said method, also set up complete Operation Log collection mechanism, utilize Operation Log to determine user's subject matter preferences.
Particularly, collect and analyze daily record, obtain the set D that user operates document op={ d op1, d op2..., d opN.Right
Figure BSA0000097293250000065
counting user is to d ithe operation frequencys such as click, download, collection, and give different operating weight, after weighting, calculate user to d iaccess frequency.According to the subject indexing of document, can obtain d idistribution in descriptor, then in conjunction with d iaccess frequency, can obtain the access frequency of user in each descriptor, the subject matter preferences degree using it as user, corresponds in theme vector space, thereby obtains user's subject matter preferences vector W op=(w 1, w 2..., w n).
Finally, by above two kinds of subject matter preferences are weighted, thus definite user's subject matter preferences W=α 1w pre+ α 2w op; α wherein 1, α 2be two kinds of vectors weights separately, according to the degree of laying particular stress on, preset or adjust.It should be noted that according to log analysis and obtain user preference along with the time changes, need to upgrade accordingly according to the update status of daily record.
Inquiry request is the direct reaction of user's query demand, is wherein containing equally potential theme demand, and this theme demand has been reacted to a certain extent user to the abstract of required document and summarized, and more can reflect user's demand.Descriptor can be used as the mark of document resource simultaneously, has reacted content core and the classified information of document, can better express the essence of document.Comprehensively this two aspect is considered, in step S2 of the present invention, choosing a topic word carries out query expansion, and from having promoted to a great extent the effect of retrieval, its algorithm flow as shown in Figure 2.
If user's retrieval input is exactly directly the descriptor of standard, can, by incidence relations such as the hypernym in subject category list, hyponyms, find relevant descriptor to carry out query expansion.But many times, between the inquiry request of user input and potential theme demand, do not have dominant associatedly, at this moment can for it, set up incidence relation by historical searching document and subject indexing document.As shown in Figure 2, basic thought is as follows:
Collection of document corresponding to note user search request Q is: D qrery={ d q1, d q2..., d qN.By to D queryin each document carry out participle, obtain one group of Term set, be designated as T query={ t q1, t q2..., t qN.Right
Figure BSA0000097293250000071
statistical probability thereby obtain D querycorresponding set T queryprobability distribution, be designated as
Figure BSA0000097293250000073
wherein,
Figure BSA0000097293250000074
for t qiword frequency, freq totalfor T querythe word frequency sum of middle Term.
For the descriptor in theme vector space, the subject indexing by document also can obtain one group of collection of document, is designated as D subject={ d s1, d s2..., d sN.Similarly, by collection of document, obtain entry set, then by the calculating of corresponding word frequency, can obtain D subjectthe probability distribution of corresponding entry set, is designated as F subject = ( p st 1 , p st 2 , . . . , p st N ) .
After having obtained this probability distribution aspect two, the similarity that can distribute by calculating probability, finds and the maximally related descriptor of term, and then be used for doing the query expansion of descriptor.
When calculating the probability distribution similarity of two groups of documents corresponding to term and descriptor, preferably consider to use Kullback-Leibler divergence (abbreviation of Kullback-Leibler Divergence is also called relative entropy Relative Entropy) to calculate.
Like this, pass through D kL(F subject|| F query) can calculate F subjectwith respect to F queryprobability distribution difference, get difference and for little descriptor, build query expansion
D KL ( F subject | | F query ) = Σ p st i log p st i p qt i ;
For obtaining better query expansion effect, further studied inquiry request and the descriptor distribution situation on the document vector of acceptance system, accordingly above-mentioned calculating is further optimized, select Jensen-Shannon divergence to carry out smoothing computation, by calculating D jS(F subject|| F query) weigh F subjectand F querymutual difference
D JS ( F subject | | F query ) = 1 2 D KL ( F subject | | R ) + 1 2 D KL ( F query | | R ) ; Wherein, R = 1 2 ( F subject + F query ) .
After selecting the descriptor that probability distribution difference is less, with certain weight, added in retrieval vector, build the query vector of expansion, to improve recall precision.
In the step S3 of said method, on the basis of file correlation sequence, consider that user's subject matter preferences is weighted the core that sequence is personalized retrieval sequence.From user's subject matter preferences model, obtain user's subject matter preferences vector W.The collection of document obtaining for retrieval, can, according to document subject index situation, obtain the theme distribution vector V=(v of every piece of document 1, v 2..., v n).Like this, can be by calculating the vector similarity sim (V, W) of W and V, pass judgment on the document that the retrieves score on the theme of user preference.The document that sim (V, W) calculated value is high, its preference-score is also higher.Wherein,
sim ( V , W ) = Σ k = 1 n v k × w k ( Σ k = 1 n v k 2 ) ( Σ k = 1 n w k 2 ) .
After having considered the weighting of user's subject matter preferences, the quality of document is also an important Weighted Guidelines.The evaluation of the quality of literature a lot of because have, the factor, the frequency being downloaded being mainly cited from paper herein, deliver periodical rank, whether be the factor of these 4 aspects of self-built resource, document is carried out to bonus point evaluation.Wherein self-built resource is mainly that document resource is collected by being purchased to resource to buy and gather two kinds of modes voluntarily by consideration our unit.And the resource gathering voluntarily according to specialty has been passed through manual examination and verification, therefore there is higher quality.The shared weight of each factor is in Table 1.
The factor f Quote f Download f Periodical f Self-built
Weight 0.5 0.1 0.2 0.2
Table 1 the quality of literature evaluation factors weight table
Normalization by relevant field in document metadata is calculated, and draws the score of each factor of document.After weighting, obtain the quality score G of document factor=0.45f quote+ 0.15f download+ 0.2f periodical+ 0.2f self-built.
By the weighting to above two aspect scores and document and retrieval similarity score, calculate the score of the sequence eventually G of result for retrieval document sort1g query+ β 2sim (V, W)+β 3g factor; Wherein, G querythe score value drawing based on a particular user inquiry (query) that LUCENE returns, β 1, β 2and β 3be the weight that each score value is corresponding, computation process is considered the setting of different weights, specifically according to system service condition and document distribution situation, determines.
Relevant feedback is supplemented as retrieval request, can effectively improve the accuracy of retrieval.In the step S4 of said method, relevant feedback and spurious correlation feedback are combined, and classify and analyze with user's Operation Log by subject category, effectively define the scope of relevant documentation and uncorrelated document, thereby make feedback reach more excellent effect, the specific algorithm flow process of relevant feedback as shown in Figure 3:
User, after primary retrieval, carries out correlativity mark to result for retrieval.According to user's mark situation, set up relevant documentation vector set D rwith uncorrelated document vector set D nr.After obtaining relevant documentation and uncorrelated document, can consider, under the guidance of Rocchio algorithm idea, to set up relevance feedback retrieval vector
q → m = γ 1 q → 0 + γ 2 1 | D r | Σ d → j ∈ D r d → j - γ 3 1 | D nr | Σ d → j ∈ D nr d → j ;
Wherein,
Figure BSA0000097293250000092
original query vector, D rand D nrknown relevant and uncorrelated collection of document, γ 1, γ 2, γ 3it is respective weights.
But under the use scenes of native system, directly use above-mentioned formula, it is optimum that relevant feedback effect cannot reach.Consideration improves model from following two aspects: relevant documentation set D rand uncorrelated collection of document D nrthe document that defines with filter, feed back vector combine with subject matter preferences vector and set up feedback query vector afterwards.
Consider that user is after primary retrieval, limited to the feedback mark operation of document, need to be from the angle of user search history and the distribution of theme interest, helping which defines is relevant documentation, which is uncorrelated document.The correlativity of the document of user's Direct Mark and judgement is explicit relevant feedback, and this part is the basis of relevant feedback, in relevant feedback is calculated, gives higher weight.And in result for retrieval Top-N, the document that user does not mark, can be by calculating the similarity of document subject matter vector and user preference theme vector, get high the adding in relevant documentation of similarity, what similarity was low adds in uncorrelated document, when this two-part document calculates in user's relevant feedback, can consider with the scoring of preference topic similarity as its weight l j.Like this, when alleviation user operates burden, effectively obtain the required document sets of feedback searching.
Determining D rand D nrdocument scope after, note
Figure BSA0000097293250000093
for the set of relevant documentation vector, note
Figure BSA0000097293250000094
vector set for uncorrelated document.Right
Figure BSA0000097293250000095
get high frequency entry and word frequency thereof, set up document vector, be designated as
Figure BSA0000097293250000096
wherein, freq tifor the word frequency in document.
After having determined feedback document vector, further its wooden fork is heavily adjusted.The document weight of the direct mark of user composes 1, and other document marks to calculate according to document subject matter vector and user's subject matter preferences vector similarity.Thereby feedback document is joined in feedback searching vector with corresponding weight.Also user's subject matter preferences vector is joined in feedback vector with weight δ simultaneously.According to using statistical study, δ get 0.2~0.3 between effect more excellent.In addition, because uncorrelated document is mainly to select the document that automatically do not mark from user of system, uncertain high.For strengthening the stability of feedback searching, by similarity, calculate and get the most incoherent document and represent D nr, join in calculating.
Be only to get in uncorrelated collection of document
Figure BSA0000097293250000101
calculate.
Comprehensive above consideration, the feedback query formula being improved q → m = q → 0 + Σ d → j ∈ D r l j d → j + δ · W - arg max d → ∈ D nr sin ( d → , q → 0 ) Carry out the query expansion of feedback searching.
Wherein,
Figure BSA0000097293250000103
original query vector, D rand D nrit is known relevant and uncorrelated collection of document.L jit is the weight of each relevant documentation.W is user's subject matter preferences vector, the weight that δ is W.By this formula, calculate query expansion and carry out secondary feedback searching, improve retrieval rate and recall rate.
One of ordinary skill in the art will appreciate that, the all or part of step realizing in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, described program can be stored in a computer read/write memory medium, this program is when carrying out, each step that comprises above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD, storage card etc.Therefore, with said method accordingly, the present invention also discloses a kind of intelligent retrieval system based on preference simultaneously, comprising:
User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;
Retrieval ordering module, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;
Feedback searching module, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.
Example as a typical application scenarios of the present invention, adopt technique scheme to set up the subsystem of south electric network information center system, intelligent retrieval system makes full use of the user journal information of comprehensive collection, and descriptor category table, the demand preference of degree of depth digging user, and as support, realize the demand of user individual retrieval, improve accuracy and the satisfaction of retrieval.System adopts Lucene4.3 as bottom retrieval technique, and unified access entry is provided.The personalized retrieval order module of designing user subject matter preferences identification module, related subject intelligent prompt and query expansion module, the relevant feedback module based on theme, fusion theme, thus build individualized intelligent searching system.Referring to Fig. 4, specifically from what time carrying out below system module design:
(1) user's subject matter preferences identification module: systematic analysis User operation log, by subject classification, add up the number of operations such as corresponding click, download, collection, and by the weights of action type, calculate the access temperature score of each theme, the preference degree as user to theme.This calculating relates to the log analysis of big data quantity, and unit operation is difficult to support.System is used Hadoop platform, by MapReduce Distributed Calculation, realizes the analysis of daily record.
(2) query expansion module: when user submits retrieval request to, system is used ICTCLAS participle device to carry out participle to retrieve statement.By Jensen-Shannon divergence balancing method, calculate the degree of correlation between retrieval participle and descriptor, get the descriptor that the degree of correlation is high and carry out query expansion, build new retrieval vector.Also can, by being prompted to user's mode, help user more to conclusively show the Search Requirement of oneself.
(3) retrieval ordering module: system provides multiple sequence interface.In integrated ordered, using document and term the degree of correlation as sequence basis.The preference of considering user's theme, is expressed as subject matter preferences vector.Calculate the space vector of document on theme and the distance of user's subject matter preferences vector, the weighting score as document on user preference, is added in overall sequence score.In addition, calculate the quality score of document.The quoting the frequency, download the frequency of document, factors affecting periodicals philosophy are normalized, are multiplied by after corresponding weight, obtain the quality score of document, then be added in overall sequence score with certain weight.
(4) feedback searching module: the relevant feedback module based on theme, for ranking results for the first time, points out by user's mark which is relevant document.The results page of leafing through from user, collect unlabelled document, as initial uncorrelated document.According to the probability results of user journal analysis, therefrom filter out the document of don't know again.To select relevant documentation and uncorrelated document, by above-mentioned algorithm, carry out feedback query expansion, carry out secondary feedback searching, further focus on the result for retrieval that user wants most.
The invention provides a kind of intelligent search method and system based on preference, utilize the theme of subject indexing technology specified data resource to distribute, the technique constructions such as the query expansion of use based on theme and relevant feedback more can representative of consumer demand retrieval vector, by combining the intelligent sequencing model of user's subject matter preferences, to user, provide the result for retrieval that more meets its potential demand again.The algorithm that the present invention realizes and system can be identified user's Intelligence Request potential, that be described based on professional thesaurus, thereby have better retrieval effectiveness.
Above-mentioned explanation illustrates and has described some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to disclosed form herein, should not regard the eliminating to other embodiment as, and can be used for various other combinations, modification and environment, and can, in invention contemplated scope described herein, by technology or the knowledge of above-mentioned instruction or association area, change.And the change that those skilled in the art carry out and variation do not depart from the spirit and scope of the present invention, all should be in the protection domain of claims of the present invention.

Claims (10)

1. the intelligent search method based on preference, is characterized in that, described method comprises step:
S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;
S3, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;
S4, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.
2. method according to claim 1, is characterized in that, in described step S1, the described user's of foundation subject matter preferences model comprises step:
According to described subject classification, set up theme vector space;
According to described user characteristics, determine user's predefine subject matter preferences vector;
According to described Operation Log, determine user's historical subject matter preferences vector;
Historical subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.
3. method according to claim 1, is characterized in that, in described step S2, described in carry out expanding query and comprise step:
Calculate the probability distribution of each lexical item in the term corresponding data set in described user search input;
Calculate in the vector space of described user's subject matter preferences model the probability distribution of each lexical item in the corresponding data acquisition of each descriptor;
Weigh the mutual difference of above-mentioned two kinds of probability distribution, select the less descriptor of probability distribution difference, it is added in retrieval vector with certain weight.
4. method according to claim 1, is characterized in that, in described step S3, described personalized retrieval sequence comprises step:
By calculating the vector similarity of each result and described user's subject matter preferences model in described first result for retrieval, pass judgment on the score of described each result on the theme of user preference;
Calculate the quality score of described each result;
According to the weighting of described vector similarity, described score on the theme of user preference and described quality score, obtain the score of sequence eventually of described each result, according to the described score of sequence eventually, each result in described first result for retrieval is sorted.
5. method according to claim 1, is characterized in that, in described step S4, described secondary feedback searching comprises step:
Utilize described relevant feedback to determine the vector set of the correlated results in described first result for retrieval;
Utilize described spurious correlation feedback to determine the vector set of the uncorrelated result in described first result for retrieval;
The vector set of described user's subject matter preferences model, the vector set of described correlated results, described uncorrelated result and the combination of original query vector are carried out to feedback query.
6. the intelligent retrieval system based on preference, is characterized in that, described system comprises:
User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;
Retrieval ordering module, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;
Feedback searching module, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.
7. system according to claim 6, is characterized in that, described user's subject matter preferences identification module further comprises:
Theme vector space module, for setting up theme vector space according to described subject classification;
Predefine preference module, for determining user's predefine subject matter preferences vector according to described user characteristics;
Historical preference module, for determining user's historical subject matter preferences vector according to described Operation Log;
Preference pattern acquisition module, for historical subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.
8. system according to claim 6, is characterized in that, described query expansion module further comprises:
Term distribution module, for calculating the probability distribution of each lexical item in the term corresponding data set of described user search input;
Descriptor distribution module, for calculating the probability distribution of each lexical item in corresponding data acquisition of each descriptor of vector space of described user's subject matter preferences model;
Expansion module, for weighing the mutual difference of above-mentioned two kinds of probability distribution, selects the less descriptor of probability distribution difference, and it is added in retrieval vector with certain weight.
9. system according to claim 6, is characterized in that, described retrieval ordering module further comprises:
Theme obtains sub-module, for by calculating the vector similarity of described each result of first result for retrieval and described user's subject matter preferences model, passes judgment on the score of described each result on the theme of user preference;
Quality score module, for calculating the quality score of described each result;
Order module, for obtain the score of sequence eventually of described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score, according to the described score of sequence eventually, each result in described first result for retrieval is sorted.
10. system according to claim 6, is characterized in that, described feedback searching module further comprises:
Relevant feedback module, for utilizing described relevant feedback to determine the vector set of the correlated results of described first result for retrieval;
Spurious correlation feedback module, for utilizing described spurious correlation feedback to determine the vector set of the uncorrelated result of described first result for retrieval;
Feedback module, for carrying out feedback query by the vector set of described user's subject matter preferences model, the vector set of described correlated results, described uncorrelated result and the combination of original query vector.
CN201310549069.5A 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system Active CN103593425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310549069.5A CN103593425B (en) 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310549069.5A CN103593425B (en) 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system

Publications (2)

Publication Number Publication Date
CN103593425A true CN103593425A (en) 2014-02-19
CN103593425B CN103593425B (en) 2015-01-07

Family

ID=50083566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310549069.5A Active CN103593425B (en) 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system

Country Status (1)

Country Link
CN (1) CN103593425B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462611A (en) * 2015-01-05 2015-03-25 五八同城信息技术有限公司 Modeling method, ranking method, modeling device and ranking device for information ranking model
CN105045875A (en) * 2015-07-17 2015-11-11 北京林业大学 Personalized information retrieval method and apparatus
CN105512298A (en) * 2015-12-10 2016-04-20 成都陌云科技有限公司 Interested content prediction method based on machine learning
CN105550282A (en) * 2015-12-10 2016-05-04 成都陌云科技有限公司 User interest forecasting method by utilizing multidimensional data
CN106547864A (en) * 2016-10-24 2017-03-29 湖南科技大学 A kind of Personalized search based on query expansion
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN108846050A (en) * 2018-05-30 2018-11-20 重庆望江工业有限公司 Core process knowledge intelligent method for pushing and system based on multi-model fusion
CN109213908A (en) * 2018-08-01 2019-01-15 浙江工业大学 A kind of academic meeting paper supplying system based on data mining
CN109361929A (en) * 2018-09-28 2019-02-19 武汉斗鱼网络科技有限公司 A kind of method and relevant device of determining direct broadcasting room label
CN109408713A (en) * 2018-10-09 2019-03-01 哈尔滨工程大学 A kind of software requirement searching system based on field feedback
CN110046243A (en) * 2019-04-23 2019-07-23 北京恒冠网络数据处理有限公司 A kind of patent personalized retrieval analysis system based on big data
CN110427400A (en) * 2019-06-21 2019-11-08 贵州电网有限责任公司 Search method is excavated based on operation of power networks information interactive information user's demand depth
CN110489638A (en) * 2019-07-08 2019-11-22 广州视源电子科技股份有限公司 A kind of searching method, device, server, system and storage medium
CN110569431A (en) * 2019-08-14 2019-12-13 深圳市赛为智能股份有限公司 public opinion information monitoring method and device, computer equipment and storage medium
CN110659768A (en) * 2019-08-14 2020-01-07 中国科学院计算机网络信息中心 Data publication academic influence evaluation and prediction method
CN113505290A (en) * 2021-08-31 2021-10-15 上海飞旗网络技术股份有限公司 Information retrieval method and system for user-defined user intention model
CN115906155A (en) * 2022-11-04 2023-04-04 浙江联运知慧科技有限公司 Data management system of sorting center
CN116719954A (en) * 2023-08-04 2023-09-08 中国人民解放军海军潜艇学院 Information retrieval method, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539930A (en) * 2009-04-21 2009-09-23 武汉大学 Search method of related feedback images

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539930A (en) * 2009-04-21 2009-09-23 武汉大学 Search method of related feedback images

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐晓玲,何天云: "基于主题偏好的个性化检索模型研究", 《情报杂志》, vol. 30, no. 4, 30 April 2011 (2011-04-30), pages 134 - 136 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462611A (en) * 2015-01-05 2015-03-25 五八同城信息技术有限公司 Modeling method, ranking method, modeling device and ranking device for information ranking model
CN104462611B (en) * 2015-01-05 2018-06-08 五八同城信息技术有限公司 Modeling method, sort method and model building device, the collator of information sorting model
CN105045875B (en) * 2015-07-17 2018-06-12 北京林业大学 Personalized search and device
CN105045875A (en) * 2015-07-17 2015-11-11 北京林业大学 Personalized information retrieval method and apparatus
CN105512298A (en) * 2015-12-10 2016-04-20 成都陌云科技有限公司 Interested content prediction method based on machine learning
CN105550282A (en) * 2015-12-10 2016-05-04 成都陌云科技有限公司 User interest forecasting method by utilizing multidimensional data
CN106547864B (en) * 2016-10-24 2019-07-16 湖南科技大学 A kind of Personalized search based on query expansion
CN106547864A (en) * 2016-10-24 2017-03-29 湖南科技大学 A kind of Personalized search based on query expansion
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN108846050A (en) * 2018-05-30 2018-11-20 重庆望江工业有限公司 Core process knowledge intelligent method for pushing and system based on multi-model fusion
CN108846050B (en) * 2018-05-30 2022-01-21 重庆望江工业有限公司 Intelligent core process knowledge pushing method and system based on multi-model fusion
CN109213908A (en) * 2018-08-01 2019-01-15 浙江工业大学 A kind of academic meeting paper supplying system based on data mining
CN109361929B (en) * 2018-09-28 2021-05-28 武汉斗鱼网络科技有限公司 Method for determining live broadcast room label and related equipment
CN109361929A (en) * 2018-09-28 2019-02-19 武汉斗鱼网络科技有限公司 A kind of method and relevant device of determining direct broadcasting room label
CN109408713A (en) * 2018-10-09 2019-03-01 哈尔滨工程大学 A kind of software requirement searching system based on field feedback
CN109408713B (en) * 2018-10-09 2020-12-04 哈尔滨工程大学 Software demand retrieval system based on user feedback information
CN110046243A (en) * 2019-04-23 2019-07-23 北京恒冠网络数据处理有限公司 A kind of patent personalized retrieval analysis system based on big data
CN110427400A (en) * 2019-06-21 2019-11-08 贵州电网有限责任公司 Search method is excavated based on operation of power networks information interactive information user's demand depth
CN110489638A (en) * 2019-07-08 2019-11-22 广州视源电子科技股份有限公司 A kind of searching method, device, server, system and storage medium
CN110569431A (en) * 2019-08-14 2019-12-13 深圳市赛为智能股份有限公司 public opinion information monitoring method and device, computer equipment and storage medium
CN110659768A (en) * 2019-08-14 2020-01-07 中国科学院计算机网络信息中心 Data publication academic influence evaluation and prediction method
CN110659768B (en) * 2019-08-14 2023-01-17 中国科学院计算机网络信息中心 Academic influence evaluation and prediction method for data publications
CN113505290A (en) * 2021-08-31 2021-10-15 上海飞旗网络技术股份有限公司 Information retrieval method and system for user-defined user intention model
CN115906155A (en) * 2022-11-04 2023-04-04 浙江联运知慧科技有限公司 Data management system of sorting center
CN116719954A (en) * 2023-08-04 2023-09-08 中国人民解放军海军潜艇学院 Information retrieval method, electronic equipment and storage medium
CN116719954B (en) * 2023-08-04 2023-10-17 中国人民解放军海军潜艇学院 Information retrieval method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103593425B (en) 2015-01-07

Similar Documents

Publication Publication Date Title
CN103593425B (en) Preference-based intelligent retrieval method and system
CN103336793B (en) A kind of personalized article recommends method and system thereof
CN107944986B (en) Method, system and equipment for recommending O2O commodities
CN106339383B (en) A kind of search ordering method and system
CN102982042B (en) A kind of personalization content recommendation method, platform and system
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
US8620948B2 (en) System and method for crowdsourced template based search
Elmeleegy et al. Mashup advisor: A recommendation tool for mashup development
CN103377232B (en) Headline keyword recommendation method and system
CN102483749B (en) Method, system, and apparatus for delivering query results from an electronic document collection
US20130185294A1 (en) Recommender system, recommendation method, and program
Budikova et al. Evaluation platform for content-based image retrieval systems
CN103064945A (en) Situation searching method based on body
CN101520785A (en) Information retrieval method and system therefor
CN102160066A (en) Search engine and method, particularly applicable to patent literature
CN103679462A (en) Comment data processing method and device and searching method and system
Silva et al. Tag recommendation for georeferenced photos
CN103309869A (en) Method and system for recommending display keyword of data object
KR20210082106A (en) Method for Providing Real Estate Estimated Real Transaction Price Calculation Service Using Decision Tree Based Time Series Trend Prediction Learning Model
CN105426550A (en) Collaborative filtering tag recommendation method and system based on user quality model
Liu et al. QA document recommendations for communities of question–answering websites
Hong et al. Mixture model with multiple centralized retrieval algorithms for result merging in federated search
JP2013140579A (en) Method of calculating securities collection ranking using securities exchange information, search server and computer-readable storage medium
CN101840438B (en) Retrieval system oriented to meta keywords of source document
Beheshti-Kashi et al. Trendfashion-a framework for the identification of fashion trends

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant