CN103593425A

CN103593425A - Preference-based intelligent retrieval method and system

Info

Publication number: CN103593425A
Application number: CN201310549069.5A
Authority: CN
Inventors: 李鹏; 周育忠; 王庆红; 龚婷; 陈传夫; 王平; 冉从敬; 吴江
Original assignee: Wuhan University WHU; Research Institute of Southern Power Grid Co Ltd
Current assignee: Wuhan University WHU; CSG Electric Power Research Institute; Research Institute of Southern Power Grid Co Ltd
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2014-02-19
Anticipated expiration: 2033-11-08
Also published as: CN103593425B

Abstract

The invention relates to the field of data retrieval, and discloses a preference-based intelligent retrieval method and system. The method includes the steps that a subject preference model of a user is established on the basis of subject classification of data, user characteristics and operation logs; query expansion is performed through the subject preference model of the user and retrieval input of the user to obtain a primary retrieval result; subject preference scoring is performed on the data through the subject preference model of the user and distribution conditions of the data on all subjects, and personalized retrieval ranking is preformed on the primary retrieval result on the basis of subject preference; secondary feedback retrieval is performed on the ranked primary retrieval result through a comprehensive model of relevance feedback and pseudo relevance feedback to obtain a final retrieval result. According to the method and system, subject distribution of data resources is determined through the subject indexing technology, retrieval vectors better representing user requirements are established through the query expansion based on subjects, relevance feedback and other technologies, and the retrieval result meeting potential requirements of the user better is provided for the user.

Description

Intelligent search method based on preference and system

Technical field

The present invention relates to data retrieval field, especially relate to a kind of intelligent search method and system based on preference.

Background technology

Along with improving constantly and the high speed development of information technoloy equipment of social informatization degree, the memory space of information is index ascendant trend; And meanwhile people to information to obtain requirement more and more higher, how to utilize retrieval technique to find fast required useful information more and more difficult.Traditional search engine is retrieved based on keyword, even if but adopt a plurality of keywords to carry out combined retrieval, in the face of the network information of magnanimity, the quantity of the result that obtains remains millions of, and from these results, find the information needing most is also a large order concerning user.Therefore, the problem of current data retrieval most critical is exactly the information how to find user to need most from result for retrieval.

In prior art, search engine or data retrieval system can sort to result for retrieval based on partial statistical information, to strive for that the higher result of the degree of correlation is preferentially offered to user.Similarly statistical information mainly contains the keyword frequency of occurrences, matching degree and clicking rate etc., and these information are that definite content of data itself is added up, although the large content for the treatment of capacity is clear and definite, more easily realizes.In addition, also have the more advanced system of part to carry out further optimizing, such as the statistical nature based on various text semantics is expanded etc. by Data classification or to keyword, make every effort to make forward result for retrieval high as much as possible with the degree of correlation of the keyword of retrieving.But the text message of the descriptor in the inquiry request that aforesaid way is mainly submitted to based on user's single (combination that keyword, time, range of search etc. require) and data, and because above-mentioned two kinds of Information Availability contents are limited, the information that adds data itself cannot embody the difference between user, even if adopt the mode of prior art to be optimized, result for retrieval is also difficult to embody all sidedly the demand difference of different user, and this causes recall precision, degree of accuracy and the user satisfaction of existing mode to be difficult to the state that reaches desirable.

Summary of the invention

For the above-mentioned defect existing in prior art, technical matters to be solved by this invention is how for the difference optimization of different user, to retrieve.

For solving the problems of the technologies described above, on the one hand, the invention provides a kind of intelligent search method based on preference, the method comprising the steps of:

S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;

S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;

S3, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;

S4, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.

Preferably, in described step S1, the described user's of foundation subject matter preferences model comprises step:

According to described subject classification, set up theme vector space;

According to described user characteristics, determine user's predefine subject matter preferences vector;

According to described Operation Log, determine user's historical subject matter preferences vector;

Historical subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.

Preferably, in described step S2, described in carry out expanding query and comprise step:

Calculate the probability distribution of each lexical item in the term corresponding data set in described user search input;

Calculate in the vector space of described user's subject matter preferences model the probability distribution of each lexical item in the corresponding data acquisition of each descriptor;

Weigh the mutual difference of above-mentioned two kinds of probability distribution, select the less descriptor of probability distribution difference, it is added in retrieval vector with certain weight.

Preferably, in described step S3, described personalized retrieval sequence comprises step:

By calculating the vector similarity of each result and described user's subject matter preferences model in described first result for retrieval, pass judgment on the score of described each result on the theme of user preference;

Calculate the quality score of described each result;

According to the weighting of described vector similarity, described score on the theme of user preference and described quality score, obtain the score of sequence eventually of described each result, according to the described score of sequence eventually, each result in described first result for retrieval is sorted.

Preferably, in described step S4, described secondary feedback searching comprises step:

Utilize described relevant feedback to determine the vector set of the correlated results in described first result for retrieval;

Utilize described spurious correlation feedback to determine the vector set of the uncorrelated result in described first result for retrieval;

The vector set of described user's subject matter preferences model, the vector set of described correlated results, described uncorrelated result and the combination of original query vector are carried out to feedback query.

On the other hand, the present invention also provides a kind of intelligent retrieval system based on preference simultaneously, and this system comprises:

User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;

Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtains first result for retrieval;

Retrieval ordering module, utilizes user's subject matter preferences model and the data distribution situation on each theme, carries out the subject matter preferences marking of data, and first result for retrieval is carried out to the personalized retrieval sequence based on subject matter preferences;

Feedback searching module, utilizes relevant feedback and spurious correlation feedback integration model to carry out secondary feedback searching to the first result for retrieval after sorting and obtains final result for retrieval.

Preferably, described user's subject matter preferences identification module further comprises:

Theme vector space module, for setting up theme vector space according to described subject classification;

Predefine preference module, for determining user's predefine subject matter preferences vector according to described user characteristics;

Historical preference module, for determining user's historical subject matter preferences vector according to described Operation Log;

Preference pattern acquisition module, for historical subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.

Preferably, described query expansion module further comprises:

Term distribution module, for calculating the probability distribution of each lexical item in the term corresponding data set of described user search input;

Descriptor distribution module, for calculating the probability distribution of each lexical item in corresponding data acquisition of each descriptor of vector space of described user's subject matter preferences model;

Expansion module, for weighing the mutual difference of above-mentioned two kinds of probability distribution, selects the less descriptor of probability distribution difference, and it is added in retrieval vector with certain weight.

Preferably, described retrieval ordering module further comprises:

Theme obtains sub-module, for by calculating the vector similarity of described each result of first result for retrieval and described user's subject matter preferences model, passes judgment on the score of described each result on the theme of user preference;

Quality score module, for calculating the quality score of described each result;

Order module, for obtain the score of sequence eventually of described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score, according to the described score of sequence eventually, each result in described first result for retrieval is sorted.

Preferably, described feedback searching module further comprises:

Relevant feedback module, for utilizing described relevant feedback to determine the vector set of the correlated results of described first result for retrieval;

Spurious correlation feedback module, for utilizing described spurious correlation feedback to determine the vector set of the uncorrelated result of described first result for retrieval;

Feedback module, for carrying out feedback query by the vector set of described user's subject matter preferences model, the vector set of described correlated results, described uncorrelated result and the combination of original query vector.

The invention provides a kind of intelligent search method and system based on preference, utilize the theme of subject indexing technology specified data resource to distribute, the technique constructions such as the query expansion of use based on theme and relevant feedback more can representative of consumer demand retrieval vector, by combining the intelligent sequencing model of user's subject matter preferences, to user, provide the result for retrieval that more meets its potential demand again.The algorithm that the present invention realizes and system can be identified user's Intelligence Request potential, that be described based on professional thesaurus, thereby have better retrieval effectiveness.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the intelligent search method based on preference in one embodiment of the present of invention;

Fig. 2 is the query expansion algorithm flow schematic diagram based on theme in a preferred embodiment of the present invention;

Fig. 3 is in conjunction with the Relevance Feedback Algorithms schematic flow sheet of theme in a preferred embodiment of the present invention;

The modular structure schematic diagram of the intelligent retrieval system based on preference in the typical application scenarios of the present invention of Fig. 4 position.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is for implementing preferred embodiments of the present invention, and described description is to illustrate that rule of the present invention is object, not in order to limit scope of the present invention.Protection scope of the present invention should with claim the person of being defined be as the criterion, the embodiment based in the present invention, the every other embodiment that those of ordinary skills obtain under the prerequisite of not making creative work, belongs to the scope of protection of the invention.

Prior art is optimized mainly for the data that are retrieved, and optimal situation has also just been carried out precise classification and expansion to the data that are retrieved, and then the descriptor in the inquiry request that it is submitted to user's single is mated.Although this mode has improved the degree of accuracy of retrieval to a great extent, but it does not embody the difference between user, as long as inquiry request is identical, result for retrieval will be identical, and this has the situation of different demands to exist obvious difference from users different in actual conditions.

In an embodiment of the present invention, user's potential demand is obtained in retrieval behavior by observation analysis user within longer a period of time, by user's request and Data classification combination, dominant relevant feedback and recessive Relevance Feedback are dissolved in Optimization of Information Retrieval, and the demand difference that has accurately embodied user has also effectively improved whole efficiency and the degree of accuracy of data retrieval.

Referring to Fig. 1, in one embodiment of the invention, the intelligent search method based on preference comprises step:

Below the various optimal ways of above-described embodiment are done to further expansion explanation, in preferred embodiment below, for further outstanding Technique Rule of the present invention and actual effect, the data area being retrieved is limited in technical information information, but relevant technical staff in the field should be appreciated that, technical information information is a concrete classification in total data, technical scheme of the present invention obviously can directly apply in various numerical information, and following preferred embodiment should not regarded limitation of the present invention as.

There is potential theme demand to obtaining of data resource in user, take scientific and technical literature as example, and the user of different field has significant difference to the demand of same keyword, makes the theme demand of this recessiveness show more obviously.In a preferred embodiment of the invention, in step S1, use descriptor category table to shine upon user's request, find that user is in the classificatory preference of document resource, thereby provide good basis for intelligent retrieval.Subject matter preferences is mainly considered from following two aspects:

One, the predefine of user's subject matter preferences

Different users has different features, wherein has the potential demand that much reflects user, therefore, and can be according to the more pre-defined users' of user characteristics (such as user's region, functional information or post scope of item etc.) subject matter preferences.Specifically, user such as power industry mesohigh test post, the relevant document resources such as power transformer, isolating switch, mutual inductor are had to specific demand, thereby can from these post documents, extract descriptor, in conjunction with Position Responsibility descriptor, be mapped on the subject category of standard, as user's demand preference predefine.More preferably, the subject matter preferences that represents user in step S1 with vector space model:

First, analyze theme distribution situation, set up N dimension theme vector space [(k ₁, w ₁), (k ₂, w ₂) ... (k _n, w _n)]; Wherein, k _ibe i theme, w _ifor user is at k _ion preference degree, i ∈ 1,2 ..., N.

Then, from user characteristics (as Position Responsibility descriptor, post document etc.), extract descriptor, add up the frequency of these descriptor calculate its probability distribution; Wherein,

word sub is the theme _iword frequency, freq _{sub_total}total word frequency of the set of words that is the theme.

Finally, will after certain system adjustment, be used for characterizing user at each descriptor sub _ion preference degree, thereby obtain predefined user's subject matter preferences vector W _pre=(w ' ₁, w ' ₂..., w ' _n); Wherein,

i=1,2 ..., n, represents that user is at theme k _nupper predefined preference degree.

Two, from User operation log, find user's subject matter preferences

User's retrieval behavior is the part in the global behavior of user's obtaining information; The relevant user that has clicks, downloads, collects the operations such as document from system, and these operations all can be recorded in system journal.Thereby can from a large amount of Operation Log information of user, excavate user's subject matter preferences, for intelligent retrieval provides shoring of foundation.In the step S1 of said method, also set up complete Operation Log collection mechanism, utilize Operation Log to determine user's subject matter preferences.

Particularly, collect and analyze daily record, obtain the set D that user operates document _op={ d _op1, d _op2..., d _opN.Right

counting user is to d _ithe operation frequencys such as click, download, collection, and give different operating weight, after weighting, calculate user to d _iaccess frequency.According to the subject indexing of document, can obtain d _idistribution in descriptor, then in conjunction with d _iaccess frequency, can obtain the access frequency of user in each descriptor, the subject matter preferences degree using it as user, corresponds in theme vector space, thereby obtains user's subject matter preferences vector W _op=(w ₁, w ₂..., w _n).

Finally, by above two kinds of subject matter preferences are weighted, thus definite user's subject matter preferences W=α ₁w _pre+ α ₂w _op; α wherein ₁, α ₂be two kinds of vectors weights separately, according to the degree of laying particular stress on, preset or adjust.It should be noted that according to log analysis and obtain user preference along with the time changes, need to upgrade accordingly according to the update status of daily record.

Inquiry request is the direct reaction of user's query demand, is wherein containing equally potential theme demand, and this theme demand has been reacted to a certain extent user to the abstract of required document and summarized, and more can reflect user's demand.Descriptor can be used as the mark of document resource simultaneously, has reacted content core and the classified information of document, can better express the essence of document.Comprehensively this two aspect is considered, in step S2 of the present invention, choosing a topic word carries out query expansion, and from having promoted to a great extent the effect of retrieval, its algorithm flow as shown in Figure 2.

If user's retrieval input is exactly directly the descriptor of standard, can, by incidence relations such as the hypernym in subject category list, hyponyms, find relevant descriptor to carry out query expansion.But many times, between the inquiry request of user input and potential theme demand, do not have dominant associatedly, at this moment can for it, set up incidence relation by historical searching document and subject indexing document.As shown in Figure 2, basic thought is as follows:

Collection of document corresponding to note user search request Q is: D _qrery={ d _q1, d _q2..., d _qN.By to D _queryin each document carry out participle, obtain one group of Term set, be designated as T _query={ t _q1, t _q2..., t _qN.Right

statistical probability thereby obtain D _querycorresponding set T _queryprobability distribution, be designated as

wherein,

for t _qiword frequency, freq _totalfor T _querythe word frequency sum of middle Term.

For the descriptor in theme vector space, the subject indexing by document also can obtain one group of collection of document, is designated as D _subject={ d _s1, d _s2..., d _sN.Similarly, by collection of document, obtain entry set, then by the calculating of corresponding word frequency, can obtain D _subjectthe probability distribution of corresponding entry set, is designated as

F_{subject} = (p_{{st}_{1}}, p_{{st}_{2}}, . . ., p_{{st}_{N}}) .

After having obtained this probability distribution aspect two, the similarity that can distribute by calculating probability, finds and the maximally related descriptor of term, and then be used for doing the query expansion of descriptor.

When calculating the probability distribution similarity of two groups of documents corresponding to term and descriptor, preferably consider to use Kullback-Leibler divergence (abbreviation of Kullback-Leibler Divergence is also called relative entropy Relative Entropy) to calculate.

Like this, pass through D _kL(F _subject|| F _query) can calculate F _subjectwith respect to F _queryprobability distribution difference, get difference and for little descriptor, build query expansion

D_{KL} (F_{subject} | | F_{query}) = Σ p_{{st}_{i}} \log \frac{p_{{st}_{i}}}{p_{{qt}_{i}}};

For obtaining better query expansion effect, further studied inquiry request and the descriptor distribution situation on the document vector of acceptance system, accordingly above-mentioned calculating is further optimized, select Jensen-Shannon divergence to carry out smoothing computation, by calculating D _jS(F _subject|| F _query) weigh F _subjectand F _querymutual difference

D_{JS} (F_{subject} | | F_{query}) = \frac{1}{2} D_{KL} (F_{subject} | | R) + \frac{1}{2} D_{KL} (F_{query} | | R);

Wherein,

R = \frac{1}{2} (F_{subject} + F_{query}) .

After selecting the descriptor that probability distribution difference is less, with certain weight, added in retrieval vector, build the query vector of expansion, to improve recall precision.

In the step S3 of said method, on the basis of file correlation sequence, consider that user's subject matter preferences is weighted the core that sequence is personalized retrieval sequence.From user's subject matter preferences model, obtain user's subject matter preferences vector W.The collection of document obtaining for retrieval, can, according to document subject index situation, obtain the theme distribution vector V=(v of every piece of document ₁, v ₂..., v _n).Like this, can be by calculating the vector similarity sim (V, W) of W and V, pass judgment on the document that the retrieves score on the theme of user preference.The document that sim (V, W) calculated value is high, its preference-score is also higher.Wherein,

sim (V, W) = \frac{Σ_{k = 1}^{n} v_{k} \times w_{k}}{\sqrt{(Σ_{k = 1}^{n} v_{k}^{2}) (Σ_{k = 1}^{n} w_{k}^{2})}} .

After having considered the weighting of user's subject matter preferences, the quality of document is also an important Weighted Guidelines.The evaluation of the quality of literature a lot of because have, the factor, the frequency being downloaded being mainly cited from paper herein, deliver periodical rank, whether be the factor of these 4 aspects of self-built resource, document is carried out to bonus point evaluation.Wherein self-built resource is mainly that document resource is collected by being purchased to resource to buy and gather two kinds of modes voluntarily by consideration our unit.And the resource gathering voluntarily according to specialty has been passed through manual examination and verification, therefore there is higher quality.The shared weight of each factor is in Table 1.

The factor	f _Quote	f _Download	f _Periodical	f _Self-built
					Weight	0.5	0.1	0.2	0.2

Table 1 the quality of literature evaluation factors weight table

Normalization by relevant field in document metadata is calculated, and draws the score of each factor of document.After weighting, obtain the quality score G of document _factor=0.45f _quote+ 0.15f _download+ 0.2f _periodical+ 0.2f _self-built.

By the weighting to above two aspect scores and document and retrieval similarity score, calculate the score of the sequence eventually G of result for retrieval document _sort=β ₁g _query+ β ₂sim (V, W)+β ₃g _factor; Wherein, G _querythe score value drawing based on a particular user inquiry (query) that LUCENE returns, β ₁, β ₂and β ₃be the weight that each score value is corresponding, computation process is considered the setting of different weights, specifically according to system service condition and document distribution situation, determines.

Relevant feedback is supplemented as retrieval request, can effectively improve the accuracy of retrieval.In the step S4 of said method, relevant feedback and spurious correlation feedback are combined, and classify and analyze with user's Operation Log by subject category, effectively define the scope of relevant documentation and uncorrelated document, thereby make feedback reach more excellent effect, the specific algorithm flow process of relevant feedback as shown in Figure 3:

User, after primary retrieval, carries out correlativity mark to result for retrieval.According to user's mark situation, set up relevant documentation vector set D _rwith uncorrelated document vector set D _nr.After obtaining relevant documentation and uncorrelated document, can consider, under the guidance of Rocchio algorithm idea, to set up relevance feedback retrieval vector

{\overset{&RightArrow;}{q}}_{m} = γ_{1} {\overset{&RightArrow;}{q}}_{0} + γ_{2} \frac{1}{| D_{r} |} \underset{{\overset{&RightArrow;}{d}}_{j} &Element; D_{r}}{Σ} {\overset{&RightArrow;}{d}}_{j} - γ_{3} \frac{1}{| D_{nr} |} \underset{{\overset{&RightArrow;}{d}}_{j} &Element; D_{nr}}{Σ} {\overset{&RightArrow;}{d}}_{j};

Wherein,

original query vector, D _rand D _nrknown relevant and uncorrelated collection of document, γ ₁, γ ₂, γ ₃it is respective weights.

But under the use scenes of native system, directly use above-mentioned formula, it is optimum that relevant feedback effect cannot reach.Consideration improves model from following two aspects: relevant documentation set D _rand uncorrelated collection of document D _nrthe document that defines with filter, feed back vector combine with subject matter preferences vector and set up feedback query vector afterwards.

Consider that user is after primary retrieval, limited to the feedback mark operation of document, need to be from the angle of user search history and the distribution of theme interest, helping which defines is relevant documentation, which is uncorrelated document.The correlativity of the document of user's Direct Mark and judgement is explicit relevant feedback, and this part is the basis of relevant feedback, in relevant feedback is calculated, gives higher weight.And in result for retrieval Top-N, the document that user does not mark, can be by calculating the similarity of document subject matter vector and user preference theme vector, get high the adding in relevant documentation of similarity, what similarity was low adds in uncorrelated document, when this two-part document calculates in user's relevant feedback, can consider with the scoring of preference topic similarity as its weight l _j.Like this, when alleviation user operates burden, effectively obtain the required document sets of feedback searching.

Determining D _rand D _nrdocument scope after, note

for the set of relevant documentation vector, note

vector set for uncorrelated document.Right

get high frequency entry and word frequency thereof, set up document vector, be designated as

wherein, freq _tifor the word frequency in document.

After having determined feedback document vector, further its wooden fork is heavily adjusted.The document weight of the direct mark of user composes 1, and other document marks to calculate according to document subject matter vector and user's subject matter preferences vector similarity.Thereby feedback document is joined in feedback searching vector with corresponding weight.Also user's subject matter preferences vector is joined in feedback vector with weight δ simultaneously.According to using statistical study, δ get 0.2～0.3 between effect more excellent.In addition, because uncorrelated document is mainly to select the document that automatically do not mark from user of system, uncertain high.For strengthening the stability of feedback searching, by similarity, calculate and get the most incoherent document and represent D _nr, join in calculating.

Be only to get in uncorrelated collection of document

calculate.

Comprehensive above consideration, the feedback query formula being improved

{\overset{&RightArrow;}{q}}_{m} = {\overset{&RightArrow;}{q}}_{0} + \underset{{\overset{&RightArrow;}{d}}_{j} &Element; D_{r}}{Σ} l_{j} {\overset{&RightArrow;}{d}}_{j} + δ \cdot W - \arg \max_{\overset{&RightArrow;}{d} &Element; D_{nr}} \sin (\overset{&RightArrow;}{d}, {\overset{&RightArrow;}{q}}_{0})

Carry out the query expansion of feedback searching.

Wherein,

original query vector, D _rand D _nrit is known relevant and uncorrelated collection of document.L _jit is the weight of each relevant documentation.W is user's subject matter preferences vector, the weight that δ is W.By this formula, calculate query expansion and carry out secondary feedback searching, improve retrieval rate and recall rate.

One of ordinary skill in the art will appreciate that, the all or part of step realizing in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, described program can be stored in a computer read/write memory medium, this program is when carrying out, each step that comprises above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD, storage card etc.Therefore, with said method accordingly, the present invention also discloses a kind of intelligent retrieval system based on preference simultaneously, comprising:

Example as a typical application scenarios of the present invention, adopt technique scheme to set up the subsystem of south electric network information center system, intelligent retrieval system makes full use of the user journal information of comprehensive collection, and descriptor category table, the demand preference of degree of depth digging user, and as support, realize the demand of user individual retrieval, improve accuracy and the satisfaction of retrieval.System adopts Lucene4.3 as bottom retrieval technique, and unified access entry is provided.The personalized retrieval order module of designing user subject matter preferences identification module, related subject intelligent prompt and query expansion module, the relevant feedback module based on theme, fusion theme, thus build individualized intelligent searching system.Referring to Fig. 4, specifically from what time carrying out below system module design:

(1) user's subject matter preferences identification module: systematic analysis User operation log, by subject classification, add up the number of operations such as corresponding click, download, collection, and by the weights of action type, calculate the access temperature score of each theme, the preference degree as user to theme.This calculating relates to the log analysis of big data quantity, and unit operation is difficult to support.System is used Hadoop platform, by MapReduce Distributed Calculation, realizes the analysis of daily record.

(2) query expansion module: when user submits retrieval request to, system is used ICTCLAS participle device to carry out participle to retrieve statement.By Jensen-Shannon divergence balancing method, calculate the degree of correlation between retrieval participle and descriptor, get the descriptor that the degree of correlation is high and carry out query expansion, build new retrieval vector.Also can, by being prompted to user's mode, help user more to conclusively show the Search Requirement of oneself.

(3) retrieval ordering module: system provides multiple sequence interface.In integrated ordered, using document and term the degree of correlation as sequence basis.The preference of considering user's theme, is expressed as subject matter preferences vector.Calculate the space vector of document on theme and the distance of user's subject matter preferences vector, the weighting score as document on user preference, is added in overall sequence score.In addition, calculate the quality score of document.The quoting the frequency, download the frequency of document, factors affecting periodicals philosophy are normalized, are multiplied by after corresponding weight, obtain the quality score of document, then be added in overall sequence score with certain weight.

(4) feedback searching module: the relevant feedback module based on theme, for ranking results for the first time, points out by user's mark which is relevant document.The results page of leafing through from user, collect unlabelled document, as initial uncorrelated document.According to the probability results of user journal analysis, therefrom filter out the document of don't know again.To select relevant documentation and uncorrelated document, by above-mentioned algorithm, carry out feedback query expansion, carry out secondary feedback searching, further focus on the result for retrieval that user wants most.

Above-mentioned explanation illustrates and has described some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to disclosed form herein, should not regard the eliminating to other embodiment as, and can be used for various other combinations, modification and environment, and can, in invention contemplated scope described herein, by technology or the knowledge of above-mentioned instruction or association area, change.And the change that those skilled in the art carry out and variation do not depart from the spirit and scope of the present invention, all should be in the protection domain of claims of the present invention.

Claims

1. the intelligent search method based on preference, is characterized in that, described method comprises step:

2. method according to claim 1, is characterized in that, in described step S1, the described user's of foundation subject matter preferences model comprises step:

According to described subject classification, set up theme vector space;

3. method according to claim 1, is characterized in that, in described step S2, described in carry out expanding query and comprise step:

4. method according to claim 1, is characterized in that, in described step S3, described personalized retrieval sequence comprises step:

Calculate the quality score of described each result;

5. method according to claim 1, is characterized in that, in described step S4, described secondary feedback searching comprises step:

6. the intelligent retrieval system based on preference, is characterized in that, described system comprises:

7. system according to claim 6, is characterized in that, described user's subject matter preferences identification module further comprises:

8. system according to claim 6, is characterized in that, described query expansion module further comprises:

9. system according to claim 6, is characterized in that, described retrieval ordering module further comprises:

10. system according to claim 6, is characterized in that, described feedback searching module further comprises: