CN103425710A

CN103425710A - Subject-based searching method and device

Info

Publication number: CN103425710A
Application number: CN2012101668529A
Authority: CN
Inventors: 方高林; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-05-25
Filing date: 2012-05-25
Publication date: 2013-12-04

Abstract

The invention provides a subject-based searching method and device. The subject-based searching method comprises the steps of performing subject analysis on each document in a document library by using a subject analysis model to determine a subject distribution corresponding to each document; performing subject analysis on a query input by a user by using the subject analysis model to determine the subject distribution corresponding to the query; calculating the subject matching degree between the query and each document by using the subject distribution corresponding to the query and the subject distribution corresponding to each document; obtaining the matching degree between the query and each document by using the subject matching degree, and determining searching results according to the matching degree between the query and each document. According to the subject-based searching method and device, the subject matching mode instead of a keyword matching mode is adopted, the documents can still be recalled even the documents are not consistent with the query of the user or not matched with redundant terms, and the search results are matched with the query in subject to a maximum extent in the query in statement, so that the research recall and accuracy are improved.

Description

A kind of searching method and device based on theme

[technical field]

The present invention relates to field of computer technology, particularly a kind of searching method and device based on theme.

[background technology]

Development along with computer networking technology, search engine has become the important means of people's obtaining information, the user is by search engine inputted search item (query), and search engine is searched for the document relevant to query from the document grabbed, and is sorted according to degree of correlation.The search model extensively adopted is vector space model, its basic thought is the form of term vector that query and document are expressed as, the weight of vector can be frequency (TF) or the word frequency-inverse document frequency (TF-IDF) that word occurs, then calculates similarity between the term vector of the term vector of query and the document tolerance as correlativity.In actual applications, various mutation are arranged, but be all to calculate the keyword mated in query and document in essence, and give weight.

That is to say, the mode of traditional search based on the keyword coupling, this will cause following two kinds of situations:

One, in the situation that do not occur or seldom same keyword occurs, also be difficult to come front even actual content has reflected the document of query demand, even can't recall.For example the user inputs query " good-looking Lincoln ", comprise " Lincoln of airflow design ", " white, black or red Lincoln " in some document although etc. also relate to Lincoln's outward appearance, but owing to keyword " good-looking " not occurring, possibly can't recall or can't come forward position, but in fact these documents have reflected that the user wants to search the document of the outward appearance good-looking about the Lincoln.

Two, there is the phenomenon of expressing redundancy in some query, and this document that will cause not have to mate redundancy word wherein can't be recalled or can't come forward position.For example the user inputs query and " after divorce, what if the former husband is still hated ", because wherein " divorce " is the redundancy word, although be the solution about the former husband is still hated for some document content, because it has the upper keyword " divorce " of coupling, can't be called back and maybe can't come forward position.

[summary of the invention]

In view of this, the invention provides a kind of searching method and device based on theme, so that improve search recall rate and accuracy.

Concrete technical scheme is as follows:

A kind of searching method based on theme, utilize the subject analysis model to carry out subject analysis to each document in document library, determines that the theme that each document is corresponding distributes, and the method also comprises:

S1, the query that utilizes the subject analysis model to input the user carry out subject analysis, determine that the theme that described query is corresponding distributes;

S2, the theme distribution that utilizes theme corresponding to described query to distribute corresponding with each document, calculate the theme matching degree of described query and each document;

S3, utilize the theme matching degree to obtain the matching degree of described query and each document, determine the Search Results of described query according to the matching degree of described query and each document.

According to one preferred embodiment of the present invention, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.

According to one preferred embodiment of the present invention, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.

According to one preferred embodiment of the present invention, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.

According to one preferred embodiment of the present invention, in the training process of described subject analysis model, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.

According to one preferred embodiment of the present invention, the described subject analysis model that utilizes carries out subject analysis to each document in document library, determines that the theme distribution that each document is corresponding specifically comprises:

A1, each document in document library is carried out to word segmentation processing, from word segmentation result, select the word that meets statistical requirements as descriptor, obtain respectively the descriptor set of each document;

A2, by each document, described subject analysis model is inputted in corresponding descriptor set, obtains the theme that each document is mapped to and is mapped to the probability on each theme.

According to one preferred embodiment of the present invention, the described word meet statistical requirements of selecting from word segmentation result comprises as descriptor:

Described word segmentation result is carried out to cluster, from the word quantity comprised, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, described M is positive integer.

According to one preferred embodiment of the present invention, described step S1 specifically comprises:

S11, the query that the user is inputted carry out word segmentation processing, from word segmentation result, select to meet the word of inverse document frequency IDF statistical requirements as descriptor, the descriptor that utilization is selected forms the descriptor set that described query is corresponding, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form the descriptor set that described query is corresponding;

S12, by described query, described descriptor analytical model is inputted in corresponding descriptor set, obtains the theme that described query is mapped to and is mapped to the probability on each theme.

According to one preferred embodiment of the present invention, the theme matching degree of calculating described query and each document comprises:

By calculating cosine similarity or relative entropy KL distance between the theme distribution that theme corresponding to described query distribute corresponding with each document, obtain the theme matching degree of described query and each document.

According to one preferred embodiment of the present invention, the method also comprises: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described each document;

At described step S1, further comprise: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described query;

Further comprise at least one during following matching degree is calculated at described step S2: the demand characteristic by calculating described query and the demand characteristic of each document are calculated the demand matching degree between described query and each document, by time attribute feature and the described query of time attribute feature calculation of each document and the time attribute matching degree between each document of calculating described query, by Regional Property feature and the described query of Regional Property feature calculation of each document and the Regional Property matching degree between each document of calculating described query, by entity attribute feature and the described query of entity attribute feature calculation of each document and the entity attribute matching degree between each document of calculating described query, perhaps, by semantic dependency relations feature and the described query of semantic dependency relations feature calculation of each document and the semantic dependency relations matching degree between each document of calculating described query,

Further combined with at least one in the demand matching degree between described query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtain the matching degree of described query and each document in described step S3.

According to one preferred embodiment of the present invention, obtain the matching degree Score (q, d) of described query and each document according to following formula:

Score(q，d)＝Sem(q，d)[λ ₁ Score1+λ ₂ Score2]

The theme matching degree that wherein Score1 is described query and document, the demand matching degree that Score2 is described query and document, Sem (q, d) mean in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of described query and document the situation of at least one, λ ₁And λ ₂For default weighting parameter.

According to one preferred embodiment of the present invention, determine that according to the matching degree of described query and each document the Search Results of described query comprises in described step S3:

To meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.

The present invention also provides a kind of searcher based on theme, and this device comprises:

The first subject analysis unit, carry out subject analysis for utilizing the subject analysis model to each document of document library, determines that the theme that each document is corresponding distributes;

The second theme analytic unit, carry out subject analysis for utilizing the subject analysis model to the query of user's input, determines that the theme that described query is corresponding distributes;

Theme matching degree computing unit, distribute for utilizing the theme distribution theme corresponding with each document that described query is corresponding, calculates the theme matching degree of described query and each document;

Search result determination unit, obtain the matching degree of described query and each document for utilizing the theme matching degree, and the matching degree of the described query of foundation and each document is determined the Search Results of described query.

According to one preferred embodiment of the present invention, this device also comprises: the topic model training unit, for training described subject analysis model, and in training process, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.

According to one preferred embodiment of the present invention, described the first subject analysis unit specifically comprises:

The first descriptor is determined subelement, for each document of document library is carried out to word segmentation processing, from word segmentation result, selects the word that meets statistical requirements as descriptor, obtains respectively the descriptor set of each document;

First distribute to determine subelement, for the descriptor set that each document is corresponding, inputs described subject analysis model, obtains the theme that each document is mapped to and is mapped to the probability on each theme.

According to one preferred embodiment of the present invention, described the first descriptor determines that subelement is at the word of selecting to meet statistical requirements from word segmentation result during as descriptor, described word segmentation result is carried out to cluster, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor from the word quantity comprised, described M is positive integer.

According to one preferred embodiment of the present invention, described second theme analytic unit specifically comprises:

Subelement determined in the second theme word, carry out word segmentation processing for the query to user's input, from word segmentation result, select to meet the word of inverse document frequency IDF statistical requirements as descriptor, the descriptor that utilization is selected forms the descriptor set that described query is corresponding, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form the descriptor set that described query is corresponding;

Second distribute to determine subelement, for the descriptor set that described query is corresponding, inputs described descriptor analytical model, obtains the theme that described query is mapped to and is mapped to the probability on each theme.

According to one preferred embodiment of the present invention, described theme matching degree computing unit, by cosine similarity or relative entropy KL distance between the theme distribution of calculating theme that described query is corresponding and distributing corresponding with each document, obtains the theme matching degree of described query and each document.

According to one preferred embodiment of the present invention, this device also comprises:

The first supplementary features determining unit, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining described each document;

The second supplementary features determining unit, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining described query;

Additional matching degree computing unit, for carrying out at least one of following matching degree calculating: the demand characteristic by calculating described query and the demand characteristic of each document are calculated the demand matching degree between described query and each document, by time attribute feature and the described query of time attribute feature calculation of each document and the time attribute matching degree between each document of calculating described query, by Regional Property feature and the described query of Regional Property feature calculation of each document and the Regional Property matching degree between each document of calculating described query, by entity attribute feature and the described query of entity attribute feature calculation of each document and the entity attribute matching degree between each document of calculating described query, perhaps, by semantic dependency relations feature and the described query of semantic dependency relations feature calculation of each document and the semantic dependency relations matching degree between each document of calculating described query,

Described search result determination unit, further combined with at least one in the demand matching degree between described query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtains the matching degree of described query and each document.

According to one preferred embodiment of the present invention, described search result determination unit obtains the matching degree Score (q, d) of described query and each document according to following formula:

Score(q，d)＝Sem(q，d)[λ ₁ Score1+λ ₂ Score2]

According to one preferred embodiment of the present invention, described search result determination unit specifically will meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.

As can be seen from the above technical solutions, the present invention carries out respectively subject analysis to document and query, determine that the theme that theme distributes and query is corresponding that in document library, each document is corresponding distributes, and determines the Search Results of query by the mode of calculating theme matching degree between query and each document.That is to say, the present invention adopts the mode of theme coupling but not the mode of keyword coupling, return to the Search Results mated on the theme of expressing to the user, even if in statement with user query not quite identical or with query in the word of redundancy do not mate, also can be called back, and on theme, make Search Results farthest mate with query, thereby improved search recall rate and accuracy.

[accompanying drawing explanation]

The main process flow diagram of the searching method based on theme that Fig. 1 provides for the embodiment of the present invention one;

The detail flowchart of the searching method based on theme that Fig. 2 provides for the embodiment of the present invention three;

The searcher structural drawing that Fig. 3 provides for the embodiment of the present invention four;

Fig. 4 is the schematic diagram that the matching degree that relates in the embodiment of the present invention is calculated.

[embodiment]

In order to make the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

Embodiment mono-,

The main process flow diagram of the searching method based on theme that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the method can comprise the following steps:

Step 101: utilize the subject analysis model to carry out subject analysis to the query of user's input, determine that the theme that query is corresponding distributes, and utilize the subject analysis model to carry out subject analysis to each document in document library, determine that the theme that each document is corresponding distributes.

The subject analysis model related in this step is set up in advance, comprises descriptor that each theme comprises and each descriptor weights at affiliated theme.Utilizing this subject analysis model can determine the theme that theme distributes and each document is corresponding that query is corresponding distributes.Wherein, the process of establishing of subject analysis model and content will be described in detail in embodiment bis-.The process of query and document being carried out to subject analysis will be described in detail in embodiment tri-.

Step 102: utilize the theme distribution theme corresponding with each document that query is corresponding to distribute, calculate the theme matching degree of query and each document.

This step is actually utilizes the theme of query and each document to distribute to determine query and the matching degree of each document on theme, as the selection foundation of subsequent searches result.

Step 103: utilize the theme matching degree to obtain the matching degree of query and each document, determine the Search Results of query according to matching degree.

Except foundation theme matching degree, obtain the matching degree of query and each document in this step, can also obtain the matching degree of query and each document further combined with at least one in the demand matching degree between query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, this process will be described in detail in embodiment tri-.

When the foundation matching degree is determined the Search Results of query, the matching degree with query can be met to the document package that the preset matching degree requires and be contained in Search Results, and can determine the sequence of document in Search Results according to the matching degree with query.

Embodiment bis-,

The subject analysis model can adopt and describe the theme probability model in embodiments of the present invention, can include but not limited to: probability latent semantic analysis (PLSA) model, potential Di Li Cray distribution (LDA) etc.

LSA is that a kind of method of mathematics and statistics of using is extracted the lexical item in document, infer semantic relation between them, and set up a semantic indexing, and and file organization being become to the method for semantic space construction, the lexical item that is about to have the higher semantic association is mapped to same theme.PLSA is on the basis of carrying out potential semantic indexing of LSA, and between document and potential semanteme, potential semanteme and lexical item, the probability of use model is described, and so-called potential semanteme is exactly the theme of indication in the embodiment of the present invention.

LDA is a kind of non-supervisory machine learning techniques, is used for identifying subject information hiding in extensive document sets or corpus.It adopts the method for word bag, and each piece of document represented the probability distribution that some themes form, and each theme has represented the probability distribution that a lot of words form.

Because LSA and PLSA are all ripe prior arts, at this, no longer specifically describe in detail.That is to say, mode by description theme probability such as PLSA or LDA can be mapped to each lexical item on corresponding theme, thereby form the theme consist of each lexical item, can further include the probability of each lexical item on affiliated theme, this probability is based on that statistics to word frequency obtains.

Wherein, what described in traditional PLSA and LDA, lexical item referred to is word, thereby but may bring the expression ambiguity because a lot of words have a plurality of meanings, this class ambiguity is difficult to distinguish from the subject analysis model.In addition on the one hand, because the merging threshold value between different themes in training process is difficult to determine, thereby there is similar theme in the subject analysis model that causes training.Given this, in the embodiment of the present invention, can adopt the mode of disambiguation to mean theme, above-mentioned lexical item can comprise following several form:

The first form: field word: the word of certain professional domain, at other field, substantially do not occur, this class word has very strong differentiation row.For example vocabulary such as { application program, thread, compilers } belongs to same field with vocabulary such as { computer, programmings }, and the field word is the monobasic word usually.

The second form: the binary combination of polysemant: for some, may there is the word of ambiguity, it is polysemant, for disambiguation can adopt the mode that is constituted binary combination with other monobasic words, for example, " apple " is a polysemant, can be distinguished by it being combined as to " apple # mobile phone ", " apple # computer ", " apple # fruit " etc., wherein # means the relation of binary combination.

The third form: entity word or entity word combination: the competency of entity word, between field word and extensive word, belongs to a plurality of fields usually, because the entity word property distinguished may be lower, therefore can adopt the form of presentation of two entity word combinations.

As front, introduce, the mode of the description such as PLSA or LDA theme probability does not restrain, produces yet the problem of similar theme because the lexical item ambiguity may cause theme, can think that by analysis the document co-occurrence is relevant according to being the reason that causes noise to be introduced as semanteme simultaneously.In order further to address this problem, the semantic related term that can the method based on statistics extracts lexical item, utilize all semantic related term corresponding to lexical item as a document in corpus, participate in the training of subject analysis model, thereby can greatly eliminate the impact of noise and ambiguity, improve the accuracy of subject analysis.

For example, extract related term for lexical item " notebook computer " based on statistics and comprise { notebook, written or printed documents, Notebook Battery, double-core, wireless network card, thinkpad, processor, video card }, using these related terms as a document, by PLSA or LDA, trained, thereby obtained corresponding theme.Again for example, " apple # computer " extracted to related term based on statistics and comprise { apple, computer, imac, desktop computer, desktop computer }, using these related terms as a document, by PLSA or LDA, trained, thereby obtain corresponding theme.

By the final subject analysis model of setting up of this embodiment, comprise: the lexical item that each theme is corresponding (being descriptor) and the weights of each lexical item in affiliated theme.

Embodiment tri-,

The detail flowchart of the searching method based on theme that Fig. 2 provides for the embodiment of the present invention three, as shown in Figure 2, this flow process specifically comprises the following steps:

Step 201: each document in document library is carried out to the descriptor analysis, obtain respectively the descriptor set of each document.

After at first the process that document is carried out to the descriptor analysis carries out word segmentation processing by document, can select descriptor based on TF or TF-IDF, the word of selecting TF or TF-IDF to meet the demands is as descriptor.It is good that this mode generally shows, but the document relatively disperseed for some words, the word frequency of statistics does not have obvious characteristic, in addition for the document of some cheatings, the cribber piles up the word irrelevant with text subject, if merely based on word frequency information, obviously the reflection for theme is inaccurate.Therefore, it is preferred embodiment a kind of that the embodiment of the present invention provides, at first the word that soon document will carry out obtaining after word segmentation processing carries out cluster, according to the word quantity comprised in of all categories and word, the occurrence number in of all categories is extracted the descriptor of document, forms the descriptor set of the document.

The mode the present invention who wherein word is carried out to cluster is not limited, and while after cluster, extracting descriptor, can meet from the word quantity comprised the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, and M is positive integer.For example, after carrying out cluster, obtain several classifications, the word quantity comprised in some classifications is more, some classifications are less, do not meet for the word quantity comprised the classification that predetermined number requires and abandon, for the word quantity comprised, meet the classification that predetermined number requires, can therefrom select occurrence number to come the word of front M as descriptor.

Step 202: the subject analysis model is inputted in the descriptor set that each document is corresponding, obtains the theme that each document is corresponding and distributes.

This step is actually the process of each document being carried out to semantic analysis, after by document, subject analysis models such as PLSA model or LDA model is inputted in corresponding descriptor set, the theme that just can access document distributes, the theme that is mapped to, and be mapped to the probability on each theme.

For example, the descriptor set of certain document is { computer, display, widescreen, notebook }, after utilizing existing subject analysis model to be analyzed, obtaining the probability that the document is mapped on theme topic1 is 0.5, the probability be mapped on theme topic2 is 0.2, and the probability be mapped on theme topic3 is 0.1.

In addition, in the process of semantic analysis, except calculating theme corresponding to each document distribute, can also analyze at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of each document.

Wherein, the Requirements Analysis Model of the demand characteristic of analytical documentation based on setting up in advance, this Requirements Analysis Model comprises: the demand word that each demand type comprises and the weights of demand word.Seemingly, just corpus, based on the demand word, is not described in detail in this for the mode of setting up of Requirements Analysis Model and subject analysis model class.

Extract the demand word in document, the extraction of this demand word can be extracted based on dictionary or template, and the extraction of demand word can adopt prior art, does not repeat them here.After obtaining demand set of words that each document is corresponding, the corresponding demand set of words input Requirements Analysis Model by each document, the demand that obtains each document distributes, each Document mapping to the demand type and be mapped to the probability on each demand type.

The time attribute feature of analytical documentation and the static attribute of Regional Property feature based on document, such as the issuing time of document, the generation region of document etc., content that also can be based on document, such as extracting temporal information in document or regional information etc.

The content of entity attribute feature based on document of analytical documentation, i.e. the content of analytical documentation, therefrom extract the structural information of " entity-entity attribute-property value ", for example, extracts structural information as shown in table 1 from document.The extraction of this structural information can be carried out based on template, and this part can adopt prior art, does not repeat them here.

Table 1

The entity word	Entity word attribute	Property value
			Baidu	Address	No. 10, ten street, ShangDi, Haidian District, BeiJing City
Company of Baidu	Phone	59928888

The semantic dependency relations feature of analytical documentation is to utilize default dependence template to be mated the semantic chunk in document, thereby determine semantic chunk and corresponding semantic dependency relations with semantic dependency relations, wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.Qualified relation refers to the relation that a word is limited another word, such as " Yao Ming's height ", " Super Rabbit usage " etc.Donor and accepter relationship refers to one of them word person that is the agent, another word person that is the word denoting the receiver of an action, both can not put upside down order, for example in " Li Na defeats Sa Labowa ", " Li Nasalabowa " is donor and accepter relationship, and in " Euro exchange dollar ", " Euro dollar " is donor and accepter relationship.

Above-mentioned dependence template can be based on after the search daily record is excavated and obtain, and can be also artificial definition.For example, the template of qualified relation can comprise: [basketball star] height, [software] usage etc.The template of donor and accepter relationship can comprise: [noun] verb [noun] etc.

In addition, it should be noted that, above-mentioned steps 201 and step 202 can be the steps of carrying out under line.

Step 203: the query to user's input carries out the descriptor analysis, obtains the descriptor set that query is corresponding.

Can choose based on IDF in this step word that IDF meets preset requirement as descriptor from the word segmentation result of query, for example choose the word that IDF comes front N1, N1 is default positive integer, or chooses the word that IDF surpasses default IDF threshold value.

Can directly the descriptor of determining be formed to the descriptor set, preferably, can further to the descriptor of determining, carry out the related term expansion, the related term that expansion is obtained and the descriptor of determining form the descriptor set jointly.Can be based on existing related term dictionary while carrying out the related term expansion, the excavation principle of this related term dictionary can be: if two words residing context environmental in language material is the same, these two words are relevant semantically, each other related term.

Step 204: the subject analysis model is inputted in the descriptor set that query is corresponding, obtains the theme that query is corresponding and distributes.

Equally, this step is actually the process of query being carried out to semantic analysis, after by query, subject analysis models such as PLSA model or LDA model is inputted in corresponding descriptor set, the theme that just can access query distributes, be the theme that query is mapped to, and be mapped to the probability on each theme.

For example, the query of user's input is " Super Rabbit ", and the result of carrying out subject analysis for " Super Rabbit ": the probability be mapped on Topic:226 is 0.851297, and the probability be mapped on Topic:1288 is 0.120394.Wherein, Topic:226 and Topic:1288 can be as shown in embodiment bis-in the subject analysis model.

In addition, in the process of semantic analysis, except calculating theme that query is corresponding distributes, can also analyze at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of query.

Analyze the Requirements Analysis Model of demand characteristic based on setting up in advance of query, at first extract the demand word in query, the extraction of this demand word can be extracted based on dictionary or template, and the extraction of demand word can adopt prior art, does not repeat them here.Then by the demand word of query input Requirements Analysis Model, the demand that obtains query distributes, the demand type that query is mapped to and be mapped to the probability on each demand type.

The time attribute feature of analysis query is the content based on query mainly, and the word of extraction and time correlation from query, as the time attribute feature.For example the user inputs query and " what major issue has occurred " in 2008, the time attribute feature that wherein " 2008 " are this query.

Analyze the user property that the Regional Property feature of query can be based on this query of input, for example according to User IP corresponding geographic position as the Regional Property feature, the user who for example inputs this query is the user of Pekinese, Regional Property feature that can be using " Beijing " as query.Content that also can be based on query is extracted the word with Regionalization from query, and as the Regional Property feature, for example the user inputs query " which Beijing five-star hotel has ", extracts " Beijing " Regional Property feature as this query.

Analyze the content of entity attribute feature based on query of query, from query, extract the structural information of " entity-entity attribute ", the extraction of this structural information can be carried out based on template, can adopt prior art equally.The entity attribute that for example user's input " Yao Ming's height is how many " extracts is characterized as " Yao Ming-height ".The purpose that extracts this entity attribute feature be for follow-up when being mated with document, if there is document just in time to there is same entity attribute feature, directly return attribute value of the document is described, therefore can improve the matching degree of the document and query.

The default dependence template of semantic dependency relations characteristic use of analyzing query is mated the semantic chunk in query, thereby determines semantic chunk and the corresponding semantic dependency relations with semantic dependency relations.Wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.For example the user inputs query " Yao Ming's height is how many ", analyzes " Yao Ming's height " for qualified relation; The user inputs query " Li Na defeats Sa Labowa ", analyzes " Li Nasalabowa " for donor and accepter relationship.

Step 205: utilize the theme distribution theme corresponding with each document that query is corresponding to distribute, calculate the theme matching degree of query and each document.

Because distributing to be actually, theme comprises the theme that is mapped to and at the vector of the probability of be mapped to theme, therefore can calculate by the mode of the cosine similarity between compute vector the theme matching degree of query and each document, the mode of the relative entropy distance (KL, Kullback-Leibler Divergence) between the theme distribution that theme that also can be corresponding by calculating query distributes corresponding with document is calculated the theme matching degree of query and each document.

Step 206: calculate at least one in demand matching degree, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree between query and each document, and calculate the matching degree between query and each document in conjunction with the theme matching degree.

Wherein query can obtain with the similarity between demand characteristic corresponding to document by calculating the demand characteristic that query is corresponding with the demand matching degree between document, and the mode that wherein similarity is calculated can adopt cosine similarity or KL distance etc.

Query can determine according to whether time attribute, Regional Property, entity attribute and the semantic dependency relations of query and document be consistent with time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree between document, for example, if consistent, determine that its matching degree is 1, if do not mate, determine that its matching degree is 0.8.

When calculating the matching degree Score (q, d) of query and document, can carry out according to formula (1):

Score(q，d)＝Sem(q，d)[λ ₁ Score1+λ ₂ Score2] (1)

Wherein, the theme matching degree that Score1 is query and document, the demand matching degree that Score2 is query and document, Sem (q, d) situation of at least one in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of expression query and document, λ ₁And λ ₂For default weighting parameter, λ ₁And λ ₂Sum can be 1, if the demand matching degree does not participate in the calculating of matching degree, λ can be set ₂Be 0.Sem (q, d) can adopt in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree the product of at least one, also can adopt the mode of whole installation, if for example time attribute, Regional Property, entity attribute and semantic dependency relations all mate, Sem (q, d) gets 1, as long as there is one, does not mate, Sem (q, d) gets 0.8.The schematic diagram that now matching degree is calculated as shown in Figure 4.

Step 207: according to the matching degree of each document and query, determine the Search Results of query.

Matching degree with query can be met to the document package that the preset matching degree requires in this step and be contained in Search Results, and can determine the sequence of document in Search Results according to the matching degree with query.

Be more than the detailed description that method provided by the present invention is carried out, be described in detail below in conjunction with tetra-pairs of devices provided by the present invention of embodiment.

Embodiment tetra-,

The searcher structural drawing that Fig. 3 provides for the embodiment of the present invention four, as shown in Figure 3, this device can comprise: the first subject analysis unit 300, second theme analytic unit 310, theme matching degree computing unit 320 and search result determination unit 330.

The first subject analysis unit 300, carry out subject analysis for utilizing the subject analysis model to each document of document library, determines that the theme that each document is corresponding distributes.

This first subject analysis unit 300 can specifically comprise: the first descriptor is determined subelement 301 and the definite subelement 302 of the first distribution.

The first descriptor is determined subelement 301, for each document of document library is carried out to word segmentation processing, from word segmentation result, selects the word that meets statistical requirements as descriptor, obtains respectively the descriptor set of each document.

Wherein can in word segmentation result, based on TF or TF-IDF, select descriptor, the word of selecting TF or TF-IDF to meet the demands is as descriptor.But the document relatively disperseed for some words, the word frequency of statistics does not have obvious characteristic, and for the document of some cheatings, the cribber piles up the word irrelevant with text subject in addition, if merely based on word frequency information, obviously the reflection for theme is inaccurate.Therefore, at the word of selecting to meet statistical requirements from word segmentation result during as descriptor, can carry out cluster to word segmentation result, from the word quantity comprised, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, M is positive integer.

First distribute to determine subelement 302, for the descriptor set input subject analysis model that each document is corresponding, obtains the theme that each document is mapped to and is mapped to the probability on each theme.

Second theme analytic unit 310, carry out subject analysis for utilizing the subject analysis model to the query of user's input, determines that the theme that query is corresponding distributes.

Wherein second theme analytic unit 310 can specifically comprise: subelement 311 and the definite subelement 312 of the second distribution determined in the second theme word.

Subelement 311 determined in the second theme word, carry out word segmentation processing for the query to user's input, from word segmentation result, select to meet the word of IDF statistical requirements as descriptor, the descriptor that utilization is selected forms descriptor set corresponding to query, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form descriptor set corresponding to query.

When carrying out the related term expansion, can be based on existing related term dictionary, the excavation principle of this related term dictionary can be: if two words residing context environmental in language material is the same, these two words are relevant semantically, each other related term.

Second distribute to determine subelement 312, for the descriptor set input descriptor analytical model that query is corresponding, obtains the theme that query is mapped to and is mapped to the probability on each theme.

Theme matching degree computing unit 320, distribute for utilizing the theme distribution theme corresponding with each document that query is corresponding, calculates the theme matching degree of query and each document.Specifically can, by calculating cosine similarity or relative entropy KL distance between the theme distribution that theme that query is corresponding distributes corresponding with each document, obtain the theme matching degree of query and each document.

Search result determination unit 330, obtain the matching degree of query and each document for utilizing the theme matching degree, determines the Search Results of query according to the matching degree of query and each document.

The above-mentioned subject analysis model related in the embodiment of the present invention comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.The form adopted can include but not limited to PLSA model or LDA model.The descriptor adopted in model comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.

In order to obtain above-mentioned subject analysis model, this device can also comprise: topic model training unit 340, for training the subject analysis model, the description of training process can be referring to above-described embodiment two.Preferably, for the mode that prevents the description theme probability such as PLSA or LDA does not restrain, produces the problem of similar theme because the lexical item ambiguity may cause theme simultaneously, in training process, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model, thereby can greatly eliminate the impact of noise and ambiguity, improve the accuracy of subject analysis.

Except based on theme property calculation matching degree, can also be in conjunction with other supplementary features, Given this, this device can also comprise: the first supplementary features determining unit 350, the second supplementary features determining unit 360 and additional matching degree computing unit 370.

The first supplementary features determining unit 350, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining each document.

Wherein, the Requirements Analysis Model of the demand characteristic of analytical documentation based on setting up in advance, this Requirements Analysis Model comprises: the demand word that each demand type comprises and the weights of demand word.Seemingly, just corpus, based on the demand word, is not described in detail in this for the mode of setting up of Requirements Analysis Model and subject analysis model class.Extract the demand word in document, the extraction of this demand word can be extracted based on dictionary or template, after obtaining demand set of words that each document is corresponding, the demand set of words that each document is corresponding is inputted Requirements Analysis Model, the demand that obtains each document distributes, each Document mapping to the demand type and be mapped to the probability on each demand type.

The content of entity attribute feature based on document of analytical documentation, the i.e. content of analytical documentation, the therefrom structural information of extraction " entity-entity attribute-property value ".The extraction of this structural information can be carried out based on template, and this part can adopt prior art, does not repeat them here.

The second supplementary features determining unit 360, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining query.

Equally, analyze the Requirements Analysis Model of demand characteristic based on setting up in advance of query, at first extract the demand word in query, the extraction of this demand word can be extracted based on dictionary or template, and the extraction of demand word can adopt prior art, does not repeat them here.Then by the demand word of query input Requirements Analysis Model, the demand that obtains query distributes, the demand type that query is mapped to and be mapped to the probability on each demand type.

The default dependence template of semantic dependency relations characteristic use of analyzing query is mated the semantic chunk in query, thereby determines semantic chunk and the corresponding semantic dependency relations with semantic dependency relations.Wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.

Additional matching degree computing unit 370, for carrying out at least one of following matching degree calculating: by demand characteristic and the demand characteristic calculating query of each document and the demand matching degree between each document of calculating query, by calculating time attribute feature and the time attribute feature calculation query of each document and the time attribute matching degree between each document of query, by calculating Regional Property feature and the Regional Property feature calculation query of each document and the Regional Property matching degree between each document of query, by calculating entity attribute feature and the entity attribute feature calculation query of each document and the entity attribute matching degree between each document of query, perhaps, by calculating semantic dependency relations feature and the semantic dependency relations feature calculation query of each document and the semantic dependency relations matching degree between each document of query.

Now, search result determination unit 330 is further combined with at least one in the demand matching degree between query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtain the matching degree of query and each document, the schematic diagram that now matching degree is calculated as shown in Figure 4.

Particularly, search result determination unit 330 can obtain according to following formula the matching degree Score (q, d) of query and each document:

Score(q，d)＝Sem(q，d)[λ ₁ Score1+λ ₂ Score2]

The theme matching degree that wherein Score1 is query and document, the demand matching degree that Score2 is query and document, Sem (q, d) situation of at least one in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of expression query and document, λ ₁And λ ₂For default weighting parameter.λ ₁And λ ₂Sum can be 1, if the demand matching degree does not participate in the calculating of matching degree, λ can be set ₂Be 0.Sem (q, d) can adopt in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree the product of at least one, also can adopt the mode of whole installation, if for example time attribute, Regional Property, entity attribute and semantic dependency relations all mate, Sem (q, d) gets 1, as long as there is one, does not mate, Sem (q, d) gets 0.8.

Search result determination unit 330, when determining Search Results, specifically will meet with the matching degree of query the document package that the preset matching degree requires and be contained in Search Results, and according to the matching degree of query, determining the sequence of each document in Search Results.

In the said apparatus that the embodiment of the present invention provides, the first subject analysis unit 300, the first supplementary features determining unit 350 and topic model training unit 340 can be executable operations under line, and other unit can be the real-time executable operations of query for user's input.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. the searching method based on theme, is characterized in that, utilizes the subject analysis model to carry out subject analysis to each document in document library, determines that the theme that each document is corresponding distributes, and the method also comprises:

2. method according to claim 1, is characterized in that, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.

3. method according to claim 2, is characterized in that, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.

4. method according to claim 3, is characterized in that, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.

5. method according to claim 3, it is characterized in that, in the training process of described subject analysis model, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.

6. method according to claim 2, is characterized in that, the described subject analysis model that utilizes carries out subject analysis to each document in document library, determines that the theme distribution that each document is corresponding specifically comprises:

7. method according to claim 6, is characterized in that, the described word that meets statistical requirements of selecting from word segmentation result comprises as descriptor:

8. method according to claim 2, is characterized in that, described step S1 specifically comprises:

9. according to the described method of the arbitrary claim of claim 1 to 8, it is characterized in that, the theme matching degree of calculating described query and each document comprises:

10. method according to claim 1, is characterized in that, the method also comprises: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described each document;

11. method according to claim 10, is characterized in that, obtains the matching degree Score (q, d) of described query and each document according to following formula:

Score(q，d)＝Sem(q，d)[λ ₁ Score1+λ ₂ Score2]

12. method according to claim 1, is characterized in that, in described step S3, according to the matching degree of described query and each document, determines that the Search Results of described query comprises:

13. the searcher based on theme, is characterized in that, this device comprises:

14. device according to claim 13, is characterized in that, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.

15. device according to claim 14, is characterized in that, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.

16. device according to claim 15, is characterized in that, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.

17. device according to claim 15, it is characterized in that, this device also comprises: the topic model training unit, for training described subject analysis model, and in training process, when extracting descriptor from document, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.

18. device according to claim 14, is characterized in that, described the first subject analysis unit specifically comprises:

19. device according to claim 18, it is characterized in that, described the first descriptor determines that subelement is at the word of selecting to meet statistical requirements from word segmentation result during as descriptor, described word segmentation result is carried out to cluster, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor from the word quantity comprised, described M is positive integer.

20. device according to claim 14, is characterized in that, described second theme analytic unit specifically comprises:

21. according to claim 13 to the described device of 20 arbitrary claim, it is characterized in that, described theme matching degree computing unit, by cosine similarity or relative entropy KL distance between the theme distribution of calculating theme that described query is corresponding and distributing corresponding with each document, obtains the theme matching degree of described query and each document.

22. device according to claim 13, is characterized in that, this device also comprises:

23. device according to claim 22, is characterized in that, described search result determination unit obtains the matching degree Score (q, d) of described query and each document according to following formula:

Score(q，d)＝Sem(q，d)[λ ₁ Score1+λ ₂ Score2]

24. device according to claim 13, it is characterized in that, described search result determination unit specifically will meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.