CN103425710A - Subject-based searching method and device - Google Patents

Subject-based searching method and device Download PDF

Info

Publication number
CN103425710A
CN103425710A CN2012101668529A CN201210166852A CN103425710A CN 103425710 A CN103425710 A CN 103425710A CN 2012101668529 A CN2012101668529 A CN 2012101668529A CN 201210166852 A CN201210166852 A CN 201210166852A CN 103425710 A CN103425710 A CN 103425710A
Authority
CN
China
Prior art keywords
document
matching degree
query
theme
descriptor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101668529A
Other languages
Chinese (zh)
Inventor
方高林
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2012101668529A priority Critical patent/CN103425710A/en
Publication of CN103425710A publication Critical patent/CN103425710A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a subject-based searching method and device. The subject-based searching method comprises the steps of performing subject analysis on each document in a document library by using a subject analysis model to determine a subject distribution corresponding to each document; performing subject analysis on a query input by a user by using the subject analysis model to determine the subject distribution corresponding to the query; calculating the subject matching degree between the query and each document by using the subject distribution corresponding to the query and the subject distribution corresponding to each document; obtaining the matching degree between the query and each document by using the subject matching degree, and determining searching results according to the matching degree between the query and each document. According to the subject-based searching method and device, the subject matching mode instead of a keyword matching mode is adopted, the documents can still be recalled even the documents are not consistent with the query of the user or not matched with redundant terms, and the search results are matched with the query in subject to a maximum extent in the query in statement, so that the research recall and accuracy are improved.

Description

A kind of searching method and device based on theme
[technical field]
The present invention relates to field of computer technology, particularly a kind of searching method and device based on theme.
[background technology]
Development along with computer networking technology, search engine has become the important means of people's obtaining information, the user is by search engine inputted search item (query), and search engine is searched for the document relevant to query from the document grabbed, and is sorted according to degree of correlation.The search model extensively adopted is vector space model, its basic thought is the form of term vector that query and document are expressed as, the weight of vector can be frequency (TF) or the word frequency-inverse document frequency (TF-IDF) that word occurs, then calculates similarity between the term vector of the term vector of query and the document tolerance as correlativity.In actual applications, various mutation are arranged, but be all to calculate the keyword mated in query and document in essence, and give weight.
That is to say, the mode of traditional search based on the keyword coupling, this will cause following two kinds of situations:
One, in the situation that do not occur or seldom same keyword occurs, also be difficult to come front even actual content has reflected the document of query demand, even can't recall.For example the user inputs query " good-looking Lincoln ", comprise " Lincoln of airflow design ", " white, black or red Lincoln " in some document although etc. also relate to Lincoln's outward appearance, but owing to keyword " good-looking " not occurring, possibly can't recall or can't come forward position, but in fact these documents have reflected that the user wants to search the document of the outward appearance good-looking about the Lincoln.
Two, there is the phenomenon of expressing redundancy in some query, and this document that will cause not have to mate redundancy word wherein can't be recalled or can't come forward position.For example the user inputs query and " after divorce, what if the former husband is still hated ", because wherein " divorce " is the redundancy word, although be the solution about the former husband is still hated for some document content, because it has the upper keyword " divorce " of coupling, can't be called back and maybe can't come forward position.
[summary of the invention]
In view of this, the invention provides a kind of searching method and device based on theme, so that improve search recall rate and accuracy.
Concrete technical scheme is as follows:
A kind of searching method based on theme, utilize the subject analysis model to carry out subject analysis to each document in document library, determines that the theme that each document is corresponding distributes, and the method also comprises:
S1, the query that utilizes the subject analysis model to input the user carry out subject analysis, determine that the theme that described query is corresponding distributes;
S2, the theme distribution that utilizes theme corresponding to described query to distribute corresponding with each document, calculate the theme matching degree of described query and each document;
S3, utilize the theme matching degree to obtain the matching degree of described query and each document, determine the Search Results of described query according to the matching degree of described query and each document.
According to one preferred embodiment of the present invention, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.
According to one preferred embodiment of the present invention, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.
According to one preferred embodiment of the present invention, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.
According to one preferred embodiment of the present invention, in the training process of described subject analysis model, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.
According to one preferred embodiment of the present invention, the described subject analysis model that utilizes carries out subject analysis to each document in document library, determines that the theme distribution that each document is corresponding specifically comprises:
A1, each document in document library is carried out to word segmentation processing, from word segmentation result, select the word that meets statistical requirements as descriptor, obtain respectively the descriptor set of each document;
A2, by each document, described subject analysis model is inputted in corresponding descriptor set, obtains the theme that each document is mapped to and is mapped to the probability on each theme.
According to one preferred embodiment of the present invention, the described word meet statistical requirements of selecting from word segmentation result comprises as descriptor:
Described word segmentation result is carried out to cluster, from the word quantity comprised, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, described M is positive integer.
According to one preferred embodiment of the present invention, described step S1 specifically comprises:
S11, the query that the user is inputted carry out word segmentation processing, from word segmentation result, select to meet the word of inverse document frequency IDF statistical requirements as descriptor, the descriptor that utilization is selected forms the descriptor set that described query is corresponding, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form the descriptor set that described query is corresponding;
S12, by described query, described descriptor analytical model is inputted in corresponding descriptor set, obtains the theme that described query is mapped to and is mapped to the probability on each theme.
According to one preferred embodiment of the present invention, the theme matching degree of calculating described query and each document comprises:
By calculating cosine similarity or relative entropy KL distance between the theme distribution that theme corresponding to described query distribute corresponding with each document, obtain the theme matching degree of described query and each document.
According to one preferred embodiment of the present invention, the method also comprises: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described each document;
At described step S1, further comprise: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described query;
Further comprise at least one during following matching degree is calculated at described step S2: the demand characteristic by calculating described query and the demand characteristic of each document are calculated the demand matching degree between described query and each document, by time attribute feature and the described query of time attribute feature calculation of each document and the time attribute matching degree between each document of calculating described query, by Regional Property feature and the described query of Regional Property feature calculation of each document and the Regional Property matching degree between each document of calculating described query, by entity attribute feature and the described query of entity attribute feature calculation of each document and the entity attribute matching degree between each document of calculating described query, perhaps, by semantic dependency relations feature and the described query of semantic dependency relations feature calculation of each document and the semantic dependency relations matching degree between each document of calculating described query,
Further combined with at least one in the demand matching degree between described query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtain the matching degree of described query and each document in described step S3.
According to one preferred embodiment of the present invention, obtain the matching degree Score (q, d) of described query and each document according to following formula:
Score(q,d)=Sem(q,d)[λ 1 Score1+λ 2 Score2]
The theme matching degree that wherein Score1 is described query and document, the demand matching degree that Score2 is described query and document, Sem (q, d) mean in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of described query and document the situation of at least one, λ 1And λ 2For default weighting parameter.
According to one preferred embodiment of the present invention, determine that according to the matching degree of described query and each document the Search Results of described query comprises in described step S3:
To meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.
The present invention also provides a kind of searcher based on theme, and this device comprises:
The first subject analysis unit, carry out subject analysis for utilizing the subject analysis model to each document of document library, determines that the theme that each document is corresponding distributes;
The second theme analytic unit, carry out subject analysis for utilizing the subject analysis model to the query of user's input, determines that the theme that described query is corresponding distributes;
Theme matching degree computing unit, distribute for utilizing the theme distribution theme corresponding with each document that described query is corresponding, calculates the theme matching degree of described query and each document;
Search result determination unit, obtain the matching degree of described query and each document for utilizing the theme matching degree, and the matching degree of the described query of foundation and each document is determined the Search Results of described query.
According to one preferred embodiment of the present invention, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.
According to one preferred embodiment of the present invention, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.
According to one preferred embodiment of the present invention, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.
According to one preferred embodiment of the present invention, this device also comprises: the topic model training unit, for training described subject analysis model, and in training process, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.
According to one preferred embodiment of the present invention, described the first subject analysis unit specifically comprises:
The first descriptor is determined subelement, for each document of document library is carried out to word segmentation processing, from word segmentation result, selects the word that meets statistical requirements as descriptor, obtains respectively the descriptor set of each document;
First distribute to determine subelement, for the descriptor set that each document is corresponding, inputs described subject analysis model, obtains the theme that each document is mapped to and is mapped to the probability on each theme.
According to one preferred embodiment of the present invention, described the first descriptor determines that subelement is at the word of selecting to meet statistical requirements from word segmentation result during as descriptor, described word segmentation result is carried out to cluster, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor from the word quantity comprised, described M is positive integer.
According to one preferred embodiment of the present invention, described second theme analytic unit specifically comprises:
Subelement determined in the second theme word, carry out word segmentation processing for the query to user's input, from word segmentation result, select to meet the word of inverse document frequency IDF statistical requirements as descriptor, the descriptor that utilization is selected forms the descriptor set that described query is corresponding, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form the descriptor set that described query is corresponding;
Second distribute to determine subelement, for the descriptor set that described query is corresponding, inputs described descriptor analytical model, obtains the theme that described query is mapped to and is mapped to the probability on each theme.
According to one preferred embodiment of the present invention, described theme matching degree computing unit, by cosine similarity or relative entropy KL distance between the theme distribution of calculating theme that described query is corresponding and distributing corresponding with each document, obtains the theme matching degree of described query and each document.
According to one preferred embodiment of the present invention, this device also comprises:
The first supplementary features determining unit, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining described each document;
The second supplementary features determining unit, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining described query;
Additional matching degree computing unit, for carrying out at least one of following matching degree calculating: the demand characteristic by calculating described query and the demand characteristic of each document are calculated the demand matching degree between described query and each document, by time attribute feature and the described query of time attribute feature calculation of each document and the time attribute matching degree between each document of calculating described query, by Regional Property feature and the described query of Regional Property feature calculation of each document and the Regional Property matching degree between each document of calculating described query, by entity attribute feature and the described query of entity attribute feature calculation of each document and the entity attribute matching degree between each document of calculating described query, perhaps, by semantic dependency relations feature and the described query of semantic dependency relations feature calculation of each document and the semantic dependency relations matching degree between each document of calculating described query,
Described search result determination unit, further combined with at least one in the demand matching degree between described query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtains the matching degree of described query and each document.
According to one preferred embodiment of the present invention, described search result determination unit obtains the matching degree Score (q, d) of described query and each document according to following formula:
Score(q,d)=Sem(q,d)[λ 1 Score1+λ 2 Score2]
The theme matching degree that wherein Score1 is described query and document, the demand matching degree that Score2 is described query and document, Sem (q, d) mean in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of described query and document the situation of at least one, λ 1And λ 2For default weighting parameter.
According to one preferred embodiment of the present invention, described search result determination unit specifically will meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.
As can be seen from the above technical solutions, the present invention carries out respectively subject analysis to document and query, determine that the theme that theme distributes and query is corresponding that in document library, each document is corresponding distributes, and determines the Search Results of query by the mode of calculating theme matching degree between query and each document.That is to say, the present invention adopts the mode of theme coupling but not the mode of keyword coupling, return to the Search Results mated on the theme of expressing to the user, even if in statement with user query not quite identical or with query in the word of redundancy do not mate, also can be called back, and on theme, make Search Results farthest mate with query, thereby improved search recall rate and accuracy.
[accompanying drawing explanation]
The main process flow diagram of the searching method based on theme that Fig. 1 provides for the embodiment of the present invention one;
The detail flowchart of the searching method based on theme that Fig. 2 provides for the embodiment of the present invention three;
The searcher structural drawing that Fig. 3 provides for the embodiment of the present invention four;
Fig. 4 is the schematic diagram that the matching degree that relates in the embodiment of the present invention is calculated.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.
Embodiment mono-,
The main process flow diagram of the searching method based on theme that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the method can comprise the following steps:
Step 101: utilize the subject analysis model to carry out subject analysis to the query of user's input, determine that the theme that query is corresponding distributes, and utilize the subject analysis model to carry out subject analysis to each document in document library, determine that the theme that each document is corresponding distributes.
The subject analysis model related in this step is set up in advance, comprises descriptor that each theme comprises and each descriptor weights at affiliated theme.Utilizing this subject analysis model can determine the theme that theme distributes and each document is corresponding that query is corresponding distributes.Wherein, the process of establishing of subject analysis model and content will be described in detail in embodiment bis-.The process of query and document being carried out to subject analysis will be described in detail in embodiment tri-.
Step 102: utilize the theme distribution theme corresponding with each document that query is corresponding to distribute, calculate the theme matching degree of query and each document.
This step is actually utilizes the theme of query and each document to distribute to determine query and the matching degree of each document on theme, as the selection foundation of subsequent searches result.
Step 103: utilize the theme matching degree to obtain the matching degree of query and each document, determine the Search Results of query according to matching degree.
Except foundation theme matching degree, obtain the matching degree of query and each document in this step, can also obtain the matching degree of query and each document further combined with at least one in the demand matching degree between query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, this process will be described in detail in embodiment tri-.
When the foundation matching degree is determined the Search Results of query, the matching degree with query can be met to the document package that the preset matching degree requires and be contained in Search Results, and can determine the sequence of document in Search Results according to the matching degree with query.
Embodiment bis-,
The subject analysis model can adopt and describe the theme probability model in embodiments of the present invention, can include but not limited to: probability latent semantic analysis (PLSA) model, potential Di Li Cray distribution (LDA) etc.
LSA is that a kind of method of mathematics and statistics of using is extracted the lexical item in document, infer semantic relation between them, and set up a semantic indexing, and and file organization being become to the method for semantic space construction, the lexical item that is about to have the higher semantic association is mapped to same theme.PLSA is on the basis of carrying out potential semantic indexing of LSA, and between document and potential semanteme, potential semanteme and lexical item, the probability of use model is described, and so-called potential semanteme is exactly the theme of indication in the embodiment of the present invention.
LDA is a kind of non-supervisory machine learning techniques, is used for identifying subject information hiding in extensive document sets or corpus.It adopts the method for word bag, and each piece of document represented the probability distribution that some themes form, and each theme has represented the probability distribution that a lot of words form.
Because LSA and PLSA are all ripe prior arts, at this, no longer specifically describe in detail.That is to say, mode by description theme probability such as PLSA or LDA can be mapped to each lexical item on corresponding theme, thereby form the theme consist of each lexical item, can further include the probability of each lexical item on affiliated theme, this probability is based on that statistics to word frequency obtains.
Wherein, what described in traditional PLSA and LDA, lexical item referred to is word, thereby but may bring the expression ambiguity because a lot of words have a plurality of meanings, this class ambiguity is difficult to distinguish from the subject analysis model.In addition on the one hand, because the merging threshold value between different themes in training process is difficult to determine, thereby there is similar theme in the subject analysis model that causes training.Given this, in the embodiment of the present invention, can adopt the mode of disambiguation to mean theme, above-mentioned lexical item can comprise following several form:
The first form: field word: the word of certain professional domain, at other field, substantially do not occur, this class word has very strong differentiation row.For example vocabulary such as { application program, thread, compilers } belongs to same field with vocabulary such as { computer, programmings }, and the field word is the monobasic word usually.
The second form: the binary combination of polysemant: for some, may there is the word of ambiguity, it is polysemant, for disambiguation can adopt the mode that is constituted binary combination with other monobasic words, for example, " apple " is a polysemant, can be distinguished by it being combined as to " apple # mobile phone ", " apple # computer ", " apple # fruit " etc., wherein # means the relation of binary combination.
The third form: entity word or entity word combination: the competency of entity word, between field word and extensive word, belongs to a plurality of fields usually, because the entity word property distinguished may be lower, therefore can adopt the form of presentation of two entity word combinations.
As front, introduce, the mode of the description such as PLSA or LDA theme probability does not restrain, produces yet the problem of similar theme because the lexical item ambiguity may cause theme, can think that by analysis the document co-occurrence is relevant according to being the reason that causes noise to be introduced as semanteme simultaneously.In order further to address this problem, the semantic related term that can the method based on statistics extracts lexical item, utilize all semantic related term corresponding to lexical item as a document in corpus, participate in the training of subject analysis model, thereby can greatly eliminate the impact of noise and ambiguity, improve the accuracy of subject analysis.
For example, extract related term for lexical item " notebook computer " based on statistics and comprise { notebook, written or printed documents, Notebook Battery, double-core, wireless network card, thinkpad, processor, video card }, using these related terms as a document, by PLSA or LDA, trained, thereby obtained corresponding theme.Again for example, " apple # computer " extracted to related term based on statistics and comprise { apple, computer, imac, desktop computer, desktop computer }, using these related terms as a document, by PLSA or LDA, trained, thereby obtain corresponding theme.
By the final subject analysis model of setting up of this embodiment, comprise: the lexical item that each theme is corresponding (being descriptor) and the weights of each lexical item in affiliated theme.
For example, the theme topic: 226, lexical items and the corresponding right contains the value: Uninstall Tool 0.00250754 | optimize the master 0.00248382 | a key Restore Wizard 0.00233508 | office2007, 0.00229928 | office2003, 0.00226238 | winrar, 0.00220789 | optimize settings 0.00219979 | 0.00211016 compression software | system patches 0.00202564 | backup tools 0.00198587 | software Description 0.00196925 | virtual drive 0.00196877 | Uninstaller 0.00196625 | shadow system 0.00193103 | window70.00192867 | installation package 0.00192127 | ultraiso, 0.00191646 | ylmf, 0.00189621 | install uninstall 0.00187417 | 0.00187176 run.
Theme topic: 1288, lexical items and the corresponding value of the right to include: malicious programs 0.00274991 | malicious code 0.00258505 | exploits 0.00252883 | Vulnerability Scanning 0.00245838 | Personal Firewall 0.00242022 | Spyware 0.00234155 | system vulnerabilities 0.00233898 | Trojan 0.00231556 | 0.00230633 malicious websites | spyware 0.0022989 | backdoor Trojan 0.00228247 | anti-Spyware 0.0022427 | antivirus Software 0.0022283 | backdoor 0.00220942 | Kingsoft 0.00220005 | Trojans 0.00219518 | Trojan 0.002188 | Trojan firewall 0.00216575 | killing virus 0.00214389 | 0.00213185 anti-virus software.
Embodiment tri-,
The detail flowchart of the searching method based on theme that Fig. 2 provides for the embodiment of the present invention three, as shown in Figure 2, this flow process specifically comprises the following steps:
Step 201: each document in document library is carried out to the descriptor analysis, obtain respectively the descriptor set of each document.
After at first the process that document is carried out to the descriptor analysis carries out word segmentation processing by document, can select descriptor based on TF or TF-IDF, the word of selecting TF or TF-IDF to meet the demands is as descriptor.It is good that this mode generally shows, but the document relatively disperseed for some words, the word frequency of statistics does not have obvious characteristic, in addition for the document of some cheatings, the cribber piles up the word irrelevant with text subject, if merely based on word frequency information, obviously the reflection for theme is inaccurate.Therefore, it is preferred embodiment a kind of that the embodiment of the present invention provides, at first the word that soon document will carry out obtaining after word segmentation processing carries out cluster, according to the word quantity comprised in of all categories and word, the occurrence number in of all categories is extracted the descriptor of document, forms the descriptor set of the document.
The mode the present invention who wherein word is carried out to cluster is not limited, and while after cluster, extracting descriptor, can meet from the word quantity comprised the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, and M is positive integer.For example, after carrying out cluster, obtain several classifications, the word quantity comprised in some classifications is more, some classifications are less, do not meet for the word quantity comprised the classification that predetermined number requires and abandon, for the word quantity comprised, meet the classification that predetermined number requires, can therefrom select occurrence number to come the word of front M as descriptor.
Step 202: the subject analysis model is inputted in the descriptor set that each document is corresponding, obtains the theme that each document is corresponding and distributes.
This step is actually the process of each document being carried out to semantic analysis, after by document, subject analysis models such as PLSA model or LDA model is inputted in corresponding descriptor set, the theme that just can access document distributes, the theme that is mapped to, and be mapped to the probability on each theme.
For example, the descriptor set of certain document is { computer, display, widescreen, notebook }, after utilizing existing subject analysis model to be analyzed, obtaining the probability that the document is mapped on theme topic1 is 0.5, the probability be mapped on theme topic2 is 0.2, and the probability be mapped on theme topic3 is 0.1.
In addition, in the process of semantic analysis, except calculating theme corresponding to each document distribute, can also analyze at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of each document.
Wherein, the Requirements Analysis Model of the demand characteristic of analytical documentation based on setting up in advance, this Requirements Analysis Model comprises: the demand word that each demand type comprises and the weights of demand word.Seemingly, just corpus, based on the demand word, is not described in detail in this for the mode of setting up of Requirements Analysis Model and subject analysis model class.
Demand word and weights that for example, in the Requirements Analysis Model of setting up, video requirement comprises comprise: Online Video 0.00385583| watches the 0.00316739| 0.00312384| complete or collected works 0.00302876| new window 0.00299306| personal letter 0.0028422| potato net 0.00276037| that turns on light to watch online 0.0027364| information subscribing 0.00267162| story of a play or opera 0.00265838| 0.00257173| intersection 0.00252181 backward.
Demand word and weights that desktop class demand comprises comprise: wallpaper picture 0.00224348| screen protection is downloaded 0.00201372| cartoon wallpaper 0.00195555| cartoon desktop 0.00189441| wallpaper and is downloaded 0.00179601| desktop download 0.0017233| desktop wallpaper 0.00172064| animation wallpaper 0.00170366| wallpaper desktop 0.00167871| wallpaper screen protection 0.00159849| beauty wallpaper 0.00159034| wallpaper special edition 0.00158929.
Extract the demand word in document, the extraction of this demand word can be extracted based on dictionary or template, and the extraction of demand word can adopt prior art, does not repeat them here.After obtaining demand set of words that each document is corresponding, the corresponding demand set of words input Requirements Analysis Model by each document, the demand that obtains each document distributes, each Document mapping to the demand type and be mapped to the probability on each demand type.
The time attribute feature of analytical documentation and the static attribute of Regional Property feature based on document, such as the issuing time of document, the generation region of document etc., content that also can be based on document, such as extracting temporal information in document or regional information etc.
The content of entity attribute feature based on document of analytical documentation, i.e. the content of analytical documentation, therefrom extract the structural information of " entity-entity attribute-property value ", for example, extracts structural information as shown in table 1 from document.The extraction of this structural information can be carried out based on template, and this part can adopt prior art, does not repeat them here.
Table 1
The entity word Entity word attribute Property value
Baidu Address No. 10, ten street, ShangDi, Haidian District, BeiJing City
Company of Baidu Phone 59928888
The semantic dependency relations feature of analytical documentation is to utilize default dependence template to be mated the semantic chunk in document, thereby determine semantic chunk and corresponding semantic dependency relations with semantic dependency relations, wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.Qualified relation refers to the relation that a word is limited another word, such as " Yao Ming's height ", " Super Rabbit usage " etc.Donor and accepter relationship refers to one of them word person that is the agent, another word person that is the word denoting the receiver of an action, both can not put upside down order, for example in " Li Na defeats Sa Labowa ", " Li Nasalabowa " is donor and accepter relationship, and in " Euro exchange dollar ", " Euro dollar " is donor and accepter relationship.
Above-mentioned dependence template can be based on after the search daily record is excavated and obtain, and can be also artificial definition.For example, the template of qualified relation can comprise: [basketball star] height, [software] usage etc.The template of donor and accepter relationship can comprise: [noun] verb [noun] etc.
In addition, it should be noted that, above-mentioned steps 201 and step 202 can be the steps of carrying out under line.
Step 203: the query to user's input carries out the descriptor analysis, obtains the descriptor set that query is corresponding.
Can choose based on IDF in this step word that IDF meets preset requirement as descriptor from the word segmentation result of query, for example choose the word that IDF comes front N1, N1 is default positive integer, or chooses the word that IDF surpasses default IDF threshold value.
Can directly the descriptor of determining be formed to the descriptor set, preferably, can further to the descriptor of determining, carry out the related term expansion, the related term that expansion is obtained and the descriptor of determining form the descriptor set jointly.Can be based on existing related term dictionary while carrying out the related term expansion, the excavation principle of this related term dictionary can be: if two words residing context environmental in language material is the same, these two words are relevant semantically, each other related term.
Step 204: the subject analysis model is inputted in the descriptor set that query is corresponding, obtains the theme that query is corresponding and distributes.
Equally, this step is actually the process of query being carried out to semantic analysis, after by query, subject analysis models such as PLSA model or LDA model is inputted in corresponding descriptor set, the theme that just can access query distributes, be the theme that query is mapped to, and be mapped to the probability on each theme.
For example, the query of user's input is " Super Rabbit ", and the result of carrying out subject analysis for " Super Rabbit ": the probability be mapped on Topic:226 is 0.851297, and the probability be mapped on Topic:1288 is 0.120394.Wherein, Topic:226 and Topic:1288 can be as shown in embodiment bis-in the subject analysis model.
In addition, in the process of semantic analysis, except calculating theme that query is corresponding distributes, can also analyze at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of query.
Analyze the Requirements Analysis Model of demand characteristic based on setting up in advance of query, at first extract the demand word in query, the extraction of this demand word can be extracted based on dictionary or template, and the extraction of demand word can adopt prior art, does not repeat them here.Then by the demand word of query input Requirements Analysis Model, the demand that obtains query distributes, the demand type that query is mapped to and be mapped to the probability on each demand type.
The time attribute feature of analysis query is the content based on query mainly, and the word of extraction and time correlation from query, as the time attribute feature.For example the user inputs query and " what major issue has occurred " in 2008, the time attribute feature that wherein " 2008 " are this query.
Analyze the user property that the Regional Property feature of query can be based on this query of input, for example according to User IP corresponding geographic position as the Regional Property feature, the user who for example inputs this query is the user of Pekinese, Regional Property feature that can be using " Beijing " as query.Content that also can be based on query is extracted the word with Regionalization from query, and as the Regional Property feature, for example the user inputs query " which Beijing five-star hotel has ", extracts " Beijing " Regional Property feature as this query.
Analyze the content of entity attribute feature based on query of query, from query, extract the structural information of " entity-entity attribute ", the extraction of this structural information can be carried out based on template, can adopt prior art equally.The entity attribute that for example user's input " Yao Ming's height is how many " extracts is characterized as " Yao Ming-height ".The purpose that extracts this entity attribute feature be for follow-up when being mated with document, if there is document just in time to there is same entity attribute feature, directly return attribute value of the document is described, therefore can improve the matching degree of the document and query.
The default dependence template of semantic dependency relations characteristic use of analyzing query is mated the semantic chunk in query, thereby determines semantic chunk and the corresponding semantic dependency relations with semantic dependency relations.Wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.For example the user inputs query " Yao Ming's height is how many ", analyzes " Yao Ming's height " for qualified relation; The user inputs query " Li Na defeats Sa Labowa ", analyzes " Li Nasalabowa " for donor and accepter relationship.
Step 205: utilize the theme distribution theme corresponding with each document that query is corresponding to distribute, calculate the theme matching degree of query and each document.
Because distributing to be actually, theme comprises the theme that is mapped to and at the vector of the probability of be mapped to theme, therefore can calculate by the mode of the cosine similarity between compute vector the theme matching degree of query and each document, the mode of the relative entropy distance (KL, Kullback-Leibler Divergence) between the theme distribution that theme that also can be corresponding by calculating query distributes corresponding with document is calculated the theme matching degree of query and each document.
Step 206: calculate at least one in demand matching degree, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree between query and each document, and calculate the matching degree between query and each document in conjunction with the theme matching degree.
Wherein query can obtain with the similarity between demand characteristic corresponding to document by calculating the demand characteristic that query is corresponding with the demand matching degree between document, and the mode that wherein similarity is calculated can adopt cosine similarity or KL distance etc.
Query can determine according to whether time attribute, Regional Property, entity attribute and the semantic dependency relations of query and document be consistent with time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree between document, for example, if consistent, determine that its matching degree is 1, if do not mate, determine that its matching degree is 0.8.
When calculating the matching degree Score (q, d) of query and document, can carry out according to formula (1):
Score(q,d)=Sem(q,d)[λ 1 Score1+λ 2 Score2] (1)
Wherein, the theme matching degree that Score1 is query and document, the demand matching degree that Score2 is query and document, Sem (q, d) situation of at least one in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of expression query and document, λ 1And λ 2For default weighting parameter, λ 1And λ 2Sum can be 1, if the demand matching degree does not participate in the calculating of matching degree, λ can be set 2Be 0.Sem (q, d) can adopt in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree the product of at least one, also can adopt the mode of whole installation, if for example time attribute, Regional Property, entity attribute and semantic dependency relations all mate, Sem (q, d) gets 1, as long as there is one, does not mate, Sem (q, d) gets 0.8.The schematic diagram that now matching degree is calculated as shown in Figure 4.
Step 207: according to the matching degree of each document and query, determine the Search Results of query.
Matching degree with query can be met to the document package that the preset matching degree requires in this step and be contained in Search Results, and can determine the sequence of document in Search Results according to the matching degree with query.
Be more than the detailed description that method provided by the present invention is carried out, be described in detail below in conjunction with tetra-pairs of devices provided by the present invention of embodiment.
Embodiment tetra-,
The searcher structural drawing that Fig. 3 provides for the embodiment of the present invention four, as shown in Figure 3, this device can comprise: the first subject analysis unit 300, second theme analytic unit 310, theme matching degree computing unit 320 and search result determination unit 330.
The first subject analysis unit 300, carry out subject analysis for utilizing the subject analysis model to each document of document library, determines that the theme that each document is corresponding distributes.
This first subject analysis unit 300 can specifically comprise: the first descriptor is determined subelement 301 and the definite subelement 302 of the first distribution.
The first descriptor is determined subelement 301, for each document of document library is carried out to word segmentation processing, from word segmentation result, selects the word that meets statistical requirements as descriptor, obtains respectively the descriptor set of each document.
Wherein can in word segmentation result, based on TF or TF-IDF, select descriptor, the word of selecting TF or TF-IDF to meet the demands is as descriptor.But the document relatively disperseed for some words, the word frequency of statistics does not have obvious characteristic, and for the document of some cheatings, the cribber piles up the word irrelevant with text subject in addition, if merely based on word frequency information, obviously the reflection for theme is inaccurate.Therefore, at the word of selecting to meet statistical requirements from word segmentation result during as descriptor, can carry out cluster to word segmentation result, from the word quantity comprised, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, M is positive integer.
First distribute to determine subelement 302, for the descriptor set input subject analysis model that each document is corresponding, obtains the theme that each document is mapped to and is mapped to the probability on each theme.
Second theme analytic unit 310, carry out subject analysis for utilizing the subject analysis model to the query of user's input, determines that the theme that query is corresponding distributes.
Wherein second theme analytic unit 310 can specifically comprise: subelement 311 and the definite subelement 312 of the second distribution determined in the second theme word.
Subelement 311 determined in the second theme word, carry out word segmentation processing for the query to user's input, from word segmentation result, select to meet the word of IDF statistical requirements as descriptor, the descriptor that utilization is selected forms descriptor set corresponding to query, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form descriptor set corresponding to query.
When carrying out the related term expansion, can be based on existing related term dictionary, the excavation principle of this related term dictionary can be: if two words residing context environmental in language material is the same, these two words are relevant semantically, each other related term.
Second distribute to determine subelement 312, for the descriptor set input descriptor analytical model that query is corresponding, obtains the theme that query is mapped to and is mapped to the probability on each theme.
Theme matching degree computing unit 320, distribute for utilizing the theme distribution theme corresponding with each document that query is corresponding, calculates the theme matching degree of query and each document.Specifically can, by calculating cosine similarity or relative entropy KL distance between the theme distribution that theme that query is corresponding distributes corresponding with each document, obtain the theme matching degree of query and each document.
Search result determination unit 330, obtain the matching degree of query and each document for utilizing the theme matching degree, determines the Search Results of query according to the matching degree of query and each document.
The above-mentioned subject analysis model related in the embodiment of the present invention comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.The form adopted can include but not limited to PLSA model or LDA model.The descriptor adopted in model comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.
In order to obtain above-mentioned subject analysis model, this device can also comprise: topic model training unit 340, for training the subject analysis model, the description of training process can be referring to above-described embodiment two.Preferably, for the mode that prevents the description theme probability such as PLSA or LDA does not restrain, produces the problem of similar theme because the lexical item ambiguity may cause theme simultaneously, in training process, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model, thereby can greatly eliminate the impact of noise and ambiguity, improve the accuracy of subject analysis.
Except based on theme property calculation matching degree, can also be in conjunction with other supplementary features, Given this, this device can also comprise: the first supplementary features determining unit 350, the second supplementary features determining unit 360 and additional matching degree computing unit 370.
The first supplementary features determining unit 350, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining each document.
Wherein, the Requirements Analysis Model of the demand characteristic of analytical documentation based on setting up in advance, this Requirements Analysis Model comprises: the demand word that each demand type comprises and the weights of demand word.Seemingly, just corpus, based on the demand word, is not described in detail in this for the mode of setting up of Requirements Analysis Model and subject analysis model class.Extract the demand word in document, the extraction of this demand word can be extracted based on dictionary or template, after obtaining demand set of words that each document is corresponding, the demand set of words that each document is corresponding is inputted Requirements Analysis Model, the demand that obtains each document distributes, each Document mapping to the demand type and be mapped to the probability on each demand type.
The time attribute feature of analytical documentation and the static attribute of Regional Property feature based on document, such as the issuing time of document, the generation region of document etc., content that also can be based on document, such as extracting temporal information in document or regional information etc.
The content of entity attribute feature based on document of analytical documentation, the i.e. content of analytical documentation, the therefrom structural information of extraction " entity-entity attribute-property value ".The extraction of this structural information can be carried out based on template, and this part can adopt prior art, does not repeat them here.
The semantic dependency relations feature of analytical documentation is to utilize default dependence template to be mated the semantic chunk in document, thereby determine semantic chunk and corresponding semantic dependency relations with semantic dependency relations, wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.Qualified relation refers to the relation that a word is limited another word, such as " Yao Ming's height ", " Super Rabbit usage " etc.Donor and accepter relationship refers to one of them word person that is the agent, another word person that is the word denoting the receiver of an action, both can not put upside down order, for example in " Li Na defeats Sa Labowa ", " Li Nasalabowa " is donor and accepter relationship, and in " Euro exchange dollar ", " Euro dollar " is donor and accepter relationship.
Above-mentioned dependence template can be based on after the search daily record is excavated and obtain, and can be also artificial definition.For example, the template of qualified relation can comprise: [basketball star] height, [software] usage etc.The template of donor and accepter relationship can comprise: [noun] verb [noun] etc.
The second supplementary features determining unit 360, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining query.
Equally, analyze the Requirements Analysis Model of demand characteristic based on setting up in advance of query, at first extract the demand word in query, the extraction of this demand word can be extracted based on dictionary or template, and the extraction of demand word can adopt prior art, does not repeat them here.Then by the demand word of query input Requirements Analysis Model, the demand that obtains query distributes, the demand type that query is mapped to and be mapped to the probability on each demand type.
The time attribute feature of analysis query is the content based on query mainly, and the word of extraction and time correlation from query, as the time attribute feature.For example the user inputs query and " what major issue has occurred " in 2008, the time attribute feature that wherein " 2008 " are this query.
Analyze the user property that the Regional Property feature of query can be based on this query of input, for example according to User IP corresponding geographic position as the Regional Property feature, the user who for example inputs this query is the user of Pekinese, Regional Property feature that can be using " Beijing " as query.Content that also can be based on query is extracted the word with Regionalization from query, and as the Regional Property feature, for example the user inputs query " which Beijing five-star hotel has ", extracts " Beijing " Regional Property feature as this query.
Analyze the content of entity attribute feature based on query of query, from query, extract the structural information of " entity-entity attribute ", the extraction of this structural information can be carried out based on template, can adopt prior art equally.The entity attribute that for example user's input " Yao Ming's height is how many " extracts is characterized as " Yao Ming-height ".The purpose that extracts this entity attribute feature be for follow-up when being mated with document, if there is document just in time to there is same entity attribute feature, directly return attribute value of the document is described, therefore can improve the matching degree of the document and query.
The default dependence template of semantic dependency relations characteristic use of analyzing query is mated the semantic chunk in query, thereby determines semantic chunk and the corresponding semantic dependency relations with semantic dependency relations.Wherein semantic chunk can adopt the mode of the binary combination of word to mean.The type of semantic dependency relations can include but not limited to: qualified relation, donor and accepter relationship etc.
Additional matching degree computing unit 370, for carrying out at least one of following matching degree calculating: by demand characteristic and the demand characteristic calculating query of each document and the demand matching degree between each document of calculating query, by calculating time attribute feature and the time attribute feature calculation query of each document and the time attribute matching degree between each document of query, by calculating Regional Property feature and the Regional Property feature calculation query of each document and the Regional Property matching degree between each document of query, by calculating entity attribute feature and the entity attribute feature calculation query of each document and the entity attribute matching degree between each document of query, perhaps, by calculating semantic dependency relations feature and the semantic dependency relations feature calculation query of each document and the semantic dependency relations matching degree between each document of query.
Now, search result determination unit 330 is further combined with at least one in the demand matching degree between query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtain the matching degree of query and each document, the schematic diagram that now matching degree is calculated as shown in Figure 4.
Particularly, search result determination unit 330 can obtain according to following formula the matching degree Score (q, d) of query and each document:
Score(q,d)=Sem(q,d)[λ 1 Score1+λ 2 Score2]
The theme matching degree that wherein Score1 is query and document, the demand matching degree that Score2 is query and document, Sem (q, d) situation of at least one in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of expression query and document, λ 1And λ 2For default weighting parameter.λ 1And λ 2Sum can be 1, if the demand matching degree does not participate in the calculating of matching degree, λ can be set 2Be 0.Sem (q, d) can adopt in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree the product of at least one, also can adopt the mode of whole installation, if for example time attribute, Regional Property, entity attribute and semantic dependency relations all mate, Sem (q, d) gets 1, as long as there is one, does not mate, Sem (q, d) gets 0.8.
Search result determination unit 330, when determining Search Results, specifically will meet with the matching degree of query the document package that the preset matching degree requires and be contained in Search Results, and according to the matching degree of query, determining the sequence of each document in Search Results.
In the said apparatus that the embodiment of the present invention provides, the first subject analysis unit 300, the first supplementary features determining unit 350 and topic model training unit 340 can be executable operations under line, and other unit can be the real-time executable operations of query for user's input.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (24)

1. the searching method based on theme, is characterized in that, utilizes the subject analysis model to carry out subject analysis to each document in document library, determines that the theme that each document is corresponding distributes, and the method also comprises:
S1, the query that utilizes the subject analysis model to input the user carry out subject analysis, determine that the theme that described query is corresponding distributes;
S2, the theme distribution that utilizes theme corresponding to described query to distribute corresponding with each document, calculate the theme matching degree of described query and each document;
S3, utilize the theme matching degree to obtain the matching degree of described query and each document, determine the Search Results of described query according to the matching degree of described query and each document.
2. method according to claim 1, is characterized in that, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.
3. method according to claim 2, is characterized in that, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.
4. method according to claim 3, is characterized in that, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.
5. method according to claim 3, it is characterized in that, in the training process of described subject analysis model, when from document, extracting descriptor, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.
6. method according to claim 2, is characterized in that, the described subject analysis model that utilizes carries out subject analysis to each document in document library, determines that the theme distribution that each document is corresponding specifically comprises:
A1, each document in document library is carried out to word segmentation processing, from word segmentation result, select the word that meets statistical requirements as descriptor, obtain respectively the descriptor set of each document;
A2, by each document, described subject analysis model is inputted in corresponding descriptor set, obtains the theme that each document is mapped to and is mapped to the probability on each theme.
7. method according to claim 6, is characterized in that, the described word that meets statistical requirements of selecting from word segmentation result comprises as descriptor:
Described word segmentation result is carried out to cluster, from the word quantity comprised, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor, described M is positive integer.
8. method according to claim 2, is characterized in that, described step S1 specifically comprises:
S11, the query that the user is inputted carry out word segmentation processing, from word segmentation result, select to meet the word of inverse document frequency IDF statistical requirements as descriptor, the descriptor that utilization is selected forms the descriptor set that described query is corresponding, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form the descriptor set that described query is corresponding;
S12, by described query, described descriptor analytical model is inputted in corresponding descriptor set, obtains the theme that described query is mapped to and is mapped to the probability on each theme.
9. according to the described method of the arbitrary claim of claim 1 to 8, it is characterized in that, the theme matching degree of calculating described query and each document comprises:
By calculating cosine similarity or relative entropy KL distance between the theme distribution that theme corresponding to described query distribute corresponding with each document, obtain the theme matching degree of described query and each document.
10. method according to claim 1, is characterized in that, the method also comprises: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described each document;
At described step S1, further comprise: determine at least one in demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of described query;
Further comprise at least one during following matching degree is calculated at described step S2: the demand characteristic by calculating described query and the demand characteristic of each document are calculated the demand matching degree between described query and each document, by time attribute feature and the described query of time attribute feature calculation of each document and the time attribute matching degree between each document of calculating described query, by Regional Property feature and the described query of Regional Property feature calculation of each document and the Regional Property matching degree between each document of calculating described query, by entity attribute feature and the described query of entity attribute feature calculation of each document and the entity attribute matching degree between each document of calculating described query, perhaps, by semantic dependency relations feature and the described query of semantic dependency relations feature calculation of each document and the semantic dependency relations matching degree between each document of calculating described query,
Further combined with at least one in the demand matching degree between described query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtain the matching degree of described query and each document in described step S3.
11. method according to claim 10, is characterized in that, obtains the matching degree Score (q, d) of described query and each document according to following formula:
Score(q,d)=Sem(q,d)[λ 1 Score1+λ 2 Score2]
The theme matching degree that wherein Score1 is described query and document, the demand matching degree that Score2 is described query and document, Sem (q, d) mean in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of described query and document the situation of at least one, λ 1And λ 2For default weighting parameter.
12. method according to claim 1, is characterized in that, in described step S3, according to the matching degree of described query and each document, determines that the Search Results of described query comprises:
To meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.
13. the searcher based on theme, is characterized in that, this device comprises:
The first subject analysis unit, carry out subject analysis for utilizing the subject analysis model to each document of document library, determines that the theme that each document is corresponding distributes;
The second theme analytic unit, carry out subject analysis for utilizing the subject analysis model to the query of user's input, determines that the theme that described query is corresponding distributes;
Theme matching degree computing unit, distribute for utilizing the theme distribution theme corresponding with each document that described query is corresponding, calculates the theme matching degree of described query and each document;
Search result determination unit, obtain the matching degree of described query and each document for utilizing the theme matching degree, and the matching degree of the described query of foundation and each document is determined the Search Results of described query.
14. device according to claim 13, is characterized in that, described subject analysis model comprises: the descriptor that each theme is corresponding and each descriptor are at the weights of affiliated theme.
15. device according to claim 14, is characterized in that, described subject analysis model is that probability latent semantic analysis PLSA model or potential Di Li Cray distribute the LDA model.
16. device according to claim 15, is characterized in that, described descriptor comprise following at least one: the binary combination of field word, polysemant, entity word or entity contamination.
17. device according to claim 15, it is characterized in that, this device also comprises: the topic model training unit, for training described subject analysis model, and in training process, when extracting descriptor from document, further the mode based on statistics is extracted the semantic related term of descriptor, using the semantic related term of each descriptor respectively a document in corpus to participate in the training of subject analysis model.
18. device according to claim 14, is characterized in that, described the first subject analysis unit specifically comprises:
The first descriptor is determined subelement, for each document of document library is carried out to word segmentation processing, from word segmentation result, selects the word that meets statistical requirements as descriptor, obtains respectively the descriptor set of each document;
First distribute to determine subelement, for the descriptor set that each document is corresponding, inputs described subject analysis model, obtains the theme that each document is mapped to and is mapped to the probability on each theme.
19. device according to claim 18, it is characterized in that, described the first descriptor determines that subelement is at the word of selecting to meet statistical requirements from word segmentation result during as descriptor, described word segmentation result is carried out to cluster, meet the classification that predetermined number requires and select occurrence number to come the word of front M as descriptor from the word quantity comprised, described M is positive integer.
20. device according to claim 14, is characterized in that, described second theme analytic unit specifically comprises:
Subelement determined in the second theme word, carry out word segmentation processing for the query to user's input, from word segmentation result, select to meet the word of inverse document frequency IDF statistical requirements as descriptor, the descriptor that utilization is selected forms the descriptor set that described query is corresponding, perhaps further the descriptor of selecting is carried out to the related term expansion, utilize the descriptor of selecting and expand the related term obtained to form the descriptor set that described query is corresponding;
Second distribute to determine subelement, for the descriptor set that described query is corresponding, inputs described descriptor analytical model, obtains the theme that described query is mapped to and is mapped to the probability on each theme.
21. according to claim 13 to the described device of 20 arbitrary claim, it is characterized in that, described theme matching degree computing unit, by cosine similarity or relative entropy KL distance between the theme distribution of calculating theme that described query is corresponding and distributing corresponding with each document, obtains the theme matching degree of described query and each document.
22. device according to claim 13, is characterized in that, this device also comprises:
The first supplementary features determining unit, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining described each document;
The second supplementary features determining unit, at least one of demand characteristic, time attribute feature, Regional Property feature, entity attribute feature and the semantic dependency relations feature of determining described query;
Additional matching degree computing unit, for carrying out at least one of following matching degree calculating: the demand characteristic by calculating described query and the demand characteristic of each document are calculated the demand matching degree between described query and each document, by time attribute feature and the described query of time attribute feature calculation of each document and the time attribute matching degree between each document of calculating described query, by Regional Property feature and the described query of Regional Property feature calculation of each document and the Regional Property matching degree between each document of calculating described query, by entity attribute feature and the described query of entity attribute feature calculation of each document and the entity attribute matching degree between each document of calculating described query, perhaps, by semantic dependency relations feature and the described query of semantic dependency relations feature calculation of each document and the semantic dependency relations matching degree between each document of calculating described query,
Described search result determination unit, further combined with at least one in the demand matching degree between described query and each document, time attribute matching degree, Regional Property matching degree, entity attribute matching degree and semantic dependency relations matching degree, obtains the matching degree of described query and each document.
23. device according to claim 22, is characterized in that, described search result determination unit obtains the matching degree Score (q, d) of described query and each document according to following formula:
Score(q,d)=Sem(q,d)[λ 1 Score1+λ 2 Score2]
The theme matching degree that wherein Score1 is described query and document, the demand matching degree that Score2 is described query and document, Sem (q, d) mean in time attribute matching degree, Regional Property matching degree, entity attribute matching degree and the semantic dependency relations matching degree of described query and document the situation of at least one, λ 1And λ 2For default weighting parameter.
24. device according to claim 13, it is characterized in that, described search result determination unit specifically will meet with the matching degree of described query the document package that the preset matching degree requires and be contained in described Search Results, and according to the matching degree of described query, determining the sequence of each document in Search Results.
CN2012101668529A 2012-05-25 2012-05-25 Subject-based searching method and device Pending CN103425710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101668529A CN103425710A (en) 2012-05-25 2012-05-25 Subject-based searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101668529A CN103425710A (en) 2012-05-25 2012-05-25 Subject-based searching method and device

Publications (1)

Publication Number Publication Date
CN103425710A true CN103425710A (en) 2013-12-04

Family

ID=49650465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101668529A Pending CN103425710A (en) 2012-05-25 2012-05-25 Subject-based searching method and device

Country Status (1)

Country Link
CN (1) CN103425710A (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870563A (en) * 2014-03-07 2014-06-18 北京奇虎科技有限公司 Method and device for determining subject distribution of given text
WO2014177104A1 (en) * 2013-12-27 2014-11-06 中兴通讯股份有限公司 Search processing method, device and terminal
CN104933022A (en) * 2014-03-20 2015-09-23 株式会社东芝 Information processing device and information processing method
CN105224521A (en) * 2015-09-28 2016-01-06 北大方正集团有限公司 Key phrases extraction method and use its method obtaining correlated digital resource and device
CN105354182A (en) * 2015-09-28 2016-02-24 北大方正集团有限公司 Method for obtaining related digital resources and method and apparatus for generating special topic by using method
CN105447005A (en) * 2014-08-08 2016-03-30 百度在线网络技术(北京)有限公司 Object push method and device
CN105589972A (en) * 2016-01-08 2016-05-18 天津车之家科技有限公司 Method and device for training classification model, and method and device for classifying search words
CN105930400A (en) * 2016-04-15 2016-09-07 南京大学 Markov decision process model based session search method
CN106055538A (en) * 2016-05-26 2016-10-26 达而观信息科技(上海)有限公司 Automatic extraction method for text labels in combination with theme model and semantic analyses
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN106709040A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Application search method and server
CN106951420A (en) * 2016-01-06 2017-07-14 富士通株式会社 Literature search method and apparatus, author's searching method and equipment
CN107103049A (en) * 2017-03-31 2017-08-29 努比亚技术有限公司 A kind of recommendation method and the network equipment
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system
CN107239438A (en) * 2016-03-28 2017-10-10 阿里巴巴集团控股有限公司 A kind of document analysis method and device
CN108040052A (en) * 2017-12-13 2018-05-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Netflow daily record datas
CN108268443A (en) * 2017-12-21 2018-07-10 北京百度网讯科技有限公司 It determines the transfer of topic point and obtains the method, apparatus for replying text
CN108345605A (en) * 2017-01-24 2018-07-31 苏宁云商集团股份有限公司 A kind of text search method and device
CN109145267A (en) * 2018-08-20 2019-01-04 六度云计算有限公司 Legal decision case matching process and device based on topic model
CN109344252A (en) * 2018-09-12 2019-02-15 东北大学 Microblogging file classification method and system based on high-quality topic expansion
CN110020153A (en) * 2017-11-30 2019-07-16 北京搜狗科技发展有限公司 A kind of searching method and device
CN110287289A (en) * 2019-06-25 2019-09-27 北京金海群英网络信息技术有限公司 A kind of document keyword extraction and the method based on document matches commodity
CN110472233A (en) * 2019-07-16 2019-11-19 清华大学 The relationship method for measuring similarity and system that entity is distributed end to end in knowledge based library
CN110659655A (en) * 2018-06-28 2020-01-07 北京三快在线科技有限公司 Index classification method and device and computer readable storage medium
CN110703968A (en) * 2018-07-09 2020-01-17 北京搜狗科技发展有限公司 Searching method and related device
CN110737750A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Data processing method and device for analyzing text audience and electronic equipment
CN111666371A (en) * 2020-04-21 2020-09-15 北京三快在线科技有限公司 Theme-based matching degree determination method and device, electronic equipment and storage medium
CN112100480A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Search method, device, equipment and storage medium
CN112685440A (en) * 2020-12-31 2021-04-20 王程 Structural query information expression method for marking search semantic role
CN113343104A (en) * 2021-06-29 2021-09-03 深圳前海微众银行股份有限公司 Information recommendation method, device, equipment and computer storage medium
CN115809312A (en) * 2023-02-02 2023-03-17 量子数科科技有限公司 Search recall method based on multi-channel recall

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
US20060248074A1 (en) * 2005-04-28 2006-11-02 International Business Machines Corporation Term-statistics modification for category-based search
CN101840438A (en) * 2010-05-25 2010-09-22 刘宏 Retrieval system oriented to meta keywords of source document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
US20060248074A1 (en) * 2005-04-28 2006-11-02 International Business Machines Corporation Term-statistics modification for category-based search
CN101840438A (en) * 2010-05-25 2010-09-22 刘宏 Retrieval system oriented to meta keywords of source document

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾西平等: "基于主题的文档检索模型", 《华南理工大学学报(自然科学版)》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014177104A1 (en) * 2013-12-27 2014-11-06 中兴通讯股份有限公司 Search processing method, device and terminal
CN103870563A (en) * 2014-03-07 2014-06-18 北京奇虎科技有限公司 Method and device for determining subject distribution of given text
CN103870563B (en) * 2014-03-07 2017-03-29 北京奇虎科技有限公司 It is determined that the method and apparatus of the theme distribution of given text
CN104933022A (en) * 2014-03-20 2015-09-23 株式会社东芝 Information processing device and information processing method
CN104933022B (en) * 2014-03-20 2018-11-13 株式会社东芝 Information processing unit and information processing method
CN105447005A (en) * 2014-08-08 2016-03-30 百度在线网络技术(北京)有限公司 Object push method and device
CN105447005B (en) * 2014-08-08 2020-03-17 北京小度互娱科技有限公司 Object pushing method and device
CN105224521A (en) * 2015-09-28 2016-01-06 北大方正集团有限公司 Key phrases extraction method and use its method obtaining correlated digital resource and device
CN105354182A (en) * 2015-09-28 2016-02-24 北大方正集团有限公司 Method for obtaining related digital resources and method and apparatus for generating special topic by using method
CN105354182B (en) * 2015-09-28 2018-06-26 北大方正集团有限公司 The method and the method and device using its generation special topic for obtaining correlated digital resource
CN105224521B (en) * 2015-09-28 2018-05-25 北大方正集团有限公司 Key phrases extraction method and the method and device using its acquisition correlated digital resource
CN106951420A (en) * 2016-01-06 2017-07-14 富士通株式会社 Literature search method and apparatus, author's searching method and equipment
CN105589972B (en) * 2016-01-08 2019-03-15 天津车之家科技有限公司 The method and device of train classification models, the method and device classified to search term
CN105589972A (en) * 2016-01-08 2016-05-18 天津车之家科技有限公司 Method and device for training classification model, and method and device for classifying search words
CN107239438A (en) * 2016-03-28 2017-10-10 阿里巴巴集团控股有限公司 A kind of document analysis method and device
CN105930400B (en) * 2016-04-15 2019-10-11 南京大学 A kind of session searching method based on markov decision process model
CN105930400A (en) * 2016-04-15 2016-09-07 南京大学 Markov decision process model based session search method
CN106055538A (en) * 2016-05-26 2016-10-26 达而观信息科技(上海)有限公司 Automatic extraction method for text labels in combination with theme model and semantic analyses
CN106055538B (en) * 2016-05-26 2019-03-08 达而观信息科技(上海)有限公司 The automatic abstracting method of the text label that topic model and semantic analysis combine
CN106709040A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Application search method and server
CN106709040B (en) * 2016-12-29 2021-02-19 北京奇虎科技有限公司 Application search method and server
CN106649818B (en) * 2016-12-29 2020-05-15 北京奇虎科技有限公司 Application search intention identification method and device, application search method and server
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN108345605B (en) * 2017-01-24 2022-04-05 苏宁易购集团股份有限公司 Text search method and device
CN108345605A (en) * 2017-01-24 2018-07-31 苏宁云商集团股份有限公司 A kind of text search method and device
CN107103049A (en) * 2017-03-31 2017-08-29 努比亚技术有限公司 A kind of recommendation method and the network equipment
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system
CN110020153A (en) * 2017-11-30 2019-07-16 北京搜狗科技发展有限公司 A kind of searching method and device
CN108040052A (en) * 2017-12-13 2018-05-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Netflow daily record datas
CN108268443A (en) * 2017-12-21 2018-07-10 北京百度网讯科技有限公司 It determines the transfer of topic point and obtains the method, apparatus for replying text
CN108268443B (en) * 2017-12-21 2022-02-25 北京百度网讯科技有限公司 Method and device for determining topic point transfer and acquiring reply text
CN110659655A (en) * 2018-06-28 2020-01-07 北京三快在线科技有限公司 Index classification method and device and computer readable storage medium
CN110659655B (en) * 2018-06-28 2021-03-02 北京三快在线科技有限公司 Index classification method and device and computer readable storage medium
CN110737750A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Data processing method and device for analyzing text audience and electronic equipment
CN110703968A (en) * 2018-07-09 2020-01-17 北京搜狗科技发展有限公司 Searching method and related device
CN109145267A (en) * 2018-08-20 2019-01-04 六度云计算有限公司 Legal decision case matching process and device based on topic model
CN109344252A (en) * 2018-09-12 2019-02-15 东北大学 Microblogging file classification method and system based on high-quality topic expansion
CN110287289A (en) * 2019-06-25 2019-09-27 北京金海群英网络信息技术有限公司 A kind of document keyword extraction and the method based on document matches commodity
CN110472233B (en) * 2019-07-16 2021-02-12 清华大学 Relation similarity measurement method and system based on head-tail entity distribution in knowledge base
CN110472233A (en) * 2019-07-16 2019-11-19 清华大学 The relationship method for measuring similarity and system that entity is distributed end to end in knowledge based library
CN111666371A (en) * 2020-04-21 2020-09-15 北京三快在线科技有限公司 Theme-based matching degree determination method and device, electronic equipment and storage medium
CN112100480A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Search method, device, equipment and storage medium
CN112685440B (en) * 2020-12-31 2022-03-22 上海欣兆阳信息科技有限公司 Structural query information expression method for marking search semantic role
CN112685440A (en) * 2020-12-31 2021-04-20 王程 Structural query information expression method for marking search semantic role
CN113343104A (en) * 2021-06-29 2021-09-03 深圳前海微众银行股份有限公司 Information recommendation method, device, equipment and computer storage medium
CN115809312A (en) * 2023-02-02 2023-03-17 量子数科科技有限公司 Search recall method based on multi-channel recall
CN115809312B (en) * 2023-02-02 2023-04-07 量子数科科技有限公司 Search recall method based on multi-channel recall

Similar Documents

Publication Publication Date Title
CN103425710A (en) Subject-based searching method and device
US11182445B2 (en) Method, apparatus, server, and storage medium for recalling for search
US11507975B2 (en) Information processing method and apparatus
Bhatia et al. Automatic labelling of topics with neural embeddings
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
Flati et al. Two is bigger (and better) than one: the wikipedia bitaxonomy project
US9997157B2 (en) Knowledge source personalization to improve language models
Hai et al. Identifying features in opinion mining via intrinsic and extrinsic domain relevance
WO2021218322A1 (en) Paragraph search method and apparatus, and electronic device and storage medium
Baralis et al. Mwi-sum: A multilingual summarizer based on frequent weighted itemsets
CN108920649B (en) Information recommendation method, device, equipment and medium
US20130018894A1 (en) System and method of sentiment data generation
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
US9805120B2 (en) Query selection and results merging
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN110309446A (en) The quick De-weight method of content of text, device, computer equipment and storage medium
US10565253B2 (en) Model generation method, word weighting method, device, apparatus, and computer storage medium
CN103699625A (en) Method and device for retrieving based on keyword
CN106325488B (en) A kind of input method, input unit, server and input system
JP4969554B2 (en) Document search server and method using topic graph
Jin et al. Topic initiator detection on the world wide web
US20130018874A1 (en) System and method of sentiment data use
Shen et al. A graph-based approach for ontology population with named entities
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
KR101651780B1 (en) Method and system for extracting association words exploiting big data processing technologies

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20131204

RJ01 Rejection of invention patent application after publication