Background technology
The appearance of search engine allows the user have can to search from mass data, the instrument of obtaining information.But be not the principle that every user understands search engine, so most user generally is oneself to organize query statement search, and think that the query word of input is more when using search engine, just more can obtain satisfied Search Results more in detail.And in fact may not, on the one hand, consider based on performance, search engine has the maximum length restriction to the query statement of user's input, surpasses maximum length will block, and only goes to retrieve with part.On the other hand, in the result that it returns, as long as return with the capital of term, comprise a large amount of irrelevant informations, accuracy rate is low, can not hit user's real intention.
And present search engine can be introduced merchant advertisement as a kind of means of income according to user's input.But the advertisement of sometimes getting has absolutely nothing to do with each other with user's input information.Main cause or search engine fail to identify user's core demand, have just hit the partial query word of user search.
So, how to allow Search Results more satisfy user's requirement, the essential requirement of more being close to the users just is appreciated that the retrieving information that the user inputs.Consider the complicacy of actual language, the retrieve statement of user's input has a lot of words that is used for restriction, and these words itself are little for the practical significance of retrieval.Therefore search engine need to be identified core or the trunk portion of retrieval, allow what hit in the Search Results is the core word of user search statement, the trunk word, but not hit be some have little significance abandon word or qualifier.How from user's search need, to extract corresponding core word, become term in the present search engine (Query) and analyze one of urgent problem.
Input the retrieve statement of oneself as the user, search engine can be done analysis to this statement automatically, the core word of identification user search input, and core word must hit and just go out Search Results; Identification user input abandon word or qualifier, this class word is with or without hit results and does not have what impact.So just can allow the result for retrieval (comprising advertisement) that shows more can satisfy user's core demand.
Up to now, the scheme of relevant search engine identification user search core word aspect is less, is summed up nothing more than following several, and a kind of click information of Search Results that is based on is afterwards extracted corresponding core word; Another is based on word Architecture Analysis Chinese semantic meaning.
For example, the patent of Chinese patent CN102043845A provides a kind of method and apparatus for extract kernel keyword based on query sequence cluster, comprise, when occurring the search need of the Search Results that a large amount of identical users click in the network, what these search needs often reflected is identical theme.By obtaining the query sequence cluster of multiple queries sequence, the corresponding at least Search Results that identical user clicks of each search sequence, extract corresponding kernel keyword, obtained to input the user's of the search sequence in this query sequence cluster search need, can also provide the search suggestion of more pressing close to or the search need of being correlated with for the user according to this kernel keyword, so that the user obtains better search experience.Its weak point is: at first high to the search engine requirement, require its performance, effect stability, and Search Results can satisfy user's demand substantially, and it is just reliable that the user who obtains like this clicks the result, just consistent with user's actual need based on this analyzing and processing of doing; Secondly, Search Results generally all is to obtain after processing was done in user's retrieval, and such as the Query expansion, Query synonym etc. so that not necessarily contain user's term in the Search Results, so just can't directly extract the core word of user search.
For example, the patent of Chinese patent CN102681982A can allow the method for automatic semantic identification of natural language sentences of computer understanding, a kind of method of computing machine accurate understanding Chinese Han language has been proposed, it has abandoned the method that word selection is in the past got word, language feature from Chinese, by the word framework, allow accurately computing machine know the language content that the operator inputs; The definite meaning of one's words that analyzes a Chinese sentence.At first set up ontology library in certain field, the unambiguous word of accurate descriptions all in certain field is returned to put together consist of ontology library (comprising domain knowledge ontology library and this exam pool of general term); Then based on understanding and the domain body of natural language sentences, set up the semantic frame knowledge base; The Ontology Mapping of last semantic-based framework realizes that natural language sentences are to the coupling directly perceived of semantic structure.Its weak point is: at first internet arena information increases severely every day, and some new terms also progressively produce, and some common vocabulary also progressively possess new meaning, for this class word, as core word or the auxiliary word of modification, relevant with the user search statement, can't lump together; The semantic frame knowledge base is similar to regularity again, and enormous amount can't be concluded fast, and effect needs further to investigate to improve.
Based on the afterwards core word identification of the user search of search, at first search engine is had higher requirements, stable in system performance, could support in the reasonable situation of effect; Next is too dependent on Search Results and user's reaction, easily introduce some unnecessary noises (such as advertisement, out of Memory etc.), and Search Results obtains through all kinds of conversion, not necessarily contains user's term in the Search Results, and retrieve statement is not necessarily directly on the correspondence.The result who again obtains under the line can only play reference function when subsequent user is inputted identical, similar Query, thereby recall rate is lower.
Based on the core word recognition methods of the retrieval of setting up the semantic frame knowledge base, to the particular entity undertreatment, there is not well to distinguish the entity word of the common meaning of word of that class; The rule that the semantic frame knowledge base is comprised of all kinds of words, and summarizing needs long time, and effect also needs progressively to improve.
Summary of the invention
The technical matters that the present invention solves has been to provide a kind of disposal route and system of user search word, to solve present None-identified user search core word problem.
For addressing the above problem, the embodiment of the invention provides a kind of disposal route of user search word, comprise,
Set up the resources bank relevant with the core word of identification user search;
Term to user's input carries out basic layering;
Term after the described basic layering is carried out entity to be introduced;
The hierarchical structure of the term that output identifies.
Above-mentioned method, wherein, described foundation comprises with the relevant resources bank of core word of identification user search, a series of vocabularys relevant with the core word of identification user search comprise inactive vocabulary, modification vocabulary and actual resource dictionary.
Above-mentioned method, wherein, described term to user's input carries out basic layering and comprises,
After the user search statement is carried out participle, can obtain a series of inquiry vocabulary term and part of speech pos, comprise term[1] _ pos[1], term[2] _ pos[2] ..., term[n] _ pos[n], term[i wherein] be i vocabulary, pos[i] be its corresponding part of speech;
Utilize the part of speech of inactive vocabulary, modification vocabulary and the vocabulary of resources bank that basic layering realized in the inquiry vocabulary of user's input, specific as follows,
Term[i wherein] i term of expression, level[i] be corresponding level, stopwordList is the vocabulary of stopping using, requirewordList is the demand vocabulary, cposList is the unessential part of speech table of a class, including but not limited to adjective, adverbial word, preposition, interjection, auxiliary word, modal particle, conjunction, symbol;
If term[i] belong to the stop words vocabulary or its part of speech belongs to cposList, level[i] be 0; If term[i] belong to qualifier, level[i] be 1; Other situation is 2.
Above-mentioned method wherein, is describedly carried out entity with the term after the described basic layering and is introduced and to comprise,
According to the retrieve statement of entity dictionary in conjunction with the user, extract actual entity word finder entityList;
Term[i wherein] i term of expression, level[i] be corresponding level, entityList is the entity set of extraction.
Above-mentioned method, wherein described according to the retrieve statement of entity dictionary in conjunction with the user, extract actual entity word finder entityList and comprise,
Consider that the user search classification is relevant, when the classification of entity is relevant with classified information, then carry out the entity word and extract; Perhaps,
Utilizing statement law to carry out the entity word extracts.
Above-mentioned method further, also comprised before the hierarchical structure of the user search word that output identifies,
Described user search word is carried out a formula syntactic analysis; And/or,
The user search word is carried out subordinate relation identification.
The embodiment of the invention also provides a kind of disposal system of user search word, comprises,
Resources bank is set up module, is used for setting up the resources bank relevant with the core word of identification user search;
Basic hierarchical block is used for the term of user's input is carried out basic layering;
Entity is introduced module, is used for that the term after the described basic layering is carried out entity and introduces;
Output module is for the hierarchical structure of exporting the term that identifies.
Above-mentioned system, wherein, described foundation comprises with the relevant resources bank of core word of identification user search, a series of vocabularys relevant with the core word of identification user search comprise inactive vocabulary, modification vocabulary and actual resource dictionary.
Above-mentioned system wherein, is used for that the term that the user inputs is carried out basic layering and specifically comprises,
Described basic hierarchical block, be used for after the user search statement is carried out participle, can obtain a series of inquiry vocabulary term and part of speech pos, comprise term[1] _ pos[1], term[2] _ pos[2] ..., term[n] _ pos[n], term[i wherein] be i vocabulary, pos[i] be its corresponding part of speech;
And be used for utilizing the part of speech of inactive vocabulary, modification vocabulary and the vocabulary of resources bank that basic layering realized in the inquiry vocabulary of user's input, it is specific as follows,
Term[i wherein] i term of expression, level[i] be corresponding level, stopwordList is the vocabulary of stopping using, requirewordList is the demand vocabulary, cposList is the unessential part of speech table of a class, including but not limited to adjective, adverbial word, preposition, interjection, auxiliary word, modal particle, conjunction, symbol;
If term[i] belong to the stop words vocabulary or its part of speech belongs to cposList, level[i] be 0; If term[i] belong to qualifier, level[i] be 1; Other situation is 2.
Above-mentioned system further, also comprises,
Sentence formula syntactic analysis module is used for described user search word is carried out a formula syntactic analysis;
The subordinate relation identification module is used for the user search word is carried out subordinate relation identification.
Adopt technical scheme of the present invention, both considered the lexical feature of retrieve statement, consider again the special role of entity word, and introduce entity and carry out entity disambiguation operation, ensure the accuracy rate of entity extraction, and by the sentence formula syntactic analysis comes the retrieve statement of user's integral body is analyzed, avoided only relying on vocabulary to investigate the local optimum problem that level causes, and the problem to the particular entity lack of identification that only relies on the holistic approach sentence structure to cause.The final core word of further optimizing again retrieve statement by subordinate relation, the core vocabulary of identification user sentence is for search engine provides information support as much as possible.Not exclusively depend on simultaneously the object information of search engine on the line, be easier to operation and realize.
Embodiment
In order to make technical matters to be solved by this invention, technical scheme and beneficial effect clearer, clear, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.
When retrieval, the user can input retrieve statement as required, and in general, retrieve statement is made of several terms.In view of rich, the complicacy of Chinese language, the statement of user search input is diversified, in order to describe the demand of oneself in detail, does not stint word.But in fact a lot of vocabulary all are the words that can be used as assistant analysis, make the meaning of expression clearer and more definite, and are little for the practical significance of retrieval.In an embodiment of the present invention, the term that contains in the retrieve statement with the user is divided into four grades:
Abandoning word, is the word that does not have what practical significance, such as stop words, punctuation mark etc., can directly abandon and not join search inquiry, can improve recall precision and the Suo Xiaoguo that do not lapse;
Qualifier, namely the user expresses the word of the modification character of self using when semantic, does not play absolute effect, and is just abundant semantic, can hit also in the Search Results and can not hit;
Core word, i.e. the core of user search statement can be expressed the word of user search demand information, must hit just in the Search Results and can return to the user;
The demand word, i.e. a kind of attribute of the things of user's actual needs generally is the user to be replenished or emphasizes demand a kind of, such as " download ", " song ", " lyrics ", " film " etc., if hit in the Search Results better, this resource has been described, rank is forward.
As shown in Figure 1, be the first embodiment of the invention process flow diagram, a kind of disposal route of user search word is provided, specifically comprise,
Step S101 sets up the resources bank relevant with the core word of identification user search;
Resources bank is a series of vocabularys relevant with the core word of identification user search, comprises inactive vocabulary (stopwordList), modifies vocabulary (modifywordList), and actual resource dictionary (dicResource).
Inactive vocabulary comprises the common a series of inactive vocabulary of Chinese, as " ", " in ", " what "; Modify vocabulary and comprise common qualifier, such as " beauty ", " good-looking " etc.; The actual resource dictionary, comprise current all kinds of resource name, such as channel resources such as novel name, software name, movie name, with and corresponding classification, this can excavate from retrieve log or from each vertical website crawl, extraction information needed, guarantee that as far as possible the resource information of resources bank is complete.
Step S102, the term that the user is inputted carries out basic layering;
Input the retrieve statement of oneself as the user, after the statement to user search carries out participle, can obtain a series of inquiry vocabulary term and part of speech pos, term[1] _ pos[1], term[2] _ pos[2] ..., term[n] _ pos[n].Term[i] be i vocabulary, pos[i] be its corresponding part of speech.
Utilize the part of speech of inactive vocabulary, modification vocabulary and the vocabulary of resources bank that basic layering realized in the inquiry vocabulary of user's input, specific as follows,
Term[i wherein] i term of expression, level[i] be corresponding level, stopwordList is the vocabulary of stopping using, requirewordList is the demand vocabulary, cposList is the unessential part of speech table of a class, including but not limited to adjective, adverbial word, preposition, interjection, auxiliary word, modal particle, conjunction, symbol etc.
If term[i] belong to the stop words vocabulary or its part of speech belongs to cposList, level[i] abandon word for the 0(representative); If term[i] belong to qualifier, level[i] represent qualifier for 1(); Other situation is the 2(core word).
By this step, with each vocabulary initial setting of user search level.
Step S103 carries out entity with the term after the basic layering and introduces;
Importance degree, the grade of the vocabulary that contains in the retrieve statement of user's input are different, how to distinguish vocabulary prior, that the meaning of representing is arranged, and comparatively speaking, the entity word is even more important, generally more can show user's original idea demand.If contain the entity word in the retrieve statement, then to give prominence to the effect of entity word.
It mainly is that important vocabulary with being divided into qualifier in the basic layering or abandoning word drags for that entity is introduced, and again gives its important grade.
In view of importance and the complicacy of entity, need to determine whether entity in conjunction with user's input itself.Such as " why " be a most common word, but also may be present in the entity dictionary, classification is song.How distinguishing the entity word of this class word, especially ambiguity, then is a most important step of this link, can be referred to as entity disambiguation way.
Consider that two kinds of methods extract entity, wherein first method considers that the user search classification is relevant, then extract the classification of entity is relevant with classified information, otherwise need not.
Particularly, first method is utilized external information exactly, and such as Query classification (classification of user search statement), this is commonplace in search engine.Such as user search " the comedy time of Zhou Xingchi downloads ", the Query classification is for downloading class; " May song why audition ", the Query classification is the song class; " why mobile phone does not connect computer ", the Query classification is the question and answer class.
The entity that extracts retrieve statement utilizes these classification information exactly.A common word such as " time ", but actual in the user search in the above is the name of a film, it is a physical name, entity class is that the film class is (by calling above-mentioned actual resource dictionary, can obtain candidate's entity word, entity class in the statement of user input), when Query classification (download class) is relevant with the classification (film) of entity, just with its extraction." why " belongs to stop words for another example, be divided in the basic layering of the first step and abandoned the word grade, pass through the actual resource dictionary at this, also occur as candidate's entity, entity is song class (be named as " why " that song is arranged), Query classification (song class) is relevant with entity class (song), then thinks entity.And in " why mobile phone does not connect computer ", even if " why " occur as candidate's entity, but Query classification (question and answer class) is not related with entity class (song), does not then think entity.
A contingency table can be manually joined in this association flexibly, represents that each Query classification may be relevant with which entity class, such as " download class: song, film, TV play, game, software "; " song class: song "; " video class: film, TV play, animation " etc.
Certainly, actual conditions are, are not that each Query has classification.If what if the statement of user search does not have classification? by experience, if contain obvious entity word among the Query, Query substantially can tell classification, if really do not tell classification, that can be directly comes preferentially according to the length of candidate's entity, the number that is cut into vocabulary, ensures accuracy rate.
Entity is introduced major significance and is " dragging for " core word.After basic layering, according to letter a basic general layering has been arranged, but general common word may abandon word or qualifier level; And this class word is carefully analyzing to find being considerable entity word in fact, so this class word " is dragged for ", gives the core word grade.Such as " because love ", participle be " because love ", " because " too common, be endowed in basic layering meeting and abandon word.But it is the part of entity (song " because love "), can give the core word grade it in this step.As above-mentioned, it is the entity disambiguation that entity is introduced topmost work, namely how to extract real useful entity, and introducing noise still less, ensures recall rate and accuracy rate, and this step has been expected above-mentioned two kinds of methods.
Certainly first method is to rely on outside Query classification, and accuracy rate is higher.
2) utilize statement law to extract: as (name | the demand word)+word T, (name) word T+(demand word), dictionary then extracts if T appears at entity.Such as user search " the beautiful song of Cai Zhuo is why ", " song is why ", at this moment " why " can think entity.
Second method is just directly set about from some rules, such as the entity word generally can and name, demand word (song, film etc.) occur together, especially for the entity word of common meaning.Such as above-mentioned " song is why ", herein " why " is exactly entity, " why mobile phone does not connect computer " herein " why " be not entity, the method realizes simple and easy.
According to the retrieve statement of entity dictionary in conjunction with the user, extract actual entity word finder entityList.
Term[i wherein] i term of expression, level[i] be corresponding level, entityList is the entity set of extraction.
This step is intended to the vocabulary (basic layering may have been given and abandoned or modify) of the entity that will comprise in the user search statement, and the level of upgrading highlights user's intention.
Step S106, the hierarchical structure of the user search word that output identifies.
For each retrieve statement, by above step, finally obtained hierarchical structure corresponding to each vocabulary that this statement comprises, namely this vocabulary is the demand word, core word, qualifier still abandon word.
Above-mentioned steps has been finished the identification that the user inputs term substantially, but if reach better effect, the embodiment of the invention can also may further comprise the steps, below two step S104 and S105 order in no particular order, also can select one and carry out choice for use:
Step S104 carries out a formula syntactic analysis to described user search word;
Above two steps are introduced with entity by the basic layering to the vocabulary of user input, have realized the layering of vocabulary that the user is inputted, but the angle that all is based on word realizes layering.The retrieve statement of user's input contains a lot of fixing sentence formulas, utilizes some formula rules, can be assisted layered.As (from) $ Adress.* $ Adress; (.* mobile phone) .* downloads; (relation) of (discussion) .* and .*; (take) .* is as the composition of (topic), the modification level can be given in the vocabulary in the bracket.
Also can do interdependent syntactic analysis to the user search statement in addition, the formation of parsing sentence obtains vocabulary that sentence contains and the dependence between the vocabulary, utilizes special sentence structure, and the hierarchical structure of the vocabulary angle based on sentence is adjusted.
This step is on the whole user's read statement to be held, and adjusts the level of vocabulary.
Step S105 carries out subordinate relation identification to the user search word.
As an embodiment, the embodiment of the invention is divided into two classes with subordinate relation: regional subordinate and industry subordinate.
The zone subordinate is the geographic position subordinate, when two place names are subordinate relation, during relationship between superior and subordinate, the higher level address is adjusted into modification.With outstanding core place name.Such as " Haidian, Beijing ", Haidian belongs to Beijing, and then " Haidian " can be more prone to core word than " Beijing ", and " Beijing " just is adjusted to qualifier herein, and regional subordinate can consider to utilize place name to encode the identified region subordinate relation.
The field subordinate is the classification field under the physical name, such as the TV play class, and the film class, song class etc., information source is in above-mentioned entity dictionary.After above-mentioned 103 Entity recognition, according to entity class, if its classification related term occurs before and after the entity, this class word is adjusted into the demand word.Essence, the demand word is a kind of attribute that shows the user search things, so be relevant with concrete entity, generally can follow entity to occur.Therefore after identifying entity, carry out subordinate relation and judge whether the demand word is arranged.Such as " song of Liu Dehua is water lustily ", " lustily water " belongs to " song ", therefore be adjusted into the demand word at this vocabulary " song ", core word is " Liu Dehua " and " lustily water ".Like this, one can the users read statement core vocabulary, two can users essential requirement (song), carry out searching order optimization.User's input " film of Liu Dehua " does not identify subordinate relation herein for another example, and vocabulary " film " still is core word, can not be identified as the demand word, otherwise result for retrieval just may be irrelevant with film.
As shown in Figure 2, be the second embodiment of the invention structural drawing, a kind of disposal system of user search word is provided, comprise,
Resources bank is set up module 201, is used for setting up the resources bank relevant with the core word of identification user search;
Basic hierarchical block 202 is used for the term of user's input is carried out basic layering;
Entity is introduced module 203, is used for that the term after the described basic layering is carried out entity and introduces;
Output module 204 is for the hierarchical structure of exporting the term that identifies.
Further, described foundation comprises with the relevant resources bank of core word of identification user search, and a series of vocabularys relevant with the core word of identification user search comprise inactive vocabulary, modification vocabulary and actual resource dictionary.
Further, described basic hierarchical block is used for that the term that the user inputs is carried out basic layering and specifically comprises,
Described basic hierarchical block, be used for after the user search statement is carried out participle, can obtain a series of inquiry vocabulary term and part of speech pos, comprise term[1] _ pos[1], term[2] _ pos[2] ..., term[n] _ pos[n], term[i wherein] be i vocabulary, pos[i] be its corresponding part of speech;
And be used for utilizing the part of speech of inactive vocabulary, modification vocabulary and the vocabulary of resources bank that basic layering realized in the inquiry vocabulary of user's input, it is specific as follows,
Term[i wherein] i term of expression, level[i] be corresponding level, stopwordList is the vocabulary of stopping using, requirewordList is the demand vocabulary, cposList is the unessential part of speech table of a class, including but not limited to adjective, adverbial word, preposition, interjection, auxiliary word, modal particle, conjunction, symbol.
If term[i] belong to the stop words vocabulary or its part of speech belongs to cposList, level[i] be 0; If term[i] belong to qualifier, level[i] be 1; Other situation is 2.
Further, described system also comprises,
Sentence formula syntactic analysis module is used for described user search word is carried out a formula syntactic analysis; And/or
The subordinate relation identification module is used for the user search word is carried out subordinate relation identification.
Above-mentioned explanation illustrates and has described a preferred embodiment of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to the disclosed form of this paper, should not regard the eliminating to other embodiment as, and can be used for various other combinations, modification and environment, and can in invention contemplated scope described herein, change by technology or the knowledge of above-mentioned instruction or association area.And the change that those skilled in the art carry out and variation do not break away from the spirit and scope of the present invention, then all should be in the protection domain of claims of the present invention.