CN101004753A

CN101004753A - Method and system for recognizing conception type files

Info

Publication number: CN101004753A
Application number: CN 200710000398
Authority: CN
Inventors: 刘琳
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2007-01-25
Filing date: 2007-01-25
Publication date: 2007-07-25
Anticipated expiration: 2027-01-25
Also published as: CN101004753B

Abstract

A method for identifying concept file includes fetching source file, using preset key character to carry out file match and making score according to matched result, obtaining said preset key character from expression mode of concept file, comparing score sum with preset judgment threshold to confirm whether said source file is concept file or not.

Description

A kind of recognition methods of concept type document and system

Technical field

The present invention relates to network information process field, particularly relate to a kind of recognition methods and system of concept type document.

Background technology

Along with increasing sharply of text that uses in the Internet and other data network and the system and content of multimedia, the data volume of the network information sharply increases, how to help the user as soon as possible, obtain required information accurately from the network information of magnanimity, be a hot issue of network information process field as far as possible.

Prior art has proposed various the network information to be carried out the technical scheme of analyzing and processing, to satisfy user's information requirement.

On September 6th, 2006, in No. 200480021922.5 application documents of disclosed Chinese patent, mentioned a kind of implication that is used for determining document, so that the system and method that document and content are complementary.This method comprises: the access originator article; A plurality of districts in the identification source article; Determine the local theme that at least one is associated with each district; The local theme of analyzing each district is to discern any uncorrelated district; The local theme that deletion is associated with any uncorrelated district is to determine related subject; Analyze related subject to determine the source implication of source article; And make the source implication and the clauses and subclauses implication that is associated with clauses and subclauses from one group of clauses and subclauses is complementary.The fundamental purpose of this technical scheme is, from document, remove of the influence of the content in uncorrelated zone to the main contents extraction of entire chapter document, and then the advertisement and the current web page contents height correlation of browsing of user that make advertiser's issue, improve the matching degree of the two and the accuracy of advertisement putting.

Though technique scheme can realize the conclusion comparatively accurately to document content, satisfy advertisement publishers' demand, for the user, its maximum demand is to obtain required information as early as possible, technique scheme can't be dealt with problems.For example, for " sunspot " such keyword, adopt technique scheme, can obtain a large amount of and the higher document information of " sunspot " content degree of correlation, the analysis found that in these documents, some are arranged to " sunspot " explanation that makes an explanation, some are arranged is about some news report of " sunspot ", and also having some is as news documents of metaphorical meaning or the like with " sunspot ".Do which is only, and the user need most in these documents?

Find that through statistical study under the situation of same matching inquiry keyword, promptly under the situation of the equal correlation inquiry speech of content, therefore the selection answer that the concept type document is normally best is necessary to analyze and identify this classification from collection of document.Notion typically refers to the blocks of knowledge (perhaps general semantic primitive) that the unique combination of feature is formed.The concept type document is usually with to the explanation of the notion theme as document, launches to describe around the connotation and extension of identical concept, and promptly the body matter of concept type document generally can be the definition, feature, evolution history, expansion explanation of notion etc.

At large-scale searching fields such as internet hunts, the concept type document often can better meet the demand of user for the inquiry of name grammatical category information than other documents, comprise explanation, to the definition of being familiar with word and word equivalent in other language or the like to strange word.

Because prior art when Search Results is showed in inquiry, is carried out sort result according to keyword frequency, paragraph implication, document time and/or linking relationship etc. usually and is showed.And in present search-engine results, the concept type document since its linking relationship on relatively a little less than, document is outmoded relatively update time, the keyword frequency is general, thereby has under the situation of identical meanings, comes the back of other documents.In detection at random, find that this problem has universal phenomenon, result such as " Argentina " (to introduction of country), "three represents" theory (particular content of thought), these query words of " bird flu " (the ins and outs of disease), just based on news, and after the document ordering that concept type is described leans on relatively; But in statistics, the concept type document is user's optimal selection answer normally, thereby has caused the user to obtain the inefficiency of information needed.

In a word, how quick identification goes out the concept type document from the network information of magnanimity, is one of those skilled in the art's technical matters of pressing for solution at present.

Summary of the invention

Technical matters to be solved by this invention provides a kind of recognition methods and system of concept type document, to realize the identifying concept type document independently, automatically, fast from the network information of magnanimity.

In order to address the above problem, the invention discloses a kind of embodiment of recognition methods of concept type document, specifically can may further comprise the steps:

Step a, read source document;

The key character that step b, employing are preset carries out the document coupling, scores according to matching result; The described key character that presets is obtained by the expression way of concept type document;

Step c, the judgment threshold of relatively scoring summation and presetting determine whether this source document is the concept type document.

Preferably, described key character can comprise title keyword, text keyword and punctuation mark.

Preferably, completing steps b in the following manner: in described source document title, adopt the title keyword and/or the punctuation mark that preset to mate, score according to first presetting rule; In described source document text, adopt the text keyword and/or the punctuation mark that preset to mate, when meeting coupling, score according to second presetting rule.

Preferably, described method can also comprise: in described source document title, adopt the title keyword and/or the punctuation mark that preset to mate, cutting obtains the notional word of described source document according to matching result.

Preferably, described method can also comprise: in described source document text, adopt the notional word of described source document to mate, when meeting coupling, whether the matched position of judging the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then score according to the 3rd presetting rule.

Preferably, described method can also comprise: in described source document text, adopt the notional word of described source document to mate, when not meeting coupling, then score according to the 4th presetting rule.

Preferably, described method can also comprise: if when determining that this source document is the concept type document, this source document, concept type are judged the notional word of conclusion and this source document is preserved with related data storage method.

Preferably, the described text keyword that presets comprises definition keyword, name-form keyword, attribute formula keyword and/or relational expression keyword.

Preferably, described second presetting rule comprises: at the basic score and the matching times attenuation coefficient of each text keyword or punctuation mark.Preferably, described second presetting rule can also comprise: position attenuation coefficient and/or paragraph attenuation coefficient in the section.

Wherein, the described text that is used to mate score comprises whole paragraphs of this source document; Perhaps, the described text that is used for mating score only comprises that this source document meets the paragraph of prerequisite.

The invention also discloses a kind of embodiment of recognition system of concept type document, can comprise:

The source document reading device;

Coalignment is used to adopt the key character that presets to carry out the document coupling, and the described key character that presets is obtained by the expression way of concept type document;

Analytical equipment comprises the score module and compares determination module that described score module is used for scoring according to matching result; Described relatively determination module be used for relatively scoring summation and the judgment threshold that presets determine whether this source document is the concept type document.

Preferably, described key character comprises title keyword, text keyword and punctuation mark.

Preferably, described system can also comprise: pretreatment module is used for carrying out pre-service at described source document, and obtains document fragment to be analyzed.

Preferably, described coalignment comprises:

The title matching module is used at described source document title, adopts the title keyword and/or the punctuation mark that preset to mate; The text matching module is used at described source document text, adopts the text keyword and/or the punctuation mark that preset to mate;

Described score module comprises:

Title score submodule is used for meeting when mating at described source document title when the title keyword that presets and/or punctuation mark, scores according to first presetting rule; Text score submodule is used for meeting when mating at described source document text when the text keyword that presets and/or punctuation mark, scores according to second presetting rule.

Preferably, described analytical equipment can also comprise: the notional word acquisition module, be used for matching result according to the output of title matching module, and described source document title is carried out the notional word that cutting obtains described source document.

Preferably, described system can also comprise: the notional word matching module is used for adopting the notional word of described source document to mate at described source document text; Notional word score submodule is used for when meeting when mating, and judges whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then scores according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule.

Preferably, described system can also comprise: memory storage, be used for when definite this source document is the concept type document, and this source document, concept type are judged the notional word of conclusion and this source document is preserved with related data storage method.

The present invention also provides a kind of embodiment of recognition device of concept type document, can comprise:

The source document read module;

Coupling score module is used to adopt the key character that presets to carry out the document coupling, scores according to matching result; The described key character that presets is obtained by the expression way of concept type document;

Compare determination module, be used for relatively scoring summation and the judgment threshold that presets determine whether this source document is the concept type document.

Preferably, described device can also comprise: pretreatment module is used for carrying out pre-service at described source document, and obtains document fragment to be analyzed.

Preferably, described coupling score module comprises: title coupling score submodule, be used at described source document title, and adopt the title keyword and/or the punctuation mark that preset to mate, score according to first presetting rule; Text matching score submodule is used at described source document text, adopts the text keyword and/or the punctuation mark that preset to mate, and when meeting coupling, scores according to second presetting rule.

Preferably, described device can also comprise: notional word obtains submodule, is used at described source document title, adopts the title keyword and/or the punctuation mark that preset to mate, and cutting obtains the notional word of described source document according to matching result; Notional word coupling score submodule is used for adopting the notional word of described source document to mate at described source document text; When meeting when coupling, judge whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then score according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule.

Preferably, described device can also comprise: memory module, be used for when definite this source document is the concept type document, and this source document, concept type are judged the notional word of conclusion and this source document is preserved with related data storage method.

Compared with prior art, the present invention has the following advantages:

The present invention will change into analysis at describing mode at the mode of semantic analysis in the prior art, crucial word and/or punctuation mark that utilization is preset, add the format signature analysis of document, can be on the basis that breaks away from semantic understanding, according to the language performance rule of the concept type document of some priori, thereby determine whether document is the concept type describing mode.

Because the present invention only needs to adopt the mode of coupling just can realize identifying purpose, only need carry out once-twice scanning to destination document gets final product, do not need deep semantic analysis, do not need to investigate the intension of notion, can reduce and handle computational resource and the processing time that document consumed; And, the present invention in processing procedure only the content with the document get final product as the analysis source, and do not rely on the content that has other documents of certain relation with document.So the present invention saves computational resource very much, and recognition speed is fast, efficient is high.

Secondly, adopt mode of the present invention, can also suppress the interference of the rubbish document that keyword stuffing causes.Simply pile up and the rubbish document that forms by a plurality of keywords, can get rid of outside effective sample easily in the present invention.

Description of drawings

Fig. 1 is the flow chart of steps of the recognition methods embodiment 1 of concept type document of the present invention;

Fig. 2 is the flow chart of steps of the recognition methods embodiment 2 of concept type document of the present invention;

Fig. 3 is the flow chart of steps of the recognition methods embodiment 3 of concept type document of the present invention;

Fig. 4 a-Fig. 4 c is the structured flowchart of the recognition system embodiment of concept type document of the present invention;

Fig. 5 is the structured flowchart of another embodiment of recognition system of concept type document of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

Core concept of the present invention is, from the expression pattern of concept type document, and the similarity of the expression pattern by judging document and the expression pattern of general concept type document, thus whether definite certain document belongs to the concept type document.The key character (comprising punctuation mark) that concrete employing is preset is as the parameter characterization of concept type document expression pattern, by the coupling scanning of once arriving secondary to document, with the priori rule as a reference, in the processing of magnanimity document, just can identify the concept type document in mode fast and effectively.

With reference to Fig. 1, show the flow chart of steps of the recognition methods embodiment 1 of concept type document of the present invention, specifically may further comprise the steps:

Step 101, read source document;

Document of the present invention can be the webpage of HTML(Hypertext Markup Language), extend markup language (XML), extensible HyperText Markup Language various forms such as (XHTML); Perhaps, Portable Document format (PDF) file; Perhaps, word processor and application document file or the like; And with the non-text resource of the file direct correlation of above-mentioned each form.

Source document described in the present invention can be complete a document files, also can be the document fragment of the original document file being carried out obtain after the pre-service or the part of a document files.

Step 102, in described source document, adopt the key character preset to mate, score according to matching result; The described key character that presets is obtained by the expression way of concept type document;

The body matter of concept type document generally comprises: the definition of theme notion word (comprising various titles), attribute (comprising character features, physical data, positional information, attribute of a relation etc.), evolution history (conceptual illustration that comprises sequential such as personage's life, the phenomenon origin cause of formation, the ins and outs, evolutionary history, migration history or cause and effect), expansion explanation contents such as (comprise works content, be worth static directly related information such as meaning, using method, latest news).

Concept type document on the internet, some the most frequently used expression waies of end user's speech like sound reach the purpose of illustrating notion usually.Therefore, these documents have some fixing expression patterns usually, such as, in Chinese expression, give the undefined sentence of a noun formula normally " what what is ", and the sentence formula of describing a certain geographical term there be " where what is positioned at " usually.Like that, can be exemplified below: " sunspot is a kind of solar activity that takes place on the photosphere of the sun "; " Tian An-men is positioned on the axis, Pekinese ".Again such as, in the Chinese wirtiting, some punctuation marks have very strong indicative function, colon can be represented prompting, explanation, quotation marks can be expressed emphasis, special noun or usage, bracket can be represented to explain, contrast, or the like.

Usually adopt the method for semantic analysis that the internet document is analyzed in the prior art, for example, the words cutting in the document can be come by participle technique, analyze the implication of words, and further they are combined the content of analytical documentation, thereby the theme and the type of definite document.But carry out the method for semantic analysis owing to need the steps such as analysis and understanding of word segmentation processing, words and phrases implication by participle technique, the process more complicated, and may need repeatedly scan text, so be subjected to the restriction of certain degree on the processing speed.In the processing of magnanimity webpage, can bring higher computational resource expense.

And the present invention passes through the analysis to the expression pattern of concept type document just, has found the parameter that can accurately characterize concept type document expression pattern---the key character of when describing notion, using always (comprising punctuation mark).At first preset some key characters (comprising punctuation mark) commonly used, score according to the match condition of these key characters in source document then, thereby judge whether source document is the concept type document.

For example, for two pieces of following documents, adopt prior art accurately to distinguish:

＜1〉sunspot is a kind of solar activity that takes place on the photosphere of the sun, is the most basic in the solar activity, the most tangible activity phenomenon.It is generally acknowledged that sunspot is actually the huge whirlpool of a kind of red-hot gas in sun surface, temperature is approximately 4500 degrees centigrade.Because the photosphere surface temperature than the sun is low, so look the spot that looks like some deep darks.Sunspot is movable seldom separately.Usually occur in groups.Be 11.2 cycle of activity.When the time comes can be to magnetic field and each electronic product and the electrical equipment generation infringement of the earth.

＜2〉as if astronomical in the past few days expert observes discovery, and the sun has been done an operation, most of sunspot disappears without a trace suddenly.The reporter learned yesterday, because the sun has entered a temporary transient movable low ebb phase now, it is normal phenomenon fully that black mole quantity reduces, and concerning people, can't produce obvious influence.It is reported that the appearance of sunspot is relevant with the cycle of solar activity.Have the expert that maximum time of sunspot is called " solar activity peak year ", the time minimum sunspot calls " quiet year of solar activity ", and solar cycle generally is about 11 years.Now, the sun is in the movable low ebb phase, and the minimizing of black mole quantity belongs to normal phenomenon.

In the analyzing and processing of carrying out at the document implication, the associated concepts of these two pieces of documents document implication in other words all is " sunspot ", and has tangible difference on the expression way of document, for the user of same inquiry " sunspot " speech, the former describing mode more meets user's query demand in the ordinary course of things.The present invention can go out first piece of document from the angle quick identification of expression pattern be the concept type document.

Because the present invention only needs to adopt the mode of coupling just can realize identifying purpose, need carry out once-twice scanning to get final product to destination document, do not need deep semantic analysis, do not need to investigate the intension of notion; And, the present invention in processing procedure only the content with the document get final product as the analysis source, and do not rely on the content that has other documents of certain relation with document.So the present invention saves computational resource very much, and recognition speed is fast, efficient is high.

Among the present invention, the described key character that presets can comprise the key character that some typical concept type documents are commonly used, and described key character also can comprise punctuation mark, because punctuation mark usually occurs at the concept type document, and the position is important.For example, described key character comprises: " what is ", " overview ", " brief introduction ", " being positioned at ", " originating in ", " another name ", and colon, dash or the like.More detailed key character presets situation and scoring rule etc., can be described in the embodiment of back, just repeats no more at this.

Embodiment 1 also needs according to the key character that is preset to taking a decision as to whether the weight height of concept type document corresponding scoring rule to be set; According to the match condition of key character in analyzed document, score then.

Step 103, the judgment threshold of relatively scoring summation and presetting determine whether this source document is the concept type document.For example, the judgment threshold that the score summation presets greater than (perhaps less than) can determine that then this source document is the concept type document.

When definite certain source document is the concept type document, the document is stored behind the mark in addition, perhaps with this source document with judge that conclusion is preserved with related data storage method, perhaps store this source document address information and the mapping relations of judging conclusion.

From the network documentation of magnanimity, identify after the concept type document, very many application can be arranged.For example:

The Search Results ordering: under the situation of same matching inquiry speech implication, the document that concept type is described might meet user's query demand more than other documents; Therefore, when user inquiring, the system of existing inquiry ordering can utilize recognition result of the present invention automatically, and the ordering of Search Results is adjusted.

Dictionary arrangement: utilize the result of concept type document recognition, can extract and upgrade the concept type document of aspects such as explanation of nouns, personage introduction, cultural and historical from the internet automatically, the corresponding relation of arrangement notion clauses and subclauses and document and become dictionary.

Multilingual bilingual dictionary: in concept type document recognition process, utilize the key character (as " English name ", " Latin literary fame ") and the punctuation mark (as bracket) of indivedual markup languages, it is right to analyze and count so multilingual word.

Advertisement putting: because the document that concept type is described is more likely selected by the user than the document of other describing modes usually, so under the situation of same coupling knowledge entry implication, advertisement delivery can have bigger possibility by inquiry's visit and click targetedly than other types document on the concept type document.

With reference to Fig. 2, it is the flow chart of steps of embodiment 2 of the recognition methods of concept type document of the present invention, wherein, the key character that is preset comprises title keyword, text keyword and punctuation mark, and the described key character that presets is obtained by the expression way of concept type document.Embodiment 2 specifically may further comprise the steps:

Step 201, read source document;

Step 202, in described source document title, adopt title keyword and/or the punctuation mark preset to mate, score according to first presetting rule;

According to the analysis to human language wirtiting (is example with Chinese), the title of concept type document has 3 kinds of expression waies usually, promptly puts up a question formula, expanded type and word formula, is not limited in this 3 kinds of modes certainly.Therefore, the described title keyword that presets can comprise rhetoric question formula keyword, is generally the question term of some prefix types, as " what is " or the like; The expanded type keyword is generally some abstract nouns that indicate conceptual description, as " overview ", " brief introduction " or the like; Punctuation mark can comprise colon, question mark, dash or the like.

If in the described source document title, coupling has above-mentioned title keyword that presets and/or punctuation mark, then gives and different score according to different keywords.How much being determined by first presetting rule of described score generally is to determine for the weight of concept type document judgement identification according to this keyword.

Step 203, in described source document text, adopt text keyword and/or the punctuation mark preset to mate, when meeting coupling, score according to second presetting rule;

The described text keyword that presets can comprise roughly four class keywords: definition keyword, name-form keyword, attribute formula keyword and/or relational expression keyword.These all are that comparatively fixing expression pattern according to the text of concept type document obtains, certainly, setting up of above-mentioned four class keywords only is a preferred embodiment of the present invention, and those skilled in the art can preset text keyword more or still less type as required.

Wherein, the definition keyword is used for the implication and the effect of direct interpretation concept, the intension of analysis notion, for example "Yes", " being exactly " etc.; The name-form keyword is used to expand the title of notion, makes the reader can understand the implication of notion, for example " another name ", " posthumous title ", " Latin literary fame " etc. by the comparison of a plurality of titles; Attribute formula keyword is used to describe the built-in attribute feature of notion, makes the reader make up understanding notion itself by the intension feature of understanding notion, and such keyword is as " height ", " area ", " atomic weight " etc.; The attributive character that the relational expression keyword is used to describe notion shows the logic classification and the space-time relativeness of notion with the relation between the outside notion, and such keyword is as " being positioned at ", " originating in ", " found in " etc.

Preferably, described second presetting rule can comprise: at the basic score and the matching times attenuation coefficient of each text keyword or punctuation mark, described matching times attenuation coefficient is relevant with matching times.Preferably, text keyword in each classification or punctuation mark can have identical basic score and identical matching times attenuation coefficient, that is to say, the repeatedly coupling of same class keyword, increase newly branch fewer and feweri.

The reason that the matching times attenuation coefficient is set is, these text keywords occur for the effect of judging identification concept type document very big for the first time, but increase along with matching times, its role also just slowly weakens, even for negative, because for the situation that same text keyword repeatedly occurs, normally do not meet speech habits that concept type describes, need to give with score on punishment.For example: in document, repeatedly occur " another name ", describe for concept type, it is unlikely phenomenon, more likely be to enumerate the another name of various materials, thereby the possibility that is judged as the concept type document is just smaller, should subtract branch punishment (certainly, subtracting branch also is a kind of form of score).

Preferably, described second presetting rule also comprises: position attenuation coefficient and/or paragraph attenuation coefficient in the section.Because in the describing mode of general concept type document, the name of notion, definition, attribute, relation etc. are convenient to get across the content of concept connotation usually in comparatively forward position, and the content of back is normally to the expansion explanation of concept connotation, expansion narration to the notion extension, therefore, for the keyword that occurs after the position is relatively leaned on, score can adopt certain mode to be punished.The position attenuation coefficient is to be used for that the score to this text keyword is limited in the appearance position of a certain paragraph according to this text keyword in described section.Described paragraph attenuation coefficient be used for according to this text keyword paragraph position that text occurs (forward paragraph or lean on after paragraph), the score of this text keyword is limited.

Above-mentioned explanation is that example is carried out with the text keyword all, and therefore the coupling point system of punctuation mark has not similarly just been described in detail.For example, the point system identical can be adopted, the point system identical can be adopted with the name-form keyword for extendability punctuation marks such as brackets with the definition keyword for explanation type punctuation marks such as colon, dashes.

Secondly, text keyword and punctuation mark all are very important in coupling is judged, for example, in common concept type document, keyword is main basis for estimation, and punctuation mark has booster action; But, to be undertaken in the document of conceptual illustration by multilingual noun contrast at some, the effect meeting of bracket is more obvious than keyword.

And above-mentioned for the explaining all only for embodiment 2 is described of text keyword (title keyword) and presetting rule, those skilled in the art can set up corresponding text keyword and presetting rule fully as required on their own and get final product.

In the step 203, the described text that is used to mate score comprises whole paragraphs of this source document; Perhaps, the described text that is used for mating score only comprises that this source document meets the paragraph of prerequisite.

Because the document of concept type describing mode just shows enough features at initial several paragraphs usually.And this based on the mode of scoring at the key character coupling of expression pattern for the present invention, too much paragraph analysis only can bring interference.Therefore, for the concept type discriminance analysis of a certain document, just can conclude as long as handle limited effective paragraph.As for the number of the concrete paragraph that needs analyzing and processing, can set according to actual needs and get final product, need not to be limited at this.

Step 204, the judgment threshold of relatively scoring summation and presetting determine whether this source document is the concept type document.

Embodiment 2 is for embodiment 1, and more detailed, preferred embodiment are described in detail concrete key character setting, coupling and score process, and other parts can be referring to the associated description of embodiment 1.

With reference to Fig. 3, be the flow chart of steps of the embodiment 3 of concept type document recognition method of the present invention, be relative and embodiment 1 and 2, a more preferred embodiment.Embodiment 3 shown in Figure 3 specifically can may further comprise the steps:

Step 301, read source document;

Step 302, in described source document title, adopt title keyword and/or the punctuation mark preset to mate, score according to first presetting rule;

Step 303, in described source document title, adopt title keyword and/or the punctuation mark preset to mate, cutting obtains the notional word of described source document according to matching result;

Step 303 can be determined the notional word of the document, and described notional word is used to characterize the document and is mainly used in what notion of explanation, reaches the initial analysis to the document.

For example: for title is the document of " black hole brief introduction ", can pass through coupling expanded type title keyword " brief introduction ", and according to general expression pattern " notional word+expansion word ", thereby the notional word of determining the document is " black hole ".For " brief introduction: black hole " or " brief introduction---black hole ", then can be by the coupling of punctuation mark, perhaps combining of punctuation mark and title keyword, and determine that the notional word of the document is " black hole ".

Step 304, in described source document text, adopt text keyword and/or the punctuation mark preset to mate, when meeting coupling, score according to second presetting rule;

Step 305, in described source document text, employing is mated according to the notional word of the described source document that step 303 obtains, when meeting coupling, whether the matched position of judging the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then score according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule;

The step 305 that the notional word that employing obtains mates is in order further analyzed document to be verified.For the notional word that extracts according to title, if in text, in the time of can't correctly mating, illustrate that analyzed document is that the probability of concept type document that this notional word is set forth reduces, so, can in score, give certain punishment by the 4th presetting rule.Certainly, above-mentioned situation only for for example, in the concrete analysis process, in order to reach accurate analysis result, need be provided with complicated strategy, and those skilled in the art are provided with voluntarily according to actual needs and get final product, and the present invention need not (also can't) detailed description one by one.

And when the notional word that extracts according to title, in text, can correctly mate, and can meet neighbouring relations with text keyword or punctuation mark, the statement that this matched position place then is described should be the explanatory statement at this notional word, promptly can increase the probability that analyzed document is the concept type document, so can be according to the 3rd presetting rule bonus point.For example, title is the document of " black hole brief introduction ", the notional word that step 303 is determined is " black hole ", in step 305, matched the neighbouring relations of " black hole " and text keyword "Yes", promptly in the document text, found " black hole is ... " such statement of expressing, and such sentence formula is a kind of typical concept type narrating mode, has higher degree of confidence than single key character coupling, so stipulated certain bonus point strategy in the 3rd presetting rule; But in same document, if such sentence formula that can match " apple is ... ", then can not score, so the coupling of the neighbouring relations by notional word and text keyword can increase determination rate of accuracy of the present invention according to the 3rd presetting rule.

Certainly, the 3rd presetting rule is diversified, and at this, the present invention need not (also can't) and describes in detail one by one.

Need to prove one of situation that may occur: not only one of the notional word that obtains according to title analysis, then can also screen these notional words by step 305, therefrom obtain proper notional word.For example, obtain two notional words from title, one of them coupling has obtained neighbouring relations, but another does not have, and judges that then previous is exactly the notional word of this document.

Certainly, individual other situation is, the notional word that obtains according to title analysis a lot (three, five even more), perhaps the notional word (being no one) that can't obtain according to title analysis then possibly can't finally obtain a suitable notional word at this document.But for the present invention, this does not influence the realization of technique effect of the present invention.Embodiments of the invention 3 are introduced the secondary matching process of notional word, mainly are in order further to increase the accuracy rate that the present invention judges the concept type document; Can find proper notional word, best certainly, if can not find, guaranteed that still the present invention judges the accuracy rate of concept type document; Therefore, whether the notional word that obtains among the embodiment 3 is the most accurately, for the present invention, is not the result that must pursue.

Step 306, the judgment threshold of relatively scoring summation and presetting determine whether this source document is the concept type document.

When if step 307 determines that this source document is the concept type document, this source document, concept type are judged the notional word of conclusion and this source document is preserved with related data storage method.

More than three embodiment all be at the method embodiment of the same core idea of the present invention, in description, emphasize particularly on different fields as space is limited; Certainly, the part that relates to other embodiment gets final product with reference to associated description.

With a concrete example (embodiment 4), the document analysis decision process is illustrated below:

The title of the source document that is read is " macaque ", text following (supposing that we only need to analyze two paragraphs):

Call yellow monkey, rhesus macaque, Guangxi monkey, belong to monkey section, formal name used at school is Macaca mulatta.

Macaque is the common a kind of monkey class of China, the long 43-55 of body centimetre, and the long 15-24 of tail centimetre.It is brown that head is, back portion palm fibre ash or pale brown look, and the bottom is orange or orange red, and the outside of belly is talked lark.The nostril is downward, and the tool cheek is spoken in a low voice.The callosity of buttocks (pian2 zhi1) is obvious.

Because title only comprises a phrase, is the notional word of the document so can determine " macaque ".And simple for what illustrate, position attenuation coefficient and paragraph attenuation coefficient all are 1 in the setting section, that is to say that keyword matched position does not in the text influence score.

Because the title keyword or the punctuation mark that do not have coupling to preset in the title are so will not score.

For first section in text, the text keyword that presets and/or the match condition of punctuation mark are as follows: coupling name-form keyword " another name ", then give bonus point 20*0.6^0=20 with the crucial part of speech of name-form; The bonus point 20*0.6^1=12 with the crucial part of speech of name-form then given in coupling name-form keyword " formal name used at school "; Matching relationship formula keyword " belongs to ", then gives the bonus point 20*0.9^0=20 with the crucial part of speech of relational expression.Wherein, the basic score 20 of the crucial part of speech of name-form, matching times attenuation coefficient 0.6; The basic score 20 of the crucial part of speech of relational expression, matching times attenuation coefficient 0.9.Because " formal name used at school " is the name-form keyword of second coupling, so need to calculate once decay.

For second section in text, adopt the text keyword and/or the punctuation mark that preset to mate, the bonus point 10*0.4^0=10 with the crucial part of speech of attribute formula then given in match attribute formula relative " body is long ".Wherein, the basic score 10 of the crucial part of speech of attribute formula, matching times attenuation coefficient 0.4.

For second section in first section in text and text, adopt the notional word " macaque " that obtains by title to carry out the secondary coupling.First section matching result that does not meet of text then will not be scored.In second section in text, coupling notional word " macaque ", coupling definition keyword "Yes", and because the matched position of notional word and definition keyword is adjacent, bonus point 15*0.5^0+40=55.Wherein, notional word basis score 15, matching times attenuation coefficient 0.5, neighbouring relations are rewarded score 40.

Analyze so far, aggregate score 117 if the predefine threshold value is 80, can judge that so this piece document is that concept type is described document.Certainly, the predefine threshold value can determine that the score of all kinds of keywords and attenuation coefficient also are adjustable through the assessment test.

With reference to Fig. 4 a-Fig. 4 c, be the structured flowchart of the recognition system embodiment of concept type document of the present invention, specifically comprise with lower member:

Source document reading device 401;

Coalignment 402 is used to adopt the key character that presets to carry out the document coupling; The described key character that presets is obtained by the expression way of concept type document;

Analytical equipment 403 comprises score module 4031 and compares determination module 4032 that described score module is used for scoring according to matching result; Described relatively determination module be used for relatively scoring summation and the judgment threshold that presets determine whether this source document is the concept type document.

Source document reading device 401 is obtained original document or through pretreated document, below is referred to as document from document deriving means or document storage device outside the system.Document content can also comprise related datas such as format information, hypertext link information except that text data.Document is stored in the computer-readable medium, and in processing procedure, document content can offer the concept type analytical equipment by computer-readable instruction.

Preferably, after treating analytical documentation if desired and carrying out pre-service, just mate score, then embodiment illustrated in fig. 4 can also comprising: be used for carrying out pre-service, and obtain the pretreatment module 404 of document fragment to be analyzed at described source document.

Between the device 401,402 and 403 in embodiment illustrated in fig. 4, can be with analytical equipment 403 as master control set, then described pretreatment module 404 can be arranged in analytical equipment 403 (with reference to Fig. 4 a).Source document reading device 401 is sent to analytical equipment 403 with the document that is read, and can connect by any data transmission channel between the two.Analytical equipment 403 receives the document that is used to analyze, and is sent to coalignment 402 through (for example, title is handled or the division of document fragment to be analyzed) after the pre-service; Receive the matching result that coalignment 402 returns then; Analytical equipment 403 is scored according to matching result, and finishes judgement.Analytical equipment 403 and and coalignment 402 between can be by any two-way data transmission by connecting.

Between the device 401,402 and 403 in embodiment illustrated in fig. 4, also can be with coalignment 402 as master control set, then described pretreatment module 404 can be arranged in coalignment 402 (with reference to Fig. 4 b).Source document reading device 401 is sent to coalignment 402 with the document that is read, and can connect by any data transmission channel between the two.Coalignment 402 receives the document that is used to analyze, and through (for example, the division of title processing or document fragment to be analyzed) after the pre-service, finishes coupling, and matching result is sent to analytical equipment 403; Analytical equipment 403 is scored according to matching result, and finishes judgement.Analytical equipment 403 and and coalignment 402 between can be by any two-way data transmission by connecting.

Below with reference to Fig. 4 c, the situation shown in Fig. 4 a is carried out more detailed, preferred explanation:

Among Fig. 4 c, the key character that presets comprises title keyword, text keyword and punctuation mark.

Wherein, described coalignment 402 comprises:

Title matching module 4021 is used at described source document title, adopts the title keyword and/or the punctuation mark that preset to mate;

Text matching module 4022 is used at described source document text, adopts the text keyword and/or the punctuation mark that preset to mate; Wherein, the described text that is used to mate score comprises whole paragraphs of this source document; Perhaps, the described text that is used for mating score only comprises that this source document meets the paragraph of prerequisite.

Accordingly, described score module can comprise:

Title score submodule 40311 is used for meeting when mating at described source document title when the title keyword that presets and/or punctuation mark, scores according to first presetting rule;

Text score submodule 40312 is used for meeting when mating at described source document text when the text keyword that presets and/or punctuation mark, scores according to second presetting rule.

Preferably, described analytical equipment 403 can also comprise:

Notional word acquisition module 4033 is used for the matching result according to 4021 outputs of title matching module, and described source document title is carried out the notional word that cutting obtains described source document;

Notional word score submodule 40313 is used for when meeting when mating, and judges whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then scores according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule.

Accordingly, described coalignment 402 can also comprise:

Notional word matching module 4023 is used for adopting the notional word of described source document to mate at described source document text.

Preferably, the preferred embodiment shown in Fig. 4 c can also comprise:

Memory storage 405 is used for when definite this source document is the concept type document, this source document, concept type is judged the notional word of conclusion and this source document is preserved with related data storage method.

Among Fig. 4 c, 403 pairs of documents to be analyzed that received of analytical equipment carry out pre-service, the natural ingredient of document (is referred generally to each paragraph of title and text, this division is normally clear and definite depends on structural information, such as newline, and do not need the participle technique of natural language) and send to coalignment 402 successively, promptly analytical equipment 403 is handled successively according to the paragraph order of Document Title and document text, and each part is all passed through the scanning of coalignment 402; Coalignment 402 returns matching result on the document segment that analytical equipment 403 provides of the various keywords that set in advance, punctuation mark (comprise whether mate and matched position information etc.); Analytical equipment 403 is estimated document according to dissimilar, the diverse location and the relative position relation in text of coupling keyword or punctuation mark in the mode of score; Analytical equipment 403 is concluded under confirmable situation, judges whether document is concept type.

Preferably, the notional word that obtains after also will the matching result analysis to title of analytical equipment 403 is sent to coalignment 402; Coalignment 402 mates on the document fragment that analytical equipment 403 provides according to this notional word, and returns matching result; Analytical equipment 403 is also included this matching result within the score scope, and determines one or more notional word.

At last, analytical equipment 403 can send to memory storage 405 by data transmission channel with document analysis result (whether being concept type document and corresponding notional word or the like), is stored in any medium.Preferably, for the concept type document, the notional word that extracts and document are stored as the data that computing machine or user can distinguish in related mode.

Preferably, the described text keyword that presets can comprise definition keyword, name-form keyword, attribute formula keyword and/or relational expression keyword.Described second presetting rule can comprise: at the basic score and the matching times attenuation coefficient of each text keyword or punctuation mark, described attenuation coefficient is relevant with matching times.Further, described second presetting rule can also comprise: position attenuation coefficient and/or paragraph attenuation coefficient in the section.

With reference to Fig. 5, show the structured flowchart of another recognition system of the present invention embodiment, specifically comprise with lower member:

Source document read module 501;

Coupling score module 502 is used to adopt the key character that presets to carry out the document coupling, scores according to matching result; The described key character that presets is obtained by the expression way of concept type document;

Compare determination module 503, be used for relatively scoring summation and the judgment threshold that presets determine whether this source document is the concept type document.

Preferably, after treating analytical documentation if desired and carrying out pre-service, just mate score, then embodiment illustrated in fig. 5 can also comprising: be used for carrying out pre-service, and obtain the pretreatment module 504 of document fragment to be analyzed at described source document.

Preferably, described coupling score module 502 comprises: title coupling score submodule 5021, be used at described source document title, and adopt the title keyword and/or the punctuation mark that preset to mate, score according to first presetting rule; Text matching score submodule 5022 is used at described source document text, adopts the text keyword and/or the punctuation mark that preset to mate, and when meeting coupling, scores according to second presetting rule;

Preferably, among the embodiment shown in Figure 5, coupling score module 502 can also comprise: notional word obtains submodule 5023, is used at described source document title, title keyword and/or punctuation mark that employing is preset mate, and cutting obtains the notional word of described source document according to matching result.

Notional word coupling score submodule 5024 is used for adopting the notional word of described source document to mate at described source document text; When meeting when coupling, judge whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then score according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule.

Preferably, among the embodiment shown in Figure 5, also comprise: memory module 505, be used for when definite this source document is the concept type document, this source document, concept type are judged the notional word of conclusion and this source document is preserved with related data storage method.

Design embodiment illustrated in fig. 5 with embodiment illustrated in fig. 4 is similar, difference is the two division difference about functional module, embodiment illustrated in fig. 4 with identical calculating operation (for example, coupling or score) be arranged in the parts, and the calculating operation at same document fragment embodiment illustrated in fig. 5 (for example, at title or at text) is arranged in the parts.The bigger difference of the two just is the difference of annexation between the module in fact, and those skilled in the art use voluntarily as required and get final product, and have not just described in detail at this.

Because the embodiment of Fig. 4 and concept type document recognition system shown in Figure 5 can correspondence be applicable among the aforesaid the whole bag of tricks embodiment, so the description at Fig. 4 and Fig. 5 is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.

More than to the method and system of concept type document recognition provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1, a kind of recognition methods of concept type document is characterized in that, comprising:

Step a, read source document;

2, the method for claim 1 is characterized in that, described key character comprises title keyword, text keyword and punctuation mark.

3, method as claimed in claim 2 is characterized in that, in the following manner completing steps b:

In described source document title, adopt the title keyword and/or the punctuation mark that preset to mate, score according to first presetting rule;

In described source document text, adopt the text keyword and/or the punctuation mark that preset to mate, when meeting coupling, score according to second presetting rule.

4, method as claimed in claim 3 is characterized in that, also comprises:

In described source document title, adopt the title keyword and/or the punctuation mark that preset to mate, cutting obtains the notional word of described source document according to matching result.

5, method as claimed in claim 4 is characterized in that, also comprises:

In described source document text, adopt the notional word of described source document to mate,, judge whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent when meeting when coupling, if adjacent, then score according to the 3rd presetting rule.

6, method as claimed in claim 5 is characterized in that, also comprises:

In described source document text, adopt the notional word of described source document to mate, when not meeting coupling, then score according to the 4th presetting rule.

7, as claim 4,5 or 6 described methods, it is characterized in that, also comprise:

If when determining that this source document is the concept type document, this source document, concept type are judged the notional word of conclusion and this source document is preserved with related data storage method.

8, method as claimed in claim 2 is characterized in that, the described text keyword that presets comprises definition keyword, name-form keyword, attribute formula keyword and/or relational expression keyword.

9, method as claimed in claim 3 is characterized in that, described second presetting rule comprises: at the basic score and the matching times attenuation coefficient of each text keyword or punctuation mark.

10, method as claimed in claim 9 is characterized in that, described second presetting rule also comprises: position attenuation coefficient and/or paragraph attenuation coefficient in the section.

11, method as claimed in claim 3 is characterized in that,

The described text that is used to mate score comprises whole paragraphs of this source document;

Perhaps, the described text that is used for mating score only comprises that this source document meets the paragraph of prerequisite.

12, a kind of recognition system of concept type document is characterized in that, comprising:

The source document reading device;

13, system as claimed in claim 12 is characterized in that, described key character comprises title keyword, text keyword and punctuation mark.

14, system as claimed in claim 13 is characterized in that, also comprises:

Pretreatment module is used for carrying out pre-service at described source document, and obtains document fragment to be analyzed.

15, system as claimed in claim 14 is characterized in that,

Described coalignment comprises:

The title matching module is used at described source document title, adopts the title keyword and/or the punctuation mark that preset to mate;

The text matching module is used at described source document text, adopts the text keyword and/or the punctuation mark that preset to mate;

Described score module comprises:

Title score submodule is used for meeting when mating at described source document title when the title keyword that presets and/or punctuation mark, scores according to first presetting rule;

Text score submodule is used for meeting when mating at described source document text when the text keyword that presets and/or punctuation mark, scores according to second presetting rule.

16, system as claimed in claim 15 is characterized in that, described analytical equipment also comprises:

The notional word acquisition module is used for the matching result according to the output of title matching module, and described source document title is carried out the notional word that cutting obtains described source document.

17, system as claimed in claim 16 is characterized in that, also comprises:

The notional word matching module is used for adopting the notional word of described source document to mate at described source document text;

Notional word score submodule is used for when meeting when mating, and judges whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then scores according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule.

18, as claim 16 or 17 described systems, it is characterized in that, also comprise:

Memory storage is used for when definite this source document is the concept type document, this source document, concept type is judged the notional word of conclusion and this source document is preserved with related data storage method.

19, system as claimed in claim 13 is characterized in that, the described text keyword that presets comprises definition keyword, name-form keyword, attribute formula keyword and/or relational expression keyword.

20, system as claimed in claim 15 is characterized in that, described second presetting rule comprises: at the basic score and the matching times attenuation coefficient of each text keyword or punctuation mark.

21, system as claimed in claim 20 is characterized in that, described second presetting rule also comprises: position attenuation coefficient and/or paragraph attenuation coefficient in the section.

22, system as claimed in claim 15 is characterized in that,

23, a kind of recognition device of concept type document is characterized in that, comprising:

The source document read module;

24, device as claimed in claim 23 is characterized in that, also comprises:

25, device as claimed in claim 24 is characterized in that, described coupling score module comprises:

Title coupling score submodule is used at described source document title, adopts the title keyword and/or the punctuation mark that preset to mate, and scores according to first presetting rule;

Text matching score submodule is used at described source document text, adopts the text keyword and/or the punctuation mark that preset to mate, and when meeting coupling, scores according to second presetting rule.

26, device as claimed in claim 25 is characterized in that, also comprises:

Notional word obtains submodule, is used at described source document title, adopts the title keyword and/or the punctuation mark that preset to mate, and cutting obtains the notional word of described source document according to matching result;

Notional word coupling score submodule is used for adopting the notional word of described source document to mate at described source document text; When meeting when coupling, judge whether the matched position of the matched position of this notional word and text keyword or punctuation mark is adjacent, if adjacent, then score according to the 3rd presetting rule; When not meeting coupling, then score according to the 4th presetting rule.

27, device as claimed in claim 26 is characterized in that, also comprises:

Memory module is used for when definite this source document is the concept type document, this source document, concept type is judged the notional word of conclusion and this source document is preserved with related data storage method.