The extracting method and device of Webpage information
Technical field
The present invention relates to data processing field, in particular to a kind of extracting method and device of Webpage information.
Background technology
Collection info web is the significant data source of big data analysis.Collection info web mainly has two kinds of sides at present
Case, one kind are to use rule-based method, and page elements are extracted using regular expression, Xpath or Css selectors, another
Kind is Statistics-Based Method, and the data manually marked by machine learning obtain training pattern, and entering row information according to model carries
Take.
Rule-based method is by analyzing HTML (HyperText Mark-up Language, HTML)
Code, the right boundary of information to be extracted is analyzed, information is extracted by regular expression or other means, or pass through
DOM (Document Object Model, document dbject model) trees are established for the page, are chosen by XPath or Css selectors
Web page element, and then the element for including information to be extracted is chosen, so as to realize information extraction.
Rule-based extracting method, extraction is accurate, but poor for applicability, often can only enter row information for a kind of page
Extraction, it can cause to extract mistake if the page changes.
Statistics-Based Method, by the method for machine learning, the accurate result manually marked is trained, instructed
Practice model, row information identification and extraction are entered by training pattern.
It is good based on statistical method applicability, it can be used for various Webpages, but such a method resource consumption is big, to people
The dependence of work mark is strong, and the quality of information extraction and the quality correlation manually marked are strong.The degree of accuracy can not ensure completely, base
It is not the information extraction for specific webpage in the method for training, the new page may result in and extract incomplete or extraction mistake
Lose.
The problem of degree of accuracy is low is extracted for info web in the prior art, not yet proposes effective solution at present.
The content of the invention
It is a primary object of the present invention to provide a kind of extracting method and device of Webpage information, to solve existing skill
The problem of info web extraction degree of accuracy is low in art.
To achieve these goals, one side according to embodiments of the present invention, there is provided a kind of Webpage information
Extracting method.
Included according to the extracting method of the Webpage information of the present invention:Obtain the hypertext of multiple Webpages to be extracted
Markup language HTML code;Multiple Webpages to be extracted are clustered according to the HTML code, obtain multiple return
Belong to classification;Object block element in each belonging kinds of extraction, wherein, the object block element is the same ownership class
The block element that the different Webpages to be extracted in not share;The text in the object block element is extracted, is obtained described
The text collection of object block element;The desired value of the text collection is calculated, wherein, the desired value is used to represent the text
The difference degree of text in set;And the extraction desired value is more than the text in the text collection of the first predetermined threshold value
This, obtains the Webpage information.
Further, calculating the desired value of the text collection includes:Record each differing in the text collection
Text occurrence number;According to the occurrence number of each text differed, determine all literary in the text collection
This total occurrence number;According to the occurrence number of each text differed and total occurrence number, each institute is calculated
State the frequency of occurrences of the text differed in the text collection;And according to each text differed in the text
The frequency of occurrences in this set, determine the desired value of the text collection.
Further, the frequency of occurrences according to each text differed in the text collection, it is determined that
The desired value of the text collection includes:According to formulaDescribed in calculating
The desired value of text collection, wherein, ESetFor the desired value of the text collection, m is comprising the not phase in the text collection
The number of same text, p (texti) it is the frequency of occurrences of each text differed in the text collection.
Further, the text in the text collection of the desired value more than the first predetermined threshold value is extracted, is obtained
After the Webpage information, the extracting method also includes:Record the category attribute of the text.
Further, the ownership class of the first Webpage to be extracted and second page to be extracted is determined in the following manner
Not, wherein, first Webpage to be extracted and second page to be extracted are appointing in multiple pages to be extracted
Two Webpages to be extracted of meaning:First tree structure is established according to the HTML code of the described first Webpage to be extracted, and
Second tree structure is established according to the HTML code of the described second Webpage to be extracted;Extract and wrapped in first tree structure
Block element containing preset attribute, obtain the block for including preset attribute in first piece of element, and extraction second tree structure
Element, obtain second piece of element;According to first piece of element and second piece of element, first webpage to be extracted is calculated
The similarity average value of the page and second Webpage to be extracted;Compare the similarity average value and the second predetermined threshold value
Size;And the similarity average value is being compared more than in the case of second predetermined threshold value, determine described first
Webpage to be extracted and second page to be extracted are same home classification, or small comparing the similarity average value
In or equal in the case of second predetermined threshold value, the described first Webpage to be extracted and second page to be extracted are determined
Face is respectively different belonging kinds.
To achieve these goals, another aspect according to embodiments of the present invention, there is provided a kind of Webpage information
Extraction element.
Included according to the extraction element of the Webpage information of the present invention:Acquiring unit, for obtaining multiple nets to be extracted
The HTML HTML code of the page page;Cluster cell, for according to the HTML code to multiple described to be extracted
Webpage is clustered, and obtains multiple belonging kinds;First extraction unit, for extracting the mesh in each belonging kinds
Block element is marked, wherein, the object block element is that the different Webpages to be extracted in the same belonging kinds share
Block element;Second extraction unit, for extracting the text in the object block element, obtain the text of the object block element
Set;First computing unit, for calculating the desired value of the text collection, wherein, the desired value is used to represent the text
The difference degree of text in this set;And the 3rd extraction unit, it is more than the first predetermined threshold value for extracting the desired value
The text collection in text, obtain the Webpage information.
Further, first computing unit includes:Logging modle, for record in the text collection it is each not
The occurrence number of identical text;First determining module, for the occurrence number according to each text differed, it is determined that
Total occurrence number of full text in the text collection;Computing module, for going out according to each text differed
Occurrence number and total occurrence number, calculate the frequency of occurrences of each text differed in the text collection;With
And second determining module, for the frequency of occurrences according to each text differed in the text collection, really
The desired value of the fixed text collection.
Further, second determining module includes:Calculating sub module, for according to formulaThe desired value of the text collection is calculated, wherein, ESetFor the text
The desired value of this set, m are the number that the text differed is included in the text collection, p (texti) it is each described
The frequency of occurrences of the text differed in the text collection.
Further, the extraction element also includes:Recording unit, it is default for being more than first in the extraction desired value
Text in the text collection of threshold value, after obtaining the Webpage information, record the category attribute of the text.
Further, the extraction element also includes:Unit is established, for the HTML according to the first Webpage to be extracted
Code establishes the first tree structure, and establishes the second tree structure according to the HTML code of the second Webpage to be extracted, wherein,
First Webpage to be extracted and second page to be extracted are that any two in multiple pages to be extracted is treated
Extract Webpage:4th extraction unit, the block element of preset attribute is included in first tree structure for extracting, is obtained
The block element of preset attribute is included in first piece of element, and extraction second tree structure, obtains second piece of element;Second
Computing unit, for according to first piece of element and second piece of element, calculate first Webpage to be extracted and
The similarity average value of second Webpage to be extracted;Comparing unit, for the similarity average value and second
The size of predetermined threshold value;And processing unit, for being more than second predetermined threshold value comparing the similarity average value
In the case of, determine that the described first Webpage to be extracted and second page to be extracted are same home classification, or than
Relatively go out the similarity average value less than or equal in the case of second predetermined threshold value, determine the described first webpage to be extracted
The page and second page to be extracted are respectively different belonging kinds.
According to inventive embodiments, using the HTML code for obtaining multiple Webpages to be extracted;According to the HTML code
Multiple Webpages to be extracted are clustered, obtain multiple belonging kinds;Mesh in each belonging kinds of extraction
Block element is marked, wherein, the object block element is that the different Webpages to be extracted in the same belonging kinds share
Block element;The content of text in the object block element is extracted, obtains the text collection of the object block element;Described in calculating
The desired value of text collection, wherein, the desired value is used for the difference degree for representing the text in the text collection;And carry
The text for taking the desired value to be more than in the text collection of the first predetermined threshold value, obtains the Webpage information.Pass through
Obtain the HTML code of multiple Webpages to be extracted, it is possible to achieve the division to multiple Webpage belonging kinds to be extracted,
And then obtain the block element jointly comprised in the different Webpages to be extracted under same belonging kinds, it is possible to achieve to same block
The extraction of element Chinese version content, then can be according to the difference degree of the content of text got and the comparison knot of predetermined threshold value
Fruit, determine whether text content is information that the needs in Webpage to be extracted extract, solves webpage in the prior art
The information extraction degree of accuracy low problem, and then improve the effect of info web extraction accuracy.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention
Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the extracting method of Webpage information according to embodiments of the present invention;And
Fig. 2 is the schematic diagram of the extraction element of Webpage information according to embodiments of the present invention.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention
Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people
The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects
Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, "
Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use
Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or
Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method that can be used for implementing the application device embodiment,
It should be noted that can be in the department of computer science of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of
Performed in system, although also, show logical order in flow charts, in some cases, can be with different from herein
Order perform shown or described step.
According to embodiments of the present invention, there is provided a kind of extracting method of Webpage information.Fig. 1 is implemented according to the present invention
The flow chart of the extracting method of the Webpage information of example, as shown in figure 1, this method includes steps S102 to step
S112:
S102:Obtain the HTML HTML code of multiple Webpages to be extracted.Specifically, can obtain simultaneously
The HTML code of multiple Webpages to be extracted is taken, each Webpage to be extracted of acquisition that can also successively one by one
HTML code.
S104:Multiple Webpages to be extracted are clustered according to HTML code, obtain multiple belonging kinds, also
It is, according to the HTML code of each Webpage to be extracted got, multiple Webpages to be extracted to be classified, will be more
Similar Webpage to be extracted is classified as a classification in individual Webpage to be extracted.An it should be noted that net to be extracted
The page page can only have a belonging kinds.
S106:The object block element in each belonging kinds is extracted, wherein, object block element is in same belonging kinds
The shared block element of different Webpages to be extracted.Specifically, object block element can be one, or multiple.At this
In inventive embodiments, the particular number of object block element is according to the shared block member of the different pages to be extracted in same belonging kinds
What the quantity of element determined.Shared block element refers to label of the block element in same belonging kinds in the different pages to be extracted
Title, attribute all identical block elements, attribute herein is class attributes or id attributes.Such as:Webpage 1, webpage page
Face 2 and Webpage 3 belong to belonging kinds A, in Webpage 1, Webpage 2 and Webpage 3 in each Webpage
The block member jointly comprised is known as 3, is div [class=" menu "], div [id=" title "] and p [class=respectively
" content "], then the object block element in belonging kinds A is then 3.
S108:The text in object block element is extracted, obtains the text collection of object block element.Specifically, same target
Multiple texts are included in block element, the set of multiple texts is the text collection of the object block element.If object block element
To be multiple, then extract the text in each object block element, obtain the text collection of each object block element.Continue using upper
State for example, for object block element div [id=" title "], obtained text collection for " title 1 ", " title 2 ",
" title 3 " }.
S110:The desired value of text collection is calculated, wherein, desired value is used for the difference journey for representing the text in text collection
Degree, i.e. calculate object block element Chinese version difference degree, difference degree is bigger, illustrates in the text in the object block element
Content difference is bigger.
S112:The text that desired value is more than in the text collection of the first predetermined threshold value is extracted, obtains Webpage information,
It is exactly that only desired value is more than the text in the text collection of the first preset value, is only needs and is extracted in Webpage to be extracted
Information.Specifically, the first preset value can be set according to demand.
In embodiments of the present invention, by obtaining the HTML codes of multiple Webpages to be extracted, it is possible to achieve to multiple
The division of Webpage belonging kinds to be extracted, and then obtain common in the different Webpages to be extracted under same belonging kinds
Comprising block element, it is possible to achieve the extraction to same block element Chinese version content, then can be according in the text got
The difference degree of appearance and the comparative result of predetermined threshold value, determine whether text content is that needs in Webpage to be extracted carry
The information taken, solve the low problem of the info web extraction degree of accuracy in the prior art, and then the extraction of raising info web
The effect of accuracy.
It should be noted that if the quantity of object block element is multiple, it is necessary to calculate each object block element respectively
The desired value of text collection, and by each desired value calculated respectively compared with the first predetermined threshold value, by desired value
Extracted more than the text in the text collection of the first predetermined threshold value.
Specifically, the desired value of text collection can be calculated by step 1-1 to step 1-4, step 1-1 to step
1-4 is specific as follows:
Step 1-1:The occurrence number of the text each differed in recording text set.Because text collection includes
Multiple texts, so multiple texts there may be content identical text, in embodiments of the present invention, only count between each other
Hold occurrence number of the text differed in text set.
Step 1-2:According to the occurrence number of each text differed, total appearance of full text in text collection is determined
Number, specifically, in text set total occurrence number of full text be equal to all texts differed occurrence number it
With.
Step 1-3:According to the occurrence number of each text differed and total occurrence number, the text each differed is calculated
Originally the frequency of occurrences in text collection.For example, have in text set individual different from other texts in text set
The occurrence number of text A, text A in text set is 3 times, and total occurrence number of full text is 30 in text set
It is secondary, then for text A, the frequency of occurrences in above-mentioned text collection is 1/10.
Step 1-4:According to the frequency of occurrences of each text differed in text collection, the index of text collection is determined
Value.
If object block element is multiple, then the desired value of the text collection of each object block element can pass through weight
Step 1-1 to step 1-4 is performed again to be calculated.
Specifically, in embodiments of the present invention, the frequency of occurrences according to each text differed in text collection, really
Determining the desired value of text collection includes:According to formulaCalculate text set
The desired value of conjunction, wherein, ESetFor the desired value of text collection, m is the number that the text differed is included in text collection, p
(texti) it is the frequency of occurrences of the text each differed in text collection.In embodiments of the present invention, text collection is calculated
ESetThe middle frequency of occurrences by each text differed is multiplied with the logarithm of the frequency of occurrences of the text differed, will obtain
The summation of all results, then take negative, be exactly the desired value of text set.
Preferably, the text being more than in extraction desired value in the text collection of the first predetermined threshold value, obtains Webpage letter
After breath, the extracting method for the Webpage information that the embodiment of the present invention is provided also includes the category attribute of recording text.Tool
Body, category attribute can be title, content etc..The embodiment of the present invention it is, record extraction content of text be title also
It is content etc..
In embodiments of the present invention, it is convenient subsequently to carry out big data point by the category attribute for the text for recording extraction
During analysis, user can quickly filter out required information, reach the effect for improving user satisfaction.For example, user wants to sieve
Select in the info web extracted, content is the information of title, then user need to only select category attribute as title, you can quick
Filter out meet its requirement info web.
The embodiment of the present invention additionally provides a kind of concrete mode for the belonging kinds for determining the page to be extracted, waits to carry with first
Exemplified by Webpage and second page to be extracted are taken as any two Webpage to be extracted in multiple pages to be extracted, for
Clearly determine the mode of the first Webpage to be extracted and the second page belonging kinds to be extracted, specifically, step 2-1 can be passed through
The belonging kinds of the first Webpage to be extracted and second page to be extracted are determined to step 2-5:
Step 2-1:First tree structure is established according to the HTML code of the first Webpage to be extracted, and treated according to second
The HTML code of extraction Webpage establishes the second tree structure.
Step 2-2:The block element that preset attribute is included in the first tree structure is extracted, obtains first piece of element, Yi Jiti
The block element that preset attribute is included in the second tree structure is taken, obtains second piece of element.Specifically, preset attribute belongs to for class
Property or id attributes, this step namely only extract first page to be extracted HTML code in include class attributes or id
The block element of attribute, and only extract the block member comprising class attributes or id attributes in the HTML code of second page to be extracted
Element.
Step 2-3:According to first piece of element and second piece of element, the first Webpage to be extracted and second to be extracted is calculated
The similarity average value of Webpage.In embodiments of the present invention, first can be calculated according to formula V=1/2 (S1+S2) to wait to carry
The similarity average value of Webpage and the second Webpage to be extracted is taken, wherein, V is similarity average value, and S1 treats for first
The first similarity of Webpage and the second Webpage to be extracted is extracted, S2 is that the first Webpage to be extracted and second is waited to carry
Take the second similarity of Webpage.Specifically, can be according to formulaThe first similarity S1 is calculated, wherein, Kp
For identical block element in the first Webpage to be extracted and the second Webpage to be extracted, it is same block that p takes 1 to m, m successively
The number of element, V1KpFor frequency of occurrences of the same block element Kp in the first Webpage to be extracted, K0kFor the first net to be extracted
First piece of element in the page page, N1 are the number of first piece of element in the first Webpage to be extracted,For first piece of member
Plain K0kFrequency of occurrence in the first Webpage to be extracted;According to formulaThe second similarity S2 is calculated, its
In, V2KpFor frequency of occurrences of the same block element Kp in the second Webpage to be extracted, K1kFor in the second Webpage to be extracted
Second piece of element, N2 be the second Webpage to be extracted in second piece of element number,For second piece of element K1k
Frequency of occurrence in two Webpages to be extracted.
Step 2-4:Compare the size of similarity average value and the second predetermined threshold value.Specifically, the second predetermined threshold value also may be used
To set according to demand.
Step 2-5:Similarity average value is being compared more than in the case of the second predetermined threshold value, is determining the first net to be extracted
The page page and second page to be extracted are same home classification, or default less than or equal to second comparing similarity average value
In the case of threshold value, determine the first Webpage to be extracted and second page to be extracted is respectively different belonging kinds, this step
It is rapid it is, in the case where judging that similarity average value is more than the second predetermined threshold value, the first Webpage to be extracted and the
Two pages to be extracted belong to same belonging kinds;Judging situation of the similarity average value less than or equal to the second predetermined threshold value
Under, the first Webpage to be extracted and second page to be extracted are belonging respectively to different belonging kinds.
In embodiments of the present invention, any two Webpage in multiple pages to be extracted can be regarded first respectively
Webpage to be extracted and the second Webpage to be extracted, and step 2-1 to step 2-5 is repeated, until determining each
The belonging kinds of the page to be extracted.It should be noted that if Webpage A and Webpage B belong to same belonging kinds, net
Page page A and Webpage D falls within same belonging kinds, then Webpage A, Webpage B and Webpage D are belonged to
Same belonging kinds.After two or more Webpage to be extracted belongs to same belonging kinds, for other it needs to be determined that ownership
The Webpage to be extracted of classification, as long as by a webpage page to be extracted in the Webpage to be extracted and above-mentioned belonging kinds
Face calculates similarity average value, and by obtained similarity average value compared with the second predetermined threshold value, you can it is determined that this is treated
Whether extraction Webpage belongs to above-mentioned belonging kinds.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of
Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because
According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know
Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention
It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation
The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot
In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing
The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage
In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate
Machine, server, or network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to embodiments of the present invention, a kind of webpage for being used to implement the extracting method of above-mentioned Webpage information is additionally provided
The extraction element of page info, the extraction element are mainly used in performing the extraction side that the above of the embodiment of the present invention is provided
Method, the extraction element of the Webpage information provided below the embodiment of the present invention do specific introduction:
Fig. 2 is the schematic diagram of the extraction element of Webpage information according to embodiments of the present invention, as shown in Fig. 2 the dress
Putting mainly includes acquiring unit 10, cluster cell 20, the first extraction unit 30, the second extraction unit 40, the first computing unit 50
With the 3rd extraction unit 60, wherein:
Acquiring unit 10 is used for the HTML HTML code for obtaining multiple Webpages to be extracted.Specifically,
The HTML code of multiple Webpages to be extracted can be obtained simultaneously, each net to be extracted of acquisition that can also successively one by one
The HTML code of the page page.
Cluster cell 20 is used to cluster multiple Webpages to be extracted according to HTML code, obtains multiple ownership classes
Not, it is, according to the HTML code of each Webpage to be extracted got, multiple Webpages to be extracted are divided
Class, Webpage to be extracted similar in multiple Webpages to be extracted is classified as a classification.It should be noted that one is treated
Extraction Webpage can only have a belonging kinds.
First extraction unit 30 is used to extract the object block element in each belonging kinds, wherein, object block element is same
The block element that different Webpages to be extracted in one belonging kinds share.Specifically, object block element can be one, also may be used
Think multiple.In embodiments of the present invention, the particular number of object block element is according to different to be extracted in same belonging kinds
What the quantity of the shared block element of the page determined.Shared block element refers to that block element difference in same belonging kinds is waited to carry
Bookmark name, the attribute all identical block elements in the page are taken, attribute herein is class attributes or id attributes.Such as:Net
The page page 1, Webpage 2 and Webpage 3 belong to belonging kinds A, in Webpage 1, Webpage 2 and Webpage 3
The block member jointly comprised in each Webpage is known as 3, is div [class=" menu "], div [id=" title "] respectively
With p [class=" content "], then the object block element in belonging kinds A is then 3.
Second extraction unit 40 is used to extract the text in object block element, obtains the text collection of object block element.Tool
Body, multiple texts are included in same object block element, the set of multiple texts is the text collection of the object block element.Such as
Fruit object block element is multiple, then extracts the text in each object block element, obtains the text set of each object block element
Close.Continue to illustrate that, for object block element div [id=" title "], obtained text collection is { " title using the example above
1 ", " title 2 ", " title 3 " }.
First computing unit 50 is used for the desired value for calculating text collection, wherein, desired value is used to represent in text collection
Text difference degree, i.e. calculate object block element Chinese version difference degree, difference degree is bigger, illustrate the object block member
Content difference in text in element is bigger.
3rd extraction unit 60 is used to extract desired value more than the text in the text collection of the first predetermined threshold value, obtains net
Page page info, that is, only desired value is more than the text in the text collection of the first preset value, is only needs to be extracted
The information extracted in Webpage.Specifically, the first preset value can be set according to demand.
In embodiments of the present invention, by obtaining the HTML codes of multiple Webpages to be extracted, it is possible to achieve to multiple
The division of Webpage belonging kinds to be extracted, and then obtain common in the different Webpages to be extracted under same belonging kinds
Comprising block element, it is possible to achieve the extraction to same block element Chinese version content, then can be according in the text got
The difference degree of appearance and the comparative result of predetermined threshold value, determine whether text content is that needs in Webpage to be extracted carry
The information taken, solve the low problem of the info web extraction degree of accuracy in the prior art, and then the extraction of raising info web
The effect of accuracy.
It should be noted that if the quantity of object block element is multiple, it is necessary to calculate each object block element respectively
The desired value of text collection, and by each desired value calculated respectively compared with the first predetermined threshold value, by desired value
Extracted more than the text in the text collection of the first predetermined threshold value.
Specifically, the first computing unit 50 includes logging modle, the first determining module, computing module and the second determination mould
Block, wherein:
The occurrence number for the text each differed that logging modle is used in recording text set.Due in text collection
Including multiple texts, so multiple texts there may be content identical text, in embodiments of the present invention, only count mutually it
Between occurrence number of the text that differs of content in text set.
First determining module is used for the occurrence number according to the text each differed, determines full text in text collection
Total occurrence number, specifically, total occurrence number of full text is equal to all texts differed and gone out in text set
Occurrence number sum.
Computing module is used for the occurrence number of the text according to each differing and total occurrence number, and calculating each differs
The frequency of occurrences of the text in text collection.For example, have in text set with other texts in text set not
Same text A, occurrence numbers of the text A in text set are 3 times, total occurrence number of full text in text set
For 30 times, then for text A, the frequency of occurrences in above-mentioned text collection is 1/10.
Second determining module is used for the frequency of occurrences in text collection according to the text each differed, determines text set
The desired value of conjunction.
If object block element is multiple, then the desired value of the text collection of each object block element can pass through weight
Polyphony is calculated with logging modle, the first determining module, computing module and the second determining module.
Specifically, the second determining module includes calculating sub module, and calculating sub module is used for according to formulaThe desired value of text collection is calculated, wherein, ESetFor text collection
Desired value, m are comprising the number of text differed, p (text in text collectioni) it is the text that each differs in text set
The frequency of occurrences in conjunction.In embodiments of the present invention, text collection E is calculatedSetThe middle frequency of occurrences by each text differed
It is multiplied with the logarithm of the frequency of occurrences of the text differed, obtained all results is summed, then take negative, is exactly the text
The desired value of set.
Preferably, the extraction element for the Webpage information that the embodiment of the present invention is provided also includes recording unit, record
Unit is used for the text being more than in extraction desired value in the text collection of the first predetermined threshold value, after obtaining Webpage information,
The category attribute of recording text.Specifically, category attribute can be title, content etc..The embodiment of the present invention is it is, record carries
The content of text taken is title or content etc..
In embodiments of the present invention, it is convenient subsequently to carry out big data point by the category attribute for the text for recording extraction
During analysis, user can quickly filter out required information, reach the effect for improving user satisfaction.For example, user wants to sieve
Select in the info web extracted, content is the information of title, then user need to only select category attribute as title, you can quick
Filter out meet its requirement info web.
Preferably, the embodiment of the present invention additionally provides a kind of concrete mode for the belonging kinds for determining the page to be extracted, can
With by establishing unit, the 4th extraction unit, the second computing unit, comparison list included by the extraction element of Webpage information
Member and processing unit perform, wherein:
Unit is established to be used to establish the first tree structure according to the HTML code of the first Webpage to be extracted, and according to the
The HTML code of two Webpages to be extracted establishes the second tree structure, wherein, the first Webpage to be extracted and second is waited to carry
It is any two Webpage to be extracted in multiple pages to be extracted to take the page.
4th extraction unit is used to extract the block element for including preset attribute in the first tree structure, obtains first piece of member
The block element of preset attribute is included in element, and the second tree structure of extraction, obtains second piece of element.Specifically, preset attribute
For class attributes or id attributes, this unit namely only extracts to be belonged in the HTML code of first page to be extracted comprising class
Property or id attributes block element, and only extract and belong in the HTML code of second page to be extracted comprising class attributes or id
The block element of property.
Second computing unit is used for according to first piece of element and second piece of element, calculates the first Webpage to be extracted and the
The similarity average value of two Webpages to be extracted.In embodiments of the present invention, can be calculated according to formula V=1/2 (S1+S2)
The similarity average value of first Webpage to be extracted and the second Webpage to be extracted, wherein, V is similarity average value, S1
For the first similarity of the first Webpage to be extracted and the second Webpage to be extracted, S2 be the first Webpage to be extracted and
Second similarity of the second Webpage to be extracted.Specifically, can be according to formulaCalculate the first similarity
S1, wherein, Kp is identical block element in the first Webpage to be extracted and the second Webpage to be extracted, and p takes 1 to m successively,
M be same block element number, V1KpFor frequency of occurrences of the same block element Kp in the first Webpage to be extracted, K0kFor
First piece of element in one Webpage to be extracted, N1 are the number of first piece of element in the first Webpage to be extracted,
For first piece of element K0kFrequency of occurrence in the first Webpage to be extracted;According to formulaCalculate the second phase
Like degree S2, wherein, V2KpFor frequency of occurrences of the same block element Kp in the second Webpage to be extracted, K1kIt is to be extracted for second
Second piece of element in Webpage, N2 are the number of second piece of element in the second Webpage to be extracted,For second piece of member
Plain K1kFrequency of occurrence in the second Webpage to be extracted.
Comparing unit is used for the size for comparing similarity average value and the second predetermined threshold value.Specifically, the second predetermined threshold value
It can also set according to demand.
Processing unit is used to compare similarity average value more than in the case of the second predetermined threshold value, determines that first waits to carry
It is same home classification to take Webpage and second page to be extracted, or is comparing similarity average value less than or equal to second
In the case of predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are respectively different belonging kinds,
This unit it is, judge similarity average value be more than the second predetermined threshold value in the case of, the first Webpage to be extracted
Belong to same belonging kinds with second page to be extracted;Judging similarity average value less than or equal to the second predetermined threshold value
In the case of, the first Webpage to be extracted and second page to be extracted are belonging respectively to different belonging kinds.
In embodiments of the present invention, any two Webpage in multiple pages to be extracted can be regarded first respectively
Webpage to be extracted and the second Webpage to be extracted, and repeat call establishment unit, the 4th extraction unit, the second calculating list
Member, comparing unit and processing unit, until determining the belonging kinds of each page to be extracted.If it should be noted that net
Page page A and Webpage B belongs to same belonging kinds, and Webpage A and Webpage D fall within same belonging kinds, that
Webpage A, Webpage B and Webpage D belong to same belonging kinds.When two or more Webpage category to be extracted
After same belonging kinds, for other it needs to be determined that the Webpage to be extracted of belonging kinds, as long as by the webpage to be extracted
The page calculates similarity average value with a Webpage to be extracted in above-mentioned belonging kinds, and obtained similarity is averaged
Value is compared with the second predetermined threshold value, you can determines whether the Webpage to be extracted belongs to above-mentioned belonging kinds.
As can be seen from the above description, the present invention solves in the prior art that the degree of accuracy is low asks for info web extraction
Topic, the effect for improving info web extraction accuracy is reached.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment
The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed client, can be by others side
Formula is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, and only one
Kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or
Another system is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed it is mutual it
Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially
The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products
Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer
Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or
Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes
Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.