CN104484451B

CN104484451B - The extracting method and device of Webpage information

Info

Publication number: CN104484451B
Application number: CN201410830367.6A
Authority: CN
Inventors: 侯明午
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2014-12-25
Filing date: 2014-12-25
Publication date: 2017-12-19
Anticipated expiration: 2034-12-25
Also published as: CN104484451A

Abstract

The invention discloses a kind of extracting method and device of Webpage information.Wherein, the extracting method of Webpage information includes：Obtain the HTML HTML code of multiple Webpages to be extracted；Multiple Webpages to be extracted are clustered according to HTML code, obtain multiple belonging kinds；The object block element in each belonging kinds is extracted, wherein, object block element is the shared block element of the different Webpages to be extracted in same belonging kinds；The text in object block element is extracted, obtains the text collection of object block element；The desired value of text collection is calculated, wherein, desired value is used for the difference degree for representing the text in text collection；The text that desired value is more than in the text collection of the first predetermined threshold value is extracted, obtains Webpage information.By the present invention, solve the low problem of the info web extraction degree of accuracy in the prior art, and then improve the effect of info web extraction accuracy.

Description

The extracting method and device of Webpage information

Technical field

The present invention relates to data processing field, in particular to a kind of extracting method and device of Webpage information.

Background technology

Collection info web is the significant data source of big data analysis.Collection info web mainly has two kinds of sides at present Case, one kind are to use rule-based method, and page elements are extracted using regular expression, Xpath or Css selectors, another Kind is Statistics-Based Method, and the data manually marked by machine learning obtain training pattern, and entering row information according to model carries Take.

Rule-based method is by analyzing HTML (HyperText Mark-up Language, HTML) Code, the right boundary of information to be extracted is analyzed, information is extracted by regular expression or other means, or pass through DOM (Document Object Model, document dbject model) trees are established for the page, are chosen by XPath or Css selectors Web page element, and then the element for including information to be extracted is chosen, so as to realize information extraction.

Rule-based extracting method, extraction is accurate, but poor for applicability, often can only enter row information for a kind of page Extraction, it can cause to extract mistake if the page changes.

Statistics-Based Method, by the method for machine learning, the accurate result manually marked is trained, instructed Practice model, row information identification and extraction are entered by training pattern.

It is good based on statistical method applicability, it can be used for various Webpages, but such a method resource consumption is big, to people The dependence of work mark is strong, and the quality of information extraction and the quality correlation manually marked are strong.The degree of accuracy can not ensure completely, base It is not the information extraction for specific webpage in the method for training, the new page may result in and extract incomplete or extraction mistake Lose.

The problem of degree of accuracy is low is extracted for info web in the prior art, not yet proposes effective solution at present.

The content of the invention

It is a primary object of the present invention to provide a kind of extracting method and device of Webpage information, to solve existing skill The problem of info web extraction degree of accuracy is low in art.

To achieve these goals, one side according to embodiments of the present invention, there is provided a kind of Webpage information Extracting method.

Included according to the extracting method of the Webpage information of the present invention：Obtain the hypertext of multiple Webpages to be extracted Markup language HTML code；Multiple Webpages to be extracted are clustered according to the HTML code, obtain multiple return Belong to classification；Object block element in each belonging kinds of extraction, wherein, the object block element is the same ownership class The block element that the different Webpages to be extracted in not share；The text in the object block element is extracted, is obtained described The text collection of object block element；The desired value of the text collection is calculated, wherein, the desired value is used to represent the text The difference degree of text in set；And the extraction desired value is more than the text in the text collection of the first predetermined threshold value This, obtains the Webpage information.

Further, calculating the desired value of the text collection includes：Record each differing in the text collection Text occurrence number；According to the occurrence number of each text differed, determine all literary in the text collection This total occurrence number；According to the occurrence number of each text differed and total occurrence number, each institute is calculated State the frequency of occurrences of the text differed in the text collection；And according to each text differed in the text The frequency of occurrences in this set, determine the desired value of the text collection.

Further, the frequency of occurrences according to each text differed in the text collection, it is determined that The desired value of the text collection includes：According to formulaDescribed in calculating The desired value of text collection, wherein, E_SetFor the desired value of the text collection, m is comprising the not phase in the text collection The number of same text, p (text_i) it is the frequency of occurrences of each text differed in the text collection.

Further, the text in the text collection of the desired value more than the first predetermined threshold value is extracted, is obtained After the Webpage information, the extracting method also includes：Record the category attribute of the text.

Further, the ownership class of the first Webpage to be extracted and second page to be extracted is determined in the following manner Not, wherein, first Webpage to be extracted and second page to be extracted are appointing in multiple pages to be extracted Two Webpages to be extracted of meaning：First tree structure is established according to the HTML code of the described first Webpage to be extracted, and Second tree structure is established according to the HTML code of the described second Webpage to be extracted；Extract and wrapped in first tree structure Block element containing preset attribute, obtain the block for including preset attribute in first piece of element, and extraction second tree structure Element, obtain second piece of element；According to first piece of element and second piece of element, first webpage to be extracted is calculated The similarity average value of the page and second Webpage to be extracted；Compare the similarity average value and the second predetermined threshold value Size；And the similarity average value is being compared more than in the case of second predetermined threshold value, determine described first Webpage to be extracted and second page to be extracted are same home classification, or small comparing the similarity average value In or equal in the case of second predetermined threshold value, the described first Webpage to be extracted and second page to be extracted are determined Face is respectively different belonging kinds.

To achieve these goals, another aspect according to embodiments of the present invention, there is provided a kind of Webpage information Extraction element.

Included according to the extraction element of the Webpage information of the present invention：Acquiring unit, for obtaining multiple nets to be extracted The HTML HTML code of the page page；Cluster cell, for according to the HTML code to multiple described to be extracted Webpage is clustered, and obtains multiple belonging kinds；First extraction unit, for extracting the mesh in each belonging kinds Block element is marked, wherein, the object block element is that the different Webpages to be extracted in the same belonging kinds share Block element；Second extraction unit, for extracting the text in the object block element, obtain the text of the object block element Set；First computing unit, for calculating the desired value of the text collection, wherein, the desired value is used to represent the text The difference degree of text in this set；And the 3rd extraction unit, it is more than the first predetermined threshold value for extracting the desired value The text collection in text, obtain the Webpage information.

Further, first computing unit includes：Logging modle, for record in the text collection it is each not The occurrence number of identical text；First determining module, for the occurrence number according to each text differed, it is determined that Total occurrence number of full text in the text collection；Computing module, for going out according to each text differed Occurrence number and total occurrence number, calculate the frequency of occurrences of each text differed in the text collection；With And second determining module, for the frequency of occurrences according to each text differed in the text collection, really The desired value of the fixed text collection.

Further, second determining module includes：Calculating sub module, for according to formulaThe desired value of the text collection is calculated, wherein, E_SetFor the text The desired value of this set, m are the number that the text differed is included in the text collection, p (text_i) it is each described The frequency of occurrences of the text differed in the text collection.

Further, the extraction element also includes：Recording unit, it is default for being more than first in the extraction desired value Text in the text collection of threshold value, after obtaining the Webpage information, record the category attribute of the text.

Further, the extraction element also includes：Unit is established, for the HTML according to the first Webpage to be extracted Code establishes the first tree structure, and establishes the second tree structure according to the HTML code of the second Webpage to be extracted, wherein, First Webpage to be extracted and second page to be extracted are that any two in multiple pages to be extracted is treated Extract Webpage：4th extraction unit, the block element of preset attribute is included in first tree structure for extracting, is obtained The block element of preset attribute is included in first piece of element, and extraction second tree structure, obtains second piece of element；Second Computing unit, for according to first piece of element and second piece of element, calculate first Webpage to be extracted and The similarity average value of second Webpage to be extracted；Comparing unit, for the similarity average value and second The size of predetermined threshold value；And processing unit, for being more than second predetermined threshold value comparing the similarity average value In the case of, determine that the described first Webpage to be extracted and second page to be extracted are same home classification, or than Relatively go out the similarity average value less than or equal in the case of second predetermined threshold value, determine the described first webpage to be extracted The page and second page to be extracted are respectively different belonging kinds.

According to inventive embodiments, using the HTML code for obtaining multiple Webpages to be extracted；According to the HTML code Multiple Webpages to be extracted are clustered, obtain multiple belonging kinds；Mesh in each belonging kinds of extraction Block element is marked, wherein, the object block element is that the different Webpages to be extracted in the same belonging kinds share Block element；The content of text in the object block element is extracted, obtains the text collection of the object block element；Described in calculating The desired value of text collection, wherein, the desired value is used for the difference degree for representing the text in the text collection；And carry The text for taking the desired value to be more than in the text collection of the first predetermined threshold value, obtains the Webpage information.Pass through Obtain the HTML code of multiple Webpages to be extracted, it is possible to achieve the division to multiple Webpage belonging kinds to be extracted, And then obtain the block element jointly comprised in the different Webpages to be extracted under same belonging kinds, it is possible to achieve to same block The extraction of element Chinese version content, then can be according to the difference degree of the content of text got and the comparison knot of predetermined threshold value Fruit, determine whether text content is information that the needs in Webpage to be extracted extract, solves webpage in the prior art The information extraction degree of accuracy low problem, and then improve the effect of info web extraction accuracy.

Brief description of the drawings

The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is the flow chart of the extracting method of Webpage information according to embodiments of the present invention；And

Fig. 2 is the schematic diagram of the extraction element of Webpage information according to embodiments of the present invention.

Embodiment

In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.

Embodiment 1

According to embodiments of the present invention, there is provided a kind of embodiment of the method that can be used for implementing the application device embodiment, It should be noted that can be in the department of computer science of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of Performed in system, although also, show logical order in flow charts, in some cases, can be with different from herein Order perform shown or described step.

According to embodiments of the present invention, there is provided a kind of extracting method of Webpage information.Fig. 1 is implemented according to the present invention The flow chart of the extracting method of the Webpage information of example, as shown in figure 1, this method includes steps S102 to step S112：

S102：Obtain the HTML HTML code of multiple Webpages to be extracted.Specifically, can obtain simultaneously The HTML code of multiple Webpages to be extracted is taken, each Webpage to be extracted of acquisition that can also successively one by one HTML code.

S104：Multiple Webpages to be extracted are clustered according to HTML code, obtain multiple belonging kinds, also It is, according to the HTML code of each Webpage to be extracted got, multiple Webpages to be extracted to be classified, will be more Similar Webpage to be extracted is classified as a classification in individual Webpage to be extracted.An it should be noted that net to be extracted The page page can only have a belonging kinds.

S106：The object block element in each belonging kinds is extracted, wherein, object block element is in same belonging kinds The shared block element of different Webpages to be extracted.Specifically, object block element can be one, or multiple.At this In inventive embodiments, the particular number of object block element is according to the shared block member of the different pages to be extracted in same belonging kinds What the quantity of element determined.Shared block element refers to label of the block element in same belonging kinds in the different pages to be extracted Title, attribute all identical block elements, attribute herein is class attributes or id attributes.Such as：Webpage 1, webpage page Face 2 and Webpage 3 belong to belonging kinds A, in Webpage 1, Webpage 2 and Webpage 3 in each Webpage The block member jointly comprised is known as 3, is div [class=" menu "], div [id=" title "] and p [class=respectively " content "], then the object block element in belonging kinds A is then 3.

S108：The text in object block element is extracted, obtains the text collection of object block element.Specifically, same target Multiple texts are included in block element, the set of multiple texts is the text collection of the object block element.If object block element To be multiple, then extract the text in each object block element, obtain the text collection of each object block element.Continue using upper State for example, for object block element div [id=" title "], obtained text collection for " title 1 ", " title 2 ", " title 3 " }.

S110：The desired value of text collection is calculated, wherein, desired value is used for the difference journey for representing the text in text collection Degree, i.e. calculate object block element Chinese version difference degree, difference degree is bigger, illustrates in the text in the object block element Content difference is bigger.

S112：The text that desired value is more than in the text collection of the first predetermined threshold value is extracted, obtains Webpage information, It is exactly that only desired value is more than the text in the text collection of the first preset value, is only needs and is extracted in Webpage to be extracted Information.Specifically, the first preset value can be set according to demand.

In embodiments of the present invention, by obtaining the HTML codes of multiple Webpages to be extracted, it is possible to achieve to multiple The division of Webpage belonging kinds to be extracted, and then obtain common in the different Webpages to be extracted under same belonging kinds Comprising block element, it is possible to achieve the extraction to same block element Chinese version content, then can be according in the text got The difference degree of appearance and the comparative result of predetermined threshold value, determine whether text content is that needs in Webpage to be extracted carry The information taken, solve the low problem of the info web extraction degree of accuracy in the prior art, and then the extraction of raising info web The effect of accuracy.

It should be noted that if the quantity of object block element is multiple, it is necessary to calculate each object block element respectively The desired value of text collection, and by each desired value calculated respectively compared with the first predetermined threshold value, by desired value Extracted more than the text in the text collection of the first predetermined threshold value.

Specifically, the desired value of text collection can be calculated by step 1-1 to step 1-4, step 1-1 to step 1-4 is specific as follows：

Step 1-1：The occurrence number of the text each differed in recording text set.Because text collection includes Multiple texts, so multiple texts there may be content identical text, in embodiments of the present invention, only count between each other Hold occurrence number of the text differed in text set.

Step 1-2：According to the occurrence number of each text differed, total appearance of full text in text collection is determined Number, specifically, in text set total occurrence number of full text be equal to all texts differed occurrence number it With.

Step 1-3：According to the occurrence number of each text differed and total occurrence number, the text each differed is calculated Originally the frequency of occurrences in text collection.For example, have in text set individual different from other texts in text set The occurrence number of text A, text A in text set is 3 times, and total occurrence number of full text is 30 in text set It is secondary, then for text A, the frequency of occurrences in above-mentioned text collection is 1/10.

Step 1-4：According to the frequency of occurrences of each text differed in text collection, the index of text collection is determined Value.

If object block element is multiple, then the desired value of the text collection of each object block element can pass through weight Step 1-1 to step 1-4 is performed again to be calculated.

Specifically, in embodiments of the present invention, the frequency of occurrences according to each text differed in text collection, really Determining the desired value of text collection includes：According to formulaCalculate text set The desired value of conjunction, wherein, E_SetFor the desired value of text collection, m is the number that the text differed is included in text collection, p (text_i) it is the frequency of occurrences of the text each differed in text collection.In embodiments of the present invention, text collection is calculated E_SetThe middle frequency of occurrences by each text differed is multiplied with the logarithm of the frequency of occurrences of the text differed, will obtain The summation of all results, then take negative, be exactly the desired value of text set.

Preferably, the text being more than in extraction desired value in the text collection of the first predetermined threshold value, obtains Webpage letter After breath, the extracting method for the Webpage information that the embodiment of the present invention is provided also includes the category attribute of recording text.Tool Body, category attribute can be title, content etc..The embodiment of the present invention it is, record extraction content of text be title also It is content etc..

In embodiments of the present invention, it is convenient subsequently to carry out big data point by the category attribute for the text for recording extraction During analysis, user can quickly filter out required information, reach the effect for improving user satisfaction.For example, user wants to sieve Select in the info web extracted, content is the information of title, then user need to only select category attribute as title, you can quick Filter out meet its requirement info web.

The embodiment of the present invention additionally provides a kind of concrete mode for the belonging kinds for determining the page to be extracted, waits to carry with first Exemplified by Webpage and second page to be extracted are taken as any two Webpage to be extracted in multiple pages to be extracted, for Clearly determine the mode of the first Webpage to be extracted and the second page belonging kinds to be extracted, specifically, step 2-1 can be passed through The belonging kinds of the first Webpage to be extracted and second page to be extracted are determined to step 2-5：

Step 2-1：First tree structure is established according to the HTML code of the first Webpage to be extracted, and treated according to second The HTML code of extraction Webpage establishes the second tree structure.

Step 2-2：The block element that preset attribute is included in the first tree structure is extracted, obtains first piece of element, Yi Jiti The block element that preset attribute is included in the second tree structure is taken, obtains second piece of element.Specifically, preset attribute belongs to for class Property or id attributes, this step namely only extract first page to be extracted HTML code in include class attributes or id The block element of attribute, and only extract the block member comprising class attributes or id attributes in the HTML code of second page to be extracted Element.

Step 2-3：According to first piece of element and second piece of element, the first Webpage to be extracted and second to be extracted is calculated The similarity average value of Webpage.In embodiments of the present invention, first can be calculated according to formula V=1/2 (S1+S2) to wait to carry The similarity average value of Webpage and the second Webpage to be extracted is taken, wherein, V is similarity average value, and S1 treats for first The first similarity of Webpage and the second Webpage to be extracted is extracted, S2 is that the first Webpage to be extracted and second is waited to carry Take the second similarity of Webpage.Specifically, can be according to formulaThe first similarity S1 is calculated, wherein, Kp For identical block element in the first Webpage to be extracted and the second Webpage to be extracted, it is same block that p takes 1 to m, m successively The number of element, V_1KpFor frequency of occurrences of the same block element Kp in the first Webpage to be extracted, K_0kFor the first net to be extracted First piece of element in the page page, N1 are the number of first piece of element in the first Webpage to be extracted,For first piece of member Plain K_0kFrequency of occurrence in the first Webpage to be extracted；According to formulaThe second similarity S2 is calculated, its In, V_2KpFor frequency of occurrences of the same block element Kp in the second Webpage to be extracted, K_1kFor in the second Webpage to be extracted Second piece of element, N2 be the second Webpage to be extracted in second piece of element number,For second piece of element K_1k Frequency of occurrence in two Webpages to be extracted.

Step 2-4：Compare the size of similarity average value and the second predetermined threshold value.Specifically, the second predetermined threshold value also may be used To set according to demand.

Step 2-5：Similarity average value is being compared more than in the case of the second predetermined threshold value, is determining the first net to be extracted The page page and second page to be extracted are same home classification, or default less than or equal to second comparing similarity average value In the case of threshold value, determine the first Webpage to be extracted and second page to be extracted is respectively different belonging kinds, this step It is rapid it is, in the case where judging that similarity average value is more than the second predetermined threshold value, the first Webpage to be extracted and the Two pages to be extracted belong to same belonging kinds；Judging situation of the similarity average value less than or equal to the second predetermined threshold value Under, the first Webpage to be extracted and second page to be extracted are belonging respectively to different belonging kinds.

In embodiments of the present invention, any two Webpage in multiple pages to be extracted can be regarded first respectively Webpage to be extracted and the second Webpage to be extracted, and step 2-1 to step 2-5 is repeated, until determining each The belonging kinds of the page to be extracted.It should be noted that if Webpage A and Webpage B belong to same belonging kinds, net Page page A and Webpage D falls within same belonging kinds, then Webpage A, Webpage B and Webpage D are belonged to Same belonging kinds.After two or more Webpage to be extracted belongs to same belonging kinds, for other it needs to be determined that ownership The Webpage to be extracted of classification, as long as by a webpage page to be extracted in the Webpage to be extracted and above-mentioned belonging kinds Face calculates similarity average value, and by obtained similarity average value compared with the second predetermined threshold value, you can it is determined that this is treated Whether extraction Webpage belongs to above-mentioned belonging kinds.

It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate Machine, server, or network equipment etc.) perform method described in each embodiment of the present invention.

Embodiment 2

According to embodiments of the present invention, a kind of webpage for being used to implement the extracting method of above-mentioned Webpage information is additionally provided The extraction element of page info, the extraction element are mainly used in performing the extraction side that the above of the embodiment of the present invention is provided Method, the extraction element of the Webpage information provided below the embodiment of the present invention do specific introduction：

Fig. 2 is the schematic diagram of the extraction element of Webpage information according to embodiments of the present invention, as shown in Fig. 2 the dress Putting mainly includes acquiring unit 10, cluster cell 20, the first extraction unit 30, the second extraction unit 40, the first computing unit 50 With the 3rd extraction unit 60, wherein：

Acquiring unit 10 is used for the HTML HTML code for obtaining multiple Webpages to be extracted.Specifically, The HTML code of multiple Webpages to be extracted can be obtained simultaneously, each net to be extracted of acquisition that can also successively one by one The HTML code of the page page.

Cluster cell 20 is used to cluster multiple Webpages to be extracted according to HTML code, obtains multiple ownership classes Not, it is, according to the HTML code of each Webpage to be extracted got, multiple Webpages to be extracted are divided Class, Webpage to be extracted similar in multiple Webpages to be extracted is classified as a classification.It should be noted that one is treated Extraction Webpage can only have a belonging kinds.

First extraction unit 30 is used to extract the object block element in each belonging kinds, wherein, object block element is same The block element that different Webpages to be extracted in one belonging kinds share.Specifically, object block element can be one, also may be used Think multiple.In embodiments of the present invention, the particular number of object block element is according to different to be extracted in same belonging kinds What the quantity of the shared block element of the page determined.Shared block element refers to that block element difference in same belonging kinds is waited to carry Bookmark name, the attribute all identical block elements in the page are taken, attribute herein is class attributes or id attributes.Such as：Net The page page 1, Webpage 2 and Webpage 3 belong to belonging kinds A, in Webpage 1, Webpage 2 and Webpage 3 The block member jointly comprised in each Webpage is known as 3, is div [class=" menu "], div [id=" title "] respectively With p [class=" content "], then the object block element in belonging kinds A is then 3.

Second extraction unit 40 is used to extract the text in object block element, obtains the text collection of object block element.Tool Body, multiple texts are included in same object block element, the set of multiple texts is the text collection of the object block element.Such as Fruit object block element is multiple, then extracts the text in each object block element, obtains the text set of each object block element Close.Continue to illustrate that, for object block element div [id=" title "], obtained text collection is { " title using the example above 1 ", " title 2 ", " title 3 " }.

First computing unit 50 is used for the desired value for calculating text collection, wherein, desired value is used to represent in text collection Text difference degree, i.e. calculate object block element Chinese version difference degree, difference degree is bigger, illustrate the object block member Content difference in text in element is bigger.

3rd extraction unit 60 is used to extract desired value more than the text in the text collection of the first predetermined threshold value, obtains net Page page info, that is, only desired value is more than the text in the text collection of the first preset value, is only needs to be extracted The information extracted in Webpage.Specifically, the first preset value can be set according to demand.

Specifically, the first computing unit 50 includes logging modle, the first determining module, computing module and the second determination mould Block, wherein：

The occurrence number for the text each differed that logging modle is used in recording text set.Due in text collection Including multiple texts, so multiple texts there may be content identical text, in embodiments of the present invention, only count mutually it Between occurrence number of the text that differs of content in text set.

First determining module is used for the occurrence number according to the text each differed, determines full text in text collection Total occurrence number, specifically, total occurrence number of full text is equal to all texts differed and gone out in text set Occurrence number sum.

Computing module is used for the occurrence number of the text according to each differing and total occurrence number, and calculating each differs The frequency of occurrences of the text in text collection.For example, have in text set with other texts in text set not Same text A, occurrence numbers of the text A in text set are 3 times, total occurrence number of full text in text set For 30 times, then for text A, the frequency of occurrences in above-mentioned text collection is 1/10.

Second determining module is used for the frequency of occurrences in text collection according to the text each differed, determines text set The desired value of conjunction.

If object block element is multiple, then the desired value of the text collection of each object block element can pass through weight Polyphony is calculated with logging modle, the first determining module, computing module and the second determining module.

Specifically, the second determining module includes calculating sub module, and calculating sub module is used for according to formulaThe desired value of text collection is calculated, wherein, E_SetFor text collection Desired value, m are comprising the number of text differed, p (text in text collection_i) it is the text that each differs in text set The frequency of occurrences in conjunction.In embodiments of the present invention, text collection E is calculated_SetThe middle frequency of occurrences by each text differed It is multiplied with the logarithm of the frequency of occurrences of the text differed, obtained all results is summed, then take negative, is exactly the text The desired value of set.

Preferably, the extraction element for the Webpage information that the embodiment of the present invention is provided also includes recording unit, record Unit is used for the text being more than in extraction desired value in the text collection of the first predetermined threshold value, after obtaining Webpage information, The category attribute of recording text.Specifically, category attribute can be title, content etc..The embodiment of the present invention is it is, record carries The content of text taken is title or content etc..

Preferably, the embodiment of the present invention additionally provides a kind of concrete mode for the belonging kinds for determining the page to be extracted, can With by establishing unit, the 4th extraction unit, the second computing unit, comparison list included by the extraction element of Webpage information Member and processing unit perform, wherein：

Unit is established to be used to establish the first tree structure according to the HTML code of the first Webpage to be extracted, and according to the The HTML code of two Webpages to be extracted establishes the second tree structure, wherein, the first Webpage to be extracted and second is waited to carry It is any two Webpage to be extracted in multiple pages to be extracted to take the page.

4th extraction unit is used to extract the block element for including preset attribute in the first tree structure, obtains first piece of member The block element of preset attribute is included in element, and the second tree structure of extraction, obtains second piece of element.Specifically, preset attribute For class attributes or id attributes, this unit namely only extracts to be belonged in the HTML code of first page to be extracted comprising class Property or id attributes block element, and only extract and belong in the HTML code of second page to be extracted comprising class attributes or id The block element of property.

Second computing unit is used for according to first piece of element and second piece of element, calculates the first Webpage to be extracted and the The similarity average value of two Webpages to be extracted.In embodiments of the present invention, can be calculated according to formula V=1/2 (S1+S2) The similarity average value of first Webpage to be extracted and the second Webpage to be extracted, wherein, V is similarity average value, S1 For the first similarity of the first Webpage to be extracted and the second Webpage to be extracted, S2 be the first Webpage to be extracted and Second similarity of the second Webpage to be extracted.Specifically, can be according to formulaCalculate the first similarity S1, wherein, Kp is identical block element in the first Webpage to be extracted and the second Webpage to be extracted, and p takes 1 to m successively, M be same block element number, V_1KpFor frequency of occurrences of the same block element Kp in the first Webpage to be extracted, K_0kFor First piece of element in one Webpage to be extracted, N1 are the number of first piece of element in the first Webpage to be extracted, For first piece of element K_0kFrequency of occurrence in the first Webpage to be extracted；According to formulaCalculate the second phase Like degree S2, wherein, V_2KpFor frequency of occurrences of the same block element Kp in the second Webpage to be extracted, K_1kIt is to be extracted for second Second piece of element in Webpage, N2 are the number of second piece of element in the second Webpage to be extracted,For second piece of member Plain K_1kFrequency of occurrence in the second Webpage to be extracted.

Comparing unit is used for the size for comparing similarity average value and the second predetermined threshold value.Specifically, the second predetermined threshold value It can also set according to demand.

Processing unit is used to compare similarity average value more than in the case of the second predetermined threshold value, determines that first waits to carry It is same home classification to take Webpage and second page to be extracted, or is comparing similarity average value less than or equal to second In the case of predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are respectively different belonging kinds, This unit it is, judge similarity average value be more than the second predetermined threshold value in the case of, the first Webpage to be extracted Belong to same belonging kinds with second page to be extracted；Judging similarity average value less than or equal to the second predetermined threshold value In the case of, the first Webpage to be extracted and second page to be extracted are belonging respectively to different belonging kinds.

In embodiments of the present invention, any two Webpage in multiple pages to be extracted can be regarded first respectively Webpage to be extracted and the second Webpage to be extracted, and repeat call establishment unit, the 4th extraction unit, the second calculating list Member, comparing unit and processing unit, until determining the belonging kinds of each page to be extracted.If it should be noted that net Page page A and Webpage B belongs to same belonging kinds, and Webpage A and Webpage D fall within same belonging kinds, that Webpage A, Webpage B and Webpage D belong to same belonging kinds.When two or more Webpage category to be extracted After same belonging kinds, for other it needs to be determined that the Webpage to be extracted of belonging kinds, as long as by the webpage to be extracted The page calculates similarity average value with a Webpage to be extracted in above-mentioned belonging kinds, and obtained similarity is averaged Value is compared with the second predetermined threshold value, you can determines whether the Webpage to be extracted belongs to above-mentioned belonging kinds.

As can be seen from the above description, the present invention solves in the prior art that the degree of accuracy is low asks for info web extraction Topic, the effect for improving info web extraction accuracy is reached.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed client, can be by others side Formula is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, and only one Kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Another system is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed it is mutual it Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes：USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.

Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

A kind of 1. extracting method of Webpage information, it is characterised in that including：

Obtain the HTML HTML code of multiple Webpages to be extracted；

Multiple Webpages to be extracted are clustered according to the HTML code, obtain multiple belonging kinds；

Object block element in each belonging kinds of extraction, wherein, the object block element is the same belonging kinds In the shared block element of the different Webpages to be extracted；

The text in the object block element is extracted, obtains the text collection of the object block element；

The desired value of the text collection is calculated, wherein, the desired value is used for the difference for representing the text in the text collection DRS degree；And

The desired value is extracted more than the text in the text collection of the first predetermined threshold value, obtains the Webpage letter Breath,

Wherein, the belonging kinds of the first Webpage to be extracted and second page to be extracted are determined in the following manner, wherein, institute The first Webpage to be extracted and second page to be extracted is stated to wait to carry for any two in multiple pages to be extracted Take Webpage：

First tree structure is established according to the HTML code of the described first Webpage to be extracted, and it is to be extracted according to described second The HTML code of Webpage establishes the second tree structure；

The block element that preset attribute is included in first tree structure is extracted, obtains first piece of element, and extraction described the The block element of preset attribute is included in two tree structures, obtains second piece of element, wherein, preset attribute be class attributes or Id attributes；

According to first piece of element and second piece of element, calculate first Webpage to be extracted and described second and treat Extract the similarity average value of Webpage；

Compare the size of the similarity average value and the second predetermined threshold value；And

The similarity average value is being compared more than in the case of second predetermined threshold value, is determining the described first net to be extracted The page page and second page to be extracted are same home classification, or are less than or equal to comparing the similarity average value In the case of second predetermined threshold value, determine that the described first Webpage to be extracted and second page to be extracted are respectively Different belonging kinds,

Wherein, the phase of the first Webpage to be extracted and the second Webpage to be extracted is calculated according to formula V=1/2 (S1+S2) Like degree average value, wherein, V is similarity average value, and S1 is the first Webpage to be extracted and the second Webpage to be extracted First similarity, S2 is the second similarity of the first Webpage to be extracted and the second Webpage to be extracted, according to formulaThe first similarity S1 is calculated, wherein, Kp is in the first Webpage to be extracted and the second Webpage to be extracted Identical block element, p take 1 to m successively, and m is the number of same block element, V_1KpIt is same block element Kp in the first net to be extracted Frequency of occurrence in the page page, K_0kFor first piece of element in the first Webpage to be extracted, N1 is the first webpage page to be extracted The number of first piece of element in face,For first piece of element K_0kFrequency of occurrence in the first Webpage to be extracted；According to public affairs FormulaThe second similarity S2 is calculated, wherein, V_2KpIt is same block element Kp in the second Webpage to be extracted Frequency of occurrence, K_1kFor second piece of element in the second Webpage to be extracted, N2 is second piece in the second Webpage to be extracted The number of element,For second piece of element K_1kFrequency of occurrence in the second Webpage to be extracted.
2. extracting method according to claim 1, it is characterised in that calculating the desired value of the text collection includes：

Record the occurrence number of the text each differed in the text collection；

According to the occurrence number of each text differed, determine full text in the text collection always goes out occurrence Number；

According to the occurrence number of each text differed and total occurrence number, each text differed is calculated Originally the frequency of occurrences in the text collection；And

According to the frequency of occurrences of each text differed in the text collection, the text collection is determined Desired value.
3. extracting method according to claim 2, it is characterised in that according to each text differed in the text The frequency of occurrences in this set, determining the desired value of the text collection includes：

According to formulaThe desired value of the text collection is calculated, wherein, E_SetFor the desired value of the text collection, m is the number that the text differed is included in the text collection, p (text_i) it is the frequency of occurrences of each text differed in the text collection.
4. extracting method according to claim 1, it is characterised in that be more than the first predetermined threshold value extracting the desired value The text collection in text, after obtaining the Webpage information, the extracting method also includes：

Record the category attribute of the text.
A kind of 5. extraction element of Webpage information, it is characterised in that including：

Acquiring unit, for obtaining the HTML HTML code of multiple Webpages to be extracted；

Cluster cell, for being clustered according to the HTML code to multiple Webpages to be extracted, obtain multiple return Belong to classification；

First extraction unit, for extracting the object block element in each belonging kinds, wherein, the object block element is The block element that the different Webpages to be extracted in the same belonging kinds share；

Second extraction unit, for extracting the text in the object block element, obtain the text collection of the object block element；

First computing unit, for calculating the desired value of the text collection, wherein, the desired value is used to represent the text The difference degree of text in set；And

3rd extraction unit, for extracting the desired value more than the text in the text collection of the first predetermined threshold value, obtain To the Webpage information,

Wherein, the extraction element also includes：

Unit is established, for establishing the first tree structure according to the HTML code of the first Webpage to be extracted, and according to second The HTML code of Webpage to be extracted establishes the second tree structure, wherein, first Webpage to be extracted and described Two pages to be extracted are any two Webpage to be extracted in multiple pages to be extracted：

4th extraction unit, the block element of preset attribute is included in first tree structure for extracting, obtains first piece of member The block element of preset attribute is included in element, and extraction second tree structure, obtains second piece of element, wherein, preset category Property is class attributes or id attributes；

Second computing unit, for according to first piece of element and second piece of element, calculating first net to be extracted The similarity average value of the page page and second Webpage to be extracted；

Comparing unit, for the similarity average value and the size of the second predetermined threshold value；And

Processing unit, for comparing the similarity average value more than in the case of second predetermined threshold value, determine institute It is same home classification to state the first Webpage to be extracted and second page to be extracted, or is put down comparing the similarity In the case that average is less than or equal to second predetermined threshold value, determine that the described first Webpage to be extracted and described second is treated It is respectively different belonging kinds to extract the page,

Wherein, the phase of the first Webpage to be extracted and the second Webpage to be extracted is calculated according to formula V=1/2 (S1+S2) Like degree average value, wherein, V is similarity average value, and S1 is the first Webpage to be extracted and the second Webpage to be extracted First similarity, S2 is the second similarity of the first Webpage to be extracted and the second Webpage to be extracted, according to formulaThe first similarity S1 is calculated, wherein, Kp is in the first Webpage to be extracted and the second Webpage to be extracted Identical block element, p take 1 to m successively, and m is the number of same block element, V_1KpIt is same block element Kp in the first net to be extracted Frequency of occurrence in the page page, K_0kFor first piece of element in the first Webpage to be extracted, N1 is the first webpage page to be extracted The number of first piece of element in face,For first piece of element K_0kFrequency of occurrence in the first Webpage to be extracted；According to public affairs FormulaThe second similarity S2 is calculated, wherein, V_2KpIt is same block element Kp in the second Webpage to be extracted Frequency of occurrence, K_1kFor second piece of element in the second Webpage to be extracted, N2 is second piece in the second Webpage to be extracted The number of element,For second piece of element K_1kFrequency of occurrence in the second Webpage to be extracted.
6. extraction element according to claim 5, it is characterised in that first computing unit includes：

Logging modle, for recording the occurrence number of the text each differed in the text collection；

First determining module, for the occurrence number according to each text differed, determine complete in the text collection Total occurrence number of portion's text；

Computing module, for the occurrence number according to each text differed and total occurrence number, calculate each The frequency of occurrences of the text differed in the text collection；And

Second determining module, for the frequency of occurrences according to the text differed in the text collection, Determine the desired value of the text collection.
7. extraction element according to claim 6, it is characterised in that second determining module includes：

Calculating sub module, for according to formulaCalculate the text set The desired value of conjunction, wherein, E_SetFor the desired value of the text collection, m is comprising the text differed in the text collection This number, p (text_i) it is the frequency of occurrences of each text differed in the text collection.
8. extraction element according to claim 5, it is characterised in that the extraction element also includes：

Recording unit, for the text in the text collection of the desired value more than the first predetermined threshold value is extracted, obtain After the Webpage information, the category attribute of the text is recorded.