CN104484451B - The extracting method and device of Webpage information - Google Patents

The extracting method and device of Webpage information Download PDF

Info

Publication number
CN104484451B
CN104484451B CN201410830367.6A CN201410830367A CN104484451B CN 104484451 B CN104484451 B CN 104484451B CN 201410830367 A CN201410830367 A CN 201410830367A CN 104484451 B CN104484451 B CN 104484451B
Authority
CN
China
Prior art keywords
extracted
text
webpage
block element
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410830367.6A
Other languages
Chinese (zh)
Other versions
CN104484451A (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410830367.6A priority Critical patent/CN104484451B/en
Publication of CN104484451A publication Critical patent/CN104484451A/en
Application granted granted Critical
Publication of CN104484451B publication Critical patent/CN104484451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of extracting method and device of Webpage information.Wherein, the extracting method of Webpage information includes:Obtain the HTML HTML code of multiple Webpages to be extracted;Multiple Webpages to be extracted are clustered according to HTML code, obtain multiple belonging kinds;The object block element in each belonging kinds is extracted, wherein, object block element is the shared block element of the different Webpages to be extracted in same belonging kinds;The text in object block element is extracted, obtains the text collection of object block element;The desired value of text collection is calculated, wherein, desired value is used for the difference degree for representing the text in text collection;The text that desired value is more than in the text collection of the first predetermined threshold value is extracted, obtains Webpage information.By the present invention, solve the low problem of the info web extraction degree of accuracy in the prior art, and then improve the effect of info web extraction accuracy.

Description

The extracting method and device of Webpage information
Technical field
The present invention relates to data processing field, in particular to a kind of extracting method and device of Webpage information.
Background technology
Collection info web is the significant data source of big data analysis.Collection info web mainly has two kinds of sides at present Case, one kind are to use rule-based method, and page elements are extracted using regular expression, Xpath or Css selectors, another Kind is Statistics-Based Method, and the data manually marked by machine learning obtain training pattern, and entering row information according to model carries Take.
Rule-based method is by analyzing HTML (HyperText Mark-up Language, HTML) Code, the right boundary of information to be extracted is analyzed, information is extracted by regular expression or other means, or pass through DOM (Document Object Model, document dbject model) trees are established for the page, are chosen by XPath or Css selectors Web page element, and then the element for including information to be extracted is chosen, so as to realize information extraction.
Rule-based extracting method, extraction is accurate, but poor for applicability, often can only enter row information for a kind of page Extraction, it can cause to extract mistake if the page changes.
Statistics-Based Method, by the method for machine learning, the accurate result manually marked is trained, instructed Practice model, row information identification and extraction are entered by training pattern.
It is good based on statistical method applicability, it can be used for various Webpages, but such a method resource consumption is big, to people The dependence of work mark is strong, and the quality of information extraction and the quality correlation manually marked are strong.The degree of accuracy can not ensure completely, base It is not the information extraction for specific webpage in the method for training, the new page may result in and extract incomplete or extraction mistake Lose.
The problem of degree of accuracy is low is extracted for info web in the prior art, not yet proposes effective solution at present.
The content of the invention
It is a primary object of the present invention to provide a kind of extracting method and device of Webpage information, to solve existing skill The problem of info web extraction degree of accuracy is low in art.
To achieve these goals, one side according to embodiments of the present invention, there is provided a kind of Webpage information Extracting method.
Included according to the extracting method of the Webpage information of the present invention:Obtain the hypertext of multiple Webpages to be extracted Markup language HTML code;Multiple Webpages to be extracted are clustered according to the HTML code, obtain multiple return Belong to classification;Object block element in each belonging kinds of extraction, wherein, the object block element is the same ownership class The block element that the different Webpages to be extracted in not share;The text in the object block element is extracted, is obtained described The text collection of object block element;The desired value of the text collection is calculated, wherein, the desired value is used to represent the text The difference degree of text in set;And the extraction desired value is more than the text in the text collection of the first predetermined threshold value This, obtains the Webpage information.
Further, calculating the desired value of the text collection includes:Record each differing in the text collection Text occurrence number;According to the occurrence number of each text differed, determine all literary in the text collection This total occurrence number;According to the occurrence number of each text differed and total occurrence number, each institute is calculated State the frequency of occurrences of the text differed in the text collection;And according to each text differed in the text The frequency of occurrences in this set, determine the desired value of the text collection.
Further, the frequency of occurrences according to each text differed in the text collection, it is determined that The desired value of the text collection includes:According to formulaDescribed in calculating The desired value of text collection, wherein, ESetFor the desired value of the text collection, m is comprising the not phase in the text collection The number of same text, p (texti) it is the frequency of occurrences of each text differed in the text collection.
Further, the text in the text collection of the desired value more than the first predetermined threshold value is extracted, is obtained After the Webpage information, the extracting method also includes:Record the category attribute of the text.
Further, the ownership class of the first Webpage to be extracted and second page to be extracted is determined in the following manner Not, wherein, first Webpage to be extracted and second page to be extracted are appointing in multiple pages to be extracted Two Webpages to be extracted of meaning:First tree structure is established according to the HTML code of the described first Webpage to be extracted, and Second tree structure is established according to the HTML code of the described second Webpage to be extracted;Extract and wrapped in first tree structure Block element containing preset attribute, obtain the block for including preset attribute in first piece of element, and extraction second tree structure Element, obtain second piece of element;According to first piece of element and second piece of element, first webpage to be extracted is calculated The similarity average value of the page and second Webpage to be extracted;Compare the similarity average value and the second predetermined threshold value Size;And the similarity average value is being compared more than in the case of second predetermined threshold value, determine described first Webpage to be extracted and second page to be extracted are same home classification, or small comparing the similarity average value In or equal in the case of second predetermined threshold value, the described first Webpage to be extracted and second page to be extracted are determined Face is respectively different belonging kinds.
To achieve these goals, another aspect according to embodiments of the present invention, there is provided a kind of Webpage information Extraction element.
Included according to the extraction element of the Webpage information of the present invention:Acquiring unit, for obtaining multiple nets to be extracted The HTML HTML code of the page page;Cluster cell, for according to the HTML code to multiple described to be extracted Webpage is clustered, and obtains multiple belonging kinds;First extraction unit, for extracting the mesh in each belonging kinds Block element is marked, wherein, the object block element is that the different Webpages to be extracted in the same belonging kinds share Block element;Second extraction unit, for extracting the text in the object block element, obtain the text of the object block element Set;First computing unit, for calculating the desired value of the text collection, wherein, the desired value is used to represent the text The difference degree of text in this set;And the 3rd extraction unit, it is more than the first predetermined threshold value for extracting the desired value The text collection in text, obtain the Webpage information.
Further, first computing unit includes:Logging modle, for record in the text collection it is each not The occurrence number of identical text;First determining module, for the occurrence number according to each text differed, it is determined that Total occurrence number of full text in the text collection;Computing module, for going out according to each text differed Occurrence number and total occurrence number, calculate the frequency of occurrences of each text differed in the text collection;With And second determining module, for the frequency of occurrences according to each text differed in the text collection, really The desired value of the fixed text collection.
Further, second determining module includes:Calculating sub module, for according to formulaThe desired value of the text collection is calculated, wherein, ESetFor the text The desired value of this set, m are the number that the text differed is included in the text collection, p (texti) it is each described The frequency of occurrences of the text differed in the text collection.
Further, the extraction element also includes:Recording unit, it is default for being more than first in the extraction desired value Text in the text collection of threshold value, after obtaining the Webpage information, record the category attribute of the text.
Further, the extraction element also includes:Unit is established, for the HTML according to the first Webpage to be extracted Code establishes the first tree structure, and establishes the second tree structure according to the HTML code of the second Webpage to be extracted, wherein, First Webpage to be extracted and second page to be extracted are that any two in multiple pages to be extracted is treated Extract Webpage:4th extraction unit, the block element of preset attribute is included in first tree structure for extracting, is obtained The block element of preset attribute is included in first piece of element, and extraction second tree structure, obtains second piece of element;Second Computing unit, for according to first piece of element and second piece of element, calculate first Webpage to be extracted and The similarity average value of second Webpage to be extracted;Comparing unit, for the similarity average value and second The size of predetermined threshold value;And processing unit, for being more than second predetermined threshold value comparing the similarity average value In the case of, determine that the described first Webpage to be extracted and second page to be extracted are same home classification, or than Relatively go out the similarity average value less than or equal in the case of second predetermined threshold value, determine the described first webpage to be extracted The page and second page to be extracted are respectively different belonging kinds.
According to inventive embodiments, using the HTML code for obtaining multiple Webpages to be extracted;According to the HTML code Multiple Webpages to be extracted are clustered, obtain multiple belonging kinds;Mesh in each belonging kinds of extraction Block element is marked, wherein, the object block element is that the different Webpages to be extracted in the same belonging kinds share Block element;The content of text in the object block element is extracted, obtains the text collection of the object block element;Described in calculating The desired value of text collection, wherein, the desired value is used for the difference degree for representing the text in the text collection;And carry The text for taking the desired value to be more than in the text collection of the first predetermined threshold value, obtains the Webpage information.Pass through Obtain the HTML code of multiple Webpages to be extracted, it is possible to achieve the division to multiple Webpage belonging kinds to be extracted, And then obtain the block element jointly comprised in the different Webpages to be extracted under same belonging kinds, it is possible to achieve to same block The extraction of element Chinese version content, then can be according to the difference degree of the content of text got and the comparison knot of predetermined threshold value Fruit, determine whether text content is information that the needs in Webpage to be extracted extract, solves webpage in the prior art The information extraction degree of accuracy low problem, and then improve the effect of info web extraction accuracy.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the extracting method of Webpage information according to embodiments of the present invention;And
Fig. 2 is the schematic diagram of the extraction element of Webpage information according to embodiments of the present invention.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method that can be used for implementing the application device embodiment, It should be noted that can be in the department of computer science of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of Performed in system, although also, show logical order in flow charts, in some cases, can be with different from herein Order perform shown or described step.
According to embodiments of the present invention, there is provided a kind of extracting method of Webpage information.Fig. 1 is implemented according to the present invention The flow chart of the extracting method of the Webpage information of example, as shown in figure 1, this method includes steps S102 to step S112:
S102:Obtain the HTML HTML code of multiple Webpages to be extracted.Specifically, can obtain simultaneously The HTML code of multiple Webpages to be extracted is taken, each Webpage to be extracted of acquisition that can also successively one by one HTML code.
S104:Multiple Webpages to be extracted are clustered according to HTML code, obtain multiple belonging kinds, also It is, according to the HTML code of each Webpage to be extracted got, multiple Webpages to be extracted to be classified, will be more Similar Webpage to be extracted is classified as a classification in individual Webpage to be extracted.An it should be noted that net to be extracted The page page can only have a belonging kinds.
S106:The object block element in each belonging kinds is extracted, wherein, object block element is in same belonging kinds The shared block element of different Webpages to be extracted.Specifically, object block element can be one, or multiple.At this In inventive embodiments, the particular number of object block element is according to the shared block member of the different pages to be extracted in same belonging kinds What the quantity of element determined.Shared block element refers to label of the block element in same belonging kinds in the different pages to be extracted Title, attribute all identical block elements, attribute herein is class attributes or id attributes.Such as:Webpage 1, webpage page Face 2 and Webpage 3 belong to belonging kinds A, in Webpage 1, Webpage 2 and Webpage 3 in each Webpage The block member jointly comprised is known as 3, is div [class=" menu "], div [id=" title "] and p [class=respectively " content "], then the object block element in belonging kinds A is then 3.
S108:The text in object block element is extracted, obtains the text collection of object block element.Specifically, same target Multiple texts are included in block element, the set of multiple texts is the text collection of the object block element.If object block element To be multiple, then extract the text in each object block element, obtain the text collection of each object block element.Continue using upper State for example, for object block element div [id=" title "], obtained text collection for " title 1 ", " title 2 ", " title 3 " }.
S110:The desired value of text collection is calculated, wherein, desired value is used for the difference journey for representing the text in text collection Degree, i.e. calculate object block element Chinese version difference degree, difference degree is bigger, illustrates in the text in the object block element Content difference is bigger.
S112:The text that desired value is more than in the text collection of the first predetermined threshold value is extracted, obtains Webpage information, It is exactly that only desired value is more than the text in the text collection of the first preset value, is only needs and is extracted in Webpage to be extracted Information.Specifically, the first preset value can be set according to demand.
In embodiments of the present invention, by obtaining the HTML codes of multiple Webpages to be extracted, it is possible to achieve to multiple The division of Webpage belonging kinds to be extracted, and then obtain common in the different Webpages to be extracted under same belonging kinds Comprising block element, it is possible to achieve the extraction to same block element Chinese version content, then can be according in the text got The difference degree of appearance and the comparative result of predetermined threshold value, determine whether text content is that needs in Webpage to be extracted carry The information taken, solve the low problem of the info web extraction degree of accuracy in the prior art, and then the extraction of raising info web The effect of accuracy.
It should be noted that if the quantity of object block element is multiple, it is necessary to calculate each object block element respectively The desired value of text collection, and by each desired value calculated respectively compared with the first predetermined threshold value, by desired value Extracted more than the text in the text collection of the first predetermined threshold value.
Specifically, the desired value of text collection can be calculated by step 1-1 to step 1-4, step 1-1 to step 1-4 is specific as follows:
Step 1-1:The occurrence number of the text each differed in recording text set.Because text collection includes Multiple texts, so multiple texts there may be content identical text, in embodiments of the present invention, only count between each other Hold occurrence number of the text differed in text set.
Step 1-2:According to the occurrence number of each text differed, total appearance of full text in text collection is determined Number, specifically, in text set total occurrence number of full text be equal to all texts differed occurrence number it With.
Step 1-3:According to the occurrence number of each text differed and total occurrence number, the text each differed is calculated Originally the frequency of occurrences in text collection.For example, have in text set individual different from other texts in text set The occurrence number of text A, text A in text set is 3 times, and total occurrence number of full text is 30 in text set It is secondary, then for text A, the frequency of occurrences in above-mentioned text collection is 1/10.
Step 1-4:According to the frequency of occurrences of each text differed in text collection, the index of text collection is determined Value.
If object block element is multiple, then the desired value of the text collection of each object block element can pass through weight Step 1-1 to step 1-4 is performed again to be calculated.
Specifically, in embodiments of the present invention, the frequency of occurrences according to each text differed in text collection, really Determining the desired value of text collection includes:According to formulaCalculate text set The desired value of conjunction, wherein, ESetFor the desired value of text collection, m is the number that the text differed is included in text collection, p (texti) it is the frequency of occurrences of the text each differed in text collection.In embodiments of the present invention, text collection is calculated ESetThe middle frequency of occurrences by each text differed is multiplied with the logarithm of the frequency of occurrences of the text differed, will obtain The summation of all results, then take negative, be exactly the desired value of text set.
Preferably, the text being more than in extraction desired value in the text collection of the first predetermined threshold value, obtains Webpage letter After breath, the extracting method for the Webpage information that the embodiment of the present invention is provided also includes the category attribute of recording text.Tool Body, category attribute can be title, content etc..The embodiment of the present invention it is, record extraction content of text be title also It is content etc..
In embodiments of the present invention, it is convenient subsequently to carry out big data point by the category attribute for the text for recording extraction During analysis, user can quickly filter out required information, reach the effect for improving user satisfaction.For example, user wants to sieve Select in the info web extracted, content is the information of title, then user need to only select category attribute as title, you can quick Filter out meet its requirement info web.
The embodiment of the present invention additionally provides a kind of concrete mode for the belonging kinds for determining the page to be extracted, waits to carry with first Exemplified by Webpage and second page to be extracted are taken as any two Webpage to be extracted in multiple pages to be extracted, for Clearly determine the mode of the first Webpage to be extracted and the second page belonging kinds to be extracted, specifically, step 2-1 can be passed through The belonging kinds of the first Webpage to be extracted and second page to be extracted are determined to step 2-5:
Step 2-1:First tree structure is established according to the HTML code of the first Webpage to be extracted, and treated according to second The HTML code of extraction Webpage establishes the second tree structure.
Step 2-2:The block element that preset attribute is included in the first tree structure is extracted, obtains first piece of element, Yi Jiti The block element that preset attribute is included in the second tree structure is taken, obtains second piece of element.Specifically, preset attribute belongs to for class Property or id attributes, this step namely only extract first page to be extracted HTML code in include class attributes or id The block element of attribute, and only extract the block member comprising class attributes or id attributes in the HTML code of second page to be extracted Element.
Step 2-3:According to first piece of element and second piece of element, the first Webpage to be extracted and second to be extracted is calculated The similarity average value of Webpage.In embodiments of the present invention, first can be calculated according to formula V=1/2 (S1+S2) to wait to carry The similarity average value of Webpage and the second Webpage to be extracted is taken, wherein, V is similarity average value, and S1 treats for first The first similarity of Webpage and the second Webpage to be extracted is extracted, S2 is that the first Webpage to be extracted and second is waited to carry Take the second similarity of Webpage.Specifically, can be according to formulaThe first similarity S1 is calculated, wherein, Kp For identical block element in the first Webpage to be extracted and the second Webpage to be extracted, it is same block that p takes 1 to m, m successively The number of element, V1KpFor frequency of occurrences of the same block element Kp in the first Webpage to be extracted, K0kFor the first net to be extracted First piece of element in the page page, N1 are the number of first piece of element in the first Webpage to be extracted,For first piece of member Plain K0kFrequency of occurrence in the first Webpage to be extracted;According to formulaThe second similarity S2 is calculated, its In, V2KpFor frequency of occurrences of the same block element Kp in the second Webpage to be extracted, K1kFor in the second Webpage to be extracted Second piece of element, N2 be the second Webpage to be extracted in second piece of element number,For second piece of element K1k Frequency of occurrence in two Webpages to be extracted.
Step 2-4:Compare the size of similarity average value and the second predetermined threshold value.Specifically, the second predetermined threshold value also may be used To set according to demand.
Step 2-5:Similarity average value is being compared more than in the case of the second predetermined threshold value, is determining the first net to be extracted The page page and second page to be extracted are same home classification, or default less than or equal to second comparing similarity average value In the case of threshold value, determine the first Webpage to be extracted and second page to be extracted is respectively different belonging kinds, this step It is rapid it is, in the case where judging that similarity average value is more than the second predetermined threshold value, the first Webpage to be extracted and the Two pages to be extracted belong to same belonging kinds;Judging situation of the similarity average value less than or equal to the second predetermined threshold value Under, the first Webpage to be extracted and second page to be extracted are belonging respectively to different belonging kinds.
In embodiments of the present invention, any two Webpage in multiple pages to be extracted can be regarded first respectively Webpage to be extracted and the second Webpage to be extracted, and step 2-1 to step 2-5 is repeated, until determining each The belonging kinds of the page to be extracted.It should be noted that if Webpage A and Webpage B belong to same belonging kinds, net Page page A and Webpage D falls within same belonging kinds, then Webpage A, Webpage B and Webpage D are belonged to Same belonging kinds.After two or more Webpage to be extracted belongs to same belonging kinds, for other it needs to be determined that ownership The Webpage to be extracted of classification, as long as by a webpage page to be extracted in the Webpage to be extracted and above-mentioned belonging kinds Face calculates similarity average value, and by obtained similarity average value compared with the second predetermined threshold value, you can it is determined that this is treated Whether extraction Webpage belongs to above-mentioned belonging kinds.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate Machine, server, or network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to embodiments of the present invention, a kind of webpage for being used to implement the extracting method of above-mentioned Webpage information is additionally provided The extraction element of page info, the extraction element are mainly used in performing the extraction side that the above of the embodiment of the present invention is provided Method, the extraction element of the Webpage information provided below the embodiment of the present invention do specific introduction:
Fig. 2 is the schematic diagram of the extraction element of Webpage information according to embodiments of the present invention, as shown in Fig. 2 the dress Putting mainly includes acquiring unit 10, cluster cell 20, the first extraction unit 30, the second extraction unit 40, the first computing unit 50 With the 3rd extraction unit 60, wherein:
Acquiring unit 10 is used for the HTML HTML code for obtaining multiple Webpages to be extracted.Specifically, The HTML code of multiple Webpages to be extracted can be obtained simultaneously, each net to be extracted of acquisition that can also successively one by one The HTML code of the page page.
Cluster cell 20 is used to cluster multiple Webpages to be extracted according to HTML code, obtains multiple ownership classes Not, it is, according to the HTML code of each Webpage to be extracted got, multiple Webpages to be extracted are divided Class, Webpage to be extracted similar in multiple Webpages to be extracted is classified as a classification.It should be noted that one is treated Extraction Webpage can only have a belonging kinds.
First extraction unit 30 is used to extract the object block element in each belonging kinds, wherein, object block element is same The block element that different Webpages to be extracted in one belonging kinds share.Specifically, object block element can be one, also may be used Think multiple.In embodiments of the present invention, the particular number of object block element is according to different to be extracted in same belonging kinds What the quantity of the shared block element of the page determined.Shared block element refers to that block element difference in same belonging kinds is waited to carry Bookmark name, the attribute all identical block elements in the page are taken, attribute herein is class attributes or id attributes.Such as:Net The page page 1, Webpage 2 and Webpage 3 belong to belonging kinds A, in Webpage 1, Webpage 2 and Webpage 3 The block member jointly comprised in each Webpage is known as 3, is div [class=" menu "], div [id=" title "] respectively With p [class=" content "], then the object block element in belonging kinds A is then 3.
Second extraction unit 40 is used to extract the text in object block element, obtains the text collection of object block element.Tool Body, multiple texts are included in same object block element, the set of multiple texts is the text collection of the object block element.Such as Fruit object block element is multiple, then extracts the text in each object block element, obtains the text set of each object block element Close.Continue to illustrate that, for object block element div [id=" title "], obtained text collection is { " title using the example above 1 ", " title 2 ", " title 3 " }.
First computing unit 50 is used for the desired value for calculating text collection, wherein, desired value is used to represent in text collection Text difference degree, i.e. calculate object block element Chinese version difference degree, difference degree is bigger, illustrate the object block member Content difference in text in element is bigger.
3rd extraction unit 60 is used to extract desired value more than the text in the text collection of the first predetermined threshold value, obtains net Page page info, that is, only desired value is more than the text in the text collection of the first preset value, is only needs to be extracted The information extracted in Webpage.Specifically, the first preset value can be set according to demand.
In embodiments of the present invention, by obtaining the HTML codes of multiple Webpages to be extracted, it is possible to achieve to multiple The division of Webpage belonging kinds to be extracted, and then obtain common in the different Webpages to be extracted under same belonging kinds Comprising block element, it is possible to achieve the extraction to same block element Chinese version content, then can be according in the text got The difference degree of appearance and the comparative result of predetermined threshold value, determine whether text content is that needs in Webpage to be extracted carry The information taken, solve the low problem of the info web extraction degree of accuracy in the prior art, and then the extraction of raising info web The effect of accuracy.
It should be noted that if the quantity of object block element is multiple, it is necessary to calculate each object block element respectively The desired value of text collection, and by each desired value calculated respectively compared with the first predetermined threshold value, by desired value Extracted more than the text in the text collection of the first predetermined threshold value.
Specifically, the first computing unit 50 includes logging modle, the first determining module, computing module and the second determination mould Block, wherein:
The occurrence number for the text each differed that logging modle is used in recording text set.Due in text collection Including multiple texts, so multiple texts there may be content identical text, in embodiments of the present invention, only count mutually it Between occurrence number of the text that differs of content in text set.
First determining module is used for the occurrence number according to the text each differed, determines full text in text collection Total occurrence number, specifically, total occurrence number of full text is equal to all texts differed and gone out in text set Occurrence number sum.
Computing module is used for the occurrence number of the text according to each differing and total occurrence number, and calculating each differs The frequency of occurrences of the text in text collection.For example, have in text set with other texts in text set not Same text A, occurrence numbers of the text A in text set are 3 times, total occurrence number of full text in text set For 30 times, then for text A, the frequency of occurrences in above-mentioned text collection is 1/10.
Second determining module is used for the frequency of occurrences in text collection according to the text each differed, determines text set The desired value of conjunction.
If object block element is multiple, then the desired value of the text collection of each object block element can pass through weight Polyphony is calculated with logging modle, the first determining module, computing module and the second determining module.
Specifically, the second determining module includes calculating sub module, and calculating sub module is used for according to formulaThe desired value of text collection is calculated, wherein, ESetFor text collection Desired value, m are comprising the number of text differed, p (text in text collectioni) it is the text that each differs in text set The frequency of occurrences in conjunction.In embodiments of the present invention, text collection E is calculatedSetThe middle frequency of occurrences by each text differed It is multiplied with the logarithm of the frequency of occurrences of the text differed, obtained all results is summed, then take negative, is exactly the text The desired value of set.
Preferably, the extraction element for the Webpage information that the embodiment of the present invention is provided also includes recording unit, record Unit is used for the text being more than in extraction desired value in the text collection of the first predetermined threshold value, after obtaining Webpage information, The category attribute of recording text.Specifically, category attribute can be title, content etc..The embodiment of the present invention is it is, record carries The content of text taken is title or content etc..
In embodiments of the present invention, it is convenient subsequently to carry out big data point by the category attribute for the text for recording extraction During analysis, user can quickly filter out required information, reach the effect for improving user satisfaction.For example, user wants to sieve Select in the info web extracted, content is the information of title, then user need to only select category attribute as title, you can quick Filter out meet its requirement info web.
Preferably, the embodiment of the present invention additionally provides a kind of concrete mode for the belonging kinds for determining the page to be extracted, can With by establishing unit, the 4th extraction unit, the second computing unit, comparison list included by the extraction element of Webpage information Member and processing unit perform, wherein:
Unit is established to be used to establish the first tree structure according to the HTML code of the first Webpage to be extracted, and according to the The HTML code of two Webpages to be extracted establishes the second tree structure, wherein, the first Webpage to be extracted and second is waited to carry It is any two Webpage to be extracted in multiple pages to be extracted to take the page.
4th extraction unit is used to extract the block element for including preset attribute in the first tree structure, obtains first piece of member The block element of preset attribute is included in element, and the second tree structure of extraction, obtains second piece of element.Specifically, preset attribute For class attributes or id attributes, this unit namely only extracts to be belonged in the HTML code of first page to be extracted comprising class Property or id attributes block element, and only extract and belong in the HTML code of second page to be extracted comprising class attributes or id The block element of property.
Second computing unit is used for according to first piece of element and second piece of element, calculates the first Webpage to be extracted and the The similarity average value of two Webpages to be extracted.In embodiments of the present invention, can be calculated according to formula V=1/2 (S1+S2) The similarity average value of first Webpage to be extracted and the second Webpage to be extracted, wherein, V is similarity average value, S1 For the first similarity of the first Webpage to be extracted and the second Webpage to be extracted, S2 be the first Webpage to be extracted and Second similarity of the second Webpage to be extracted.Specifically, can be according to formulaCalculate the first similarity S1, wherein, Kp is identical block element in the first Webpage to be extracted and the second Webpage to be extracted, and p takes 1 to m successively, M be same block element number, V1KpFor frequency of occurrences of the same block element Kp in the first Webpage to be extracted, K0kFor First piece of element in one Webpage to be extracted, N1 are the number of first piece of element in the first Webpage to be extracted, For first piece of element K0kFrequency of occurrence in the first Webpage to be extracted;According to formulaCalculate the second phase Like degree S2, wherein, V2KpFor frequency of occurrences of the same block element Kp in the second Webpage to be extracted, K1kIt is to be extracted for second Second piece of element in Webpage, N2 are the number of second piece of element in the second Webpage to be extracted,For second piece of member Plain K1kFrequency of occurrence in the second Webpage to be extracted.
Comparing unit is used for the size for comparing similarity average value and the second predetermined threshold value.Specifically, the second predetermined threshold value It can also set according to demand.
Processing unit is used to compare similarity average value more than in the case of the second predetermined threshold value, determines that first waits to carry It is same home classification to take Webpage and second page to be extracted, or is comparing similarity average value less than or equal to second In the case of predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are respectively different belonging kinds, This unit it is, judge similarity average value be more than the second predetermined threshold value in the case of, the first Webpage to be extracted Belong to same belonging kinds with second page to be extracted;Judging similarity average value less than or equal to the second predetermined threshold value In the case of, the first Webpage to be extracted and second page to be extracted are belonging respectively to different belonging kinds.
In embodiments of the present invention, any two Webpage in multiple pages to be extracted can be regarded first respectively Webpage to be extracted and the second Webpage to be extracted, and repeat call establishment unit, the 4th extraction unit, the second calculating list Member, comparing unit and processing unit, until determining the belonging kinds of each page to be extracted.If it should be noted that net Page page A and Webpage B belongs to same belonging kinds, and Webpage A and Webpage D fall within same belonging kinds, that Webpage A, Webpage B and Webpage D belong to same belonging kinds.When two or more Webpage category to be extracted After same belonging kinds, for other it needs to be determined that the Webpage to be extracted of belonging kinds, as long as by the webpage to be extracted The page calculates similarity average value with a Webpage to be extracted in above-mentioned belonging kinds, and obtained similarity is averaged Value is compared with the second predetermined threshold value, you can determines whether the Webpage to be extracted belongs to above-mentioned belonging kinds.
As can be seen from the above description, the present invention solves in the prior art that the degree of accuracy is low asks for info web extraction Topic, the effect for improving info web extraction accuracy is reached.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed client, can be by others side Formula is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, and only one Kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Another system is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed it is mutual it Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (8)

  1. A kind of 1. extracting method of Webpage information, it is characterised in that including:
    Obtain the HTML HTML code of multiple Webpages to be extracted;
    Multiple Webpages to be extracted are clustered according to the HTML code, obtain multiple belonging kinds;
    Object block element in each belonging kinds of extraction, wherein, the object block element is the same belonging kinds In the shared block element of the different Webpages to be extracted;
    The text in the object block element is extracted, obtains the text collection of the object block element;
    The desired value of the text collection is calculated, wherein, the desired value is used for the difference for representing the text in the text collection DRS degree;And
    The desired value is extracted more than the text in the text collection of the first predetermined threshold value, obtains the Webpage letter Breath,
    Wherein, the belonging kinds of the first Webpage to be extracted and second page to be extracted are determined in the following manner, wherein, institute The first Webpage to be extracted and second page to be extracted is stated to wait to carry for any two in multiple pages to be extracted Take Webpage:
    First tree structure is established according to the HTML code of the described first Webpage to be extracted, and it is to be extracted according to described second The HTML code of Webpage establishes the second tree structure;
    The block element that preset attribute is included in first tree structure is extracted, obtains first piece of element, and extraction described the The block element of preset attribute is included in two tree structures, obtains second piece of element, wherein, preset attribute be class attributes or Id attributes;
    According to first piece of element and second piece of element, calculate first Webpage to be extracted and described second and treat Extract the similarity average value of Webpage;
    Compare the size of the similarity average value and the second predetermined threshold value;And
    The similarity average value is being compared more than in the case of second predetermined threshold value, is determining the described first net to be extracted The page page and second page to be extracted are same home classification, or are less than or equal to comparing the similarity average value In the case of second predetermined threshold value, determine that the described first Webpage to be extracted and second page to be extracted are respectively Different belonging kinds,
    Wherein, the phase of the first Webpage to be extracted and the second Webpage to be extracted is calculated according to formula V=1/2 (S1+S2) Like degree average value, wherein, V is similarity average value, and S1 is the first Webpage to be extracted and the second Webpage to be extracted First similarity, S2 is the second similarity of the first Webpage to be extracted and the second Webpage to be extracted, according to formulaThe first similarity S1 is calculated, wherein, Kp is in the first Webpage to be extracted and the second Webpage to be extracted Identical block element, p take 1 to m successively, and m is the number of same block element, V1KpIt is same block element Kp in the first net to be extracted Frequency of occurrence in the page page, K0kFor first piece of element in the first Webpage to be extracted, N1 is the first webpage page to be extracted The number of first piece of element in face,For first piece of element K0kFrequency of occurrence in the first Webpage to be extracted;According to public affairs FormulaThe second similarity S2 is calculated, wherein, V2KpIt is same block element Kp in the second Webpage to be extracted Frequency of occurrence, K1kFor second piece of element in the second Webpage to be extracted, N2 is second piece in the second Webpage to be extracted The number of element,For second piece of element K1kFrequency of occurrence in the second Webpage to be extracted.
  2. 2. extracting method according to claim 1, it is characterised in that calculating the desired value of the text collection includes:
    Record the occurrence number of the text each differed in the text collection;
    According to the occurrence number of each text differed, determine full text in the text collection always goes out occurrence Number;
    According to the occurrence number of each text differed and total occurrence number, each text differed is calculated Originally the frequency of occurrences in the text collection;And
    According to the frequency of occurrences of each text differed in the text collection, the text collection is determined Desired value.
  3. 3. extracting method according to claim 2, it is characterised in that according to each text differed in the text The frequency of occurrences in this set, determining the desired value of the text collection includes:
    According to formulaThe desired value of the text collection is calculated, wherein, ESetFor the desired value of the text collection, m is the number that the text differed is included in the text collection, p (texti) it is the frequency of occurrences of each text differed in the text collection.
  4. 4. extracting method according to claim 1, it is characterised in that be more than the first predetermined threshold value extracting the desired value The text collection in text, after obtaining the Webpage information, the extracting method also includes:
    Record the category attribute of the text.
  5. A kind of 5. extraction element of Webpage information, it is characterised in that including:
    Acquiring unit, for obtaining the HTML HTML code of multiple Webpages to be extracted;
    Cluster cell, for being clustered according to the HTML code to multiple Webpages to be extracted, obtain multiple return Belong to classification;
    First extraction unit, for extracting the object block element in each belonging kinds, wherein, the object block element is The block element that the different Webpages to be extracted in the same belonging kinds share;
    Second extraction unit, for extracting the text in the object block element, obtain the text collection of the object block element;
    First computing unit, for calculating the desired value of the text collection, wherein, the desired value is used to represent the text The difference degree of text in set;And
    3rd extraction unit, for extracting the desired value more than the text in the text collection of the first predetermined threshold value, obtain To the Webpage information,
    Wherein, the extraction element also includes:
    Unit is established, for establishing the first tree structure according to the HTML code of the first Webpage to be extracted, and according to second The HTML code of Webpage to be extracted establishes the second tree structure, wherein, first Webpage to be extracted and described Two pages to be extracted are any two Webpage to be extracted in multiple pages to be extracted:
    4th extraction unit, the block element of preset attribute is included in first tree structure for extracting, obtains first piece of member The block element of preset attribute is included in element, and extraction second tree structure, obtains second piece of element, wherein, preset category Property is class attributes or id attributes;
    Second computing unit, for according to first piece of element and second piece of element, calculating first net to be extracted The similarity average value of the page page and second Webpage to be extracted;
    Comparing unit, for the similarity average value and the size of the second predetermined threshold value;And
    Processing unit, for comparing the similarity average value more than in the case of second predetermined threshold value, determine institute It is same home classification to state the first Webpage to be extracted and second page to be extracted, or is put down comparing the similarity In the case that average is less than or equal to second predetermined threshold value, determine that the described first Webpage to be extracted and described second is treated It is respectively different belonging kinds to extract the page,
    Wherein, the phase of the first Webpage to be extracted and the second Webpage to be extracted is calculated according to formula V=1/2 (S1+S2) Like degree average value, wherein, V is similarity average value, and S1 is the first Webpage to be extracted and the second Webpage to be extracted First similarity, S2 is the second similarity of the first Webpage to be extracted and the second Webpage to be extracted, according to formulaThe first similarity S1 is calculated, wherein, Kp is in the first Webpage to be extracted and the second Webpage to be extracted Identical block element, p take 1 to m successively, and m is the number of same block element, V1KpIt is same block element Kp in the first net to be extracted Frequency of occurrence in the page page, K0kFor first piece of element in the first Webpage to be extracted, N1 is the first webpage page to be extracted The number of first piece of element in face,For first piece of element K0kFrequency of occurrence in the first Webpage to be extracted;According to public affairs FormulaThe second similarity S2 is calculated, wherein, V2KpIt is same block element Kp in the second Webpage to be extracted Frequency of occurrence, K1kFor second piece of element in the second Webpage to be extracted, N2 is second piece in the second Webpage to be extracted The number of element,For second piece of element K1kFrequency of occurrence in the second Webpage to be extracted.
  6. 6. extraction element according to claim 5, it is characterised in that first computing unit includes:
    Logging modle, for recording the occurrence number of the text each differed in the text collection;
    First determining module, for the occurrence number according to each text differed, determine complete in the text collection Total occurrence number of portion's text;
    Computing module, for the occurrence number according to each text differed and total occurrence number, calculate each The frequency of occurrences of the text differed in the text collection;And
    Second determining module, for the frequency of occurrences according to the text differed in the text collection, Determine the desired value of the text collection.
  7. 7. extraction element according to claim 6, it is characterised in that second determining module includes:
    Calculating sub module, for according to formulaCalculate the text set The desired value of conjunction, wherein, ESetFor the desired value of the text collection, m is comprising the text differed in the text collection This number, p (texti) it is the frequency of occurrences of each text differed in the text collection.
  8. 8. extraction element according to claim 5, it is characterised in that the extraction element also includes:
    Recording unit, for the text in the text collection of the desired value more than the first predetermined threshold value is extracted, obtain After the Webpage information, the category attribute of the text is recorded.
CN201410830367.6A 2014-12-25 2014-12-25 The extracting method and device of Webpage information Active CN104484451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410830367.6A CN104484451B (en) 2014-12-25 2014-12-25 The extracting method and device of Webpage information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410830367.6A CN104484451B (en) 2014-12-25 2014-12-25 The extracting method and device of Webpage information

Publications (2)

Publication Number Publication Date
CN104484451A CN104484451A (en) 2015-04-01
CN104484451B true CN104484451B (en) 2017-12-19

Family

ID=52758992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410830367.6A Active CN104484451B (en) 2014-12-25 2014-12-25 The extracting method and device of Webpage information

Country Status (1)

Country Link
CN (1) CN104484451B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10049269B2 (en) * 2015-09-30 2018-08-14 Fuji Xerox Co., Ltd. Information processing apparatus, information processing method, and non-transitory computer readable medium
CN108664511B (en) * 2017-03-31 2021-07-13 北京京东尚科信息技术有限公司 Method and device for acquiring webpage information
CN116304457B (en) * 2023-02-27 2024-03-29 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541874A (en) * 2010-12-16 2012-07-04 ***通信集团公司 Webpage text content extracting method and device
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005092889A (en) * 2003-09-18 2005-04-07 Fujitsu Ltd Information block extraction apparatus and method for web page

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541874A (en) * 2010-12-16 2012-07-04 ***通信集团公司 Webpage text content extracting method and device
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
应用聚类技术分类提取Web页面;崔慧超等;《电脑知识与技术》;20100131;第6卷(第1期);第212-213页 *

Also Published As

Publication number Publication date
CN104484451A (en) 2015-04-01

Similar Documents

Publication Publication Date Title
US11138250B2 (en) Method and device for extracting core word of commodity short text
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
CN105095368B (en) Method and device for sequencing news information
CN103020845B (en) A kind of method for pushing and system of mobile application
CN100478962C (en) Method, device and system for searching web page and device for establishing index database
CN109471937A (en) A kind of file classification method and terminal device based on machine learning
CN107545038B (en) Text classification method and equipment
CN108090162A (en) Information-pushing method and device based on artificial intelligence
CN106354818B (en) Social media-based dynamic user attribute extraction method
US10311120B2 (en) Method and apparatus for identifying webpage type
CN103617290B (en) Chinese machine-reading system
CN106708841B (en) The polymerization and device of website visitation path
CN110263009A (en) Generation method, device, equipment and the readable storage medium storing program for executing of log classifying rules
CN106776710A (en) A kind of picture and text construction of knowledge base method based on vertical search engine
CN106649849A (en) Text information base building method and device and searching method, device and system
CN108197190A (en) A kind of method and apparatus of user's identification
CN104484451B (en) The extracting method and device of Webpage information
CN109451147A (en) A kind of information displaying method and device
CN106776609A (en) Reprint the statistical method and device of quantity in website
CN111159404A (en) Text classification method and device
CN106294363A (en) A kind of forum postings evaluation methodology, Apparatus and system
CN109992665A (en) A kind of classification method based on the extension of problem target signature
CN108241867A (en) A kind of sorting technique and device
CN103970888B (en) Document classifying method based on network measure index
CN106934049B (en) News question selection analysis method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Web page information extraction method and web page information extraction device

Effective date of registration: 20190531

Granted publication date: 20171219

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20171219