CN104484451A - Web page information extraction method and web page information extraction device - Google Patents

Web page information extraction method and web page information extraction device Download PDF

Info

Publication number
CN104484451A
CN104484451A CN201410830367.6A CN201410830367A CN104484451A CN 104484451 A CN104484451 A CN 104484451A CN 201410830367 A CN201410830367 A CN 201410830367A CN 104484451 A CN104484451 A CN 104484451A
Authority
CN
China
Prior art keywords
text
extracted
webpage
collection
text collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410830367.6A
Other languages
Chinese (zh)
Other versions
CN104484451B (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410830367.6A priority Critical patent/CN104484451B/en
Publication of CN104484451A publication Critical patent/CN104484451A/en
Application granted granted Critical
Publication of CN104484451B publication Critical patent/CN104484451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a web page information extraction method and a web page information extraction device, wherein the web page information extraction method comprises the following steps: acquiring the HTML (hypertext markup language) codes of a plurality of web pages to be extracted; clustering the plurality of web pages to be extracted according to the HTML codes to obtain a plurality of attribution categories; extracting a target block element in each attribution category, wherein each target block element is a block element shared by different web pages to be extracted in the same attribution category; extracting the texts in the target block elements to obtain to obtain a text set of the target block elements; calculating the index values of the text set, wherein the index values are used for representing the difference degree of the texts in the text set; extracting the texts in the text set with the index values being larger than a first preset threshold value to obtain web page information. After the method and the device are used, the problem in the prior art that the web page information extraction accuracy is low can be solved, so that the effect of improving the web page information extraction accuracy is further achieved.

Description

The extracting method of Webpage information and device
Technical field
The present invention relates to data processing field, in particular to a kind of extracting method and device of Webpage information.
Background technology
Gather the significant data source that info web is large data analysis.Current collection info web mainly contains two schemes, one uses rule-based method, regular expression, Xpath or Css selector switch is used to extract page elements, another kind is Statistics-Based Method, the data manually marked by machine learning obtain training pattern, carry out information extraction according to model.
Rule-based method is by analyzing HTML (HyperText Mark-up Language, HTML (Hypertext Markup Language)) code, the right boundary treating information extraction is analyzed, by regular expression or other means information extraction, or by setting up DOM (Document Object Model, document dbject model) tree for the page, choose web page element by XPath or Css selector switch, and then choose the element comprising information to be extracted, thus realize information extraction.
Rule-based extracting method, extracts accurately, but poor for applicability, often can only carry out information extraction for a class page, can cause extracting mistake if the page changes.
Statistics-Based Method, by the method for machine learning, trains the accurate result of artificial mark, obtains training pattern, carry out information identification and extraction by training pattern.
Corpus--based Method method applicability is good, may be used for various Webpage, but this kind of method resource consumption is large, strong to the dependence of artificial mark, and the quality of information extraction is strong with the quality correlativity manually marked.Accuracy can not ensure completely, and the method based on training is not the information extraction for specific webpage, may cause extracting incomplete or extract unsuccessfully the new page.
Extract the low problem of accuracy for info web in prior art, not yet propose effective solution at present.
Summary of the invention
Fundamental purpose of the present invention is the extracting method and the device that provide a kind of Webpage information, extracts the low problem of accuracy to solve info web in prior art.
To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of extracting method of Webpage information is provided.
Extracting method according to Webpage information of the present invention comprises: the HTML (Hypertext Markup Language) HTML code obtaining multiple Webpage to be extracted; According to described HTML code, cluster is carried out to multiple described Webpage to be extracted, obtain multiple belonging kinds; Extract the object block element in each described belonging kinds, wherein, described object block element is the block element that the described Webpage to be extracted of difference in same described belonging kinds has; Extract the text in described object block element, obtain the text collection of described object block element; Calculate the desired value of described text collection, wherein, described desired value is for representing the difference degree of the text in described text collection; And extract described desired value and be greater than text in the described text collection of the first predetermined threshold value, obtain described Webpage information.
Further, the desired value calculating described text collection comprises: the occurrence number recording each not identical text in described text collection; According to the occurrence number of text not identical described in each, determine total occurrence number of full text in described text collection; According to occurrence number and described total occurrence number of text not identical described in each, calculate each described in the not identical frequency of occurrences of text in described text collection; And according to the described frequency of occurrences of text in described text collection not identical described in each, determine the desired value of described text collection.
Further, according to the described frequency of occurrences of text in described text collection not identical described in each, determine that the desired value of described text collection comprises: according to formula calculate the desired value of described text collection, wherein, E setfor the desired value of described text collection, m is the number of text not identical described in comprising in described text collection, p (text i) be the frequency of occurrences of text in described text collection not identical described in each.
Further, be greater than the text in the described text collection of the first predetermined threshold value in the described desired value of extraction, after obtaining described Webpage information, described extracting method also comprises: the category attribute recording described text.
Further, determine the belonging kinds of the first Webpage to be extracted and second page to be extracted in the following manner, wherein, described first Webpage to be extracted and described second page to be extracted are any two Webpages to be extracted in multiple described page to be extracted: the HTML code according to described first Webpage to be extracted sets up the first tree structure, and set up the second tree structure according to the HTML code of described second Webpage to be extracted; Extract the block element comprising preset attribute in described first tree structure, obtain first piece of element, and extract the block element comprising preset attribute in described second tree structure, obtain second piece of element; According to described first piece of element and described second piece of element, calculate the similarity mean value of described first Webpage to be extracted and described second Webpage to be extracted; The size of more described similarity mean value and the second predetermined threshold value; And when comparing described similarity mean value and being greater than described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are same home classification, or when comparing described similarity mean value and being less than or equal to described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are respectively different belonging kinds.
To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of extraction element of Webpage information.
Extraction element according to Webpage information of the present invention comprises: acquiring unit, for obtaining the HTML (Hypertext Markup Language) HTML code of multiple Webpage to be extracted; Cluster cell, for carrying out cluster according to described HTML code to multiple described Webpage to be extracted, obtains multiple belonging kinds; First extraction unit, for extracting the object block element in each described belonging kinds, wherein, described object block element is the block element that the described Webpage to be extracted of difference in same described belonging kinds has; Second extraction unit, for extracting the text in described object block element, obtains the text collection of described object block element; First computing unit, for calculating the desired value of described text collection, wherein, described desired value is for representing the difference degree of the text in described text collection; And the 3rd extraction unit, being greater than text in the described text collection of the first predetermined threshold value for extracting described desired value, obtaining described Webpage information.
Further, described first computing unit comprises: logging modle, for recording the occurrence number of each not identical text in described text collection; First determination module, for the occurrence number according to text not identical described in each, determines total occurrence number of full text in described text collection; Computing module, for according to the occurrence number of text not identical described in each and described total occurrence number, calculate each described in the not identical frequency of occurrences of text in described text collection; And second determination module, for according to the described frequency of occurrences of text in described text collection not identical described in each, determine the desired value of described text collection.
Further, described second determination module comprises: calculating sub module, for according to formula calculate the desired value of described text collection, wherein, E setfor the desired value of described text collection, m is the number of text not identical described in comprising in described text collection, p (text i) be the frequency of occurrences of text in described text collection not identical described in each.
Further, described extraction element also comprises: record cell, for being greater than the text in the described text collection of the first predetermined threshold value in the described desired value of extraction, after obtaining described Webpage information, records the category attribute of described text.
Further, described extraction element also comprises: set up unit, for setting up the first tree structure according to the HTML code of the first Webpage to be extracted, and set up the second tree structure according to the HTML code of the second Webpage to be extracted, wherein, described first Webpage to be extracted and described second page to be extracted are any two Webpages to be extracted in multiple described page to be extracted: the 4th extraction unit, for extracting the block element comprising preset attribute in described first tree structure, obtain first piece of element, and extract the block element comprising preset attribute in described second tree structure, obtain second piece of element, second computing unit, for according to described first piece of element and described second piece of element, calculates the similarity mean value of described first Webpage to be extracted and described second Webpage to be extracted, comparing unit, for the size of more described similarity mean value and the second predetermined threshold value, and processing unit, for when comparing described similarity mean value and being greater than described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are same home classification, or when comparing described similarity mean value and being less than or equal to described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are respectively different belonging kinds.
According to inventive embodiments, adopt the HTML code obtaining multiple Webpage to be extracted; According to described HTML code, cluster is carried out to multiple described Webpage to be extracted, obtain multiple belonging kinds; Extract the object block element in each described belonging kinds, wherein, described object block element is the block element that the described Webpage to be extracted of difference in same described belonging kinds has; Extract the content of text in described object block element, obtain the text collection of described object block element; Calculate the desired value of described text collection, wherein, described desired value is for representing the difference degree of the text in described text collection; And extract described desired value and be greater than text in the described text collection of the first predetermined threshold value, obtain described Webpage information.By obtaining the HTML code of multiple Webpage to be extracted, the division to multiple Webpage belonging kinds to be extracted can be realized, and then the block element jointly comprised in the difference Webpage to be extracted obtaining under same belonging kinds, the extraction to same block element Chinese version content can be realized, then can according to the comparative result of the difference degree of the content of text got and predetermined threshold value, determine whether text content is the information that the needs in Webpage to be extracted extract, solve info web in prior art and extract the low problem of accuracy, and then reach the effect improving info web extraction accuracy.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the extracting method of Webpage information according to the embodiment of the present invention; And
Fig. 2 is the schematic diagram of the extraction element of Webpage information according to the embodiment of the present invention.
Embodiment
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiment 1
According to the embodiment of the present invention, provide a kind of embodiment of the method that may be used for implementing the application's device embodiment, it should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
According to the embodiment of the present invention, provide a kind of extracting method of Webpage information.Fig. 1 is the process flow diagram of the extracting method of Webpage information according to the embodiment of the present invention, and as shown in Figure 1, the method comprises following step S102 to step S112:
S102: the HTML (Hypertext Markup Language) HTML code obtaining multiple Webpage to be extracted.Particularly, the HTML code of multiple Webpage to be extracted can be obtained simultaneously, also can the HTML code of each Webpage to be extracted of acquisition one by one successively.
S104: cluster is carried out to multiple Webpage to be extracted according to HTML code, obtain multiple belonging kinds, namely, according to the HTML code of the Webpage each to be extracted got, multiple Webpage to be extracted is classified, Webpage to be extracted similar in multiple Webpage to be extracted is classified as a classification.It should be noted that, a Webpage to be extracted can only have a belonging kinds.
S106: extract the object block element in each belonging kinds, wherein, object block element is the block element that the difference Webpage to be extracted in same belonging kinds has.Particularly, object block element can be one, also can be multiple.In embodiments of the present invention, the concrete quantity of object block element is that the quantity of the block element had according to the page to be extracted different in same belonging kinds is determined.Total block element refers to the bookmark name of this block element in same belonging kinds in the different page to be extracted, the block element that attribute is all identical, and attribute is herein class attribute or id attribute.Such as: Webpage 1, Webpage 2 and Webpage 3 belong to belonging kinds A, the block unit jointly comprised in each Webpage in Webpage 1, Webpage 2 and Webpage 3 have 3, be div [class=" menu "], div [id=" title "] and p [class=" content "] respectively, the object block element so in belonging kinds A is then 3.
S108: extract the text in object block element, obtain the text collection of object block element.Particularly, comprise multiple text in same object block element, the set of multiple text is the text collection of this object block element.If object block element is multiple, so extract the text in each object block element, obtain the text collection of each object block element.Continue to adopt above-mentioned illustrating, for object block element div [id=" title "], the text collection obtained is { " title 1 ", " title 2 ", " title 3 " }.
S110: the desired value calculating text collection, wherein, desired value is for representing the difference degree of the text in text collection, namely, calculate object block element Chinese version difference degree, difference degree is larger, illustrates that the content difference in the text in this object block element is larger.
S112: extract desired value and be greater than text in the text collection of the first predetermined threshold value, obtain Webpage information, namely only has desired value to be greater than text in the text collection of the first preset value, is only the information needing to extract in Webpage to be extracted.Particularly, the first preset value can be arranged according to demand.
In embodiments of the present invention, by obtaining the HTML code of multiple Webpage to be extracted, the division to multiple Webpage belonging kinds to be extracted can be realized, and then the block element jointly comprised in the difference Webpage to be extracted obtaining under same belonging kinds, the extraction to same block element Chinese version content can be realized, then can according to the comparative result of the difference degree of the content of text got and predetermined threshold value, determine whether text content is the information that the needs in Webpage to be extracted extract, solve info web in prior art and extract the low problem of accuracy, and then reach the effect improving info web extraction accuracy.
It should be noted that, if the quantity of object block element is multiple, need the desired value of the text collection calculating each object block element respectively, and each desired value calculated compared with the first predetermined threshold value respectively, text desired value be greater than in the text collection of the first predetermined threshold value extracts.
Particularly, the desired value of text collection can be calculated to step 1-4 by step 1-1, and step 1-1 is specific as follows to step 1-4:
Step 1-1: the occurrence number of each not identical text in recording text set.Because text collection comprises multiple text, so multiple text may exist the identical text of content, in embodiments of the present invention, the occurrence number of text in text set that content is not identical each other is only added up.
Step 1-2: according to the occurrence number of each not identical text, determines total occurrence number of full text in text collection, and particularly, in text set, total occurrence number of full text equals the occurrence number sum of all not identical texts.
Step 1-3: according to occurrence number and total occurrence number of each not identical text, calculates the frequency of occurrences of each not identical text in text collection.Such as, a text A different from other texts in text set is had in text set, the occurrence number of text A in text set is 3 times, in text set, total occurrence number of full text is 30 times, so for text A, the frequency of occurrences in above-mentioned text collection is 1/10.
Step 1-4: according to the frequency of occurrences of each not identical text in text collection, determine the desired value of text collection.
If object block element is multiple, the desired value of the text collection of so each object block element can be calculated to step 1-4 by repeated execution of steps 1-1.
Particularly, in embodiments of the present invention, according to the frequency of occurrences of each not identical text in text collection, determine that the desired value of text collection comprises: according to formula calculate the desired value of text collection, wherein, E setfor the desired value of text collection, m is the number comprising not identical text in text collection, p (text i) be the frequency of occurrences of each not identical text in text collection.In embodiments of the present invention, text collection E is calculated setthe middle logarithm by the frequency of occurrences of text not identical with this for the frequency of occurrences of each not identical text is multiplied, and by all result summations obtained, then getting negative, is exactly the desired value of text set.
Preferably, be greater than the text in the text collection of the first predetermined threshold value in extraction desired value, after obtaining Webpage information, the extracting method of the Webpage information that the embodiment of the present invention provides also comprises the category attribute of recording text.Particularly, category attribute can be title, content etc.Namely, the content of text that record extracts is title or content etc. to the embodiment of the present invention.
In embodiments of the present invention, the category attribute of text extracted by record, convenient follow-up carry out large data analysis time, user can filter out required information fast, reaches the effect improving user satisfaction.Such as, user wants to screen in the info web extracted, and content is the information of title, and so user only need select category attribute to be title, can filter out the info web meeting it and require fast.
The embodiment of the present invention additionally provides a kind of concrete mode determining the belonging kinds of the page to be extracted, for any two Webpages to be extracted that the first Webpage to be extracted and second page to be extracted are in multiple page to be extracted, clearly determine the mode of the first Webpage to be extracted and the second page belonging kinds to be extracted, particularly, the belonging kinds of the first Webpage to be extracted and second page to be extracted can be determined by step 2-1 to step 2-5:
Step 2-1: the HTML code according to the first Webpage to be extracted sets up the first tree structure, and set up the second tree structure according to the HTML code of the second Webpage to be extracted.
Step 2-2: extract the block element comprising preset attribute in the first tree structure, obtains first piece of element, and extracts the block element comprising preset attribute in the second tree structure, obtains second piece of element.Particularly, preset attribute is class attribute or id attribute, this step namely only extracts the block element comprising class attribute or id attribute in the HTML code of first page to be extracted, and only extracts the block element comprising class attribute or id attribute in the HTML code of second page to be extracted.
Step 2-3: according to first piece of element and second piece of element, calculates the similarity mean value of the first Webpage to be extracted and the second Webpage to be extracted.In embodiments of the present invention, the similarity mean value of the first Webpage to be extracted and the second Webpage to be extracted can be calculated according to formula V=1/2 (S1+S2), wherein, V is similarity mean value, S1 is the first similarity of the first Webpage to be extracted and the second Webpage to be extracted, and S2 is the second similarity of the first Webpage to be extracted and the second Webpage to be extracted.Particularly, can according to formula calculate the first similarity S1, wherein, Kp is the block element that the first Webpage to be extracted is identical with in the second Webpage to be extracted, and p gets 1 to m successively, and m is the number of same block element, V 1Kpfor the frequency of occurrence of same block element Kp in the first Webpage to be extracted, K 0kbe first piece of element in the first Webpage to be extracted, N1 is the number of first piece of element in the first Webpage to be extracted, be first piece of element K 0kfrequency of occurrence in the first Webpage to be extracted; According to formula calculate the second similarity S2, wherein, V 2Kpfor the frequency of occurrence of same block element Kp in the second Webpage to be extracted, K 1kbe second piece of element in the second Webpage to be extracted, N2 is the number of second piece of element in the second Webpage to be extracted, be second piece of element K 1kfrequency of occurrence in the second Webpage to be extracted.
Step 2-4: the size comparing similarity mean value and the second predetermined threshold value.Particularly, the second predetermined threshold value also can be arranged according to demand.
Step 2-5: when comparing similarity mean value and being greater than the second predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are same home classification, or when comparing similarity mean value and being less than or equal to the second predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are respectively different belonging kinds, this step namely, when judging that similarity mean value is greater than the second predetermined threshold value, the first Webpage to be extracted and second page to be extracted belong to same belonging kinds; When judging that similarity mean value is less than or equal to the second predetermined threshold value, the first Webpage to be extracted and second page to be extracted belong to different belonging kinds respectively.
In embodiments of the present invention, can by any two Webpages in multiple page to be extracted respectively as the first Webpage to be extracted and the second Webpage to be extracted, and repeated execution of steps 2-1 is to step 2-5, until determine the belonging kinds of each page to be extracted.It should be noted that, if Webpage A and Webpage B belongs to same belonging kinds, Webpage A and Webpage D also belongs to same belonging kinds, and so Webpage A, Webpage B and Webpage D all belong to same belonging kinds.After two or more Webpage to be extracted belongs to same belonging kinds, the Webpage to be extracted determining belonging kinds is needed for other, as long as a Webpage to be extracted in this Webpage to be extracted and above-mentioned belonging kinds is calculated similarity mean value, and the similarity mean value obtained and the second predetermined threshold value are compared, can determine whether this Webpage to be extracted belongs to above-mentioned belonging kinds.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to the embodiment of the present invention, additionally provide a kind of extraction element of Webpage information of the extracting method for implementing above-mentioned Webpage information, this extraction element is mainly used in the extracting method that execution embodiment of the present invention foregoing provides, and does concrete introduction below to the extraction element of the Webpage information that the embodiment of the present invention provides:
Fig. 2 is the schematic diagram of the extraction element of Webpage information according to the embodiment of the present invention, as shown in Figure 2, this device mainly comprises acquiring unit 10, cluster cell 20, first extraction unit 30, second extraction unit 40, first computing unit 50 and the 3rd extraction unit 60, wherein:
Acquiring unit 10 is for obtaining the HTML (Hypertext Markup Language) HTML code of multiple Webpage to be extracted.Particularly, the HTML code of multiple Webpage to be extracted can be obtained simultaneously, also can the HTML code of each Webpage to be extracted of acquisition one by one successively.
Cluster cell 20 is for carrying out cluster according to HTML code to multiple Webpage to be extracted, obtain multiple belonging kinds, namely, according to the HTML code of the Webpage each to be extracted got, multiple Webpage to be extracted is classified, Webpage to be extracted similar in multiple Webpage to be extracted is classified as a classification.It should be noted that, a Webpage to be extracted can only have a belonging kinds.
First extraction unit 30 is for extracting the object block element in each belonging kinds, and wherein, object block element is the block element that the difference Webpage to be extracted in same belonging kinds has.Particularly, object block element can be one, also can be multiple.In embodiments of the present invention, the concrete quantity of object block element is that the quantity of the block element had according to the page to be extracted different in same belonging kinds is determined.Total block element refers to the bookmark name of this block element in same belonging kinds in the different page to be extracted, the block element that attribute is all identical, and attribute is herein class attribute or id attribute.Such as: Webpage 1, Webpage 2 and Webpage 3 belong to belonging kinds A, the block unit jointly comprised in each Webpage in Webpage 1, Webpage 2 and Webpage 3 have 3, be div [class=" menu "], div [id=" title "] and p [class=" content "] respectively, the object block element so in belonging kinds A is then 3.
Second extraction unit 40, for extracting the text in object block element, obtains the text collection of object block element.Particularly, comprise multiple text in same object block element, the set of multiple text is the text collection of this object block element.If object block element is multiple, so extract the text in each object block element, obtain the text collection of each object block element.Continue to adopt above-mentioned illustrating, for object block element div [id=" title "], the text collection obtained is { " title 1 ", " title 2 ", " title 3 " }.
First computing unit 50 is for calculating the desired value of text collection, and wherein, desired value is for representing the difference degree of the text in text collection, namely, calculate object block element Chinese version difference degree, difference degree is larger, illustrates that the content difference in the text in this object block element is larger.
3rd extraction unit 60 is greater than the text in the text collection of the first predetermined threshold value for extracting desired value, obtain Webpage information, namely only have desired value to be greater than text in the text collection of the first preset value, be only the information needing to extract in Webpage to be extracted.Particularly, the first preset value can be arranged according to demand.
In embodiments of the present invention, by obtaining the HTML code of multiple Webpage to be extracted, the division to multiple Webpage belonging kinds to be extracted can be realized, and then the block element jointly comprised in the difference Webpage to be extracted obtaining under same belonging kinds, the extraction to same block element Chinese version content can be realized, then can according to the comparative result of the difference degree of the content of text got and predetermined threshold value, determine whether text content is the information that the needs in Webpage to be extracted extract, solve info web in prior art and extract the low problem of accuracy, and then reach the effect improving info web extraction accuracy.
It should be noted that, if the quantity of object block element is multiple, need the desired value of the text collection calculating each object block element respectively, and each desired value calculated compared with the first predetermined threshold value respectively, text desired value be greater than in the text collection of the first predetermined threshold value extracts.
Particularly, the first computing unit 50 comprises logging modle, the first determination module, computing module and the second determination module, wherein:
Logging modle is used for the occurrence number of each not identical text in recording text set.Because text collection comprises multiple text, so multiple text may exist the identical text of content, in embodiments of the present invention, the occurrence number of text in text set that content is not identical each other is only added up.
First determination module is used for the occurrence number according to each not identical text, and determine total occurrence number of full text in text collection, particularly, in text set, total occurrence number of full text equals the occurrence number sum of all not identical texts.
Computing module is used for, according to the occurrence number of each not identical text and total occurrence number, calculating the frequency of occurrences of each not identical text in text collection.Such as, a text A different from other texts in text set is had in text set, the occurrence number of text A in text set is 3 times, in text set, total occurrence number of full text is 30 times, so for text A, the frequency of occurrences in above-mentioned text collection is 1/10.
Second determination module is used for according to the frequency of occurrences of each not identical text in text collection, determines the desired value of text collection.
If object block element is multiple, the desired value of the text collection of so each object block element can calculate by repeating to call logging modle, the first determination module, computing module and the second determination module.
Particularly, the second determination module comprises calculating sub module, and calculating sub module is used for according to formula calculate the desired value of text collection, wherein, E setfor the desired value of text collection, m is the number comprising not identical text in text collection, p (text i) be the frequency of occurrences of each not identical text in text collection.In embodiments of the present invention, text collection E is calculated setthe middle logarithm by the frequency of occurrences of text not identical with this for the frequency of occurrences of each not identical text is multiplied, and by all result summations obtained, then getting negative, is exactly the desired value of text set.
Preferably, the extraction element of the Webpage information that the embodiment of the present invention provides also comprises record cell, record cell is used for the text be greater than in extraction desired value in the text collection of the first predetermined threshold value, after obtaining Webpage information, and the category attribute of recording text.Particularly, category attribute can be title, content etc.Namely, the content of text that record extracts is title or content etc. to the embodiment of the present invention.
In embodiments of the present invention, the category attribute of text extracted by record, convenient follow-up carry out large data analysis time, user can filter out required information fast, reaches the effect improving user satisfaction.Such as, user wants to screen in the info web extracted, and content is the information of title, and so user only need select category attribute to be title, can filter out the info web meeting it and require fast.
Preferably, the embodiment of the present invention additionally provides a kind of concrete mode determining the belonging kinds of the page to be extracted, can be performed by set up unit, the 4th extraction unit, the second computing unit, comparing unit and the processing unit included by the extraction element of Webpage information, wherein:
Set up unit for setting up the first tree structure according to the HTML code of the first Webpage to be extracted, and set up the second tree structure according to the HTML code of the second Webpage to be extracted, wherein, the first Webpage to be extracted and second page to be extracted are any two Webpages to be extracted in multiple page to be extracted.
4th extraction unit, for extracting the block element comprising preset attribute in the first tree structure, obtains first piece of element, and extracts the block element comprising preset attribute in the second tree structure, obtains second piece of element.Particularly, preset attribute is class attribute or id attribute, this unit namely only extracts the block element comprising class attribute or id attribute in the HTML code of first page to be extracted, and only extracts the block element comprising class attribute or id attribute in the HTML code of second page to be extracted.
Second computing unit is used for according to first piece of element and second piece of element, calculates the similarity mean value of the first Webpage to be extracted and the second Webpage to be extracted.In embodiments of the present invention, the similarity mean value of the first Webpage to be extracted and the second Webpage to be extracted can be calculated according to formula V=1/2 (S1+S2), wherein, V is similarity mean value, S1 is the first similarity of the first Webpage to be extracted and the second Webpage to be extracted, and S2 is the second similarity of the first Webpage to be extracted and the second Webpage to be extracted.Particularly, can according to formula calculate the first similarity S1, wherein, Kp is the block element that the first Webpage to be extracted is identical with in the second Webpage to be extracted, and p gets 1 to m successively, and m is the number of same block element, V 1Kpfor the frequency of occurrence of same block element Kp in the first Webpage to be extracted, K 0kbe first piece of element in the first Webpage to be extracted, N1 is the number of first piece of element in the first Webpage to be extracted, be first piece of element K 0kfrequency of occurrence in the first Webpage to be extracted; According to formula calculate the second similarity S2, wherein, V 2Kpfor the frequency of occurrence of same block element Kp in the second Webpage to be extracted, K 1kbe second piece of element in the second Webpage to be extracted, N2 is the number of second piece of element in the second Webpage to be extracted, be second piece of element K 1kfrequency of occurrence in the second Webpage to be extracted.
Comparing unit is for comparing the size of similarity mean value and the second predetermined threshold value.Particularly, the second predetermined threshold value also can be arranged according to demand.
Processing unit is used for when comparing similarity mean value and being greater than the second predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are same home classification, or when comparing similarity mean value and being less than or equal to the second predetermined threshold value, determine that the first Webpage to be extracted and second page to be extracted are respectively different belonging kinds, this unit namely, when judging that similarity mean value is greater than the second predetermined threshold value, the first Webpage to be extracted and second page to be extracted belong to same belonging kinds; When judging that similarity mean value is less than or equal to the second predetermined threshold value, the first Webpage to be extracted and second page to be extracted belong to different belonging kinds respectively.
In embodiments of the present invention, can by any two Webpages in multiple page to be extracted respectively as the first Webpage to be extracted and the second Webpage to be extracted, and repeat call establishment unit, the 4th extraction unit, the second computing unit, comparing unit and processing unit, until determine the belonging kinds of each page to be extracted.It should be noted that, if Webpage A and Webpage B belongs to same belonging kinds, Webpage A and Webpage D also belongs to same belonging kinds, and so Webpage A, Webpage B and Webpage D all belong to same belonging kinds.After two or more Webpage to be extracted belongs to same belonging kinds, the Webpage to be extracted determining belonging kinds is needed for other, as long as a Webpage to be extracted in this Webpage to be extracted and above-mentioned belonging kinds is calculated similarity mean value, and the similarity mean value obtained and the second predetermined threshold value are compared, can determine whether this Webpage to be extracted belongs to above-mentioned belonging kinds.
As can be seen from the above description, the invention solves info web in prior art and extract the low problem of accuracy, reach and improve the effect that info web extracts accuracy.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. an extracting method for Webpage information, is characterized in that, comprising:
Obtain the HTML (Hypertext Markup Language) HTML code of multiple Webpage to be extracted;
According to described HTML code, cluster is carried out to multiple described Webpage to be extracted, obtain multiple belonging kinds;
Extract the object block element in each described belonging kinds, wherein, described object block element is the block element that the described Webpage to be extracted of difference in same described belonging kinds has;
Extract the text in described object block element, obtain the text collection of described object block element;
Calculate the desired value of described text collection, wherein, described desired value is for representing the difference degree of the text in described text collection; And
Extract described desired value and be greater than text in the described text collection of the first predetermined threshold value, obtain described Webpage information.
2. extracting method according to claim 1, is characterized in that, the desired value calculating described text collection comprises:
Record the occurrence number of each not identical text in described text collection;
According to the occurrence number of text not identical described in each, determine total occurrence number of full text in described text collection;
According to occurrence number and described total occurrence number of text not identical described in each, calculate each described in the not identical frequency of occurrences of text in described text collection; And
According to the described frequency of occurrences of text in described text collection not identical described in each, determine the desired value of described text collection.
3. extracting method according to claim 2, is characterized in that, according to the described frequency of occurrences of text in described text collection not identical described in each, determines that the desired value of described text collection comprises:
According to formula E Set = - Σ i = 1 m p ( text i ) Log 2 ( p ( text i ) ) Calculate the desired value of described text collection, wherein, E setfor the desired value of described text collection, m is the number of text not identical described in comprising in described text collection, p (text i) be the frequency of occurrences of text in described text collection not identical described in each.
4. extracting method according to claim 1, is characterized in that, be greater than the text in the described text collection of the first predetermined threshold value in the described desired value of extraction, after obtaining described Webpage information, described extracting method also comprises:
Record the category attribute of described text.
5. extracting method according to claim 1, it is characterized in that, determine the belonging kinds of the first Webpage to be extracted and second page to be extracted in the following manner, wherein, described first Webpage to be extracted and described second page to be extracted are any two Webpages to be extracted in multiple described page to be extracted:
HTML code according to described first Webpage to be extracted sets up the first tree structure, and sets up the second tree structure according to the HTML code of described second Webpage to be extracted;
Extract the block element comprising preset attribute in described first tree structure, obtain first piece of element, and extract the block element comprising preset attribute in described second tree structure, obtain second piece of element;
According to described first piece of element and described second piece of element, calculate the similarity mean value of described first Webpage to be extracted and described second Webpage to be extracted;
The size of more described similarity mean value and the second predetermined threshold value; And
When comparing described similarity mean value and being greater than described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are same home classification, or when comparing described similarity mean value and being less than or equal to described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are respectively different belonging kinds.
6. an extraction element for Webpage information, is characterized in that, comprising:
Acquiring unit, for obtaining the HTML (Hypertext Markup Language) HTML code of multiple Webpage to be extracted;
Cluster cell, for carrying out cluster according to described HTML code to multiple described Webpage to be extracted, obtains multiple belonging kinds;
First extraction unit, for extracting the object block element in each described belonging kinds, wherein, described object block element is the block element that the described Webpage to be extracted of difference in same described belonging kinds has;
Second extraction unit, for extracting the text in described object block element, obtains the text collection of described object block element;
First computing unit, for calculating the desired value of described text collection, wherein, described desired value is for representing the difference degree of the text in described text collection; And
3rd extraction unit, being greater than text in the described text collection of the first predetermined threshold value for extracting described desired value, obtaining described Webpage information.
7. extraction element according to claim 6, is characterized in that, described first computing unit comprises:
Logging modle, for recording the occurrence number of each not identical text in described text collection;
First determination module, for the occurrence number according to text not identical described in each, determines total occurrence number of full text in described text collection;
Computing module, for according to the occurrence number of text not identical described in each and described total occurrence number, calculate each described in the not identical frequency of occurrences of text in described text collection; And
Second determination module, for according to the described frequency of occurrences of text in described text collection not identical described in each, determines the desired value of described text collection.
8. extraction element according to claim 7, is characterized in that, described second determination module comprises:
Calculating sub module, for according to formula E Set = - Σ i = 1 m p ( text i ) Log 2 ( p ( text i ) ) Calculate the desired value of described text collection, wherein, E setfor the desired value of described text collection, m is the number of text not identical described in comprising in described text collection, p (text i) be the frequency of occurrences of text in described text collection not identical described in each.
9. extraction element according to claim 6, is characterized in that, described extraction element also comprises:
Record cell, for being greater than the text in the described text collection of the first predetermined threshold value in the described desired value of extraction, after obtaining described Webpage information, records the category attribute of described text.
10. extraction element according to claim 6, is characterized in that, described extraction element also comprises:
Set up unit, for setting up the first tree structure according to the HTML code of the first Webpage to be extracted, and set up the second tree structure according to the HTML code of the second Webpage to be extracted, wherein, described first Webpage to be extracted and described second page to be extracted are any two Webpages to be extracted in multiple described page to be extracted:
4th extraction unit, for extracting the block element comprising preset attribute in described first tree structure, obtains first piece of element, and extracts the block element comprising preset attribute in described second tree structure, obtain second piece of element;
Second computing unit, for according to described first piece of element and described second piece of element, calculates the similarity mean value of described first Webpage to be extracted and described second Webpage to be extracted;
Comparing unit, for the size of more described similarity mean value and the second predetermined threshold value; And
Processing unit, for when comparing described similarity mean value and being greater than described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are same home classification, or when comparing described similarity mean value and being less than or equal to described second predetermined threshold value, determine that described first Webpage to be extracted and described second page to be extracted are respectively different belonging kinds.
CN201410830367.6A 2014-12-25 2014-12-25 The extracting method and device of Webpage information Active CN104484451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410830367.6A CN104484451B (en) 2014-12-25 2014-12-25 The extracting method and device of Webpage information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410830367.6A CN104484451B (en) 2014-12-25 2014-12-25 The extracting method and device of Webpage information

Publications (2)

Publication Number Publication Date
CN104484451A true CN104484451A (en) 2015-04-01
CN104484451B CN104484451B (en) 2017-12-19

Family

ID=52758992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410830367.6A Active CN104484451B (en) 2014-12-25 2014-12-25 The extracting method and device of Webpage information

Country Status (1)

Country Link
CN (1) CN104484451B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557734A (en) * 2015-09-30 2017-04-05 富士施乐株式会社 Information processor and information processing method
CN108664511A (en) * 2017-03-31 2018-10-16 北京京东尚科信息技术有限公司 Obtain webpage information method and apparatus
CN116304457A (en) * 2023-02-27 2023-06-23 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050066269A1 (en) * 2003-09-18 2005-03-24 Fujitsu Limited Information block extraction apparatus and method for Web pages
CN102541874A (en) * 2010-12-16 2012-07-04 ***通信集团公司 Webpage text content extracting method and device
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050066269A1 (en) * 2003-09-18 2005-03-24 Fujitsu Limited Information block extraction apparatus and method for Web pages
CN102541874A (en) * 2010-12-16 2012-07-04 ***通信集团公司 Webpage text content extracting method and device
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
凌云等: "《智能信息检索》", 31 December 2006, 中国科学技术出版社 *
崔慧超等: "应用聚类技术分类提取Web页面", 《电脑知识与技术》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557734A (en) * 2015-09-30 2017-04-05 富士施乐株式会社 Information processor and information processing method
CN106557734B (en) * 2015-09-30 2020-01-17 富士施乐株式会社 Information processing apparatus, information processing method, and computer program
CN108664511A (en) * 2017-03-31 2018-10-16 北京京东尚科信息技术有限公司 Obtain webpage information method and apparatus
CN108664511B (en) * 2017-03-31 2021-07-13 北京京东尚科信息技术有限公司 Method and device for acquiring webpage information
CN116304457A (en) * 2023-02-27 2023-06-23 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute
CN116304457B (en) * 2023-02-27 2024-03-29 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Also Published As

Publication number Publication date
CN104484451B (en) 2017-12-19

Similar Documents

Publication Publication Date Title
CN109145216B (en) Network public opinion monitoring method, device and storage medium
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
CN106960040B (en) A kind of classification of URL determines method and device
CN106156372B (en) A kind of classification method and device of internet site
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN102637172B (en) Webpage blocking marking method and system
WO2015061046A2 (en) Method and apparatus for performing topic-relevance highlighting of electronic text
CN104881458A (en) Labeling method and device for web page topics
CN102567494B (en) Website classification method and device
CN104504086A (en) Clustering method and device for webpage
CN103455411B (en) The foundation of daily record disaggregated model, user behaviors log sorting technique and device
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN101101605A (en) Method, device and system for searching web page and device for establishing index database
CN104915422A (en) Webpage collecting method and device based on browser
CN110569419A (en) question-answering system optimization method and device, computer equipment and storage medium
CN103399855A (en) Behavior intention determining method and device based on multiple data sources
CN106980667A (en) A kind of method and apparatus that label is marked to article
CN104484451A (en) Web page information extraction method and web page information extraction device
CN106168968A (en) A kind of Website classification method and device
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN112417267A (en) User behavior analysis method and device, computer equipment and storage medium
CN104462061A (en) Word extraction method and word extraction device
CN108076032B (en) Abnormal behavior user identification method and device
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
CN107807920A (en) Construction method, device and the server of mood dictionary based on big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Web page information extraction method and web page information extraction device

Effective date of registration: 20190531

Granted publication date: 20171219

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20171219