CN106095854A - A kind of method and device of the positional information determining block of information - Google Patents

A kind of method and device of the positional information determining block of information Download PDF

Info

Publication number
CN106095854A
CN106095854A CN201610389942.2A CN201610389942A CN106095854A CN 106095854 A CN106095854 A CN 106095854A CN 201610389942 A CN201610389942 A CN 201610389942A CN 106095854 A CN106095854 A CN 106095854A
Authority
CN
China
Prior art keywords
node
information
block
weighted value
characteristic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610389942.2A
Other languages
Chinese (zh)
Other versions
CN106095854B (en
Inventor
马莘权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201610389942.2A priority Critical patent/CN106095854B/en
Publication of CN106095854A publication Critical patent/CN106095854A/en
Application granted granted Critical
Publication of CN106095854B publication Critical patent/CN106095854B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of method of positional information determining block of information, comprising: pending web page contents is converted into model tree, model tree comprises multiple node;For each type of block of information, searching for the characteristic information of this type block of information in multiple nodes, characteristic information is for for describing the information of the form of expression of this type block of information;Determining the weighted value of each node comprising characteristic information, each node comprising characteristic information includes first node and Section Point, and first node is for directly comprising the node of characteristic information, and Section Point is the node that self does not directly comprise characteristic information;The positional information determining the block of information that specific node clustered is the positional information of this type block of information, and specific node includes the maximum node of weighted value, and all nodes of the maximum node subordinate of weighted value.The scheme that the application provides can be automatically positioned out the positional information of block of information by way of node clustering accurately, improves the efficiency of block of information positioning.

Description

A kind of method and device of the positional information determining block of information
Technical field
The present invention relates to technical field of information processing, be specifically related to method and the dress of a kind of positional information determining block of information Put.
Background technology
It in various webpages on current Internet, is generally all attended by substantial amounts of advertisement and unrelated link.In particular, for example The users such as novel read the webpage of class, and flow is not only wasted in substantial amounts of advertisement and unrelated link, return user and bring very big Interference.
Therefore, if just the interference information in webpage can be filtered before terminal equipment side displayed web page, so that it may To avoid flow waste and user's interference problem.
Interference information is filtered, needs to be accurately positioned out the position of all kinds block of information in webpage, with novel be Example, block of information can be title and the text etc. of novel.
Webpage layout layout generally, due to different web sites is different, and even same website, webpage layout layout is also frequent Change, currently is all dependent on manually realizing to the method for block of information positioning in webpage, needs long-term to webpage layout layout It is monitored, after webpage layout layout changes, need the artificial configuration of synchronous vacations again.Although this method can be accurately The position of block of information in locating web-pages, but need the long-term continuous webpage to website be monitored and safeguard.Not only cost is high, And location efficiency is low.
Content of the invention
For solving the low problem of location efficiency of block of information in webpage in prior art, the embodiment of the present invention provides a kind of true The method determining the positional information of block of information, can be automatically positioned out the position letter of block of information by way of node clustering accurately Breath, improves the efficiency of block of information positioning.The embodiment of the present invention additionally provides corresponding device.
First aspect present invention provides a kind of method of positional information determining block of information, comprising:
Pending web page contents is converted into model tree, described model tree comprises multiple node;
For each type of block of information, the plurality of node is searched for the characteristic information of this type block of information, institute State characteristic information for for describing the information of the form of expression of this type block of information;
Determining the weighted value of each node comprising described characteristic information, described each node comprising characteristic information includes First node and Section Point, described first node is for directly comprising the node of described characteristic information, and described Section Point is not for Directly comprise the node of described characteristic information;
The positional information determining the block of information that specific node clustered is the positional information of this type block of information, described spy Determine node and include the maximum node of weighted value, and all nodes of the node subordinate of described weighted value maximum.
Second aspect present invention provides the device of a kind of positional information determining block of information, comprising:
Converting unit, for pending web page contents is converted into model tree, comprises multiple node in described model tree;
Search unit is for for each type of block of information, many at the described model tree of described converting unit conversion Searching for the characteristic information of this type block of information in individual node, described characteristic information is for for describing the table of this type block of information The information of existing form;
First determining unit, for determining the weight of each node of the characteristic information comprising the search of described search unit Value, described each node comprising characteristic information includes first node and Section Point, and described first node is for directly comprising institute Stating the node of characteristic information, described Section Point is the node directly not comprising described characteristic information;
Second determining unit, is this type block of information for determining the positional information of block of information that specific node clustered Positional information, described specific node includes that in the weighted value of each node determined by described first determining unit, weighted value is Big node, and all nodes of the node subordinate of described weighted value maximum.
With in prior art by the location efficiency of the block of information in manual type locating web-pages lowly compared with, the present invention is real The method executing the positional information of the determination block of information that example provides, can be automatically positioned out letter by way of node clustering accurately The positional information of breath block, improves the efficiency of block of information positioning.
Brief description
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, required in embodiment being described below make Accompanying drawing be briefly described, it should be apparent that, below describe in accompanying drawing be only some embodiments of the present invention, for From the point of view of those skilled in the art, on the premise of not paying creative work, the attached of other can also be obtained according to these accompanying drawings Figure.
Fig. 1 is an embodiment schematic diagram of the method for the positional information determining block of information in the embodiment of the present invention;
Fig. 2 is an example schematic diagram of model tree in the embodiment of the present invention;
Fig. 3 is another example schematic diagram of model tree in the embodiment of the present invention;
Fig. 4 is the one of the node comprising characteristic information extracting from the model tree shown in Fig. 3 in the embodiment of the present invention Schematic diagram;
Fig. 5 is the another of the node comprising characteristic information that extracts from the model tree shown in Fig. 3 in the embodiment of the present invention One schematic diagram;
Fig. 6 is an embodiment schematic diagram of the device of the positional information determining block of information in the embodiment of the present invention;
Fig. 7 is another embodiment schematic diagram of the device of the positional information determining block of information in the embodiment of the present invention;
Fig. 8 is an embodiment schematic diagram of server in the embodiment of the present invention.
Detailed description of the invention
The embodiment of the present invention provides a kind of method of positional information determining block of information, can be by way of node clustering It is automatically positioned out the positional information of block of information accurately, improve the efficiency of block of information positioning.The embodiment of the present invention additionally provides Corresponding device.It is described in detail individually below.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments wholely.Based on Embodiment in the present invention, the every other enforcement that those skilled in the art are obtained under the premise of not making creative work Example, broadly falls into the scope of protection of the invention.
Refering to Fig. 1, an embodiment of the method for the positional information of the determination block of information of embodiment of the present invention offer includes:
101st, pending web page contents is converted into model tree, described model tree comprises multiple node.
As a example by text class content, web page contents refers to include the contents such as title, text, and personage introduction.
Model tree is to divide according to the branch belonging to web page contents each several part, divides each several part according to hierarchical structure Formed in different nodes.The simple model tree schematic diagram in Fig. 2 position one.As in figure 2 it is shown, model tree is tied according to level Structure can include node 0, and node 0 subordinate has node 1 and 2 two nodes of node, and node 1 subordinate has 11 1 nodes of node, joint Point 2 subordinaties have node 21 and 22 two nodes of node.
102nd, for each type of block of information, the plurality of node is searched for the feature letter of this type block of information Breath, described characteristic information is for for describing the information of the form of expression of this type block of information.
The type of block of information refers to the species of the information included in webpage, as a example by text class content, block of information Type can include catalogue, article title, article text, author profile and index introduction etc..
The characteristic information of characteristic block refers to the information of the form of expression for describing this type block of information, for example: letter Breath block is the information of the forms of expression such as plain text, link or picture.
If the type of block of information is article text, then characteristic information can be plain text, if the type of block of information is catalogue, Then characteristic information can be link.
Search procedure can be from the beginning of the node of the superiors, scan one by one, for example: in Fig. 2 can from the beginning of node 0, Scan other nodes one by one.
It is for a type during scanning, scan node one by one.Whether as catalogue, scanning one by one in each node has Chain feature information.For article text, scan one by one and whether each node has plain text characteristic information.
Model tree can include multiple node, but the characteristic information of this type might not be comprised by each node. Therefore, for each type, the node comprising characteristic information corresponding to this type is determined.
For example: determine the plain text characteristic information comprising article text in node the 2nd, node 21 and node 22.
103rd, the weighted value of each node comprising described characteristic information, described each node comprising characteristic information are determined Including first node and Section Point, described first node is for directly comprising the node of described characteristic information, described Section Point For directly not comprising the node of described characteristic information.
The described herein node comprising characteristic information includes directly comprising and indirectly comprises two kinds, directly comprises Refer to that this node self just comprises the characteristic information of this type.Indirectly comprise to refer to the feature that this node does not comprise this type Information, but the child node of this node or Sun Jiedian may comprise the characteristic information of this type, say, that no matter this node Any node layer of subordinate comprises the characteristic information of this type, broadly falls into this node and indirectly comprise the feature letter of this type Breath.
Weighted value calculating with regard to each node can be pre-configured with algorithm, when node directly comprises characteristic information, The relevant parameter of characteristic information can be input in weighted value algorithm, thus calculate the power of the correlated characteristic information of this node Weight values.For example: when characteristic information is link, link length can be inputted in weighted value algorithm, thus calculate this node The character quantity of plain text, when characteristic information is plain text, can be input to weighted value algorithm by the weighted value of chain feature In, thus calculating the weighted value of this node plain text, the weighted value algorithm of certain various characteristic informations can be different.
When node comprises characteristic information indirectly, this joint can be calculated by the weighted value of the child node of its subordinate The weighted value of point, for example: when node 2 does not comprise plain text characteristic information, node 21 and node 22 comprise this plain text feature letter Breath, then can calculate node 21 and the weighted value of node 22 by the calculation of above-mentioned algorithm, then pass through node 21 He The weighted value of node 22 calculates the weighted value of node 2.
104th, the positional information determining the block of information that specific node clustered is the positional information of this type block of information, institute State specific node and include the maximum node of weighted value, and all nodes of the node subordinate of described weighted value maximum.
For same type, after the weighted value of each node is all determined, can therefrom find out the maximum node of weighted value, For example: the weighted value of node 2 is maximum, and node 21 and node 22 are all subordinate's nodes of node 2, then node the 2nd, node 21 and node 22 is all specific node described herein.
If node 21 and node 22 also have other nodes, then other nodes of node 21 and node 22 subordinate fall within specific Node.
The positional information of the block of information that these specific nodes are clustered is the positional information of this type block of information, namely The positional information saying the block of information that node the 2nd, node 21 and node 22 clustered is the positional information of this type block of information, if joint Point the 2nd, block of information that node 21 and node 22 are clustered is article text, then in this webpage, the positional information of article text saves exactly The positional information of point the 2nd, the block of information that node 21 and node 22 are clustered.
With in prior art by the location efficiency of the block of information in manual type locating web-pages lowly compared with, the present invention is real The method executing the positional information of the determination block of information that example provides, can be automatically positioned out letter by way of node clustering accurately The positional information of breath block, improves the efficiency of block of information positioning.
Alternatively, described determination comprises the weighted value of each node of described characteristic information, may include that
For each first node, determine described each first segment directly comprising described characteristic information according to Predistribution Algorithm The weighted value of point;
For each Section Point, the weighted value of the child node of its immediate subordinate is done with computing after, then be multiplied by receipts The contracting factor, determines the weighted value of each Section Point described.
It is formulated as:
Weight (Section Point)=θ * ∑ weight (first node)
In the embodiment of the present invention, if as a example by the 2nd, the node 21 of the node in Fig. 2 and node 22, when node 21 and node 22 are First node, when node 2 is Section Point, the weighted value of node 21 is 1.5, the weighted value of node 22 is 1.8, then node 2 Weighted value can be just (1.5+1.8) * θ, and θ is contraction factor, and the value of θ can be a numerical value between 0.5 to 1, for example: Take 0.8, then the weighted value of node 2 is equal to (1.5+1.8) * 0.8=1.84.
Alternatively, described determination comprises the weighted value of each node of described characteristic information, may include that
For different types of block of information, the parallel weighted value determining each node comprising dissimilar characteristic information.
In the embodiment of the present invention, the weighted value of dissimilar characteristic information can be determined, for example: can determine simultaneously simultaneously The weighted value of chain feature information and the weighted value of plain text characteristic information, both do not conflict.
Alternatively, described pending web page contents is converted into model tree, may include that
It by the pending web page contents of HTML HTML form according to the relation of main and subordinate node, is progressively converted into DOM Document Object Model DOM node, obtains dom tree after described pending web page contents all converts;
In transfer process, if detecting in described pending web page contents there is mistake, then correct described mistake, and will Division of teaching contents after correction is in corresponding DOM node.
Webpage involved in the embodiment of the present invention can be HTML (hypertext markup Language, HTML) form.
Below as a example by a html web page, introduce html web page and be converted into DOM Document Object Model (Document Object Model, DOM) process set.
Html web page is:
Above html web page content, according to the level in web page contents and branch, is converted into dom tree as shown in Figure 3, The node comprising on dom tree is properly termed as DOM node.
First, based on DOM standard, a dom tree html text being converted in program internal memory.This is avoiding tradition side Method needs while the problem of participle and text semantic identification, moreover it is possible to get both two benefits: one is to use maturation Tree ergodic algorithm search for any position in dom tree;Two is during generating dom tree, can synchronize to correct HTML The mistake of document itself, such as incomplete DIV element etc..Even unacquainted self-defining element, also finally can become An ordinary node on dom tree, will not hinder the traversing operation of dom tree.
After being converted to dom tree as shown in Figure 3, next for different types of block of information, node searching should one by one The characteristic information of type, as searched for the characteristic information of topic Types, characteristic information is a small amount of word of continuous print.
From the point of view of Search Results, having the node 32 in Fig. 4 directly to comprise the characteristic information of this topic Types, the 30th, node saves Point 31 belongs to the characteristic information directly not comprising this topic Types, then can calculate node 32 according to the weighted value algorithm of title Weighted value.
If the weighted value calculating node 32 is 2.57, node the 31st, node 30 can be according to node 32 and weight factor Product calculates successively, if θ=0.7, then the weighted value of node 31 is 1.8.Afterwards again by the weight of node 31 and weight because of Son is multiplied and obtains the weighted value position 1.26 of node 30.
As can be seen here, for topic Types, the maximum node of weighted value is node 32, then can be using node 32 as specific Node.The positional information of the block of information that node 32 is clustered is the positional information of this heading message block, if node 32 is clustered The positional information of block of information be /html/body/div [4], then the positional information of title be /html/body/div [4]/.
In like manner, if in the dom tree shown in Fig. 3 the characteristic information of node searching text one by one, the characteristic information of text can To be a large amount of character of continuous print, for example: can arrange 30 is thresholding, when continuous print character quantity is more than 30, then it is assumed that be just Literary composition.
From the point of view of Search Results, the 43rd, the node 44 of the node shown in Fig. 5 and node 45 is had directly to comprise the feature letter of text Breath.The weighted value of node the 43rd, node 44 and node 45 then can be calculated according to the weighted value algorithm of text.
If calculating node the 43rd, node 44 and node 45 being respectively 1, node the 41st, node the 42nd, node 40 belongs to and indirectly includes The characteristic information of text, then can use the calculation relational expression between above-mentioned first node and Section Point, pass through contraction factor Calculating the weighted value of node the 41st, node the 42nd, node 40, if contraction factor θ=0.7, then the weighted value of node 41 is 1*0.7= 0.7, the weight of node 42 is (1+1) * 0.7=1.4, and the weighted value of node 40 is (0.7+1.4) * 0.7=1.47.
As can be seen here, for body type, the maximum node of weighted value is node 40, then may be determined for that the spy of text Determine node and include node the 40th, node the 41st, node the 42nd, node the 43rd, node 44 and node 45.The 42nd, node the 40th, node the 41st, node saves The positional information of point the 43rd, the block of information that node 44 and node 45 are clustered is the positional information of text, if node the 40th, node 41st, the positional information of the block of information that node the 42nd, node the 43rd, node 44 and node 45 are clustered for/html/body/div [8]/, Then the positional information of text for/html/body/div [8]/.
Alternatively, the positional information of the block of information that the specific node of described determination is clustered is the position of this type block of information After information, described method can also include:
Set up the corresponding relation between the type of block of information and the positional information of this type block of information, described corresponding relation For the interference information in filtering web page.
In the embodiment of the present invention, after the positional information determining each type block of information, the type of block of information can be set up And the corresponding relation between the positional information of this type block of information.
Corresponding relation can be represented by the form of form, it is also possible to represents by other forms, if representing in table form, Can understand refering to table 1.
Table 1: the corresponding relation between the type of block of information and the positional information of this type block of information
The type of block of information The positional information of this type block of information
Title /html/body/div[4]/
Text /html/body/div[8]/
Brief introduction /html/body/div[5]/
Above table 1 simply lists several types and illustrates, and the type of not exhaustive block of information is believed with this type Whole corresponding relations between the positional information of breath block, therefore, the content of above table 1 should not be understood as the type to block of information Restriction with the content that the corresponding relation of the positional information of this type block of information is comprised.
Because webpage may comprise the interference information such as advertisement, so behind the position determining above-mentioned useful information, just Outside the interference information such as most advertisements can being rejected range.
For realizing the method for the positional information of the determination block of information described by above-mentioned Fig. 1 to Fig. 5 part, the application also provides Corresponding device, this function performed by device each several part module, can be in conjunction with the embodiment of the method in Fig. 1 to Fig. 5 part Understand.
Refering to Fig. 6, an embodiment of the device of the positional information of the determination block of information of embodiment of the present invention offer includes:
Converting unit 501, for pending web page contents is converted into model tree, comprises multiple joint in described model tree Point;
Search unit 502, for for each type of block of information, at the described model of described converting unit 501 conversion The characteristic information of search this type block of information in multiple nodes of tree, described characteristic information is for for describing this type information The information of the form of expression of block;
First determining unit 503, comprises each node of the characteristic information of described search unit 502 search for determination Weighted value, described each node comprising characteristic information includes first node and Section Point, and described first node is for directly wrapping Node containing described characteristic information, described Section Point is the node directly not comprising described characteristic information;
Second determining unit 504, is this type letter for determining the positional information of block of information that specific node clustered The positional information of breath block, described specific node includes in the weighted value of each node determined by described first determining unit 503 The maximum node of weighted value, and all nodes of the node subordinate of described weighted value maximum.
In the embodiment of the present invention, pending web page contents is converted into model tree by converting unit 501, wraps in described model tree Containing multiple nodes;Search unit 502 is for each type of block of information, at the described model tree of described converting unit 501 conversion Multiple nodes in the characteristic information of search this type block of information, described characteristic information is for for describing this type block of information The information of the form of expression;First determining unit 503 determines each joint of the characteristic information comprising the search of described search unit 502 The weighted value of point, described each node comprising characteristic information includes first node and Section Point, and described first node is straight Connecing the node comprising described characteristic information, described Section Point is the node directly not comprising described characteristic information;Second determines Unit 504 determines the positional information that the positional information of the block of information that specific node clustered is this type block of information, described spy Determine the node that node includes that in the weighted value of each node determined by described first determining unit 503, weighted value is maximum, and All nodes of the maximum node subordinate of described weighted value.With the block of information passed through in prior art in manual type locating web-pages Location efficiency lowly compare, the device of positional information of the determination block of information that the embodiment of the present invention provides, node can be passed through The mode of cluster is automatically positioned out the positional information of block of information accurately, improves the efficiency of block of information positioning.
Alternatively, described first determining unit 503 is used for:
For each first node, determine described each first segment directly comprising described characteristic information according to Predistribution Algorithm The weighted value of point;
For each Section Point, the weighted value of the child node of its immediate subordinate is done with computing after, then be multiplied by receipts The contracting factor, determines the weighted value of each Section Point described.
Alternatively, described first determining unit 503, for for different types of block of information, parallel determination comprises difference The weighted value of each node of type feature information.
Alternatively, described converting unit 501 is used for:
It by the pending web page contents of HTML HTML form according to the relation of main and subordinate node, is progressively converted into DOM Document Object Model DOM node, obtains dom tree after described pending web page contents all converts;
In transfer process, if detecting in described pending web page contents there is mistake, then correct described mistake, and will Division of teaching contents after correction is in corresponding DOM node.
Alternatively, referring to Fig. 7, another of the device 50 of the positional information of the determination block of information that the embodiment of the present invention provides is real Executing in example, described device 50 also includes:
Set up unit 505, for described second determining unit 504 determine this type block of information positional information it After, set up the corresponding relation between the type of block of information and the positional information of this type block of information, described corresponding relation is used for Interference information in filtering web page.
The device of the positional information of the determination block of information that the embodiment of the present invention is provided can pass through server or physics Main frame realizes, below as a example by server, introduces the mistake that the method for the positional information determining block of information relies on server to realize Journey.
Fig. 8 is the structural representation of the server 60 that the embodiment of the present invention provides.Described server 60 includes processor 610th, memory 650 and transceiver 630, memory 650 can include read-only storage and random access memory, and to process Device 610 provides operational order and data.A part for memory 650 can also include nonvolatile RAM (NVRAM)。
In some embodiments, memory 650 stores following element, executable module or data structure, or Their subset of person, or their superset:
In embodiments of the present invention, by calling the operational order of memory 650 storage, (this operational order is storable in behaviour Make in system),
Pending web page contents is converted into model tree, described model tree comprises multiple node;
For each type of block of information, the plurality of node is searched for the characteristic information of this type block of information, institute State characteristic information for for describing the information of the form of expression of this type block of information;
Determining the weighted value of each node comprising described characteristic information, described each node comprising characteristic information includes First node and Section Point, described first node is for directly comprising the node of described characteristic information, and described Section Point is not for Directly comprise the node of described characteristic information;
The positional information determining the block of information that specific node clustered is the positional information of this type block of information, described spy Determine node and include the maximum node of weighted value, and all nodes of the node subordinate of described weighted value maximum.
With in prior art by the location efficiency of the block of information in manual type locating web-pages lowly compared with, the present invention is real Execute the server that example provides, the positional information of block of information can be automatically positioned out by way of node clustering accurately, improve The efficiency of block of information positioning.
Processor 610 controls the operation of server 60, and processor 610 can also be referred to as CPU (Central Processing Unit, CPU).Memory 650 can include read-only storage and random access memory, and to processor 610 Instruction and data is provided.A part for memory 650 can also include nonvolatile RAM (NVRAM).Specifically Application in each assembly of server 60 be coupled by bus system 620, wherein bus system 620 is except including data Outside bus, power bus, control bus and status signal bus in addition etc. can also be included.But for the sake of clear explanation, Various buses are all designated as bus system 620 by figure.
The method that the invention described above embodiment discloses can apply in processor 610, or is realized by processor 610. Processor 610 is probably a kind of IC chip, has the disposal ability of signal.During realizing, said method each Step can be completed by the instruction of the integrated logic circuit of the hardware in processor 610 or software form.Above-mentioned process Device 610 can be general processor, digital signal processor (DSP), special IC (ASIC), ready-made programmable gate array Or other PLDs, discrete gate or transistor logic, discrete hardware components (FPGA).Can realize or Person performs the disclosed each method in the embodiment of the present invention, step and logic diagram.General processor can be microprocessor or This processor of person also can be the processor etc. of any routine.Step in conjunction with the method disclosed in the embodiment of the present invention can be straight Connect and be presented as that hardware decoding processor performs to complete, or performed with the hardware in decoding processor and software module combination Become.Software module may be located at random access memory, flash memory, read-only storage, and programmable read only memory or electrically-erasable can In the ripe storage medium in this area such as programmable memory, register.This storage medium is positioned at memory 650, and processor 610 is read Information in access to memory 650, the step completing said method in conjunction with its hardware.
Alternatively, processor 610 is used for:
For each first node, determine described each first segment directly comprising described characteristic information according to Predistribution Algorithm The weighted value of point;
For each Section Point, the weighted value of the child node of its immediate subordinate is done with computing after, then be multiplied by receipts The contracting factor, determines the weighted value of each Section Point described.
Alternatively, processor 610 is used for:
For different types of block of information, the parallel weighted value determining each node comprising dissimilar characteristic information.
Alternatively, processor 610 is used for:
It by the pending web page contents of HTML HTML form according to the relation of main and subordinate node, is progressively converted into DOM Document Object Model DOM node, obtains dom tree after described pending web page contents all converts;
In transfer process, if detecting in described pending web page contents there is mistake, then correct described mistake, and will Division of teaching contents after correction is in corresponding DOM node.
Alternatively, processor 610 is used for:
Set up the corresponding relation between the type of block of information and the positional information of this type block of information, described corresponding relation For the interference information in filtering web page.
Above server 60 can understand refering to the description of Fig. 1 to Fig. 5 part, and this place does not do and too much repeats.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completing to instruct related hardware by program, this program can be stored in a computer-readable recording medium, storage Medium may include that ROM, RAM, disk or CD etc..
The method of the positional information of the determination block of information being provided the embodiment of the present invention above and device have been carried out in detail Thin introducing, principle and embodiment to the present invention for the specific case used herein is set forth, saying of above example Bright method and the core concept thereof being only intended to help to understand the present invention;Simultaneously for one of ordinary skill in the art, foundation The thought of the present invention, all will change in specific embodiments and applications, and in sum, this specification content is not It is interpreted as limitation of the present invention.

Claims (10)

1. the method for the positional information determining block of information, it is characterised in that include:
Pending web page contents is converted into model tree, described model tree comprises multiple node;
For each type of block of information, the plurality of node is searched for the characteristic information of this type block of information, described spy Reference breath is for for describing the information of the form of expression of this type block of information;
Determining the weighted value of each node comprising described characteristic information, described each node comprising characteristic information includes first Node and Section Point, described first node is for directly comprising the node of described characteristic information, and described Section Point is not direct Comprise the node of described characteristic information;
The positional information determining the block of information that specific node clustered is the positional information of this type block of information, described specific joint Point includes the maximum node of weighted value, and all nodes of the node subordinate of described weighted value maximum.
2. method according to claim 1, it is characterised in that described determination comprises each node of described characteristic information Weighted value, comprising:
For each first node, determine described each first node directly comprising described characteristic information according to Predistribution Algorithm Weighted value;
For each Section Point, the weighted value of the child node of its immediate subordinate is done with computing after, then be multiplied by one shrink because of Son, determines the weighted value of each Section Point described.
3. method according to claim 1, it is characterised in that described determination comprises each node of described characteristic information Weighted value, comprising:
For different types of block of information, the parallel weighted value determining each node comprising dissimilar characteristic information.
4. the method according to claim 1-3 is arbitrary, it is characterised in that described pending web page contents is converted into model Tree, comprising:
It by the pending web page contents of HTML HTML form according to the relation of main and subordinate node, is progressively converted into document Object model DOM node, obtains dom tree after described pending web page contents all converts;
In transfer process, if detecting in described pending web page contents there is mistake, then correct described mistake, and will correct After division of teaching contents in corresponding DOM node.
5. the method according to claim 1-3 is arbitrary, it is characterised in that the block of information that the specific node of described determination is clustered Positional information be this type block of information positional information after, described method also includes:
Setting up the corresponding relation between the type of block of information and the positional information of this type block of information, described corresponding relation is used for Interference information in filtering web page.
6. the device of the positional information determining block of information, it is characterised in that include:
Converting unit, for pending web page contents is converted into model tree, comprises multiple node in described model tree;
Search unit, for for each type of block of information, at multiple joints of the described model tree of described converting unit conversion Searching for the characteristic information of this type block of information in point, described characteristic information is for for describing the performance shape of this type block of information The information of formula;
First determining unit, for determining the weighted value of each node of the characteristic information comprising the search of described search unit, institute Stating each node comprising characteristic information and including first node and Section Point, described first node is for directly comprising described feature The node of information, described Section Point is the node directly not comprising described characteristic information;
Second determining unit, for determining the position that the positional information of block of information that specific node clustered is this type block of information Confidence ceases, and described specific node includes that weighted value in the weighted value of each node determined by described first determining unit is maximum Node, and all nodes of the node subordinate of described weighted value maximum.
7. device according to claim 6, it is characterised in that
Described first determining unit is used for:
For each first node, determine described each first node directly comprising described characteristic information according to Predistribution Algorithm Weighted value;
For each Section Point, the weighted value of the child node of its immediate subordinate is done with computing after, then be multiplied by one shrink because of Son, determines the weighted value of each Section Point described.
8. device according to claim 6, it is characterised in that
Described first determining unit, for for different types of block of information, parallel determination comprises dissimilar characteristic information The weighted value of each node.
9. according to claim 6-8 arbitrary described in device, it is characterised in that
Described converting unit is used for:
It by the pending web page contents of HTML HTML form according to the relation of main and subordinate node, is progressively converted into document Object model DOM node, obtains dom tree after described pending web page contents all converts;
In transfer process, if detecting in described pending web page contents there is mistake, then correct described mistake, and will correct After division of teaching contents in corresponding DOM node.
10. the device according to claim 6-8 is arbitrary, it is characterised in that described device also includes:
Set up unit, for after described second determining unit determines the positional information of this type block of information, set up information Corresponding relation between the positional information of the type of block and this type block of information, described corresponding relation is in filtering web page Interference information.
CN201610389942.2A 2016-06-02 2016-06-02 Method and device for determining position information of information block Active CN106095854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610389942.2A CN106095854B (en) 2016-06-02 2016-06-02 Method and device for determining position information of information block

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610389942.2A CN106095854B (en) 2016-06-02 2016-06-02 Method and device for determining position information of information block

Publications (2)

Publication Number Publication Date
CN106095854A true CN106095854A (en) 2016-11-09
CN106095854B CN106095854B (en) 2022-05-17

Family

ID=57448505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610389942.2A Active CN106095854B (en) 2016-06-02 2016-06-02 Method and device for determining position information of information block

Country Status (1)

Country Link
CN (1) CN106095854B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874934A (en) * 2018-06-01 2018-11-23 百度在线网络技术(北京)有限公司 Page body extracting method and device
CN113485782A (en) * 2021-07-29 2021-10-08 北京百度网讯科技有限公司 Page data acquisition method and device, electronic equipment and medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060191028A1 (en) * 2003-12-04 2006-08-24 H:S Rigshospitalet And University Of Prince Edward Island Developmental animal model of temporal lobe epilepsy
CN101515287A (en) * 2009-03-24 2009-08-26 崔志明 Automatic generating method of wrapper of complex page
US20100031138A1 (en) * 2008-07-30 2010-02-04 International Business Machines Corporation Method For Generating Simple Object Access Protocol Messages and Process Engine
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103198114A (en) * 2013-03-29 2013-07-10 电子科技大学 WEB service matching method based on overlapped oriented graph
CN103491116A (en) * 2012-06-12 2014-01-01 深圳市世纪光速信息技术有限公司 Method and device for processing text-related structural data
CN103559234A (en) * 2013-10-24 2014-02-05 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN103559202A (en) * 2013-10-08 2014-02-05 北京奇虎科技有限公司 Webpage content extracting device and method
CN104346405A (en) * 2013-08-08 2015-02-11 阿里巴巴集团控股有限公司 Method and device for extracting information from webpage
CN105183730A (en) * 2014-05-30 2015-12-23 北大方正集团有限公司 Method and device for processing webpage information
CN105630772A (en) * 2016-01-26 2016-06-01 广东工业大学 Method for extracting webpage comment content
CN105677638A (en) * 2016-01-05 2016-06-15 北京工业大学 Web information extraction method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060191028A1 (en) * 2003-12-04 2006-08-24 H:S Rigshospitalet And University Of Prince Edward Island Developmental animal model of temporal lobe epilepsy
US20100031138A1 (en) * 2008-07-30 2010-02-04 International Business Machines Corporation Method For Generating Simple Object Access Protocol Messages and Process Engine
CN101515287A (en) * 2009-03-24 2009-08-26 崔志明 Automatic generating method of wrapper of complex page
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103491116A (en) * 2012-06-12 2014-01-01 深圳市世纪光速信息技术有限公司 Method and device for processing text-related structural data
CN103198114A (en) * 2013-03-29 2013-07-10 电子科技大学 WEB service matching method based on overlapped oriented graph
CN104346405A (en) * 2013-08-08 2015-02-11 阿里巴巴集团控股有限公司 Method and device for extracting information from webpage
CN103559202A (en) * 2013-10-08 2014-02-05 北京奇虎科技有限公司 Webpage content extracting device and method
CN103559234A (en) * 2013-10-24 2014-02-05 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN105183730A (en) * 2014-05-30 2015-12-23 北大方正集团有限公司 Method and device for processing webpage information
CN105677638A (en) * 2016-01-05 2016-06-15 北京工业大学 Web information extraction method
CN105630772A (en) * 2016-01-26 2016-06-01 广东工业大学 Method for extracting webpage comment content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谷凯凯等: "紧致融合模糊集和故障树的变压器故障诊断", 《高电压技术》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874934A (en) * 2018-06-01 2018-11-23 百度在线网络技术(北京)有限公司 Page body extracting method and device
CN108874934B (en) * 2018-06-01 2021-11-30 百度在线网络技术(北京)有限公司 Page text extraction method and device
CN113485782A (en) * 2021-07-29 2021-10-08 北京百度网讯科技有限公司 Page data acquisition method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN106095854B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN101025738B (en) Template-free dynamic website generating method
US7921106B2 (en) Group-by attribute value in search results
CN103268348B (en) A kind of user's query intention recognition methods
US8832102B2 (en) Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN110334202A (en) User interest label construction method and relevant device based on news application software
CN104484431B (en) A kind of multi-source Personalize News webpage recommending method based on domain body
Zheng et al. Template-independent news extraction based on visual consistency
CN102651002A (en) Webpage information extracting method and system
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN102314497B (en) Method and equipment for identifying body contents of markup language files
CN103020295B (en) A kind of problem label for labelling method and device
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
US8438080B1 (en) Learning characteristics for extraction of information from web pages
CN103281217B (en) A kind of measuring method of User Page stay time
CN105069103A (en) Method and system for APP search engine to utilize client comment
CN101299217A (en) Method, apparatus and system for processing map information
CN106354844B (en) Service combination package recommendation system and method based on text mining
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN104182412A (en) Webpage crawling method and webpage crawling system
CN103177036A (en) Method and system for label automatic extraction
CN103838837A (en) Remote-sensing metadata integration method based on lexeme templates
CN109522410A (en) Document clustering method and platform, server and computer-readable medium
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
Ghobadi et al. An ontology based semantic extraction approach for B2C eCommerce
CN106095854A (en) A kind of method and device of the positional information determining block of information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant