CN111026946A

CN111026946A - Page information extraction method, device, medium and equipment

Info

Publication number: CN111026946A
Application number: CN201911278179.6A
Authority: CN
Inventors: 丁柳朋
Original assignee: Hangzhou Xinhua Information Technology Co Ltd
Current assignee: Hangzhou Xinhua Information Technology Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-04-17

Abstract

The invention discloses a page information extraction method, a device, a medium and equipment, which comprises the steps of obtaining a page attribute set of a multi-sample page, wherein elements of the page attribute set correspond to an attribute block of one page, and the attribute block of the page comprises a plurality of page sub-attributes; calculating the difference degree of each page and other pages according to the page attribute set; obtaining a classification target value N, and obtaining a clustering target difference value according to the classification target value; clustering the multi-sample pages according to the clustering target difference value to obtain a clustering result; generating an information extraction template corresponding to each class in the clustering result; acquiring a page to be extracted; calculating the difference degree between the page to be extracted and the clustering centers of the various classes, and determining the class with the minimum difference degree as a target class; and extracting the information in the page to be extracted based on the information extraction template corresponding to the target class. The invention provides convenience for extracting the page information.

Description

Page information extraction method, device, medium and equipment

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a medium, and a device for extracting page information.

Background

The importance of data analysis in the big data era is increasingly highlighted, and how to extract data information from massive page data through data clustering and quickly and accurately grasp the information in the data is one of key contents of webpage information research. The webpage information obtained through webpage data classification, clustering and information extraction can be used in various scenes such as webpage pushing, webpage construction and the like.

Disclosure of Invention

In order to solve technical problems in the prior art, embodiments of the present invention provide a method, an apparatus, a medium, and a device for extracting page information.

A method for extracting page information, the method comprising:

acquiring a page attribute set of a multi-sample page, wherein elements of the page attribute set correspond to attribute blocks of one page, and the attribute blocks of the page comprise a plurality of page sub-attributes;

calculating the difference degree of each page and other pages according to the page attribute set;

obtaining a classification target value N, and obtaining a clustering target difference value according to the classification target value;

clustering the multi-sample pages according to the clustering target difference value to obtain a clustering result;

generating an information extraction template corresponding to each class in the clustering result;

acquiring a page to be extracted;

calculating the difference degree between the page to be extracted and the clustering centers of the various classes, and determining the class with the minimum difference degree as a target class;

and extracting the information in the page to be extracted based on the information extraction template corresponding to the target class.

Preferably, the obtaining a clustering target difference value according to the classification target value includes:

constructing a difference degree graph according to the difference degree of each page and other pages, wherein each vertex in the difference graph represents one page, each vertex and a related node are provided with a unique connecting line, the weight of the connecting line is the difference degree between the page represented by the vertex and the page represented by the related node, and the related node is other vertices adjacent to the vertex;

and calculating a clustering target difference value according to the difference degree graph and the classification target value.

Preferably, the calculating a clustering target difference value according to the difference degree map and the classification target value includes:

acquiring a first vertex set and a first connecting line set according to the difference degree graph;

initializing a second vertex set and a second connecting line set, wherein the second vertex set has one and only one element, and the second connecting line set is empty;

constructing a first attribute set, wherein elements in the first attribute set are used for recording first attributes of all vertexes, the first attributes represent the minimum weight of a connecting line formed by the vertexes and all related elements in a second vertex set when the vertexes are in a difference set of the first vertex set and the second vertex set, and the vertexes corresponding to the related elements are vertexes adjacent to the vertexes of the difference degree graph;

constructing a second attribute set, wherein elements in the second attribute set are used for recording second attributes of each vertex, the second attributes represent that when the vertex is in a difference set between the first vertex set and the second vertex set, another vertex different from the vertex is included in a connecting line which is hooked by each related element in the second vertex set and has the minimum weight, and the vertex corresponding to the related element is a vertex adjacent to the vertex of the difference degree graph;

executing a preset operation, and updating the second vertex set, the second connecting line set, the first attribute set and the second attribute set until a preset requirement is met;

and performing descending order on each element in the first attribute set according to the value, and determining the value of the (N-1) th element as the clustering target difference value.

Preferably, the generating an information extraction template corresponding to each class in the clustering result includes:

for each class, randomly determining one page, and taking a document object model of the page as an information extraction template;

and sequentially comparing the document object models of other pages in the class with the information extraction template from the root node according to the top-down sequence, if the label of the current node of the information extraction template is the same as that of the current node of the document object models of other pages, keeping the current node, and if the label of the current node of the information extraction template is different from that of the current node of the document object models of other pages, deleting the current node from the information extraction template.

Preference is given toOf any two pages A_,The degree of difference of B can be identified as

Each page has n attributes, A_i,B_iAnd respectively identifying the ith attributes of the pages A and B.

A page information extraction apparatus, the apparatus comprising:

the system comprises a page attribute set acquisition module, a page attribute set acquisition module and a page attribute setting module, wherein the page attribute set acquisition module is used for acquiring a page attribute set of a multi-sample page, elements of the page attribute set correspond to an attribute block of one page, and the attribute block of the page comprises a plurality of page sub-attributes;

the difference degree calculation module is used for calculating the difference degree of each page and other pages according to the page attribute set;

the clustering target difference value determining module is used for acquiring a classification target value N and acquiring a clustering target difference value according to the classification target value;

the clustering module is used for clustering the multi-sample pages according to the clustering target difference value to obtain a clustering result;

the information extraction template determining module is used for generating an information extraction template corresponding to each class in the clustering result;

the to-be-extracted page acquisition module is used for acquiring a to-be-extracted page;

the target class determining module is used for calculating the difference degree between the page to be extracted and the clustering centers of all the classes and determining the class with the minimum difference degree as a target class;

and the information extraction module is used for extracting the information in the page to be extracted based on the information extraction template corresponding to the target class.

A computer storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement a method of page information extraction.

A page information extraction apparatus, comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded by the processor and executes a page information extraction method.

The invention provides a method, a device, a medium and equipment for extracting page information. According to the invention, a classification target value is determined for the multi-sample page, sample clustering can be rapidly and accurately carried out, and the information extraction template is obtained through the clustering result of the sample clustering, so that convenience is provided for subsequent page information extraction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method for extracting page information according to the present invention;

FIG. 2 is a flowchart of a method for obtaining a difference value of a clustering target according to the classification target value according to the present invention;

FIG. 3 is a flow chart for calculating the difference value of the clustering target according to the difference map and the classification target value according to the present invention;

FIG. 4 is a flowchart of an information extraction template provided by the present invention for generating a corresponding class for each class in the clustering result;

FIG. 5 is a block diagram of a device for extracting page information according to the present invention;

fig. 6 is a hardware structural diagram of an apparatus for implementing the method provided by the embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to make the objects, technical solutions and advantages disclosed in the embodiments of the present invention more clearly apparent, the embodiments of the present invention are described in further detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and are not intended to limit the embodiments of the invention.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, "a plurality" means two or more unless otherwise specified. In order to facilitate understanding of the technical solutions and the technical effects thereof described in the embodiments of the present invention, the embodiments of the present invention first explain related terms:

an embodiment of the present invention provides a method for extracting page information, as shown in fig. 1, the method may include:

s101, obtaining a page attribute set of a multi-sample page, wherein elements of the page attribute set correspond to attribute blocks of one page, and the attribute blocks of the page comprise a plurality of page sub-attributes.

The pages have the page attributes, the attributes of each page are gathered in the attribute block in the embodiment of the invention, and each page has the same number and the same content of attributes, but the attribute values are different, but the values of the attribute values are normalized and fall between [0,1 ].

And S103, calculating the difference between each page and other pages according to the page attribute set.

In particular, the degree of difference of any two pages A, B can be identified as

And S105, acquiring a classification target value N, and acquiring a clustering target difference value according to the classification target value.

The obtaining of the clustering target difference value according to the classification target value, as shown in fig. 2, includes:

s1051, constructing a difference graph according to the difference between each page and other pages, wherein each vertex in the difference graph represents one page, each vertex and a related node are provided with a unique connecting line, the weight of each connecting line is the difference between the page represented by the vertex and the page represented by the related node, and the related node is other vertices adjacent to the vertex.

And S1053, calculating a clustering target difference value according to the difference degree graph and the classification target value.

Specifically, the calculating a clustering target difference value according to the difference degree map and the classification target value, as shown in fig. 3, includes:

s10531, a first vertex set and a first connecting line set are obtained according to the difference degree graph.

Specifically, a first connection set records each connection in the difference degree graph, and a value of each element in the first connection set identifies a weight value of the connection, that is, a difference degree. Each vertex in the disparity map is included in the first vertex set. In particular, each vertex and each link may have its corresponding number.

S10533, initializing a second vertex set and a second connecting line set, wherein the second vertex set has one and only one element, and the second connecting line set is empty.

In particular, the element may be any one vertex of the first set of vertices.

S10535, constructing a first attribute set, wherein elements in the first attribute set are used for recording first attributes of each vertex, the first attributes represent the minimum weight of a connecting line formed by the vertex and each related element in a second vertex set when the vertex is in a difference set of the first vertex set and the second vertex set, and the vertex corresponding to the related element is the vertex of the difference degree graph adjacent to the vertex.

S10537, constructing a second attribute set, wherein elements in the second attribute set are used for recording second attributes of each vertex, the second attributes represent that when the vertex is in a difference set between the first vertex set and the second vertex set, another vertex different from the vertex is in a connecting line which is hooked by each related element in the second vertex set and has the minimum weight, and the vertex corresponding to the related element is a vertex adjacent to the vertex of the difference degree graph.

And S10539, executing a preset operation, and updating the second vertex set, the second connecting line set, the first attribute set and the second attribute set until a preset requirement is met.

Specifically, the executing the preset operation to update the second vertex set, the second connection set, the first attribute set, and the second attribute set until the preset requirement is met includes:

performing the following operations until the second vertex set and the first vertex set have the same element number:

(1) selecting a target connecting line with the minimum weight value from the first connecting line set, wherein a first vertex of the target connecting line is positioned in a second vertex set, and a second vertex of the target connecting line is positioned in a difference set of the first vertex set and the second vertex set;

(2) adding the vertex positioned in the difference set of the first vertex set and the second vertex set in the target connecting line into the second vertex set, and adding the target connecting line into the second connecting line set;

(3) updating the first set of attributes and the second set of attributes.

S105311, arranging each element in the first attribute set according to the numerical value in a descending order, and determining the value of the (N-1) th element as the difference value of the clustering target.

And S107, clustering the multi-sample pages according to the clustering target difference value to obtain a clustering result.

Specifically, the difference degrees of the pages in the same class in the clustering result are not greater than the clustering target difference value, and the difference degrees of any two pages in different classes are greater than the clustering target difference value.

And S109, generating an information extraction template corresponding to each class in the clustering result.

Specifically, the generating an information extraction template corresponding to each class in the clustering result, as shown in fig. 4, includes:

s1091, for each class, one page is determined randomly, and a document object model of the page is used as an information extraction template.

S1093, comparing the document object models of other pages in the class with the information extraction template from the root node in sequence from top to bottom, if the label of the current node of the information extraction template is the same as that of the current node of the document object models of other pages, keeping the current node, and if the label of the current node of the information extraction template is different from that of the current node of the document object models of other pages, deleting the current node from the information extraction template.

S1011, acquiring the page to be extracted.

And S1013, calculating the difference between the page to be extracted and the clustering centers of the classes, and determining the class with the minimum difference as a target class.

Specifically, the algorithm of the difference between the page to be extracted and the elements in each class is given in the embodiment of the present invention, and therefore, the difference between the page to be extracted and the elements in each class can be obtained by using the prior art, which is not described herein again.

And S1015, extracting the information in the page to be extracted based on the information extraction template corresponding to the target class.

The embodiment of the invention discloses a page information extraction method, which can perform sample clustering rapidly and accurately by determining a classification target value for a multi-sample page, and obtain an information extraction template through a clustering result of the sample clustering, thereby providing convenience for subsequent page information extraction.

The present invention also provides a page information extraction device, as shown in fig. 5, the device includes:

a page attribute set obtaining module 201, configured to obtain a page attribute set of a multi-sample page, where an element of the page attribute set corresponds to an attribute block of one page, and the attribute block of the page includes multiple page sub-attributes;

the difference degree calculating module 203 is used for calculating the difference degree of each page and other pages according to the page attribute set;

a clustering target difference value determining module 205, configured to obtain a classification target value N, and obtain a clustering target difference value according to the classification target value;

the clustering module 207 is configured to cluster the multi-sample pages according to the clustering target difference values to obtain clustering results;

an information extraction template determining module 209, configured to generate an information extraction template corresponding to each class in the clustering result;

a to-be-extracted page acquiring module 2011, configured to acquire a page to be extracted;

the target class determining module 2013 is configured to calculate a difference between the page to be extracted and the cluster centers of the classes, and determine the class with the smallest difference as a target class;

and an information extraction module 2015, configured to extract information in the page to be extracted based on the information extraction template corresponding to the target class.

Specifically, the embodiments of the page information extraction device and the method of the present invention are all based on the same inventive concept. For details, please refer to the method embodiment, which is not described herein.

The embodiment of the invention also provides a computer storage medium, and the computer storage medium can store a plurality of instructions. The instruction may be suitable for being loaded by a processor and executing a method for extracting page information according to an embodiment of the present invention, which refers to the method embodiment.

Further, fig. 6 shows a hardware structure diagram of an apparatus for implementing the method provided by the embodiment of the present invention, and the apparatus may participate in forming or containing the device or system provided by the embodiment of the present invention. As shown in fig. 6, the device 10 may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration and is not intended to limit the structure of the electronic device. For example, device 10 may also include more or fewer components than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the device 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method described in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the above-mentioned page information extraction method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory located remotely from processor 102, which may be connected to device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by the communication provider of the device 10. In one example, the transmission device 106 includes a network adapter (NIC) that can be connected to other network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the device 10 (or mobile device).

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the device and server embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for extracting page information is characterized by comprising the following steps:

acquiring a page to be extracted;

2. The method according to claim 1, wherein the obtaining a clustering target difference value according to the classification target value comprises:

3. The method of claim 2, wherein the calculating a clustering target difference value from the difference map and the classification target value comprises:

4. The method according to claim 1, wherein the generating an information extraction template corresponding to each class in the clustering result comprises:

5. The method of claim 1, wherein:

the degree of difference between any two pages A, B can be identified as

Each page has n attributes, A_i,B_iRespectively identifying A, B pagesThe ith attribute of (1).

6. A page information extraction apparatus, characterized in that the apparatus comprises:

7. A computer storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement a method of extracting page information as claimed in any one of claims 1 to 5.

8. A page information extraction apparatus, characterized in that the apparatus comprises a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes or a set of instructions, and the at least one instruction, the at least one program, the set of codes or the set of instructions is loaded by the processor and executes a page information extraction method according to any one of claims 1 to 5.