CN110309364B

CN110309364B - Information extraction method and device

Info

Publication number: CN110309364B
Application number: CN201810176124.3A
Authority: CN
Inventors: 王策; 张锋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-03-02
Filing date: 2018-03-02
Publication date: 2023-03-28
Anticipated expiration: 2038-03-02
Also published as: CN110309364A

Abstract

The embodiment of the application discloses an information extraction method and device, which are used for improving configuration efficiency. The method comprises the following steps: acquiring page information of a target page; establishing a document model according to the page information, and determining target path information corresponding to the target page according to a configuration file, wherein the configuration file is used for extracting the target document information and comprises the path information corresponding to at least one page; if the target path information contains target characters, determining node path information of at least one node in the document model, and determining target node path information matched with the target path information in the node path information; and extracting first document information through the target node path information, wherein the first document information is the target document information.

Description

Information extraction method and device

Technical Field

The present application relates to the field of computer applications, and in particular, to an information extraction method and apparatus.

Background

Information extraction (Information Extract) refers to structuring of Information contained in a text into an organization form like a table. The input of the information extraction system is original text (which can be web page data or single character content), and the output is information points in a fixed format. Information points are extracted from various documents and then integrated together in a unified form.

For the extraction of the structured page information, a configurator may configure, for each category of page, an Extensible Markup Language path Language (xpath) for values corresponding to some attributes of the category of page, and after acquiring page information of a certain page, a server may extract a corresponding attribute value from the page information through the xpath preconfigured by the configurator, thereby acquiring required information.

For the same page, the corresponding xpath of the attribute value has a fixed format, but some attributes have multiple values, and the number of the attribute values corresponding to these attributes in different pages is different, for example, an encyclopedia page of mataire, and the tag content includes: the industry personas, the economic personas, the personas and the internet personas are shown in fig. 1, and xpaths corresponding to the tags in the page are shown in table 1 below; certain encyclopedia page of Liu, the label content includes: music characters, actors, singers, entertainment characters, producers, and characters are shown in fig. 2, and xpaths corresponding to tags in the page are shown in table 2 below.

Attribute name	Attribute value	xpath
			Label (R)	Trade figure	//*[@id＝"open-tag-item"]/span[1]
Label (R)	Economic figure	//*[@id＝"open-tag-item"]/span[2]
			Label (R)	Character	//*[@id＝"open-tag-item"]/span[3]
Label (R)	Internet networkCharacter	//*[@id＝"open-tag-item"]/span[4]

TABLE 1

Attribute name	Attribute value	xpath
			Label (R)	Music figure	//*[@id＝"open-tag-item"]/span[1]
Label (R)	Actor(s)	//*[@id＝"open-tag-item"]/span[2]
			Label (R)	Singer	//*[@id＝"open-tag-item"]/span[3]
Label (R)	Entertainment figure	//*[@id＝"open-tag-item"]/span[4]
			Label (R)	Maker	//*[@id＝"open-tag-item"]/span[5]
Label (R)	Character	//*[@id＝"open-tag-item"]/span[6]

TABLE 2

It can be seen that if values corresponding to the attributes on different pages are to be extracted, one xpath needs to be configured for each value, and a large number of xpaths are configured by an enumeration method, which greatly reduces configuration efficiency.

Disclosure of Invention

The embodiment of the application provides an information extraction method and device, which are used for improving configuration efficiency.

In view of this, a first aspect of the present application provides an information extraction method, including:

acquiring page information of a target page;

establishing a document model according to the page information, and determining target path information corresponding to the target page according to a configuration file, wherein the configuration file is used for extracting the target document information and comprises the path information corresponding to at least one page;

if the target path information contains target characters, determining node path information of at least one node in the document model, and determining target node path information matched with the target path information in the node path information;

and extracting first document information through the target node path information, wherein the first document information is the target document information.

In view of this, the second aspect of the present application provides an information extraction apparatus, including:

the acquisition module is used for acquiring page information of a target page;

the establishing module is used for establishing a document model according to the page information;

the first determining module is used for determining target path information corresponding to the target page according to a configuration file, wherein the configuration file is used for extracting target document information, and the configuration file comprises path information corresponding to at least one page;

the second determining module is used for determining node path information of at least one node in the document model when the target path information contains target characters;

a third determining module, configured to determine target node path information that matches the target path information in the node path information;

and the extraction module is used for extracting first document information through the target node path information, wherein the first document information is the target document information.

Optionally, in a possible implementation manner of the second aspect, the second determining module is specifically configured to determine an expression corresponding to the target path information, and generate the node path information according to the expression and a position of the node in the document model.

A third aspect of the present application provides an information extraction apparatus, including: a processor and a memory;

the memory is used for storing programs;

the processor is configured to execute the program, and specifically includes the following steps:

acquiring page information of a target page;

Optionally, in a possible implementation manner of the third aspect, the processor further specifically performs the following steps: and determining an expression corresponding to the target path information, and generating the node path information according to the expression and the position of the node in the document model.

A fourth aspect of the present application provides a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the first aspect described above.

According to the technical scheme, the embodiment of the application has the following advantages:

after the information extraction device obtains page information of a certain page, a document model can be established according to the page information, target path information corresponding to the page is determined according to the configuration file, if the target path information contains target characters, the information extraction device can determine node path information of at least one node in the document model, the target node path information matched with the target path information in the node path information is determined, and a needed file can be extracted through the target node path information. The configuration file refers to a file required for extracting target document information, and the file comprises path information configured by a configurator for different pages. In this embodiment, for target path information including a target character, the required document information may be extracted through target node path information matched with the target path information, and based on this scheme, when configuring path information for a page, a configurator may configure one xpath including a target character uniformly for different values corresponding to the same attribute, without configuring a large number of xpaths through an enumeration method, thereby improving configuration efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present application.

FIG. 1 is a diagram illustrating an attribute corresponding to a plurality of attribute values according to an embodiment of the present application;

FIG. 2 is another diagram illustrating that one attribute corresponds to a plurality of attribute values in the embodiment of the present application;

FIG. 3 is a schematic diagram of an information extraction system in an embodiment of the present application;

FIG. 4 is a flowchart of an embodiment of an information extraction method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating an information extraction method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of an information extraction apparatus in an embodiment of the present application;

FIG. 7 is a schematic diagram of another embodiment of an information extraction apparatus in an embodiment of the present application;

fig. 8 is a schematic diagram of another embodiment of an information extraction device in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the preceding drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

To facilitate understanding, some terms referred to in the implementation of the present application are described below:

extensible Markup Language path Language (xpath): a Language for locating a portion of a standard generalized Markup Language (XML) document. xpath is based on the tree structure of XML, has different types of nodes, including Element Node (Element Node), attribute Node (attribute Node) and Text Node (Text Node), provides the ability of looking for the Node in the data structure tree.

Document Object Model (DOM) Tree (Tree): the rdom analyzes a HyperText Markup Language (HTML) page, and generates an HTML tree structure and a corresponding access method.

Traversing: the method sequentially makes one-time and only one-time access to each node in the tree along a certain search route.

For convenience of understanding, the following describes a scenario in which the information extraction method and apparatus in the present application are applicable:

as shown in fig. 3, for each page, a configurator configures xpaths for values corresponding to some attributes of the page, and for different values corresponding to the same attribute, configures an xpath containing a target character uniformly, even if the values correspond to the same xpath. And the configurator loads the configured xpath into the target server. And the target server determines to extract the target document information in a certain page from the page information of other web servers according to the information extraction method in the application.

By the information extraction method, the server can extract needed target document information from a large number of webpages, and then can establish a knowledge base according to the target documents for users to use.

It should be understood that the information extraction method in the present application may be applied to other scenarios besides the scenario described above, and the present application is not limited thereto.

Based on the above scenario, the following describes an information extraction method in the present application, and referring to fig. 4, an embodiment of the information extraction method in the present application includes:

401. the information extraction device acquires page information of a target page;

after the information extraction device finishes loading the configuration file for extracting the target document information, the information extraction device obtains page information of a target page, where the target page may be an HTML page, an XHTML page, or other pages, and the application is not limited in detail.

The configuration file loaded by the information extraction apparatus includes path information corresponding to at least one page, and specifically, each page may be identified by a Uniform Resource Locator (URL) in the configuration file, that is, the configuration file includes a URL of at least one page and the path information corresponding to the URL.

It should be understood that the path information in the configuration file may be xpath information or other information, and the embodiment is not limited in this respect.

It should also be understood that, in this embodiment, the configuration file may also include other information, and is not limited herein.

402. The information extraction device establishes a document model according to the page information and determines target path information corresponding to a target page according to the configuration file;

after the information extraction device obtains the page information, a document model corresponding to the target page is established according to the page information, and the corresponding target path information of the target page is determined.

Specifically, the information extraction means may determine the target path information by: the information extraction device analyzes the page information of the target page to obtain the URL of the target page, firstly determines the URL identification code corresponding to the URL of the target page according to the corresponding relation in the configuration file, and then determines the attribute identification corresponding to the URL identification code and the path information corresponding to each attribute identification.

It should be understood that the information extraction device may also determine the target path information in other manners, and the specific application is not limited thereto.

Specifically, the document model in this embodiment may specifically be a tree-structured document model.

As an alternative, the information extraction device may establish the document model by: the information extraction device analyzes the page information of the target page through the document structured model to obtain HTML documents, and generates a document model with a Tree structure, namely a DOM Tree, according to the HTML documents.

It should be understood that the information extraction device may also establish the document model in other ways, and the specific application is not limited thereto.

403. The information extraction device determines whether the target path information contains a target character, if yes, go to step 404;

after the information extraction device determines the target path information, judging whether the path information contains target characters or not aiming at each target path information, if so, indicating that the attribute corresponding to the target path information has a plurality of attribute values, and extracting the target document information by the information extraction device through the processes described in the following 404 to 405; if not, the information extraction device may determine a node in the document model corresponding to the target path information, extract content corresponding to the node, that is, extract the target document information through the target path information, and the information extraction device may also execute other processes, which is not limited in this application.

404. The information extraction device determines node path information of at least one node in the document model and determines target node path information matched with the target path information in the node path information;

for any target path information, if the path information includes a target character, the information extraction apparatus may determine node path information of at least one node from the document model, and match the determined node path information with the target path information, and if the matching is successful, determine that the node path information is the target node path information, and the information extraction apparatus performs step 405.

405. The information extraction device extracts first document information through the target node path information;

after the information extraction device determines the target node path information, the document information (first document information) is extracted through the target node path information, and the extracted document information is the target document information to be extracted by the target node path information.

In this embodiment, for target path information including a target character, the required document information may be extracted through target node path information matched with the target path information, and based on this scheme, when configuring path information for a page, a configurator may configure one xpath including a target character uniformly for different values corresponding to the same attribute, without configuring a large number of xpaths through an enumeration method, thereby improving configuration efficiency.

Based on the above embodiment corresponding to fig. 4, it can be seen that the information extraction apparatus may determine the target node path information matched with the target path information in various ways, and one of the following ways is taken as an example to describe in detail the information extraction method in the present application, please refer to fig. 5, where another embodiment of the information extraction method in the present application includes:

501. the information extraction device acquires page information of a target page;

Specifically, each page may be identified by a Uniform Resource Locator (URL) in the configuration file, that is, the configuration file includes the URL of at least one page and the path information corresponding to the URL.

As an alternative, the target document information used by the configuration file for extraction may be an attribute value corresponding to an attribute in the page.

The configuration file may include at least one Uniform Resource Locator (URL) corresponding to the page, URL identification numbers, attribute names, path information, and correspondence between these information, where each URL corresponds to one URL identification code, each URL identification code corresponds to one or more attribute identifiers, and each attribute identifier corresponds to one path information. It should be understood that the path information is used to extract the value of the attribute identified by the attribute identifier corresponding to the path information, and for attributes corresponding to multiple values, the path information corresponding to the attribute is only one (i.e., the path information corresponding to each value is the same).

It should be understood that the attribute identifier may be an attribute name or other identifier, which is not limited herein, and the path information may be xpath information or other information, which is not limited herein.

For example, the configuration file pre-loaded by the information extraction apparatus includes a first file (pattern.conf) and a second file (xpath.conf), where the first file includes a content (pattern) of a URL compiled by a regular expression and an identifier (pattern _ id) corresponding to the URL, and is specifically shown in table 3 below:

pattern_id	content providing method and apparatus
		0	^https://baike\.***\.com/item/.+/\d+$
1	^https://baike\.***\.com/subview/\d+/\d+\.htm$

TABLE 3

The second file contains a URL identification code, an attribute name, and an xpath (path information), where only one xpath containing "% d" (target character) is configured for a plurality of values of attributes, as shown in table 4 below:

TABLE 4

It should be understood that the configuration file may also include other information, and is not limited in particular.

502. The information extraction device establishes a document model according to the page information and determines target path information corresponding to a target page according to the configuration file;

after the information extraction device obtains the page information, a document model corresponding to the target page is established according to the page information, and the target path information corresponding to the target page is determined.

Specifically, the information extraction means may determine the target path information by: the information extraction device analyzes the page information of the target page to obtain the URL of the target page, firstly determines the URL identification code corresponding to the URL of the target page according to the corresponding relation in the configuration file, then determines the attribute identification corresponding to the URL identification code and the path information corresponding to each attribute identification, and the determined attribute represents the corresponding path information, namely the target path information.

Specifically, the document model in this embodiment may specifically be a document model with a tree structure.

As an alternative, the information extraction apparatus may create the document model by: the information extraction device analyzes the page information of the target page through the document structural model to obtain HTML documents, and generates a document model with a tree structure, namely DOMTree, according to the HTML documents.

503. The information extraction device determines whether the target path information includes the target character, if yes, step 504 is executed, and if not, step 509 is executed;

after the information extraction device determines the target path information, judging whether the path information contains target characters or not for each target path information, if so, indicating that the attribute corresponding to the target path information has a plurality of attribute values, and extracting the target document information by the information extraction device through the flow described in the following 504 to 505; if not, go to step 509.

504. The information extraction device determines node path information of at least one node in the document model;

when the information extraction means determines that the target path information contains the target character, the information extraction means may determine node path information of at least one node in the document model, and perform step 505.

It should be understood that, regarding the position of a node in the document model, the information extraction apparatus may be described in different manners, that is, in a manner that an expression corresponding to path information is described, taking xpath as an example, and a path expression of xpath is shown in table 5 below:

TABLE 5

Generally, target path information will take two expressions, namely "/" and "//".

As an alternative, the information extraction means may determine the node path information by: and determining an expression corresponding to the target path information, and generating the node path information of the node according to the expression and the position of the node in the document model, wherein the generated node path information is the same as the expression of the target path information.

Specifically, the information extraction device may traverse the DOM Tree from the root node, record a traversed tag (tag) path, combine the traversed tag path with the traversed node tag to obtain node path information of the node, and an expression used by the node path information is the same as an expression used by the target path information.

505. The information extraction device determines a first character string sequence corresponding to the first node path information and determines a second character string sequence corresponding to the target path information;

after determining the first node path information, the information extraction apparatus determines a first character string sequence corresponding to the first node path information, and determines a second character string sequence corresponding to the target path information, where the first node path information is any one of the node path information determined in step 504.

As an alternative, the information extraction device may determine the second string sequence corresponding to the node path information every time the information extraction device generates the node path information of one node, and execute step 506 below, the information extraction device may determine the first string sequence corresponding to the node path information for each node path information after generating the node path information corresponding to each node, and execute step 506 below, the information extraction device may also determine the first string sequence corresponding to the node path information for each generated node path information in the process of generating the node path information corresponding to each node, and execute step 506 below, which is not limited in this application.

Specifically, the information extraction means may determine the first character string sequence and the second character string sequence by: the information extraction device divides the node path information and the target path information according to some special characters in the path information to obtain a first character string sequence and a second character string sequence, taking xpath with an expression of "/", for example, the information extraction device can divide the node path information of the node path information according to a symbol "/" to obtain the first character string sequence and the second character string sequence, for example, if the node path information is "/html/body/div [4]/", then the first character string sequence obtained after "/" division is: html, body, div [4].

506. The information extraction device matches a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence, if the plurality of pairs of character strings are matched, the step 507 is executed, and if one pair of character strings is not matched, the step 509 is executed;

after the information extraction device determines the first character string sequence and the second character string sequence, a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence are matched.

Alternatively, the information extraction device matches the strings in the two sequences with the same position, and if both strings match, step 507 is executed, and if a string in one position does not match, step 509 is executed.

Specifically, the information extraction device judges whether the ith character string in the first character string sequence is the same as the ith character string in the second character string sequence, and if so, determines that the ith character string in the first character string sequence is matched with the ith character string in the second character string sequence; if not, judging whether the ith character string of the second character string sequence contains the target character; if not, determining that the ith character string of the first character string sequence is not matched with the ith character string of the second character string sequence; if yes, judging whether other characters except the first character corresponding to the target character in the ith character string of the first character string sequence are the same as other characters except the target character in the ith character string of the second character string sequence; if not, determining that the ith character string of the first character string sequence is not matched with the ith character string of the second character string sequence; if so, determining that the ith character string of the first character string sequence is matched with the ith character string of the second character string sequence.

The first character corresponding to the target character can be determined by a special symbol, such as the content in the square bracket "[ ]" in the ith character string of the first character string sequence is marked and corresponds to the content in the square bracket "[ ]" in the ith character string of the second character string sequence.

Taking a path adopting a "/" expression as an example for explanation, target path information corresponding to a page is "/html/body/div [4]/div [2]/div/div [% d ]/dl [1]/dd/h1", named as cur _ xpath, first node path information is "/html/body/div [4]/div [2]/div/div [2]/dl [1]/dd/h1", named as input _ xpath, cur _ xpath is split according to "/", and the split array (first character string sequence) is: html, body, div 4, div 2, div, div 2, dl 1, dd, h1; splitting input _ xpath according to '/', wherein a split array (a second character string sequence) is as follows: html, body, div 4, div 2, div, div [% d ], dl 1, dd, h1.

Traversing the two arrays to obtain the current value of cur _ vec, named cs; the current value of input _ vec is obtained, named is. If cs is equal to is, it indicates that the current tag is matched, for example, the current value of cur _ vec is html, correspondingly, the current value of in put _ vec is html, cs = is, and the current tag is matched, then the next value is obtained for matching; if cs and is are not equal, for example, the current value of cur _ vec is div [2], the current value of input _ vec is div [% d ], and is contains a target character% d, the [ ] contents in cs and is are both replaced by 0, namely div [2] is replaced by div [0], div [% d ] is replaced by div [0], after the replacement, cs = is used for matching the current label, and the next label is continuously matched.

As an optional manner, the information extraction apparatus performs step 508 early, before matching the character string in the first character string sequence with the character string in the second character string sequence, the information extraction apparatus may first determine whether the length of the first character string sequence is equal to the length of the second character string sequence, if not, the information extraction apparatus may determine that the first node path information is not matched with the target path information, and perform step 509; if so, the information extraction device performs this step 508.

For example, in the above example, after cur _ xpath and input _ xpath are cut into pieces according to "/", if the array lengths of cur _ xpath and input _ xpath are not the same, it indicates that input _ xpath and cur _ xpath do not match, and if the array lengths of cur _ xpath and input _ xpath are the same, it starts to traverse the arrays for matching.

507. The information extraction device determines that the first node path information is target node path information matched with the target path information;

when a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence are matched, the information extraction device determines that the first node path information is matched with the target path information, namely the first node path information is the target node path information.

508. The information extraction device extracts first document information through the target node path information;

after the information extraction device determines target node path information matched with the target path information, first document information is extracted through the target node path information, and the first document information is target document information to be extracted by the target path information, namely document information needed by a server.

Specifically, the first document information is extracted through the target node path information, that is, the node pointed by the target node path information is determined, and the content corresponding to the node is extracted.

It should be noted that, in the present application, after the information extraction device extracts the first document information, the first document information may be output, specifically, the first document information is an attribute value corresponding to the target path information, and the attribute value corresponds to the target attribute, so that the information extraction device may output the attribute name of the attribute and the attribute value.

For example, if the target node path information matches xpath "/html/body/div [4]/div [2]/div/div [% d ]/div [4]" in the configuration file described in table 4, the document information extracted by the target node path information is "person", and the output result is < tag, person >, and the output result corresponds to the target page.

509. The information extraction device executes other processes.

When the information extraction device determines that the target path information does not include the target character, the information extraction device executes another process, specifically, the information extraction device may determine a node in the document model corresponding to the target path information, and extract content corresponding to the node, that is, extract the target document information through the target path information.

When the information extraction device determines that a pair of character strings in the first character string sequence and the second character string sequence do not match, the information extraction device executes other processes, and specifically, the information extraction device may not match the first node path information with the target path information.

In this embodiment, for target path information including a target character, the required document information may be extracted through target node path information matched with the target path information, and based on this scheme, when configuring path information for a page, an operation and maintenance worker may configure one xpath including a target character uniformly for different values corresponding to the same attribute, without configuring a large number of xpaths through an enumeration method, thereby improving configuration efficiency.

Secondly, the embodiment provides a plurality of specific ways for determining the path information of the target node, and improves the flexibility of the scheme.

With reference to fig. 6, an embodiment of an information extraction apparatus in the present application includes:

an obtaining module 601, configured to obtain page information of a target page;

the establishing module 602 is configured to establish a document model according to the page information;

a first determining module 603, configured to determine, according to a configuration file, target path information corresponding to a target page, where the configuration file is used to extract required target document information, and the configuration file includes path information corresponding to at least one page;

a second determining module 604, configured to determine node path information of at least one node in the document model when the target path information includes the target character;

a third determining module 605, configured to determine target node path information that matches the target path information in the node path information;

the extracting module 606 is configured to extract first document information through the target node path information, where the first document information is target document information.

It should be understood that the above-mentioned process executed by each module in the information extraction apparatus corresponding to fig. 6 may refer to the process of the method embodiment corresponding to fig. 4, which is not described herein again.

In this embodiment, for target path information including a target character, the extraction module 606 may extract required document information through target node path information matched with the target path information, and based on this scheme, when configuring path information for a page, an operation and maintenance worker may configure one xpath including a target character uniformly for different values corresponding to the same attribute, without configuring a large number of xpaths through an enumeration method, thereby improving configuration efficiency.

To facilitate understanding of the information extraction apparatus in the present application, referring to fig. 7, an embodiment of an information extraction apparatus in the embodiment of the present application includes:

an obtaining module 701, configured to obtain page information of a target page;

an establishing module 702, configured to establish a document model according to the page information;

a first determining module 703, configured to determine, according to a configuration file, target path information corresponding to a target page, where the configuration file is used to extract required target document information, and the configuration file includes path information corresponding to at least one page;

a second determining module 704, configured to determine node path information of at least one node in the document model when the target path information includes the target character;

a third determining module 705, configured to determine target node path information that matches the target path information in the node path information;

an extraction module 706, configured to extract first document information through the target node path information, where the first document information is target document information;

wherein the third determining module 705 comprises:

a first determining unit 7051, configured to determine a first character string sequence corresponding to the first node path information;

a second determining unit 7052, configured to determine a second character string sequence corresponding to the target path information, where the first node path information is node path information of any node in the path information of at least one node;

a matching unit 7053, configured to match a plurality of pairs of character strings corresponding to positions in the first character string sequence and the second character string sequence;

a third determining unit 7054, configured to determine, when the plurality of pairs of character strings are all matched, that the node path information of the first node is target node path information matched with the target path information;

optionally, the information extracting apparatus may further include:

a judging module 707, configured to judge whether a sequence length of the first string sequence is equal to a sequence length of the second string sequence;

a fourth determining module 708, configured to determine that the first node path information does not match the target path information when the determining module determines that the first node path information and the target path information are equal;

the matching unit 7053 is specifically configured to, when the determining module 707 determines that the first string sequence and the second string sequence are not equal to each other, match a plurality of pairs of strings corresponding to positions in the first string sequence and the second string sequence.

Optionally, the matching unit 7053 may include:

a first determining subunit 70531, configured to determine whether an ith character string of the first character string sequence is the same as an ith character string of the second character string sequence;

a first determining subunit 70532, configured to determine that the ith character string of the first character string sequence matches the ith character string of the second character string sequence when the first determining subunit 70531 determines that the first character string sequence is the same;

a second judging subunit 70533, configured to, when the first judging subunit 70531 determines that the first character string sequence is not the same, judge whether the ith character string of the second character string sequence includes a target character;

a second determining subunit 70534, configured to determine that the ith character string of the first character string sequence does not match the ith character string of the second character string sequence when the second determining subunit 70533 determines that the target character is not included;

a third judging subunit 70535, configured to, when the second judging subunit 70533 determines that the target character is included, judge whether the other characters, except the first character corresponding to the target character, in the ith character string of the first character string sequence are the same as the other characters, except the target character, in the ith character string of the second character string sequence;

a third determining subunit 70536, configured to determine that the ith character string of the first character string sequence does not match the ith character string of the second character string sequence when the third determining subunit 70535 determines that the first character string sequence is not identical to the second character string sequence;

a fourth determining subunit 70537 is configured to determine that the ith character string of the first character string sequence matches the ith character string of the second character string sequence when the third judging subunit 70535 determines to be the same.

It should be understood that, for the process executed by each module in the information extraction apparatus corresponding to fig. 7, reference may be made to the process of the method embodiment corresponding to fig. 5, which is not described herein again.

In this embodiment, for target path information including a target character, the extraction module 706 may extract required document information through target node path information matched with the target path information, and based on this scheme, when configuring path information for a page, an operation and maintenance worker may configure one xpath including a target character uniformly for different values corresponding to the same attribute, without configuring a large number of xpaths through an enumeration method, thereby improving configuration efficiency.

Secondly, the method for determining the path information of the target node is provided, and the realizability of the scheme is improved.

The information extraction device in the present application is described above from the perspective of functional modules, and the information extraction device in the present application is described below from the perspective of physical hardware, and fig. 8 is a schematic structural diagram of an information extraction device 80 according to an embodiment of the present invention. The information extraction apparatus 80 may include an input device 810, an output device 820, a processor 830, and a memory 840. The output device in the embodiments of the present invention may be a display device.

Memory 840 may include both read-only memory and random access memory and provides instructions and data to processor 830. A portion of Memory 840 may also include Non-Volatile Random Access Memory (NVRAM).

Memory 840 stores elements, executable modules or data structures, or subsets thereof, or expanded sets thereof:

and (3) operating instructions: including various operational instructions for performing various operations.

Operating the system: including various system programs for implementing various basic services and for handling hardware-based tasks.

In this embodiment of the present invention, the processor 830 is configured to perform the following steps:

acquiring page information of a target page, establishing a document model according to the page information, and determining target path information corresponding to the target page according to a configuration file, wherein the configuration file is used for extracting required target document information and comprises path information corresponding to at least one page; and if the target path information contains the target character, determining node path information of at least one node in the document model, determining target node path information matched with the target path information in the node path information, and extracting first document information through the target node path information, wherein the first document information is the target document information. Wherein the content of the first and second substances,

optionally, the processor 830 is specifically configured to: determining a first character string sequence corresponding to first node path information, and determining a second character string sequence corresponding to target path information, wherein the first node path information is node path information of any node in the path information of at least one node; matching a plurality of pairs of character strings corresponding to positions in the first character string sequence and the second character string sequence; and if the plurality of pairs of character strings are matched, determining that the node path information of the first node is the target node path information matched with the target path information.

Optionally, the processor 830 is further configured to perform the following process: judging whether the sequence length of the first character string sequence is equal to the sequence length of the second character string sequence; if not, determining that the first node path information is not matched with the target path information; and if so, executing a step of matching a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence.

Optionally, the processor 830 is specifically configured to: judging whether the ith character string of the first character string sequence is the same as the ith character string of the second character string sequence or not; if the first character string sequence is the same as the second character string sequence, determining that the ith character string of the first character string sequence is matched with the ith character string of the second character string sequence; if not, judging whether the ith character string of the second character string sequence contains the target character; if not, determining that the ith character string of the first character string sequence does not match the ith character string of the second character string sequence; if yes, judging whether the characters except the first character corresponding to the target character in the ith character string of the first character string sequence are the same as the characters except the target character in the ith character string of the second character string sequence; if not, determining that the ith character string of the first character string sequence is not matched with the ith character string of the second character string sequence; if so, determining that the ith character string of the first character string sequence is matched with the ith character string of the second character string sequence.

Optionally, the processor 830 is specifically configured to: determining an expression corresponding to the target path information; and generating node path information according to the expression and the position of the node in the document model.

The processor 830 controls the operation of the information extracting apparatus 80, and the processor 830 may also be called a Central Processing Unit (CPU). Memory 840 may include both read-only memory and random-access memory, and provides instructions and data to processor 830. A portion of the memory 840 may also include NVRAM. In a particular application, the various components of the information extraction device 80 are coupled together by a bus system 850, wherein the bus system 850 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various busses are illustrated as the bus system 850.

The method disclosed in the above embodiments of the present invention may be applied to the processor 830, or implemented by the processor 830. The processor 830 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 830. The processor 830 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 840, and the processor 830 reads the information in the memory 840 and performs the steps of the above method in combination with the hardware thereof.

The related description of fig. 8 can be understood by referring to the related description and effects of the method portions of fig. 4 and 5, and will not be described in detail herein.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An information extraction method, comprising:

acquiring page information of a target page;

establishing a document model according to the page information, and determining target path information corresponding to the target page according to a configuration file, wherein the configuration file is used for extracting the target document information, the configuration file comprises at least one path information corresponding to the page, and different attribute values corresponding to the same attribute in the page are uniformly configured with one path information containing target characters;

2. The method of claim 1, wherein the determining the target node path information that matches the target path information in the node path information comprises:

determining a first character string sequence corresponding to first node path information, and determining a second character string sequence corresponding to target path information, wherein the first node path information is node path information of any node in the path information of the at least one node;

matching a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence respectively;

and if the first node and the second node are matched, determining that the node path information of the first node is the target node path information matched with the target path information.

3. The method of claim 2, wherein the matching the pairs of strings corresponding to the positions in the first string sequence and the second string sequence comprises:

judging whether the sequence length of the first character string sequence is equal to the sequence length of the second character string sequence or not;

if not, determining that the first node path information is not matched with the target path information;

and if so, executing the step of respectively matching a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence.

4. The method according to claim 2 or 3, wherein the matching of the pairs of strings corresponding to the positions in the first string sequence and the second string sequence respectively comprises:

judging whether the ith character string of the first character string sequence is the same as the ith character string of the second character string sequence or not;

and if the first character string sequence is the same as the second character string sequence, determining that the ith character string of the first character string sequence is matched with the ith character string of the second character string sequence.

5. The method of claim 4, further comprising:

if not, judging whether the ith character string of the second character string sequence contains the target character;

if not, determining that the ith character string of the first character string sequence does not match the ith character string of the second character string sequence;

if yes, judging whether other characters except the first character corresponding to the target character in the ith character string of the first character string sequence are the same as other characters except the target character in the ith character string of the second character string sequence;

if not, determining that the ith character string of the first character string sequence is not matched with the ith character string of the second character string sequence;

if yes, determining that the ith character string of the first character string sequence is matched with the ith character string of the second character string sequence.

6. The method of any of claims 1 to 3, wherein said determining node path information for at least one node in said document model comprises:

determining an expression corresponding to the target path information;

and generating the node path information according to the expression and the position of the node in the document model.

7. An information extraction apparatus, characterized by comprising:

the acquisition module is used for acquiring page information of a target page;

the first determining module is used for determining target path information corresponding to the target page according to a configuration file, wherein the configuration file is used for extracting target document information, the configuration file comprises path information corresponding to at least one page, and different attribute values corresponding to the same attribute in the page are uniformly configured with path information containing target characters;

8. The apparatus of claim 7, wherein the third determining module comprises:

the first determining unit is used for determining a first character string sequence corresponding to the first node path information;

a second determining unit, configured to determine a second character string sequence corresponding to target path information, where the first node path information is node path information of any node in the path information of the at least one node;

the matching unit is used for matching a plurality of pairs of character strings corresponding to the positions in the first character string sequence and the second character string sequence respectively;

and a third determining unit, configured to determine, when the plurality of pairs of character strings are all matched, that the node path information of the first node is target node path information matched with the target path information.

9. The apparatus of claim 8, further comprising:

the judging module is used for judging whether the sequence length of the first character string sequence is equal to the sequence length of the second character string sequence;

a fourth determining module, configured to determine that the first node path information does not match the target path information when the determining module determines that the first node path information and the target path information are equal to each other;

the matching unit is specifically configured to match a plurality of pairs of character strings corresponding to positions in the first character string sequence and the second character string sequence, respectively, when the determining module determines that the character strings are not equal to each other.

10. The apparatus according to claim 8 or 9, wherein the matching unit comprises:

a first judging subunit, configured to judge whether an ith character string of the first character string sequence is the same as an ith character string of the second character string sequence;

a first determining subunit configured to determine that an ith character string of the first character string sequence matches an ith character string of the second character string sequence when the first judging subunit determines that they are the same;

a second judging subunit, configured to, when the first judging subunit determines that the first character string sequence is not the same, judge whether an ith character string of the second character string sequence includes the target character;

a second determining subunit, configured to determine that an ith character string of the first character string sequence does not match an ith character string of the second character string sequence when the second determining subunit determines that the target character is not included;

a third judging subunit configured to, when the second judging subunit determines that the target character is included, judge whether or not characters other than the first character corresponding to the target character in the ith character string of the first character string sequence are the same as characters other than the target character in the ith character string of the second character string sequence;

a third determining subunit, configured to determine that, when the third determining subunit determines that the first character string sequence and the second character string sequence are not the same, the ith character string of the first character string sequence does not match the ith character string of the second character string sequence;

a fourth determining subunit, configured to determine that an ith character string of the first character string sequence matches an ith character string of the second character string sequence when the third determining subunit determines that the determination is the same.

11. An information extraction apparatus, characterized by comprising: a processor and a memory;

the memory is used for storing programs;

acquiring page information of a target page;

12. The apparatus of claim 11, wherein the processor is further configured to perform the steps of:

determining a first character string sequence corresponding to first node path information, and determining a second character string sequence corresponding to target path information, wherein the first node path information is node path information of any one node in the path information of the at least one node;

and if the node path information of the first node is matched with the target node path information, determining that the node path information of the first node is the target node path information matched with the target path information.

13. The apparatus of claim 12, wherein the processor further performs the steps of:

judging whether the sequence length of the first character string sequence is equal to the sequence length of the second character string sequence;

14. The apparatus according to claim 12 or 13, wherein the processor is configured to perform the steps of:

if the first character string sequence is the same as the second character string sequence, determining that the ith character string of the first character string sequence is matched with the ith character string of the second character string sequence;

15. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1-6.