CN102495847B

CN102495847B - Network commodity information extraction method

Info

Publication number: CN102495847B
Application number: CN201110363931.4A
Authority: CN
Inventors: 刘崟; 吴浩苗
Original assignee: ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co Ltd
Current assignee: Zhejiang Panxing Shuzhi Technology Co ltd
Priority date: 2011-11-16
Filing date: 2011-11-16
Publication date: 2017-04-19
Anticipated expiration: 2031-11-16
Also published as: CN102495847A

Abstract

The invention relates to a network commodity information extraction method. The network commodity information extraction method includes steps of (1), generating an initial network commodity information extraction template by the aid of a template generating tool; and (2), applying the initial template to extract commodity information of websites. By the aid of the template generating tool, the template is generated during information extraction, and is processed and modified, the information is extracted semi-automatically, and required specified information, such as names of commodities, image URL (uniform resource locator) of the commodities and prices, can be accurately and quickly extracted from web pages and labeled. The network commodity information extraction method leads operation to be visual and brings convenience for relevant operation, error rate is reduced, and work efficiency is improved.

Description

A kind of network commodity information extraction method

Technical field

The present invention relates to a kind of network commodity information extraction method.

Background technology

In recent years, developing rapidly with ecommerce, all kinds of enterprises, personal all marketing by the Internet development one after another are lived It is dynamic, make the Internet summarize shiploads of merchandise information, it has also become maximum merchandise news source.Be no lack of in these information as price, The information of the great commercial values such as the place of production, distributor, sales volume, customer evaluation.

Classify, analyze these data, and show in a suitable manner, for the business decision of enterprise can bring necessarily Help.For example, for the enterprise of a manufacture sale pressure cooker, the product price of oneself how is positioned, how grasps city Fast changing industry market price, the particularly price change of rival, how to know opponent sales territory scope, How Sales Channel, compare and position the products characteristics of oneself.And how accurate from webpage the basis of all these processes is Extraction information.

Web page information extraction mainly divides at present artificial extraction, full-automatic extraction, three kinds of semi-automatic extraction.It is artificial to extract accurately Property is good, but workload is big, efficiency is low, high cost；Full-automatic extraction low cost, efficiency high but accuracy are poor, technical difficulty Greatly；Semi-automatic extraction is based on artificial mark on a small quantity, and workload is little, and because the intervention accuracy of people has preferably guarantee, is The feasible mode of comparison.

The content of the invention

It is an object of the invention to overcome above-mentioned deficiency present in prior art, and one kind is provided and belongs to semi-automatic extraction Network commodity information extraction method, quickly and accurately to extract from webpage and mark required customizing messages.

The present invention the adopted technical scheme that solves the above problems is：A kind of network commodity information extraction method, its feature It is：The method comprises the steps：

1st, the original template that network commodity information is extracted is generated using template generation instrument；

2nd, merchandise news extraction is carried out to website using the original template, the step includes：

A, in the product classification page of website, using being manually labeled, extract in webpage all commodity category names and List page URL, in adding a classification queue；

B, the list page for taking the classification queue squadron head, give and are manually labeled；After the completion of, by the classification path and The template of generation is stored in a classification template correspondence table；Multiple commodity details page URL are extracted from the list page, and Lower one page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue；

C, from the webpage pond details page is selected, give and be manually labeled；After the completion of, also it is stored in the classification mould In plate correspondence table, there are two template difference corresponding lists pages and details page under such a classification path；

D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is It is empty；

E, the list page for taking the classification queue squadron head, check that the classification path whether there is in the classification template pair In answering the list page template of table；

If existing, using the template analysis；

If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct, Then its corresponding relation is added in the classification template correspondence table, if error in data, artificial mark template is submitted to, and is also added In the classification template correspondence table.

F, one by one process classification queue in list page until queue for sky.

The present invention compared with prior art, with advantages below and effect：

1st, using the template tool of the present invention, through the simple training of a few minutes, commonly used person just can be in 10 minutes An information extraction template is defined, and without the need for being familiar with programmer's intervention of HTML, is reduced the work and peopleware is wanted Ask；By the extraction tool of visualization interface, make work more directly perceived, facilitate associative operation, reduce error rate, improve Work efficiency.

2nd, using the extraction flow process of the present invention, the various difference conditions in similar webpage can be automatically found, is easy to artificial Process；The design for extracting flow process is more convenient for finding the template before multiplexing, effective template number for reducing artificial customization.

Description of the drawings

Fig. 1 is that embodiment of the present invention merchandise news extracts operating diagram.

The merchandise news schematic diagram that Fig. 2 is extracted for the embodiment of the present invention.

Fig. 3 is the schematic diagram of the classification template correspondence table that the present invention sets up.

Specific embodiment

Below in conjunction with the accompanying drawings and by embodiment the present invention is described in further detail.

Referring to Fig. 1～Fig. 3, in the present embodiment, by taking " foodstuff " of " Taobao " as an example, describe in detail, merchandise news is taken out The whole process for taking.

1st, the original template that network commodity information is extracted is generated using template generation instrument, template generation instrument is browser An inserter tool, by applicant of the present invention design.The step process is as follows：

(1), user arbitrarily browses in a browser webpage, the webpage until needing Extracting Information；

(2), " template generation plug-in unit " icon in click browser toolbar, starts extraction tool；

(3) " starting collection " button, is clicked on, starts extraction process, the now meeting when mouse moves to each several part of webpage There is blue frame, the position of ID Extraction；

(4), " new landmark " or " new record " button is clicked on, is generated " terrestrial reference " or " record ", then extraction is chosen in webpage Region, template generation instrument automatically according to heuristic rule produce respective paths；

(5), user fills in the information such as variable name, remarks to this path affix, represents its implication；

(6), repeat step (4), (5), complete until field interested is all marked；

(7) " application " button, is clicked on, template generation instrument extracts corresponding word by the template of current definition from current web page Section content simultaneously shows；

(8) if, content it is correct, user can click on " preservations " button, preservation template, if incorrect, user can be to mould Plate does and preserved again after a little manual settings.

A, in the product classification page of website, using being manually labeled, extract all commodity category names and row in webpage Table page URL, in adding a classification queue.

Specially：" goods catalogue " page of " Taobao " food is extracted, using being manually labeled generation one Individual template, the template returns the object $ foodCat of a List type and stores all commodity category names and list page in webpage URL, in adding a classification queue.

B, the list page for taking the classification queue squadron head, give and are manually labeled；After the completion of, by the classification path and The template of generation is stored in a classification template correspondence table；Multiple commodity details page URL are extracted from the list page, and Lower one page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue.

Specially：List page " chocolate/DIY chocolate " to classification queue squadron head, with template generation instrument people Work is labeled, and generates the list page extraction template for defining " chocolate " classification, and the template and the classification path are stored in In one classification template correspondence table.The template extraction goes out multiple commodity details page URL, and returns two results, a List type Object $ foodList store the details page URL and title of chocolate, due to there is multiple commodity in one page, therefore be list； Another variable nextPage stores lower one page URL.Because last page does not descend one page, therefore the variable is optional.Will Commodity details page URL give a webpage pond, and lower one page URL is added into the tail of the queue of the classification queue.

C, from the webpage pond details page is selected, give and be manually labeled；After the completion of, also it is stored in the classification mould In plate correspondence table, there are two template difference corresponding lists pages and details page under such a classification path.

Specially：A chocolate details page is selected from webpage pond, is manually labeled with template tool, generated and define one Individual " chocolate " details page template, in being also stored in classification template correspondence table.So there are two templates under " chocolate " classification path Difference corresponding lists page and details page.What the template was extracted is exactly the final result for expecting acquisition, including merchandise news and businessman Information two parts.

D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is It is empty.

Specially：To chocolate details page URL in webpage pond, letter is extracted one by one using " chocolate details page " template Breath, until webpage pond is sky.That is " chocolate details page " template under first trying out each page, sees whether be suitable for go out Existing various situations.

Such sequence arrangement, is easy to pinpoint the problems.The trial of multiple commodity of one page in list page, typically can send out Difference in the most of webpages of existing such commodity is also individual good with regard to providing for similar and other class commodity analysis below Basis.

If the data of the extraction in page template do not meet data verification rule, artificial correction template is handed over.

If existing, using the template analysis；

Specially：The list page of the classification queue squadron head is taken, checks that the classification path whether there is in the classification In the list page template of template correspondence table；

If existing, using the template analysis；When the details page for completing 45 commodity of " chocolate " list page page 1 After analysis, analyzing for next classification " preserve/Fructus Jujubae class/prunus mume (sieb.) sieb.et zucc./preserved fruit " can be started；

If not existing, such as due to " preserve " classification was not analyzed, so not existing accordingly in classification template correspondence table Template, can first attempt being extracted with the template of " chocolate ", if data are correct, its corresponding relation be added into classification template correspondence In table, if error in data, artificial mark template is submitted to, and also add in the classification template correspondence table.

Because of this step, when list page, details page all compare close in each classification, need to only human configuration be processed several Individual template just can process all pages.

F, one by one process classification queue in list page until queue for sky.

In whole process, cohort design has than considerable influence to the opportunity of manual intervention.Joined the team by design effectively Time, the composition for needing manual intervention is made all at the initial stage for crawling process.It is big by covering after a number of Web Page Processing Partial webpage situation, whole process avoids the need for manual intervention, can go on automatically.

The present invention extracts self-defined language, a template generation instrument of specific area based on a kind of object web page, then Learnt during information extraction, and template is modified, belonged to semi-automatic extraction, can be from webpage quickly and accurately Extract and mark required customizing messages, such as trade name, commodity picture URL, price.

Claims

1. a kind of network commodity information extraction method, it is characterised in that：The method comprises the steps：

(1), the original template that network commodity information is extracted is generated using template generation instrument；

(2), merchandise news extraction is carried out to website using the original template, the step includes：

A, in the product classification page of website, using being manually labeled, extract all commodity category names and list in webpage Page URL, in adding a classification queue；

B, the list page for taking the classification queue squadron head, give and are manually labeled；After the completion of, by the classification path and generation Template be stored in classification template correspondence table；Multiple commodity details page URL are extracted from the list page, and it is next Page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue；

C, from the webpage pond details page is selected, give and be manually labeled；After the completion of, also it is stored in the classification template pair In answering table, there are two template difference corresponding lists pages and details page under such a classification path；

D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is sky；

E, the list page for taking the classification queue squadron head, check that the classification path whether there is in classification template correspondence table List page template in；

If existing, using the template analysis；

If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct, will Its corresponding relation is added in the classification template correspondence table, if error in data, submits artificial mark template to, and is also added described In classification template correspondence table；

F, one by one process classification queue in list page until queue for sky.