The content of the invention
It is an object of the invention to overcome above-mentioned deficiency present in prior art, and one kind is provided and belongs to semi-automatic extraction
Network commodity information extraction method, quickly and accurately to extract from webpage and mark required customizing messages.
The present invention the adopted technical scheme that solves the above problems is:A kind of network commodity information extraction method, its feature
It is:The method comprises the steps:
1st, the original template that network commodity information is extracted is generated using template generation instrument;
2nd, merchandise news extraction is carried out to website using the original template, the step includes:
A, in the product classification page of website, using being manually labeled, extract in webpage all commodity category names and
List page URL, in adding a classification queue;
B, the list page for taking the classification queue squadron head, give and are manually labeled;After the completion of, by the classification path and
The template of generation is stored in a classification template correspondence table;Multiple commodity details page URL are extracted from the list page, and
Lower one page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue;
C, from the webpage pond details page is selected, give and be manually labeled;After the completion of, also it is stored in the classification mould
In plate correspondence table, there are two template difference corresponding lists pages and details page under such a classification path;
D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is
It is empty;
E, the list page for taking the classification queue squadron head, check that the classification path whether there is in the classification template pair
In answering the list page template of table;
If existing, using the template analysis;
If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct,
Then its corresponding relation is added in the classification template correspondence table, if error in data, artificial mark template is submitted to, and is also added
In the classification template correspondence table.
F, one by one process classification queue in list page until queue for sky.
The present invention compared with prior art, with advantages below and effect:
1st, using the template tool of the present invention, through the simple training of a few minutes, commonly used person just can be in 10 minutes
An information extraction template is defined, and without the need for being familiar with programmer's intervention of HTML, is reduced the work and peopleware is wanted
Ask;By the extraction tool of visualization interface, make work more directly perceived, facilitate associative operation, reduce error rate, improve
Work efficiency.
2nd, using the extraction flow process of the present invention, the various difference conditions in similar webpage can be automatically found, is easy to artificial
Process;The design for extracting flow process is more convenient for finding the template before multiplexing, effective template number for reducing artificial customization.
Specific embodiment
Below in conjunction with the accompanying drawings and by embodiment the present invention is described in further detail.
Referring to Fig. 1~Fig. 3, in the present embodiment, by taking " foodstuff " of " Taobao " as an example, describe in detail, merchandise news is taken out
The whole process for taking.
1st, the original template that network commodity information is extracted is generated using template generation instrument, template generation instrument is browser
An inserter tool, by applicant of the present invention design.The step process is as follows:
(1), user arbitrarily browses in a browser webpage, the webpage until needing Extracting Information;
(2), " template generation plug-in unit " icon in click browser toolbar, starts extraction tool;
(3) " starting collection " button, is clicked on, starts extraction process, the now meeting when mouse moves to each several part of webpage
There is blue frame, the position of ID Extraction;
(4), " new landmark " or " new record " button is clicked on, is generated " terrestrial reference " or " record ", then extraction is chosen in webpage
Region, template generation instrument automatically according to heuristic rule produce respective paths;
(5), user fills in the information such as variable name, remarks to this path affix, represents its implication;
(6), repeat step (4), (5), complete until field interested is all marked;
(7) " application " button, is clicked on, template generation instrument extracts corresponding word by the template of current definition from current web page
Section content simultaneously shows;
(8) if, content it is correct, user can click on " preservations " button, preservation template, if incorrect, user can be to mould
Plate does and preserved again after a little manual settings.
2nd, merchandise news extraction is carried out to website using the original template, the step includes:
A, in the product classification page of website, using being manually labeled, extract all commodity category names and row in webpage
Table page URL, in adding a classification queue.
Specially:" goods catalogue " page of " Taobao " food is extracted, using being manually labeled generation one
Individual template, the template returns the object $ foodCat of a List type and stores all commodity category names and list page in webpage
URL, in adding a classification queue.
B, the list page for taking the classification queue squadron head, give and are manually labeled;After the completion of, by the classification path and
The template of generation is stored in a classification template correspondence table;Multiple commodity details page URL are extracted from the list page, and
Lower one page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue.
Specially:List page " chocolate/DIY chocolate " to classification queue squadron head, with template generation instrument people
Work is labeled, and generates the list page extraction template for defining " chocolate " classification, and the template and the classification path are stored in
In one classification template correspondence table.The template extraction goes out multiple commodity details page URL, and returns two results, a List type
Object $ foodList store the details page URL and title of chocolate, due to there is multiple commodity in one page, therefore be list;
Another variable nextPage stores lower one page URL.Because last page does not descend one page, therefore the variable is optional.Will
Commodity details page URL give a webpage pond, and lower one page URL is added into the tail of the queue of the classification queue.
C, from the webpage pond details page is selected, give and be manually labeled;After the completion of, also it is stored in the classification mould
In plate correspondence table, there are two template difference corresponding lists pages and details page under such a classification path.
Specially:A chocolate details page is selected from webpage pond, is manually labeled with template tool, generated and define one
Individual " chocolate " details page template, in being also stored in classification template correspondence table.So there are two templates under " chocolate " classification path
Difference corresponding lists page and details page.What the template was extracted is exactly the final result for expecting acquisition, including merchandise news and businessman
Information two parts.
D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is
It is empty.
Specially:To chocolate details page URL in webpage pond, letter is extracted one by one using " chocolate details page " template
Breath, until webpage pond is sky.That is " chocolate details page " template under first trying out each page, sees whether be suitable for go out
Existing various situations.
Such sequence arrangement, is easy to pinpoint the problems.The trial of multiple commodity of one page in list page, typically can send out
Difference in the most of webpages of existing such commodity is also individual good with regard to providing for similar and other class commodity analysis below
Basis.
If the data of the extraction in page template do not meet data verification rule, artificial correction template is handed over.
E, the list page for taking the classification queue squadron head, check that the classification path whether there is in the classification template pair
In answering the list page template of table;
If existing, using the template analysis;
If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct,
Then its corresponding relation is added in the classification template correspondence table, if error in data, artificial mark template is submitted to, and is also added
In the classification template correspondence table.
Specially:The list page of the classification queue squadron head is taken, checks that the classification path whether there is in the classification
In the list page template of template correspondence table;
If existing, using the template analysis;When the details page for completing 45 commodity of " chocolate " list page page 1
After analysis, analyzing for next classification " preserve/Fructus Jujubae class/prunus mume (sieb.) sieb.et zucc./preserved fruit " can be started;
If not existing, such as due to " preserve " classification was not analyzed, so not existing accordingly in classification template correspondence table
Template, can first attempt being extracted with the template of " chocolate ", if data are correct, its corresponding relation be added into classification template correspondence
In table, if error in data, artificial mark template is submitted to, and also add in the classification template correspondence table.
Because of this step, when list page, details page all compare close in each classification, need to only human configuration be processed several
Individual template just can process all pages.
F, one by one process classification queue in list page until queue for sky.
In whole process, cohort design has than considerable influence to the opportunity of manual intervention.Joined the team by design effectively
Time, the composition for needing manual intervention is made all at the initial stage for crawling process.It is big by covering after a number of Web Page Processing
Partial webpage situation, whole process avoids the need for manual intervention, can go on automatically.
The present invention extracts self-defined language, a template generation instrument of specific area based on a kind of object web page, then
Learnt during information extraction, and template is modified, belonged to semi-automatic extraction, can be from webpage quickly and accurately
Extract and mark required customizing messages, such as trade name, commodity picture URL, price.