CN102495847B - Network commodity information extraction method - Google Patents

Network commodity information extraction method Download PDF

Info

Publication number
CN102495847B
CN102495847B CN201110363931.4A CN201110363931A CN102495847B CN 102495847 B CN102495847 B CN 102495847B CN 201110363931 A CN201110363931 A CN 201110363931A CN 102495847 B CN102495847 B CN 102495847B
Authority
CN
China
Prior art keywords
template
classification
page
queue
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110363931.4A
Other languages
Chinese (zh)
Other versions
CN102495847A (en
Inventor
刘崟
吴浩苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Panxing Shuzhi Technology Co ltd
Original Assignee
ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co Ltd filed Critical ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co Ltd
Priority to CN201110363931.4A priority Critical patent/CN102495847B/en
Publication of CN102495847A publication Critical patent/CN102495847A/en
Application granted granted Critical
Publication of CN102495847B publication Critical patent/CN102495847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a network commodity information extraction method. The network commodity information extraction method includes steps of (1), generating an initial network commodity information extraction template by the aid of a template generating tool; and (2), applying the initial template to extract commodity information of websites. By the aid of the template generating tool, the template is generated during information extraction, and is processed and modified, the information is extracted semi-automatically, and required specified information, such as names of commodities, image URL (uniform resource locator) of the commodities and prices, can be accurately and quickly extracted from web pages and labeled. The network commodity information extraction method leads operation to be visual and brings convenience for relevant operation, error rate is reduced, and work efficiency is improved.

Description

A kind of network commodity information extraction method
Technical field
The present invention relates to a kind of network commodity information extraction method.
Background technology
In recent years, developing rapidly with ecommerce, all kinds of enterprises, personal all marketing by the Internet development one after another are lived It is dynamic, make the Internet summarize shiploads of merchandise information, it has also become maximum merchandise news source.Be no lack of in these information as price, The information of the great commercial values such as the place of production, distributor, sales volume, customer evaluation.
Classify, analyze these data, and show in a suitable manner, for the business decision of enterprise can bring necessarily Help.For example, for the enterprise of a manufacture sale pressure cooker, the product price of oneself how is positioned, how grasps city Fast changing industry market price, the particularly price change of rival, how to know opponent sales territory scope, How Sales Channel, compare and position the products characteristics of oneself.And how accurate from webpage the basis of all these processes is Extraction information.
Web page information extraction mainly divides at present artificial extraction, full-automatic extraction, three kinds of semi-automatic extraction.It is artificial to extract accurately Property is good, but workload is big, efficiency is low, high cost;Full-automatic extraction low cost, efficiency high but accuracy are poor, technical difficulty Greatly;Semi-automatic extraction is based on artificial mark on a small quantity, and workload is little, and because the intervention accuracy of people has preferably guarantee, is The feasible mode of comparison.
The content of the invention
It is an object of the invention to overcome above-mentioned deficiency present in prior art, and one kind is provided and belongs to semi-automatic extraction Network commodity information extraction method, quickly and accurately to extract from webpage and mark required customizing messages.
The present invention the adopted technical scheme that solves the above problems is:A kind of network commodity information extraction method, its feature It is:The method comprises the steps:
1st, the original template that network commodity information is extracted is generated using template generation instrument;
2nd, merchandise news extraction is carried out to website using the original template, the step includes:
A, in the product classification page of website, using being manually labeled, extract in webpage all commodity category names and List page URL, in adding a classification queue;
B, the list page for taking the classification queue squadron head, give and are manually labeled;After the completion of, by the classification path and The template of generation is stored in a classification template correspondence table;Multiple commodity details page URL are extracted from the list page, and Lower one page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue;
C, from the webpage pond details page is selected, give and be manually labeled;After the completion of, also it is stored in the classification mould In plate correspondence table, there are two template difference corresponding lists pages and details page under such a classification path;
D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is It is empty;
E, the list page for taking the classification queue squadron head, check that the classification path whether there is in the classification template pair In answering the list page template of table;
If existing, using the template analysis;
If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct, Then its corresponding relation is added in the classification template correspondence table, if error in data, artificial mark template is submitted to, and is also added In the classification template correspondence table.
F, one by one process classification queue in list page until queue for sky.
The present invention compared with prior art, with advantages below and effect:
1st, using the template tool of the present invention, through the simple training of a few minutes, commonly used person just can be in 10 minutes An information extraction template is defined, and without the need for being familiar with programmer's intervention of HTML, is reduced the work and peopleware is wanted Ask;By the extraction tool of visualization interface, make work more directly perceived, facilitate associative operation, reduce error rate, improve Work efficiency.
2nd, using the extraction flow process of the present invention, the various difference conditions in similar webpage can be automatically found, is easy to artificial Process;The design for extracting flow process is more convenient for finding the template before multiplexing, effective template number for reducing artificial customization.
Description of the drawings
Fig. 1 is that embodiment of the present invention merchandise news extracts operating diagram.
The merchandise news schematic diagram that Fig. 2 is extracted for the embodiment of the present invention.
Fig. 3 is the schematic diagram of the classification template correspondence table that the present invention sets up.
Specific embodiment
Below in conjunction with the accompanying drawings and by embodiment the present invention is described in further detail.
Referring to Fig. 1~Fig. 3, in the present embodiment, by taking " foodstuff " of " Taobao " as an example, describe in detail, merchandise news is taken out The whole process for taking.
1st, the original template that network commodity information is extracted is generated using template generation instrument, template generation instrument is browser An inserter tool, by applicant of the present invention design.The step process is as follows:
(1), user arbitrarily browses in a browser webpage, the webpage until needing Extracting Information;
(2), " template generation plug-in unit " icon in click browser toolbar, starts extraction tool;
(3) " starting collection " button, is clicked on, starts extraction process, the now meeting when mouse moves to each several part of webpage There is blue frame, the position of ID Extraction;
(4), " new landmark " or " new record " button is clicked on, is generated " terrestrial reference " or " record ", then extraction is chosen in webpage Region, template generation instrument automatically according to heuristic rule produce respective paths;
(5), user fills in the information such as variable name, remarks to this path affix, represents its implication;
(6), repeat step (4), (5), complete until field interested is all marked;
(7) " application " button, is clicked on, template generation instrument extracts corresponding word by the template of current definition from current web page Section content simultaneously shows;
(8) if, content it is correct, user can click on " preservations " button, preservation template, if incorrect, user can be to mould Plate does and preserved again after a little manual settings.
2nd, merchandise news extraction is carried out to website using the original template, the step includes:
A, in the product classification page of website, using being manually labeled, extract all commodity category names and row in webpage Table page URL, in adding a classification queue.
Specially:" goods catalogue " page of " Taobao " food is extracted, using being manually labeled generation one Individual template, the template returns the object $ foodCat of a List type and stores all commodity category names and list page in webpage URL, in adding a classification queue.
B, the list page for taking the classification queue squadron head, give and are manually labeled;After the completion of, by the classification path and The template of generation is stored in a classification template correspondence table;Multiple commodity details page URL are extracted from the list page, and Lower one page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue.
Specially:List page " chocolate/DIY chocolate " to classification queue squadron head, with template generation instrument people Work is labeled, and generates the list page extraction template for defining " chocolate " classification, and the template and the classification path are stored in In one classification template correspondence table.The template extraction goes out multiple commodity details page URL, and returns two results, a List type Object $ foodList store the details page URL and title of chocolate, due to there is multiple commodity in one page, therefore be list; Another variable nextPage stores lower one page URL.Because last page does not descend one page, therefore the variable is optional.Will Commodity details page URL give a webpage pond, and lower one page URL is added into the tail of the queue of the classification queue.
C, from the webpage pond details page is selected, give and be manually labeled;After the completion of, also it is stored in the classification mould In plate correspondence table, there are two template difference corresponding lists pages and details page under such a classification path.
Specially:A chocolate details page is selected from webpage pond, is manually labeled with template tool, generated and define one Individual " chocolate " details page template, in being also stored in classification template correspondence table.So there are two templates under " chocolate " classification path Difference corresponding lists page and details page.What the template was extracted is exactly the final result for expecting acquisition, including merchandise news and businessman Information two parts.
D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is It is empty.
Specially:To chocolate details page URL in webpage pond, letter is extracted one by one using " chocolate details page " template Breath, until webpage pond is sky.That is " chocolate details page " template under first trying out each page, sees whether be suitable for go out Existing various situations.
Such sequence arrangement, is easy to pinpoint the problems.The trial of multiple commodity of one page in list page, typically can send out Difference in the most of webpages of existing such commodity is also individual good with regard to providing for similar and other class commodity analysis below Basis.
If the data of the extraction in page template do not meet data verification rule, artificial correction template is handed over.
E, the list page for taking the classification queue squadron head, check that the classification path whether there is in the classification template pair In answering the list page template of table;
If existing, using the template analysis;
If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct, Then its corresponding relation is added in the classification template correspondence table, if error in data, artificial mark template is submitted to, and is also added In the classification template correspondence table.
Specially:The list page of the classification queue squadron head is taken, checks that the classification path whether there is in the classification In the list page template of template correspondence table;
If existing, using the template analysis;When the details page for completing 45 commodity of " chocolate " list page page 1 After analysis, analyzing for next classification " preserve/Fructus Jujubae class/prunus mume (sieb.) sieb.et zucc./preserved fruit " can be started;
If not existing, such as due to " preserve " classification was not analyzed, so not existing accordingly in classification template correspondence table Template, can first attempt being extracted with the template of " chocolate ", if data are correct, its corresponding relation be added into classification template correspondence In table, if error in data, artificial mark template is submitted to, and also add in the classification template correspondence table.
Because of this step, when list page, details page all compare close in each classification, need to only human configuration be processed several Individual template just can process all pages.
F, one by one process classification queue in list page until queue for sky.
In whole process, cohort design has than considerable influence to the opportunity of manual intervention.Joined the team by design effectively Time, the composition for needing manual intervention is made all at the initial stage for crawling process.It is big by covering after a number of Web Page Processing Partial webpage situation, whole process avoids the need for manual intervention, can go on automatically.
The present invention extracts self-defined language, a template generation instrument of specific area based on a kind of object web page, then Learnt during information extraction, and template is modified, belonged to semi-automatic extraction, can be from webpage quickly and accurately Extract and mark required customizing messages, such as trade name, commodity picture URL, price.

Claims (1)

1. a kind of network commodity information extraction method, it is characterised in that:The method comprises the steps:
(1), the original template that network commodity information is extracted is generated using template generation instrument;
(2), merchandise news extraction is carried out to website using the original template, the step includes:
A, in the product classification page of website, using being manually labeled, extract all commodity category names and list in webpage Page URL, in adding a classification queue;
B, the list page for taking the classification queue squadron head, give and are manually labeled;After the completion of, by the classification path and generation Template be stored in classification template correspondence table;Multiple commodity details page URL are extracted from the list page, and it is next Page URL, by commodity details page URL a webpage pond is given, and lower one page URL is added into the tail of the queue of the classification queue;
C, from the webpage pond details page is selected, give and be manually labeled;After the completion of, also it is stored in the classification template pair In answering table, there are two template difference corresponding lists pages and details page under such a classification path;
D, URL in the webpage pond is processed one by one using such details page template now, until webpage pond is sky;
E, the list page for taking the classification queue squadron head, check that the classification path whether there is in classification template correspondence table List page template in;
If existing, using the template analysis;
If not existing, the template of other classifications in the correspondence classification template correspondence table is attempted one by one, if data are correct, will Its corresponding relation is added in the classification template correspondence table, if error in data, submits artificial mark template to, and is also added described In classification template correspondence table;
F, one by one process classification queue in list page until queue for sky.
CN201110363931.4A 2011-11-16 2011-11-16 Network commodity information extraction method Active CN102495847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110363931.4A CN102495847B (en) 2011-11-16 2011-11-16 Network commodity information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110363931.4A CN102495847B (en) 2011-11-16 2011-11-16 Network commodity information extraction method

Publications (2)

Publication Number Publication Date
CN102495847A CN102495847A (en) 2012-06-13
CN102495847B true CN102495847B (en) 2017-04-19

Family

ID=46187672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110363931.4A Active CN102495847B (en) 2011-11-16 2011-11-16 Network commodity information extraction method

Country Status (1)

Country Link
CN (1) CN102495847B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867053A (en) * 2012-09-12 2013-01-09 北京奇虎科技有限公司 Method, device and system for collecting effective information web pages in website information
CN103593429B (en) * 2013-11-07 2017-02-15 北京奇虎科技有限公司 Commodity template failure detection method and device
CN103853823B (en) * 2014-02-26 2017-01-18 中国科学院计算技术研究所 Online encyclopedia oriented entity attribute extraction method and system
CN104268766A (en) * 2014-10-21 2015-01-07 中国建设银行股份有限公司 Synchronous release method, device and terminal for e-commerce products
CN105528403B (en) * 2015-12-02 2020-01-03 小米科技有限责任公司 Target data identification method and device
CN109389434A (en) * 2018-10-12 2019-02-26 罗挺 A kind of market capacity determines method, apparatus, equipment and readable storage medium storing program for executing
CN110264315B (en) * 2019-06-20 2023-04-11 北京百度网讯科技有限公司 Introduction information generation method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1845098A (en) * 2006-02-20 2006-10-11 南京工业大学 Artificial fine-grained webpage information acquisition method
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN102184184A (en) * 2011-04-07 2011-09-14 安徽博约信息科技有限责任公司 Method for acquiring webpage dynamic information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7519607B2 (en) * 2002-08-14 2009-04-14 Anderson Iv Robert Computer-based system and method for generating, classifying, searching, and analyzing standardized text templates and deviations from standardized text templates

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1845098A (en) * 2006-02-20 2006-10-11 南京工业大学 Artificial fine-grained webpage information acquisition method
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN102184184A (en) * 2011-04-07 2011-09-14 安徽博约信息科技有限责任公司 Method for acquiring webpage dynamic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于自动生成模板的Web 信息抽取技术";张彦超 等,;《北京交通大学学报》;20091204;第33卷(第5期);第41-45页 *

Also Published As

Publication number Publication date
CN102495847A (en) 2012-06-13

Similar Documents

Publication Publication Date Title
CN102495847B (en) Network commodity information extraction method
US10235349B2 (en) Systems and methods for automated content generation
US11182823B2 (en) Automated creative extension selection for content performance optimization
Wang et al. Effects of the aesthetic design of icons on app downloads: evidence from an android market
Seckler et al. Linking objective design factors with subjective aesthetics: An experimental study on how structure and color of websites affect the facets of users’ visual aesthetic perception
US10223727B2 (en) E-commerce recommendation system and method
TWI322950B (en)
AU2014399168B2 (en) Automated click type selection for content performance optimization
US9607010B1 (en) Techniques for shape-based search of content
Boakye Factors influencing mobile data service (MDS) continuance intention: An empirical study
US20140108200A1 (en) Method and system for recommending search phrases
WO2015066891A1 (en) Systems and methods for extracting and generating images for display content
Lima et al. Assessing the visual esthetics of user interfaces: A ten-year systematic mapping
Michel et al. Perceptual attributes of poultry and other meat products: a repertory grid application
US20220383381A1 (en) Video generation method, apparatus, terminal and storage medium
Zhou et al. The effect of social media use on customer qualification skills and adaptive selling behaviors of export salespeople in China
Hinkes et al. Consumer attitudes toward palm oil: Insights from focus group discussions
CN104408155A (en) Method and device for generating webpage codes, and system
JP6270085B1 (en) Information processing apparatus, information processing system, information processing method, and program
Li et al. Evaluating online review helpfulness based on elaboration likelihood model: The moderating role of readability
US10402449B2 (en) Information processing system, information processing method, and information processing program
Muposhi et al. The influence of green atmospherics on store image, store loyalty and green purchase behaviour
Vukasović et al. Going local: exploring millennials preferences for locally sourced and produced fresh poultry in a developing economy
Ismail et al. Chetti Malacca: exploring millennials consumption intention of Peranakan Indian ethnic cuisine
Huang et al. Rapid screening of sensory attributes of mackerel using big data mining techniques and rapid sensory evaluation methods

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: A northern Software Park C district Internet advertising building No. 45 Hangzhou 310011 Zhejiang province Gongshu District Xiangyuan Road

Applicant after: ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co.,Ltd.

Address before: A northern Software Park C district Internet advertising building No. 45 Hangzhou 310011 Zhejiang province Gongshu District Xiangyuan Road

Applicant before: ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co.,Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221019

Address after: 6th Floor, Building 3, No. 45, Xiangyuan Road, Hangzhou City, Zhejiang Province, 310000

Patentee after: Zhejiang Panxing Shuzhi Technology Co.,Ltd.

Address before: 310011 Panshi Internet Advertising Building, Zone C, North Software Park, No. 45, Xiangyuan Road, Gongshu District, Hangzhou, Zhejiang

Patentee before: ZHEJIANG PANSHI INFORMATION TECHNOLOGY Co.,Ltd.