CN105045838A - Network crawler system based on distributed storage system - Google Patents
Network crawler system based on distributed storage system Download PDFInfo
- Publication number
- CN105045838A CN105045838A CN201510377049.3A CN201510377049A CN105045838A CN 105045838 A CN105045838 A CN 105045838A CN 201510377049 A CN201510377049 A CN 201510377049A CN 105045838 A CN105045838 A CN 105045838A
- Authority
- CN
- China
- Prior art keywords
- module
- webpage
- page
- url
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention discloses a network crawler system based on a distributed storage system. The system comprises a basic service module, a grabber, as well as a task scheduling module, a parsing service module, a page downloading module, a page updating module and a data storage module, which are all arranged in the grabber, wherein the task scheduling module controls a procedure of grabbing data by the grabber; the parsing service module parses the content of a webpage and provides user-defined template extraction information; the page downloading module downloads a source code of the webpage; the page updating module acquires data information of the updated webpage; the data storage module, with a structural information extraction method, stores the extracted content into a database of the distributed storage system; and the basic service module completes flow control of the grabber, a monitoring and warning mechanism of the grabber, a URL deduplication service, a URL normalization service and a js/css resource management service. The network crawler system based on the distributed storage system has the characteristics that a crawler method is flexible and intelligent and automatic structural extraction of webpage content information is realized.
Description
Technical field
The invention belongs to computer data to excavate and search technique field, relate to the method based on the network crawler system of distributed memory system and structured message extraction.
Background technology
Along with the development of Internet technology and universal, Web resource is explosive growth, and webpage becomes the important sources of obtaining information in people's daily life.Internet resources are various and tool open, dynamic and isomerism etc., and cannot carry out unified management, this makes people want to find information needed rapidly and accurately becomes a difficult problem.The isomerism of Internet resources causes being difficult to obtain structurized information.
Summary of the invention
The object of the invention is a kind of network crawler system based on distributed memory system provided for the deficiencies in the prior art, this system can find information needed rapidly and accurately.
The concrete technical scheme realizing the object of the invention is:
A kind of network crawler system based on distributed memory system, feature is that this system comprises infrastructure service module, grabber and is arranged at task scheduling modules, analysis service module, page-downloading module, renewal of the page module and the data memory module in grabber, and task scheduling modules controls the flow process that grabber captures data; The content of analysis service module analyzing web page and self-defining masterplate Extracting Information is provided; The source code of page-downloading module downloading web pages, supports to load javascipt, the page of ajax and the form list that asynchronous dynamical loads; Data message after renewal of the page module acquisition webpage is updated; The method that data memory module is extracted by structured message, is stored into the content after decimated in the database of distributed memory system; Infrastructure service module completes the flow control of grabber, the monitoring alarm mechanism of grabber, the service of URL duplicate removal, URL normalization service and js/css resource management service.
The method that described structured message extracts comprises:
A) based on the vector space model algorithm building dictionary, concrete steps are as follows:
1) want according to user the data message capturing certain field, build the dictionary of this field keyword in advance, this dictionary is regarded as the term vector of a m dimension, be denoted as β
m;
2) by using participle instrument, the content of text of webpage is divided into isolated word;
3) calculate each word occurrence number in dictionary, the number of times that word occurs is higher, and just to represent the degree of correlation higher;
4) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension;
5) each webpage obtains the term vector of a n dimension, is denoted as α
n;
6) by the degree of correlation of this formulae discovery webpage and dictionary:
7) web pages relevance S (α
n, β
m) be greater than the webpage of the threshold value θ pre-set, put into and treat that queue is extracted in structuring;
B) build Page template, concrete steps are as follows:
1) analyzing web page html structure, webpage comprises head label and body label;
2), in Head label, target labels is title label;
3), in body label, target labels is p label, a label, form form tags;
4) above-mentioned target labels is combined, complete the template of webpage;
C) structured message extracts, and concrete steps are as follows:
Use B) web page template that builds goes to extract A) info web in the webpage to be extracted that obtains of vector space model algorithm, finally obtain structural data, data are stored in distributed system database with xml form.
The flow process of described crawl data is as follows:
1) the given seed URL of user, as the entrance capturing internet web page;
2) user individual customization infrastructure service configuration file, comprising: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service;
3) according to url filtering rule, URL duplicate removal is served, and captures qualified URL;
4) page-downloading module is to the qualified URL corresponding page of download URL one by one; Multi-thread concurrent technology is adopted to accelerate the speed of downloading page; Page-downloading module calls browser engine in infrastructure service, and browser engine plays up javascript/ajax webpage by calling chrome kernel loads, ensures this page download data integrity;
5) task scheduling modules starts analytics engine module, and the external linkage URL meeting filtering rule is put into queue to be crawled by analytics engine module;
6) task scheduling modules starts data memory module, the method that data memory module uses structured message to extract, and the content after decimated is stored in the database of distributed memory system.
Compared with prior art, the present invention has following advantage:
(1), improve the efficiency of web crawlers.By using multithreading to improve the concurrency crawled, the method extracted by structured message improves the efficiency extracting info web.
(2), the present invention proposes usage space vector model method and can crawl webpage about a certain theme.
(3), the present invention carrys out in conjunction with vector space model and masterplate method the information that intelligent automation extracts webpage.
, the present invention be easy to operation, cost is low.Only need the configuration file of configuration native system and several linux servers just can reach the web data amount crawling 1,000,000 grades.
Accompanying drawing explanation
Fig. 1 present system Organization Chart;
Fig. 2 structured message abstracting method of the present invention process flow diagram;
Fig. 3 present system process flow diagram.
Embodiment
Embodiment
Below by crawl excellent cruel collection of drama essential information be example, describe the present invention in detail.
Yoqoo
1)
http:// movie.youku.com, http://tv.youku.comas the entrance of the excellent cruel crawl of native system.
2) native system configuration file is customized: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service.
3) Yoqoo is for same IP address, the frequency of meeting limiting access.The grabber of this locality by agency, switch a different IP addresses according to each hour.Such Yoqoo website would not be refused native system and capture request.
4) crawl task priority according to native system to arrange from high to low: Yoqoo URL crawls, the Yoqoo essential information page crawls, crawling of the Yoqoo broadcast information page, and Yoqoo diversity information is play and crawled, crawling of excellent trenchant comments opinion information page, crawling of the Yoqoo increment page.
5) by seed
http:// movie.youku.com, http://tv.youku.comuRL, native system removes the URL finding to point to external linkage by seed URL, and reads Yoqoo configuration file url filtering rule, filters out the URL not meeting filtering rule.The URL meeting filtering rule is put into queue to be crawled.
6) page-downloading is carried out to each URL.
7) the keyword dictionary of video field is built.75 domain lexicon that dictionary creation is provided by data hall, regard dictionary as the term vector of a m dimension, are denoted as β
m.
6) to the content of pages participle of URL, web page contents is divided into isolated word.
7) number of times that each webpage word always occurs at dictionary is calculated.
10) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension, be denoted as α
n.
11) degree of correlation of webpage and dictionary is calculated:
12) the threshold value θ that native system pre-sets is 0.4.Step 11) result that calculates this webpage of being greater than 0.4 puts into and treats that queue is extracted in structuring.
13) only template is built, so the present embodiment only builds the template of essential information webpage to Yoqoo collection of drama essential information webpage in the present embodiment.Note: actual production environment needs to build a more than template: as the template, diversity broadcast information Page Template, review information Page Template, excellent cruel aggregate index Page Template etc. of excellent cruel collection of drama essential information Page Template, the broadcast information page.
14) Yoqoo collection of drama essential information web page template:
15) by the template of previous step, native system is automatically webpage and template matches, and coupling information is out deposited in xml file, and storing template is deposited into xml file in distributed memory system database.The data sample that structuring extracts is as follows:
16) by above-mentioned steps, just Yoqoo collection of drama essential information webpage can be crawled.
Claims (3)
1. the network crawler system based on distributed memory system, it is characterized in that this system comprises infrastructure service module, grabber and is arranged at task scheduling modules, analysis service module, page-downloading module, renewal of the page module and the data memory module in grabber, task scheduling modules controls the flow process that grabber captures data; The content of analysis service module analyzing web page and self-defining masterplate Extracting Information is provided; The source code of page-downloading module downloading web pages, supports to load javascipt, the page of ajax and the form list that asynchronous dynamical loads; Data message after renewal of the page module acquisition webpage is updated; The method that data memory module is extracted by structured message, is stored into the content after decimated in the database of distributed memory system; Infrastructure service module completes the flow control of grabber, the monitoring alarm mechanism of grabber, the service of URL duplicate removal, URL normalization service and js/css resource management service.
2. network crawler system according to claim 1, is characterized in that the method that described structured message extracts comprises:
A) based on the vector space model algorithm building dictionary, concrete steps are as follows:
1) want according to user the data message capturing certain field, build the dictionary of this field keyword in advance, this dictionary is regarded as the term vector of a m dimension, be denoted as β
m;
2) by using participle instrument, the content of text of webpage is divided into isolated word;
3) calculate each word occurrence number in dictionary, the number of times that word occurs is higher, and just to represent the degree of correlation higher;
4) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension;
5) each webpage obtains the term vector of a n dimension, is denoted as α
n;
6) by the degree of correlation of this formulae discovery webpage and dictionary:
7) web pages relevance S (α
n, β
m) be greater than the webpage of the threshold value θ pre-set, put into and treat that queue is extracted in structuring;
B) build Page template, concrete steps are as follows:
1) analyzing web page html structure, webpage comprises head label and body label;
2), in Head label, target labels is title label;
3), in body label, target labels is p label, a label, form form tags;
4) above-mentioned target labels is combined, complete the template of webpage;
C) structured message extracts, and concrete steps are as follows:
Use B) web page template that builds goes to extract A) info web in the webpage to be extracted that obtains of vector space model algorithm, finally obtain structural data, data are stored in distributed system database with xml form.
3. network crawler system according to claim 1, is characterized in that the flow process of described crawl data is as follows:
1) the given seed URL of user, as the entrance capturing internet web page;
2) user individual customization infrastructure service configuration file, comprising: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service;
3) according to url filtering rule, URL duplicate removal is served, and captures qualified URL;
4) page-downloading module is to the qualified URL corresponding page of download URL one by one; Multi-thread concurrent technology is adopted to accelerate the speed of downloading page; Page-downloading module calls browser engine in infrastructure service, and browser engine plays up javascript/ajax webpage by calling chrome kernel loads, ensures this page download data integrity;
5) task scheduling modules starts analytics engine module, and the external linkage URL meeting filtering rule is put into queue to be crawled by analytics engine module;
6) task scheduling modules starts data memory module, the method that data memory module uses structured message to extract, and the content after decimated is stored in the database of distributed memory system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510377049.3A CN105045838A (en) | 2015-07-01 | 2015-07-01 | Network crawler system based on distributed storage system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510377049.3A CN105045838A (en) | 2015-07-01 | 2015-07-01 | Network crawler system based on distributed storage system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105045838A true CN105045838A (en) | 2015-11-11 |
Family
ID=54452385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510377049.3A Pending CN105045838A (en) | 2015-07-01 | 2015-07-01 | Network crawler system based on distributed storage system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105045838A (en) |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930482A (en) * | 2016-04-29 | 2016-09-07 | 北京小米移动软件有限公司 | Method and apparatus for matching keyword with network data |
CN106126715A (en) * | 2016-06-30 | 2016-11-16 | 北京奇虎科技有限公司 | The method and apparatus that in a kind of webpage, rendering data is included |
CN107273481A (en) * | 2017-06-10 | 2017-10-20 | 苏州唯亚信息科技股份有限公司 | Suitable for the maintaining method of enterprise customer's R & D Database |
CN107273499A (en) * | 2017-06-16 | 2017-10-20 | 成都布林特信息技术有限公司 | Data grab method based on vertical search engine |
CN107317724A (en) * | 2017-06-06 | 2017-11-03 | 中证信用增进股份有限公司 | Data collecting system and method based on cloud computing technology |
CN107729564A (en) * | 2017-11-13 | 2018-02-23 | 北京众荟信息技术股份有限公司 | A kind of distributed focused web crawler web page crawl method and system |
CN108108440A (en) * | 2017-12-21 | 2018-06-01 | 北京慧数科技有限公司 | The acquisition method of proxy server and internet data |
CN108170803A (en) * | 2017-12-28 | 2018-06-15 | 南京烽火软件科技有限公司 | A kind of internet information is layered acquisition method |
CN108228151A (en) * | 2016-12-22 | 2018-06-29 | 北京询达数据科技有限公司 | A kind of design method of new network robot |
CN108520024A (en) * | 2018-03-22 | 2018-09-11 | 河海大学 | Binary cycle crawler system and its operation method based on Spark Streaming |
CN109213824A (en) * | 2017-06-29 | 2019-01-15 | 北京京东尚科信息技术有限公司 | Data grabber system, method and apparatus |
CN109255063A (en) * | 2018-08-01 | 2019-01-22 | 宜人恒业科技发展(北京)有限公司 | A kind of method and apparatus crawling web page contents |
CN109284434A (en) * | 2018-09-12 | 2019-01-29 | 东莞数汇大数据有限公司 | Web page contents crawling method, system and storage medium based on R language |
CN109299392A (en) * | 2018-11-21 | 2019-02-01 | 安徽云融信息技术有限公司 | A kind of optimization method of web crawlers crawl data |
CN109299371A (en) * | 2018-10-16 | 2019-02-01 | 珠海智慧创新科技有限公司 | A kind of policy information acquisition management system based on distributed reptile technology |
CN109446441A (en) * | 2018-09-26 | 2019-03-08 | 北京邮电大学 | A kind of credible distributed capture storage system of general Web Community |
CN109522466A (en) * | 2018-10-20 | 2019-03-26 | 河南工程学院 | A kind of distributed reptile system |
CN109522562A (en) * | 2018-11-30 | 2019-03-26 | 济南浪潮高新科技投资发展有限公司 | A kind of webpage Knowledge Extraction Method based on text image fusion recognition |
CN110020062A (en) * | 2019-04-12 | 2019-07-16 | 北京邮电大学 | A kind of customized web crawlers method and system |
CN110297960A (en) * | 2019-06-17 | 2019-10-01 | 中电科大数据研究院有限公司 | A kind of distributed DOC DATA acquisition system based on configuration |
CN110309389A (en) * | 2018-03-14 | 2019-10-08 | 北京嘀嘀无限科技发展有限公司 | Cloud computing system |
CN110737813A (en) * | 2019-09-26 | 2020-01-31 | 苏州浪潮智能科技有限公司 | method, equipment and medium for improving efficiency of reptile |
CN110765334A (en) * | 2019-09-10 | 2020-02-07 | 北京字节跳动网络技术有限公司 | Data capture method, system, medium and electronic device |
CN111026945A (en) * | 2019-12-05 | 2020-04-17 | 北京创鑫旅程网络技术有限公司 | Multi-platform crawler scheduling method and device and storage medium |
CN111125589A (en) * | 2018-10-31 | 2020-05-08 | 北大方正集团有限公司 | Data acquisition method and device and computer readable storage medium |
CN111488508A (en) * | 2020-04-10 | 2020-08-04 | 长春博立电子科技有限公司 | Internet information acquisition system and method supporting multi-protocol distributed high concurrency |
CN112818201A (en) * | 2021-02-07 | 2021-05-18 | 四川封面传媒有限责任公司 | Network data acquisition method and device, computer equipment and storage medium |
CN113704589A (en) * | 2021-09-03 | 2021-11-26 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110307467A1 (en) * | 2010-06-10 | 2011-12-15 | Stephen Severance | Distributed web crawler architecture |
CN103258017A (en) * | 2013-04-24 | 2013-08-21 | 中国科学院计算技术研究所 | Method and system for parallel square crossing network data collection |
CN103310012A (en) * | 2013-07-02 | 2013-09-18 | 北京航空航天大学 | Distributed web crawler system |
CN103838732A (en) * | 2012-11-21 | 2014-06-04 | 大连灵动科技发展有限公司 | Vertical search engine in life service field |
-
2015
- 2015-07-01 CN CN201510377049.3A patent/CN105045838A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110307467A1 (en) * | 2010-06-10 | 2011-12-15 | Stephen Severance | Distributed web crawler architecture |
CN103838732A (en) * | 2012-11-21 | 2014-06-04 | 大连灵动科技发展有限公司 | Vertical search engine in life service field |
CN103258017A (en) * | 2013-04-24 | 2013-08-21 | 中国科学院计算技术研究所 | Method and system for parallel square crossing network data collection |
CN103310012A (en) * | 2013-07-02 | 2013-09-18 | 北京航空航天大学 | Distributed web crawler system |
Non-Patent Citations (1)
Title |
---|
万涛: "基于hadoop的分布式网络爬虫研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑 》 * |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930482A (en) * | 2016-04-29 | 2016-09-07 | 北京小米移动软件有限公司 | Method and apparatus for matching keyword with network data |
CN106126715A (en) * | 2016-06-30 | 2016-11-16 | 北京奇虎科技有限公司 | The method and apparatus that in a kind of webpage, rendering data is included |
CN106126715B (en) * | 2016-06-30 | 2019-06-04 | 北京奇虎科技有限公司 | The method and apparatus that rendering data is included in a kind of webpage |
CN108228151A (en) * | 2016-12-22 | 2018-06-29 | 北京询达数据科技有限公司 | A kind of design method of new network robot |
CN107317724A (en) * | 2017-06-06 | 2017-11-03 | 中证信用增进股份有限公司 | Data collecting system and method based on cloud computing technology |
CN107317724B (en) * | 2017-06-06 | 2020-12-11 | 中证信用增进股份有限公司 | Data acquisition system and method based on cloud computing technology |
CN107273481A (en) * | 2017-06-10 | 2017-10-20 | 苏州唯亚信息科技股份有限公司 | Suitable for the maintaining method of enterprise customer's R & D Database |
CN107273499A (en) * | 2017-06-16 | 2017-10-20 | 成都布林特信息技术有限公司 | Data grab method based on vertical search engine |
CN109213824A (en) * | 2017-06-29 | 2019-01-15 | 北京京东尚科信息技术有限公司 | Data grabber system, method and apparatus |
CN109213824B (en) * | 2017-06-29 | 2022-03-04 | 北京京东尚科信息技术有限公司 | Data capture system, method and device |
CN107729564A (en) * | 2017-11-13 | 2018-02-23 | 北京众荟信息技术股份有限公司 | A kind of distributed focused web crawler web page crawl method and system |
CN108108440A (en) * | 2017-12-21 | 2018-06-01 | 北京慧数科技有限公司 | The acquisition method of proxy server and internet data |
CN108170803A (en) * | 2017-12-28 | 2018-06-15 | 南京烽火软件科技有限公司 | A kind of internet information is layered acquisition method |
CN108170803B (en) * | 2017-12-28 | 2021-12-21 | 南京烽火天地通信科技有限公司 | Internet information layered acquisition method |
CN110309389A (en) * | 2018-03-14 | 2019-10-08 | 北京嘀嘀无限科技发展有限公司 | Cloud computing system |
CN108520024A (en) * | 2018-03-22 | 2018-09-11 | 河海大学 | Binary cycle crawler system and its operation method based on Spark Streaming |
CN109255063A (en) * | 2018-08-01 | 2019-01-22 | 宜人恒业科技发展(北京)有限公司 | A kind of method and apparatus crawling web page contents |
CN109284434A (en) * | 2018-09-12 | 2019-01-29 | 东莞数汇大数据有限公司 | Web page contents crawling method, system and storage medium based on R language |
CN109446441A (en) * | 2018-09-26 | 2019-03-08 | 北京邮电大学 | A kind of credible distributed capture storage system of general Web Community |
CN109446441B (en) * | 2018-09-26 | 2020-11-03 | 北京邮电大学 | General credible distributed acquisition and storage system for network community |
CN109299371A (en) * | 2018-10-16 | 2019-02-01 | 珠海智慧创新科技有限公司 | A kind of policy information acquisition management system based on distributed reptile technology |
CN109522466A (en) * | 2018-10-20 | 2019-03-26 | 河南工程学院 | A kind of distributed reptile system |
CN111125589B (en) * | 2018-10-31 | 2023-09-05 | 新方正控股发展有限责任公司 | Data acquisition method and device and computer readable storage medium |
CN111125589A (en) * | 2018-10-31 | 2020-05-08 | 北大方正集团有限公司 | Data acquisition method and device and computer readable storage medium |
CN109299392A (en) * | 2018-11-21 | 2019-02-01 | 安徽云融信息技术有限公司 | A kind of optimization method of web crawlers crawl data |
CN109522562A (en) * | 2018-11-30 | 2019-03-26 | 济南浪潮高新科技投资发展有限公司 | A kind of webpage Knowledge Extraction Method based on text image fusion recognition |
CN110020062A (en) * | 2019-04-12 | 2019-07-16 | 北京邮电大学 | A kind of customized web crawlers method and system |
CN110020062B (en) * | 2019-04-12 | 2021-09-24 | 北京邮电大学 | Customizable web crawler method and system |
CN110297960A (en) * | 2019-06-17 | 2019-10-01 | 中电科大数据研究院有限公司 | A kind of distributed DOC DATA acquisition system based on configuration |
CN110765334A (en) * | 2019-09-10 | 2020-02-07 | 北京字节跳动网络技术有限公司 | Data capture method, system, medium and electronic device |
CN110737813A (en) * | 2019-09-26 | 2020-01-31 | 苏州浪潮智能科技有限公司 | method, equipment and medium for improving efficiency of reptile |
CN110737813B (en) * | 2019-09-26 | 2022-07-29 | 苏州浪潮智能科技有限公司 | Method, equipment and medium for improving efficiency of reptiles |
CN111026945B (en) * | 2019-12-05 | 2024-01-26 | 北京创鑫旅程网络技术有限公司 | Multi-platform crawler scheduling method, device and storage medium |
CN111026945A (en) * | 2019-12-05 | 2020-04-17 | 北京创鑫旅程网络技术有限公司 | Multi-platform crawler scheduling method and device and storage medium |
CN111488508A (en) * | 2020-04-10 | 2020-08-04 | 长春博立电子科技有限公司 | Internet information acquisition system and method supporting multi-protocol distributed high concurrency |
CN112818201A (en) * | 2021-02-07 | 2021-05-18 | 四川封面传媒有限责任公司 | Network data acquisition method and device, computer equipment and storage medium |
CN113704589A (en) * | 2021-09-03 | 2021-11-26 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
CN113704589B (en) * | 2021-09-03 | 2023-10-13 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105045838A (en) | Network crawler system based on distributed storage system | |
Mehmood et al. | Implementing big data lake for heterogeneous data sources | |
EP2919133A1 (en) | Method and system for identifying a sensor to be deployed in a physical environment | |
CN102831252B (en) | A kind of method for upgrading index data base and device, searching method and system | |
CN102662703A (en) | Method and device for loading application program plugins | |
Devarakonda et al. | Mercury: reusable metadata management, data discovery and access system | |
CN103440243A (en) | Teaching resource recommendation method and device thereof | |
CN111626568B (en) | Knowledge base construction method and knowledge search method and system in natural disaster field | |
CN103714116A (en) | Webpage information extracting method and webpage information extracting equipment | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
KR101087134B1 (en) | Digital Data Tagging Apparatus, Tagging and Search Service Providing System and Method by Sensory and Environmental Information | |
KR20190131778A (en) | Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL | |
CN103902667A (en) | Simple network information collector achieving method based on meta-search | |
CN108650546A (en) | Barrage processing method, computer readable storage medium and electronic equipment | |
CN114117242A (en) | Data query method and device, computer equipment and storage medium | |
CN102902792B (en) | list page identification system and method | |
Bross et al. | Mapping the blogosphere with rss-feeds | |
Sourav et al. | Recent trends of big data in precision agriculture: a review | |
JP2014532220A (en) | Net comment collection method and system | |
CN113094568A (en) | Data extraction method based on data crawler technology | |
Antunes et al. | Context storage for m2m scenarios | |
CN102929948A (en) | List page identification system and method | |
CN107679168B (en) | Target website content acquisition method based on java platform | |
Soldatos et al. | Multimedia search over integrated social and sensor networks | |
CN103793516A (en) | Method and device for obtaining URL icon |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20151111 |
|
WD01 | Invention patent application deemed withdrawn after publication |