CN105045838A - Network crawler system based on distributed storage system - Google Patents

Network crawler system based on distributed storage system Download PDF

Info

Publication number
CN105045838A
CN105045838A CN201510377049.3A CN201510377049A CN105045838A CN 105045838 A CN105045838 A CN 105045838A CN 201510377049 A CN201510377049 A CN 201510377049A CN 105045838 A CN105045838 A CN 105045838A
Authority
CN
China
Prior art keywords
module
webpage
page
url
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510377049.3A
Other languages
Chinese (zh)
Inventor
贺樑
黄保荃
杨燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201510377049.3A priority Critical patent/CN105045838A/en
Publication of CN105045838A publication Critical patent/CN105045838A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a network crawler system based on a distributed storage system. The system comprises a basic service module, a grabber, as well as a task scheduling module, a parsing service module, a page downloading module, a page updating module and a data storage module, which are all arranged in the grabber, wherein the task scheduling module controls a procedure of grabbing data by the grabber; the parsing service module parses the content of a webpage and provides user-defined template extraction information; the page downloading module downloads a source code of the webpage; the page updating module acquires data information of the updated webpage; the data storage module, with a structural information extraction method, stores the extracted content into a database of the distributed storage system; and the basic service module completes flow control of the grabber, a monitoring and warning mechanism of the grabber, a URL deduplication service, a URL normalization service and a js/css resource management service. The network crawler system based on the distributed storage system has the characteristics that a crawler method is flexible and intelligent and automatic structural extraction of webpage content information is realized.

Description

Based on the network crawler system of distributed memory system
Technical field
The invention belongs to computer data to excavate and search technique field, relate to the method based on the network crawler system of distributed memory system and structured message extraction.
Background technology
Along with the development of Internet technology and universal, Web resource is explosive growth, and webpage becomes the important sources of obtaining information in people's daily life.Internet resources are various and tool open, dynamic and isomerism etc., and cannot carry out unified management, this makes people want to find information needed rapidly and accurately becomes a difficult problem.The isomerism of Internet resources causes being difficult to obtain structurized information.
Summary of the invention
The object of the invention is a kind of network crawler system based on distributed memory system provided for the deficiencies in the prior art, this system can find information needed rapidly and accurately.
The concrete technical scheme realizing the object of the invention is:
A kind of network crawler system based on distributed memory system, feature is that this system comprises infrastructure service module, grabber and is arranged at task scheduling modules, analysis service module, page-downloading module, renewal of the page module and the data memory module in grabber, and task scheduling modules controls the flow process that grabber captures data; The content of analysis service module analyzing web page and self-defining masterplate Extracting Information is provided; The source code of page-downloading module downloading web pages, supports to load javascipt, the page of ajax and the form list that asynchronous dynamical loads; Data message after renewal of the page module acquisition webpage is updated; The method that data memory module is extracted by structured message, is stored into the content after decimated in the database of distributed memory system; Infrastructure service module completes the flow control of grabber, the monitoring alarm mechanism of grabber, the service of URL duplicate removal, URL normalization service and js/css resource management service.
The method that described structured message extracts comprises:
A) based on the vector space model algorithm building dictionary, concrete steps are as follows:
1) want according to user the data message capturing certain field, build the dictionary of this field keyword in advance, this dictionary is regarded as the term vector of a m dimension, be denoted as β m;
2) by using participle instrument, the content of text of webpage is divided into isolated word;
3) calculate each word occurrence number in dictionary, the number of times that word occurs is higher, and just to represent the degree of correlation higher;
4) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension;
5) each webpage obtains the term vector of a n dimension, is denoted as α n;
6) by the degree of correlation of this formulae discovery webpage and dictionary:
7) web pages relevance S (α n, β m) be greater than the webpage of the threshold value θ pre-set, put into and treat that queue is extracted in structuring;
B) build Page template, concrete steps are as follows:
1) analyzing web page html structure, webpage comprises head label and body label;
2), in Head label, target labels is title label;
3), in body label, target labels is p label, a label, form form tags;
4) above-mentioned target labels is combined, complete the template of webpage;
C) structured message extracts, and concrete steps are as follows:
Use B) web page template that builds goes to extract A) info web in the webpage to be extracted that obtains of vector space model algorithm, finally obtain structural data, data are stored in distributed system database with xml form.
The flow process of described crawl data is as follows:
1) the given seed URL of user, as the entrance capturing internet web page;
2) user individual customization infrastructure service configuration file, comprising: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service;
3) according to url filtering rule, URL duplicate removal is served, and captures qualified URL;
4) page-downloading module is to the qualified URL corresponding page of download URL one by one; Multi-thread concurrent technology is adopted to accelerate the speed of downloading page; Page-downloading module calls browser engine in infrastructure service, and browser engine plays up javascript/ajax webpage by calling chrome kernel loads, ensures this page download data integrity;
5) task scheduling modules starts analytics engine module, and the external linkage URL meeting filtering rule is put into queue to be crawled by analytics engine module;
6) task scheduling modules starts data memory module, the method that data memory module uses structured message to extract, and the content after decimated is stored in the database of distributed memory system.
Compared with prior art, the present invention has following advantage:
(1), improve the efficiency of web crawlers.By using multithreading to improve the concurrency crawled, the method extracted by structured message improves the efficiency extracting info web.
(2), the present invention proposes usage space vector model method and can crawl webpage about a certain theme.
(3), the present invention carrys out in conjunction with vector space model and masterplate method the information that intelligent automation extracts webpage.
, the present invention be easy to operation, cost is low.Only need the configuration file of configuration native system and several linux servers just can reach the web data amount crawling 1,000,000 grades.
Accompanying drawing explanation
Fig. 1 present system Organization Chart;
Fig. 2 structured message abstracting method of the present invention process flow diagram;
Fig. 3 present system process flow diagram.
Embodiment
Embodiment
Below by crawl excellent cruel collection of drama essential information be example, describe the present invention in detail.
Yoqoo
1) http:// movie.youku.com, http://tv.youku.comas the entrance of the excellent cruel crawl of native system.
2) native system configuration file is customized: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service.
3) Yoqoo is for same IP address, the frequency of meeting limiting access.The grabber of this locality by agency, switch a different IP addresses according to each hour.Such Yoqoo website would not be refused native system and capture request.
4) crawl task priority according to native system to arrange from high to low: Yoqoo URL crawls, the Yoqoo essential information page crawls, crawling of the Yoqoo broadcast information page, and Yoqoo diversity information is play and crawled, crawling of excellent trenchant comments opinion information page, crawling of the Yoqoo increment page.
5) by seed http:// movie.youku.com, http://tv.youku.comuRL, native system removes the URL finding to point to external linkage by seed URL, and reads Yoqoo configuration file url filtering rule, filters out the URL not meeting filtering rule.The URL meeting filtering rule is put into queue to be crawled.
6) page-downloading is carried out to each URL.
7) the keyword dictionary of video field is built.75 domain lexicon that dictionary creation is provided by data hall, regard dictionary as the term vector of a m dimension, are denoted as β m.
6) to the content of pages participle of URL, web page contents is divided into isolated word.
7) number of times that each webpage word always occurs at dictionary is calculated.
10) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension, be denoted as α n.
11) degree of correlation of webpage and dictionary is calculated:
12) the threshold value θ that native system pre-sets is 0.4.Step 11) result that calculates this webpage of being greater than 0.4 puts into and treats that queue is extracted in structuring.
13) only template is built, so the present embodiment only builds the template of essential information webpage to Yoqoo collection of drama essential information webpage in the present embodiment.Note: actual production environment needs to build a more than template: as the template, diversity broadcast information Page Template, review information Page Template, excellent cruel aggregate index Page Template etc. of excellent cruel collection of drama essential information Page Template, the broadcast information page.
14) Yoqoo collection of drama essential information web page template:
15) by the template of previous step, native system is automatically webpage and template matches, and coupling information is out deposited in xml file, and storing template is deposited into xml file in distributed memory system database.The data sample that structuring extracts is as follows:
16) by above-mentioned steps, just Yoqoo collection of drama essential information webpage can be crawled.

Claims (3)

1. the network crawler system based on distributed memory system, it is characterized in that this system comprises infrastructure service module, grabber and is arranged at task scheduling modules, analysis service module, page-downloading module, renewal of the page module and the data memory module in grabber, task scheduling modules controls the flow process that grabber captures data; The content of analysis service module analyzing web page and self-defining masterplate Extracting Information is provided; The source code of page-downloading module downloading web pages, supports to load javascipt, the page of ajax and the form list that asynchronous dynamical loads; Data message after renewal of the page module acquisition webpage is updated; The method that data memory module is extracted by structured message, is stored into the content after decimated in the database of distributed memory system; Infrastructure service module completes the flow control of grabber, the monitoring alarm mechanism of grabber, the service of URL duplicate removal, URL normalization service and js/css resource management service.
2. network crawler system according to claim 1, is characterized in that the method that described structured message extracts comprises:
A) based on the vector space model algorithm building dictionary, concrete steps are as follows:
1) want according to user the data message capturing certain field, build the dictionary of this field keyword in advance, this dictionary is regarded as the term vector of a m dimension, be denoted as β m;
2) by using participle instrument, the content of text of webpage is divided into isolated word;
3) calculate each word occurrence number in dictionary, the number of times that word occurs is higher, and just to represent the degree of correlation higher;
4) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension;
5) each webpage obtains the term vector of a n dimension, is denoted as α n;
6) by the degree of correlation of this formulae discovery webpage and dictionary:
7) web pages relevance S (α n, β m) be greater than the webpage of the threshold value θ pre-set, put into and treat that queue is extracted in structuring;
B) build Page template, concrete steps are as follows:
1) analyzing web page html structure, webpage comprises head label and body label;
2), in Head label, target labels is title label;
3), in body label, target labels is p label, a label, form form tags;
4) above-mentioned target labels is combined, complete the template of webpage;
C) structured message extracts, and concrete steps are as follows:
Use B) web page template that builds goes to extract A) info web in the webpage to be extracted that obtains of vector space model algorithm, finally obtain structural data, data are stored in distributed system database with xml form.
3. network crawler system according to claim 1, is characterized in that the flow process of described crawl data is as follows:
1) the given seed URL of user, as the entrance capturing internet web page;
2) user individual customization infrastructure service configuration file, comprising: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service;
3) according to url filtering rule, URL duplicate removal is served, and captures qualified URL;
4) page-downloading module is to the qualified URL corresponding page of download URL one by one; Multi-thread concurrent technology is adopted to accelerate the speed of downloading page; Page-downloading module calls browser engine in infrastructure service, and browser engine plays up javascript/ajax webpage by calling chrome kernel loads, ensures this page download data integrity;
5) task scheduling modules starts analytics engine module, and the external linkage URL meeting filtering rule is put into queue to be crawled by analytics engine module;
6) task scheduling modules starts data memory module, the method that data memory module uses structured message to extract, and the content after decimated is stored in the database of distributed memory system.
CN201510377049.3A 2015-07-01 2015-07-01 Network crawler system based on distributed storage system Pending CN105045838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510377049.3A CN105045838A (en) 2015-07-01 2015-07-01 Network crawler system based on distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510377049.3A CN105045838A (en) 2015-07-01 2015-07-01 Network crawler system based on distributed storage system

Publications (1)

Publication Number Publication Date
CN105045838A true CN105045838A (en) 2015-11-11

Family

ID=54452385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510377049.3A Pending CN105045838A (en) 2015-07-01 2015-07-01 Network crawler system based on distributed storage system

Country Status (1)

Country Link
CN (1) CN105045838A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930482A (en) * 2016-04-29 2016-09-07 北京小米移动软件有限公司 Method and apparatus for matching keyword with network data
CN106126715A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 The method and apparatus that in a kind of webpage, rendering data is included
CN107273481A (en) * 2017-06-10 2017-10-20 苏州唯亚信息科技股份有限公司 Suitable for the maintaining method of enterprise customer's R & D Database
CN107273499A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Data grab method based on vertical search engine
CN107317724A (en) * 2017-06-06 2017-11-03 中证信用增进股份有限公司 Data collecting system and method based on cloud computing technology
CN107729564A (en) * 2017-11-13 2018-02-23 北京众荟信息技术股份有限公司 A kind of distributed focused web crawler web page crawl method and system
CN108108440A (en) * 2017-12-21 2018-06-01 北京慧数科技有限公司 The acquisition method of proxy server and internet data
CN108170803A (en) * 2017-12-28 2018-06-15 南京烽火软件科技有限公司 A kind of internet information is layered acquisition method
CN108228151A (en) * 2016-12-22 2018-06-29 北京询达数据科技有限公司 A kind of design method of new network robot
CN108520024A (en) * 2018-03-22 2018-09-11 河海大学 Binary cycle crawler system and its operation method based on Spark Streaming
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN109255063A (en) * 2018-08-01 2019-01-22 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus crawling web page contents
CN109284434A (en) * 2018-09-12 2019-01-29 东莞数汇大数据有限公司 Web page contents crawling method, system and storage medium based on R language
CN109299392A (en) * 2018-11-21 2019-02-01 安徽云融信息技术有限公司 A kind of optimization method of web crawlers crawl data
CN109299371A (en) * 2018-10-16 2019-02-01 珠海智慧创新科技有限公司 A kind of policy information acquisition management system based on distributed reptile technology
CN109446441A (en) * 2018-09-26 2019-03-08 北京邮电大学 A kind of credible distributed capture storage system of general Web Community
CN109522466A (en) * 2018-10-20 2019-03-26 河南工程学院 A kind of distributed reptile system
CN109522562A (en) * 2018-11-30 2019-03-26 济南浪潮高新科技投资发展有限公司 A kind of webpage Knowledge Extraction Method based on text image fusion recognition
CN110020062A (en) * 2019-04-12 2019-07-16 北京邮电大学 A kind of customized web crawlers method and system
CN110297960A (en) * 2019-06-17 2019-10-01 中电科大数据研究院有限公司 A kind of distributed DOC DATA acquisition system based on configuration
CN110309389A (en) * 2018-03-14 2019-10-08 北京嘀嘀无限科技发展有限公司 Cloud computing system
CN110737813A (en) * 2019-09-26 2020-01-31 苏州浪潮智能科技有限公司 method, equipment and medium for improving efficiency of reptile
CN110765334A (en) * 2019-09-10 2020-02-07 北京字节跳动网络技术有限公司 Data capture method, system, medium and electronic device
CN111026945A (en) * 2019-12-05 2020-04-17 北京创鑫旅程网络技术有限公司 Multi-platform crawler scheduling method and device and storage medium
CN111125589A (en) * 2018-10-31 2020-05-08 北大方正集团有限公司 Data acquisition method and device and computer readable storage medium
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN112818201A (en) * 2021-02-07 2021-05-18 四川封面传媒有限责任公司 Network data acquisition method and device, computer equipment and storage medium
CN113704589A (en) * 2021-09-03 2021-11-26 海粟智链(青岛)科技有限公司 Internet system for collecting industrial chain data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307467A1 (en) * 2010-06-10 2011-12-15 Stephen Severance Distributed web crawler architecture
CN103258017A (en) * 2013-04-24 2013-08-21 中国科学院计算技术研究所 Method and system for parallel square crossing network data collection
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN103838732A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in life service field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307467A1 (en) * 2010-06-10 2011-12-15 Stephen Severance Distributed web crawler architecture
CN103838732A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in life service field
CN103258017A (en) * 2013-04-24 2013-08-21 中国科学院计算技术研究所 Method and system for parallel square crossing network data collection
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
万涛: "基于hadoop的分布式网络爬虫研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑 》 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930482A (en) * 2016-04-29 2016-09-07 北京小米移动软件有限公司 Method and apparatus for matching keyword with network data
CN106126715A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 The method and apparatus that in a kind of webpage, rendering data is included
CN106126715B (en) * 2016-06-30 2019-06-04 北京奇虎科技有限公司 The method and apparatus that rendering data is included in a kind of webpage
CN108228151A (en) * 2016-12-22 2018-06-29 北京询达数据科技有限公司 A kind of design method of new network robot
CN107317724A (en) * 2017-06-06 2017-11-03 中证信用增进股份有限公司 Data collecting system and method based on cloud computing technology
CN107317724B (en) * 2017-06-06 2020-12-11 中证信用增进股份有限公司 Data acquisition system and method based on cloud computing technology
CN107273481A (en) * 2017-06-10 2017-10-20 苏州唯亚信息科技股份有限公司 Suitable for the maintaining method of enterprise customer's R & D Database
CN107273499A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Data grab method based on vertical search engine
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN109213824B (en) * 2017-06-29 2022-03-04 北京京东尚科信息技术有限公司 Data capture system, method and device
CN107729564A (en) * 2017-11-13 2018-02-23 北京众荟信息技术股份有限公司 A kind of distributed focused web crawler web page crawl method and system
CN108108440A (en) * 2017-12-21 2018-06-01 北京慧数科技有限公司 The acquisition method of proxy server and internet data
CN108170803A (en) * 2017-12-28 2018-06-15 南京烽火软件科技有限公司 A kind of internet information is layered acquisition method
CN108170803B (en) * 2017-12-28 2021-12-21 南京烽火天地通信科技有限公司 Internet information layered acquisition method
CN110309389A (en) * 2018-03-14 2019-10-08 北京嘀嘀无限科技发展有限公司 Cloud computing system
CN108520024A (en) * 2018-03-22 2018-09-11 河海大学 Binary cycle crawler system and its operation method based on Spark Streaming
CN109255063A (en) * 2018-08-01 2019-01-22 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus crawling web page contents
CN109284434A (en) * 2018-09-12 2019-01-29 东莞数汇大数据有限公司 Web page contents crawling method, system and storage medium based on R language
CN109446441A (en) * 2018-09-26 2019-03-08 北京邮电大学 A kind of credible distributed capture storage system of general Web Community
CN109446441B (en) * 2018-09-26 2020-11-03 北京邮电大学 General credible distributed acquisition and storage system for network community
CN109299371A (en) * 2018-10-16 2019-02-01 珠海智慧创新科技有限公司 A kind of policy information acquisition management system based on distributed reptile technology
CN109522466A (en) * 2018-10-20 2019-03-26 河南工程学院 A kind of distributed reptile system
CN111125589B (en) * 2018-10-31 2023-09-05 新方正控股发展有限责任公司 Data acquisition method and device and computer readable storage medium
CN111125589A (en) * 2018-10-31 2020-05-08 北大方正集团有限公司 Data acquisition method and device and computer readable storage medium
CN109299392A (en) * 2018-11-21 2019-02-01 安徽云融信息技术有限公司 A kind of optimization method of web crawlers crawl data
CN109522562A (en) * 2018-11-30 2019-03-26 济南浪潮高新科技投资发展有限公司 A kind of webpage Knowledge Extraction Method based on text image fusion recognition
CN110020062A (en) * 2019-04-12 2019-07-16 北京邮电大学 A kind of customized web crawlers method and system
CN110020062B (en) * 2019-04-12 2021-09-24 北京邮电大学 Customizable web crawler method and system
CN110297960A (en) * 2019-06-17 2019-10-01 中电科大数据研究院有限公司 A kind of distributed DOC DATA acquisition system based on configuration
CN110765334A (en) * 2019-09-10 2020-02-07 北京字节跳动网络技术有限公司 Data capture method, system, medium and electronic device
CN110737813A (en) * 2019-09-26 2020-01-31 苏州浪潮智能科技有限公司 method, equipment and medium for improving efficiency of reptile
CN110737813B (en) * 2019-09-26 2022-07-29 苏州浪潮智能科技有限公司 Method, equipment and medium for improving efficiency of reptiles
CN111026945B (en) * 2019-12-05 2024-01-26 北京创鑫旅程网络技术有限公司 Multi-platform crawler scheduling method, device and storage medium
CN111026945A (en) * 2019-12-05 2020-04-17 北京创鑫旅程网络技术有限公司 Multi-platform crawler scheduling method and device and storage medium
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN112818201A (en) * 2021-02-07 2021-05-18 四川封面传媒有限责任公司 Network data acquisition method and device, computer equipment and storage medium
CN113704589A (en) * 2021-09-03 2021-11-26 海粟智链(青岛)科技有限公司 Internet system for collecting industrial chain data
CN113704589B (en) * 2021-09-03 2023-10-13 海粟智链(青岛)科技有限公司 Internet system for collecting industrial chain data

Similar Documents

Publication Publication Date Title
CN105045838A (en) Network crawler system based on distributed storage system
Mehmood et al. Implementing big data lake for heterogeneous data sources
EP2919133A1 (en) Method and system for identifying a sensor to be deployed in a physical environment
CN102831252B (en) A kind of method for upgrading index data base and device, searching method and system
CN102662703A (en) Method and device for loading application program plugins
Devarakonda et al. Mercury: reusable metadata management, data discovery and access system
CN103440243A (en) Teaching resource recommendation method and device thereof
CN111626568B (en) Knowledge base construction method and knowledge search method and system in natural disaster field
CN103714116A (en) Webpage information extracting method and webpage information extracting equipment
CN104598536B (en) A kind of distributed network information structuring processing method
KR101087134B1 (en) Digital Data Tagging Apparatus, Tagging and Search Service Providing System and Method by Sensory and Environmental Information
KR20190131778A (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
CN103902667A (en) Simple network information collector achieving method based on meta-search
CN108650546A (en) Barrage processing method, computer readable storage medium and electronic equipment
CN114117242A (en) Data query method and device, computer equipment and storage medium
CN102902792B (en) list page identification system and method
Bross et al. Mapping the blogosphere with rss-feeds
Sourav et al. Recent trends of big data in precision agriculture: a review
JP2014532220A (en) Net comment collection method and system
CN113094568A (en) Data extraction method based on data crawler technology
Antunes et al. Context storage for m2m scenarios
CN102929948A (en) List page identification system and method
CN107679168B (en) Target website content acquisition method based on java platform
Soldatos et al. Multimedia search over integrated social and sensor networks
CN103793516A (en) Method and device for obtaining URL icon

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151111

WD01 Invention patent application deemed withdrawn after publication