CN105447146A - Massive data collecting and exchanging system and method - Google Patents

Massive data collecting and exchanging system and method Download PDF

Info

Publication number
CN105447146A
CN105447146A CN201510843249.3A CN201510843249A CN105447146A CN 105447146 A CN105447146 A CN 105447146A CN 201510843249 A CN201510843249 A CN 201510843249A CN 105447146 A CN105447146 A CN 105447146A
Authority
CN
China
Prior art keywords
data
event
transmission channel
receiver
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510843249.3A
Other languages
Chinese (zh)
Inventor
朱志祥
梁小江
肖跃雷
于金良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Aite Informatization Engineering Consultation Co Ltd
Original Assignee
Shaanxi Aite Informatization Engineering Consultation Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Aite Informatization Engineering Consultation Co Ltd filed Critical Shaanxi Aite Informatization Engineering Consultation Co Ltd
Priority to CN201510843249.3A priority Critical patent/CN105447146A/en
Publication of CN105447146A publication Critical patent/CN105447146A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a massive data collecting and exchanging system. The system adopts a proxy mode. Proxy of the system comprises: a data collector, a transmission channel and a receiver. The data collector is responsible for collecting data of a data source, converting the data into an event through processing, and sending the data into the transmission channel in an event (comprising two parts, namely, event head information and data) form, and supports a plurality of data receivers. The transmission channel is used for caching the event sent from the data collector. The receiver extracts the event in the transmission channel, stores a file in a file system and a database according to corresponding configuration, or submits the file to a remote server or a next level of proxy. According to the massive data collecting and exchanging system and method, all proxies are independent and can perform parallel exchange on a plurality of data sources, thereby realizing separation of data read-in and data write-out so as to making the system architecture more flexible and efficient and lighter.

Description

A kind of collection of mass data and exchange system and method
Technical field
The present invention relates to large data and Data Collection field, specifically a kind of mass data is collected and exchange system and method.
Background technology
Along with ICT (information and communication technology) development accumulates so far, various data become explosion type to develop, make terabyte (Terabyte, TB), petabyte (PetaByte, PB) even end byte (Exabyte, EB) data of level all become a kind of normality, and large data age just arises wherein; Although it is day by day general and ripe that large data breed in infotech, it is never limited to technological layer to the impact that social and economic activities produce, more in essence, he provides a kind of brand-new method for we treat the world, and namely decision behavior will day by day be made based on data analysis instead of manyly as before by virtue of experience make with intuition.
Large data refer in the time range that can cannot bear people carries out with conventional software instrument the data acquisition that catches, manage and process; Conventional software instrument cannot used to process large data, and represent our machine used in everyday is what cannot complete the storage of large data and analyzing and processing task; And high performance giant computer can double, the even raising of several times of price along with the lifting of performance; How to solve these difficult problems? distributed type assemblies can well solve this difficult problem; The mass data storage Internet era that open source projects distributed system architecture (Hadoop) being just in order to solve and process and design, develop; Simply say that Hadoop is one and can more easily develops and the Distributed Calculation of parallel processing large-scale data and storage system; It has, and ability extending transversely is strong, cost is low, efficiency is high, reliable feature; The user of current Hadoop thinks from traditional Internet firm, expands to telecommunications industry, power industry, hospital, financial industry, and obtains applying more and more widely.
Although Hadoop system has so many feature to be applicable to the Storage and Processing of large data, but a lot of raw data is stored in stand-alone machines, but not in Hadoop cluster, if we can not by these exchanges data in Hadoop cluster, the various advantages of Hadoop all cannot be implemented; How these raw data are exchanged to problem Hadoop system platform becoming and first will solve; Therefore our eager searching a kind of can rapidly and efficiently, safe and reliable mode by the exchanges data in different pieces of information source to Hadoop system; Have a sub-project data transfer tool (Sqoop) data in relevant database and Hadoop system can be carried out exchanges data in the project of current Hadoop, but it have two deficiencies: 1, exchanges data can only be carried out with relevant database; 2, the operation of Sqoop relies on the environment of not Hadoop, can not depart from Hadoop and carry out exchanges data.
The present invention is directed to these problems above-mentioned, propose a kind of collection of mass data and exchange system and method.
Summary of the invention
The present invention is that a kind of mass data is collected and exchange system and method, and object is to realize the exchanges data between different pieces of information source and large data processing platform (DPP).
Technical solution of the present invention is: the present invention is that a kind of mass data is collected and exchange system, its special character is, this system adopts proxy mode, the agency of this system comprises data collector, transmission channel, receiver, separate between each agency, parallel switching can be carried out to multiple data source, realize data and read in and being separated of writing out, make system architecture more flexibly, light weight, efficient.
Described data collector is responsible for the Data Collection of data source, is event through process change, in the transmission channel sent, supports several data receiver with the form of event (comprising event header information and data two parts).
Described transmission channel is used for the event that data cached gatherer sends over, and is the reliability ensureing data in transmittance process, only has when event buffer processes this event to next transmission channel or receiver, and ability is deleted in event from then on transmission channel.
Described receiver extracts the event in transmission channel, according to corresponding configuration, file is stored into file system, database, or is submitted in the agency of remote server or next stage.
The data sink of described data collector support comprises file, catalogue and database.
Described transmission channel comprises file and internal memory.
Described receiver comprises distributed file system (HadoopDistributedFileSystem, HDFS), non-relational database (HadoopDatabase, HBase), message system (Kafka) and file.
The present invention is that a kind of mass data is collected and switching method, and its special character is, the method comprises the following steps.
1) configuration file of agency is write according to demand.
2) agency is started according to the configuration file write, act on behalf of after successfully starting, start to transmit data, by receiver, data are read in agency inside from external data source, event is become to be sent to buffer memory in transmission channel the data encapsulation of reading in, the extraction of wait-receiving mode device, receiver extracts these events and they is resolved to raw data, is stored into final destination; After agency starts, the transmitting procedure of data is automatic, automatically can also realize the collection of transform data according to the change of data.
Above-mentioned steps 1) specific implementation step as follows.
100) type of data sink needs to do corresponding configuration according to the type of external data source, if data source is the file under a catalogue, receiver types is configured to catalogue file (SpoolingDirectory, spooldir), also wants the position of disposition data source.
101) type of transmission channel configures as required; Transmission channel also needs the size of the capacity of collocation channel, the options such as the size of transfer capability.
102) type and the user of receiver the most at last data stored in position relevant; When selecting HDFS as receiver, configuration store to the position of HDFS files, the size of file.
Above-mentioned steps 2) described in agency data transmission step as follows.
200) data collector is according to the data in the reading external data source, address of configuration, reads in and first judges whether data are new data afterwards, confirms as newly, pre-service is carried out to data, data are specifically formatd, and adds header, be encapsulated into an event.
201) data collector is sent to event in single or multiple transmission channel, wherein transmission channel can be regarded as a buffer zone, and its preservation event is until receiver extracts and processes this event.
202) receiver extracts the event in transmission channel, and event being resolved becomes raw data, writes data into destination by calling client-side interface, or as the external data source that next stage is acted on behalf of.
Accompanying drawing explanation
The integrated stand composition of Fig. 1 system.
Fig. 2 acts on behalf of internal data flow process figure.
Embodiment
The present invention is described in detail with reference to the accompanying drawings; Following detailed description of the invention is not limitation of the present invention; On the contrary, scope of the present invention is determined by claims.
The present invention is that a kind of mass data is collected and exchange system, and wherein the integrated stand composition of system as shown in Figure 1; This system adopts proxy mode, the agency of this system comprises data collector, transmission channel, receiver, separate between each agency, can carry out parallel switching to multiple data source, realize data to read in and being separated of writing out, make system architecture more flexibly, light weight, efficient.
Data collector is responsible for the Data Collection of data source, is event through process change, in the transmission channel sent, supports several data receiver, as file, catalogue, database with the form of event (comprising event header information and data two parts).
Transmission channel is used for the event that data cached gatherer sends over, and is the reliability ensureing data in transmittance process, only has when event buffer processes this event to next transmission channel or receiver, and ability is deleted in event from then on transmission channel; The passage supported has file, internal memory etc.
Receiver extracts the event in transmission channel, according to corresponding configuration, file is stored into file system, database, or is submitted in the agency of remote server or next stage; The receiver supported has HDFS, HBase, Kafka, file etc.
The present invention is that a kind of mass data is collected and switching method, and the method will complete an exchanges data to be needed to perform following steps.
1) write the configuration file of agency first according to demand, concrete steps are as follows.
100) type of data sink needs to do corresponding configuration according to the type of external data source, such as: if data source is the file under a catalogue, receiver types is configured to spooldir, also wants the position of disposition data source, as: file absolute path.Also have some specific attributes, no longer do concrete introduction.
102) type of transmission channel configures as required, has internal memory, file etc.; Transmission channel also needs the size of the capacity of collocation channel, the options such as the size of transfer capability.
103) type and the user of receiver the most at last data stored in position relevant, as HDFS, HBase etc.; Also having some often to plant parameter specific to receiver, during as selected HDFS as receiver, configuration store to the position of HDFS files, the size etc. of file.
2) agency is started according to the configuration file write, act on behalf of after successfully starting, start to transmit data, by receiver, data are read in agency inside from external data source, become event to be sent to buffer memory in transmission channel the data encapsulation of reading in, the extraction of wait-receiving mode device, receiver extracts these events and they is resolved to raw data, be stored into final destination, as the HDFS in Fig. 1, complete a data transfer; After agency starts, the transmitting procedure of data is automatic, automatically can also realize the collection of transform data according to the change of data.
In step 2) in, act on behalf of internal data flow process as shown in Figure 2; Data exchange system data stream is carried throughout by event; Event is the base unit of exchanges data; Data collector, according to the data in the reading external data source, address of configuration, reads in and first judges whether data are new data afterwards, confirms as newly, pre-service is carried out to data, data are specifically formatd, and adds header, be encapsulated into an event; Then data collector is sent to event in single or multiple transmission channel; You can regard transmission channel as a buffer zone as, and its preservation event is until receiver extracts and processes this event; Receiver extracts the event in transmission channel, event is resolved and becomes raw data, destination is write data into by calling client-side interface, or as the external data source of next stage agency, this is allowed to, as acted on behalf of 1,2,3 in Fig. 1 by the data source of the receiver of oneself as next stage agency 4.
Very blunt design, wherein it should be noted that and present system provides data collector (as the file in Fig. 1, the webserver, database) built-in in a large number, transmission channel (file, internal memory etc.) and receiver (as HDFS in Fig. 1); Dissimilar data collector, can independent assortment between transmission channel and receiver; Array mode can be arranged in configuration file by user, uses very simple, flexible; Such as: transmission channel event buffer in internal memory, also can be able to be persisted on local file system; Receiver can write HDFS daily record, HBase, or even another one data collector etc.

Claims (7)

1. a mass data is collected and exchange system, it is characterized in that: this system adopts proxy mode, the agency of this system comprises data collector, transmission channel, receiver, separate between each agency, parallel switching can be carried out to multiple data source, realize data to read in and being separated of writing out, make system architecture more flexibly, light weight, efficient;
Described data collector is responsible for the Data Collection of data source, is event through process change, in the transmission channel sent, supports several data receiver with the form of event (comprising event header information and data two parts);
Described transmission channel is used for the event that data cached gatherer sends over, and is the reliability ensureing data in transmittance process, only has when event buffer processes this event to next transmission channel or receiver, and ability is deleted in event from then on transmission channel;
Described receiver extracts the event in transmission channel, according to corresponding configuration, file is stored into file system, database, or is submitted in the agency of remote server or next stage.
2. the system as claimed in claim 1, is characterized in that: the data sink of described data collector support comprises file, catalogue and database.
3. the system as claimed in claim 1, is characterized in that: described transmission channel comprises file and internal memory.
4. the system as claimed in claim 1, is characterized in that: described receiver comprises HDFS, HBase, Kafka and file.
5. mass data is collected and a switching method, and it is characterized in that, the method comprises the following steps:
1) configuration file of agency is write according to demand;
2) agency is started according to the configuration file write, act on behalf of after successfully starting, start to transmit data, by receiver, data are read in agency inside from external data source, event is become to be sent to buffer memory in transmission channel the data encapsulation of reading in, the extraction of wait-receiving mode device, receiver extracts these events and they is resolved to raw data, is stored into final destination; After agency starts, the transmitting procedure of data is automatic, automatically can also realize the collection of transform data according to the change of data.
6. method as claimed in claim 5, is characterized in that: the specific implementation step of described step 1) is as follows:
100) type of data sink needs to do corresponding configuration according to the type of external data source, if data source is the file under a catalogue, receiver types is configured to spooldir, also wants the position of disposition data source;
101) type of transmission channel configures as required, and transmission channel also needs the size of the capacity of collocation channel, the options such as the size of transfer capability;
102) type and the user of receiver the most at last data stored in position relevant, when selecting HDFS as receiver, configuration store to the position of HDFS files, the size of file.
7. method as claimed in claim 5, is characterized in that: described step 2) described in the data transmission step of agency as follows:
200) data collector is according to the data in the reading external data source, address of configuration, reads in and first judges whether data are new data afterwards, confirms as newly, pre-service is carried out to data, data are specifically formatd, and adds header, be encapsulated into an event;
201) data collector is sent to event in single or multiple transmission channel, wherein transmission channel can be regarded as a buffer zone, and its preservation event is until receiver extracts and processes this event;
202) receiver extracts the event in transmission channel, and event being resolved becomes raw data, writes data into destination by calling client-side interface, or as the external data source that next stage is acted on behalf of.
CN201510843249.3A 2015-11-26 2015-11-26 Massive data collecting and exchanging system and method Pending CN105447146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510843249.3A CN105447146A (en) 2015-11-26 2015-11-26 Massive data collecting and exchanging system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510843249.3A CN105447146A (en) 2015-11-26 2015-11-26 Massive data collecting and exchanging system and method

Publications (1)

Publication Number Publication Date
CN105447146A true CN105447146A (en) 2016-03-30

Family

ID=55557322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510843249.3A Pending CN105447146A (en) 2015-11-26 2015-11-26 Massive data collecting and exchanging system and method

Country Status (1)

Country Link
CN (1) CN105447146A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202324A (en) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 The data processing method of a kind of real-time calculating platform and device
CN106227855A (en) * 2016-07-28 2016-12-14 努比亚技术有限公司 A kind of transacter, system and method
CN106383758A (en) * 2016-09-22 2017-02-08 郑州云海信息技术有限公司 Operation system-based information acquisition method
CN108614820A (en) * 2016-12-09 2018-10-02 腾讯科技(深圳)有限公司 The method and apparatus for realizing the parsing of streaming source data
CN109088782A (en) * 2018-11-01 2018-12-25 郑州云海信息技术有限公司 The log collecting method and device of distributed system
CN109857448A (en) * 2018-12-30 2019-06-07 贝壳技术有限公司 A kind of multi-data source cut-in method and device
CN114500315A (en) * 2021-12-31 2022-05-13 深圳云天励飞技术股份有限公司 Equipment state monitoring method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101483887A (en) * 2009-02-25 2009-07-15 南京邮电大学 Multi-proxy collaboration method applied to wireless multimedia sensor network
CN102801559A (en) * 2012-08-03 2012-11-28 南京富士通南大软件技术有限公司 Intelligent local area network data collecting method
CN103731298A (en) * 2013-11-15 2014-04-16 中国航天科工集团第二研究院七〇六所 Large-scale distributed network safety data acquisition method and system
US20140297826A1 (en) * 2013-04-01 2014-10-02 Electronics And Telecommunications Research Institute System and method for big data aggregation in sensor network
CN105025090A (en) * 2015-06-24 2015-11-04 上海斐讯数据通信技术有限公司 Data transmission customization system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101483887A (en) * 2009-02-25 2009-07-15 南京邮电大学 Multi-proxy collaboration method applied to wireless multimedia sensor network
CN102801559A (en) * 2012-08-03 2012-11-28 南京富士通南大软件技术有限公司 Intelligent local area network data collecting method
US20140297826A1 (en) * 2013-04-01 2014-10-02 Electronics And Telecommunications Research Institute System and method for big data aggregation in sensor network
CN103731298A (en) * 2013-11-15 2014-04-16 中国航天科工集团第二研究院七〇六所 Large-scale distributed network safety data acquisition method and system
CN105025090A (en) * 2015-06-24 2015-11-04 上海斐讯数据通信技术有限公司 Data transmission customization system and method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202324A (en) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 The data processing method of a kind of real-time calculating platform and device
CN106202324B (en) * 2016-06-30 2020-10-30 北京奇虎科技有限公司 Data processing method and device for real-time computing platform
CN106227855A (en) * 2016-07-28 2016-12-14 努比亚技术有限公司 A kind of transacter, system and method
CN106383758A (en) * 2016-09-22 2017-02-08 郑州云海信息技术有限公司 Operation system-based information acquisition method
CN108614820A (en) * 2016-12-09 2018-10-02 腾讯科技(深圳)有限公司 The method and apparatus for realizing the parsing of streaming source data
CN109088782A (en) * 2018-11-01 2018-12-25 郑州云海信息技术有限公司 The log collecting method and device of distributed system
CN109857448A (en) * 2018-12-30 2019-06-07 贝壳技术有限公司 A kind of multi-data source cut-in method and device
CN114500315A (en) * 2021-12-31 2022-05-13 深圳云天励飞技术股份有限公司 Equipment state monitoring method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105447146A (en) Massive data collecting and exchanging system and method
US20230169084A1 (en) Interactive visualization of a relationship of isolated execution environments
US11615082B1 (en) Using a data store and message queue to ingest data for a data intake and query system
US11409756B1 (en) Creating and communicating data analyses using data visualization pipelines
CN108847977A (en) A kind of monitoring method of business datum, storage medium and server
CN103299600B (en) For transmitting the apparatus and method of live media content
CN107818120A (en) Data processing method and device based on big data
US11966797B2 (en) Indexing data at a data intake and query system based on a node capacity threshold
CN109753502B (en) Data acquisition method based on NiFi
US11609913B1 (en) Reassigning data groups from backup to searching for a processing node
CN105512201A (en) Data collection and processing method and device
CN111258978B (en) Data storage method
CN104699723A (en) Data exchange adapter and system and method for synchronizing data among heterogeneous systems
CN101964795A (en) Log collecting system, log collection method and log recycling server
CN108121778B (en) Heterogeneous data exchange and cleaning system and method
CN103561033B (en) User remotely accesses the device and method of HDFS cluster
US11573971B1 (en) Search and data analysis collaboration system
CN104584524A (en) Aggregating data in a mediation system
CN105357280B (en) A kind of file based on HDFS is traced to the source FTP system
CN105930502B (en) System, client and method for collecting data
Malik et al. A framework for collecting youtube meta-data
US11892976B2 (en) Enhanced search performance using data model summaries stored in a remote data store
CN105306261A (en) Method, device and system for collecting logs
US11210212B2 (en) Conflict resolution and garbage collection in distributed databases
CN106919574B (en) Method for processing remote synchronous file in real time

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160330