CN108319652A

CN108319652A - A kind of the column document storage system and method for the elevator data based on HDFS

Info

Publication number: CN108319652A
Application number: CN201711465597.7A
Authority: CN
Inventors: 万敏; 张仪; 丁凌峰; 张雷; 陈小游
Original assignee: Zhejiang New Zailing Technology Co Ltd
Current assignee: Zhejiang New Zailing Technology Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2018-07-24

Abstract

The present invention provides a kind of column document storage system of the elevator data based on HDFS, including storage system layer, document handling system layer and data active layer, wherein, data active layer includes elevator real-time status data and elevator triggering class data, elevator real-time status data is per second to report one, elevator triggering class data are reported when elevator changes, and file process layer includes data load-on module, data normalization module, document management module and index management module.This system design is based on HDFS and ElasticSearch technologies, a set of storage scheme is designed for solving this technical barrier, wherein HDFS is the file system of an Error Tolerance, has the autgmentability of height, it is aided with parquet column storage file formats, data storage and the access ability of high-throughput can be provided.

Description

A kind of the column document storage system and method for the elevator data based on HDFS

Technical field

The present invention relates to big data field of storage, the application for the big data component parquet and HDFS that be related specifically to increase income.

Background technology

In ladder networking industry, elevator can carry a variety of sensing apparatus to acquire elevator operation data, including elevator fortune Scanning frequency degree, car temperature, human body sensing etc., this kind of data characteristics be data class it is various, it is in large scale, report frequency high and Timing is strong, this four features determine that the storage for elevator operation data, analysis difficulty are larger, terraced networking scenario Under, it is a large amount of to acquire parameters of elevator run data, data are stored and retrieval all has larger performance bottleneck.

Chinese invention patent application CN 106919675 discloses a kind of date storage method and device, according to what is received Data to be stored search preset first field in the data, and the data are stored into ElasticSearch, And it is established according to preset first field and indexes and preserve；According to the data to be stored received, in the number Preset second field is obtained according to middle, the data are stored into Parquet, and established under target directory and index and protect It deposits.Parquet file storage mediums are not known in the technical solution, do not have the autgmentability of mass data storage and smooth expansion Hold, there are technical risk, which does not design partitioned storage scheme, ElasticSearch progress synchronous with Parquet Processing mode may have dragging slow-motion journey and disorder to pulling in processing for batch data.

Invention content

Present invention technical problems to be solved first are to provide a kind of column file storage of the elevator data based on HDFS System, including storage system layer, document handling system layer and data active layer, wherein data active layer includes elevator real-time status number Class data are triggered according to elevator, elevator real-time status data is per second to report one, and elevator triggers class data when elevator changes When report, file process layer includes data load-on module, data normalization module, document management module and index management module, Data load-on module is connected to data active layer, and data therein are loaded and sorted, into caching；Data normalization module Parquet files, storage system are generated according to partitions of file rule and service logic to the data in distributed caching database Layer includes distributed file system and full-text index system, and document management module is used to manage the file in distributed file system Folder and Parquet files, index management module manage full-text index system according to Parquet file directorys.

Further, partitions of file rule is temporally subregion first, big according to file on the basis of temporally subregion Small carry out subregion.

Further, file format rule is constrained using the parts message in Parquet configuration files.

The present invention also provides a kind of column file memory method of the elevator data based on HDFS, this method is using above-mentioned System, and include the following steps：

（1）Data load；

（1.1）Elevator real time data and electricity are obtained from messaging bus kafka using streaming computing engine spark streaming Ladder triggering class data；

（1.2）Bit-reversed is carried out to data using the time as dimension in the window phase of spark streaming；

（1.3）Data after sequence are respectively stored into according to data class in data buffer storage queue；

（2）Data normalization；

（2.1）Read Parquet configuration files；

（2.2）Cycle reads the data in data buffer storage queue in batches；

（2.3）Data are generated into Parquet files according to Parquet configuration files；

（2.4）File is stored in temp directory；

（3）Establishment file catalogue；

（3.1）Scanning file temp directory；

（3.2）Determine file storing directory；（3.3）The establishment file catalogue in HDFS；

（3.4）Upper transmitting file is to corresponding catalogue；

（4）Create index；

（4.1）Scanning file temp directory；

（4.2）Index record is generated according to filename；

（4.3）ElasticSearch servers are called to create index.

Further, file directory is established sequentially in time, and file designation rule is yyyy-mm-dd hh:mm:ss~ yyyy-mm-dd hh:mm:Ss.par, first time are earliest data time, and second time is data time the latest.

The beneficial effects of the invention are as follows：This system design is based on HDFS and ElasticSearch technologies, designs a set of storage For scheme for solving this technical barrier, wherein HDFS is the file system of an Error Tolerance, has the autgmentability of height, It is aided with parquet column storage file formats, data storage and the access ability of high-throughput can be provided.Present invention focuses on The storage scheme for seeking terraced networking arenas magnanimity time series data, partitioned storage, in such a way that the time builds gross index, prominent batch Measure the advantage that pulling data carries out data analysis.

Description of the drawings

Fig. 1 is the system architecture diagram of the present invention.

Fig. 2 is the flow chart of data load.

Fig. 3 is the flow chart of data normalization.

Fig. 4 is parquet configuration file examples.

Fig. 5 is the flow chart of file management.

Fig. 6 is the flow chart of index management.

Specific implementation mode

Specific embodiments of the present invention are described in further details below in conjunction with attached drawing, it is noted that implement Example is only specifically described the present invention, rather than limitation of the invention.

With reference to attached drawing 1, system of the invention can be divided into three levels, specifically storage system layer, document handling system layer And data active layer, wherein data active layer includes elevator real-time status data and elevator triggering class data, elevator real-time status data Per second to report one, elevator triggering class data are reported when elevator changes, and file process layer includes data load-on module, number According to standardized module, document management module and index management module, data load-on module is connected to data active layer, to number therein According to being loaded and being sorted, into caching；Data normalization module is to the data in distributed caching database according to file point Area's rule and file format rule generate Parquet files, and storage system layer includes distributed file system and full-text index system System, document management module are used to manage file and Parquet files in distributed file system, index management module according to Parquet file directorys manage full-text index system.

For at present, file distinguishing rule is mainly temporally classified, such as per diem subregion in the present embodiment, same The subregion further refined according to file size in file area on the one, as judge in Fig. 3 file size whether be more than A certain threshold value and judge whether time-out time more than a certain threshold value embodies above-mentioned zoning ordinance.

File format rule is constrained according to the parts message in Parquet configuration files.

The present invention carries out classification storage using above system to the elevator data of ladder networking, and detailed process is as follows：

（1）Data load, as shown in Figure 2.

（1.1）Elevator real time data is obtained from messaging bus kafka using streaming computing engine spark streaming And elevator triggers class data.

（1.2）Bit-reversed is carried out to data using the time as dimension in the window phase of spark streaming, window phase Time can be configured according to concrete condition, be traditionally arranged to be 1 minute, and target is so that being obtained from each subregions of kafka The data rate taken is totally consistent, and instantaneous data consumption rate is avoided to have a long way to go.

（1.3）Data after sequence are respectively stored into according to data class in data buffer storage queue, in the present embodiment Data class refers mainly to elevator real-time status data and elevator triggering class data, i.e., it is slow data not to be respectively set according to this two major classes Queue is deposited, then different classes of data are respectively stored into respective data buffer storage queue.

（2）Data normalization, as shown in Figure 3.

（2.1）Parquet configuration files are read, enters in next step after reading successfully, otherwise stops；It is main in configuration file Including two aspect content of message and schemas, message is mainly used for describing parquet file memory formats, schemas Be mainly used for describing data connector, data converter and file storage rule parameter, correspond to respectively connect, Tri- modules of transverter and files, the following Fig. 4 of file format.

（2.2）Cycle reads the data in data buffer storage queue in batches, according to connector portion in parquet configuration files The parameter divided（Kafka is configured in connect）, initial data is read from kafka, if there is no data, then flow terminates, Otherwise in memory by the data buffer storage of reading, in the process if data cached size reaches threshold value （router.maxFileSize）, then wait for a period of time into next flow if continuing not up to threshold value （router.timeout）Enter back into next flow.

（2.3）Data are generated into Parquet files according to Parquet configuration files, read converter phase in configuration file Configuration is closed, wherein first item is configured with conversion process class, the conversion being mainly responsible between data in different formats, in example Json data can be switched to parquet files by JsonToParquetTransverter.class, also be to read in transfer process The parts message in configuration file, it has arranged the specific format of parquet files.According to partitions of file rule （router.rules）And it is yyyy-mm-dd hh that internal data time range, which names file, file designation rule,:mm:ss~ yyyy-mm-dd hh:mm:Ss.par, first time are earliest data time, and second time is data time the latest.

（2.4）File is stored in temp directory.

（3）Establishment file catalogue.

Folder management is carried out in HDFS file system using document management module and is uploaded to parquet files pair It answers in file, flow is as shown in Figure 5.

（3.1）Scanning file temp directory, intercept file name in starting two times in date judge whether With the date in All Files title（yyyy-mm-dd）The file being consistent, if it is,

（3.2）Determine file storing directory；

（3.3）The establishment file catalogue in HDFS, file directory name are named with the date（yyyy-mm-dd）, such as/ hdfs/bigdata/2018-01-01；

（3.4）Upper transmitting file is to corresponding catalogue；

（4）Create index.

Index is established to the data time sequence range in parquet files using index management module, reduces data traversal model It encloses, accelerates data query performance, flow as shown in Figure 6.

（4.1）Scanning file temp directory；

（4.2）Index record is generated according to filename；

（4.3）ElasticSearch servers are called to create index.

Claims

1. a kind of column document storage system of the elevator data based on HDFS, characterized in that at storage system layer, file Manage system layer and data active layer, wherein data active layer includes elevator real-time status data and elevator triggering class data, and elevator is real-time Status data is per second to report one, and elevator triggering class data are reported when elevator changes, and file process layer adds including data It carries module, data normalization module, document management module and index management module, data load-on module and is connected to data active layer, Data therein are loaded and sorted, into caching；Data normalization module is to the data in distributed caching database Generate Parquet files according to partitions of file rule and file format rule, storage system layer include distributed file system and Full-text index system, document management module are used to manage the file and Parquet files in distributed file system, index pipe Reason module manages full-text index system according to Parquet file directorys.

2. a kind of column document storage system of elevator data based on HDFS according to claim 1, characterized in that text Part zoning ordinance is per diem subregion first, and subregion is carried out according to file size on the basis of per diem subregion.

3. a kind of column document storage system of elevator data based on HDFS according to claim 1, characterized in that text Part format convention is constrained according to the parts message in Parquet configuration files.

4. a kind of column file memory method of the elevator data based on HDFS, characterized in that this method application claim 1 institute The system stated, and include the following steps：

（1）Data load；

（2）Data normalization；

（2.1）Read Parquet configuration files；

（2.2）Cycle reads the data in data buffer storage queue in batches；

（2.4）File is stored in temp directory；

（3）File management；

（3.1）Scanning file temp directory；

（3.4）Upper transmitting file is to corresponding catalogue；

（4）Index management；

（4.1）Scanning file temp directory；

（4.2）Index record is generated according to filename；

（4.3）ElasticSearch servers are called to create index.

5. a kind of column file memory method of elevator data based on HDFS according to claim 2, characterized in that File directory is established sequentially in time, and file designation rule is yyyy-mm-dd hh:mm:ss~yyyy-mm-dd hh:mm: Ss.par, first time are earliest data time, and second time is data time the latest.