CN108319652A - A kind of the column document storage system and method for the elevator data based on HDFS - Google Patents

A kind of the column document storage system and method for the elevator data based on HDFS Download PDF

Info

Publication number
CN108319652A
CN108319652A CN201711465597.7A CN201711465597A CN108319652A CN 108319652 A CN108319652 A CN 108319652A CN 201711465597 A CN201711465597 A CN 201711465597A CN 108319652 A CN108319652 A CN 108319652A
Authority
CN
China
Prior art keywords
data
file
elevator
hdfs
parquet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711465597.7A
Other languages
Chinese (zh)
Inventor
万敏
张仪
丁凌峰
张雷
陈小游
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang New Zailing Technology Co Ltd
Original Assignee
Zhejiang New Zailing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang New Zailing Technology Co Ltd filed Critical Zhejiang New Zailing Technology Co Ltd
Priority to CN201711465597.7A priority Critical patent/CN108319652A/en
Publication of CN108319652A publication Critical patent/CN108319652A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of column document storage system of the elevator data based on HDFS, including storage system layer, document handling system layer and data active layer, wherein, data active layer includes elevator real-time status data and elevator triggering class data, elevator real-time status data is per second to report one, elevator triggering class data are reported when elevator changes, and file process layer includes data load-on module, data normalization module, document management module and index management module.This system design is based on HDFS and ElasticSearch technologies, a set of storage scheme is designed for solving this technical barrier, wherein HDFS is the file system of an Error Tolerance, has the autgmentability of height, it is aided with parquet column storage file formats, data storage and the access ability of high-throughput can be provided.

Description

A kind of the column document storage system and method for the elevator data based on HDFS
Technical field
The present invention relates to big data field of storage, the application for the big data component parquet and HDFS that be related specifically to increase income.
Background technology
In ladder networking industry, elevator can carry a variety of sensing apparatus to acquire elevator operation data, including elevator fortune Scanning frequency degree, car temperature, human body sensing etc., this kind of data characteristics be data class it is various, it is in large scale, report frequency high and Timing is strong, this four features determine that the storage for elevator operation data, analysis difficulty are larger, terraced networking scenario Under, it is a large amount of to acquire parameters of elevator run data, data are stored and retrieval all has larger performance bottleneck.
Chinese invention patent application CN 106919675 discloses a kind of date storage method and device, according to what is received Data to be stored search preset first field in the data, and the data are stored into ElasticSearch, And it is established according to preset first field and indexes and preserve;According to the data to be stored received, in the number Preset second field is obtained according to middle, the data are stored into Parquet, and established under target directory and index and protect It deposits.Parquet file storage mediums are not known in the technical solution, do not have the autgmentability of mass data storage and smooth expansion Hold, there are technical risk, which does not design partitioned storage scheme, ElasticSearch progress synchronous with Parquet Processing mode may have dragging slow-motion journey and disorder to pulling in processing for batch data.
Invention content
Present invention technical problems to be solved first are to provide a kind of column file storage of the elevator data based on HDFS System, including storage system layer, document handling system layer and data active layer, wherein data active layer includes elevator real-time status number Class data are triggered according to elevator, elevator real-time status data is per second to report one, and elevator triggers class data when elevator changes When report, file process layer includes data load-on module, data normalization module, document management module and index management module, Data load-on module is connected to data active layer, and data therein are loaded and sorted, into caching;Data normalization module Parquet files, storage system are generated according to partitions of file rule and service logic to the data in distributed caching database Layer includes distributed file system and full-text index system, and document management module is used to manage the file in distributed file system Folder and Parquet files, index management module manage full-text index system according to Parquet file directorys.
Further, partitions of file rule is temporally subregion first, big according to file on the basis of temporally subregion Small carry out subregion.
Further, file format rule is constrained using the parts message in Parquet configuration files.
The present invention also provides a kind of column file memory method of the elevator data based on HDFS, this method is using above-mentioned System, and include the following steps:
(1)Data load;
(1.1)Elevator real time data and electricity are obtained from messaging bus kafka using streaming computing engine spark streaming Ladder triggering class data;
(1.2)Bit-reversed is carried out to data using the time as dimension in the window phase of spark streaming;
(1.3)Data after sequence are respectively stored into according to data class in data buffer storage queue;
(2)Data normalization;
(2.1)Read Parquet configuration files;
(2.2)Cycle reads the data in data buffer storage queue in batches;
(2.3)Data are generated into Parquet files according to Parquet configuration files;
(2.4)File is stored in temp directory;
(3)Establishment file catalogue;
(3.1)Scanning file temp directory;
(3.2)Determine file storing directory;(3.3)The establishment file catalogue in HDFS;
(3.4)Upper transmitting file is to corresponding catalogue;
(4)Create index;
(4.1)Scanning file temp directory;
(4.2)Index record is generated according to filename;
(4.3)ElasticSearch servers are called to create index.
Further, file directory is established sequentially in time, and file designation rule is yyyy-mm-dd hh:mm:ss~ yyyy-mm-dd hh:mm:Ss.par, first time are earliest data time, and second time is data time the latest.
The beneficial effects of the invention are as follows:This system design is based on HDFS and ElasticSearch technologies, designs a set of storage For scheme for solving this technical barrier, wherein HDFS is the file system of an Error Tolerance, has the autgmentability of height, It is aided with parquet column storage file formats, data storage and the access ability of high-throughput can be provided.Present invention focuses on The storage scheme for seeking terraced networking arenas magnanimity time series data, partitioned storage, in such a way that the time builds gross index, prominent batch Measure the advantage that pulling data carries out data analysis.
Description of the drawings
Fig. 1 is the system architecture diagram of the present invention.
Fig. 2 is the flow chart of data load.
Fig. 3 is the flow chart of data normalization.
Fig. 4 is parquet configuration file examples.
Fig. 5 is the flow chart of file management.
Fig. 6 is the flow chart of index management.
Specific implementation mode
Specific embodiments of the present invention are described in further details below in conjunction with attached drawing, it is noted that implement Example is only specifically described the present invention, rather than limitation of the invention.
With reference to attached drawing 1, system of the invention can be divided into three levels, specifically storage system layer, document handling system layer And data active layer, wherein data active layer includes elevator real-time status data and elevator triggering class data, elevator real-time status data Per second to report one, elevator triggering class data are reported when elevator changes, and file process layer includes data load-on module, number According to standardized module, document management module and index management module, data load-on module is connected to data active layer, to number therein According to being loaded and being sorted, into caching;Data normalization module is to the data in distributed caching database according to file point Area's rule and file format rule generate Parquet files, and storage system layer includes distributed file system and full-text index system System, document management module are used to manage file and Parquet files in distributed file system, index management module according to Parquet file directorys manage full-text index system.
For at present, file distinguishing rule is mainly temporally classified, such as per diem subregion in the present embodiment, same The subregion further refined according to file size in file area on the one, as judge in Fig. 3 file size whether be more than A certain threshold value and judge whether time-out time more than a certain threshold value embodies above-mentioned zoning ordinance.
File format rule is constrained according to the parts message in Parquet configuration files.
The present invention carries out classification storage using above system to the elevator data of ladder networking, and detailed process is as follows:
(1)Data load, as shown in Figure 2.
(1.1)Elevator real time data is obtained from messaging bus kafka using streaming computing engine spark streaming And elevator triggers class data.
(1.2)Bit-reversed is carried out to data using the time as dimension in the window phase of spark streaming, window phase Time can be configured according to concrete condition, be traditionally arranged to be 1 minute, and target is so that being obtained from each subregions of kafka The data rate taken is totally consistent, and instantaneous data consumption rate is avoided to have a long way to go.
(1.3)Data after sequence are respectively stored into according to data class in data buffer storage queue, in the present embodiment Data class refers mainly to elevator real-time status data and elevator triggering class data, i.e., it is slow data not to be respectively set according to this two major classes Queue is deposited, then different classes of data are respectively stored into respective data buffer storage queue.
(2)Data normalization, as shown in Figure 3.
(2.1)Parquet configuration files are read, enters in next step after reading successfully, otherwise stops;It is main in configuration file Including two aspect content of message and schemas, message is mainly used for describing parquet file memory formats, schemas Be mainly used for describing data connector, data converter and file storage rule parameter, correspond to respectively connect, Tri- modules of transverter and files, the following Fig. 4 of file format.
(2.2)Cycle reads the data in data buffer storage queue in batches, according to connector portion in parquet configuration files The parameter divided(Kafka is configured in connect), initial data is read from kafka, if there is no data, then flow terminates, Otherwise in memory by the data buffer storage of reading, in the process if data cached size reaches threshold value (router.maxFileSize), then wait for a period of time into next flow if continuing not up to threshold value (router.timeout)Enter back into next flow.
(2.3)Data are generated into Parquet files according to Parquet configuration files, read converter phase in configuration file Configuration is closed, wherein first item is configured with conversion process class, the conversion being mainly responsible between data in different formats, in example Json data can be switched to parquet files by JsonToParquetTransverter.class, also be to read in transfer process The parts message in configuration file, it has arranged the specific format of parquet files.According to partitions of file rule (router.rules)And it is yyyy-mm-dd hh that internal data time range, which names file, file designation rule,:mm:ss~ yyyy-mm-dd hh:mm:Ss.par, first time are earliest data time, and second time is data time the latest.
(2.4)File is stored in temp directory.
(3)Establishment file catalogue.
Folder management is carried out in HDFS file system using document management module and is uploaded to parquet files pair It answers in file, flow is as shown in Figure 5.
(3.1)Scanning file temp directory, intercept file name in starting two times in date judge whether With the date in All Files title(yyyy-mm-dd)The file being consistent, if it is,
(3.2)Determine file storing directory;
(3.3)The establishment file catalogue in HDFS, file directory name are named with the date(yyyy-mm-dd), such as/ hdfs/bigdata/2018-01-01;
(3.4)Upper transmitting file is to corresponding catalogue;
(4)Create index.
Index is established to the data time sequence range in parquet files using index management module, reduces data traversal model It encloses, accelerates data query performance, flow as shown in Figure 6.
(4.1)Scanning file temp directory;
(4.2)Index record is generated according to filename;
(4.3)ElasticSearch servers are called to create index.

Claims (5)

1. a kind of column document storage system of the elevator data based on HDFS, characterized in that at storage system layer, file Manage system layer and data active layer, wherein data active layer includes elevator real-time status data and elevator triggering class data, and elevator is real-time Status data is per second to report one, and elevator triggering class data are reported when elevator changes, and file process layer adds including data It carries module, data normalization module, document management module and index management module, data load-on module and is connected to data active layer, Data therein are loaded and sorted, into caching;Data normalization module is to the data in distributed caching database Generate Parquet files according to partitions of file rule and file format rule, storage system layer include distributed file system and Full-text index system, document management module are used to manage the file and Parquet files in distributed file system, index pipe Reason module manages full-text index system according to Parquet file directorys.
2. a kind of column document storage system of elevator data based on HDFS according to claim 1, characterized in that text Part zoning ordinance is per diem subregion first, and subregion is carried out according to file size on the basis of per diem subregion.
3. a kind of column document storage system of elevator data based on HDFS according to claim 1, characterized in that text Part format convention is constrained according to the parts message in Parquet configuration files.
4. a kind of column file memory method of the elevator data based on HDFS, characterized in that this method application claim 1 institute The system stated, and include the following steps:
(1)Data load;
(1.1)Elevator real time data and electricity are obtained from messaging bus kafka using streaming computing engine spark streaming Ladder triggering class data;
(1.2)Bit-reversed is carried out to data using the time as dimension in the window phase of spark streaming;
(1.3)Data after sequence are respectively stored into according to data class in data buffer storage queue;
(2)Data normalization;
(2.1)Read Parquet configuration files;
(2.2)Cycle reads the data in data buffer storage queue in batches;
(2.3)Data are generated into Parquet files according to Parquet configuration files;
(2.4)File is stored in temp directory;
(3)File management;
(3.1)Scanning file temp directory;
(3.2)Determine file storing directory;(3.3)The establishment file catalogue in HDFS;
(3.4)Upper transmitting file is to corresponding catalogue;
(4)Index management;
(4.1)Scanning file temp directory;
(4.2)Index record is generated according to filename;
(4.3)ElasticSearch servers are called to create index.
5. a kind of column file memory method of elevator data based on HDFS according to claim 2, characterized in that File directory is established sequentially in time, and file designation rule is yyyy-mm-dd hh:mm:ss~yyyy-mm-dd hh:mm: Ss.par, first time are earliest data time, and second time is data time the latest.
CN201711465597.7A 2017-12-28 2017-12-28 A kind of the column document storage system and method for the elevator data based on HDFS Pending CN108319652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711465597.7A CN108319652A (en) 2017-12-28 2017-12-28 A kind of the column document storage system and method for the elevator data based on HDFS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711465597.7A CN108319652A (en) 2017-12-28 2017-12-28 A kind of the column document storage system and method for the elevator data based on HDFS

Publications (1)

Publication Number Publication Date
CN108319652A true CN108319652A (en) 2018-07-24

Family

ID=62893227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711465597.7A Pending CN108319652A (en) 2017-12-28 2017-12-28 A kind of the column document storage system and method for the elevator data based on HDFS

Country Status (1)

Country Link
CN (1) CN108319652A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376121A (en) * 2018-08-10 2019-02-22 南京华讯方舟通信设备有限公司 A kind of document indexing system and method based on ElasticSearch full-text search
CN109542889A (en) * 2018-10-11 2019-03-29 平安科技(深圳)有限公司 Stream data column storage method, device, equipment and storage medium
CN109766325A (en) * 2019-01-09 2019-05-17 吴思齐 A kind of distributed file system and flow data wiring method towards flow data
US10681106B2 (en) 2017-09-26 2020-06-09 Oracle International Corporation Entropy sharing across multiple compression streams
US11074248B2 (en) 2019-03-31 2021-07-27 Oracle International Corporation Map of operations for ingesting external data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104944240A (en) * 2015-05-19 2015-09-30 重庆大学 Elevator equipment state monitoring system based on large data technology
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data
CN105645209A (en) * 2016-03-03 2016-06-08 宁夏电通物联网科技股份有限公司 Maintenance system and maintenance method for elevators based on big data support of Internet of Things
CN106672733A (en) * 2016-12-02 2017-05-17 常州大学 Elevator failure analysis and early warning system based on micro-cloud intelligent terminal and method thereof
CN107463620A (en) * 2017-07-05 2017-12-12 洛川闰土农牧科技有限责任公司 A kind of elevator accident early-warning and predicting system based on data mining

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data
CN104944240A (en) * 2015-05-19 2015-09-30 重庆大学 Elevator equipment state monitoring system based on large data technology
CN105645209A (en) * 2016-03-03 2016-06-08 宁夏电通物联网科技股份有限公司 Maintenance system and maintenance method for elevators based on big data support of Internet of Things
CN106672733A (en) * 2016-12-02 2017-05-17 常州大学 Elevator failure analysis and early warning system based on micro-cloud intelligent terminal and method thereof
CN107463620A (en) * 2017-07-05 2017-12-12 洛川闰土农牧科技有限责任公司 A kind of elevator accident early-warning and predicting system based on data mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王海等: "《hadoop权威指南(2017.7.4版)》", 31 July 2017 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10681106B2 (en) 2017-09-26 2020-06-09 Oracle International Corporation Entropy sharing across multiple compression streams
CN109376121A (en) * 2018-08-10 2019-02-22 南京华讯方舟通信设备有限公司 A kind of document indexing system and method based on ElasticSearch full-text search
CN109376121B (en) * 2018-08-10 2021-07-02 南京华讯方舟通信设备有限公司 File indexing system and method based on elastic search full-text retrieval
CN109542889A (en) * 2018-10-11 2019-03-29 平安科技(深圳)有限公司 Stream data column storage method, device, equipment and storage medium
CN109542889B (en) * 2018-10-11 2023-07-21 平安科技(深圳)有限公司 Stream data column storage method, device, equipment and storage medium
CN109766325A (en) * 2019-01-09 2019-05-17 吴思齐 A kind of distributed file system and flow data wiring method towards flow data
US11074248B2 (en) 2019-03-31 2021-07-27 Oracle International Corporation Map of operations for ingesting external data

Similar Documents

Publication Publication Date Title
CN108319652A (en) A kind of the column document storage system and method for the elevator data based on HDFS
CN104765840B (en) A kind of method and apparatus of big data distributed storage
CN103593436B (en) file merging method and device
CN107436813A (en) A kind of method and system of meta data server dynamic load leveling
CN103020204B (en) A kind of method and its system carrying out multi-dimensional interval query to distributed sequence list
CN104252536B (en) A kind of internet log data query method and device based on hbase
CN104794190B (en) The method and apparatus that a kind of big data effectively stores
CN100452054C (en) Integrated data source finding method for deep layer net page data source
CN104536959A (en) Optimized method for accessing lots of small files for Hadoop
CN106326381A (en) HBase data retrieval method based on MapDB construction
CN111427844B (en) Data migration system and method for file hierarchical storage
CN103678491A (en) Method based on Hadoop small file optimization and reverse index establishment
CN104750855B (en) A kind of big data storage optimization method and device
CN107391280A (en) A kind of reception of small documents and storage method and device
CN105893542B (en) A kind of cold data file redistribution method and system in cloud storage system
CN103440288A (en) Big data storage method and device
CN105843841A (en) Small file storing method and system
CN110515920A (en) A kind of mass small documents access method and system based on Hadoop
CN109815234A (en) A kind of multiple cuckoo filter under streaming computing model
CN110990447B (en) Data exploration method, device, equipment and storage medium
CN106528649A (en) Massive data storage and retrieval system and massive data storage and retrieval methods for new energy vehicles
CN102930060A (en) Method and device for performing fast indexing of database
CN102169491B (en) Dynamic detection method for multi-data concentrated and repeated records
CN104239377A (en) Platform-crossing data retrieval method and device
CN104407879A (en) A power grid timing sequence large data parallel loading method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180724

RJ01 Rejection of invention patent application after publication