CN105069703B

CN105069703B - A kind of electrical network mass data management method

Info

Publication number: CN105069703B
Application number: CN201510487734.1A
Authority: CN
Inventors: 刘志刚; 魏晓光; 陈剑飞; 刘小宝; 戴昭
Original assignee: State Grid Corp of China SGCC; Jinan Power Supply Co of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Jinan Power Supply Co of State Grid Shandong Electric Power Co Ltd
Priority date: 2015-08-10
Filing date: 2015-08-10
Publication date: 2018-08-28
Anticipated expiration: 2035-08-10
Also published as: CN105069703A

Abstract

The present invention provides a kind of electrical network mass data management method, this method includes：Power grid user data management system is built, each collected data of power grid subsystem are integrated, and the data of power grid user are excavated and analyzed using parallel computation frame；System is managed based on the data, realizes that parallel load is predicted using distributed terminator prediction algorithm.The present invention proposes a kind of electrical network mass data management method, and the data of each system of power grid user are merged and integrated, and traditional data computational methods are moved in distributed platform, meets the operation requirement of mass data.

Description

A kind of electrical network mass data management method

Technical field

The present invention relates to intelligent grid, more particularly to a kind of electrical network mass data management method.

Background technology

Acquisition, transimission and storage to power grid user real time data, and the magnanimity multi-source historical data of associate cumulation carries out Quickly analysis can effectively improve demand management, be managed to user data and support smart grid security, heavily fortified point with processing Strong and reliability service.With being continuously increased for various kinds of sensors and smart machine quantity, equipment obtains the Various types of data with transmission Also exponential growth is occurring, these data include not only the electricity consumption that intelligent electric meter is collected, and further include various kinds of sensors According to temperature, weather, humidity, geography information and the wind speed information etc. of fixed frequency acquisition.User data complexity increases.

The technology of China's generating and transmitting system and external difference are little, but with electricity consumption especially user terminal, there are larger differences Different, since adaptable market mechanism is not yet formed, the condition of implementation of China intelligent power technology is not mature enough, it is difficult to support intelligence The effective integration of energy electric power distribution system and Subscriber Management System.Generally speaking, the Mass Data Management of power grid user exists such as Lower challenge：The fast development of intelligent electric meter and technology of Internet of things keeps the mass data mode that it is generated multifarious, constituent parts number Differ according to bore, processing is integrated difficult.For mass data, a module how is built to carry out specification expression to it how The problem of Data Integration is urgent need to resolve is realized based on the module.Since the acquisition mode of data is varied, each communication Channel quality differs, and the quality of data not only received is inferior, but also also insufficient to the management and control ability of data, so as to cause utilization It is also unscientific that these inferior data, which carry out the knowledge of mining analysis discovery, cannot make accurately decision.This exists Ill effect is caused in global range, seriously annoyings information-intensive society.Data type is complicated, traditional relevant database and File memory format has been unable to meet the demand of mass data rapid growth.

Invention content

To solve the problems of above-mentioned prior art, the present invention proposes a kind of electrical network mass data management method, Including：

Power grid user data management system is built, each collected data of power grid subsystem are integrated, and utilizes Parallel computation frame is excavated and is analyzed to the data of power grid user；System is managed based on the data, it is negative using distribution It carries prediction algorithm and realizes parallel load prediction.

Preferably, the framework of the power grid user data management system is divided into application layer, data analysis computation layer, data pipe Layer is managed, power grid user data management system is built using Hadoop, data storage system is established using HDFS, HBase on platform System builds MapReduce parallel computation frames and Storm memory parallels Computational frame and is calculated as mass data and divides on platform Analysis system analyzes the mass data of power grid user；The data management layer is acquired and integrates to data；The number Include the data acquired from intelligent electric meter, data acquisition monitoring system and various sensors according to acquisition, to the collection of these data At including being managed Data Migration to cluster server；In the integrating process of data, using data transfer tool logarithm According to extraction and integration work is carried out, data transfer tool is utilized to extract data and historical data that each independent system generates It is integrated into HBase, and column storage database is operated using java persistence tools, it will be based on Distributed Calculation It is written in HBase using the online data of generation；Storage and calculating point of the data analysis computation layer for mass data Analysis；Electrical load data and related data are stored using HBase；Using parallel computation module MapReduce to mass data into Row parallel batch calculates analysis, and uses the parallel computation module Storm based on memory to data-intensive iterative calculation, will Data needed for business read in memory, need directly to inquire from memory when data.

Preferably, described to manage system based on the data, realize that parallel load is pre- using distributed terminator prediction algorithm It surveys, further comprises：

The training process of algorithm is executed using 3 MapReduce service class, the output of each MapReduce is as it The input of the latter, the decision-making module obtained after training are stored in the distributed type assemblies of Hadoop, are divided into three parts： Generate data dictionary；Generate decision tree；Form decision tree set；

The wherein described generation data dictionary includes that the sample data being trained is described, and generates a file to retouch Sample conditional attribute and decision attribute, the type of record condition attribute value and the position of decision attribute are stated, and to be created Module carries out classification or regressing calculation, this process are completed by first MapReduce, and each Map processes read experiment A part for data records the attribute type and load value or type identification of data；The description file of generation is with the shape of key/value Formula is stored in the file system HDFS of Hadoop；

The wherein described generation decision tree process includes following parallel procedure：

1) carry out having extraction K put back to and the equirotal sample data of original sample data set at random to original data set TS_{1,2 ..., k}；One sample data corresponds to the training set of a decision tree, and each sample data is different, and and original data set Size is the same；

2) each node randomly selected attribute number m, wherein m are determined according to the number M of attribute in sample data<<M, M is the square root of M in sort module, and m is the 1/3 of M in regression block；Calculate the information content of each attribute in m attribute, selection Best attributes carry out branch；

3) recurrence carries out the foundation of node, generates decision tree；The generation of K decision tree generates parallel, a Map life At a decision tree, this process is completed by second MapReduce process；

The formation decision tree set includes that each decision tree classifiers combination is got up, and each decision tree generates a knot Fruit, if it is decided that tree set is that ballot is chosen for its final result of classifying, and when it is used for regression forecasting, sets for K and provides K Value, end value are the average value of each tree, this process is completed by third MapReduce.

Preferably, in the deployment framework of the HBase systems, using control centre as entire distributing real-time data bank Manager, store metadata information, including the division of labor of each node, node state, data partition mode, data block location, task The key message of scheduling, safety management；The control centre keeps the consistency of metadata, data by synchronization mechanism each other Analysis computation layer is reciprocity in logic, and deployment same process completes same logical operation, and data analysis computation layer uses base In the redundancy backup mechanism of affairs, power grid user data management system uses the distributed field system that HDFS is stored as bottom System builds the timing control component towards electrical network mass data to store the time series data in electrical network business, by timing control group Part builds time series data module, according to the unified time series data for receiving storage acquisition of peculiar module, and externally provides unified Query interface；

On storage mode, data are stored in the form of key-value, i.e., are stored towards row, be basic with column family Storage and permission control unit, for for empty row, real space being not take up in actual storage, uses the design of sparse table Mode abandons the pattern of traditional C/S multi-clients, single server in data framework deployment；Using distributed more clothes The cluster mode of business device, all data are disperseed according to replicator in the multiple stage computers being stored in cluster；Timing control group Part bottom depend on column storage database, specifically processing time series data when be abstracted as the reading and writing to HBase databases, increase, It deletes, the basic operation of modification, software top layer is the client and third-party application client of timing control component, Suo Youke Family end carries out concrete operations by the API of Java, and all API are a database manipulation by type parsing module function decomposition into analytic function Or the arrangement set of multiple database manipulations, these database manipulation set are called by the RPC inside control assembly, are finally united One completes data manipulation using asynchronous HBase operations API.

The present invention compared with prior art, has the following advantages：

The present invention proposes a kind of electrical network mass data management method, by the data of each system of power grid user carry out fusion and It is integrated, and traditional data computational methods are moved in distributed platform, meet the operation requirement of mass data.

Specific implementation mode

It is hereafter the detailed description to one or more embodiment of the invention.This hair is described in conjunction with such embodiment It is bright, but the present invention is not limited to any embodiments.The scope of the present invention is limited only by the appended claims, and the present invention cover it is all More replacements, modification and equivalent.Illustrate many details in order to provide thorough understanding of the present invention in the following description.Go out These details are provided in exemplary purpose, and can also be according to power without some or all details in these details Sharp claim realizes the present invention.

An aspect of of the present present invention provides a kind of power grid user mass data processing method.Sea is built using Hadoop clusters The basic management system for measuring data, by each collected Data Integration of power grid subsystem at mass data storage, and using parallel Computational frame carries out quick mining analysis to the mass data of power grid user.It, will be traditional by taking electrical load prediction application as an example Load estimation moves to Distributed Computing Platform, realizes that parallel load is predicted using the load estimation algorithm based on decision tree.This Invention combines the actual needs of power grid user mass data analysis, and structure is to analyze the power grid user data management system based on calculating System, basic framework are divided into application layer, data analysis computation layer, data management layer.

The frame builds power grid user data management system using Hadoop, and sea is established using HDFS, HBase on platform Data-storage system is measured, MapReduce parallel computation frames and Storm memory parallel Computational frames are built on platform as sea It measures data and calculates analysis system, the mass data of power grid user is analyzed.

Wherein, data management layer is that data are acquired and are integrated.Data acquisition includes being acquired from intelligent electric meter, data The data acquired in monitoring system and various sensors, these data include not only the data inside power grid, further include a large amount of phases The data of pass, these data are generated by the equipment of different vendor, and mode is multifarious, and constituent parts data bore differs, and forms Mass data flow, processing are integrated difficult.These data it is integrated refer to the generation to legacy system Data Migration to cluster take Business device, is efficiently managed.

Platform carries out extracting integral work using data transfer tool at this difficult point to data for data sets, will be each The data and historical data that independent system generates are using in data transfer tool extracting integral to HBase.It is lasting using java Chemical industry tool operates column storage database, and the online data that the application based on Distributed Calculation generates is written to HBase In.

Storage and calculating analytic function of the data analysis computation layer for mass data.Distributed Calculation layer utilizes Hadoop Built-up, mass data storage is managed data in distributed file system HDFS, using HBase.

The platform is to be classified as storage unit using HBase storage electrical load data and related data, HBase databases , it is convenient that the prediction algorithm that permutation data are inquired, and then used is needed repeatedly in learning process to permutation data The characteristics of being read out calculating, the storage of HBase data met to the operational requirements of data.

Parallel batch is carried out to mass data using parallel computation module MapReduce and calculates analysis, and to data-intensive The iterative calculation of type uses the parallel computation module Storm based on memory.Storm provides a kind of memory parallel Computational frame, Data needed for business are read in memory by frame, and whens required data directly inquires from memory, and ratio is based on disk in this way The speed that MapReduce accesses data is fast, reduces the run time of business, decreases I/O operation.

Load estimation is the key link in Electric Power Network Planning, is substation, space truss project important computations foundation, high-precision Switch-time load prediction can effectively reduce cost of electricity-generating, there is key effect.The present invention uses a kind of improved integrated learning approach, Using decision tree as basic studies unit, including the decision tree that multiple Stochastic subspace identification methods are trained, inputs sample to be sorted This generates each classification results by each decision tree, and final classification results are chosen in a vote by the result of each decision tree.It can It to overcome some shortcomings of decision tree, and is with good expansibility and concurrency, can effectively solve the problem that mass data Quick process problem has preferable application prospect for the electrical load prediction under mass data environment.

Entire load estimation process executes the training process of algorithm using 3 MapReduce service class, each Input of the output of MapReduce as its latter.The decision-making module obtained after training is stored in the distribution of Hadoop In formula cluster, it is divided into three parts：Generate data dictionary；Generate decision tree；Form decision tree set.Generating data dictionary is exactly The sample data being trained is described, a file is generated to describe sample conditional attribute and decision attribute, records The type of conditional attribute value and the position of decision attribute, and the module to be created carry out classification or regressing calculation.This Process is completed by first MapReduce, and each Map processes read a part for experimental data, record the attribute type of data With load value or type identification.The description file of generation is stored in the form of key/value in the file system HDFS of Hadoop, In case subsequent MapReduce is used.

The core that decision tree process is entire parallel algorithm is generated, parallel procedure is wherein in following several respects：1) to original Data set carries out having extraction K put back to and the equirotal sample data TS of original sample data set at random_{1,2 ..., k}.Because being to have The extraction put back to, it is possible to original data set be extracted parallel, without being had an impact to TS.One TS corresponds to one and sentences Surely the training set set, each TS is different, and as original data set size, both ensure that each decision tree not in this way Together, and the knowledge scale of original data set will not be lost.

2) the randomly selected attribute number m (m of each node are determined according to the number M of attribute in sample data<<M), classify M is the square root of M in module, and m is the 1/3 of M in regression block.The information content of each attribute in m attribute is calculated, selection is best Attribute carry out branch；

3) the recursive foundation for carrying out node, generates decision tree.The generation of K decision tree generates parallel, a Map A decision tree is generated, the parallel of algorithm is realized.This process is completed by second MapReduce process.This MapReduce Only Map processes do not have Reduce processes.

Decision tree set is formed namely to get up each decision tree classifiers combination.Each decision tree can generate one As a result, if it is decided that tree set is that ballot is chosen for its final result of classifying, and when it is used for regression forecasting, K tree can be given Go out K value, end value is the average value of each tree.This process is completed by third MapReduce.

Entire module is built upon on the distributed type assemblies of Hadoop, is carried out distributed storage to mass data, is utilized MapReduce is parallel by algorithm, and calculation sample is enable always to collect storage capacity and computing capability logarithm that S methods rely on Hadoop clusters According to excavation and calculate prediction, whole process all executes parallel, can effectively improve the precision of prediction and to improve load pre- Examining system handles the ability of mass data.

In the deployment framework of above-mentioned HBase systems, using control centre as the management of entire distributing real-time data bank Person stores metadata information, including the division of labor of each node, node state, data partition mode, data block location, task scheduling, peace The key messages such as full management.Control centre's generally deployment 2 (can also be formed by more), keeps member by synchronization mechanism each other The consistency of data to eliminate the risk that control centre's Single Point of Faliure causes system allomeric function to lose, while being also simultaneously The realization of hair request load balancing is laid a good foundation.Fragment of the data analysis computation layer for mass data stores, and is completed at the same time The quantity of all kinds of calculating process, data analysis computation layer is limited solely by the rigid condition such as Ethernet bandwidth, computer room physical condition.Respectively Data analysis computation layer is reciprocity in logic, and deployment same process completes same logical operation, according to control centre's logarithm According to area principle, only storage belongs to the data of respective partition, to achieve the purpose that distributed storage.In view of distributed body System structure lower node fails and failure can use the redundancy backup machine based on affairs frequent occurrence between data analysis computation layer System, by the same transaction operation be synchronized to another or a few number of units according in analysis computation layer (depend on customized duplication because Son), while realizing data high reliability, lay a good foundation for the load balancing of data access.

Power grid user data management system uses the distributed file system that HDFS is stored as bottom, herein on basis The timing control component towards electrical network mass data is built to store the time series data in electrical network business.By timing control component Lai Time series data module is built, receives the time series data of storage acquisition according to peculiar module is unified, and externally provide unified inquiry Interface.

On specific storage mode, it is different from the table structure of the determinant of traditional relational, using the form of key-value Data are stored, i.e., are stored towards row, with column family for basic storage and permission control unit.For for empty row, It is not take up real space in actual storage, uses the design method of sparse table.In this way, Different sampling period is solved Caused by space waste problem.The mould of traditional C/S multi-clients, single server is abandoned in data framework deployment simultaneously Formula.Using the cluster mode of distributed multiserver, all data are stored in more in cluster according to replicator dispersion The storage security for enhancing data on computer improves the search efficiency of data.

Timing control component bottom depends on column storage database.In specifically processing time series data, can be abstracted as pair The basic operations such as reading and writing, increase, deletion, the modification of HBase databases.Software top layer is the client of timing control component And third-party application client.All clients carry out concrete operations by the API of Java.All API parse mould by type Block can be a database manipulation or the arrangement set of multiple database manipulations with function decomposition into analytic function.These database manipulation set are logical The RPC crossed inside control assembly is called, and finally unifies to complete data manipulation using asynchronous HBase operations API.

Time series data record is made of 4 measurement object, timestamp, measured value, label fields.Wherein, label is by one Or multiple key/value, to constituting, for further describing measurement object information, measurement object and tag combination are measurement item.Label Design make user be easy to inquire its care measurement item value.Control assembly stores data using accumulation layer, and deposits Reservoir is the distributed file storage system of a key/value structure.Time series data efficiently is stored in distributed accumulation layer, The data point of over ten billion easily is stored with minimum memory/disk space, must be solved when being outstanding node store structure design Critical issue certainly.For this purpose, distributing real-time data bank management level rely on columnar database HBase table design need to abide by with Lower principle：Should include retrieval information as much as possible for the major key of the timing control component using regular length；The number of storage According to generally comprising a large amount of measurement object and label, and these fields are elongated, and therefore, one ID table of setting stores these letters Breath is incorporated as major key as globally unique number, and number with timestamp；Often row should store information as much as possible. For example, the data of some period distributed collection are merged, data are submitted according to a row.The program can be reduced The number of entire table row major key, to improve the speed of row retrieval.Data are stored according to the extension of time, use is stateless Storage scheme, to provide system survivability.

The method that key and value for each measurement object, label are all made of Hash maps is numbered, while in order to carry Above-mentioned map information is stored in 2 parts by the efficiency of high data query in ID tables, and portion is that measurement object, label key and value arrive it The mapping of number is hashed, another is mapping of the hash number to measurement object, label key and value.Above-mentioned hash number is all made of The regular length of 3 bytes.The time series data of measurement object is stored in another table, and the line unit of the table uses measurement object ID The ID of the ID+ label values of+fiducial time+label key, wherein fiducial time field are right for a certain time series data record to be stored The system development answered is with using the integral point time, and in addition to fiducial time is 4 bytes, other fields are 3 bytes.In 1 hour Time series data be stored in a line in table, a certain record storage is by row and its offset Δ t institute relative to fiducial time Under corresponding row, timestamp-fiducial time of wherein Δ t=records.When certain a line record is filled with, next line is opened after renewing Storage.

Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.

It should be understood that the above-mentioned specific implementation mode of the present invention is used only for exemplary illustration or explains the present invention's Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of electrical network mass data management method, which is characterized in that including：

Power grid user data management system is built, each collected data of power grid subsystem are integrated, and using parallel Computational frame is excavated and is analyzed to the data of power grid user；System is managed based on the data, it is pre- using distributed terminator Method of determining and calculating realizes parallel load prediction；

The framework of the power grid user data management system is divided into application layer, data analysis computation layer, data management layer, utilizes Hadoop builds power grid user data management system, data-storage system is established using HDFS, HBase on platform, in platform Upper structure MapReduce parallel computation frames and Storm memory parallels Computational frame calculate analysis system as mass data, right The mass data of power grid user is analyzed；The data management layer is acquired and integrates to data；The data acquisition packet The data acquired from intelligent electric meter, data acquisition monitoring system and various sensors are included, include inciting somebody to action to the integrated of these data Data Migration to cluster server is managed；In the integrating process of data, data are taken out using data transfer tool It takes and integration work, data and historical data that each independent system generates is arrived using data transfer tool extracting integral In HBase, and column storage database is operated using java persistence tools, the application based on Distributed Calculation is generated Online data be written in HBase；Storage and calculating analysis of the data analysis computation layer for mass data；It utilizes HBase stores electrical load data and related data；Mass data is criticized parallel using parallel computation module MapReduce Gauge point counting is analysed, and uses the parallel computation module Storm based on memory to data-intensive iterative calculation, needed for business Data read in memory, need directly to inquire from memory when data；

It is described to manage system based on the data, it realizes that parallel load is predicted using distributed terminator prediction algorithm, further wraps It includes：

The training process of algorithm is executed using 3 MapReduce service class, the output of each MapReduce is latter as its A input, the decision-making module obtained after training are stored in the distributed type assemblies of Hadoop, are divided into three parts：It generates Data dictionary；Generate decision tree；Form decision tree set；

The wherein described generation data dictionary includes that the sample data being trained is described, and generates a file to describe sample This conditional attribute and decision attribute, the type of record condition attribute value and the position of decision attribute, and the module to be created It carries out classification or regressing calculation, this process is completed by first MapReduce, each Map processes read experimental data A part, record the attribute type and load value or type identification of data；The description file of generation is deposited in the form of key/value Storage is in the file system HDFS of Hadoop；

2) each node randomly selected attribute number m, wherein m are determined according to the number M of attribute in sample data<<M, classification M is the square root of M in module, and m is the 1/3 of M in regression block；The information content of each attribute in m attribute is calculated, selection is best Attribute carries out branch；

3) recurrence carries out the foundation of node, generates decision tree；The generation of K decision tree generates parallel, and a Map generates one A decision tree, this process are completed by second MapReduce process；

The formation decision tree set includes that each decision tree classifiers combination is got up, each decision tree generation one as a result, If it is determined that tree set is used for classifying, its final result is that ballot is chosen, and when it is used for regression forecasting, K tree provides K value, End value is the average value of each tree, this process is completed by third MapReduce.

2., will be in scheduling according to the method described in claim 1, it is characterized in that, in the deployment framework of the HBase systems Manager of the heart as entire distributing real-time data bank stores metadata information, including the division of labor of each node, node state, number According to partitioned mode, data block location, task scheduling, safety management key message；The control centre passes through synchronization each other It is reciprocity in logic that mechanism, which keeps the consistency of metadata, data analysis computation layer, and deployment same process completion is similarly patrolled Operation is collected, data analysis computation layer uses the redundancy backup mechanism based on affairs, power grid user data management system to use HDFS As the distributed file system of bottom storage, the timing control component towards electrical network mass data is built to store electrical network business In time series data, time series data module is built by timing control component, storage acquisition is received according to peculiar module is unified Time series data, and unified query interface is externally provided；

On storage mode, data are stored in the form of key-value, i.e., are stored towards row, be basic deposit with column family Storage and permission control unit are not take up real space in actual storage, use the design side of sparse table for the row for sky Formula abandons the pattern of traditional C/S multi-clients, single server in data framework deployment；Using distributed more services The cluster mode of device, all data are disperseed according to replicator in the multiple stage computers being stored in cluster；Timing control component Bottom depends on column storage database, and the reading and writing to HBase databases are abstracted as in specifically processing time series data, increases, delete The basic operation remove, changed, software top layer are the client and third-party application client of timing control component, all clients End carries out concrete operations by the API of Java, all API by type parsing module function decomposition into analytic function be a database manipulation or The arrangement set of multiple database manipulations, these database manipulation set are called by the RPC inside control assembly, are finally unified Data manipulation is completed using asynchronous HBase operations API.