CN108491507A

CN108491507A - A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments

Info

Publication number: CN108491507A
Application number: CN201810240305.8A
Authority: CN
Inventors: 徐维祥; 李灵博
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2018-09-04
Anticipated expiration: 2038-03-22
Also published as: CN108491507B

Abstract

The present invention discloses a kind of querying method of the uncertain higher-dimension traffic flow data based on Hadoop distributed environments, includes the following steps：It received by MapReduce batch processing Computational frames, store flow data and the flow data is pre-processed and obtains data set；The adaptive DBSCAN that parallelization is carried out to the data set is clustered and query demand is combined to generate query result.The present invention can successfully manage data characteristics variation, provide efficient query result in real time.

Description

A kind of uncertain traffic flow data based on Hadoop distributed environments is persistently looked into parallel Inquiry method

Technical field

The present invention relates to magnanimity not to know flow data management domain, and Hadoop distributed environments are based on more particularly to one kind The parallel continuous Query method of uncertain traffic flow data.

Background technology

With the fast development of social informatization, there is the growth of data explosion formula in global every field.According to《2016 Chinese big data transaction white paper》It is expected that Chinese big data industry size or 1,362,600,000,000 yuan will be reached in the year two thousand twenty.Data city Field is not only showed data scale and is improved as unit of the order of magnitude, but also data itself produce a series of new features, including Unstructured, multi-source heterogeneous and dynamic evolution etc..

Wisdom traffic system covers communications and transportation various aspects, and it is wisdom that big data theory, which is introduced into traffic system, The development of railway and highway system creates new thinking and technology connotation.The traffic data for the magnanimity complexity that wisdom traffic system is accumulated Stream, derives from a wealth of sources, is various informative, has data-intensive processing feature；Meanwhile while by data integration, propagation delay time and The influence of the low equal complicated reason of equipment precision is widely present uncertain data in transport data stream.With wisdom traffic system It is increasingly taken seriously, an important required course of traffic data management platform is rationally had become using big data resource.

Big data technology is applied in the processing of traffic operation mass data mining analysis, will be wisdom traffic provider Method supports and data are supported.Flow data is the data normality in traffic system, meanwhile, traffic flow data has number in wisdom traffic According to the features such as magnanimity, storage and rate of interaction be fast, therefore to become vehicle remote monitoring flat for acquisition, storage and the retrieval of its data Critical issue in platform.Also, to meet the traffic controls demand such as traffic guidance of modernization, need to traffic behavior carry out compared with Accurately to judge and predicting, it is therefore desirable to obtain accurate traffic flow data in real time.However, since current system is strong Strong property is insufficient, it is difficult to voluntarily judge the quality of data, may have missing values not so as to cause in the certain dimensions of traffic flow data Determine data flow characteristics, therefore, every traffic control demand will be difficult to meet due to the missing of reliable initial data, to lead Cause the overall value of wisdom traffic system by large effect.

The management that mass data is carried out based on Hadoop can make full use of the autgmentability of MapReduce, solve magnanimity number The problem of according to the autgmentability and scale that face is managed.It, can not be fine but since MapReduce is batch processing Computational frame Ground adapts to and processing flow data, it is therefore desirable to be assisted MapReduce frames so that it successfully manages flow data.

Therefore, in conjunction with modes such as big data and cloud computings, existing search algorithm is improved, is allowed to merge with better adapting to property In wisdom traffic system, go to meet to having many characteristics, such as that the progress of the streaming traffic data of uncertain, high latitude is persistently looked into parallel The requirement of inquiry undoubtedly has become the strength boost motor for accelerating traffic system development.

Invention content

The purpose of the present invention is to provide a kind of uncertain higher-dimension traffic flow data based on Hadoop distributed environments Querying method, a kind of inquiry of the uncertain higher-dimension traffic flow data based on Hadoop distributed environments proposed by the present invention Method can successfully manage data characteristics variation, provide efficient query result in real time.

In order to achieve the above objectives, the present invention uses following technical proposals：It is a kind of not true based on Hadoop distributed environments The querying method of qualitative higher-dimension traffic flow data, includes the following steps：

It received by MapReduce batch processing Computational frames, store flow data and the flow data is pre-processed simultaneously Obtain data set；

The adaptive DBSCAN that parallelization is carried out to the data set is clustered and query demand is combined to generate query result.

Preferably, it received by MapReduce batch processing Computational frames, store flow data and the flow data carried out pre- Processing includes：

MapReduce batch processings Computational frame is set to be carried out in flow data environment to passing in real time by sliding window pattern Defeated flow data is received and stored；

Data item screening and principal component analysis dimensionality reduction are carried out respectively to the flow data received to obtain dimensionality reduction number According to；

It carries out standard deviation calculating by the dimensionality reduction data and brings section expression formula into obtain data set.

Preferably, the adaptive DBSCAN that parallelization is carried out to the data set is clustered and query demand is combined to generate Query result includes：

The data set is decomposed and obtains several data subsets；

Adaptivity DBSCAN clusters are carried out respectively to several data subsets and obtain each data subset Data distribution characteristics and data structure feature；

The data distribution characteristics and data structure feature of each data subset are integrated and obtain whole number According to the data distribution characteristics and data structure feature of collection；

Data division is carried out to the data structure feature of the whole data set and query demand is combined to generate query result.

Preferably, described that adaptivity DBSCAN clusters are carried out respectively to several data subsets and obtain each institute The data distribution characteristics and data structure feature for stating data subset include：

Line number statistical analysis of going forward side by side is distributed to several data subset progress KNN and obtains pre-set parameter；

DBSCAN clusters are carried out respectively to several data subsets according to the pre-set parameter and obtain each institute State the data distribution characteristics and data structure feature of data subset.

Beneficial effects of the present invention are as follows：

(1) it introduces sliding window pattern and reads stream data, and introduce " interval number " concept and carry out data rewriting, to one Determine to compensate for the analytical error that uncertain data is brought in degree, and MapReduce batch processing Computational frames provide one Kind answers the processing means of streaming data；

(2) core calculations part clusters by adaptive DBSCAN and obtains data structure feature, can successfully manage data Characteristic changes and improve efficiency data query can efficiently visit especially under mass data environment by cluster mode Rope data common feature realizes the rapid excavation to data entirety feature.

Description of the drawings

Specific embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.

Fig. 1 shows the step flow chart of querying method of the present invention；

Fig. 2 shows the step flow charts of data prediction part in the present invention；

Fig. 3 shows to rewrite the result schematic diagram of initial data in the present invention according to interval number representation method；

Fig. 4 shows the step flow chart of data core calculations part in the present invention；

Fig. 5 shows the step flow chart that adaptivity DBSCAN is clustered in the present invention.

Specific implementation mode

In order to illustrate more clearly of the present invention, the present invention is done further with reference to preferred embodiments and drawings It is bright.Similar component is indicated with identical reference numeral in attached drawing.It will be appreciated by those skilled in the art that institute is specific below The content of description is illustrative and be not restrictive, and should not be limited the scope of the invention with this.

For under wisdom traffic background, the characteristic properties such as the streaming of traffic data, high-dimensional, uncertain propose a kind of Environment is background in a distributed manner, is based on density clustering algorithm, can successfully manage data characteristics variation, provides efficiently look into real time Ask the uncertain traffic flow data comprehensive inquiry method of result.

The described a kind of uncertain based on Hadoop distributed environments of the present invention is discussed in detail with reference to above-mentioned target The querying method of property higher-dimension traffic flow data, as shown in Figure 1, including the following steps：

Step 100：It received by MapReduce batch processing Computational frames, store flow data and the flow data is carried out It pre-processes and obtains data set；

Step 200：The adaptive DBSCAN that parallelization is carried out to data set is clustered and query demand is combined to generate inquiry knot Fruit.

Fig. 2 is the step flow chart of data prediction part in the present invention, as shown in Fig. 2, step 100 includes following step Suddenly：

Step 110：Using sliding window model, the flow data of real-time Transmission is received, and buffering area is combined to carry out The short-term storage of initial data, to be calculated in real time for the data in unit interval piece；

Flow data is one group of characteristic data sequence, can often regard the dynamic data set to increase without limitation at any time as.Stream The characteristics of data includes mainly in short：(1) data reach in real time, have quick unlimitedness；(2) it is only to reach order for data It is vertical, and it is unknown to generate the characteristics such as speed and time；(3) data flow changes over time；(4) single pass is required, after data processing It cannot generally be handled by taking-up again；(5) a large amount of flow data analysis generally requires query result to meet trueness error requirement i.e. Can, there is result approximation.For the characteristic of flow data, need to carry out corresponding back work to MapReduce frames, so that It adapts to the processing of stream data.

MapReduce frames based on Hadoop platform are a batch processing Computational frames.Batch processing can be used for calculating pair The arbitary inquiry of different data collection is generally used for realizing the in-depth analysis to large data sets.On the contrary, stream process then needs intake one A data sequence, incrementally updating index, report and collect statistics are as a result, to respond the data record each reached.This place Reason method is more suitable for real time monitoring and receptance function.But batch processing and stream process are not incompatible with, can pass through combined use Two methods build a kind of mixed mode, while maintaining real-time process layer and batch processing layer, a kind of more with suitable to be formed The processing scheme of answering property and use value.Therefore, by introduce sliding window model, in time read or receive data, not only with Smaller data window and buffer cache is combined, realization effectively reduces requirement of the algorithm to memory, and disclosure satisfy that Data are received in time, and for the needs that Recent data is analysed in depth.

The concrete methods of realizing of sliding window：The data that real-time Transmission is stored by core buffer, in each data block It all include a plurality of initial data received.Sliding window reads a certain number of data blocks every time, and as time goes by, The position of mobile sliding window, reads the data in new sliding window, to realize that emphasis carries out processing and feature to Recent data Analysis.

It is worth noting that, there are certain defects for sliding window model itself, since stale data cannot be timely in window It deletes completely, a degree of memory is caused to waste.The present invention is directed to the limitation of sliding window, is introducing sliding window model On the basis of, it is realized to the timely processing of expired tuple, is avoided since expired tuple is not timely by the conversion of buffering area Ground deletes, and to cause the waste of memory source, and is impacted to the cluster process and result in later stage.By designing above and It improves, effectively improves clustering result quality and data-handling capacity, while memory overhead is greatly saved.

Step 120：With reference to the data characteristics that historical data accumulates, according to the required precision of practical problem, to what is received High-dimensional initial data in unit interval piece, filters out the data item being affected to principal component, then carries out simplified master Constituent analysis dimensionality reduction calculates；

With the development of wisdom traffic, field of traffic data record and data attribute scale show becoming of expanding rapidly Gesture, while high dimensional data is in occupation of sizable proportion, but such case fully may result in and be produced in data analysis application Raw quite bad performance, so for big data processing platform, Data Dimensionality Reduction becomes increasingly part and parcel.

Under the conditions of the time restriction of flow data processing, by the data characteristics that historical data accumulates, by largely going through History or in the recent period classification traffic data carry out principal component analysis repeatedly and calculate, and obtain the principal component shadow to newly being formed in each formatted data Ring universal data item bigger than normal.Under certain accuracy enabled condition, for particular demands, consideration passes through above-mentioned analysis and calculating Obtained result preference carries out data item screening and principal component analysis dimensionality reduction.

As a part for data prediction, which can effectively reduce the time complexity of Data Dimensionality Reduction Algorithm, not only Meet the needs of convenient for carrying out subsequent processing to high dimensional data, and improves algorithm operational efficiency.

Step 130：Dimensionality reduction data in the unit interval piece obtained for step 120 carry out standard deviation calculating, count respectively The standard deviation of each data item is calculated, and carries it into interval number expression formula, to be rewritten to each data item of the data, The data point object that newly defines is formed to obtain data set.

It can be obtained by interval number correlation theory, the corresponding error vectors of data point Xi are usedIt indicates, due to measurement data It is distributed in sectionProbability be 68.3%, in section Probability be 95.4%, in sectionProbability be 99.7%.Error according to actual needs Required precision selects suitable section to indicate.

Fig. 3 is the result schematic diagram for rewriting initial data in the present invention according to interval number representation method.Make in illustrative example WithAs the re-writing mode of raw data points Xi, then initial data can be rewritten into new definition Data object obtain data set.

Fig. 4 is the step flow chart of data core calculations part in the present invention, as shown in figure 4, step 200 includes following step Suddenly：

Step 210：By the MapReduce parallel computation frames of Hadoop distributed processing system(DPS)s, will by step 110 to The data set that step 130 processing obtains is divided, and the data subset of several scale is smallers is formed, and is then directed to each data Collection carries out adaptivity DBSCAN clusters, obtains the data distribution characteristics and data structure feature of small-scale data subset；

Step 220：The cluster result of each data subset is integrated by MapReduce parallel computation frames, is obtained The data structure feature and data distribution characteristics of whole data set；

Traditional uniprocessor algorithm when handling large-scale data sample, often existence time and space expense it is excessive and knot The bad problem of fruit accuracy.For the general data Processing Algorithm including clustering algorithm, due to the limitation of Installed System Memory, When data volume increased dramatically, memory and I/O consumption will significantly increase.In this regard, algorithm is rewritten as to be arranged in distributed environment In, and piecemeal is carried out to set of data samples and handles and very effective can evade the above problem.

By distributed processing system(DPS), piecemeal processing is carried out to data set so that raw data set in large scale is closed Reason is divided into the data subset of several scale is smallers, to meet the purpose that parallelization handles these data samples.

Therefore, it realizes that parallelization calculates by the MapReduce parallel computation frames of Hadoop distributed processing system(DPS)s to calculate Method thought.Parallelization data processing and cluster process are realized by writing Map () function, and then can be small-sized to being broken down into The data of data subset carry out speed faster, the higher implementing result of accuracy.

Step 230：For specific actual demand, obtains data structure feature by step 220 and carry out data division, so Relational data areas is inquired again afterwards, obtains targetedly query result.

On the basis of carrying out resolution process to data and integrating subarea clustering result of calculation, according to final inquiry Demand Design querying condition, to carry out further targetedly analysis to data and study.

Fig. 5 is the step flow chart that adaptivity DBSCAN is clustered in the present invention.For characterized by big data instantly For, the signature analysis and demand established in mass data explore the mainstream research for increasingly becoming big data analysis and processing Direction.For the processing and research for meeting to large data sets, cloud computing and machine learning are gradually developed.And clustering algorithm can Characteristic feature in abundant mining data distribution and structure is that a kind of algorithm for having larger potentiality in machine learning field is thought Think.

So present invention selection carries out core calculations and the processing of data by clustering algorithm, clustering algorithm is given full play to Advantage on efficient heuristic data common feature meets the rapid excavation to data entirety feature, and non-existing algorithm is to a The inquiry and concern of other data.

DBSCAN clustering algorithms are a kind of typical density-based algorithms, and are a kind of efficient cluster calculations Method.It is main to rely on two parameters during the algorithm is realized：Radius Eps and density threshold minPts, the two parameters are set It is fixed to have more crucial influence to the speed of service of cluster and the quality of cluster result.

Existing DBSCAN clustering algorithms rely on user defeated in advance the setting of two kinds of parameter values of Eps and minPts substantially Enter.User rule of thumb carries out parameter setting, and then according to result progress parameters revision is attempted, ideal can be generated by gradually finding The more suitable parameter value of cluster result.This mode is a kind of parameter selection scheme of existing relatively meet demand, but For the data set larger to data volume, this progress repeatedly clusters the mode that parameters revision is then carried out by comparing result, Resource consumption caused by each run is very important, meanwhile, small data quantity set is compared in the accuracy of parameter selection also can be It reduces.In this regard, it is contemplated that introduce a kind of system by being investigated to data set, the mode of adaptive setting parameter value.

As shown in figure 5, step 210 includes the following steps：

Step 211：It is decomposed into line data set using Map () function of MapReduce, to carry out parallel clustering calculating；

Step 212：KNN distributions are carried out for each data subset, being found according to k-dist distribution curves can be representative anti- The k value k0 for mirroring the shape of other distk curves, root for statistical analysis to the k- nearest neighbor distances data (distk0) of k0 Analysis result is by distk probability distribution region the most intensive according to statistics, as the setting value of radius parameter Eps, therefore, for Selection area data carry out models fitting, find most suitable model and calculate knee of curve f (x0), then radius Eps=f (x0)；

The adaptive of parameter Eps and minPts in DBSCAN clustering algorithms is selected, it can be by being carried out to data set KNN is distributed the pre-set parameter analyzed with mathematical statistics, and then obtain the more science that feedback is come.

Specifically, first, according to the pretreated data set D of input be calculated range distribution matrix D ISTn × Then n calculates the value of each element in range distribution matrix, and carries out ascending order arrangement to each row in DISTn × i, obtain KNN is distributed.The k value k0 for the shape that representative can reflect other distk curves are found according to k-dist distribution curves, and right The k- nearest neighbor distances data (distk0) of k0 are for statistical analysis.It can be obtained according to statistic analysis result, be existed in distk By smooth variation to the point steeply risen, i.e. distk probability distribution region the most intensive, you can be considered as radius parameter The setting value of Eps.Therefore, a variety of models fittings such as Fourier, Gauss and multinomial are carried out to the data, finds most suitable mould Type calculates knee of curve f (x0), then radius Eps=f (x0).

Step 213：After obtaining radius Eps by step 212, the calculation to density threshold minPts is to calculate successively Then the number of objects for the Eps neighborhoods each put calculates the mathematic expectaion of data object, the as value of MinPts；

Step 214：The science value of the parameter Eps and minPts that are calculated through the above steps, as main Pre-set parameter carries out the clusters of the DBSCAN based on density to data subset, and the distance between data point is calculated when cluster and is no longer pressed According to Euclidean distance calculation formula, but the cluster result of data subset is finally obtained apart from calculation according to interval number And the analysis result to its data structure feature.

Show that this method can successfully manage the processing work of flow data by test result；As data set capacity gradually increases Greatly, the time shortens rapidly, and efficiency is apparently higher than general query algorithm；Moreover, uncertain data and noise point can be effectively reduced The error brought, it is of less demanding to data set data characteristics.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention may be used also on the basis of the above description for those of ordinary skill in the art To make other variations or changes in different ways, all embodiments can not be exhaustive here, it is every to belong to this hair Row of the obvious changes or variations that bright technical solution is extended out still in protection scope of the present invention.

Claims

1. a kind of querying method of the uncertain higher-dimension traffic flow data based on Hadoop distributed environments, which is characterized in that Include the following steps：

It received by MapReduce batch processing Computational frames, store flow data and the flow data is pre-processed and obtained Data set；

2. querying method according to claim 1, which is characterized in that received by MapReduce batch processing Computational frames, It stores flow data and pretreatment is carried out to the flow data and include：

MapReduce batch processings Computational frame is set to be carried out to real-time Transmission in flow data environment by sliding window pattern Flow data is received and stored；

Data item screening and principal component analysis dimensionality reduction are carried out respectively to the flow data received to obtain dimensionality reduction data；

3. querying method according to claim 1, which is characterized in that described to carry out the adaptive of parallelization to the data set It answers DBSCAN to cluster and query demand is combined to generate query result and include：

The data set is decomposed and obtains several data subsets；

Adaptivity DBSCAN clusters are carried out respectively to several data subsets and obtain the number of each data subset According to distribution characteristics and data structure feature；

The data distribution characteristics and data structure feature of each data subset are integrated and obtain whole data set Data distribution characteristics and data structure feature；

4. querying method according to claim 3, which is characterized in that described to be carried out respectively to several data subsets Adaptivity DBSCAN is clustered and is obtained the data distribution characteristics of each data subset and data structure feature includes：

DBSCAN clusters are carried out respectively to several data subsets according to the pre-set parameter and obtain each number According to the data distribution characteristics and data structure feature of subset.