CN105554069A

CN105554069A - Big data processing distributed cache system and method thereof

Info

Publication number: CN105554069A
Application number: CN201510891553.5A
Authority: CN
Inventors: 马艳; 陈玉峰; 朱文兵; 杜修明; 郑建; 袁海燕; 任敬国; 邹立达; 苏东亮
Original assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd; Shandong Zhongshi Yitong Group Co Ltd
Current assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd; Shandong Zhongshi Yitong Group Co Ltd
Priority date: 2015-12-04
Filing date: 2015-12-04
Publication date: 2016-05-04
Anticipated expiration: 2035-12-04
Also published as: CN105554069B

Abstract

The invention discloses a big data processing distributed cache system and method thereof. The method includes the steps of: dividing a big data processing server into a plurality of cache units, and storing data in the form of key-value pairs in each cache unit; according to the accessed frequency of the cache units, calculating value of the cache units and sorting the cache units, and extracting all cache units in a preset value threshold value range; and clustering all the extracted cache units in the preset value threshold value range, and allocating a preset clustering number of cache units to any cloud computing cache node for storage. The method provided by the invention enables network data transmission between nodes to be reduced when data are accessed or processed, thereby shortening the processing time, and effectively improving the efficiency of big data processing.

Description

A kind of large data processing distributed cache system and method thereof

Technical field

The invention belongs to large market demand field, particularly relate to a kind of large data processing distributed cache system and method thereof.

Background technology

The development of the Internet science and technology makes data volume sharply increase, and under the greatly developing of data science and technology, people can store, the data that process have reached unprecedented magnitude, and rapidly increase with the speed exceeding Moore's Law.The core value of large data is to carry out storing and analyzing for mass data exactly.In commercial environments, large data processing is packaged into service by data processing service provider, is sold to user.

For the data analysis requirements that some are real-time, required by user has the performance processed and the time returned.Therefore need to be optimized the performance of large data processing, to improve data-handling efficiency.Buffer memory is the important means improving large data processing speed.

Store data in high-speed cache, can significantly improve data I/O efficiency, and then accelerate data-handling efficiency.But buffer memory is a kind of article costly relative to External memory equipments such as disks, and large data are bulk sample mass datas originally, and all data being stored is uneconomic, infeasible in the buffer.The data access of user often to a part of data be frequently, real-time, therefore we can by access frequently, important data placement is among buffer memory.

Relative to traditional data buffer memory, the feature that large data buffer storage has it exclusive:

Data store with key-value pair (Key-Value) structure organization.The granularity of buffer memory, form and replace algorithm and have the need for further discussion the storage organization adapting to large data.

Large data processing needs to depend on cloud computing platform.The data of large data access often have certain relevance, by related data placement to close position, can reduce the cost of transfer of data.Such as a data processing needs A, B two parts data, and A and B is stored in two different nodes, and this needs just to process in one of them transfer of data to another node; If A, B are centrally stored in a node, will Internet Transmission be avoided, thus improve treatment effeciency.After obtaining needs and be data cached, need to design a kind of method by these data placements at applicable node.

Summary of the invention

In order to solve the shortcoming of prior art, the invention provides a kind of large data processing distributed caching method.The method utilizes the method for buffer unit being carried out to cluster, in each cloud computing cache node corresponding stored buffer unit type, for accelerating the processing speed of large data.

For achieving the above object, the present invention is by the following technical solutions:

A kind of large data processing distributed cache system, comprising: the large data storage intercomed mutually and distributed cloud computing server;

Described large data storage is divided into several buffer units, and each described buffer unit is used for carrying out storage data with the form of key-value pair;

Several cloud computing cache nodes, large data extraction module and cloud computing cache node distribution module is provided with in described distributed cloud computing server;

Described large data extraction module, it is for the accessed frequency according to buffer unit, calculates the value of buffer unit and to go forward side by side line ordering, extract all buffer units preset and be worth in threshold range;

Described cloud computing cache node distribution module, the buffer unit of default number of clusters for carrying out cluster to all buffer units in the default value threshold range extracted, and is dispensed in arbitrary cloud computing cache node and stores by it.

Described large data storage comprises RAM memory and FLASH memory.

Carry out upgrading the data in buffer unit according to predetermined period in large data storage.

A caching method for large data processing distributed cache system, comprising:

Large data processing server is divided into several buffer units, in each buffer unit, carries out storage data with the form of key-value pair;

According to the accessed frequency of buffer unit, calculate the value of buffer unit and to go forward side by side line ordering, extract all buffer units preset and be worth in threshold range;

Cluster is carried out to all buffer units in the default value threshold range extracted, and the buffer unit of default number of clusters is dispensed in arbitrary cloud computing cache node stores.

Before calculating the value of buffer unit, carry out upgrading the data in buffer unit according to predetermined period.

The computational methods of the value of buffer unit are:

p_{i}^{j} = α \cdot p_{i}^{j - 1} + (1 - α) \cdot n_{i}^{j} \cdot β

Wherein, represent the value of i-th buffer unit in a jth cycle; represent the value of i-th buffer unit in-1 cycle of jth; α is the cycle influences factor, is constant; β is the data value factor in i-th buffer unit, is constant; the access times of i-th buffer unit within a jth cycle; I and j is the positive integer being more than or equal to 1, for being more than or equal to the positive integer of 0.

In cloud computing cache node, Memcache mechanism is adopted to carry out the large data of buffer memory.

K-means algorithm is used to carry out cluster to all buffer units in the default value threshold range extracted.

Large data storage comprises RAM memory and FLASH memory.

Beneficial effect of the present invention is:

(1) distributed cloud computing server of the present invention is provided with several cloud computing cache nodes, adopt the buffer unit type of each cloud computing cache node corresponding stored predetermined number, make data accessed or process time, reduce internodal network data transmission, shorten the processing time, effectively improve the efficiency of large data processing;

(2) the cloud computing cache node of distributed cloud computing server of the present invention, can adopt multiple memory mechanism to carry out storing large data, wherein, comprise Memcache mechanism; And the multiple cloud computing cache nodes arranged in large data processing distributed cache system of the present invention, can ensure that large data obtain distributed caching and process.

Accompanying drawing explanation

Fig. 1 is large data processing distributed caching method flow chart of the present invention.

Embodiment

Below in conjunction with accompanying drawing and embodiment, the present invention will be further described:

Large data processing distributed cache system of the present invention comprises: large data storage and distributed cloud computing server, both intercoms mutually.

Describe in detail to large data storage with distributed cloud computing server successively below:

(1) large data storage:

Large data storage divides several buffer units, and each buffer unit is all for carrying out storage data with the form of key-value pair.Wherein, large data storage comprises RAM memory and FLASH memory.

(2) distributed cloud computing server:

Several cloud computing cache nodes, large data extraction module and cloud computing cache node distribution module is provided with in distributed cloud computing server;

Wherein, large data extraction module, for the accessed frequency according to buffer unit, calculates the value of buffer unit and to go forward side by side line ordering, extract all buffer units preset and be worth in threshold range;

Cloud computing cache node distribution module, the buffer unit of default number of clusters for carrying out cluster to all buffer units in the default value threshold range extracted, and is dispensed in arbitrary cloud computing cache node and stores by it.

Fig. 1 is the caching method of large data processing distributed cache system of the present invention, describes the caching method of large data processing distributed cache system of the present invention below in conjunction with Fig. 1 in detail.

Particularly, this caching method comprises:

Step 1: large data processing server is divided into several buffer units, carries out storage data with the form of key-value pair in each buffer unit;

Step 2: according to the accessed frequency of buffer unit, calculates the value of buffer unit and to go forward side by side line ordering, extract all buffer units preset and be worth in threshold range;

Step 3: cluster is carried out to all buffer units in the default value threshold range extracted, and the buffer unit of default number of clusters is dispensed in arbitrary cloud computing cache node stores.

Wherein, before calculating the value of buffer unit, carry out upgrading the data in buffer unit according to predetermined period.

In step 2, the computational methods of the value of buffer unit are:

p_{i}^{j} = α \cdot p_{i}^{j - 1} + (1 - α) \cdot n_{i}^{j} \cdot β

When i-th buffer unit is accessed, return time is more urgent, and β value is higher.In data access, according to the requirement of access to return time, urgency level can be classified: in real time, generally, loosely.The β value that this three kinds of similar correspondences are different.The access that urgency level is high has higher β value.Can count in one-period according to data access record, the access frequency of any one buffer unit and access urgency level.

Use k-means algorithm to carry out cluster to all buffer units in the default value threshold range extracted, and the buffer unit of default number of clusters is dispensed in any one cloud computing cache node stores.

If a cluster is greater than the capacity of a node, then use k-means algorithm to divide this cluster, and with the least possible node, it is stored.

Before carrying out cluster to all buffer units in the default value threshold range extracted, all buffer units in default value threshold range build a connected graph:

Each buffer unit in default value threshold range is set to a point, if two buffer units are accessed by a data processing simultaneously, increasing a weight at these two points is the limit of 1, and the weight on limit can superpose, and presets all buffer units be worth in threshold range and forms a connected graph.

The connected graph built is judged whether it is empty, if not empty, then carries out cluster to all buffer units in the default value threshold range extracted; Otherwise terminate not carry out cluster, the number of all buffer units in the default value threshold range now extracted is one.By in this buffer unit corresponding stored to cloud computing cache node.

By reference to the accompanying drawings the specific embodiment of the present invention is described although above-mentioned; but not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various amendment or distortion that creative work can make still within protection scope of the present invention.

Claims

1. a large data processing distributed cache system, is characterized in that, comprising: the large data storage intercomed mutually and distributed cloud computing server;

2. a kind of large data processing distributed cache system as claimed in claim 1, it is characterized in that, described large data storage comprises RAM memory and FLASH memory.

3. a kind of large data processing distributed cache system as claimed in claim 1, is characterized in that, carry out upgrading the data in buffer unit in large data storage according to predetermined period.

4. a caching method for large data processing distributed cache system as claimed in claim 1, is characterized in that, comprising:

Cluster is carried out to all buffer units in the default value threshold range extracted, and the buffer unit of default number of clusters is dispensed in a cloud computing cache node stores.

5. caching method as claimed in claim 4, is characterized in that, before calculating the value of buffer unit, carries out upgrading the data in buffer unit according to predetermined period.

6. caching method as claimed in claim 4, it is characterized in that, the computational methods of the value of buffer unit are:

p_{i}^{j} = α \cdot p_{i}^{j - 1} + (1 - α) \cdot n_{i}^{j} \cdot β

7. caching method as claimed in claim 4, is characterized in that, in cloud computing cache node, adopts Memcache mechanism to carry out the large data of buffer memory.

8. caching method as claimed in claim 4, is characterized in that, uses k-means algorithm to carry out cluster to all buffer units in the default value threshold range extracted.

9. caching method as claimed in claim 4, it is characterized in that, described large data storage comprises RAM memory and FLASH memory.