CN113407620B

CN113407620B - Data block placement method and system based on heterogeneous Hadoop cluster environment

Info

Publication number: CN113407620B
Application number: CN202010185518.2A
Authority: CN
Inventors: 宋�莹; 许家豪
Original assignee: Beijing Information Science and Technology University
Current assignee: Best Innovation Beijing Technology Co ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2023-04-21
Anticipated expiration: 2040-03-17
Also published as: CN113407620A

Abstract

The invention provides a data block placement method and a system based on a heterogeneous Hadoop cluster environment, wherein the cold and hot degrees of data blocks are measured by calculating the access frequency of each period of the data blocks, then the data blocks are placed on different data nodes according to the difference of the heat degrees of the data blocks, the correlation problem is considered in the placement process, the data blocks with correlation are placed in a scattered manner and are not stored on the same data node at the same time, the condition that a plurality of data blocks are accessed on one data node at the same time is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved.

Description

Data block placement method and system based on heterogeneous Hadoop cluster environment

Technical Field

The invention relates to copy replication for improving cluster performance according to the cold and hot degree of data blocks in a Hadoop cluster, and belongs to the field of distributed computation.

Background

With the continuous development of internet technology, we have entered the era of big data, so the application of technology related to big data should be more extensive. Hadoop is the most popular big data open source framework at present, is a big data platform capable of processing mass data offline and in parallel, has the characteristics of high reliability, high expandability, high efficiency, low cost, open source and the like, and is called as a preferred mass data processing scheme of a plurality of Internet companies. Hadoop mainly comprises a Hadoop Distributed File System (HDFS) and a MapReduce distributed computing framework, and although Hadoop development has been mature to date, there are some aspects in which improvement and optimization are required.

There are many files stored in HDFS, both large files, which consist of a plurality of data blocks, and small files (large files are numerous), which only occupy a portion of a data block. The heat of the data blocks is measured by the access frequency of the user to the data blocks, and the heat of the data blocks with higher access frequency is higher, so that hot spot data (data with high access frequency) and cold gate data (data with low access frequency) exist. For hot spot data, which is data that users often access, this presents two problems: 1) Because the hot spot data access frequency is higher, the hot spot data can be accessed by a plurality of users at the same time, and the burden of the node is increased; 2) The hot spot data belongs to data frequently accessed by users, and needs to meet the experience of the users in response time. Both of the above problems are problems faced by conventional Hadoop.

The design of the traditional Hadoop system is oriented to an isomorphic computing environment, and consists of a group of machines with the same configuration, each node has the same storage performance and disk capacity under the isomorphic cluster, when data is written into an HDFS, the data is divided into a plurality of blocks with the same size, and then the Hadoop system can load the data blocks on each node equally in a random distribution mode. However, current clusters running Hadoop are often heterogeneous computing environments, and the data stored in Hadoop are hot, so that hot data is often accessed and the number of users accessing the data is large, which requires a node storing such data to have high storage performance, while cold data is rarely accessed or even not accessed, and only needs to be stored. Therefore, aiming at the problem of heat of data, the isomorphic cluster of the traditional Hadoop has no high efficiency and practicality.

The default copy policy of Hadoop has certain defects on the aspects of user demand, storage performance, system resources and the like. There are problems in the context of heterogeneous clusters, such as low system resource utilization, unbalanced node load, poor fault tolerance, network transmission and communication payloads, which may even lead to failures.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data block placement method based on a heterogeneous Hadoop cluster environment, which comprises the following steps:

step 1, according to the accessed frequency of the data blocks, dividing the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium data blocks, normal data blocks and cold data blocks, and classifying the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;

step 2, carrying out correlation analysis on the data blocks, and marking the data blocks with correlation in each classification of the data blocks;

step 3, executing a data block placement strategy, and placing each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;

step 4, judging whether other data blocks related to the data block are in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, re-executing the step 3 in the classification of the data node, and selecting other data nodes to be placed;

and 5, finishing the placement of the current data block, and executing the step 3 again until all the data nodes finish the placement.

The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps:

step 11, acquiring the reading operation times M of each data block in the heterogeneous cluster environment in a specified period T through a log collecting tool, and obtaining the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period according to a balance factor tau:

and step 12, calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and dividing the data block into a hot data block and a cold data block according to the heat from high to low.

step 21, according to the access frequencies of the data blocks obtained in the step a in different periods, performing correlation analysis by using covariance cov among the data blocks:

where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of the data blocks B1 and B2, respectively, at the current cycle,

and->

Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;

step 22, judging whether the covariance cov is positive, if yes, indicating that the variation trend of the access frequencies of the two data blocks is consistent, and if not, indicating that the data blocks B1 and B2 have no access correlation.

The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises:

the method comprises the steps that 1, data blocks stored in a heterogeneous cluster environment are divided into hot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the accessed frequency of the data blocks, and the data nodes in the heterogeneous cluster environment are classified according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;

the module 2 is used for carrying out correlation analysis on the data blocks and marking the data blocks with correlation in each classification of the data blocks;

the module 3 executes a data block placement strategy, and places each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;

the module 4 judges whether other data blocks related to the data block exist in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, the module 3 is executed again in the classification of the data node, and other data nodes are selected to be placed;

and 5, finishing the placement of the current data block, and executing the module 3 again until all the data nodes finish the placement.

The data block placement system based on the heterogeneous Hadoop cluster environment, wherein the module 1 comprises:

the module 11 obtains the reading operation times M of each data block in the heterogeneous cluster environment in the specified period T through a log collecting tool, and obtains the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period and the balance factor tau:

and the module 12 calculates an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and divides the data block into a hot data block and a cold data block according to the heat from high to low.

The data block placement system based on the heterogeneous Hadoop cluster environment, wherein the module 2 comprises:

the module 21 performs correlation analysis by using covariance cov between data blocks according to the access frequencies of the data blocks obtained in the module a in different periods:

and->

the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trend of the access frequencies of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.

The advantages of the invention are as follows:

the invention has the advantages that the invention provides a copy strategy of the data block under heterogeneous Hadoop cluster environment, firstly, the cold and hot degrees of the data block are measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes by different heat degrees of the data block, the relativity problem is considered in the placing process, the data blocks with relativity are placed in a scattered way and are not stored on the same data node at the same time, the simultaneous access of a plurality of data blocks on one data node is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.

Drawings

FIG. 1 is a flowchart of a data block placement strategy in a heterogeneous Hadoop-based cluster environment;

FIG. 2 is a detailed flow chart of a data block placement strategy.

Detailed Description

The invention aims to provide a data block placement strategy based on a heterogeneous Hadoop cluster environment aiming at hot and cold data existing in the existing Hadoop cluster, and improve the execution performance and resource utilization rate of the cluster.

Specifically, the invention comprises the following steps:

A. and judging the cold and hot degree of the data block. The implementation method comprises the following steps:

A1. calculating the access frequency of each data block;

a1-1, acquiring the reading operation number of each data block in the HDFS in a specified period T by using a Flume log collecting tool, wherein the access frequency of each period possibly has larger contrast, thus setting a balance factor tau, and the access frequency of the last period is marked as B_f (pre), wherein the access frequency B_f of the current period is calculated as follows:

a1-2. The access frequency B_f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B_f (0) represents the access frequency when the data block is created, and since the data block has no history access condition in the early stage when it is created, the value of B_f (0) is calculated as follows:

A2. calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block obtained in the step;

A3. and B_F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot data blocks, medium data blocks, normal data blocks and cold data blocks according to the heat degree from high to low.

B. The implementation method of the data block correlation analysis comprises the following steps:

B1. a data block having an access dependency;

b1-1. The correlation referred to herein mainly means that there is a certain degree of correlation between data blocks of a cluster, for example, there are data block B1 and data block B2, the first case where a user accesses data block B2 at the same time when accessing data block B1, the second case where data block B2 also has a linear change in the same direction when the access frequency of data block B1 increases or decreases with time period, and the present invention refers to that data block B1 and data block B2 have correlation based on the two cases;

B2. a method of detecting correlation;

b2-1. According to the access frequency of each data block obtained in step a at different periods, correlation analysis is performed using covariance, for example, correlation of the data blocks B1 and B2 is detected, which can be calculated using the following formula:

where n is the number of cycles, i is the current cycle, X and Y represent the access frequencies of data blocks B1 and B2, respectively,

and->

b2-2, if the calculated covariance cov is positive, the variation trend of the access frequencies of the two data blocks is consistent, if the cov value is 0, the two data blocks are independent, and if the covariance cov is negative, the two data blocks are negatively correlated, which is not the focus of the research of the invention, and the invention mainly focuses on positive correlation, namely the cov value is positive;

b2-3. Such detection methods are used to detect if there is a correlation between two data blocks, but there may be more than two data blocks with a correlation, so if data block B1 is correlated with data block B2, data block B2 (or B1) is again correlated with data block B3 during the detection, then data blocks B1, B2, B3 are all correlated;

B3. creating a data block set of the correlation and marking;

step A, classifying the data blocks according to heat (hot spot, medium heat, normal and cold gate data blocks), traversing the classification to detect the correlation of the data blocks in sequence in the classification, and if two or more data blocks have correlation, establishing a correlation set C= { B1, B2, …, bn }, wherein n represents the number of the data blocks in the set, and each set takes BID of the first data block as a mark;

C. the data node classification method comprises the following steps:

C1. the difference of hardware is mainly embodied in CPU, disk I/O, network and memory (because the memory resource is mainly embodied in the size of the memory, the difference between the performances is small, and the network transmission is not the key point of the research of the invention, so the two items are not considered), but the classification standard is mainly focused on the CPU and the disk I/O;

C2. our classification is probably several: 1) machines with strong CPU and IO performance are called MAX class, 2) machines with strong CPU and IO performance are called CPU class, 3) machines with strong IO performance are called IO class, 4) both machines are commonly called CIM class, 5) both machines are weak called CIB class;

D. the implementation method of the heat-based data block placement strategy comprises the following steps:

D1. in the step A, the data block is divided into a hot data block, a medium heat data block, a normal data block and a cold gate data block, and four queues are generated according to the four classifications;

d1-1. Hot spot data block queue B (h) = { B1, B2, …, bm }, m is the number of data blocks; a medium data block queue B (m) = { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) = { B1, B2, …, bk }, k is the number of data blocks; b (c) = { B1, B2, …, bn }, n is the number of data blocks;

D2. c2, obtaining a plurality of groups of data node queues according to the classification in the step;

d2-1.1) MAX class data node queue D (MAX) = { D1, D2, …, dm }, m is the number of data nodes; 2) CPU class data node queues D (CPU) = { D1, D2, …, dn }, n is the number of data nodes; 3) The IO class data node queue D (IO) = { D1, D2, …, dj }, j is the number of data nodes; 4) A CIM class data node queue D (CIM) = { D1, D2, …, dk }, k is the number of data nodes; 5) CIB class data node queues D (CIB) = { D1, D2, …, dl }, i is the number of data nodes;

d2-2 for the data nodes in the queue D (max) only the copies of the data blocks in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored on the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium heat data blocks are preferentially stored on the IO class data nodes (because the performance support mainly required by the high heat data is IO), then the CPU class data nodes are considered, and finally the CIM class data nodes are considered. For normal data blocks, the normal data blocks are preferentially stored on CIM class data nodes, and CPU class nodes and IO class nodes are not considered unless the storage of the CIM class data nodes is saturated; for the cold gate data block B (c), the queue can only be stored in the data node of the queue D (cib); a more detailed method illustration is illustrated in fig. 2;

D3. the placement strategy considers data correlation;

d3-1. For data blocks of different classifications, whether there are data blocks related to them is also considered before replication, avoiding storing two or more data blocks with a correlation on a node, so that when a user accesses one of the data blocks, the node may take over access of multiple data blocks, and should place them in a decentralized manner;

d3-2, obtaining a correlation set in the step B, firstly inquiring whether the corresponding correlation set exists through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;

D4. considering the problems of Data localization (Data Locality) and network transmission, most copies of a Data block are stored on the same rack, but following the principle of cluster availability, two of the copies of the Data block must be stored on different racks.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

A. Judging the cold and hot degree of the data block; B. data block correlation analysis; C. classifying data nodes; D. heat-based data block placement policies. One specific embodiment is as follows:

specifically, the invention comprises the following steps:

A1. calculating the access frequency of each data block;

a1-1. Acquiring the read operation number of each data block in the HDFS in the specified period T as M by the Flume log collecting tool, wherein the access frequency of each period may be in large contrast, thus setting a balance factor

The access frequency of the previous cycle is denoted as b_f (pre), and the access frequency b_f of the current cycle is calculated as follows:

a1-2. The access frequency B_f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B_f (0) represents the access frequency when the data block is created, and since the data block has no history access condition in the early stage when it is created, the value of B_f (i) is 0, and the formula is as follows:

B1. a data block having a correlation;

b1-1. The correlation referred to herein mainly means that there is a certain degree of correlation between data blocks of a cluster, for example, there are data block B1 and data block B2, the first case is that the user accesses data block B2 at the same time when accessing data block B1, the second case is that when the access frequency of data block B1 increases or decreases with time period, the data block B2 also has a linear change in the same direction, and we call that data block B1 and data block B2 have correlation based on the two cases;

B2. a method of detecting correlation;

wherein n is the period number, i is the current period, X and Y are respectivelyThe access frequency of the table data blocks B1 and B2,

and->

B3. creating a data block set of the correlation and marking;

C. the data node classification method comprises the following steps:

d2-2 for the data nodes in the queue D (max) only the copies of the data blocks in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored on the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium heat data blocks are preferentially stored on the IO class data nodes (because the performance support mainly required by the high heat data is IO), then the CPU class data nodes are considered, and finally the CIM class data nodes are considered. For normal data blocks, the normal data blocks are preferentially stored on CIM class data nodes, and CPU class nodes and IO class nodes are not considered unless the storage of the CIM class data nodes is saturated; for the cold gate data block B (c), the queue can only be stored in the data node of the queue D (cib);

D3. the placement strategy considers data correlation;

d3-1. For data blocks of different classifications, whether there are data blocks related to them is also considered before copying, avoiding storing two or more data blocks having a correlation on a node, so that when a user accesses one of the data blocks, the node may take over access of multiple data blocks, which should be replicated in a decentralized manner;

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

and->

Claims

1. The data block placement method based on the heterogeneous Hadoop cluster environment is characterized by comprising the following steps of:

step 5, completing the placement of the current data block, and executing step 3 again until all the data nodes are placed;

wherein the step 1 comprises:

step 12, calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and dividing the data block into a hot data block and a cold data block according to the sequence from high heat to low heat;

the step 2 comprises the following steps:

and->

2. A data block placement system based on a heterogeneous Hadoop cluster environment, comprising:

a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed;

wherein the module 1 comprises:

the module 12 calculates an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and divides the data block into a hot data block and a cold data block according to the sequence from high heat to low heat;

the module 2 comprises:

and->