CN113407620B - Data block placement method and system based on heterogeneous Hadoop cluster environment - Google Patents

Data block placement method and system based on heterogeneous Hadoop cluster environment Download PDF

Info

Publication number
CN113407620B
CN113407620B CN202010185518.2A CN202010185518A CN113407620B CN 113407620 B CN113407620 B CN 113407620B CN 202010185518 A CN202010185518 A CN 202010185518A CN 113407620 B CN113407620 B CN 113407620B
Authority
CN
China
Prior art keywords
data
data blocks
data block
blocks
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010185518.2A
Other languages
Chinese (zh)
Other versions
CN113407620A (en
Inventor
宋�莹
许家豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Best Innovation Beijing Technology Co ltd
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN202010185518.2A priority Critical patent/CN113407620B/en
Publication of CN113407620A publication Critical patent/CN113407620A/en
Application granted granted Critical
Publication of CN113407620B publication Critical patent/CN113407620B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data block placement method and a system based on a heterogeneous Hadoop cluster environment, wherein the cold and hot degrees of data blocks are measured by calculating the access frequency of each period of the data blocks, then the data blocks are placed on different data nodes according to the difference of the heat degrees of the data blocks, the correlation problem is considered in the placement process, the data blocks with correlation are placed in a scattered manner and are not stored on the same data node at the same time, the condition that a plurality of data blocks are accessed on one data node at the same time is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved.

Description

Data block placement method and system based on heterogeneous Hadoop cluster environment
Technical Field
The invention relates to copy replication for improving cluster performance according to the cold and hot degree of data blocks in a Hadoop cluster, and belongs to the field of distributed computation.
Background
With the continuous development of internet technology, we have entered the era of big data, so the application of technology related to big data should be more extensive. Hadoop is the most popular big data open source framework at present, is a big data platform capable of processing mass data offline and in parallel, has the characteristics of high reliability, high expandability, high efficiency, low cost, open source and the like, and is called as a preferred mass data processing scheme of a plurality of Internet companies. Hadoop mainly comprises a Hadoop Distributed File System (HDFS) and a MapReduce distributed computing framework, and although Hadoop development has been mature to date, there are some aspects in which improvement and optimization are required.
There are many files stored in HDFS, both large files, which consist of a plurality of data blocks, and small files (large files are numerous), which only occupy a portion of a data block. The heat of the data blocks is measured by the access frequency of the user to the data blocks, and the heat of the data blocks with higher access frequency is higher, so that hot spot data (data with high access frequency) and cold gate data (data with low access frequency) exist. For hot spot data, which is data that users often access, this presents two problems: 1) Because the hot spot data access frequency is higher, the hot spot data can be accessed by a plurality of users at the same time, and the burden of the node is increased; 2) The hot spot data belongs to data frequently accessed by users, and needs to meet the experience of the users in response time. Both of the above problems are problems faced by conventional Hadoop.
The design of the traditional Hadoop system is oriented to an isomorphic computing environment, and consists of a group of machines with the same configuration, each node has the same storage performance and disk capacity under the isomorphic cluster, when data is written into an HDFS, the data is divided into a plurality of blocks with the same size, and then the Hadoop system can load the data blocks on each node equally in a random distribution mode. However, current clusters running Hadoop are often heterogeneous computing environments, and the data stored in Hadoop are hot, so that hot data is often accessed and the number of users accessing the data is large, which requires a node storing such data to have high storage performance, while cold data is rarely accessed or even not accessed, and only needs to be stored. Therefore, aiming at the problem of heat of data, the isomorphic cluster of the traditional Hadoop has no high efficiency and practicality.
The default copy policy of Hadoop has certain defects on the aspects of user demand, storage performance, system resources and the like. There are problems in the context of heterogeneous clusters, such as low system resource utilization, unbalanced node load, poor fault tolerance, network transmission and communication payloads, which may even lead to failures.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data block placement method based on a heterogeneous Hadoop cluster environment, which comprises the following steps:
step 1, according to the accessed frequency of the data blocks, dividing the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium data blocks, normal data blocks and cold data blocks, and classifying the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
step 2, carrying out correlation analysis on the data blocks, and marking the data blocks with correlation in each classification of the data blocks;
step 3, executing a data block placement strategy, and placing each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;
step 4, judging whether other data blocks related to the data block are in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, re-executing the step 3 in the classification of the data node, and selecting other data nodes to be placed;
and 5, finishing the placement of the current data block, and executing the step 3 again until all the data nodes finish the placement.
The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps:
step 11, acquiring the reading operation times M of each data block in the heterogeneous cluster environment in a specified period T through a log collecting tool, and obtaining the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period according to a balance factor tau:
Figure BDA0002414041590000021
and step 12, calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and dividing the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps:
step 21, according to the access frequencies of the data blocks obtained in the step a in different periods, performing correlation analysis by using covariance cov among the data blocks:
Figure BDA0002414041590000031
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of the data blocks B1 and B2, respectively, at the current cycle,
Figure BDA0002414041590000032
and->
Figure BDA0002414041590000033
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
step 22, judging whether the covariance cov is positive, if yes, indicating that the variation trend of the access frequencies of the two data blocks is consistent, and if not, indicating that the data blocks B1 and B2 have no access correlation.
The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises:
the method comprises the steps that 1, data blocks stored in a heterogeneous cluster environment are divided into hot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the accessed frequency of the data blocks, and the data nodes in the heterogeneous cluster environment are classified according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for carrying out correlation analysis on the data blocks and marking the data blocks with correlation in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;
the module 4 judges whether other data blocks related to the data block exist in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, the module 3 is executed again in the classification of the data node, and other data nodes are selected to be placed;
and 5, finishing the placement of the current data block, and executing the module 3 again until all the data nodes finish the placement.
The data block placement system based on the heterogeneous Hadoop cluster environment, wherein the module 1 comprises:
the module 11 obtains the reading operation times M of each data block in the heterogeneous cluster environment in the specified period T through a log collecting tool, and obtains the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period and the balance factor tau:
Figure BDA0002414041590000041
and the module 12 calculates an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and divides the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement system based on the heterogeneous Hadoop cluster environment, wherein the module 2 comprises:
the module 21 performs correlation analysis by using covariance cov between data blocks according to the access frequencies of the data blocks obtained in the module a in different periods:
Figure BDA0002414041590000042
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of the data blocks B1 and B2, respectively, at the current cycle,
Figure BDA0002414041590000043
and->
Figure BDA0002414041590000044
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trend of the access frequencies of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.
The advantages of the invention are as follows:
the invention has the advantages that the invention provides a copy strategy of the data block under heterogeneous Hadoop cluster environment, firstly, the cold and hot degrees of the data block are measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes by different heat degrees of the data block, the relativity problem is considered in the placing process, the data blocks with relativity are placed in a scattered way and are not stored on the same data node at the same time, the simultaneous access of a plurality of data blocks on one data node is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.
Drawings
FIG. 1 is a flowchart of a data block placement strategy in a heterogeneous Hadoop-based cluster environment;
FIG. 2 is a detailed flow chart of a data block placement strategy.
Detailed Description
The invention aims to provide a data block placement strategy based on a heterogeneous Hadoop cluster environment aiming at hot and cold data existing in the existing Hadoop cluster, and improve the execution performance and resource utilization rate of the cluster.
Specifically, the invention comprises the following steps:
A. and judging the cold and hot degree of the data block. The implementation method comprises the following steps:
A1. calculating the access frequency of each data block;
a1-1, acquiring the reading operation number of each data block in the HDFS in a specified period T by using a Flume log collecting tool, wherein the access frequency of each period possibly has larger contrast, thus setting a balance factor tau, and the access frequency of the last period is marked as B_f (pre), wherein the access frequency B_f of the current period is calculated as follows:
Figure BDA0002414041590000051
a1-2. The access frequency B_f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B_f (0) represents the access frequency when the data block is created, and since the data block has no history access condition in the early stage when it is created, the value of B_f (0) is calculated as follows:
Figure BDA0002414041590000052
A2. calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block obtained in the step;
A3. and B_F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot data blocks, medium data blocks, normal data blocks and cold data blocks according to the heat degree from high to low.
B. The implementation method of the data block correlation analysis comprises the following steps:
B1. a data block having an access dependency;
b1-1. The correlation referred to herein mainly means that there is a certain degree of correlation between data blocks of a cluster, for example, there are data block B1 and data block B2, the first case where a user accesses data block B2 at the same time when accessing data block B1, the second case where data block B2 also has a linear change in the same direction when the access frequency of data block B1 increases or decreases with time period, and the present invention refers to that data block B1 and data block B2 have correlation based on the two cases;
B2. a method of detecting correlation;
b2-1. According to the access frequency of each data block obtained in step a at different periods, correlation analysis is performed using covariance, for example, correlation of the data blocks B1 and B2 is detected, which can be calculated using the following formula:
Figure BDA0002414041590000061
where n is the number of cycles, i is the current cycle, X and Y represent the access frequencies of data blocks B1 and B2, respectively,
Figure BDA0002414041590000062
and->
Figure BDA0002414041590000063
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
b2-2, if the calculated covariance cov is positive, the variation trend of the access frequencies of the two data blocks is consistent, if the cov value is 0, the two data blocks are independent, and if the covariance cov is negative, the two data blocks are negatively correlated, which is not the focus of the research of the invention, and the invention mainly focuses on positive correlation, namely the cov value is positive;
b2-3. Such detection methods are used to detect if there is a correlation between two data blocks, but there may be more than two data blocks with a correlation, so if data block B1 is correlated with data block B2, data block B2 (or B1) is again correlated with data block B3 during the detection, then data blocks B1, B2, B3 are all correlated;
B3. creating a data block set of the correlation and marking;
step A, classifying the data blocks according to heat (hot spot, medium heat, normal and cold gate data blocks), traversing the classification to detect the correlation of the data blocks in sequence in the classification, and if two or more data blocks have correlation, establishing a correlation set C= { B1, B2, …, bn }, wherein n represents the number of the data blocks in the set, and each set takes BID of the first data block as a mark;
C. the data node classification method comprises the following steps:
C1. the difference of hardware is mainly embodied in CPU, disk I/O, network and memory (because the memory resource is mainly embodied in the size of the memory, the difference between the performances is small, and the network transmission is not the key point of the research of the invention, so the two items are not considered), but the classification standard is mainly focused on the CPU and the disk I/O;
C2. our classification is probably several: 1) machines with strong CPU and IO performance are called MAX class, 2) machines with strong CPU and IO performance are called CPU class, 3) machines with strong IO performance are called IO class, 4) both machines are commonly called CIM class, 5) both machines are weak called CIB class;
D. the implementation method of the heat-based data block placement strategy comprises the following steps:
D1. in the step A, the data block is divided into a hot data block, a medium heat data block, a normal data block and a cold gate data block, and four queues are generated according to the four classifications;
d1-1. Hot spot data block queue B (h) = { B1, B2, …, bm }, m is the number of data blocks; a medium data block queue B (m) = { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) = { B1, B2, …, bk }, k is the number of data blocks; b (c) = { B1, B2, …, bn }, n is the number of data blocks;
D2. c2, obtaining a plurality of groups of data node queues according to the classification in the step;
d2-1.1) MAX class data node queue D (MAX) = { D1, D2, …, dm }, m is the number of data nodes; 2) CPU class data node queues D (CPU) = { D1, D2, …, dn }, n is the number of data nodes; 3) The IO class data node queue D (IO) = { D1, D2, …, dj }, j is the number of data nodes; 4) A CIM class data node queue D (CIM) = { D1, D2, …, dk }, k is the number of data nodes; 5) CIB class data node queues D (CIB) = { D1, D2, …, dl }, i is the number of data nodes;
d2-2 for the data nodes in the queue D (max) only the copies of the data blocks in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored on the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium heat data blocks are preferentially stored on the IO class data nodes (because the performance support mainly required by the high heat data is IO), then the CPU class data nodes are considered, and finally the CIM class data nodes are considered. For normal data blocks, the normal data blocks are preferentially stored on CIM class data nodes, and CPU class nodes and IO class nodes are not considered unless the storage of the CIM class data nodes is saturated; for the cold gate data block B (c), the queue can only be stored in the data node of the queue D (cib); a more detailed method illustration is illustrated in fig. 2;
D3. the placement strategy considers data correlation;
d3-1. For data blocks of different classifications, whether there are data blocks related to them is also considered before replication, avoiding storing two or more data blocks with a correlation on a node, so that when a user accesses one of the data blocks, the node may take over access of multiple data blocks, and should place them in a decentralized manner;
d3-2, obtaining a correlation set in the step B, firstly inquiring whether the corresponding correlation set exists through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;
D4. considering the problems of Data localization (Data Locality) and network transmission, most copies of a Data block are stored on the same rack, but following the principle of cluster availability, two of the copies of the Data block must be stored on different racks.
In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.
A. Judging the cold and hot degree of the data block; B. data block correlation analysis; C. classifying data nodes; D. heat-based data block placement policies. One specific embodiment is as follows:
specifically, the invention comprises the following steps:
A. and judging the cold and hot degree of the data block. The implementation method comprises the following steps:
A1. calculating the access frequency of each data block;
a1-1. Acquiring the read operation number of each data block in the HDFS in the specified period T as M by the Flume log collecting tool, wherein the access frequency of each period may be in large contrast, thus setting a balance factor
The access frequency of the previous cycle is denoted as b_f (pre), and the access frequency b_f of the current cycle is calculated as follows:
Figure BDA0002414041590000081
a1-2. The access frequency B_f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B_f (0) represents the access frequency when the data block is created, and since the data block has no history access condition in the early stage when it is created, the value of B_f (i) is 0, and the formula is as follows:
Figure BDA0002414041590000082
A2. calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block obtained in the step;
A3. and B_F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot data blocks, medium data blocks, normal data blocks and cold data blocks according to the heat degree from high to low.
B. The implementation method of the data block correlation analysis comprises the following steps:
B1. a data block having a correlation;
b1-1. The correlation referred to herein mainly means that there is a certain degree of correlation between data blocks of a cluster, for example, there are data block B1 and data block B2, the first case is that the user accesses data block B2 at the same time when accessing data block B1, the second case is that when the access frequency of data block B1 increases or decreases with time period, the data block B2 also has a linear change in the same direction, and we call that data block B1 and data block B2 have correlation based on the two cases;
B2. a method of detecting correlation;
b2-1. According to the access frequency of each data block obtained in step a at different periods, correlation analysis is performed using covariance, for example, correlation of the data blocks B1 and B2 is detected, which can be calculated using the following formula:
Figure BDA0002414041590000091
wherein n is the period number, i is the current period, X and Y are respectivelyThe access frequency of the table data blocks B1 and B2,
Figure BDA0002414041590000092
and->
Figure BDA0002414041590000093
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
b2-2, if the calculated covariance cov is positive, the variation trend of the access frequencies of the two data blocks is consistent, if the cov value is 0, the two data blocks are independent, and if the covariance cov is negative, the two data blocks are negatively correlated, which is not the focus of the research of the invention, and the invention mainly focuses on positive correlation, namely the cov value is positive;
b2-3. Such detection methods are used to detect if there is a correlation between two data blocks, but there may be more than two data blocks with a correlation, so if data block B1 is correlated with data block B2, data block B2 (or B1) is again correlated with data block B3 during the detection, then data blocks B1, B2, B3 are all correlated;
B3. creating a data block set of the correlation and marking;
step A, classifying the data blocks according to heat (hot spot, medium heat, normal and cold gate data blocks), traversing the classification to detect the correlation of the data blocks in sequence in the classification, and if two or more data blocks have correlation, establishing a correlation set C= { B1, B2, …, bn }, wherein n represents the number of the data blocks in the set, and each set takes BID of the first data block as a mark;
C. the data node classification method comprises the following steps:
C1. the difference of hardware is mainly embodied in CPU, disk I/O, network and memory (because the memory resource is mainly embodied in the size of the memory, the difference between the performances is small, and the network transmission is not the key point of the research of the invention, so the two items are not considered), but the classification standard is mainly focused on the CPU and the disk I/O;
C2. our classification is probably several: 1) machines with strong CPU and IO performance are called MAX class, 2) machines with strong CPU and IO performance are called CPU class, 3) machines with strong IO performance are called IO class, 4) both machines are commonly called CIM class, 5) both machines are weak called CIB class;
D. the implementation method of the heat-based data block placement strategy comprises the following steps:
D1. in the step A, the data block is divided into a hot data block, a medium heat data block, a normal data block and a cold gate data block, and four queues are generated according to the four classifications;
d1-1. Hot spot data block queue B (h) = { B1, B2, …, bm }, m is the number of data blocks; a medium data block queue B (m) = { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) = { B1, B2, …, bk }, k is the number of data blocks; b (c) = { B1, B2, …, bn }, n is the number of data blocks;
D2. c2, obtaining a plurality of groups of data node queues according to the classification in the step;
d2-1.1) MAX class data node queue D (MAX) = { D1, D2, …, dm }, m is the number of data nodes; 2) CPU class data node queues D (CPU) = { D1, D2, …, dn }, n is the number of data nodes; 3) The IO class data node queue D (IO) = { D1, D2, …, dj }, j is the number of data nodes; 4) A CIM class data node queue D (CIM) = { D1, D2, …, dk }, k is the number of data nodes; 5) CIB class data node queues D (CIB) = { D1, D2, …, dl }, i is the number of data nodes;
d2-2 for the data nodes in the queue D (max) only the copies of the data blocks in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored on the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium heat data blocks are preferentially stored on the IO class data nodes (because the performance support mainly required by the high heat data is IO), then the CPU class data nodes are considered, and finally the CIM class data nodes are considered. For normal data blocks, the normal data blocks are preferentially stored on CIM class data nodes, and CPU class nodes and IO class nodes are not considered unless the storage of the CIM class data nodes is saturated; for the cold gate data block B (c), the queue can only be stored in the data node of the queue D (cib);
D3. the placement strategy considers data correlation;
d3-1. For data blocks of different classifications, whether there are data blocks related to them is also considered before copying, avoiding storing two or more data blocks having a correlation on a node, so that when a user accesses one of the data blocks, the node may take over access of multiple data blocks, which should be replicated in a decentralized manner;
d3-2, obtaining a correlation set in the step B, firstly inquiring whether the corresponding correlation set exists through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;
D4. considering the problems of Data localization (Data Locality) and network transmission, most copies of a Data block are stored on the same rack, but following the principle of cluster availability, two of the copies of the Data block must be stored on different racks.
The invention has the advantages that the invention provides a copy strategy of the data block under heterogeneous Hadoop cluster environment, firstly, the cold and hot degrees of the data block are measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes by different heat degrees of the data block, the relativity problem is considered in the placing process, the data blocks with relativity are placed in a scattered way and are not stored on the same data node at the same time, the simultaneous access of a plurality of data blocks on one data node is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.
The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises:
the method comprises the steps that 1, data blocks stored in a heterogeneous cluster environment are divided into hot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the accessed frequency of the data blocks, and the data nodes in the heterogeneous cluster environment are classified according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for carrying out correlation analysis on the data blocks and marking the data blocks with correlation in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;
the module 4 judges whether other data blocks related to the data block exist in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, the module 3 is executed again in the classification of the data node, and other data nodes are selected to be placed;
and 5, finishing the placement of the current data block, and executing the module 3 again until all the data nodes finish the placement.
The data block placement system based on the heterogeneous Hadoop cluster environment, wherein the module 1 comprises:
the module 11 obtains the reading operation times M of each data block in the heterogeneous cluster environment in the specified period T through a log collecting tool, and obtains the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period and the balance factor tau:
Figure BDA0002414041590000121
and the module 12 calculates an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and divides the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement system based on the heterogeneous Hadoop cluster environment, wherein the module 2 comprises:
the module 21 performs correlation analysis by using covariance cov between data blocks according to the access frequencies of the data blocks obtained in the module a in different periods:
Figure BDA0002414041590000122
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of the data blocks B1 and B2, respectively, at the current cycle,
Figure BDA0002414041590000123
and->
Figure BDA0002414041590000124
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trend of the access frequencies of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.

Claims (2)

1. The data block placement method based on the heterogeneous Hadoop cluster environment is characterized by comprising the following steps of:
step 1, according to the accessed frequency of the data blocks, dividing the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium data blocks, normal data blocks and cold data blocks, and classifying the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
step 2, carrying out correlation analysis on the data blocks, and marking the data blocks with correlation in each classification of the data blocks;
step 3, executing a data block placement strategy, and placing each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;
step 4, judging whether other data blocks related to the data block are in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, re-executing the step 3 in the classification of the data node, and selecting other data nodes to be placed;
step 5, completing the placement of the current data block, and executing step 3 again until all the data nodes are placed;
wherein the step 1 comprises:
step 11, acquiring the reading operation times M of each data block in the heterogeneous cluster environment in a specified period T through a log collecting tool, and obtaining the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period according to a balance factor tau:
Figure FDA0004112830070000011
step 12, calculating an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and dividing the data block into a hot data block and a cold data block according to the sequence from high heat to low heat;
the step 2 comprises the following steps:
step 21, according to the access frequencies of the data blocks obtained in the step a in different periods, performing correlation analysis by using covariance cov among the data blocks:
Figure FDA0004112830070000012
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of the data blocks B1 and B2, respectively, at the current cycle,
Figure FDA0004112830070000021
and->
Figure FDA0004112830070000022
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
step 22, judging whether the covariance cov is positive, if yes, indicating that the variation trend of the access frequencies of the two data blocks is consistent, and if not, indicating that the data blocks B1 and B2 have no access correlation.
2. A data block placement system based on a heterogeneous Hadoop cluster environment, comprising:
the method comprises the steps that 1, data blocks stored in a heterogeneous cluster environment are divided into hot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the accessed frequency of the data blocks, and the data nodes in the heterogeneous cluster environment are classified according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for carrying out correlation analysis on the data blocks and marking the data blocks with correlation in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on the data nodes with different classifications according to the classifications of the data blocks and the data nodes;
the module 4 judges whether other data blocks related to the data block exist in the data node selected to be placed by the current data block when the data block placement strategy is executed, if so, the module 3 is executed again in the classification of the data node, and other data nodes are selected to be placed;
a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed;
wherein the module 1 comprises:
the module 11 obtains the reading operation times M of each data block in the heterogeneous cluster environment in the specified period T through a log collecting tool, and obtains the access frequency B_f of the current period according to the access frequency B_f (pre) of the previous period and the balance factor tau:
Figure FDA0004112830070000023
the module 12 calculates an average access frequency B_F (avg) according to the access frequency of each period of the data block so as to measure the heat of the data block, and divides the data block into a hot data block and a cold data block according to the sequence from high heat to low heat;
the module 2 comprises:
the module 21 performs correlation analysis by using covariance cov between data blocks according to the access frequencies of the data blocks obtained in the module a in different periods:
Figure FDA0004112830070000031
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of the data blocks B1 and B2, respectively, at the current cycle,
Figure FDA0004112830070000032
and->
Figure FDA0004112830070000033
Representing the average access frequency of the data blocks B1 and B2 in n periods, respectively;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trend of the access frequencies of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.
CN202010185518.2A 2020-03-17 2020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment Active CN113407620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010185518.2A CN113407620B (en) 2020-03-17 2020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010185518.2A CN113407620B (en) 2020-03-17 2020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment

Publications (2)

Publication Number Publication Date
CN113407620A CN113407620A (en) 2021-09-17
CN113407620B true CN113407620B (en) 2023-04-21

Family

ID=77677033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010185518.2A Active CN113407620B (en) 2020-03-17 2020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment

Country Status (1)

Country Link
CN (1) CN113407620B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023995A (en) * 2012-11-29 2013-04-03 中国电力科学研究院 Hadoop-based distributive type cloud storage type automatic grading data managing system
CN103593452A (en) * 2013-11-21 2014-02-19 北京科技大学 Data intensive computing cost optimization method based on MapReduce mechanism
CN103631894A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Dynamic copy management method based on HDFS
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system
CN103942289A (en) * 2014-04-12 2014-07-23 广西师范大学 Memory caching method oriented to range querying on Hadoop
CN104133882A (en) * 2014-07-28 2014-11-05 四川大学 HDFS (Hadoop Distributed File System)-based old file processing method
CN105183839A (en) * 2015-09-02 2015-12-23 华中科技大学 Hadoop-based storage optimizing method for small file hierachical indexing
CN106156283A (en) * 2016-06-27 2016-11-23 江苏迪纳数字科技股份有限公司 Isomery Hadoop based on data temperature and joint behavior stores method
CN108519856A (en) * 2018-03-02 2018-09-11 西北大学 Based on the data block copy laying method under isomery Hadoop cluster environment
CN109446114A (en) * 2018-10-12 2019-03-08 咪咕文化科技有限公司 Spatial data caching method and device and storage medium
CN110096350A (en) * 2019-04-10 2019-08-06 山东科技大学 Cold and hot region division energy saving store method based on the prediction of clustered node load condition
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN110647497A (en) * 2019-07-19 2020-01-03 广东工业大学 HDFS-based high-performance file storage and management system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016199955A1 (en) * 2015-06-10 2016-12-15 울산과학기술원 Code dispersion hash table-based map-reduce system and method
US10127238B1 (en) * 2015-12-08 2018-11-13 EMC IP Holding Company LLC Methods and apparatus for filtering dynamically loadable namespaces (DLNs)
US10303391B2 (en) * 2017-10-30 2019-05-28 AtomBeam Technologies Inc. System and method for data storage, transfer, synchronization, and security
US11003477B2 (en) * 2019-02-08 2021-05-11 Intel Corporation Provision of input/output classification in a storage system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023995A (en) * 2012-11-29 2013-04-03 中国电力科学研究院 Hadoop-based distributive type cloud storage type automatic grading data managing system
CN103631894A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Dynamic copy management method based on HDFS
CN103593452A (en) * 2013-11-21 2014-02-19 北京科技大学 Data intensive computing cost optimization method based on MapReduce mechanism
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system
CN103942289A (en) * 2014-04-12 2014-07-23 广西师范大学 Memory caching method oriented to range querying on Hadoop
CN104133882A (en) * 2014-07-28 2014-11-05 四川大学 HDFS (Hadoop Distributed File System)-based old file processing method
CN105183839A (en) * 2015-09-02 2015-12-23 华中科技大学 Hadoop-based storage optimizing method for small file hierachical indexing
CN106156283A (en) * 2016-06-27 2016-11-23 江苏迪纳数字科技股份有限公司 Isomery Hadoop based on data temperature and joint behavior stores method
CN108519856A (en) * 2018-03-02 2018-09-11 西北大学 Based on the data block copy laying method under isomery Hadoop cluster environment
CN109446114A (en) * 2018-10-12 2019-03-08 咪咕文化科技有限公司 Spatial data caching method and device and storage medium
CN110096350A (en) * 2019-04-10 2019-08-06 山东科技大学 Cold and hot region division energy saving store method based on the prediction of clustered node load condition
CN110647497A (en) * 2019-07-19 2020-01-03 广东工业大学 HDFS-based high-performance file storage and management system
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于人工神经网络的机器人路径规划研究;陈麒瑞等;电脑知识与技术;227-229 *
异构Hadoop集群中数据副本放置策略优化;刘艳等;华中科技大学学报(自然科学版);63-68 *

Also Published As

Publication number Publication date
CN113407620A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN108829494B (en) Container cloud platform intelligent resource optimization method based on load prediction
CN107734052B (en) Load balancing container scheduling method facing component dependence
Zhang et al. Virtual machine placement strategy using cluster-based genetic algorithm
CN105389211B (en) Memory allocation method and delay perception-Memory Allocation device suitable for NUMA architecture
CN111381928B (en) Virtual machine migration method, cloud computing management platform and storage medium
CN107273200B (en) Task scheduling method for heterogeneous storage
Ubarhande et al. Novel data-distribution technique for Hadoop in heterogeneous cloud environments
CN108519856B (en) Data block copy placement method based on heterogeneous Hadoop cluster environment
CN113391913A (en) Distributed scheduling method and device based on prediction
CN104182278A (en) Method and device for judging busy degree of computer hardware resource
US20230229580A1 (en) Dynamic index management for computing storage resources
CN115454569A (en) Resource load data processing method and device and storage medium
Song et al. Rethinking graph data placement for graph neural network training on multiple GPUs
Wang et al. Lunule: an agile and judicious metadata load balancer for cephfs
Thomas et al. Survey on MapReduce scheduling algorithms
CN107193940A (en) Big data method for optimization analysis
CN113407620B (en) Data block placement method and system based on heterogeneous Hadoop cluster environment
Bawankule et al. A classification framework for straggler mitigation and management in a heterogeneous Hadoop cluster: A state-of-art survey
CN114020218B (en) Hybrid de-duplication scheduling method and system
CN116755872A (en) TOPSIS-based containerized streaming media service dynamic loading system and method
Myint et al. Comparative analysis of adaptive file replication algorithms for cloud data storage
Gao et al. A load-aware data migration scheme for distributed surveillance video processing with hybrid storage architecture
CN115309502A (en) Container scheduling method and device
Mao et al. A fine-grained and dynamic MapReduce task scheduling scheme for the heterogeneous cloud environment
Mao et al. FiGMR: A fine-grained mapreduce scheduler in the heterogeneous cloud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240329

Address after: 100095, 2nd Floor, Building 1, Baijiatong Shangpingyuan, Haidian District, Beijing, 20218

Patentee after: Beijing United Power Cultural Media Co.,Ltd.

Country or region after: China

Address before: 100101 12 Xiaoying East Road, Qinghe, Haidian District, Beijing

Patentee before: BEIJING INFORMATION SCIENCE AND TECHNOLOGY University

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240708

Address after: Room 309, 3rd Floor, Building D, No. 2-2, Beijing Shichuang High tech Development Corporation, 2 Shangdi Information Road, Haidian District, Beijing 100088

Patentee after: Best Innovation (Beijing) Technology Co.,Ltd.

Country or region after: China

Address before: 100095, 2nd Floor, Building 1, Baijiatong Shangpingyuan, Haidian District, Beijing, 20218

Patentee before: Beijing United Power Cultural Media Co.,Ltd.

Country or region before: China