CN110569310A

CN110569310A - Management method of relational big data in cloud computing environment

Info

Publication number: CN110569310A
Application number: CN201910883879.1A
Authority: CN
Inventors: 李晓涛; 金炯华; 朱海平; 倪明堂; 黄培; 张卫平; 吴淑敏
Original assignee: Guangdong Provincial Institute Of Intelligent Robotics
Current assignee: Guangdong Provincial Institute Of Intelligent Robotics
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2019-12-13

Abstract

a management method of relational big data in a cloud computing environment is based on a MapReduce framework and comprises the following steps: constructing a bottom layer module which is a distributed database ChunkDB to form a ChunkDB database cluster for receiving, managing and storing various types of data; constructing an upper module which is a distributed data access interface unit and a Hadoop cluster and is in communication connection with a distributed database ChunkDB; and the construction calculation module is used for constructing a parallel calculation unit based on ChunkDB and a parallel calculation unit based on HDFS, is connected with the distributed data access interface unit and is used for calculating and processing related data. The invention realizes the method for dividing, storing in a distributed mode and optimizing the index in the distributed mode of the relational big data, and can greatly improve the processing efficiency of the storage, management and query analysis of the relational big data.

Description

Management method of relational big data in cloud computing environment

Technical Field

The invention relates to the field of data processing, in particular to a management method of relational big data in a cloud computing environment.

background

With the widespread use of computer technology, the continuous development of data acquisition technologies such as sensors and RFID, and the continuous expansion of inexpensive storage capacities, it has become possible to collect and store exponentially growing data resources. The problem of managing and analyzing relational mass data has attracted extensive attention from academic and industrial fields, and has become an important research direction and hotspot.

Current database management systems provide convenient and efficient data storage and management methods for online transactions. However, the nature of mass data management and analytics applications dictates the need to manage large amounts of TB-level and even PB-level data using different implementation mechanisms. Conventional relational database management systems are unable to meet this need. Conventional data warehousing and online analytical processing also fail to meet this demand because as the volume of data grows faster and faster, the manner of centralized data management is increasingly unable to meet the needs of users. The management and analysis of mass data face many challenges, which are mainly reflected in the following aspects:

(1) how to process relational data using an excellent parallel computing architecture. For example, MapReduce is a parallel framework of distributed computing tasks, simplifies data processing tasks on a super-large cluster composed of general-purpose machines, can achieve isolation of application programs and underlying distributed processing mechanisms, users only need to consider how Map and Reduce processes are achieved to meet business requirements, and functions of data cutting, task scheduling, node communication and the like are automatically completed by the framework. The MapReduce framework has the characteristics of high scalability and high parallelism, but the processing technology of the MapReduce framework on relational data is not mature enough.

(2) how to efficiently organize and store relational data. How to manage and store the ever-increasing mass relational data enables the system to effectively manage the mass data and conveniently and quickly process and analyze the mass data. For example, mass data may be partitioned according to certain conditions and then distributed as uniformly as possible on each node, so as to balance computing and storage resources. However, this brings about a problem that a large number of nodes are involved in completing one calculation, which inevitably results in a large amount of data transmission between nodes. Therefore, how to store data needs to be balanced among parallelism, balance, efficiency, and the like, and the organization and management of data is a challenge facing mass data management and analysis technology.

(3) how to implement fast interactive data analysis. How to distribute large-scale data on each node, how to quickly calculate aggregated data, how to quickly respond to online analysis processing operations, and how to quickly discover rules and models hidden in a large amount of data are problems that must be solved for mass data analysis applications. The nature of mass data analysis applications requires that data be stored in a distributed manner and that calculations be spread around the data. This presents new challenges to global aggregated queries: firstly, a large amount of data migration among nodes brings network communication pressure; second, a large number of intermediate results put storage and management pressure; third, execution skew and synchronization overhead issues between nodes. This requires reasonable partitioning and data placement to reduce data migration between nodes and reduce network communication pressure, and a research on relational mass data analysis techniques is required.

In summary, the research and development of the management and analysis method of the relational mass data not only has important theoretical research significance, but also has wide practical application value. The query analysis of mass data can find potential rules through the statistical analysis of large-scale data, and a model obtained by continuously generated data correction is utilized. The invention is based on the MapReduce parallel computing architecture in the cloud computing environment to store and manage the massive relational data, and designs and realizes a set of effective relational massive data management method.

Disclosure of Invention

In order to solve the technical problem, the invention provides a method for managing relational big data in a cloud computing environment.

In order to solve the technical problems, the invention adopts the following technical scheme:

A management method of relational big data in a cloud computing environment is based on a MapReduce framework and comprises the following steps:

Constructing a bottom layer module which is a distributed database ChunkDB to form a ChunkDB database cluster for receiving, managing and storing various types of data;

Constructing an upper module which is a distributed data access interface unit and a Hadoop cluster and is in communication connection with a distributed database ChunkDB;

And the construction calculation module is used for constructing a parallel calculation unit based on ChunkDB and a parallel calculation unit based on HDFS, and is connected with the distributed data access interface unit to perform calculation processing on related data.

the distributed database ChunkDB carries out blocking operation on data after receiving the data, and divides a data table into a plurality of sub table blocks according to a set division rule, wherein the sub table blocks are respectively stored in different nodes of the distributed database ChunkDB, the sub table blocks on the different nodes are in a parallel relation, and the sub table blocks independently exist in a table form on the different nodes.

And after the distributed database ChunkDB performs blocking operation on the data, the data are stored in a distributed mode, the data tables are stored on each node by using sub table blocks with the same size as a storage unit, and each sub table block has at least one copy.

The table sub-blocks are stored in a Hash distribution mode and a polling distribution mode, when the Hash distribution mode is used, each sub-table block is numbered, the block number id of each sub-table block serves as a distribution key, and the distribution key is transmitted to a Hash function, so that data nodes where each sub-table block is stored are obtained;

When the polling distribution is used, all the data nodes are sequenced according to a set sequence, then all the sub-table blocks are stored on the data nodes one by one according to the sequence, if the data nodes are polled once, the data nodes are stored from the starting point again and continue to be stored in turn until all the sub-table blocks are completely stored;

And after the sub-table block is stored, storing the first copy of the sub-table block on other data nodes of the same rack as the node where the sub-table block is located, and storing the second copy of the sub-table block on other data nodes of a different rack from the node where the sub-table block is located.

The distributed database ChunkDB comprises a Master node, the Master node stores all Metadata information Metadata of the distributed database ChunkDB, and the Master node is a manager and a maintainer of the distributed database and is used for storing block Metadata information of data and managing information of the node.

The distributed data access interface unit expands a data access interface, and expands a DBInputFormat data interface of a MapReduce architecture, so that MapReduce can be combined with a distributed database storage unit ChunkDB to realize parallel acquisition of input data from a relational database.

the MapReduce architecture is expanded, so that the MapReduce architecture can support a distributed database ChunkDB and has compatibility, and MapReduce calculation based on the ChunkDB is realized.

The invention has the following beneficial effects:

1) the method performs distributed storage management and application on the relational big data, and improves the parallel processing efficiency of the relational big data compared with the traditional mode of using a single-node high-performance host; 2) the method has good fault tolerance, and by means of the advantages of a MapReduce framework, the ChunkDB periodically detects the health condition of each node, and for nodes without responses, the ChunkDB marks the nodes as inactive nodes in metadata, does not write new data blocks into the data nodes any more, and all blocks on the nodes are not available any more, and the nodes without responses do not affect the whole system by combining a redundancy strategy of the data blocks; 3) the method has good expandability, the number of the data nodes of the ChunkDB system can be dynamically adjusted according to the requirement according to the data scale condition of specific application, and the data nodes of the cluster can be added or removed, namely the expandability of the cluster can be well guaranteed.

Drawings

FIG. 1 is a schematic diagram of the overall module structure of the present invention;

FIG. 2 is a schematic diagram of the overall structure of the ChunkDB system of the present invention;

FIG. 3 is a schematic diagram of the structure of the ChunkDB-based MapReduce calculation according to the present invention.

Detailed Description

For further understanding of the features and technical means of the present invention, as well as the specific objects and functions attained by the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description.

In one embodiment of the invention, a distributed cluster is formed by 10 server nodes, and each node stores a sub-table block by adopting a PostgreSQL open-source relational database management system.

As shown in fig. 1, a management method of relational big data in a cloud computing environment, based on a MapReduce architecture, includes the following steps:

and constructing a bottom layer module which is a distributed database ChunkDB to form a ChunkDB database cluster for receiving, managing and storing various types of data.

And constructing an upper module which is a distributed data access interface unit and a Hadoop cluster and is in communication connection with a distributed database ChunkDB.

the system is integrally provided with a ChunkDB database cluster and a Hadoop cluster, wherein the Hadoop cluster and the ChunkDB database cluster share the same hardware cluster node environment. As for any common node in the cluster environment, the node is not only a DataNode node of an HDFS distributed file system, but also a sub-database node of ChunkDB, and is also a TaskTracker node calculated by MapReduce. The MapReduce parallel computing module stores and manages massive relational big data in a ChunkDB database or HDFS file mode.

A distributed database ChunkDB is constructed on a cluster, and nodes of a ChunkDB system comprise three important roles, namely a Master node, a DataNode node and a Client. A relational database is configured on each child node of the ChunkDB database cluster, wherein one relational database is used as a Master node Master of the distributed database ChunkDB. The Master node can be regarded as an administrator and a maintainer of the distributed database system and is mainly responsible for storing the block metadata information of the data and managing the information of the node. On one hand, the Master node detects the specific conditions of each node and maintains a node data information table; on the other hand, the data storage system stores the position, copy information and the like of the table block corresponding to each data table in the system. The DataNode node is a basic unit for data storage of the distributed database system, stores data information in a local database system, and performs data communication with the Master node. A Client is an application that needs to obtain distributed data. The Client can read the metadata information of the data from the Master node and then directly read the data from the DataNode node, and the data information does not need to be transferred through the Master node. The Client can also directly write data information to the DataNode node and then return the metadata of the information to the Master node.

As shown in table one below, the configuration is for the cluster node

In table one, 10 nodes are counted, wherein node cn 01 is used as Master node of ChunkDB, NameNode of HDFS and JobTracker node of MapReduce, and the remaining 9 nodes are used as PostgreSQL database child node of ChunkDB, DataNode of HDFS and taskttracker node of MapReduce.

As shown in fig. 3, which is a general structure diagram of a distributed database ChunkDB, three important roles of the ChunkDB system are shown, namely a Master node, a DataNode and a Client. Wherein, the Master node mainly manages several types of Metadata information Metadata of the ChunkDB system.

As shown in the following table two

And a second table is a configuration information metadata table structure db _ nodes of the ChunkDB database cluster node, wherein the configuration information metadata table structure db _ nodes comprises host name information of each node in the ChunkDB database cluster, user name and password information of the sub-database, JDBC connection driving information of the sub-database, information of whether the node is active or not and the like. And the Master node maintains a cluster node configuration information table in real time and updates the information mark whether each node is active or not. By which ChunkDB determines whether a node is alive and whether the data table sub-blocks above each node are valid. And the metadata table also provides parameter information required by connection of each child node database for the Client program.

As shown in the third table

Table three is a metadata information table structure table _ info of the ChunkDB data table, which includes information such as the name of the table, whether division is needed, the tuple number of the sub-blocks, and the number of copies needed by the sub-blocks. And the ChunkDB writes the configuration information of the data table into the metadata table according to specific settings when the data table is loaded. And then according to the metadata information of the data table, determining whether the data table is in a full-table data storage form of all nodes or in a form of requiring subblock division for storage. And if the data table needs to be stored in a blocking mode, the data table is stored in different nodes after being blocked according to the size and the copy number of the blocks. And the ChunkDB maintains the copies of the data sub-table blocks in real time according to the sub-block information of the metadata table, so that the effective copy quantity is ensured, and the data security of the data table is further ensured.

As shown in table four below

Table four is a data table blocking information metadata table structure partitions _ info of ChunkDB. The list name, the sub-list block number, the copy number, the host name of the sub-block, the JDBC connection URL information and the like are included. When the ChunkDB loads the data table, the data table is blocked according to the configuration information in the table tables _ info, then the data of each block is written into the database of the corresponding child node in the cluster, and the sub-block information of the data table is written into the partitions _ info metadata table. When the Client program reads the data table, the subblock information corresponding to the data table is obtained by accessing the partitions _ info metadata table, and then data reading is directly performed from the corresponding subnode.

The types of information are main metadata information stored by a Master node, and the ChunkDB combines the data information stored in the metadata tables to detect and maintain the ChunkDB database cluster node and the data copy in real time so as to provide reliable data information for a Client program.

The specific process of storing and dividing data in the distributed database ChunkDB is as follows:

step 1: and building a cluster foundation environment. The distributed database ChunkDB is realized based on a MapReduce architecture, so that a MapReduce cluster environment needs to be built at first, and the building of a MapReduce basic environment is completed by using an open-source Hadoop environment. In addition, each node in the cluster needs to install a single-edition relational database management system as each child node of the distributed database ChunkDB.

step 2: and (5) partitioning the data. Tables that relate big data typically contain a huge amount of data information. The parallel operability of the data table can be improved only by dividing the data table into blocks and storing the blocks on different nodes of the system respectively. And dividing the data blocks according to a set rule of the number of the recording strips. For example, for a data table with 10000 ten thousand line tuples, if every 100 ten thousand lines of data are divided into one sub-table block, the whole data table can obtain 100 sub-table block divisions. Then the system respectively stores 100 sub-table blocks on a plurality of nodes of the cluster, and then different sub-table blocks of the data table can be processed in parallel, so that the processing efficiency is improved. The sub-table blocks of the data table exist independently in the form of a relational table on each node, so that each sub-table block can be independently accessed for retrieval, and query optimization can be performed by independently establishing indexes and the like.

And step 3: and (4) storage structure of data. The data table is stored in a database system on each data node by taking sub-table blocks with the same size as a unit, and each sub-table block corresponds to at least one copy for ensuring the fault tolerance. The parameters of the number of records contained in the sub-table block and the number of copies of the sub-table block are configurable. The strategy of evenly distributing the copies can ensure that the load balance is easily realized when some nodes fail.

When the Client writes data to the data node, the data table is divided according to the set size and copy number of the sub-table blocks, then the Client directly writes each sub-table block to the corresponding node without passing through the Master node, and writes information such as the position of the sub-table block into Metadata data of the Master node. When the client side reads data, the Metadata information of the Master node is accessed firstly to obtain information of each sub-table block of the corresponding table and the node where the copy block is located, and the like, and then the data information is directly retrieved from the corresponding DataNode node without transfer through the Master node, so that the bottleneck problem of a single node is solved.

The Master node constantly monitors the health of each node in the cluster and then records the real-time information of the cluster in Metadata. And the sub-table blocks of the data table on the invalid node are redistributed and rewritten, so that the copy safety and fault tolerance of the data table are ensured.

And 4, step 4: distribution of sub-table blocks. After the data table is partitioned according to the set configuration information, one data table can correspond to a plurality of data sub-table blocks. The distribution of the data sub-table blocks adopts two strategies of Hash distribution and polling distribution. When hash distribution is used, the block number id of a block is used as a distribution key. And transferring the distribution key to a hash function to obtain the data node where each data sub-table block should be stored. When the polling distribution is used, the system sorts all the data nodes according to a certain sequence, and then all the data sub-table blocks are stored on the data nodes one by one according to the sequence. If the data node has polled once, the data node is started from the starting point again and stored in turn until all the data sub-table blocks are stored completely.

The most effective method for storing the other copies of the data sub-table block is as follows: and after the sub-table blocks are stored according to a Hash or polling distribution strategy, storing the first copies of the sub-table blocks on other data nodes of the same rack as the nodes where the sub-table blocks are located, and storing the second copies of the sub-table blocks on other data nodes of a different rack from the nodes where the sub-table blocks are located. The method reduces the writing flow in the rack and improves the writing performance.

And 5: and maintaining metadata information of the Master node. The Master node stores all Metadata information Metadata of the ChunkDB distributed database, and can be regarded as an administrator and a maintainer of the distributed database system. The Metadata information Metadata mainly comprises three parts of information, namely configuration information of each node of the cluster, Metadata information of a data table of the distributed system and Metadata information of sub-blocks of the data table.

For the upper-layer distributed data access interface unit, in order that the distributed database ChunkDB can support large-scale MapReduce parallel computation, the MapReduce computation can acquire data from the relational database with high parallelism, and an optimization strategy such as an index of a traditional relational database can be used. The implementation process of the part mainly comprises the following key steps:

Step 1: an extended data access interface. The MapReduce architecture provides a DBInputFormat data interface for operating the relational database, and data can be read from the relational database and then further processed by a Map task and a Reduce task. By expanding the DBInputFormat data interface of the MapReduce architecture, the MapReduce can be combined with a ChunkDB distributed database, input data can be really and completely acquired from a relational database in parallel, the parallel advantage of the MapReduce architecture operation relational database is fully exerted, and meanwhile, a query optimization strategy of a traditional database can be combined.

Step 2: and (4) expanding the relational big data MapReduce calculation flow. After the relational big data is stored in the ChunkDB system, the execution flow of MapReduce calculation based on the ChunkDB is similar to the MapReduce calculation flow based on the distributed files, so that the ChunkDB can be well compatible with a MapReduce parallel framework.

Fig. 3 is a schematic diagram of a MapReduce computing structure based on ChunkDB, which shows a composition structure of a distributed data access interface at an upper layer of the present invention. After a user submits a MapReduce task on a JobTracker node, a system reads data table block metadata information of a Master node of ChunkDB and reads data block information of a data table to be operated by MapReduce calculation. According to the cluster condition, the JobTracker starts Map tasks on some nodes, and transmits each piece of block information to one Map task (a localization strategy is applied to distribute the block information of which the data position is local to the Map task as much as possible). And then the Map task connects the database of the child nodes according to the information of the connection drive, the URL, the user name, the password and the like of the child database recorded in the blocking information, and reads data information from the database of the child nodes according to the retrieval condition. And then, processing the data in a Map task in the same processing flow as the MapReduce processing flow based on the distributed file, then transmitting the data to a Reduce task through a Shuffle process, then calling a Reduce function to further process the data, and finally returning the MapReduce calculation result to the user.

The execution flow of MapReduce calculation based on ChunkDB is similar to that of MapReduce calculation based on a distributed file, so that the ChunkDB can be well compatible with a MapReduce parallel framework. However, the mode of acquiring data blocks by the JobTracker and the corresponding process of acquiring block actual data in the Map task are different from those of the current MapReduce calculation, so that the parallel architecture of the MapReduce is expanded, a ChunkDB database can be supported, and the ChunkDB and the MapReduce can be well combined together.

by expanding a DBInputFormat data operation interface of a MapReduce framework, the operation mode of the MapReduce on the distributed files is simulated, the mode of reading file data block information from a NameNode main node of a DFS (distributed file system) is replaced by the mode of reading the block information of a data table from a Master main node of a ChunkDB distribution library. And then generating input data blocks Split calculated by MapReduce by the JobTracker node of the cluster according to the block information of the data table read from the Master node, and setting relevant information for each Split data block, wherein the relevant information comprises database connection information (information such as a driver connected with the database, a URL (uniform resource locator) string connected with the database, a user name and a password of the database) of the data block and host information (the mobile calculation is more favorable than the mobile data so as to realize an optimization strategy for moving the calculation to the data) of the data block.

After the default DBInputFormat data interface is expanded, the MapReduce framework operates the ChunkDB distributed database in a consistent mode with the mode of operating the FileInputFormat and other data interfaces used by the distributed file, and only parameters required by some queries need to be set, such as the executed query SQL statement and the like. Therefore, the method can be conveniently combined with a MapReduce framework and is very simple and convenient to operate.

Finally, the invention is subjected to a large number of tests and is put into practical use. Practice proves that the relational big data management method under the cloud computing environment has high-efficiency query performance, and can further improve the operation performance of parallel computing of the relational big data by combining optimization strategies such as a distributed computing architecture MapReduce and a traditional relational database index.

In addition, the method provided by the invention has good fault tolerance and expandability.

The ChunkDB periodically checks the health of each node in the cluster, and for nodes that do not respond, the ChunkDB marks them in the metadata as inactive nodes, no new data chunks are written onto the data node, and all chunks on it are no longer available. If the number of copies of some data blocks is lower than the critical value, the Master node finds other available copies of the corresponding block, and a new copy is regenerated on the active node. If the data node is not available due to network connection and the like, but the data on the data node is not damaged, the data node is checked for repair, the state of the data node is marked as an active node again, and the copy of the data on the data node can be used continuously. If the data node is not available due to other reasons such as damage of a database system of the data node, the node is marked as damaged after being checked, and at the moment, the block information on the node is deleted from the metadata block information table, namely, the data block on the node is not available for reuse permanently.

According to the data scale of a specific application, the number of data nodes of the ChunkDB system may need to be adjusted, and data nodes of a cluster may be added or removed, that is, the scalability of the cluster needs to be well guaranteed. When data node reduction is required, the data node to be removed is first marked in the metadata information as a corrupted state. At this time, all data blocks on the node are considered to be lost by the system, and all block information on the node is deleted in the data block metadata information of the Master node. At which point the data node may be removed. Meanwhile, the ChunkDB fault-tolerant mechanism generates new copies of data blocks with the number of copies lower than a critical value, so as to ensure the information security of each data block. When a data node needs to be added, various configuration information of the node, including host information of the node, user name and password information of a sub-database on the node, JDBC connection driving information and the like, is added to metadata information of a Master node, then the node state is marked as an active node, and at the moment, the ChunkDB writes new data blocks into the newly added data node.

Although the present invention has been described in detail with reference to the embodiments, it will be apparent to those skilled in the art that modifications, equivalents, improvements, and the like can be made in the technical solutions of the foregoing embodiments or in some of the technical features of the foregoing embodiments, but those modifications, equivalents, improvements, and the like are all within the spirit and principle of the present invention.

Claims

1. a management method of relational big data in a cloud computing environment is based on a MapReduce framework and comprises the following steps:

And the construction calculation module is used for constructing a parallel calculation unit based on ChunkDB and a parallel calculation unit based on HDFS, is connected with the distributed data access interface unit and is used for calculating and processing related data.

2. The method for managing relational big data in the cloud computing environment according to claim 1, wherein the ChunkDB performs a block operation on the data after receiving the data, and divides the data table into a plurality of sub-table blocks according to a set division rule, the sub-table blocks are respectively stored in different nodes of the ChunkDB, the sub-table blocks on the different nodes are in a parallel relationship, and the sub-table blocks are independently stored in the form of tables on the different nodes.

3. The method for managing relational big data in the cloud computing environment according to claim 2, wherein the chunk db stores the data after performing a blocking operation on the data, the data tables are stored on each node by using sub-table blocks with the same size as a storage unit, and each sub-table block corresponds to at least one copy.

4. The method for managing the relational big data in the cloud computing environment according to claim 3, wherein the table sub-blocks are stored in a hash distribution mode and a polling distribution mode, when the hash distribution mode is used, each sub-table block is numbered, a block number id of each sub-table block is used as a distribution key, and the distribution key is transmitted to a hash function, so that a data node where each sub-table block is stored is obtained;

5. The method for managing relational big data in the cloud computing environment according to claim 4, wherein the distributed database ChunkDB comprises a Master node, the Master node stores all Metadata information Metadata of the distributed database ChunkDB, and the Master node is a manager and a maintainer of the distributed database and is used for storing partitioned Metadata information of data and managing information of the nodes.

6. The method for managing relational big data in the cloud computing environment according to claim 5, wherein the distributed data access interface unit expands a data access interface, and expands a DBInputFormat data interface of a MapReduce architecture, so that MapReduce can be combined with a distributed database storage unit ChunkDB to realize parallel acquisition of input data from a relational database.

7. The method for managing relational big data in the cloud computing environment according to claim 6, wherein the MapReduce architecture is extended to enable the MapReduce architecture to support a distributed database ChunkDB and have compatibility, so that MapReduce computing based on the ChunkDB is realized.