CN112395085B

CN112395085B - HDFS-based distributed relational database scheduling method

Info

Publication number: CN112395085B
Application number: CN202011226082.3A
Authority: CN
Inventors: 李发明
Original assignee: Shenzhen China Blog Imformation Technology Co ltd
Current assignee: Shenzhen China Blog Imformation Technology Co ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2022-10-25
Anticipated expiration: 2040-11-05
Also published as: CN112395085A

Abstract

The invention provides a scheduling method of a distributed relational database based on an HDFS (Hadoop distributed File System), belonging to the field of database management. The scheduling method is based on the bottom nodes of a relational database, a scheduler is arranged at each node, after the scheduler receives a task of a calling party, the task is split according to an internal scheduling strategy of a system, subtasks suitable for the node are directly issued to a current processing engine to be executed, the subtasks not suitable for the node are used for calling data suitable for the subtasks through interconnection among the processing engines, then the current processing engine executes the subtasks based on the calling data to obtain an execution result, and the execution result is fed back to the calling party through the scheduler after being summarized. The invention combines the dispatching and interface logic with the bottom node, does not need to increase dispatching nodes, unifies the architecture, is beneficial to node management, ensures that the response speed is fast enough under multitask, avoids the competition of data resources and computing resources, and realizes high-efficiency dispatching under multitask.

Description

HDFS-based distributed relational database scheduling method

Technical Field

The invention belongs to the field of database management, and particularly relates to a scheduling method of a distributed relational database based on an HDFS (Hadoop distributed File System).

Background

With the advent of the internet era, big data storage becomes the basis of background operation, and a distributed database becomes an optional mode of large-scale data storage. In a large-scale distributed database, service data are stored on a plurality of servers in a multi-center distributed mode, and the data are called and shared mutually. When data sharing is performed, when data of the same type reaches the same node, phenomena such as path blockage, data packet path identification error, packet missing and the like are easy to occur. Therefore, in distributed databases, there is a need for reasonable scheduling of data. Meanwhile, the inadequate amount of data scheduling can cause resource waste, cause inefficient operation of upper-layer application, and influence the timeliness of data analysis and the accuracy of data query.

In the prior art, a scheduling engine is usually added to a distributed relational database, and a scheduling task is completed through the scheduling engine. However, when the scheduling task is heavy or each single task becomes complicated, the burden of the scheduling engine is severely increased, and the scheduling cost is increased.

Disclosure of Invention

In view of the above-mentioned defects or shortcomings in the prior art, the present invention aims to provide a scheduling method for a distributed relational database based on an HDFS, which fully utilizes a distributed network of the distributed database, adds a scheduling engine and a user interface at a child node, does not need to add a scheduling node, directly sends a task to the scheduling engine at the node through the user interface, and the scheduling engine at the node splits the task and sends a splitting command to other nodes, thereby reducing link load and implementation cost, implementing efficient matching of data resources and computing resources, and improving database performance.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

the embodiment of the invention provides a dispatching method of a distributed relational database based on an HDFS (Hadoop distributed File System). A dispatcher is arranged at each node based on bottom nodes of the relational database; after receiving a calling task of a calling party, a scheduler at a node splits the task according to a scheduling strategy inside the system, directly issues a subtask applicable to the node to a current processing engine and executes the subtask based on data of the current node, calls data applicable to the subtask by the subtask inapplicable to the node through interconnection between the processing engines, executes the subtask based on the calling data by the current processing engine to obtain an execution result, summarizes the execution result with the execution result of data processing of the node, and feeds the summarized result back to the calling party by the scheduler.

As a preferred embodiment of the present invention, the scheduler at the node receives the caller's invocation task, and is implemented by a user interface disposed between the scheduler and the caller.

As a preferred embodiment of the present invention, the scheduler at the node receives the invocation task of the caller, and is implemented by a distributor arranged between the scheduler and the caller, where the distributor distributes the received task of the caller and issues the task to the scheduler at the node based on a random or RoundRobin policy in combination with manual assignment and a communication and coordination mechanism between the callers.

As a preferred embodiment of the present invention, in the internal scheduling policy, when the scheduler performs sub-task splitting on a task, firstly, a sub-task that can be executed is split according to current data of the current node and is directly issued to the processing engine for execution, secondly, a task portion that cannot be executed according to the current data of the current node is split into other sub-tasks, when the current processing engine receives the other sub-tasks, data is dynamically read from the nodes containing the data through a network link according to a cross-node reading efficiency maximization principle, and then the other sub-tasks are executed based on the read data.

As a preferred embodiment of the present invention, when the scheduler at each node determines whether the scheduler is an executable sub-task split from the data of the node, the specific scheduling policy adopted is set according to the data characteristics of the node.

As a preferred embodiment of the present invention, the specific scheduling policy adopts a fair sharing and/or capacity-based scheduling policy.

As a preferred embodiment of the present invention, the method further comprises: when partial node failure occurs in the process of executing the subtasks, the scheduler tries to avoid the existing work failure to the maximum extent, and when a wrong node occurs, as long as data is not completely lost, the subtasks of the failed node are executed again on other health points.

As a preferred embodiment of the present invention, the method further comprises: when partial node failure occurs in the process of executing the subtasks, the schedulers uniformly cancel the subtasks on all the nodes, and then perform splitting and scheduling again according to the latest node resource condition to execute the tasks from the beginning.

As a preferred embodiment of the present invention, the scheduling method includes:

step S101, a calling party issues tasks to a dispatcher at a node;

step S102, a scheduler at a node splits a task according to a scheduling strategy, firstly splits a subtask applicable to a current node HDFS database, and directly issues the subtask to a processing engine of the current node;

step S103, the current processing engine directly calls the data at the current HDFS node to execute the subtasks, and obtains and feeds back the execution result to the current scheduler;

step S104, a scheduler at a node splits subtasks which are not suitable for the HDFS database of the current node and issues the subtasks to a current processing engine;

step S105, the current processing engine calls data at other nodes according to the subtasks, executes the current subtask according to the called data, and obtains and feeds back an execution result to the current scheduler:

and step S106, the current scheduler collects the subtask execution result based on the current node data and the subtask execution result based on other node data, and feeds the collected result back to the calling party.

As a preferred embodiment of the present invention, in step S101, when there are fewer tasks of the calling party, the calling party directly issues the tasks to the scheduler through the user interface; when the task amount is large, the calling party sends the task to the distributor, and the distributor sends the task to the scheduler at the node based on a random or RoundRobin strategy and by combining manual assignment and a communication and coordination mechanism among the calling parties.

The invention has the following beneficial effects:

the scheduling method of the distributed relational database based on the HDFS provided by the embodiment of the invention is characterized in that a scheduler is arranged at each node based on the bottom node of the relational database; after receiving a calling task of a calling party, a scheduler at a node splits the task according to a scheduling strategy inside the system, directly issues a subtask applicable to the node to a current processing engine and executes the subtask based on data of the current node, calls data applicable to the subtask by the subtask inapplicable to the node through interconnection between the processing engines, executes the subtask based on the calling data by the current processing engine to obtain an execution result, summarizes the execution result with the execution result of data processing of the node, and feeds the summarized result back to the calling party by the scheduler. The distributed relational database based on the HDFS is provided with a bottom layer distributed architecture with a large number of nodes, scheduling and interface logic are combined with the bottom layer nodes, nodes specially used for scheduling do not need to be added, task coordination is realized through interconnection among the nodes, the unification of the architecture is realized, the maintenance and management of the nodes are facilitated, the sufficiently fast response speed under the calling party multitask is ensured, meanwhile, competition of data resources and computing resources is avoided, and the efficient scheduling under the multitask is realized.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments thereof, made with reference to the following drawings:

FIG. 1 is a schematic diagram of a centralized scheduling method in the prior art;

FIG. 2 is a schematic diagram of a distributed scheduling method in the prior art;

FIG. 3 is a schematic diagram of a scheduling method for a distributed relational database based on HDFS in an embodiment of the present invention;

FIG. 4 is a timing diagram illustrating a method for scheduling a distributed relational database based on HDFS according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, in the centralized scheduling method for a distributed relational database, a scheduling engine is set in the database as an independent node and is specially responsible for external interfaces. All users of the caller initiate tasks through this scheduler engine. The node acquires the tasks from the external calling party, splits and dispatches the tasks to each bottom node for processing, receives the processing results of the bottom nodes to the local machine for final merging processing, and returns the processing results to the external calling party. In this case, the node becomes a central node for control in the system and is responsible for command and data interaction with the underlying nodes and with the clients, which easily causes a single point of failure and bottleneck. In general, a backup node may be established to prevent the master node from going offline. But there is no good way to deal with the I/O bottleneck brought by the data flow concentration.

As shown in fig. 2, in the distributed scheduling method for the distributed relational database, a scheduler and an external interface are deployed on a plurality of nodes. An external calling party selects a certain node as a temporary control node of a certain task according to own strategy or at random, and the selected node is responsible for task splitting, scheduling and result merging in the task; when a plurality of calling parties operate simultaneously, different nodes are selected as respective control nodes. Although single point failure and bottleneck are avoided, with the expansion of the system, the number of scheduling nodes is increased continuously, and the increase of the system capacity also leads to the increase of the scheduling overhead.

The implementation mode of the invention aims at the distributed relational database based on the HDFS, a distributed architecture with a large number of nodes is used at the bottom layer, the scheduling and interface logic is combined with the bottom layer nodes, and the scheduling and interface logic and the bottom layer nodes have natural bonding force, so that the nodes specially used for scheduling are not required to be added, the coordination of tasks is realized through the interconnection among the nodes, the unification of the architecture is realized, and the maintenance and the management of the nodes are facilitated.

As shown in fig. 3, in the scheduling method of a distributed relational database according to an embodiment of the present invention, a scheduler and a user interface are provided based on a bottom node of a relational database, after receiving a call task of a caller, the scheduler at a node splits the task according to a scheduling policy inside a system, directly issues a subtask applicable to the node to a current processing engine, calls data of other nodes for subtasks of other nodes through interconnection between the processing engines, and then the current processing engine executes the subtask based on the call data to obtain an execution result, and then merges the execution result with an execution result at a data location of the node, and feeds back the execution result to the caller through the current scheduler.

The invention sets the dispatching method of the distributed relational data based on the nodes of the dispatcher, and the dispatching method is divided into two parts according to the inside and the outside of the system, namely a task external connection stage and an internal execution stage. The distributed relational database system separately loads a complex internal structure and scheduling logic in the system, and provides a high-level task submission and result acquisition interface for an external application layer through a calling interface. The external caller does not need to care about the scheduling strategy and the execution details in the distributed system at all, and only needs to send out the requirement to obtain the result. Meanwhile, because the service interconnection is directly carried out among the bottom nodes in the internal execution stage, the bottom nodes which can be called are far more than the calling tasks of the calling party without passing through the upper nodes, thereby ensuring that the response speed is high enough under the multi-task of the calling party.

In the task external connection stage, a calling party directly selects an applicable node scheduler to send a task request and receives a feedback result of the node scheduler; in the internal execution stage, the current node scheduler splits tasks according to an internal scheduling strategy, firstly splits subtasks suitable for the current node and issues the subtasks, secondly splits other subtasks and calls data at other nodes through a processing engine of the current node, and the processing engine at the current node executes the subtasks based on the called data, and finally summarizes execution results of all subtasks and feeds the results back to a calling party.

The method for setting the scheduler and the user interface in a distributed manner adopted by the embodiment of the invention configures a calling party coordination mechanism among all nodes so as to avoid unbalanced load among the nodes caused by the fact that all calling parties are only butted with one or more concentrated nodes. Preferably, in the embodiment, according to the organization architecture and the service communication mode of the center, a unified random or RoundRobin policy distributor is adopted, and meanwhile, a communication coordination mechanism between manual assignment and each calling party is combined to distribute a specific docking node to each calling party.

As described above, in the process of executing a system for scheduling tasks, two problems of splitting and scheduling a task to multiple service nodes and scheduling multiple tasks after being submitted to the system in the same period are generally solved, so that a proper scheduling policy is used, fair queuing or queuing according to reasonable priority during multitasking is realized, contention between data resources and computing resources is avoided, optimal matching is realized, and overall efficiency under multitasking is guaranteed. The distributed relational database based on the HDFS, which is provided by the embodiment of the invention, adopts a Hadoop system as a bottom layer load, the storage and processing architectures are dispersed, data can be distributed to a large number of nodes, meanwhile, the nodes can also carry out various operations on the data, high-speed network interconnection exists between the nodes, and the distributed relational database is preferably a gigabit link inside a certain IDC or a plurality of IDCs which are interconnected through special optical fibers. The nodes and the nodes are in equal status, and have no priority, the data is fragmented according to a certain logic and is stored dispersedly among the nodes, and a certain redundancy (for example, 2 times or 3 times redundancy) is often added. Therefore, when data resources and computing resources cannot be efficiently matched, the computing resources must acquire a large amount of data resources from other nodes through the network, which is very costly compared to the local I/O of the nodes, and if the situation frequently occurs, the network may be congested, and other tasks or subtasks may be prevented from being completed.

In the internal scheduling strategy adopted in the internal execution stage, a scheduler firstly issues split subtasks to a processing engine of a node, when the split subtasks cannot be completed based on the current processing engine, the split subtasks are dynamically read from the node containing the data through a network link, when reading and calling the cross-node data, in the subtask splitting stage, the current node data is fully utilized, meanwhile, selection, combination and data volume optimization of other nodes are considered, and the efficiency of overall cross-node reading is maximized.

And then the processing engine of the node is interconnected with the processing engines of other nodes. Because the number of the task nodes which can be called by the calling party can be multiple, each node flexibly adopts a corresponding scheduling strategy according to an actual task when splitting, the same scheduling strategy is not necessarily adopted, the computing resource of the node is fully considered when splitting, the load conditions of other interconnected nodes are considered, the join operation is completed with the maximum efficiency, and the distributed application of the data of each node is fully improved.

Preferably, fair sharing and/or capacity-based scheduling strategies are employed in embodiments of the present invention.

As a large-scale system, the number of nodes in the system is large, the resource and load conditions of each node are different, and the large-scale system cannot avoid the guarantee problem that the large-scale system faces the working environment, wherein the large-scale system can cause the fault or the abnormality of the nodes due to power, refrigeration, network, hardware fault, software abnormality and the like. Therefore, when the work is scheduled, the failure problem of the node needs to be dealt with at any time. Node failure may be a momentary network disruption, a brief restart, or a permanent drop of the node. Hadoop distributed system nodes are not tightly coupled, complex system resources are not shared, and a shared storage mode is not adopted, so that immediate task migration and recovery cannot be achieved among the nodes. Therefore, when partial node failure occurs during task execution in the embodiment, two coping strategies are selected: one is to try to avoid the existing work failure as much as possible, and when a wrong node occurs, as long as data is not completely lost, the subtask of the failed node is executed again on another healthy point; the other is the simplest scheduling strategy, when part of nodes fail, subtasks on all the nodes are uniformly cancelled, then according to the latest node resource condition, the splitting scheduling of the tasks is carried out again, and the tasks are executed from the beginning.

Fig. 4 is a timing diagram of a scheduling method of the HDFS-based distributed relational database according to a preferred embodiment of the present invention. As shown in fig. 4, in this embodiment, the scheduling method includes:

and step S101, the calling party issues tasks to a scheduler at a node.

When the tasks of the calling party are few, the calling party directly issues the tasks to the scheduler through a user interface; when the task amount is large, the calling party sends the tasks to the distributor, and the distributor issues the tasks to the scheduler at the node based on a random or RoundRobin strategy and combined with manual assignment and a communication and coordination mechanism among the calling parties, so that the task concentration or blockage at the node is avoided.

and step S106, the current scheduler collects the subtask execution results based on the current node data and the subtask execution results based on other node data, and feeds the collected results back to the caller.

According to the technical scheme, the HDFS-based distributed relational database scheduling method provided by the embodiment of the invention has the advantages that the HDFS-based distributed relational database has a bottom-layer distributed architecture with a large number of nodes, the scheduling and interface logic is combined with the bottom-layer nodes, the nodes specially used for scheduling are not required to be added, the coordination of tasks is realized through the interconnection among the nodes, the unification of the architecture is realized, the maintenance and management of the nodes are facilitated, the sufficiently fast response speed under the condition of multiple tasks of a calling party is ensured, meanwhile, the competition of data resources and computing resources is avoided, and the efficient scheduling under the multiple tasks is realized.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. A scheduling method of a distributed relational database based on an HDFS (Hadoop distributed File System) is characterized in that a scheduler is arranged at each node based on bottom nodes of the relational database; after receiving a calling task of a calling party, a scheduler at a node splits the task according to a scheduling strategy inside the system, directly issues a subtask applicable to the node to a current processing engine and executes the subtask based on data of the current node, calls data applicable to the subtask by the subtask inapplicable to the node through interconnection between the processing engines, executes the subtask based on the calling data by the current processing engine to obtain an execution result, summarizes the execution result with the data processing execution result of the node, and feeds the summarized result back to the calling party by the scheduler;

according to the internal scheduling strategy, when a scheduler splits subtasks of a task, firstly, an executable subtask is split according to current data of a node and is directly issued to a processing engine for execution, secondly, a task part which cannot be executed according to the current data of the node is split into other subtasks, when the current processing engine receives the other subtasks, data reading is dynamically carried out from the node containing the data through a network link according to a cross-node reading efficiency maximization principle, and then the other subtasks are executed based on the read data.

2. The HDFS-based distributed relational database scheduling method according to claim 1, wherein the scheduler at the node receives the caller's call task through a user interface provided between the scheduler and the caller.

3. The HDFS-based scheduling method for a distributed relational database according to claim 1, wherein the scheduler at the node receives a call task of a caller, and the call task is implemented by a distributor arranged between the scheduler and the caller, and the distributor distributes the received call task of the caller and issues the received call task to the scheduler at the node based on a random or RoundRobin policy in combination with manual assignment and a communication and coordination mechanism between the callers.

4. The HDFS-based scheduling method for a distributed relational database according to claim 3, wherein a scheduler at each node determines whether the scheduler is an executable sub-task split according to data of the node, and a specific scheduling policy to be used is set according to data characteristics of the node.

5. The HDFS based distributed relational database scheduling method according to claim 4, wherein the specific scheduling policy is fair sharing and/or capacity based scheduling policy.

6. The HDFS-based distributed relational database scheduling method according to any one of claims 1 to 3, wherein the method further comprises: when partial node failure occurs in the process of executing the subtasks, the scheduler tries to avoid the existing work failure to the maximum extent, and when a wrong node occurs, as long as data is not completely lost, the subtasks of the failed node are executed again on other health points.

7. The HDFS-based distributed relational database scheduling method according to any one of claims 1 to 3, wherein the method further comprises: when partial node failure occurs in the process of executing the subtasks, the scheduler cancels the subtasks on all the nodes uniformly, and then performs splitting scheduling of the tasks again according to the latest node resource condition to execute the tasks from the beginning.

8. The HDFS-based distributed relational database scheduling method according to claim 1, wherein the scheduling method comprises:

step S101, a calling party issues tasks to a dispatcher at a node;

step S105, the current processing engine calls data at other nodes according to the subtasks, executes the current subtasks according to the called data, and obtains and feeds back an execution result to the current scheduler:

9. The HDFS-based distributed relational database scheduling method according to claim 8, wherein in step S101, when there are few tasks for the caller, the caller directly issues the tasks to the scheduler through a user interface; when the task amount is large, the calling party sends the task to the distributor, and the distributor sends the task to the scheduler at the node based on a random or RoundRobin strategy and by combining manual assignment and a communication and coordination mechanism among the calling parties.