CN107357688B - Distributed system and fault recovery method and device thereof - Google Patents

Distributed system and fault recovery method and device thereof Download PDF

Info

Publication number
CN107357688B
CN107357688B CN201710630823.6A CN201710630823A CN107357688B CN 107357688 B CN107357688 B CN 107357688B CN 201710630823 A CN201710630823 A CN 201710630823A CN 107357688 B CN107357688 B CN 107357688B
Authority
CN
China
Prior art keywords
metadata
redo log
master node
mirror image
metadata mirror
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710630823.6A
Other languages
Chinese (zh)
Other versions
CN107357688A (en
Inventor
褚建辉
卢申朋
刘东辉
王新栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Shenma Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shenma Mobile Information Technology Co Ltd filed Critical Guangzhou Shenma Mobile Information Technology Co Ltd
Priority to CN201710630823.6A priority Critical patent/CN107357688B/en
Publication of CN107357688A publication Critical patent/CN107357688A/en
Priority to PCT/CN2018/097262 priority patent/WO2019020081A1/en
Application granted granted Critical
Publication of CN107357688B publication Critical patent/CN107357688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments

Abstract

The invention discloses a distributed system and a fault recovery method and device thereof. The method comprises the following steps that a slave node and/or a main node acquire and store a metadata mirror image recorded with scheduling information and a system state at a certain moment on the main node; the method comprises the steps that a main node obtains and stores redo logs recording all operations of the main node after a moment; and the main node calls the metadata mirror image and the corresponding redo log thereof to carry out fault recovery when the fault is recovered. Therefore, when the main node fails, the main node can be quickly restored to the state before the failure according to the metadata mirror image and the redo log recorded before.

Description

Distributed system and fault recovery method and device thereof
Technical Field
The present invention relates to the field of distributed technologies, and in particular, to a distributed system and a fault recovery method and apparatus thereof.
Background
A distributed system organically combines and connects a plurality of machines to cooperatively complete a task, such as a computing task and a storage task. Which is a software system built on top of a network. Most of the existing distributed systems are of a master-slave architecture, and fig. 1 is a schematic diagram illustrating a structure of a distributed system adopting the master-slave architecture. As shown in fig. 1, a distributed system of a master-slave architecture is mostly composed of a master node (master) and a plurality of slave nodes (slave). The main node is used as a central scheduling node of the distributed system, and generally has the functions of metadata storage and query, cluster node state management, decision making, task issuing and the like.
Therefore, a failover (failover) mechanism is needed, so that when the primary node crashes due to an unknown error, the primary node can be restored to the state before the error occurs, and loss of data of the primary node is avoided.
Disclosure of Invention
The invention provides a fault recovery scheme for a main node in a distributed system, which is characterized in that the operation of the main node is recorded in a redo log by acquiring the metadata mirror image of the main node at one or more moments, so that when the main node fails, the main node can be quickly recovered to the state before the fault according to the metadata mirror image and the redo log recorded before.
According to one aspect of the invention, a distributed system is provided, which comprises a main node for scheduling tasks and managing system states and a plurality of slave nodes for running the scheduled tasks, wherein one or more slave nodes and/or the main node acquire and store a metadata mirror recorded with scheduling information and system states at a certain moment on the main node; the master node acquires and stores redo logs recording all operations of the master node after the moment; and the main node calls the metadata mirror image and the corresponding redo log thereof to carry out fault recovery when the fault is recovered.
Therefore, the main node can be quickly restored to the state before the fault according to the metadata mirror image and the redo log which are recorded before, and the restoration efficiency can be improved compared with the mode of only recording the log file.
Preferably, one or more slave nodes and/or master nodes perform the metadata image acquisition and saving operations triggered by the master node and/or external commands. Therefore, different triggering modes can be set according to the characteristics of the distributed system to trigger the acquisition and storage operations of the metadata mirror.
Preferably, the master node responds to the slave node's request after each of its operations is recorded in the redo log and stored. Thereby ensuring that the redo log is able to record every operation of the master node in its entirety.
Preferably, one or more slave nodes and/or the master node continuously acquire and store the metadata images of the master node at a plurality of different times, and the master node continuously acquires and stores redo logs respectively corresponding to the plurality of different times. The primary node may call the latest metadata mirror image and the redo log corresponding to the latest metadata mirror image to perform failure recovery when the failure recovery is performed, and call the latest data when the metadata mirror image and the redo log corresponding to the metadata mirror image are both available to perform failure recovery when the latest metadata mirror image and/or the redo log corresponding to the metadata mirror image are unavailable. Therefore, the fault tolerance rate during fault recovery can be improved by storing a plurality of memory mirror images at different moments and corresponding redo logs.
Preferably, one or more slave nodes and/or the master node directly acquire and store the memory state of the master node at a certain time as a metadata mirror. The metadata image may be stored in task groups. Thus, corresponding metadata images can be efficiently organized from the groupings upon subsequent recovery.
According to another aspect of the present invention, there is also provided a failure recovery apparatus of a distributed system including a master node for scheduling tasks and managing a system state and a plurality of slave nodes for running the tasks, the apparatus being configured to perform failure recovery when the master node fails, and including: the mirror image acquisition unit is used for acquiring and storing the metadata mirror image recorded with the scheduling information and the system state at a certain moment on the main node; the redo log acquisition unit is used for acquiring and storing redo logs recording all operations of the main node after the moment; and the failure recovery unit is used for calling the metadata mirror image and the corresponding redo log thereof to carry out failure recovery when the failure is recovered.
Preferably, the mirror acquiring unit performs the metadata mirror acquiring and saving operations under the trigger of a master node, a device and/or an external command.
Preferably, the master node responds to the request of the slave node after each operation thereof is recorded and stored in the redo log by the redo log retrieving unit.
Preferably, the mirror image obtaining unit continuously obtains and saves the metadata mirror images of the master node at a plurality of different times, and the redo log obtaining unit continuously obtains and saves the redo logs respectively corresponding to the plurality of different times.
Preferably, the failure recovery unit calls the latest metadata mirror and the redo log corresponding to the metadata mirror to perform failure recovery when the failure recovery is performed.
Preferably, when the latest metadata image and/or the redo log corresponding to the latest metadata image are/is unavailable, the failure recovery unit calls data at the latest moment when both the metadata image and the redo log corresponding to the metadata image are available to perform failure recovery.
Preferably, the mirror image obtaining unit directly obtains and stores the memory state of the master node at a certain time as the metadata mirror image.
Preferably, the image obtaining unit stores the metadata image according to the task grouping.
According to still another aspect of the present invention, there is also provided a failure recovery method of a distributed system including a plurality of slave nodes for running tasks, the method including: acquiring and storing a metadata mirror image recorded with scheduling information and a system state at a certain moment; acquiring and storing redo logs recording all scheduling operations after a moment; and calling the metadata mirror image and the corresponding redo log thereof to carry out fault recovery when the fault is recovered.
Preferably, the metadata images of the master node at a plurality of different times are continuously obtained and saved, and the redo logs respectively corresponding to the plurality of different times are continuously obtained and saved.
Preferably, invoking the metadata mirror and the corresponding redo log for failure recovery at the time of failure recovery may include: calling the latest metadata mirror image and the corresponding redo log thereof to carry out fault recovery when the fault is recovered; and when the latest metadata mirror image and/or the corresponding redo log are/is unavailable, calling the data at the latest moment when the metadata mirror image and the corresponding redo log are both available for fault recovery.
Preferably, the memory state of the master node at a certain time can be directly acquired and saved as the metadata mirror.
According to the distributed system and the fault recovery method and device thereof, the metadata mirror images of the main node at one or more moments are obtained, and the subsequent operation of the main node is recorded in the redo log, so that when the main node fails, the main node can be quickly recovered to the state before the fault according to the previously recorded metadata mirror images and the redo log.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
Fig. 1 is an architectural diagram of a distributed system showing a master-slave architecture.
FIG. 2 is a schematic flow chart diagram illustrating a method of fault recovery in accordance with an embodiment of the present invention.
FIG. 3 is a diagram illustrating the continuous saving of multiple metadata images and redo logs.
Fig. 4 is a schematic block diagram showing the structure of a failure recovery apparatus according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
For the distributed system of the master-slave architecture shown in fig. 1, since the master node stores data necessary for normal operation and scheduling of the system, such as system status data and current scheduling data, the loss of the data has a great influence on the system. Therefore, there is a need for a failover mechanism that allows a primary node to be restored to a stable and reliable state when the primary node encounters an unknown error. For this, a log file of all operations of the master node may be recorded, which may be persistently stored on a disk. Once the main node fails, even if all the data in the main node memory is lost, the main node can still be restored to the state before the failure by reproducing (replaying) the recorded log file when the main node is started next time.
The operation flow of the main node under the scheme is as follows: before the main node executes the operation each time, the operation can be recorded into a log file, and the operation is executed after the recording is successful, that is, the data in the memory can be updated based on the operation; the recovery flow when a failure occurs is as follows: and reading the log file, and sequentially modifying the data in the memory based on the operation of the main node recorded in the log file. The recovery method of the log file only by recording the write operation is simple to realize, but the recovery process takes a long time.
Therefore, after intensive research, the inventor finds that, in the process of recording the log file of the operation of the master node, the image file of the data stored in the master node at a certain moment can be acquired alternately, the image file can represent the current state data of the master node at a corresponding moment, so that when the master node fails, the latest image file and the operation recorded in the log file after the moment corresponding to the called image file can be called, the recovery of the master node can be realized according to the called data, and the recovery time can be greatly shortened compared with the recovery time only by recording the log file.
Based on the above concept, the present invention provides a failure recovery scheme for a master node in a distributed system, and the failure recovery scheme of the present invention can be implemented by the distributed system shown in fig. 1. As shown in fig. 1, the distributed system of the present invention may include a master node for scheduling tasks and managing system states and a plurality of slave nodes for running the scheduled tasks. The master node and the slave nodes may be both deployed in a server, and the master node may be deployed in an independent server different from the slave nodes, or may be deployed in the same server as one of the slave nodes. As a preferred embodiment, different nodes may be deployed in different servers. While the distributed system shown in fig. 1 is comprised of a master node and a plurality of slave nodes, it should be understood that the distributed system of the present invention may also include a plurality of master nodes, and may also include other devices besides the master nodes and slave nodes, such as backup master nodes, failover databases, and the like.
The following describes in detail a specific flow of implementing the failure recovery scheme by the distributed system of the present invention. FIG. 2 is a schematic flow chart diagram illustrating a method of fault recovery in accordance with an embodiment of the present invention. The method shown in fig. 2 may be implemented by the distributed system shown in fig. 1, and in particular, may be implemented by a master node in the distributed system.
Referring to fig. 2, in step S210, a metadata mirror recorded with scheduling information and a system state at a certain time on a master node is acquired and saved.
For a distributed system with a master-slave architecture, after a master node crashes, the whole distributed system is not available, so that the master node usually does not directly run a specific task but only takes charge of maintaining the running of the distributed system and the scheduling distribution of the task, and the specific task can be executed by a slave node, considering the importance of the master node. That is, the master node is mainly responsible for parsing the task request, allocating resources, and positioning the target data or nodes according to the metadata, and the specific task is executed by the slave node designated by the master node. The metadata is data for describing data, and the metadata in the present invention refers to data that the master node is responsible for storing and managing. Since the master node is used to schedule tasks and manage system states, metadata may refer to data that records scheduling information and system states at a certain time on the master node. For example, for a Hadoop distributed system, the metadata may be system-related description data, system state data, current task scheduling and state data, and the like, and for example, for a distributed storage system, the metadata may be data describing state information (e.g., storage locations) of user data.
The acquired metadata mirror image of the master node at a certain time may be a mapping of the memory state of the master node at the certain time, so that the memory state of the master node at the certain time may be directly acquired and stored as the metadata mirror image. In a specific implementation, the metadata mirror image of the master node at a certain time may be obtained in modes of Snapshot (disk Snapshot), dump (backup file system), and the like.
The operation of obtaining the metadata mirror image can be executed by the main node, one or more subordinate nodes, or a backup main node in the distributed system. The obtained metadata image may be persistently stored in a local disk or distributed file system, for example in a failover database.
As an optional embodiment of the present invention, the master node may schedule the tasks concurrently according to the groups, and the obtained metadata mirror may be a metadata mirror under multiple groups, so that the obtained metadata mirror may be stored according to the task groups, and the metadata mirrors belonging to the same task group are stored in the same directory, so that the corresponding metadata mirrors may be efficiently organized according to the groups during subsequent recovery.
In step S220, a redo log recording all operations of the master node after the time may be obtained and saved by the master node. Operations referred to herein may refer to operations performed by a master node on metadata or operations performed by a master node on its stored data.
For each operation performed by the master node, it may be recorded in a redo log (redo log). The redo log may sequentially record operation information of the master node. For each operation to be performed by the master node, the operation may be performed by the master node after the operation is recorded in the redo log and persisted. Therefore, when the main node has an error in the operation execution process, the operation can be recovered according to the data recorded in the redo log. Otherwise, if a certain operation is executed first and then recorded, and an error occurs during the execution of the operation or before the operation is recorded and stored, the operation cannot be recovered and can only come over again.
For example, when the slave node requests a task from the master node (e.g., a computing task or a storage task), the master node may record the operation of issuing the target data to the slave node in a redo log, and after the recording and the persistence are successfully performed, the master node sends the target data to the slave node in response to the request of the slave node. In other words, for a request of a dependent node, the request of the dependent node may be responded after the operation of the master node for the request is recorded in the redo log and stored (persistent storage).
In step S230, the metadata mirror and the corresponding redo log are called for failure recovery when failure recovery occurs.
As described above, the metadata mirror may be regarded as a mapping of the memory state of the master node at a certain time, and the redo log records all operations of the master node. Therefore, when the master node fails, the failure recovery can be performed according to the metadata mirror image acquired before the failure and the operation of the master node in the period of time after the time corresponding to the metadata mirror image and recorded in the redo log before the failure of the master node occurs, so that the master node is recovered to the state before the failure occurs. Taking the redo log record in the file system as an example, the redo log record can be recovered as follows: after the main node is restarted, firstly traversing the metadata mirror directory in the file system, finding the latest metadata mirror, loading the latest metadata mirror into the memory, then starting to load the redo log after the latest metadata mirror, and starting to play (replay), so that after the loading is finished, the whole recovery process is finished.
As an alternative embodiment of the present invention, when saving a metadata image of a primary node, a plurality of metadata images corresponding to different times may be saved. In the process of recording the redo log, the metadata image acquisition operation can be executed once periodically or in response to meeting a preset trigger condition. The trigger condition may be, for example, that a certain parameter satisfies a predetermined value, reaches a predetermined interval, or is directly responsive to an external trigger command. For example, the metadata mirror acquisition operation may be performed once every predetermined number of operations are recorded in the redo log, or the metadata mirror acquisition operation may be performed every predetermined time, or the like.
Further, while recording the operation of the master node in the redo log, the redo logs respectively corresponding to a plurality of different times (i.e., a plurality of metadata images) may be continuously acquired. FIG. 3 is a schematic diagram illustrating the principle of persisting a plurality of metadata image files and their corresponding redo logs.
Referring to fig. 3, at time t1, a metadata image 1 of the master node may be obtained first, operations of the master node between times t1-t2 may be recorded and saved in a redo log 1, a metadata image 2 of the master node may be obtained again at time t2, operations of the master node between times t2-t3 may be recorded and saved in a redo log 2, and so on, metadata images corresponding to times t1, t2, and t3, respectively, and redo logs corresponding to metadata images at different times, respectively, may be obtained.
Thus, assuming that the primary node crashes at time t4, the primary node may first invoke the latest metadata mirror (i.e., the metadata mirror at time t 3) and its corresponding redo log (the redo log in segments t3-t 4) for failover at the time of failover. If the latest metadata mirror and redo log are not available, the next-to-new metadata mirror (i.e., the metadata mirror at time t 2) and redo log (i.e., the redo log in segments t2-t 3) may be further invoked for failover, and so on, until an available data file is obtained by continually pushing back. Therefore, the fault tolerance rate during fault recovery can be improved by storing a plurality of memory mirror images at different moments and corresponding redo logs.
In other words, the solution of the present application can trigger the acquisition and storage of the metadata image (e.g., save the state at time t 3) under certain conditions or commands, and then start the continuous recording of the redo log (i.e., record all operations after t 3). After the failure at time t4, all operations after t3 can be replayed by restoring the state at time t3 so that the master node quickly returns to the state at time t 4.
When acquiring a metadata mirror of a master node at a certain time, for example, as shown in fig. 3, when acquiring a metadata mirror 1 at time t1, the service of the master node is often not stopped, but a certain time is required for acquiring the metadata mirror 1, so that the metadata mirror 1 acquired at time t1 is likely to include some operations in the redo log 1 after time t1, and therefore when the master node fails at time t2, when recovering using the metadata mirror 1 at time t1 and the corresponding redo log 1, it is likely that the state of the master node that is recovered last is not consistent with the state before recovery.
Therefore, as an optional embodiment of the present invention, in the process of obtaining the metadata image at a certain time, the time of the operation recorded in the redo log at this time may be recorded in real time, and after the metadata image at a certain time is obtained, the corresponding operation may be removed from the redo log, so as to avoid the phenomenon that the obtained metadata image includes some operations recorded in the redo log thereafter, thereby enabling the metadata image to be strictly compared with the corresponding redo log in terms of time.
The fault recovery method of the present invention has been described in detail so far in conjunction with fig. 2-3. In addition, the fault recovery scheme of the present invention can also be implemented by a fault recovery apparatus. Fig. 4 shows a block diagram of a fault recovery apparatus according to an embodiment of the present invention. The functional blocks of the fault recovery apparatus 400 may be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 4 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.
The failure recovery apparatus 400 shown in fig. 4 may be used to implement the failure recovery method shown in fig. 2, and only the functional modules that the failure recovery apparatus 400 may have and the operations that each functional module may perform are briefly described below, and for the details involved therein, reference may be made to the description above in conjunction with fig. 2, and details are not repeated here. It should be noted that the failure recovery apparatus 400 may be the main node itself or may be a backup main node.
As shown in fig. 4, the failure recovery apparatus of the present invention may include a mirror obtaining unit 410, a redo log obtaining unit 420, and a failure recovery unit 430. The mirror image obtaining unit 410 may obtain and store a metadata mirror image in which scheduling information and a system state at a certain time on the master node are recorded, the redo log obtaining unit 420 may obtain and store a redo log in which all operations of the master node are recorded after the certain time, and the failure recovery unit 430 may call the metadata mirror image and a redo log corresponding thereto to perform failure recovery when the failure is recovered.
Preferably, the mirror obtaining unit 410 may perform the obtaining and saving operations of the metadata mirror under the trigger of a master node, a device, and/or an external command. The mirror image obtaining unit 410 may directly obtain and store the memory state of the master node at a certain time as the metadata mirror image. Further, the image obtaining unit 410 may store the metadata image according to the task packet.
Preferably, the master node responds to the new request of the slave node after each operation thereof is recorded and stored in the redo log by the redo log retrieving unit 420.
Preferably, the mirror obtaining unit 410 continuously obtains and saves the metadata mirror of the master node at a plurality of different times, and the redo log obtaining unit 420 continuously obtains and saves the redo logs corresponding to the plurality of different times, respectively. At this time, the failure recovery unit 430 calls the latest metadata mirror image and the redo log corresponding to the latest metadata mirror image to perform failure recovery when the failure recovery is performed, and the failure recovery unit 430 may call the data at the latest moment when the metadata mirror image and the redo log corresponding to the latest metadata mirror image are both available to perform failure recovery when the latest metadata mirror image and/or the redo log corresponding to the latest metadata mirror image are not available.
The distributed system and the failure recovery method and apparatus thereof according to the present invention have been described in detail above with reference to the accompanying drawings.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (20)

1. A distributed system comprising a master node for scheduling tasks and managing system state and a plurality of slave nodes for running the scheduled tasks,
one or more slave nodes and/or the master node acquire and store a metadata mirror image recorded with scheduling information and a system state at a certain moment on the master node;
the master node acquires and stores redo logs recording all operations of the master node after the moment; and
the main node calls the metadata mirror image and the corresponding redo log to carry out fault recovery when the fault is recovered,
wherein, the main node schedules according to grouping concurrence when scheduling tasks, the metadata mirror image is the metadata mirror image under a plurality of groups, wherein, the metadata mirror image belonging to the same task group is stored in the same directory, and the main node organizes the corresponding metadata mirror image according to the groups when recovering from failure,
the method for recovering the fault by calling the metadata mirror image and the redo log corresponding to the metadata mirror image by the main node during fault recovery comprises the following steps:
the main node finds the latest metadata mirror image and carries out memory loading after being restarted;
loading the redo log after the metadata mirror image;
and replaying the operation recorded on the redo log.
2. The distributed system of claim 1, wherein one or more of the slave nodes and/or the master node performs the metadata mirror acquire and save operations triggered by the master node and/or an external command.
3. The distributed system of claim 1 wherein the master node responds to requests from the slave nodes after each of its operations is recorded in the redo log and stored.
4. The distributed system of claim 1, wherein one or more of the slave nodes and/or the master node continuously acquire and save metadata images of the master node at a plurality of different times, and
and the master node continuously acquires and stores redo logs respectively corresponding to the different moments.
5. The distributed system of claim 4, wherein the primary node, upon failover, invokes the most recent metadata image and its corresponding redo log for failover.
6. The distributed system of claim 4, wherein the master node invokes the most recent data for which both the metadata image and its corresponding redo log are available for failover when the most recent metadata image and/or its corresponding redo log are unavailable.
7. The distributed system of claim 1, wherein one or more of the slave nodes and/or the master node directly acquire and save the memory state of the master node at a time as the metadata mirror.
8. The distributed system of claim 1, wherein the metadata image is stored in task groups.
9. A failure recovery apparatus of a distributed system including a master node for scheduling tasks and managing a system state and a plurality of slave nodes for running the tasks, for performing failure recovery when the master node fails, comprising:
the mirror image acquisition unit is used for acquiring and storing the metadata mirror image recorded with the scheduling information and the system state at a certain moment on the main node;
the redo log obtaining unit is used for obtaining and storing redo logs recording all operations of the main node after the moment; and
a failure recovery unit for calling the metadata mirror image and the corresponding redo log to perform failure recovery when the failure is recovered,
wherein the primary node schedules concurrently according to groups when scheduling tasks, the metadata mirror is a metadata mirror under a plurality of groups, wherein the metadata mirrors belonging to the same task group are stored under the same directory, and the failure recovery unit organizes the corresponding metadata mirror according to the groups when recovering from a failure,
wherein the failure recovery unit is to:
the main node finds the latest metadata mirror image and carries out memory loading after being restarted;
loading the redo log after the metadata mirror;
and replaying the operation recorded on the redo log.
10. The apparatus of claim 9, wherein the image acquisition unit performs the metadata image acquisition and saving operations triggered by the master node, the apparatus, and/or an external command.
11. The apparatus of claim 9, wherein the master node responds to the request of the slave node after each operation thereof is recorded and stored in the redo log by the redo log retrieving unit.
12. The apparatus of claim 9, wherein the image acquisition unit continuously acquires and saves metadata images of the master node at a plurality of different times, and
the redo log obtaining unit continuously obtains and stores the redo logs respectively corresponding to the plurality of different moments.
13. The apparatus of claim 12, wherein the failover unit is to invoke the most recent metadata image and its corresponding redo log for failover upon failover.
14. The apparatus according to claim 12, wherein the failure recovery unit invokes data at a latest time when the latest metadata image and/or the corresponding redo log thereof are/is not available, to perform failure recovery.
15. The apparatus of claim 9, wherein the mirror obtaining unit directly obtains and saves a memory state of the master node at a certain time as the metadata mirror.
16. The apparatus of claim 9, wherein the image acquisition unit stores the metadata image in terms of task packets.
17. A method of fault recovery for a distributed system comprising a plurality of slave nodes for running tasks, the method comprising:
acquiring and storing a metadata mirror image recorded with scheduling information and a system state at a certain moment;
acquiring and storing redo logs recorded with all scheduling operations after the moment; and
calling the metadata mirror image and the corresponding redo log thereof for fault recovery when the fault is recovered,
wherein, the main node schedules according to grouping concurrence when scheduling tasks, the metadata mirror image is the metadata mirror image under a plurality of groups, wherein, the metadata mirror image belonging to the same task group is stored in the same directory, and the main node organizes the corresponding metadata mirror image according to the groups when recovering from failure,
the step of calling the metadata mirror image and the redo log corresponding to the metadata mirror image for fault recovery includes:
the main node finds the latest metadata mirror image and carries out memory loading after being restarted;
loading the redo log after the metadata mirror image;
and replaying the operation recorded on the redo log.
18. The method of claim 17, wherein,
continuously acquire and save the metadata images of the master node at a plurality of different times, an
And continuously acquiring and storing redo logs respectively corresponding to the plurality of different moments.
19. The method of claim 18, wherein invoking the metadata image and its corresponding redo log for failover upon failover comprises:
calling the latest metadata mirror image and the corresponding redo log thereof to perform fault recovery when the fault is recovered; and
and when the latest metadata mirror image and/or the corresponding redo log are/is unavailable, calling the data at the latest moment when the metadata mirror image and the corresponding redo log are both available for fault recovery.
20. The method of claim 17, wherein a memory state of a master node at a time is directly obtained and saved as the metadata mirror.
CN201710630823.6A 2017-07-28 2017-07-28 Distributed system and fault recovery method and device thereof Active CN107357688B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710630823.6A CN107357688B (en) 2017-07-28 2017-07-28 Distributed system and fault recovery method and device thereof
PCT/CN2018/097262 WO2019020081A1 (en) 2017-07-28 2018-07-26 Distributed system and fault recovery method and apparatus thereof, product, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710630823.6A CN107357688B (en) 2017-07-28 2017-07-28 Distributed system and fault recovery method and device thereof

Publications (2)

Publication Number Publication Date
CN107357688A CN107357688A (en) 2017-11-17
CN107357688B true CN107357688B (en) 2020-06-12

Family

ID=60285161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710630823.6A Active CN107357688B (en) 2017-07-28 2017-07-28 Distributed system and fault recovery method and device thereof

Country Status (2)

Country Link
CN (1) CN107357688B (en)
WO (1) WO2019020081A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357688B (en) * 2017-07-28 2020-06-12 广东神马搜索科技有限公司 Distributed system and fault recovery method and device thereof
CN108390771B (en) * 2018-01-25 2021-04-16 ***股份有限公司 Network topology reconstruction method and device
CN108427728A (en) * 2018-02-13 2018-08-21 百度在线网络技术(北京)有限公司 Management method, equipment and the computer-readable medium of metadata
CN109189480B (en) * 2018-07-02 2021-11-09 新华三技术有限公司成都分公司 File system starting method and device
CN109144792A (en) * 2018-10-08 2019-01-04 郑州云海信息技术有限公司 Data reconstruction method, device and system and computer readable storage medium
CN109656911B (en) * 2018-12-11 2023-08-01 江苏瑞中数据股份有限公司 Distributed parallel processing database system and data processing method thereof
CN111104226B (en) * 2019-12-25 2024-01-26 东北大学 Intelligent management system and method for multi-tenant service resources
CN112379977A (en) * 2020-07-10 2021-02-19 中国航空工业集团公司西安飞行自动控制研究所 Task-level fault processing method based on time triggering
CN111880969A (en) * 2020-07-30 2020-11-03 上海达梦数据库有限公司 Storage node recovery method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8401998B2 (en) * 2010-09-02 2013-03-19 Microsoft Corporation Mirroring file data
CN103294701A (en) * 2012-02-24 2013-09-11 联想(北京)有限公司 Distributed file system and data processing method
CN104216802A (en) * 2014-09-25 2014-12-17 北京金山安全软件有限公司 Memory database recovery method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357688B (en) * 2017-07-28 2020-06-12 广东神马搜索科技有限公司 Distributed system and fault recovery method and device thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8401998B2 (en) * 2010-09-02 2013-03-19 Microsoft Corporation Mirroring file data
CN103294701A (en) * 2012-02-24 2013-09-11 联想(北京)有限公司 Distributed file system and data processing method
CN104216802A (en) * 2014-09-25 2014-12-17 北京金山安全软件有限公司 Memory database recovery method and device

Also Published As

Publication number Publication date
CN107357688A (en) 2017-11-17
WO2019020081A1 (en) 2019-01-31

Similar Documents

Publication Publication Date Title
CN107357688B (en) Distributed system and fault recovery method and device thereof
US20220188003A1 (en) Distributed Storage Method and Device
US11397648B2 (en) Virtual machine recovery method and virtual machine management device
CN102158540A (en) System and method for realizing distributed database
JP2007179551A (en) Method and apparatus for backup and recovery using storage based journaling
CN114466027B (en) Cloud primary database service providing method, system, equipment and medium
CN110807062B (en) Data synchronization method and device and database host
CN108566291B (en) Event processing method, server and system
CN109582686B (en) Method, device, system and application for ensuring consistency of distributed metadata management
EP3974973A1 (en) Virtual machine backup method and device based on cloud platform data center
CN111651523A (en) MySQL data synchronization method and system of Kubernetes container platform
CN112416889A (en) Distributed storage system
CN106873902B (en) File storage system, data scheduling method and data node
JP7215971B2 (en) METHOD AND APPARATUS FOR PROCESSING DATA LOCATION IN STORAGE DEVICE, COMPUTER DEVICE AND COMPUTER-READABLE STORAGE MEDIUM
CN113986450A (en) Virtual machine backup method and device
CN111226200B (en) Method, device and distributed system for creating consistent snapshot for distributed application
CN107943615B (en) Data processing method and system based on distributed cluster
CN111752892A (en) Distributed file system, method for implementing the same, management system, device, and medium
CN116048878A (en) Business service recovery method, device and computer equipment
CN113515574B (en) Data synchronization method and device
CN111208949B (en) Method for determining data rollback time period in distributed storage system
CN114385755A (en) Distributed storage system
CN114860505A (en) Object storage data asynchronous backup method and system
CN111176886B (en) Database mode switching method and device and electronic equipment
CN111198849A (en) Power supply data read-write system based on Hadoop and working method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200811

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 13 layer self unit 01

Patentee before: Guangdong Shenma Search Technology Co.,Ltd.

TR01 Transfer of patent right