CN111106947B - Node downtime repairing method and device, electronic equipment and readable storage medium - Google Patents

Node downtime repairing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111106947B
CN111106947B CN201811270222.XA CN201811270222A CN111106947B CN 111106947 B CN111106947 B CN 111106947B CN 201811270222 A CN201811270222 A CN 201811270222A CN 111106947 B CN111106947 B CN 111106947B
Authority
CN
China
Prior art keywords
node
data
execute
restarting
restarted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811270222.XA
Other languages
Chinese (zh)
Other versions
CN111106947A (en
Inventor
申航
高宇
杨稼晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Beijing Kingsoft Cloud Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Beijing Kingsoft Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd, Beijing Kingsoft Cloud Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN201811270222.XA priority Critical patent/CN111106947B/en
Publication of CN111106947A publication Critical patent/CN111106947A/en
Application granted granted Critical
Publication of CN111106947B publication Critical patent/CN111106947B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The embodiment of the invention provides a method and a device for repairing node downtime, electronic equipment and a readable storage medium. The method comprises the following steps: when the first node or the second node is detected to be down, restarting the down node of the first node and the second node; the method comprises the following steps that a restarted node in a first node and a second node is used as a slave node, and a node which does not execute restarting operation in the first node and the second node is used as a master node; and under the conditions that the data in the nodes which do not execute the restarting operation are detected to be copied to the nodes which do not execute the restarting operation, the data stored in the nodes which do not execute the restarting operation are detected to be persistent, and the data persistence mechanisms of the nodes which do not execute the restarting operation are detected to be in a closed state, the completion of the downtime repair of the nodes is determined. By applying the embodiment of the invention, the downtime repair of the node can be completed when the first node or the second node in the single master-slave service system is down, so that the data security in the single master-slave service system is improved.

Description

Node downtime repairing method and device, electronic equipment and readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for repairing a node downtime, an electronic device, and a readable storage medium.
Background
A database is a repository that organizes, stores, and manages data according to a data structure. To ensure high availability of databases, a single master-slave service system is often employed to provide data access services. The single master-slave service system comprises a master node, a slave node and a management node. The main node is used for providing data reading service and data writing service; the slave node is used for backing up data on the master node; the management node is used for managing the master node and the slave nodes. In this way, the same data can be stored by the master node and the slave node, so that the reliability of the data can be ensured by two copies of the data.
However, when a master node or a slave node in a single master-slave service system goes down, a copy of stored data is lost; at this time, the data in the single master-slave service system lacks backup data, so that the security of the data is reduced.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a method, an apparatus, an electronic device, and a readable storage medium for repairing a node downtime, so as to recover a downtime node when a first node or a second node in a single master-slave service system is down, thereby improving the security of data in the single master-slave service system. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for repairing a node downtime, which is applied to a management node in a single master-slave service system, where the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is closed; the method can comprise the following steps: when the first node or the second node is detected to be down, restarting the down node in the first node and the second node; the method comprises the following steps that a restarted node in a first node and a second node is used as a slave node, and a node which does not execute restarting operation in the first node and the second node is used as a master node; and determining that the downtime restoration of the node is completed under the condition that the data in the node which does not execute the restarting operation is detected to be copied to the restarted node, the data stored in the restarted node is subjected to data persistence, and the data persistence mechanism of the node which does not execute the restarting operation is in a closed state.
Optionally, the step of restarting the down node of the first node and the second node may include: judging whether a down node in the first node and the second node starts a data persistence mechanism or not; if the data persistence mechanism is not started, starting the data persistence mechanism for the down node in the first node and the second node, and restarting the down node in the first node and the second node; and if the data persistence mechanism is started, restarting the down node in the first node and the second node.
Optionally, before the step of restarting the down one of the first node and the second node, the method may further include at least one of the following steps: when the first node is detected to be not down and the second node is detected to be down, the first node is upgraded to be the main node, and the second node is modified to be the slave node; and when the first node is detected to be down and the second node is not detected to be down, maintaining the state that the first node is a slave node and the second node is a master node.
Optionally, in an embodiment of the present invention, the method may further include: in the process of copying data in the nodes which do not execute the restarting operation to the nodes after restarting and carrying out data persistence on the data stored in the nodes after restarting, detecting whether the nodes which do not execute the restarting operation are down; under the condition that the node which does not execute the restarting operation is down, stopping running the restarted node; writing persistent data generated by a data persistence mechanism started by a first node into a node which does not execute restarting operation; and restarting the nodes which do not execute the restarting operation and the restarted nodes after stopping running.
Optionally, before the step of restarting the node that does not perform the restart operation and the restarted node after the shutdown, the method may further include: judging whether the information summary algorithm MD5 value of the persistent data written into the node which does not execute the restarting operation is matched with the MD5 value of the persistent data generated by the data persistence mechanism started by the first node; if not, re-executing the step of writing the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restart operation; and if the node is matched with the node, triggering the restarting of the node which does not execute the restarting operation and the restarted node which stops running.
Optionally, the step of triggering the restart of the node that does not execute the restart operation and the restarted node that stops running may include: judging whether a node which does not execute the restarting operation starts a data persistence mechanism or not; under the condition that the node which does not execute the restarting operation does not start the data persistence mechanism, starting the data persistence mechanism for the node which does not execute the restarting operation, and triggering the node which does not execute the restarting operation and the node which stops running and is restarted; and under the condition that the node which does not execute the restarting operation starts the data persistence mechanism, giving up starting the data persistence mechanism for the node which does not execute the restarting operation, and triggering the steps of restarting the node which does not execute the restarting operation and the restarted node after stopping running.
Optionally, before the step of restarting the down node of the first node and the second node, the method may further include: controlling the nodes which are not down in the first node and the second node to close own write data service; the method comprises the following steps of determining that the downtime repair of the node is completed when detecting that data in the node which does not execute the restarting operation is copied to the restarted node, data persistence of the data stored in the restarted node is completed and a data persistence mechanism of the node which does not execute the restarting operation is in a closed state, wherein the steps comprise: under the conditions that the data in the node which does not execute the restarting operation is detected to be copied to the restarted node, the data stored in the restarted node is subjected to data persistence, and a data persistence mechanism of the node which does not execute the restarting operation is in a closed state, controlling the node which does not execute the restarting operation to start the own data writing service; and after detecting that the data writing service is started by the node which does not execute the restarting operation, determining that the downtime restoration of the node is completed.
Optionally, the step of determining that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been data persisted, and the data persistence mechanism of the node that does not perform the restart operation is in the off state may include: under the conditions that the data in the node which does not execute the restarting operation is copied to the node after restarting and the data stored in the node after restarting is subjected to data persistence, judging whether a data persistence mechanism is started by the node which does not execute the restarting operation or not; under the condition that the data persistence mechanism is started by the node which does not execute the restarting operation, the data persistence mechanism started by the node which does not execute the restarting operation is closed; and deleting the persistent data generated by the data persistence mechanism started by the node which does not execute the restarting operation, and completing the downtime restoration of the node.
Optionally, the first node and the second node are created on different node builders, and the node builder comprises a kernel virtualization LXC container or a virtual machine; the step of restarting the down node of the first node and the second node may include: detecting whether a node construction device of a down node in the first node and the second node is down; if the first node is down, restarting or building a node builder of the down node in the first node and the second node; and running the down node in the first node and the second node.
In a second aspect, an embodiment of the present invention provides a node downtime repair apparatus, which is applied to a management node in a single master-slave service system, where the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is closed; the apparatus may include: the first restarting module is used for restarting the crashed node of the first node and the second node when detecting that the first node or the second node is crashed; the method comprises the following steps that a restarted node in a first node and a second node is used as a slave node, and a node which does not execute restarting operation in the first node and the second node is used as a master node; the determining module is used for determining that the downtime restoration of the node is completed when it is detected that the data in the node which does not execute the restarting operation is copied to the restarted node, the data stored in the restarted node is subjected to data persistence and a data persistence mechanism of the node which does not execute the restarting operation is in a closed state.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and the data persistence mechanism is closed; the electronic device comprises a processor and a memory, wherein: a memory for storing a computer program; and the processor is used for realizing the method steps of the method for repairing the downtime of any node when the program stored in the memory is executed.
In a fourth aspect, an embodiment of the present invention provides a readable storage medium, where the readable storage medium is a readable storage medium in an electronic device; the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further comprises: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and the data persistence mechanism is closed; the readable storage medium stores therein a computer program, and the computer program is executed by a processor of the electronic device to perform the method steps of any of the above-mentioned node downtime repairing methods.
In a fifth aspect, an embodiment of the present invention provides a computer program product including instructions, which is executable on an electronic device, where the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and the data persistence mechanism is closed; when run on an electronic device, cause the electronic device to perform: the method for repairing the downtime of any one node comprises the following steps.
In the embodiment of the present invention, the single master-slave service system includes: a first node, a second node and a management node. And the first node is set as a slave node and the data persistence mechanism is started, and the second node is set as a master node and the data persistence mechanism is closed. When detecting that the first node or the second node is down, the management node in the single master-slave service system may restart the down node in the first node and the second node. And after the restarting operation is executed, the restarted node in the first node and the second node is used as a slave node, and the node which does not execute the restarting operation in the first node and the second node is used as a master node. Then, the management node may determine that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node that does not perform the restart operation is in a closed state. Therefore, when the first node or the second node in the single master-slave service system goes down, the down node can be recovered, so that the data in the single master-slave service system can be stored in duplicate through the first node and the second node, and the safety of the data in the single master-slave service system is improved on the basis of not influencing the service performance of the single master-slave service system.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a single master-slave service system according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for repairing a node downtime according to an embodiment of the present invention;
fig. 3 is a flowchart of another method for repairing a node downtime according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a device for repairing a node downtime according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the problem that when a master node or a slave node in a single master-slave service system is down, the security of data in the system is reduced, embodiments of the present invention provide a method and an apparatus for repairing the down of the node, an electronic device, and a readable storage medium.
In the following, referring to fig. 1, a single master-slave service system in the embodiment of the present invention and technical terms related to the embodiment of the present invention are explained.
In an embodiment of the present invention, a single master-slave service system is a system for providing data access services in a database (e.g., a redis database, a kind of open-source Key-Value database). The single master-slave service system comprises a first node, a second node and a management node. And the first node is set as a slave node and a data persistence mechanism is started; the second node is set as the master node and the data persistence mechanism is turned off.
Referring to fig. 1, a single master-slave service system includes a master node a, a slave node a, and a management node; the other single master-slave service system comprises a master node B, a slave node B and a management node; and the other single master-slave service system comprises a master node C, a slave node C and a management node.
Wherein the management node is not shown in fig. 1. It will be understood by those skilled in the art that the management node in the single master-slave service system example in the above example may be an electronic device other than the electronic device D1 and the electronic device D1, and the management nodes in the above example may be the same or different.
The master node a and the slave node a are arranged on different electronic devices. The master node B and the slave node B are also provided on different electronic devices. It is reasonable that the master node a and the master node B may be provided on the same electronic device D1, and the slave node a and the slave node B may be provided on the same electronic device D2. The electronic device in the embodiment of the present invention may be a mobile device, or may be a server, but is not limited to this.
In addition, the client 1, the client 2, and the client 3 shown in fig. 1 are programs that provide services to customers corresponding to master nodes and slave nodes in the respective single master-slave service systems. This is prior art and will not be described in detail here.
In the embodiment of the present invention, the master node refers to an application installed on an electronic device and configured to provide services such as a data reading service (i.e., a data reading service) and a data writing service (i.e., a data writing service). In general, a master node may also be understood as: an electronic device capable of providing services such as a data reading service (i.e., a data reading service) and a data writing service (i.e., a data writing service). Moreover, the primary node does not turn on (i.e., turn off) the data persistence mechanism.
Similarly, the slave node refers to an application installed on the electronic device and used for backing up data cached by the master node. In general, a slave node may also be understood as: the electronic equipment can back up the data cached by the main node. Moreover, the slave node opens a data persistence mechanism.
The management node is an application installed on the electronic device and used for managing the master node and the slave node in the single master-slave service system. In general, a management node may also be understood as: an electronic device capable of managing a master node and a slave node in a single master-slave service system.
Wherein, the data persistence mechanism is: a mechanism to transition data between a persistent state and a transient state. Colloquially, transient data, such as cache data, is persisted as persistent data. Persistent data obtained based on the persistence mechanism can be permanently stored in the storage device, and even if the storage device is down, the persistent data cannot be lost as long as the persistent data is not damaged.
For example, when a data persistence mechanism is turned on for a slave node in a single master-slave service system, the write operation of the slave node is recorded as persistent data. When the slave node is down, the persistent data is not lost. In this way, after the slave node is restarted, the persistent data can be written into the cache of the slave node, so that the data cached by the slave node before downtime is recovered. Since the slave node in the embodiment of the present invention is used for backing up the data of the master node, opening the data persistence mechanism for the slave node does not affect the service performance of the service provided by the single master-slave service system.
When the main node cannot provide the service, the main node is down. Similarly, when the slave node cannot provide the service, the slave node is down.
Specifically, for the data persistence mechanism AOF (application Only File), when apppendoly in the configuration File corresponding to the slave node is set to yes, the data persistence mechanism of the slave node is turned on. At this time, the write operation of the slave node is recorded in the persistent file apendonly. When the apppendoly in the configuration file corresponding to the slave node is set to no, the data persistence mechanism of the slave node is closed.
After the data persistence mechanism is opened by a node in the database, the node needs additional resources to ensure the normal operation of the data persistence mechanism, so that the capability of the node for providing other services is weakened. For this reason, in the prior art, the persistence mechanism is not started for the single-master-slave server system, that is, the data persistence mechanism is not started for the master node and the slave nodes in the single-master-slave server system.
The following describes a method for repairing the node downtime, provided by the embodiment of the present invention, by taking an example that a single master-slave service system is a system for providing data access services in a redis database.
The management node in the single master-slave service system provided by the embodiment of the invention is used for executing the method steps of the node downtime repair method, thereby realizing the management of the master node and the slave node in the single master-slave service system. The single master-slave service system provided by the embodiment of the invention comprises a first node, a second node and a management node. And the first node is set as a slave node and a data persistence mechanism is started; the second node is set as the master node and the data persistence mechanism is turned off.
The management node, the first node, the second node, the master node and the slave nodes related in the embodiment of the present invention are all nodes in a redis database, and thus these nodes are redis nodes.
Referring to fig. 2, the method for repairing the node downtime may include the following steps:
s101: when the first node or the second node is detected to be down, restarting the down node of the first node and the second node; the method comprises the following steps that a restarted node in a first node and a second node is used as a slave node, and a node which does not execute restarting operation in the first node and the second node is used as a master node;
it is understood that the management node may detect in real time whether the first node and the second node in the single master-slave service system are down. The management node can call a sentinel system in the prior art to detect whether the first node and the second node in the single master-slave service system are down.
And when the second node is detected to be down and the first node is normal, the sentinel system of the sentinel can be called to carry out fault migration on the single master-slave service system. In other words, the first node in the single master-slave service system may be upgraded to the master node; and modifying the configuration file of the second node which is down to modify the second node which is down into the slave node. In this way, a switch in master-slave roles for the first node and the second node can be achieved. The downed one of the first and second nodes may then be restarted.
In addition, when it is detected that the second node is normal and the first node is down, the state that the second node which is not down is used as the main node and the first node which is down is used as the slave node can be maintained. In this way, there is no need to switch the master-slave roles of the first and second nodes. The downed one of the first and second nodes may then be restarted.
The data cached by the downtime node before downtime is still cached in the non-downtime node, so that the data in the downtime node can be subsequently recovered by utilizing the data stored in the non-downtime node.
In addition, in the embodiment of the present invention, when there is no node downtime in the single master-slave service system, the first node set as the slave node starts the data persistence mechanism, and the second node set as the master node does not start the data persistence mechanism. Thus, when there is a first node or a second node down in the single master-slave service system, the operation of restarting the down one of the first node and the second node may include:
and judging whether a down node in the first node or the second node starts a data persistence mechanism or not. And if the data persistence mechanism is not started, starting the data persistence mechanism for the shutdown node, and restarting the shutdown node. And if the data persistence mechanism is started, directly restarting the down node.
Specifically, under the condition that the second node is down and the first node is normal, the first node which is not down is upgraded to be the master node, and the second node which is down is modified to be the slave node. At this time, it may be determined that the master node obtained by the upgrade has started the data persistence mechanism, and it may be determined that the slave node obtained by the modification has not started the data persistence mechanism. In this case, the modified data persistence mechanism of the slave node (i.e., the second node) may be turned on, for example, to set apendonly in the configuration file corresponding to the second node to yes. The second node is then restarted.
And under the condition that the second node is normal and the first node is down, the second node which is not down still serves as the main node, and the first node which is down still serves as the slave node. At this time, it may be determined that the slave node has started the data persistence mechanism, and it may be determined that the master node has not started the data persistence mechanism. Therefore, the first node can be restarted directly without starting a data persistence mechanism of the first node.
And restarting the first node means that the first node is restarted so that the first node can provide the service provided by the slave node. After restarting the first node, it can be checked through the redisrole command whether the role of the first node is indeed a slave node.
S102: and determining that the downtime restoration of the node is completed under the condition that the data in the node which does not execute the restarting operation is detected to be copied to the restarted node, the data stored in the restarted node is subjected to data persistence, and the data persistence mechanism of the node which does not execute the restarting operation is in a closed state.
It is understood that, after restarting the downed one of the first node and the second node, it may be detected whether the condition is satisfied: the data in the node which does not execute the restarting operation is copied to the restarted node, the data stored in the restarted node is subjected to data persistence, and a data persistence mechanism of the node which does not execute the restarting operation is in a closed state. And when the condition is met, determining that the downtime restoration of the node is completed.
When it is detected that the data in the node which does not perform the restart operation has been copied to the restarted node, it indicates that the restarted node stores the same data as the node which does not perform the restart operation. At this time, the restarted node realizes the backup of the data in the node which does not execute the restarting operation, and improves the safety of the data in the single master-slave service system.
When the data stored in the restarted node is detected to have completed data persistence, it indicates that the data in the single master-slave service system can be permanently stored in the storage device at this time, so as to implement persistent protection on the data in the single master-slave service system.
When detecting that the operation of copying the data in the node which does not execute the restart operation to the restarted node is not completed, or detecting that the data stored in the restarted node is not completed with the data persistence operation, repeating and continuing the copy operation and the data persistence operation until the copy operation and the data persistence operation are completed.
In addition, under the condition that the second node is down and the first node is normal, the first node which is not down is upgraded to be the main node, and the down first node is modified to be the slave node. That is, the upgraded master node opens the data persistence mechanism. In order to ensure the performance of the read-write data service of the upgraded main node, when the completion of the copy operation and the data persistence operation is detected, the data persistence mechanism of the upgraded main node can be closed. In addition, after the data persistence mechanism of the main node obtained by upgrading is closed, persistent data generated by the data persistence mechanism started by the main node obtained by upgrading can be deleted, so that the storage space can be saved. Then, a repair completion for the downed node in the single master-slave service system may be determined.
And under the condition that the second node is normal and the first node is down, the second node which is not down still serves as the main node, and the first node which is down still serves as the slave node. That is, the host node does not turn on the data persistence mechanism at this time, and the performance of the host node for reading and writing data services is not reduced. Therefore, when the completion of the copying operation and the data persistence operation is detected, the repair of the down node in the single master-slave service system can be determined to be completed. In the embodiment of the present invention, the single master-slave service system includes: a first node, a second node and a management node. And the first node is set as a slave node and the data persistence mechanism is turned on, and the second node is set as a master node and the data persistence mechanism is turned off. When detecting that the first node or the second node is down, the management node in the single master-slave service system may restart the down node in the first node and the second node. And after the restarting operation is executed, the restarted node in the first node and the second node is used as a slave node, and the node which does not execute the restarting operation in the first node and the second node is used as a master node. Then, the management node may determine that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node that does not perform the restart operation is in the off state. Therefore, when the first node or the second node in the single master-slave service system is down, the down node can be recovered, so that the data in the single master-slave service system can be stored in duplicate by the first node and the second node, and the safety of the data in the single master-slave service system is improved on the basis of not influencing the service performance of the single master-slave service system.
In the process of copying the data in the node which does not execute the restarting operation to the restarted node, or in the process of carrying out data persistence on the data stored in the restarted node, if the node which does not execute the restarting operation goes down, the operation is interrupted, and the down repairing operation is interrupted. In order to repair the down node in the case that a new master node is down in the above copying process or data persistence process, thereby ensuring the security of data in a single master-slave service system, the node down repairing step shown in fig. 3 may also be executed:
the step S201 is the same as the step S101, and after the step S201 is executed, the step S202 is triggered. The step of S206 is the same as the step of S102, and is not described herein again.
S202: in the process of copying data in the nodes which do not execute the restarting operation to the nodes after restarting and carrying out data persistence on the data stored in the nodes after restarting, detecting whether the nodes which do not execute the restarting operation are down; when detecting that a node which does not execute the restart operation is down, executing step S203; when detecting that none of the nodes which do not execute the restart operation are down, executing step S206;
after restarting the down node of the first node and the second node, the data in the node (i.e., the current master node) that does not perform the restart operation is copied to the restarted node (i.e., the current slave node), and after the data is written by the restarted node, the data persistence operation is performed on the written data. If a node that does not perform the reboot operation in the replication process goes down, it is likely that the current slave node cannot replicate all data in the current master node, and this may cause the downtime repair process to be interrupted.
In addition, if a node that does not perform the restart operation is down in the persistence process, data in the current slave node cannot be persistently stored, and due to the down of the current master node, the security of the data in the single master-slave service system is reduced.
Therefore, whether the nodes which do not execute the restarting operation are down can be detected in the process of executing the copying operation and the data persistence operation.
S203: stopping running the restarted node, and triggering the step S204;
s204: writing the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restart operation, and triggering the step S205;
s205: restarting the node which does not execute the restarting operation and the restarted node after stopping running, and triggering the step S206;
when the downtime of the node which does not execute the restarting operation is detected in the process of copying the data in the node which does not execute the restarting operation to the node which does not execute the restarting operation and carrying out data persistence on the data stored in the node which does not execute the restarting operation, the node which does not execute the restarting operation is likely to copy only a part of the data or carry out data persistence on only a part of the data. In order to recover the service of the single master-slave service system without losing data, the operation of the restarted node may be stopped, that is, the restarted node is forced to go down.
Then, the persistent data generated by the data persistence mechanism started by the first node in the single master-slave service system can be written into the nodes which do not execute the restart operation. Then restarting the node which does not execute the restarting operation and the restarted node after stopping running.
S206: for example, when detecting that data in a node which does not execute the restart operation is copied to a restarted node, data persistence of the data stored in the restarted node is completed, and a data persistence mechanism of the node which does not execute the restart operation is in a closed state, determining that the completion of the shutdown restoration of the node is completed, and when detecting that a first node is shutdown and a second node is normal, and restarting the first node, the second node is the node which does not execute the restart operation, and the first node is the node which is restarted. When detecting that the second node is down, the first node may be stopped from operating when data in the second node that does not perform the restart operation is copied to the restarted first node and data stored in the first node is subjected to data persistence. And then writing the persistent data generated by the data persistence mechanism started by the first node into the second node. Then, the second node and the first node are restarted. Thereafter, it may be determined that the downtime repair is completed if it is detected that the data in the second node has been copied to the first node, the data stored in the first node has completed data persistence, and the data persistence mechanism of the second node is in an off state.
It is to be understood that the node that does not perform the restart operation in step S206 and the node after the restart refers to the node that does not perform the restart operation and has performed the restart operation before step S205.
In order to ensure that the persistent data generated by the data persistence mechanism started by the first node in step S204 are all written into the nodes that do not execute the restart operation, the service of the single master-slave service system is restored without loss. After step S204 is executed, it may be determined whether the MD5 value of the persistent data written to the node that does not execute the restart operation matches the MD5 value of the persistent data generated by the data persistence mechanism started by the first node.
If not, it indicates that the persistent data generated by the first node has not been written all in its entirety and correctly to the node that did not perform the restart operation. At this time, the operation of writing the persistent data generated by the data persistence mechanism started by the first node to the node not executing the restart operation may be re-executed until the MD5 value of the persistent data written to the node not executing the restart operation matches the MD5 value of the persistent data generated by the data persistence mechanism started by the first node.
If the data is matched with the data, the persistent data generated by the first node is indicated to be completely and correctly written into the node which does not execute the restarting operation, and at the moment, the operations of restarting the node which does not execute the restarting operation and the restarted node after the operation is stopped can be triggered.
Among them, MD5 (Message-Digest Algorithm 5) is one of the hash algorithms widely used by computers to ensure the integrity and consistency of information transmission.
Under the condition of matching, in order to improve the security of data in the single master-slave service system, whether a data persistence mechanism is started by a node which does not execute the restart operation can be judged. If the node which does not execute the restarting operation does not start the data persistence mechanism, the data persistence mechanism is started for the node which does not execute the restarting operation, and the steps of restarting the node which does not execute the restarting operation and the restarted node after stopping running are triggered. In this way, the situation that part of data in the single master-slave service system is lost when the node which does not execute the restarting operation goes down in the process of copying and performing the persistence operation on part of data of the node which does not execute the restarting operation after the node which does not execute the restarting operation and the restarted node which stops running can be avoided.
For example, when it is detected that the first node is down and the second node is normal, after the first node is restarted, the second node is a node that has not executed the restart operation, and at this time, the second node still serves as the master node, the first node is a restarted node, and at this time, the first node still serves as the slave node. When detecting that the second node goes down in the process of copying the data in the second node which does not execute the restart operation to the restarted first node and performing data persistence on the data stored in the first node, the method may stop running the restarted first node and write the persistent data generated by the data persistence mechanism started by the first node into the second node which does not execute the restart operation. When the MD5 value of the persistent data written into the second node is matched with the MD5 value of the persistent data generated by the data persistence mechanism started by the first node, whether the data persistence mechanism is started by the second node can be judged. Since the second node does not start the data persistence mechanism, the data persistence mechanism may be started for the second node first. Then, the second node and the first node are restarted.
Therefore, the situation that after the second node is restarted and the first node stops running, new data is written into the second node and part of data in the single master-slave service system is lost when the second node is down in the process of copying and carrying out persistence operation on part of data of the second node can be avoided.
If the node which does not execute the restart operation starts the data persistence mechanism, the node which does not execute the restart operation may be given up to start the data persistence mechanism, and the steps of restarting the node which does not execute the restart operation and restarting the node which stops running after the restart may be triggered.
In addition, since the deployment of the master nodes is completed after the nodes which are not down are used as the master nodes, the current master nodes can provide data read-write service to the outside. If the current master node adds the cache data after providing the data writing service to the outside, after restarting the down node of the first node and the second node, the data volume copied from the current master node by the current slave node is increased, and the copying time is slowed down. Then the probability of the current master node going down during the replication process increases.
When the current master node is down in the copying process, part of data in the master node cannot be copied to the current slave node, which may cause the loss of the part of data. In particular, when newly added cache data is lost, the newly added cache data cannot be recovered.
Therefore, in order to avoid unrecoverable data loss, before restarting the down one of the first node and the second node, the non-down one of the first node and the second node may be further controlled to turn off its own write data service. Accordingly, under the condition that it is detected that data in a node which does not execute the restart operation has been copied to a restarted node, data stored in the restarted node has been subjected to data persistence, and a data persistence mechanism of the node which does not execute the restart operation is in a shutdown state, the node which does not execute the restart operation can be controlled to start its own write data service. Therefore, after the nodes which do not execute the restarting operation are controlled to start the own write data service and the data persistence mechanism is determined to be closed, the performance of the read-write data service of the nodes which do not execute the restarting operation can be ensured not to be influenced, and the downtime restoration of the nodes can be determined to be completed at the moment.
Furthermore, the first node and the second node in the embodiments of the present invention may be created on different node contractors. The node builder may be a kernel virtualization LXC container or a virtual machine, but is not limited thereto. That is, the node contractor is a physical device virtualized on the physical device.
In this case, the specific manner for restarting the down node of the first node and the second node may be: and detecting whether a node construction device of the down node in the first node and the second node is down. And if the first node is down, restarting or building a node building device of the down node in the first node and the second node. And then running the downed one of the first node and the second node. And when the downtime node in the first node and the second node is successfully operated, restarting the downtime node.
In addition, when restarting or newly building a node builder of a down node in the first node and the second node fails, or when running the down node in the first node and the second node fails after the node builder is successfully created, alarm information of the failure of restarting the down node needs to be generated so as to prompt technicians to take measures in time.
The manner of restarting the node that does not execute the restart operation may refer to the manner of restarting the downed node, which is not described in detail herein.
In addition, when the condition that the first node and the second node are down at the same time is detected, persistent data generated by a data persistence mechanism started by the first node can be written into a node construction device of the second node. Then, the first node and the second node are restarted, and when the data in the node contractor of the second node is copied to the first node, the completion of the downtime repair of the node can be determined. Therefore, when both the first node and the second node in the single master-slave service system are down, the down restoration of the nodes can be completed, and the data security in the single master-slave service system is improved.
In summary, by applying the embodiments of the present invention, when the first node or the second node in the single master-slave service system goes down, the down node can be recovered, so as to improve the security of data in the single master-slave service system.
Corresponding to the foregoing method embodiment, an embodiment of the present invention further provides a device for repairing a node downtime, which is applied to a management node in a single master-slave service system, where the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and the data persistence mechanism is closed; referring to fig. 4, the apparatus may include: the first restarting module 301 is configured to restart a down node of the first node and the second node when the first node or the second node is detected to be down; the method comprises the following steps that a restarted node in a first node and a second node is used as a slave node, and a node which does not execute restarting operation in the first node and the second node is used as a master node;
the determining module 302 is configured to determine that the downtime repair of the node is completed when it is detected that data in the node that does not perform the restart operation has been copied to the restarted node, data persistence of the data stored in the restarted node is completed, and a data persistence mechanism of the node that does not perform the restart operation is in a shutdown state.
In the embodiment of the present invention, the single master-slave service system includes: a first node, a second node and a management node. And the first node is set as a slave node and the data persistence mechanism is started, and the second node is set as a master node and the data persistence mechanism is closed. By applying the device provided by the embodiment of the invention, when the management node in the single master-slave service system detects the downtime of the first node or the second node, the downtime node in the first node and the second node can be restarted. And after the restarting operation is executed, the restarted node in the first node and the second node is used as a slave node, and the node which does not execute the restarting operation in the first node and the second node is used as a master node. Then, the management node may determine that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node that does not perform the restart operation is in the off state. Therefore, when the first node or the second node in the single master-slave service system goes down, the down node can be recovered, so that the data in the single master-slave service system can be stored in duplicate through the first node and the second node, and the safety of the data in the single master-slave service system is improved on the basis of not influencing the service performance of the single master-slave service system.
Optionally, the restart module may include:
the judging unit is used for judging whether a data persistence mechanism is started by a down node of the first node and the second node;
the first restarting unit is used for starting the data persistence mechanism for the down node in the first node and the second node and restarting the down node in the first node and the second node when the judging unit judges that the data persistence mechanism is not started;
and the second restarting unit is used for restarting the down node in the first node and the second node when the judging unit judges that the data persistence mechanism is started.
Optionally, the apparatus may further comprise at least one of:
the modification module is used for upgrading the first node into a main node and modifying the second node into a slave node before restarting the delayed node of the first node and the second node when detecting that the first node is not delayed and the second node is delayed;
and the processing module is used for maintaining the state that the first node is set as the slave node and the second node is set as the main node before the delayed node of the first node and the second node is restarted when the first node is detected to be delayed and the second node is not delayed.
Optionally, in an embodiment of the present invention, the apparatus may further include:
the detection module is used for detecting whether the node which does not execute the restarting operation is down or not in the process of copying the data in the node which does not execute the restarting operation to the node which is restarted and carrying out data persistence on the data stored in the node which is restarted;
the stopping module is used for stopping running the restarted nodes when the detecting module detects that the nodes which do not execute the restarting operation are down;
the writing module is used for writing the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restarting operation after the stopping module stops running the restarted node;
and the second restarting module is used for restarting the node which does not execute the restarting operation and the restarted node after the running is stopped when the writing module writes the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restarting operation.
Optionally, in an embodiment of the present invention, the apparatus may further include:
the judging module is used for judging whether an information summary algorithm MD5 value of the persistent data written into the node which does not execute the restarting operation is matched with an MD5 value of the persistent data generated by the data persistence mechanism started by the first node or not when the writing module writes the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restarting operation;
the first trigger module is used for triggering the write-in module to write the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restart operation again when the judgment module judges that the persistent data are not matched with the data persistence mechanism;
and the second triggering module is used for triggering the second restarting module when the judging module judges that the two modules are matched.
Optionally, the second triggering module may be specifically configured to:
judging whether a node which does not execute the restarting operation starts a data persistence mechanism or not;
under the condition that the node which does not execute the restarting operation does not start the data persistence mechanism, starting the data persistence mechanism for the node which does not execute the restarting operation, and triggering a second restarting module;
and under the condition that the node which does not execute the restarting operation starts the data persistence mechanism, giving up starting the data persistence mechanism for the node which does not execute the restarting operation, and triggering a second restarting module.
Optionally, in an embodiment of the present invention, the apparatus may further include:
the control module is configured to control, before the first restarting module 301 restarts the down node of the first and second nodes, the non-down node of the first and second nodes to close its own write data service;
the determining module 302 may be specifically configured to:
under the conditions that the data in the node which does not execute the restarting operation is detected to be copied to the node after restarting, the data stored in the node after restarting is detected to be persistent, and the data persistence mechanism of the node which does not execute the restarting operation is in a closed state, controlling the node which does not execute the restarting operation to start the own data writing service; and after detecting that the data writing service is started by the node which does not execute the restarting operation, determining that the downtime restoration of the node is completed.
Optionally, the determining module 302 may be specifically configured to:
under the condition that the data in the node which does not execute the restarting operation is detected to be copied to the node after restarting and the data stored in the node after restarting is detected to be subjected to data persistence, judging whether a data persistence mechanism is started by the node which does not execute the restarting operation or not;
under the condition that the data persistence mechanism is started by the node which does not execute the restarting operation, the data persistence mechanism started by the node which does not execute the restarting operation is closed;
and deleting the persistent data generated by the data persistence mechanism started by the node which does not execute the restarting operation, and completing the downtime restoration of the node.
Optionally, the first node and the second node are created on different node builders, and the node builder comprises a kernel virtualization LXC container or a virtual machine;
the first restart module 301 may specifically be configured to:
detecting whether a node construction device of a down node in the first node and the second node is down;
if the first node is down, restarting or building a node builder of the down node in the first node and the second node;
and running the down node in the first node and the second node.
First restart module 301 determines module 302
Corresponding to the above method embodiment, an embodiment of the present invention further provides an electronic device, where the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and the data persistence mechanism is closed; referring to fig. 5, the electronic device comprises a processor 401 and a memory 402, wherein:
a memory 402 for storing a computer program;
the processor 401 is configured to implement the method steps of any one of the above-described methods for repairing a node downtime when executing the program stored in the memory 402.
When the electronic device corresponding to the management node in the single master-slave service system detects that the first node or the second node is down, the down node in the first node and the second node can be restarted. And after the restarting operation is executed, the restarted node in the first node and the second node is used as a slave node, and the node which does not execute the restarting operation in the first node and the second node is used as a master node. Then, the management node may determine that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node that does not perform the restart operation is in the off state. Therefore, when the first node or the second node in the single master-slave service system goes down, the down node can be recovered, so that the data in the single master-slave service system can be stored in duplicate through the first node and the second node, and the safety of the data in the single master-slave service system is improved on the basis of not influencing the service performance of the single master-slave service system.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a readable storage medium, where the readable storage medium is a readable storage medium in an electronic device; the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further comprises: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is closed;
the readable storage medium stores therein a computer program, and the computer program, when executed by a processor of the electronic device, implements the method steps of any one of the above-described methods for repairing the downtime of the node.
After the computer program stored in the storage medium provided in the embodiment of the present invention is executed by the processor of the electronic device corresponding to the management node, when the management node in the single master-slave service system detects that the first node or the second node is down, the down node in the first node and the second node may be restarted. And after the restarting operation is executed, the restarted node in the first node and the second node is used as a slave node, and the node which does not execute the restarting operation in the first node and the second node is used as a master node. Then, the management node may determine that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node that does not perform the restart operation is in the off state. Therefore, when the first node or the second node in the single master-slave service system goes down, the down node can be recovered, so that the data in the single master-slave service system can be stored in duplicate through the first node and the second node, and the safety of the data in the single master-slave service system is improved on the basis of not influencing the service performance of the single master-slave service system.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a computer program product including instructions, which, when run on an electronic device, cause the electronic device to perform: the method for repairing the downtime of any one node comprises the following steps. The electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further comprises: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and the data persistence mechanism is closed;
in the computer program product including the instructions provided in the embodiments of the present invention, after the processor of the electronic device corresponding to the managed node runs, when the managed node in the single master-slave service system detects that the first node or the second node is down, the down node in the first node and the second node may be restarted. And after the restarting operation is executed, the restarted node in the first node and the second node is used as a slave node, and the node which does not execute the restarting operation in the first node and the second node is used as a master node. Then, the management node may determine that the downtime repair of the node is completed when it is detected that the data in the node that does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node that does not perform the restart operation is in the off state. Therefore, when the first node or the second node in the single master-slave service system is down, the down node can be recovered, so that the data in the single master-slave service system can be stored in duplicate by the first node and the second node, and the safety of the data in the single master-slave service system is improved on the basis of not influencing the service performance of the single master-slave service system.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to the partial description of the method embodiments for relevant points.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for repairing node downtime is applied to a management node in a single master-slave service system, and the single master-slave service system further comprises the following steps: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is turned off; the method comprises the following steps:
when the first node or the second node is detected to be down, restarting the down node of the first node and the second node; the restarted node of the first node and the second node is used as a slave node, and the node which does not execute the restarting operation of the first node and the second node is used as a master node;
under the conditions that the data in the nodes which do not execute the restarting operation are detected to be copied to the nodes after restarting, the data stored in the nodes after restarting are detected to be data persisted, and the data persistence mechanism of the nodes which do not execute the restarting operation is in a closed state, the completion of the downtime repair of the nodes is determined;
in the process of copying the data in the node which does not execute the restarting operation to the restarted node and carrying out data persistence on the data stored in the restarted node, detecting whether the node which does not execute the restarting operation is down;
under the condition that the node which does not execute the restarting operation is down, stopping running the restarted node;
writing the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restart operation;
judging whether the information digest algorithm MD5 value of the persistent data written into the node which does not execute the restart operation is matched with the MD5 value of the persistent data generated by the data persistence mechanism started by the first node;
if not, re-executing the persistent data generated by the data persistence mechanism started by the first node, and writing the persistent data into the node which does not execute the restart operation;
and if the node is matched with the node which does not execute the restarting operation, restarting the node which does not execute the restarting operation and the restarted node which stops running.
2. The method according to claim 1, wherein said step of restarting said down one of said first and second nodes comprises:
judging whether a data persistence mechanism is started by a down node of the first node and the second node;
if the data persistence mechanism is not started, starting the data persistence mechanism for the down node in the first node and the second node, and restarting the down node in the first node and the second node;
and if the data persistence mechanism is started, restarting the down node in the first node and the second node.
3. The method of claim 1, wherein prior to said step of restarting said down one of said first and second nodes, said method further comprises at least one of:
when the first node is detected not to be down and the second node is detected to be down, upgrading the first node to be a main node, and modifying the second node to be a slave node;
and when the first node is detected to be down and the second node is not detected to be down, maintaining the state that the first node is a slave node and the second node is a master node.
4. The method according to claim 1, wherein the restarting the node which does not execute the restart operation and the restarted node after the shutdown comprises:
judging whether the node which does not execute the restarting operation starts a data persistence mechanism or not;
under the condition that the node which does not execute the restarting operation does not start a data persistence mechanism, starting the data persistence mechanism for the node which does not execute the restarting operation, and triggering the node which does not execute the restarting operation and the restarted node which stops running;
and under the condition that the node which does not execute the restarting operation starts a data persistence mechanism, giving up starting the data persistence mechanism for the node which does not execute the restarting operation, and triggering the steps of restarting the node which does not execute the restarting operation and the restarted node after stopping running.
5. The method according to any one of claims 1-3, further comprising, prior to said step of restarting the one of the first and second nodes that is down:
controlling the nodes which are not down in the first node and the second node to close own write data service;
the step of determining that the downtime repair of the node is completed when it is detected that the data in the node which does not execute the restart operation has been copied to the restarted node, the data stored in the restarted node has been subjected to data persistence, and the data persistence mechanism of the node which does not execute the restart operation is in a shutdown state includes: under the conditions that the data in the node which does not execute the restarting operation is detected to be copied to the node after restarting, the data stored in the node after restarting is detected to be persistent, and the data persistence mechanism of the node which does not execute the restarting operation is in a closed state, controlling the node which does not execute the restarting operation to open the own data writing service; and after detecting that the data writing service is started by the node which does not execute the restarting operation, determining that the downtime restoration of the node is completed.
6. The method according to any one of claims 1 to 3, wherein the step of determining that the downtime repair of the node is completed in case of detecting that the data in the node which does not perform the restart operation has been copied to the restarted node, the data stored in the restarted node has completed data persistence, and the data persistence mechanism of the node which does not perform the restart operation is in a shutdown state comprises:
under the condition that data in a node which does not execute the restarting operation is detected to be copied to a restarted node and data stored in the restarted node is detected to be subjected to data persistence, judging whether a data persistence mechanism is started by the node which does not execute the restarting operation or not;
under the condition that the data persistence mechanism is started by the node which does not execute the restarting operation, the data persistence mechanism started by the node which does not execute the restarting operation is closed;
and deleting the persistent data generated by the data persistence mechanism started by the node which does not execute the restarting operation to complete the downtime restoration of the node.
7. The method of any of claims 1-3, wherein the first node and the second node are created on different node contractors that include a kernel virtualized LXC container or virtual machine;
the step of restarting the down one of the first and second nodes includes: detecting whether a node construction device of a down node in the first node and the second node is down; if the first node is down, restarting or building a node builder of the down node in the first node and the second node;
and running the down node in the first node and the second node.
8. A node downtime repair device is applied to a management node in a single master-slave service system, and the single master-slave service system further comprises: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is turned off; the device comprises:
a first restarting module, configured to restart a down node of the first node and the second node when the first node or the second node is detected to be down; the restarted node of the first node and the second node is used as a slave node, and the node which does not execute the restarting operation of the first node and the second node is used as a master node;
the determining module is used for determining that the downtime restoration of the node is completed when detecting that the data in the node which does not execute the restarting operation is copied to the restarted node, the data stored in the restarted node is subjected to data persistence and a data persistence mechanism of the node which does not execute the restarting operation is in a closed state;
the detection module is used for detecting whether the nodes which do not execute the restarting operation are down or not in the process of copying the data in the nodes which do not execute the restarting operation to the nodes which are restarted and carrying out data persistence on the data stored in the nodes which are restarted;
the stopping module is used for stopping running the restarted nodes when the detecting module detects that the nodes which do not execute the restarting operation are down;
a writing module, configured to write persistent data generated by a data persistence mechanism started by the first node into the node that does not execute the restart operation after the stopping module stops running the restarted node;
a determining module, configured to determine whether an MD5 value of the persistent data written to the node that does not perform the reboot operation matches an MD5 value of the persistent data generated by the data persistence mechanism started by the first node when the writing module writes the persistent data generated by the data persistence mechanism started by the first node to the node that does not perform the reboot operation;
the first triggering module is used for triggering the writing module to write the persistent data generated by the data persistence mechanism started by the first node into the node which does not execute the restarting operation again when the judging module judges that the data are not matched;
the second triggering module is used for triggering the second restarting module when the judging module judges that the two modules are matched;
the second restarting module is configured to restart the node that does not execute the restarting operation and the restarted node after the operation is stopped when the writing module writes the persistent data generated by the data persistence mechanism started by the first node into the node that does not execute the restarting operation.
9. An electronic device, wherein the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is turned off; the electronic device comprises a processor and a memory, wherein:
the memory is used for storing a computer program;
the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-7.
10. A readable storage medium, wherein the readable storage medium is a readable storage medium in an electronic device; the electronic device is a device corresponding to a management node in a single master-slave service system, and the single master-slave service system further includes: a first node and a second node; the first node is set as a slave node and a data persistence mechanism is started; the second node is set as a main node and a data persistence mechanism is turned off;
the readable storage medium has stored therein a computer program which, when being executed by a processor of the electronic device, carries out the method steps of any one of claims 1-7.
CN201811270222.XA 2018-10-29 2018-10-29 Node downtime repairing method and device, electronic equipment and readable storage medium Active CN111106947B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811270222.XA CN111106947B (en) 2018-10-29 2018-10-29 Node downtime repairing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811270222.XA CN111106947B (en) 2018-10-29 2018-10-29 Node downtime repairing method and device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111106947A CN111106947A (en) 2020-05-05
CN111106947B true CN111106947B (en) 2023-02-07

Family

ID=70419964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811270222.XA Active CN111106947B (en) 2018-10-29 2018-10-29 Node downtime repairing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111106947B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360323A (en) * 2011-10-28 2012-02-22 东莞市正欣科技有限公司 Method and system for self-repairing down of network server
CN107844386A (en) * 2016-09-19 2018-03-27 北京金山云网络技术有限公司 A kind of data backup, restoration methods and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539113B2 (en) * 2011-06-16 2013-09-17 Hewlett-Packard Development Company, L.P. Indicators for streams associated with messages
CN105933391B (en) * 2016-04-11 2019-06-21 聚好看科技股份有限公司 A kind of node expansion method, apparatus and system
CN107229541B (en) * 2017-06-20 2019-11-26 携程旅游信息技术(上海)有限公司 Backup method, standby system and the server of Transaction Information
CN108628717A (en) * 2018-03-02 2018-10-09 北京辰森世纪科技股份有限公司 A kind of Database Systems and monitoring method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360323A (en) * 2011-10-28 2012-02-22 东莞市正欣科技有限公司 Method and system for self-repairing down of network server
CN107844386A (en) * 2016-09-19 2018-03-27 北京金山云网络技术有限公司 A kind of data backup, restoration methods and device

Also Published As

Publication number Publication date
CN111106947A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
RU2751551C1 (en) Method and apparatus for restoring disrupted operating ability of a unit, electronic apparatus and data storage medium
CN106951345B (en) Consistency test method and device for disk data of virtual machine
US9146839B2 (en) Method for pre-testing software compatibility and system thereof
US8806476B2 (en) Implementing a software installation process
US7689859B2 (en) Backup system and method
US8788774B2 (en) Protecting data during different connectivity states
US9720786B2 (en) Resolving failed mirrored point-in-time copies with minimum disruption
US7509544B2 (en) Data repair and synchronization method of dual flash read only memory
TWI839587B (en) Method and device for managing software updates , and non-transitory computer readable storage medium
US20200142791A1 (en) Method for the implementation of a high performance, high resiliency and high availability dual controller storage system
CN113254048B (en) Method, device and equipment for updating boot program and computer readable medium
US11150831B2 (en) Virtual machine synchronization and recovery
CN111106947B (en) Node downtime repairing method and device, electronic equipment and readable storage medium
CN113032183A (en) System management method, device, computer equipment and storage medium
US20190026195A1 (en) System halt event recovery
CN114356658A (en) Processing method of firmware upgrading exception, computer equipment and readable storage medium
CN107797885B (en) Electronic device and control method thereof
US20240211257A1 (en) Cache to Receive Firmware Generated Data During Firmware Update
US12032944B2 (en) State machine operation for non-disruptive update of a data management system
US20230385091A1 (en) Managing updates on virtual machine systems
US12019618B2 (en) Prechecking for non-disruptive update of a data management system
US20240095010A1 (en) Configuration management for non-disruptive update of a data management system
CN116501573A (en) Firmware detection method, firmware detection device, electronic device, storage medium and program product
CN117519745A (en) Software upgrading method and device
CN115562803A (en) Automatic recovery method, device, equipment and storage medium for mirror image file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant