CN111309515B - Disaster recovery control method, device and system - Google Patents

Disaster recovery control method, device and system Download PDF

Info

Publication number
CN111309515B
CN111309515B CN201811513686.9A CN201811513686A CN111309515B CN 111309515 B CN111309515 B CN 111309515B CN 201811513686 A CN201811513686 A CN 201811513686A CN 111309515 B CN111309515 B CN 111309515B
Authority
CN
China
Prior art keywords
site
service
instance
fault
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811513686.9A
Other languages
Chinese (zh)
Other versions
CN111309515A (en
Inventor
赵洪锟
钱义勇
岳晓明
王晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201811513686.9A priority Critical patent/CN111309515B/en
Publication of CN111309515A publication Critical patent/CN111309515A/en
Application granted granted Critical
Publication of CN111309515B publication Critical patent/CN111309515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • G06F11/1484Generic software techniques for error detection or fault masking by means of middleware or OS functionality involving virtual machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)

Abstract

A disaster recovery control method, device and system are used for solving the problem that when all instances of the same service in a main site are failed, the main site and a standby site can not continuously provide the service for clients. The method comprises the following steps: for a first service provided by a host site, determining an operational state of the first service at each instance of a plurality of virtual machines in the host site. And when the number of the instances with the working states of faults in all the instances of the first service in the main site meets a fault strategy, determining a first decision result, wherein the first decision result is a service for indicating the standby site to take over the main site. And sending the first decision result to a standby site.

Description

Disaster recovery control method, device and system
Technical Field
The present application relates to the field of computer technologies, and in particular, to a disaster recovery control method, device, and system.
Background
With the development of cloud computing, a software system evolves in a direction toward service, and a large software system of a client often consists of a plurality of services or components, such as a database, message middleware, business application, and the like. In a data center, private cloud or public cloud environment, service shutdown is caused when an instance for service deployment in a site fails, and the service shutdown brings great economic loss and reputation impact to clients.
The current common site disaster recovery scheme is to establish two sites, wherein the two sites are peer-to-peer, one of the sites serves as a main site for providing service for clients, and the other site provides backup capability for a standby site for providing backup capability for the main site. The backup site monitors whether the primary site is malfunctioning by monitoring heartbeat messages with the primary site. When the main station is broken by earthquake, fire disaster, network broken link and other disasters, the heartbeat messages between the main station and the standby station are interrupted, so that the standby station can take over the service of the main station when monitoring the interruption of the heartbeat messages between the standby station and the main station, and continue to provide service for clients.
However, when all instances deployed for a certain service in the primary site fail to cause service interruption, heartbeat messages between the primary site and the backup site are still normal, so that the primary site and the backup site cannot provide the service for clients.
Disclosure of Invention
The application provides a disaster recovery control method, device and system, which are used for solving the problem that when all examples of the same service in a main site are in failure, the main site and a standby site can not continuously provide the service for clients.
In a first aspect, the present application provides a disaster recovery control method, including: for a first service provided by a host site, determining an operational state of the first service at each instance of a plurality of virtual machines in the host site. And when the number of the instances with the working states of faults in all the instances of the first service in the main site meets a fault strategy, determining a first decision result, wherein the first decision result is a service for indicating the standby site to take over the main site. And sending the first decision result to a standby site. According to the embodiment of the application, the main site can monitor the working state of each instance of the first service in the plurality of virtual machines in the main site, and then whether the service is switched to the standby site or not can be truly according to the working state of each instance, for example, when the main site monitors all faults of an instance of a certain service, the standby site is instructed to take over the service of the main site, so that the standby site can continue to provide service for clients, and further economic loss and reputation influence of the clients are reduced.
In one possible design, the failure policy is formulated for the first service.
In one possible design, when determining the operational state of the first service at each of the plurality of virtual machines in the host site, a failure start time of the instance may be determined for each of the plurality of virtual machines in the host site for the first service. Thereafter, a fault duration is determined based on the fault start time. And if the fault duration is greater than a fault time threshold, determining the working state of the instance as a fault. And if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulty. In the design, whether the instance fails or not can be accurately judged through the failure duration of the instance, so that the accuracy of a decision result can be improved.
In one possible design, the application health of the instance may be received and recorded when determining the failure start time of the instance. If the received application health condition of the instance is abnormal, and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance is provided, determining that the failure start time of the instance is the current time. When the application health condition of the instance changes from normal jump to abnormal, the jump time can be determined as the time for starting the fault of the instance, so that the relatively accurate fault starting time can be obtained through the design, and the accuracy of the decision result can be improved.
In one possible design, when determining the failure start time of the instance, if the application health condition of the instance reported by the first virtual machine is not received, the failure start time of the instance may be determined to be the current time, where the first virtual machine is the virtual machine in which the instance is deployed in the primary site. If the application health condition of the instance reported by the first virtual machine is not received, the first virtual machine can be considered to have faults, so that the instance on the first virtual machine can be considered to have faults, and therefore, the fault starting time of the instance can be timely determined through the design, and the accuracy of a decision result can be improved.
In one possible design, when determining the working state of the first service in each of the plurality of virtual machines in the host site, the time of last receiving the application health of the instance may also be determined for each of the plurality of virtual machines in the host site for the first service. And determining interrupt time according to the last time of receiving the application health condition of the instance. And if the interruption time is greater than a fault time threshold, determining that the working state of the instance is a fault. And if the interruption time is smaller than or equal to the fault time threshold value, determining that the working state of the instance is not faulty. If the application health condition of the instance reported by the first virtual machine is not received for a long time, the first virtual machine can be considered to be faulty, so that the instance on the first virtual machine can be considered to be faulty, and therefore, the interruption time reported by the first virtual machine can be considered to be the fault duration of the virtual machine, that is, the instance on the first virtual machine is at least faulty, and therefore, whether the working state of the instance is faulty or not can be timely determined according to the interruption time, and the accuracy of a decision result can be improved.
In one possible design, the master site may determine the failure time threshold by resolving the failure policy.
In one possible design, after determining the working state of the first service in each of the plurality of virtual machines in the primary site, if the number of instances in which the working state is faulty in all the instances of the first service in the primary site does not satisfy the fault policy, a second decision result may be determined, where the second decision result is not indicative of the backup site taking over the traffic of the primary site. In the above design, if the instance of the first service does not meet the failure policy, it may be determined that the backup site is not instructed to take over the service of the primary site, so that the primary site may continue to provide services for the client.
In one possible design, the sending the first decision result to the backup site may send the first decision result to the backup site through an arbitration service. In the design, the standby site can accurately acquire the decision result of the main site when the heartbeat network between the standby site and the main site is interrupted, so that the service of the main site can be taken over in time, and the risk of service shutdown can be further realized.
In one possible design, the sending of the first decision result to the backup site by the arbitration service may be implemented by writing the first decision result into an instance of the arbitration service of the primary site, where the instance of the arbitration service may be deployed in the primary site, the backup site, and the arbitration site. In the above design, the instance of the arbitration service in the master site, the instance of the arbitration service in the slave site, and the instance of the arbitration service in the arbitration site form a cluster, so that the first decision result written by the instance of the arbitration service of the master site can be shared in the cluster. Thus, the backup site may obtain the first decision result in the cluster by means of a query.
In a second aspect, the present application provides a primary site comprising: the system comprises a plurality of virtual machines, and a working state unit, a decision unit and a sending unit which are deployed in a first virtual machine. The working state unit is used for determining the working state of each instance of a plurality of virtual machines in a main site of a first service provided by the main site. The decision unit is configured to determine a first decision result when the number of instances in the primary site, in which the working state of the first service is failure in all instances in the plurality of virtual machines, meets a failure policy, where the first decision result is a service indicating that a backup site takes over the primary site. The sending unit is used for sending the first decision result to the standby station.
In a possible design, the operation state unit may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance; determining a fault duration according to the fault start time; if the fault duration is greater than a fault time threshold, determining that the working state of the instance is a fault; and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulty.
In one possible design, the operation state unit, when determining the failure start time of the instance, may be specifically configured to: receiving and recording application health conditions of the instance; if the received application health condition of the instance is abnormal, and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance is provided, determining that the failure start time of the instance is the current time.
In one possible design, the operation state unit, when determining the failure start time of the instance, may be specifically configured to: if the application health condition of the instance reported by the first virtual machine is not received, determining the fault starting time of the instance as the current time, wherein the first virtual machine is the virtual machine for deploying the instance in the master site.
In a possible design, the operation state unit may be specifically configured to: determining, for each instance of the first service in the plurality of virtual machines in the host site, a time of last receipt of an application health of the instance; determining an interruption time according to the last time the application health condition of the instance was received; if the interruption time is greater than a fault time threshold, determining that the working state of the instance is a fault; and if the interruption time is smaller than or equal to the fault time threshold value, determining that the working state of the instance is not faulty.
In a possible design, the decision unit may further be adapted to: and when the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result is a service which does not instruct the standby site to take over the main site.
In a possible design, the transmitting unit may be specifically configured to: and sending the first decision result to the standby site through arbitration service.
In one possible design, the first virtual machine is one of the plurality of virtual machines, or the first virtual machine is not one of the plurality of virtual machines.
In a third aspect, the present application provides a primary site running a plurality of virtual machines including an instance of a first service. The master site comprises a disaster recovery service module. The disaster recovery service module is used for determining the working state of the first service in each instance of the plurality of virtual machines; if the number of the instances with the working states of faults in all the instances of the first service meets a fault strategy, determining a first decision result, wherein the first decision result is a service for indicating a standby site to take over the main site; and sending the first decision result to the standby station.
In one possible design, the disaster recovery service module, when determining an operating state of the first service in each instance of the plurality of virtual machines, may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance; determining a fault duration according to the fault start time; if the fault duration is greater than a fault time threshold, determining that the working state of the instance is a fault; and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulty.
In one possible design, each of the plurality of virtual machines may include a disaster recovery proxy module, where each disaster recovery proxy module is configured to report, to the disaster recovery service module, an application health condition of the instance of the first service on the virtual machine where the disaster recovery service module is located. The disaster recovery service module, when determining the fault starting time of the instance, may be specifically configured to: receiving and recording the application health condition of the instance reported by a first disaster recovery proxy module, wherein the first disaster recovery proxy module is a disaster recovery proxy module included on a virtual machine deploying the instance; if the received application health condition of the instance is abnormal, and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance is provided, determining that the failure start time of the instance is the current time.
In one possible design, the plurality of virtual machines may each include a disaster recovery proxy module, where each disaster recovery proxy module is configured to report, to the disaster recovery service module, an application health condition of the instance of the first service on the virtual machine where the disaster recovery proxy module is located; the disaster recovery service module, when determining the fault starting time of the instance, may be specifically configured to: if the application health condition of the instance reported by the first disaster recovery agent module is not received, determining the fault starting time of the instance as the current time, wherein the first disaster recovery agent module is a disaster recovery agent module included in a virtual machine for deploying the instance.
In one possible design, the plurality of virtual machines may each include a disaster recovery proxy module, where each disaster recovery proxy module is configured to report, to the disaster recovery service module, an application health condition of the instance of the first service on the virtual machine where the disaster recovery proxy module is located; the disaster recovery service module, when determining the working state of the first service in each instance of the plurality of virtual machines, may be specifically configured to: determining, for each instance of the first service in the plurality of virtual machines in the host site, a time when a first disaster recovery agent module reports an application health condition of the instance last time, where the first disaster recovery agent module is a disaster recovery agent module deployed on a virtual machine including the instance in the host site; determining interrupt time according to the time of the last report of the application health condition of the instance by the first disaster recovery agent module; if the interruption time is greater than a fault time threshold, determining that the working state of the instance is a fault; and if the interruption time is smaller than or equal to the fault time threshold value, determining that the working state of the instance is not faulty.
In one possible design, the primary station may further include a heartbeat interface, where the heartbeat interface is configured to send and receive heartbeat messages between the primary station and the backup station. The disaster recovery service module, when sending the first decision result to the backup site, may be specifically configured to: and sending the first decision result to the standby site through the heartbeat interface.
In one possible design, the disaster recovery service module may be further configured to: and if the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result is a service which does not instruct the standby site to take over the main site.
In one possible design, the host site may further include an arbitration service module for providing arbitration services. The disaster recovery service module, when sending the first decision result to the backup site, may be specifically configured to: and sending the first decision result to the standby site through arbitration service provided by the arbitration service module.
In one possible design, the number of disaster recovery service modules in the main site may be two, where one disaster recovery service module serves as a main service and the other disaster recovery service module serves as a backup service. And the disaster recovery service module serving as the backup service is used for taking over the service of the disaster recovery service module serving as the main service when the disaster recovery service module serving as the main service fails.
In a fourth aspect, the present application provides a station comprising a processor, a memory, a communication interface, and a bus, the processor, the memory and the communication interface being connected by the bus and performing communication therebetween, the memory being for storing computer-executable instructions, the processor executing the computer-executable instructions in the memory when the apparatus is run to perform the steps of the method of the first aspect or any one of the possible implementations of the first aspect using hardware resources in the apparatus.
In a fifth aspect, the present application provides a disaster recovery system, including the primary site as described in the second aspect or any one of the designs of the second aspect, and the backup site.
In one possible design, the disaster recovery system may further include an arbitration site. The arbitration station is used for providing arbitration service for the main station and the standby station.
In one possible design, the primary site, the backup site, and the arbitration site each include an arbitration service module. The arbitration service module of the arbitration station is used for providing arbitration service for the main station and the standby station; the arbitration service module of the master station is used for sending the decision result of the master station to the slave station through arbitration service; and the arbitration service module of the standby site is used for acquiring the decision result of the main site through arbitration service.
In a sixth aspect, the present application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect.
In a seventh aspect, the application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method as described in the first aspect or any one of the possible implementations of the first aspect.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Drawings
Fig. 1 is a schematic diagram of a site protection scheme according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a disaster recovery protection scheme according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a disaster recovery system according to an embodiment of the present application;
fig. 4 is a schematic diagram of a primary site architecture according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating a first disaster recovery service module updating a failure start time of a first instance according to an embodiment of the present disclosure;
fig. 6 is a schematic flow chart of detecting a working state of a first example by a first disaster recovery service module according to an embodiment of the present application;
Fig. 7 is a schematic flow chart of disaster recovery switching performed by the first disaster recovery service module according to the embodiment of the present application;
FIG. 8 is a schematic flow chart of a disaster recovery control method according to an embodiment of the present application;
fig. 9 is a schematic diagram of a disaster recovery switching process according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a master station according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a master station according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
With the development of cloud computing, a software system evolves in a direction toward service, and a large software system of a client often consists of a plurality of services or components, such as a database, message middleware, business application, and the like. In a data center, private cloud or public cloud environment, service shutdown is caused when an instance for service deployment in a site fails, and the service shutdown brings great economic loss and reputation impact to clients.
A single site protection scheme may be employed for problems that may result in traffic outage when an instance fails. The single site protection schemes currently in common use are:
(1) Cluster scheme: multiple instances are deployed in a site for the same service to form a cluster. When a portion of the instances fail, other instances in the cluster may continue to provide service. For example, referring to fig. 1, for application a deployed 3 instances in a site, when one of the 3 instances fails, the other two instances may provide the service of application a.
(2) Double active (cold standby) protocol: two instances are deployed in a site for the same service, where one instance is running and one instance is stopped. Through the auxiliary monitoring means, when the running instance fails, the stopped instance is pulled up, and the service is continuously provided. For example, referring to fig. 1, for a Web server/reverse proxy server and an e-mail proxy server (Engine x, nminbx), 2 nminbx are deployed in a site, where one nminbx is in an operation state and the other nminbx is in a stop state, and when the nminbx in the operation state fails, the nminbx in the stop state is pulled up to provide the service of the nminbx.
(3) Double active (hot standby) protocol: two instances are deployed in the site for the same service, wherein both instances are running, one instance is a primary service and the other is a backup service. When the main service instance fails, the standby service instance becomes the main service and continues to provide the service. For example, referring to fig. 1, 2 DBs are deployed in a site for a Database (DB), wherein one DB is a main service and the other DB is a backup service, and when the DB as the main service fails, the DB as the backup service becomes the main service to continue providing the service.
The single site protection solution solves the problem of instance failure of the same service part inside a single site. If all instances of the same service within a single site fail, the service will be interrupted and service cannot continue to be provided.
Aiming at the problem that service is stopped when a station fails, a disaster recovery protection scheme can be adopted. The disaster recovery protection scheme is as follows: two sites are established, wherein one site serves as a master site for providing clients, and the other site provides backup capability for a slave site for providing master sites. The backup site monitors whether the primary site is malfunctioning by monitoring heartbeat messages with the primary site. When the main station breaks down in earthquake, fire disaster, network broken link and other disasters, the heartbeat messages between the main station and the standby station are interrupted, so that the standby station can take over the service of the main station when monitoring the interruption of the heartbeat messages between the standby station and the main station, and the service is continuously provided for clients, as shown in figure 2. The disaster recovery protection scheme can be divided into different-place disaster recovery and same-city disaster recovery, wherein the different-place disaster recovery means that two sites are deployed in different cities, and the same-city disaster recovery means that the two sites are deployed in different places in the same city.
However, when all instances deployed for a certain service in the primary site fail to cause service interruption, heartbeat messages between the primary site and the backup site are still normal, so that the primary site and the backup site cannot provide the service for clients.
Based on the above, the embodiment of the application can realize smaller granularity monitoring by monitoring the working state of each instance in the site and then switching the service to the standby site according to the working state of each instance, compared with the prior art that the whole main site is powered down or is switched to the standby site after failure, for example, when an application in the site cannot provide service, the embodiment of the application can be switched to the standby site, for example, when all instances of a certain service fail, the embodiment of the application can be switched to the standby site, thereby the standby site can continue to provide service for clients, and further the economic loss and reputation influence of the clients are reduced.
The term "plurality" as used herein means two or more. The term "at least one of the present application" means one, or more than one, that is, includes one, two, three and more than one.
In addition, it should be understood that in the description of the present application, the words "first," "second," and the like are used merely for distinguishing between the descriptions and not for indicating or implying any relative importance or order.
The disaster recovery control method provided by the embodiment of the application can be applied to the disaster recovery system shown in fig. 3, wherein the disaster recovery system can comprise a main site and a standby site, and can also comprise an arbitration site, wherein the main site is used for providing service for clients. The master station can also determine a decision result according to the working state of each instance, wherein the decision result is a first decision result or a second decision result, the first decision result is used for indicating the slave station to take over the service of the master station, and the second decision result is used for not indicating the slave station to take over the service of the master station. The backup site is used for taking over the business of the main site under the instruction of the main site. The arbitration station is used for providing arbitration service for the master station and the slave station, the master station is also used for sending decision results to the slave station through the arbitration service, and the slave station is also used for obtaining the decision results of the master station through the arbitration service.
In one embodiment, as shown in fig. 4, the primary site may at least one disaster recovery agent module and a first disaster recovery service module, where each disaster recovery agent module is independently deployed on one virtual machine operated by the primary site, and the first disaster recovery service module may be deployed on any virtual machine in the primary site. It should be understood that fig. 4 is only an exemplary illustration, and does not specifically limit the number of virtual machines, disaster recovery agent modules, and first disaster recovery service modules included in the primary site.
And each disaster recovery agent module is used for reporting the application health condition of each instance on the virtual machine where the disaster recovery agent module is located to the first disaster recovery service module.
The disaster recovery agent module may report the application health condition of each instance on the virtual machine where the disaster recovery agent module is located through steps A1 to A4:
a1, reading deployment information of the application on the virtual machine.
A2, determining the examples included in the virtual machine according to the deployment information, such as a key business management service, a database, message middleware and the like.
A3, collecting application health conditions of each instance on the virtual machine. The disaster recovery agent module may collect application health conditions of each instance on the virtual machine in a similar manner to a patrol, for example, the disaster recovery agent module may collect application health conditions of each instance by querying a process, a service state, a keep-alive interface, etc. of each instance.
And A4, reporting the application health condition of each instance to the first disaster recovery service module. For example, the disaster recovery proxy module may report health information of the application by calling an interface provided by the first disaster recovery service module, for example, the interface may be a yaml defined interface.
The first disaster recovery service module is used for determining a decision result based on the application health condition reported by each disaster recovery agent module and sending the first decision result to the backup site.
As a possible implementation manner, the first disaster recovery service module may determine the decision result through steps B1 to B4:
b1, the first disaster recovery service module analyzes the fault strategy configuration file. The fault policy profile may include fault policies formulated for all services or may include fault policies formulated for a plurality of services separately, e.g., the fault policy profile may include fault policies tailored for a first service, fault policies formulated for a second service, and so on.
And B2, the first disaster recovery service module receives the application health condition reported by the disaster recovery agent module, and updates the failure starting time of the application in the cache according to the application health condition of the instance.
An exemplary illustration, for each instance on a virtual machine, a first disaster recovery service module sets a failure start time of the instance in a cache as a current time if an application health of the instance received is abnormal and an application health of the instance last recorded in the cache is normal or there is no record of an application health of the instance.
If the received application health condition of the instance is normal, the first disaster recovery service module may set the application health condition of the instance to be normal in the cache, and may also set the failure start time of the instance to be 0.
In another exemplary illustration, if the application health condition of the instance reported by the first virtual machine is not received, the first disaster recovery service module may consider that the first virtual machine fails, that is, the instance fails, so that the failure start time of the instance may be set as the current time in the cache.
In order to better understand the embodiment of the present application, the process of step B2 is described in detail below in conjunction with a specific application scenario. In the following description taking the first example as an example, referring to fig. 5, a process for updating the fault start time of the first example for the first disaster recovery service module is shown:
s501, a first disaster recovery service module receives application health conditions of a first instance reported by a first disaster recovery agent module, wherein the first disaster recovery agent module is a disaster recovery agent module deployed on a virtual machine where the first instance is located. Step S502 is performed.
S502, the first disaster recovery service module determines whether the fault starting time of the first instance exists in the cache. If yes, go to step S503; if not, go to step S507.
S503, the first disaster recovery service module determines whether the application health condition of the first instance in the cache is normal. If yes, go to step S504; if not, step S511 is performed.
S504, the first disaster recovery service module determines whether the application health condition of the first instance reported by the first disaster recovery agent module is normal. If yes, go to step S505, if no, go to step S506.
S505, the first disaster recovery service module does not update the application health condition and the fault start time of the first instance in the cache.
S506, the first disaster recovery service module sets the fault starting time of the first instance in the cache as the current time.
S507, the first disaster recovery service module adds the record of the first instance in the cache. Step S508 is performed.
S508, the first disaster recovery service module determines whether the application health condition of the first instance reported by the first disaster recovery agent module is normal. If yes, go to step S509, if no, go to step S510.
S509, the first disaster recovery service module sets the application health condition of the first instance to be normal and sets the failure start time of the first instance to be 0 in the cache.
S510, the first disaster recovery service module sets the fault starting time of the first instance as the current time in the cache.
S511, the first disaster recovery service module determines whether the application health condition of the first instance reported by the first disaster recovery agent module is normal. If yes, go to step S512, if no, go to step S513.
S512, the first disaster recovery service module sets the application health condition of the first instance to be normal in the cache, and updates the fault starting time of the first instance to be 0.
S513, the first disaster recovery service module does not update the application health condition and the fault start time of the first instance in the cache.
And B3, periodically detecting the working state of each instance according to the fault starting time of each instance in the cache.
In one implementation, for each instance, when there is a failure start time of the instance in the cache, if the failure start time of the instance is 0 or an application health condition of the instance in the cache is normal, it may be determined that the working state of the instance is not failed. If the failure start time of the instance is not 0 or the application health condition of the instance in the cache is not normal, the failure duration may be determined according to the failure start time. Illustratively, Δt1=t1-t 2, where Δt1 is the duration of the fault, t1 is the current time, and t2 is the start of the fault. When the failure duration is greater than a failure time threshold, the operational state of the instance may be determined to be a failure. When the failure duration is less than or equal to the failure time threshold, the operational state of the instance may be determined to be non-failed.
If the failure start time of the instance does not exist in the cache and the time of the application health condition of the instance is never received, the failure start time of the instance can be determined as the current time and the failure start time of the instance is recorded in the cache.
In another implementation, a time is determined that the application health of the instance was last received. And determining interrupt time according to the last time of receiving the application health condition of the instance. Illustratively, Δt2=t1-t 3, where Δt2 is the duration of the fault, t1 is the current time, and t3 is the time the last time the application health of the instance was received. If the interruption time is greater than the fault time threshold, the working state of the instance may be determined to be a fault. If the interruption time is less than or equal to a failure time threshold, the working state of the instance may be determined to be non-failed.
For example, the failure time threshold may be configured in a failure policy. Therefore, the first disaster recovery service module can analyze the fault policy configuration file to determine the fault time threshold through the step B1.
In order to better understand the embodiment of the present application, the process of step B3 is described in detail below in conjunction with a specific application scenario. The first disaster recovery service module traverses each instance and detects the working state of each instance. The following describes a first example, where the first example belongs to a first service, and referring to fig. 6, a process of detecting an operation state of the first example by using a first disaster recovery service module is shown:
S601, the first disaster recovery service module determines whether an application health condition of a first instance from the first disaster recovery agent module has been received. If yes, go to step S602. If not, go to step S608.
S602, the first disaster recovery service module determines an interruption time of the first instance, wherein the interruption time of the first instance is a difference between a current time and a time of last receiving an application health condition of the first instance from the first disaster recovery service module. Step S603 is performed.
S603, the first disaster recovery service module determines whether the interruption time of the first instance is greater than a failure time threshold in a failure policy corresponding to the first service. If yes, go to step S604. If not, go to step S605.
S604, the first disaster recovery service module determines that the working status of the first instance is a failure.
S605, the first disaster recovery service module determines a failure duration of the first instance, where the failure duration of the first instance is a difference between a current time and a failure start time of the first instance. Step S606 is performed.
S606, the first disaster recovery service module determines whether the failure duration of the first instance is greater than a failure time threshold in a failure policy corresponding to the first service. If yes, go to step S604. If not, go to step S607.
S607, the first disaster recovery service module determines that the working status of the first instance is not faulty.
S608, the first disaster recovery service module determines whether the fault start time of the first instance exists in the cache. If yes, go to step S605. If not, go to step S609.
S609, the first disaster recovery service module sets the fault starting time of the first instance as the current time in the cache.
And B4, determining whether the number of instances in which the working state is fault in all instances of the service meets a fault strategy or not according to each service. If so, determining a first decision result, namely, indicating the standby site to take over the service of the main site. If not, determining a second decision result, and not indicating the standby site to take over the service of the main site.
For example, the fault policy may include a number threshold of fault instances, so that when the service operating state is that the number of instances of the fault is greater than the number threshold, it may be determined that the fault policy is satisfied, and conversely, it may be determined that the fault policy is not satisfied. For example, the failure policy is that the threshold of the number of failure instances is 4, so when the number of instances that determine that the service has a working state of failure is greater than 4, it may be determined that the failure policy is satisfied, and otherwise, it is determined that the failure policy is not satisfied.
Alternatively, the fault policy may also include a specific gravity threshold of the fault instance, so that when the working state of the service is that the specific gravity of the fault instance in all instances of the service is greater than the specific gravity threshold, it may be determined that the fault policy is satisfied, and conversely, it may be determined that the fault policy is not satisfied. For example, the failure policy is that the working state is more than half of the failed instances, so when the working state of the service is determined that the specific gravity of the failed instances is greater than 50% of all the instances of the service, it may be determined that the failure policy is satisfied, and otherwise, it is determined that the failure policy is not satisfied. As another example, the failure policy is an instance total failure, so that upon determining that an instance of the service is total failure, it may be determined that the failure policy is satisfied, and conversely, it is determined that the failure policy is not satisfied.
If the fault policy configuration file includes a fault policy formulated for all the services, it is determined whether the number of instances in which the working state is a fault in all the instances of the service satisfies the fault policy. If the fault policy configuration file includes fault policies respectively formulated for a plurality of services, determining whether the number of instances in which the operating state is faulty in all instances of the service satisfies the fault policies formulated for the service.
The primary site may further include a second disaster recovery service module that deploys any virtual machine in the primary site. The second disaster recovery service module is used for taking over the service of the first disaster recovery service module when the first disaster recovery service module fails. When the first disaster recovery service module operates, the second disaster recovery service module may be in a stopped state or an operating state, and the embodiment of the present application is not limited specifically.
In one implementation, the primary site may further include a heartbeat interface, where the heartbeat interface is configured to send and receive heartbeat messages between the primary site and the backup site.
In an exemplary illustration, the first disaster recovery service module may send the first decision result to the backup site through the heartbeat interface when sending the first decision result to the backup site.
The host site may also include an arbitration service module that may be deployed on any virtual machine of the host site, the arbitration service module being configured to store decision results. The first disaster recovery service module is further configured to write the decision result into the arbitration service module after determining the decision result.
The backup site and the arbitration site may also include an arbitration service module. The arbitration service module in the main site, the arbitration service module in the standby site and the arbitration service module in the arbitration site can form a cluster, and the arbitration service module of the main site can share the decision result in the cluster after writing the decision result, so that the standby site can obtain the decision result in a query mode.
In another exemplary illustration, when the first disaster recovery service module sends the first decision result to the backup site, the first decision result may also be saved in an arbitration service module, so that the first decision result is sent to the backup site through an arbitration service.
The main site may include two mediation service modules, where one mediation service module serves as a main service to provide mediation service, and the other mediation service module serves as a standby service, and when the mediation service module serving as the main service fails, the mediation service module serving as the standby service serves as the main service to continue mediation service. When the arbitration service module serving as the main service operates, the arbitration service module serving as the standby service may be in a stop state or an operation state, and the application is not particularly limited.
In order to better understand the embodiment of the present application, a process of performing disaster recovery switching on the first disaster recovery service module is described in detail below in conjunction with a specific application scenario. See fig. 7:
s701, traversing each service by the first disaster recovery service module to determine whether each service meets a corresponding fault policy. If yes, go to step S702. If not, step S703 is performed.
S702, the first disaster recovery service module obtains a second decision result. Step S707 is performed.
S703, the first disaster recovery service module obtains a first decision result. Step S704 is performed.
S704, the first disaster recovery service module determines whether the site is a master site. If yes, go to step S705. If not, ending.
S705, the first disaster recovery service module determines whether the heartbeat between the first disaster recovery service module and the backup site is normal. If yes, go to step S706. If not, step S707 is executed.
S706, the first disaster recovery service module sends a first decision result to the standby site through the heartbeat interface to indicate that the standby site is raised to be the master site. And after the standby site is lifted to the main site, taking over the service of the original main site. Step S707 is performed.
S707, the first disaster recovery service module writes the first decision result into the arbitration service module.
In another embodiment, the structures of the backup site and the arbitration site in the disaster recovery system shown in fig. 3 may refer to the structure of the main site shown in fig. 4, and the description thereof will not be repeated here.
The first disaster recovery service module in the backup site can also be used for receiving a first decision result from the backup site through a heartbeat interface of the backup site and executing the processing of taking over the business of the primary site.
The first disaster recovery service module of the backup site may be further configured to obtain a decision result of the primary site through the arbitration service, and if the obtained decision result is the first decision result, execute a process of taking over the primary site service. For example, the first disaster recovery service module of the backup site may periodically query the arbitration service module in the backup site to obtain the decision result of the primary site.
In one embodiment, after receiving the first decision result from the backup site through the heartbeat interface of the backup site, the first disaster recovery service module of the backup site only processes the first decision result once if the decision result obtained through the arbitration service is the first decision result.
The disaster recovery control method provided by the embodiment of the application is further described below with reference to a disaster recovery system as shown in fig. 3. Referring to fig. 8, a method flow chart of the disaster recovery control method provided by the present application is provided. The method can be implemented by the main station and the standby station in fig. 3, and the method can comprise the following steps:
S801, the primary site determines, for a first service provided by the primary site, an operating state of the first service in each instance of a plurality of virtual machines in the primary site. The first service may be any service provided by the primary site, or the first service may be a key service in the primary site, such as a database, a key service management service, or the like.
In one embodiment, the primary site determines the working state of the first service in each instance of the plurality of virtual machines in the primary site by steps C1 to C4:
c1, for each instance of the first service in the plurality of virtual machines in the primary site, the primary site determines a failure start time of the instance.
One exemplary illustration, a host site may receive and record application health for the instance. If the received application health condition of the instance is abnormal, and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance is provided, determining that the failure start time of the instance is the current time. The application health condition of the instance may be reported by a first virtual machine, where the first virtual machine is a virtual machine in the primary site where the instance is deployed. Further, the first virtual machine may periodically report the application health of the instance.
In another exemplary illustration, if the primary site does not receive the application health condition of the instance reported by the first virtual machine, the first virtual machine may be considered to be failed, that is, the instance fails, so that the failure start time of the instance may be determined to be the current time.
And C2, the master station determines the fault duration according to the fault starting time. Illustratively, Δt1=t1-t 2, where Δt1 is the duration of the fault, t1 is the current time, and t2 is the start of the fault.
And C3, if the fault duration is greater than a fault time threshold, the master station determines that the working state of the instance is a fault.
And C4, if the fault duration is smaller than or equal to the fault time threshold, the master station determines that the working state of the instance is not faulty.
In another embodiment, the primary site determines the working state of the first service in each instance of the multiple virtual machines in the primary site, and may further be implemented by steps D1 to D4:
d1, determining, for each instance of the first service in the plurality of virtual machines in the host site, a time of last receipt of an application health of the instance;
and D2, determining interrupt time according to the last time of receiving the application health condition of the instance. Illustratively, Δt2=t1-t 3, where Δt2 is the duration of the fault, t1 is the current time, and t3 is the time the last time the application health of the instance was received.
And D3, if the interruption time is greater than a fault time threshold value, determining that the working state of the instance is a fault.
And D4, if the interruption time is smaller than or equal to a fault time threshold value, determining that the working state of the instance is not faulty.
In an implementation, the disaster recovery agent module of the first service in the multiple virtual machines in the primary site may report the application health condition of each instance of the first service to the first disaster recovery service module of the primary site. The first disaster recovery service module can then determine an operational status of each instance of the first service based on the application health of each instance of the first service.
The process that the disaster recovery agent module of the plurality of virtual machines in the primary site of the first service reports the application health condition of each instance of the first service to the first disaster recovery service module of the primary site may refer to the above steps A1 to A4, and the description is not repeated here. The process of determining the working state of each instance of the first service by the first disaster recovery service module may refer to the above step B3, and the detailed description is not repeated here.
S802, when the number of instances with the working states of faults in all instances of the first service in the master site meets a fault strategy, the master site determines a first decision result, wherein the first decision result is a service for indicating the slave site to take over the master site.
Or if the number of instances in which the working state is a fault in all the instances of the first service in the master site does not meet the fault policy, the master site may determine a second decision result, where the second decision result is not indicative of the slave site taking over the service of the master site.
For example, the fault policy may include a number threshold of fault instances, so that when the service operating state is that the number of instances of the fault is greater than the number threshold, it may be determined that the fault policy is satisfied, and conversely, it may be determined that the fault policy is not satisfied. For example, the failure policy is that the threshold of the number of failure instances is 4, so when the number of instances that determine that the service has a working state of failure is greater than 4, it may be determined that the failure policy is satisfied, and otherwise, it is determined that the failure policy is not satisfied.
Alternatively, the fault policy may also include a specific gravity threshold of the fault instance, so that when the working state of the service is that the specific gravity of the fault instance in all instances of the service is greater than the specific gravity threshold, it may be determined that the fault policy is satisfied, and conversely, it may be determined that the fault policy is not satisfied. For example, the failure policy is that the working state is more than half of the failed instances, so when the working state of the service is determined that the specific gravity of the failed instances is greater than 50% of all the instances of the service, it may be determined that the failure policy is satisfied, and otherwise, it is determined that the failure policy is not satisfied. As another example, the failure policy is an instance total failure, so that upon determining that an instance of the service is total failure, it may be determined that the failure policy is satisfied, and conversely, it is determined that the failure policy is not satisfied.
Specifically, the failure policy may be a policy formulated for the first service, or the failure policy may be a policy formulated for all services, which is not specifically limited in the present application.
In implementation, step S802 may be performed by a first disaster recovery service module in the host site. The process of determining the decision result by the first disaster recovery service module may refer to the above step B4, and the description thereof will not be repeated here.
S803, the master station sends the first decision result to the slave station.
In one implementation, the primary site may send the first decision result directly to the backup site. Illustratively, the primary site may send the first decision result to the backup site over a heartbeat network with the backup site.
In another implementation, when the master station sends the first decision result to the backup station, the first decision result may be sent to the backup station through an arbitration service. For example, a primary site may write a first decision result to an instance of an arbitration service of the primary site, where the instance of the arbitration service may be deployed in the primary site, the backup site, and the arbitration site. The instance of the arbitration service in the master site, the instance of the arbitration service in the slave site, and the instance of the arbitration service in the arbitration site form a cluster, so that the first decision result written by the instance of the arbitration service of the master site can be shared within the cluster. Thus, the backup site may obtain the first decision result in the cluster by means of a query.
In implementation, step S803 may be performed by the first disaster recovery service module of the primary site.
S804, the standby site takes over the service of the main site after receiving the first decision result. The backup site may take over all traffic of the primary site. Alternatively, the backup site may take over the primary site's first service. The present application is not particularly limited.
In implementation, step S804 may be performed by the first disaster recovery service module of the backup site.
According to the embodiment of the application, the main site can monitor the working state of each instance, and then whether the service is actually switched to the standby site or not can be judged according to the working state of each instance, for example, when the main site monitors all faults of the instance of a certain service, the standby site is instructed to take over the service of the main site, so that the standby site can continue to provide service for clients, and further the economic loss and reputation influence of the clients are reduced.
In order to better understand the embodiment of the present application, a process of disaster recovery switching is described in detail below with reference to the disaster recovery system shown in fig. 4. The disaster recovery switching process is shown in fig. 9:
s901, a disaster recovery agent module of a main site collects application health conditions of all instances on a virtual machine.
For example, the disaster recovery agent module may collect the application health conditions of each instance on the virtual machine where the disaster recovery agent module is located, refer to the above steps A1 to A3, and the description thereof will not be repeated here.
S902, the disaster recovery agent module of the master site reports the acquired application health condition to the first disaster recovery service module of the master site.
S903, the first disaster recovery service module of the main site analyzes the fault policy configuration file.
S904, the first disaster recovery service module of the main site gathers the application health conditions reported by the disaster recovery agent module of the main site, and makes a decision result.
The process of making the decision result by the first disaster recovery service module summarizing the application health conditions reported by the disaster recovery agent module of the main website can refer to the steps B2 to B4, and the detailed description is not repeated here.
S905, the first disaster recovery service module of the master site writes the decision result into the arbitration service module of the master site.
S906, if the decision result made by the first disaster recovery service module of the main site is that the backup site is instructed to take over the service of the main site and the heartbeat between the main site and the backup site is normal, the first disaster recovery service module of the main site calls the heartbeat interface of the main site to send the decision result through the heartbeat network between the main site and the backup site.
S907, after the first disaster recovery service module of the backup site receives the decision result indicating the backup site to take over the service of the main site, the operation of taking over the service of the main site is executed.
S908, the first disaster recovery service module of the backup site queries the decision result of the main site in the arbitration service module of the backup site at regular time.
S909, if the decision result inquired in the arbitration service module of the backup site is that the backup site is instructed to take over the service of the main site, the first disaster recovery service module of the backup site executes the operation of taking over the service of the main site.
In one implementation manner, if, before step S909, the first disaster recovery service module of the backup site receives a decision result from the primary site indicating that the backup site takes over the service of the primary site, when the decision result queried by the first disaster recovery service module of the backup site in the arbitration service module of the backup site is that the backup site is indicated to take over the service of the primary site, the operation of taking over the service of the primary site is not repeatedly performed.
Based on the same inventive concepts as the embodiments described above, embodiments of the present invention provide a primary site 100, specifically for implementing the method described in the embodiment depicted in fig. 8. A plurality of virtual machines are run on the host site 100, on which instances of the first service may be run. The working state unit 101, the decision unit 102 and the sending unit 103 are deployed on a first virtual machine running on a main site, where the first virtual machine may be one virtual machine of the multiple virtual machines, or the first virtual machine may also be not one virtual machine of the multiple virtual machines. Taking the example that the first virtual machine is not one of the plurality of virtual machines, the structure of the master site 100 may be as shown in fig. 10. It should be understood that fig. 10 is only an exemplary illustration of a primary site structure, and is not intended to specifically limit the number of virtual machines included in the primary site, the number and type of services provided by the primary site, the relationship of the first virtual machine to the plurality of virtual machines, and the like.
The working state unit 101 is configured to determine, for a first service provided by a primary site, a working state of the first service in each instance of a plurality of virtual machines in the primary site. The decision unit 102 is configured to determine a first decision result when the number of instances in the primary site, in which the first service has a failure in all instances in the plurality of virtual machines, meets a failure policy, where the first decision result is a service indicating that a backup site takes over the primary site. The sending unit 103 is configured to send the first decision result to a backup station.
In one implementation, the working state unit 101 may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance; determining a fault duration according to the fault start time; if the fault duration is greater than a fault time threshold, determining that the working state of the instance is a fault; and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulty.
An exemplary illustration, the operation state unit 101, when determining the failure start time of the instance, may be specifically configured to: receiving and recording application health conditions of the instance; if the received application health condition of the instance is abnormal, and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance is provided, determining that the failure start time of the instance is the current time.
In another exemplary illustration, the working state unit 101, when determining the fault start time of the instance, may be further specifically configured to: if the application health condition of the instance reported by the first virtual machine is not received, determining the fault starting time of the instance as the current time, wherein the first virtual machine is the virtual machine for deploying the instance in the master site.
In another implementation manner, the working state unit 101 may be further specifically configured to: determining, for each instance of the first service in the plurality of virtual machines in the host site, a time of last receipt of an application health of the instance; determining an interruption time according to the last time the application health condition of the instance was received; if the interruption time is greater than a fault time threshold, determining that the working state of the instance is a fault; and if the interruption time is smaller than or equal to the fault time threshold value, determining that the working state of the instance is not faulty.
The decision unit 102 may be further configured to: and when the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result is a service which does not instruct the standby site to take over the main site.
In a possible implementation, the sending unit 103 may be specifically configured to: and sending the first decision result to the standby site through arbitration service.
The master station 100 may be the master station in the embodiment corresponding to fig. 3 or fig. 4 for performing the operations performed by the master station in the embodiment corresponding to fig. 5-9. The working state unit 101, the decision unit 102 and the sending unit 103 in the primary site 100 may be software units in the first disaster recovery service module shown in fig. 4.
The division of the modules in the embodiments of the present application is schematically only one logic function division, and there may be another division manner in actual implementation, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, or may exist separately and physically, or two or more modules may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules.
Where the integrated module may be implemented in hardware, the host site may include a processor 802, as shown in fig. 11. Multiple virtual machines may run on the processor 802. The hardware of the entity corresponding to the above modules can be the processor 802. The processor 802 may be a central processing module (central processing unit, CPU), or a digital processing module, or the like. The primary site may also include communication interfaces 801A, 801B, and the processor 802 may send and receive messages between the primary site and the secondary site through the communication interface 801A, where the communication interface 801A may be a heartbeat interface. Processor 802 may send and receive messages between the arbitration stations via communication interface 801B. The primary site further comprises: a memory 803 for storing programs executed by the processor 802. The memory 803 may be a nonvolatile memory such as a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or may be a volatile memory (RAM). Memory 803 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto.
The processor 802 is configured to execute program codes stored in the memory 803, and in particular, to execute corresponding functions of the disaster recovery agent module and the first disaster recovery processing module.
The specific connection medium between the communication interface 801A, the communication interface 801B, the processor 802, and the memory 803 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 803, the processor 802, the communication interface 801A and the communication interface 801B are connected through the bus 804 in fig. 11, the bus is shown by a thick line in fig. 11, and the connection manner between other components is only schematically illustrated, but not limited thereto. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in FIG. 11, but not only one bus or one type of bus.
The primary site shown in fig. 11 may be the primary site in the embodiment corresponding to fig. 3 or fig. 4. The processor 802 in the host site executes computer readable instructions in the memory 803, which may cause the host site to perform operations performed by the host site in embodiments corresponding to fig. 5-9. The memory 803 stores therein an operating system based on Linux, unix, or Windows, and virtual machine software instructions for generating a virtual machine on the operating system. Based on the operating system, the processor 802 executes the virtual machine software instructions to obtain a virtual machine host site including a plurality of virtual machine host sites as shown in fig. 4 or 10 on the host site shown in fig. 11.
The embodiment of the application also provides a computer readable storage medium for storing computer software instructions required to be executed by the processor, and the computer readable storage medium contains a program required to be executed by the processor.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk (solid state drive, SSD).
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. The skilled person may use different methods for each specific application to achieve the described functionality.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Claims (9)

1. A disaster recovery control method, the method comprising:
determining, for a first service provided by a host site, an operational state of the first service at each instance of a plurality of virtual machines in the host site;
when the number of instances that the first service in the primary site has a fault in the working states in all instances of the plurality of virtual machines meets a fault policy, determining a first decision result, wherein the first decision result is a service for indicating a standby site to take over the primary site;
sending the first decision result to a standby site;
wherein determining the operational status of the first service at each instance of the plurality of virtual machines in the host site comprises:
determining, for each instance of the first service in the plurality of virtual machines in the host site, a time of last receipt of an application health of the instance;
Determining an interruption time according to the last time of receiving the application health condition of the instance;
if the interruption time is greater than a fault time threshold, determining that the working state of the instance is a fault;
and if the interruption time is smaller than or equal to the fault time threshold value, determining that the working state of the instance is not faulty.
2. The method of claim 1, wherein after determining the operational status of the first service at each instance of the plurality of virtual machines in the host site, the method further comprises:
and if the number of the instances with the working states of faults in all the instances of the first service in the main site does not meet the fault strategy, determining a second decision result, wherein the second decision result is a service which does not instruct the standby site to take over the main site.
3. The method of claim 1, wherein the sending the first decision result to a backup site comprises:
and sending the first decision result to the standby site through arbitration service.
4. A primary site, comprising: a plurality of virtual machines, and a working state unit, a decision unit and a sending unit deployed in the first virtual machine, wherein
The working state unit is used for determining the working state of each instance of a plurality of virtual machines in a main site of a first service provided by the main site;
the decision unit is used for determining a first decision result when the number of instances in the primary site, in which the working state of the first service in all instances in the plurality of virtual machines is a fault, meets a fault policy, wherein the first decision result is a service for indicating a standby site to take over the primary site;
the sending unit is used for sending the first decision result to the standby station;
the working state unit is specifically configured to:
determining, for each instance of the first service in the plurality of virtual machines in the host site, a time of last receipt of an application health of the instance;
determining an interruption time according to the last time the application health condition of the instance was received;
if the interruption time is greater than a fault time threshold, determining that the working state of the instance is a fault;
and if the interruption time is smaller than or equal to the fault time threshold value, determining that the working state of the instance is not faulty.
5. The master station of claim 4, wherein the decision unit is further to:
And when the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result is a service which does not instruct the standby site to take over the main site.
6. The primary station of claim 4, wherein the transmitting unit is specifically configured to:
and sending the first decision result to the standby site through arbitration service.
7. The host site of claim 4, wherein the first virtual machine is one of the plurality of virtual machines or the first virtual machine is not one of the plurality of virtual machines.
8. A disaster recovery system comprising a primary site as claimed in any one of claims 4 to 7, and a backup site.
9. The disaster recovery system of claim 8, further comprising an arbitration site;
the arbitration station is used for providing arbitration service for the main station and the standby station.
CN201811513686.9A 2018-12-11 2018-12-11 Disaster recovery control method, device and system Active CN111309515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811513686.9A CN111309515B (en) 2018-12-11 2018-12-11 Disaster recovery control method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811513686.9A CN111309515B (en) 2018-12-11 2018-12-11 Disaster recovery control method, device and system

Publications (2)

Publication Number Publication Date
CN111309515A CN111309515A (en) 2020-06-19
CN111309515B true CN111309515B (en) 2023-11-28

Family

ID=71150545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811513686.9A Active CN111309515B (en) 2018-12-11 2018-12-11 Disaster recovery control method, device and system

Country Status (1)

Country Link
CN (1) CN111309515B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112714461B (en) * 2021-01-29 2022-05-31 四川安迪科技实业有限公司 DAMA satellite network central station protection switching method
CN116962153A (en) * 2022-04-20 2023-10-27 华为云计算技术有限公司 Disaster recovery management method and disaster recovery management equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473105A (en) * 2010-01-04 2012-05-23 阿瓦雅公司 Packet mirroring between primary and secondary virtualized software images for improved system failover performance
CN103118100A (en) * 2013-01-25 2013-05-22 武汉大学 Guarantee method and guarantee system for improving usability of virtual machine application
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN103812675A (en) * 2012-11-08 2014-05-21 中兴通讯股份有限公司 Method and system for realizing allopatric disaster recovery switching of service delivery platform
CN104205060A (en) * 2012-04-12 2014-12-10 国际商业机器公司 Providing application based monitoring and recovery for a hypervisor of an ha cluster
CN108632057A (en) * 2017-03-17 2018-10-09 华为技术有限公司 A kind of fault recovery method of cloud computing server, device and management system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080189700A1 (en) * 2007-02-02 2008-08-07 Vmware, Inc. Admission Control for Virtual Machine Cluster

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473105A (en) * 2010-01-04 2012-05-23 阿瓦雅公司 Packet mirroring between primary and secondary virtualized software images for improved system failover performance
CN104205060A (en) * 2012-04-12 2014-12-10 国际商业机器公司 Providing application based monitoring and recovery for a hypervisor of an ha cluster
CN103812675A (en) * 2012-11-08 2014-05-21 中兴通讯股份有限公司 Method and system for realizing allopatric disaster recovery switching of service delivery platform
CN103118100A (en) * 2013-01-25 2013-05-22 武汉大学 Guarantee method and guarantee system for improving usability of virtual machine application
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN108632057A (en) * 2017-03-17 2018-10-09 华为技术有限公司 A kind of fault recovery method of cloud computing server, device and management system

Also Published As

Publication number Publication date
CN111309515A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
EP3518110B1 (en) Designation of a standby node
WO2018036148A1 (en) Server cluster system
CN102394914A (en) Cluster brain-split processing method and device
US7702667B2 (en) Methods and systems for validating accessibility and currency of replicated data
US9450700B1 (en) Efficient network fleet monitoring
CN105933391A (en) Node capacity expansion method, device and system
CN110830283B (en) Fault detection method, device, equipment and system
US20080288812A1 (en) Cluster system and an error recovery method thereof
CN103354503A (en) Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof
CN107229425B (en) Data storage method and device
CN113595836A (en) Heartbeat detection method of high-availability cluster, storage medium and computing node
CN111309515B (en) Disaster recovery control method, device and system
CN114064374A (en) Fault detection method and system based on distributed block storage
CN111800484B (en) Service anti-destruction replacing method for mobile edge information service system
CN108512753B (en) Method and device for transmitting messages in cluster file system
WO2019109019A1 (en) Reducing recovery time of an application
CN111342986B (en) Distributed node management method and device, distributed system and storage medium
CN113489149B (en) Power grid monitoring system service master node selection method based on real-time state sensing
CN117076196A (en) Database disaster recovery management and control method and device
CA2241861C (en) A scheme to perform event rollup
CN116668269A (en) Arbitration method, device and system for dual-activity data center
CN110675614A (en) Transmission method of power monitoring data
WO2019241199A1 (en) System and method for predictive maintenance of networked devices
CN114553900B (en) Distributed block storage management system, method and electronic equipment
CN114124803B (en) Device management method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant