CN118101435A

CN118101435A - High-availability service method and system based on dual-machine hot standby

Info

Publication number: CN118101435A
Application number: CN202311844082.3A
Authority: CN
Inventors: 徐跃福; 傅阳; 陈刚; 杨竞毅
Original assignee: Next Day Technology Chongqing Co ltd
Current assignee: Next Day Technology Chongqing Co ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-05-28

Abstract

The invention discloses a high-availability service method based on dual-machine hot standby, which comprises a master server and a slave server which are taken as service nodes, wherein an application program on the two servers is constructed into a multi-process application across hardware based on Microsoft Orleans frames, and the application program is communicated with the master server and the slave server through VRRP protocol; one of the two servers operates in a host mode, and the host mode is switched from the current server to the other server under the triggering of the VRRP heartbeat failure. The invention solves the problem of insufficient data of the persistent protection in the prior art.

Description

High-availability service method and system based on dual-machine hot standby

Technical Field

The invention relates to the technical field of server backup, in particular to a high-availability service method and system based on dual-machine hot standby.

Background

With the continuous expansion of the application of information technology, data has covered the whole human life and become an important financial resource for enterprises and institutions. Research on data protection is always an important subject of industry due to the vulnerability of data. In industries where data value is particularly high, there is a need for data persistence protection because of the ever-increasing demands placed on data protection. However, there are still some problems with the server persistence protection data of the prior art:

(1) The persistent data protection of the prior art is limited to stand-alone protection. The current persistent protection is mostly directed to single machine protection. However, in practical production applications, a dual-machine environment is very common: the two business machines are divided into a host machine and a slave machine, and a data disk is mounted on the master machine and the slave machine by utilizing a double-machine system. When the host fails, the master-slave machine can be automatically switched, so that the service processing is not interrupted, and the data is located on the same data disc. The current single machine protection is difficult to meet the requirements of double machine services. Because in an environment with a dual system, real data disks are generally products that do not allow for the installation of persistent data protection for third parties, they can only be installed on a master-slave business machine. If the third party continuous data protection product is installed on the master-slave service machine, the master-slave machine is objectively protected as two magnetic disks in a single-machine protection mode, and when the master-slave machine is switched, the data is necessarily lost.

(2) The two-machine protection is limited to non-persistent protection. The data backup of the dual-machine protection is generally based on periodic data copying, and the backup granularity of the periodic data copying is far greater than that of real-time level protection and block level protection. The data between the two backup periods can be lost, and the block level protection can ensure that the snapshot disk and the real disk at the corresponding moment of the snapshot reach the consistency of the block level, so that the data is more complete.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a high-availability service method and a system based on dual-machine hot standby, which are used for solving the problem of insufficient continuous protection data in the prior art.

In order to achieve the above object, the present invention provides the following technical solutions:

The high-availability service method based on the dual-computer hot standby comprises a master server and a slave server which are taken as service nodes, wherein an application program on the two servers is constructed to be a multi-process application crossing hardware based on Microsoft Orleans frames, and the application program is communicated with the master server and the slave server through VRRP protocol; one of the two servers operates in a host mode, and the host mode is switched from the current server to the other server under the triggering of the VRRP heartbeat failure.

Further, the two servers are provided with the same PostgreSQL database, redis cache, minIO file server, gateway program, portainer container management tool, pgwatch database tool and application program;

the portainer container management tool provides a management interface of a docker for the gateway program;

The application program runs as a service program in a dock container according to the service request, and the gateway program coordinates the running of the service program and provides reverse proxy service, so that the same application program has only one running instance on two service nodes;

the pgwatch database tool sets the PostgreSQL database into a master-slave mode and starts a backup stream; the Redi s cache operates in a master-slave mode; minIO the file server operates in a clustered fashion in mirror mode.

Further, for the PostgreSQL database, the host mode switching between the master and slave servers is implemented:

1) The KEEPALIVED tool is adopted to carry out the same route configuration on the master server and the slave server:

In the non-preemption mode, the master server and the slave server are configured with a BACKUP router, a network interface, a virtual router ID, a priority, a VRRP heartbeat packet checking period, a notification script, authentication and PostgreSQL database virtual IP and heartbeat triggering fault conditions in the VRRP protocol;

The notification script records the current state of the PostgreSQL database and the switching log of the current virtual IP, the application program only uses the virtual IP of the PostgreSQL database for access, and the virtual IP only points to the PostgreSQL database of the server running in a host mode;

2) The pgwatch database tool processing procedure comprises the following steps:

step a, checking the connection state of the PostgreSQL database, and if the connection is normal, entering a step b; if the connection is overtime, entering a step e;

step b, checking the operation mode of the PostgreSQL database, and if the operation mode is a host mode, entering the step c; if the mode is the standby mode, entering a step f;

c, checking a switching log of the current virtual IP through a notification script, and if the switching log record value indicates that the current node state is a host, entering a step d; otherwise, judging that the current PostgreSQL database runs in a host mode, wherein the master-slave relationship is destroyed, and recovering the following state of the database;

Step d, checking the backup stream, if the state of the backup stream is normal, judging that the current PostgreSQL database runs normally in a host mode; if the backup stream state is interrupted, starting checking and restoring the backup stream;

step e, checking whether the timeout times reach the maximum value, if so, modifying the record value in the notification script to indicate that the PostgreSQL database is abnormal in operation, and starting checking and recovering;

Step f, checking a switching log of the current virtual IP through a notification script, and entering step g if the switching log record value indicates that the current node state is a standby machine; if the switching log record value indicates that the current node state is the host, switching the PostgreSQL database of the current node into a host mode; if the switching log record value indicates that the PostgreSQL database is not operating normally, starting checking and recovering;

Step g, checking the state of the backup stream, and if the backup stream is interrupted, starting checking and recovering the backup stream; if the backup flow is normal, the PostgreSQL database of the current node operates in a standby mode.

The invention also provides a high-availability service system based on the dual-machine hot standby, which is characterized by comprising the following steps:

a processor; a memory for storing the processor-executable instructions;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to the present invention.

The invention has the beneficial effects that:

1. The integrated dual hot standby solution including database (PostgreSQL), cache/message queue (Redis), file server (MinIO), gateway program and application program is applicable to multiple service type servers.

2. Under the coordination of the gateway program, the application programs are in a single-instance running state, running state conflict does not occur, if an instance in which a certain application program is not running is detected, a starting instruction of the application program can be automatically initiated, if a host machine and a standby machine simultaneously start the same application program, a built-in mutual exclusion starting mechanism can ensure that only one instance can be successfully started, the implementation mode of the single-node application can be suitable for medium-small-scale program application scenes, and the requirement on hardware resources is lower.

3. The method supports the scene of continuous and repeated automatic switching, and can realize unattended automatic recovery except that the state recovery is carried out by human intervention after the main and standby switching of the database, and KEEPAL IVED services on the failed node can not be stopped in the operation process, so long as the main and standby computers are not simultaneously failed, the long-term stable operation can be maintained.

The scheme of the invention constructs a flexible lightweight dual-machine hot standby implementation scheme under the condition of only two servers, can automatically switch various necessary services including databases, caches, file servers and application programs in an unattended manner when one host machine fails, and can be used as an infrastructure to support common application program deployment scenes. Compared with the common high-availability scheme in the market, the scheme can realize the function of dual-machine hot standby by only needing a two-state server at least, and has low threshold of hardware configuration. The scheme is based on open source free tool construction, is low in cost, high in customization and high in flexibility, and can freely match and expand functions of the components according to specific requirements.

Drawings

FIG. 1 is a flow chart of the processing logic of pgwatch database tools on a PostgreSQL database in one embodiment of the invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present disclosure. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which are apparent to those of ordinary skill in the art based on the embodiments disclosed herein without undue burden, are within the scope of the present disclosure.

The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

The invention will be described in further detail with reference to specific embodiments and drawings.

The invention is based on the VRRP protocol and Microsoft 0rleans framework.

VRRP protocol: VRRP generic term Virtual Router Redundancy Protocol, the virtual routing redundancy protocol, is a protocol that enables routers to be highly available. N routers providing the same function are combined into a router group, a master and a plurality of backups are arranged in the router group, the master is provided with a VIP (Virtual IP ADDRESS) for providing services to the outside, the master can multicast, when the backups cannot receive VRRP packets, the master is considered to be down, and then one backup is selected as the master according to the priority of VRRP. In this way, a high availability of routers can be guaranteed. keepalive achieves high availability via VRRP protocol.

Microsoft Orleans: orleans is a cross-platform framework for building reliable and scalable distributed applications. Distributed applications are defined as applications that span multiple processes, often using peer-to-peer communication to cross hardware boundaries. Orleans extend from a single local server to hundreds or thousands of distributed, highly available applications in the cloud. Orleans extend the familiar concepts and C# idioms to a multi-server environment. Orleans are flexibly scalable in design. When a host joins a cluster, it can accept the new activation. When a host exits the cluster due to a vertical shrink or computer failure, the previous activation on that host will be re-activated on the remaining hosts as needed.

The invention provides a high-availability service method based on dual-computer hot standby, which comprises a master server and a slave server which are taken as service nodes, wherein an application program on the two servers is constructed into a multi-process application across hardware based on a Microsoft 0rleans framework, and the application program is communicated with the master server and the slave server through a VRRP protocol; one of the two servers operates in a host mode, and the host mode is switched from the current server to the other server under the triggering of the VRRP heartbeat failure.

According to the invention, only two servers are needed, and when one of the servers cannot continue to work due to downtime or network failure, the high-availability service for the other server to continue to work is automatically switched to, so that uninterrupted operation of important applications is ensured.

The two servers are provided with the same PostgreSQL database, redis cache, minIO file server, gateway program, portainer container management tool, pgwatch database tool and application program;

The following describes the scheme in detail using a typical application scenario as an example.

Embodiment one, postgreSQL database application scenario:

① keepalive configuration

The main logic of the notification script (pgkpsate. Sh) is to record the current state of the database and the switching log of the current virtual IP:

The virtual IP state switch of keepalive is triggered by heartbeat failure, and the cause of heartbeat failure is generally network failure or server downtime, besides, the database needs to perform regular running health state check and execute corresponding subsequent processing logic under certain conditions.

The application program always uses virtual IP to access database services, and the virtual IP always points to a server node where the database is running in host mode. After a failure, the database service can remain available as long as the database of one node is running in host mode and the virtual IP is properly directed to that node.

② Pgwatch supporting processing logic

Pgwatch the matching processing logic is periodic health checking logic, and the most core purpose of the matching processing logic is to lift the database on the server pointed by the virtual IP from the original standby machine operation mode to the host machine operation mode after the keep alive is switched so as to continuously support the normal operation of the service program (the database is read-only in the standby machine operation mode), wherein the processing logic in a single period is as shown in fig. 1:

③ Manual handling mode description after database switching

The initial state is that the main machine database and the standby machine database respectively run normally in a main machine mode and a standby machine mode, a backup flow is established, the standby machine follows the main machine in real time through the backup flow, and the data of the main machine database and the standby machine database are complete in the normal state.

After the first switching, the original host becomes the host and the data is complete, and the operation step of manually intervening to restore the state of the host is that the database of the current host is backed up and restored to the original host, and then the database of the original host is operated along with the current host in a standby mode.

If the state recovery is not performed when the first switching occurs and then at least one switching occurs, the main machine database and the standby machine database are not complete any more, the data is recovered to the other main machine by taking the current main machine as a reference after being combined to the current main machine according to the service logic, and the other main machine is operated in a standby machine mode along with the current main machine.

Embodiment two, redis cache and MinIO file server application scenario:

① keepalive configuration

Down_timer_ adverts 100 #vrp timeout check interval number, timeout time is Down_timer_ adverts _advertet_int, if heartbeat packet verification failure occurs but timely heartbeat is recovered within timeout time, it is regarded as not timeout, notify is not triggered, this configuration avoids unnecessary switching due to short jitter

The heartbeat failure can directly cause the virtual IP state switching, and the reason for the heartbeat failure is generally network failure or server downtime; hanging up the master node minio or redis-server process can cause the priority of the current node to drop, thereby causing the current node to drop from the master node to the slave node, and the original slave node to rise to the master node, thereby indirectly causing the virtual IP state switching.

The application program always uses virtual IP to access the cache and file server, and the virtual IP always points to the server node where the database is running in host mode. After the failover, minI is operated in a cluster mode in a mirror mode, so that subsequent processing is not needed, redis is operated in a master-slave mode, matched operation is needed to be executed, redis service operated on a node pointed by the virtual IP is used as a master node, the redis service value of the other node is a slave node and follows the master node (the redis of the slave node is read-only), and after matched processing is finished, the cache and the file server of the other node can be kept in an available state as long as the cache and the file server of the other node are normally operated and the virtual IP is correctly pointed to the node.

② Pgwat ch supporting processing logic

The redis master-slave checking and restoration logic needs to be executed periodically, and the following is the core code of the logic executed once (exception handling logic is removed for brevity, only core handling logic is reserved):

/>

in short, it is checked whether the current states of rediss of two nodes are consistent with the target state, and when the states are inconsistent, the command is executed to convert the states into the target state.

Different from the database, the switching between the cache and the file server can be automatically completed without manual processing, so that the switching can be performed for many times without being attended by people, and the data and the recovery state are not required to be restored manually.

Embodiment III, gateway program application scenario: ① keepalive configuration

The main logic of the notification script (svcchange. Sh) is to record the switching log of the current virtual IP:

The heartbeat failure can directly cause the virtual IP state switching, and the reason for the heartbeat failure is generally network failure or server downtime; the hanging up of the master node gateway process can lead to the lowering of the priority of the current node, further lead to the lowering of the current node from the master node to the slave node, and lead to the lifting of the original slave node to the master node, thereby indirectly leading to the switching of the virtual IP state.

② Gateway program core logic specification

The core logic of the gateway program is described below as picking corresponding core code segments to coordinate the operation of the business program and provide reverse proxy services.

A. Coordinating operation of business programs

Using 0RLEANS to maintain service program operation information, two gateway nodes sharing the same information, persisting the information into redis buffer, and relevant logic being located in program starting link of gateway program, the code is as follows:

/>

The service program access gateway service needs to execute the following logic in the starting link, so that the same service program of two nodes can only start one instance, keep heartbeat with the server end by the identity of the client, continuously report the state of the service program, and stop the current service program if the heartbeat fails:

/>

The definition of the custom load balancing service is as follows, and the policy is to forward the request to the currently active business program node:

/>

configuration file example of reverse proxy:

/>

In the configuration case of the above example, if the gateway program virtual IP is 10.1.2.200, the gateway program port is 5010, and the service program physical address is http: fv 10.1.3.165:5700/

Http: //10.1.2.135:5700/, the two-machine hot standby mode should be passed

Http: the business program is accessed by// 10.1.2.200:5010/vm.

a processor; a memory for storing the processor-executable instructions;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of the present invention.

The foregoing has described in detail the technical solutions provided by the embodiments of the present invention, and specific examples have been applied to illustrate the principles and implementations of the embodiments of the present invention, where the above description of the embodiments is only suitable for helping to understand the principles of the embodiments of the present invention; meanwhile, as for those skilled in the art, according to the embodiments of the present invention, there are variations in the specific embodiments and the application scope, and the present description should not be construed as limiting the present invention.

Claims

1. The high-availability service method based on the dual-computer hot standby is characterized by comprising a master server and a slave server which are taken as service nodes, wherein an application program on the two servers is constructed into a multi-process application across hardware based on a Microsoft 0rleans framework, and the application program is communicated with the master server and the slave server through a VRRP protocol; one of the two servers operates in a host mode, and the host mode is switched from the current server to the other server under the triggering of the VRRP heartbeat failure.

2. The high availability service method based on dual hot standby according to claim 1, wherein: the two servers are provided with the same PostgreSQL database, redis cache, minIO file server, gateway program, portainer container management tool, pgwatch database tool and application program;

The pgwatch database tool sets the PostgreSQL database into a master-slave mode and starts a backup stream; the Redis cache operates in a master-slave mode; minIO the file server operates in a clustered fashion in mirror mode.

3. The high availability service method based on dual hot standby according to claim 2, wherein: for the postgreSQL database, the host mode switching between the master server and the slave server is realized:

4. A dual hot standby based high availability service method as claimed in claim 3, wherein: a notification delay for a network interface state change is specified that is less than the count down broadcast interval value and the check interval value in all virtual network instance configurations.

5. A dual hot standby based high availability service method as claimed in claim 3, wherein: for the Redis cache and MinIO file servers, the host mode switching between the master server and the slave server is realized:

1) Starting a Redis cache and a MinIO file server process monitoring;

2) In the non-preemption mode, the master server and the slave server are configured with a BACKUP router, a network interface, a virtual router ID, a priority, a VRRP heartbeat packet checking period, a notification script, authentication and PostgreSQL database virtual IP and heartbeat triggering fault conditions in the VRRP protocol;

3) When the Redis cache or MinIO file server process in the host mode is hung up, the current node is reduced from the host mode to the standby mode, the other node is increased from the original standby mode to the host mode, and the virtual IP state is switched;

4) Periodically performing Redis cache master-slave relationship checking and recovering:

Step a1, obtaining a master-slave relation target state and an actual state of a Redis cache of a current node, and if the target state and the actual state are both master nodes and no slave nodes exist, enabling the Redis cache of another node to be a slave node of the Redis cache of the current node; otherwise, entering a step b1;

Step b1, if the target state is a master node and the actual states are not the master node, enabling the Redis cache of the current node to be the master node and enabling the Redis cache of another node to be a slave node of the Redis cache of the current node; otherwise, entering a step c1;

Step c1, if the target state is a slave node and the actual state of another node is a master node but no slave node is connected, enabling the Redis of the current node to be cached as the slave node of the other node; otherwise, entering a step d1;

Step d1, if the target state is a slave node and the actual state of the other node is not the master node, enabling the Redis cache of the other node to be the master node and enabling the Redis cache of the current node to be the slave node of the other node; otherwise, step d1 is entered.

6. The high availability service method based on dual hot standby according to claim 5, wherein: for the gateway program, the host mode switching between the master server and the slave server is realized:

1) Starting process monitoring of a gateway;

3) The two gateway nodes share the same service program operation information, the service program operation information is persisted into the redis cache, and the gateway program coordination service program comprises:

a2, checking the number of instances started by the same service program when the service program is accessed to the gateway service, ensuring that only one instance runs, keeping the heartbeat by the gateway program with the client and the server, continuously reporting the state of the gateway program, and stopping the current service program when the heartbeat fails;

b2, the gateway program of the main node periodically checks the running state of each service program, and if not, a starting instruction is sent to start the service program;

c2, when the gateway program provides the reverse proxy service, load balancing scheduling is executed: the request is forwarded to the currently active business program node.

7. A dual hot standby based high availability service system comprising:

A processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any of claims 1-6.