CN111240749B

CN111240749B - Suspending control method, device, equipment and storage medium of instance in cluster system

Info

Publication number: CN111240749B
Application number: CN201811436447.8A
Authority: CN
Inventors: 吁玲; 王璇; 竺士杰; 任赣
Original assignee: China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Group Zhejiang Co Ltd
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2023-07-21
Anticipated expiration: 2038-11-28
Also published as: CN111240749A

Abstract

The embodiment of the invention provides a method and a device for controlling suspension of an instance in a cluster system, wherein the method comprises the following steps: in an instance to be subjected to high availability test in a cluster system, deploying a thread control module for suspending the threads of the instance; and sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module carries out suspending operation on all threads of the instance according to the simulation request signals. The embodiment of the invention ensures the test accuracy of the high-availability test when the instance is suspended.

Description

Suspending control method, device, equipment and storage medium of instance in cluster system

Technical Field

The embodiment of the invention relates to the technical field of communication, in particular to a suspension control method and device for an instance in a cluster system.

Background

With the rapid development of information technology, enterprises have placed higher and higher demands on service availability. For enterprises, service unavailability caused by downtime of the system or other reasons directly affects the operating income, image and customer satisfaction of the enterprises, and even causes legal disputes. Current clusters have become a mainstream structure in the server industry, and although the reliability of single hardware is increasingly improved, due to the complexity of the environment and uncertainty of artifacts caused by the increase of the cluster size, the cluster system still shows frequent faults, which makes the problem of high availability performance of the system especially necessary from the viewpoint of software. By availability is meant the percentage of time that the system is operating normally without shutdown. Typically, high availability tests are made during a period of time after code updates before and after system entry.

Among them, applications are typically deployed on middleware instances, and high availability testing of middleware applications is categorized into two types, instance stopping (down) and instance suspending (hang). Instance down is a state in which the instance is stopped. Instance hang refers to a state where the middleware instance port is listening, the existence of a process can be detected, but in fact the instance has no traffic response capability, such as a state where a process memory overflows and is not responsive, a state where a process thread pool is full of no response, etc.

A typical high availability architecture for existing middleware includes several middleware servers and a load balancer, with different service requests being load balanced by the load balancer to the middleware servers for processing. However, during the operation of the system, a state of a certain instance hang will often occur. When the system has a hang condition, the load balancer needs to have a mechanism to identify the abnormal instance, forward the service request to the normal instance, and make the service continuously available, so that it is especially necessary to perform high availability test on the instance hang.

The existing high availability test method for the middleware instance hang comprises the following steps: an oversized bulk port call request (via Telnet instructions) is sent to the instance that needs hang in hope of hang the instance. However, through the fact test, since the method sends signals with ultra-large magnitude, commands sent in unit time are extremely large, if the commands are sent too frequently, a large amount of resources of a host are consumed, so that the maximum connection number of the TCP/IP of the host is consumed, the performance of the host is abnormal, the performance of other examples of the host is affected, and the effect of high-availability testing is affected. In addition, the method does not call the internal method in the instance, so the instance cannot be completely halved, the halved result has randomness, namely the instance cannot be halved finally through actual inspection, and the result has unpredictability.

In summary, the prior art has the problem of low accuracy of the test results when performing high availability tests.

Disclosure of Invention

The embodiment of the invention provides a suspension control method and device for an instance in a cluster system, which are used for solving the problem of lower accuracy in the prior art when the instance is subjected to high-availability test.

In order to solve the above problem, in a first aspect, an embodiment of the present invention provides a suspension control method for an instance in a cluster system, where the method includes:

in an instance to be subjected to high availability test in a cluster system, deploying a thread control module for suspending the threads of the instance;

and sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module carries out suspending operation on all threads of the instance according to the simulation request signals.

In a second aspect, an embodiment of the present invention provides a suspension control apparatus for an instance in a cluster system, where the apparatus includes:

the deployment module is used for deploying a thread control module for suspending the threads of the examples in the examples to be tested in the cluster system;

and the signal sending module is used for sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module can carry out suspending operation on all threads of the instance according to the simulation request signals.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements steps of a suspension control method for an instance in the cluster system when the computer program is executed.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a suspension control method of an instance in a clustered system as described.

According to the method and the device for controlling the suspension of the instance in the cluster system, the thread control module is deployed in the middleware instance to be subjected to high availability test, and then the thread control module is triggered to control all threads of the instance by sending a plurality of simulation request signals matched with the number of threads of the instance, so that all threads are suspended, all threads can not respond to the request any more, the complete suspension of the instance is realized, the certainty of the suspension result of the instance is ensured, the number of simulation request signals is controlled, the problem that the performance of other instances of the host is affected due to abnormal host performance caused when an ultra-large batch port call request is sent to the instance to be suspended is avoided, and the test accuracy of the high availability test when the instance is suspended is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating steps of a method for suspension control of an instance in a clustered system in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of an example high availability architecture in a clustered system in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram illustrating a suspension control apparatus for an example in a cluster system in accordance with one embodiment of the invention;

fig. 4 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a flowchart of a step of a suspension control method of an example in a cluster system according to an embodiment of the present invention is shown, where the method includes the following steps:

step 101: in an instance to be tested for high availability in a clustered system, a thread control module for suspending threads of the instance is deployed.

In this step, specifically, when a high availability test needs to be performed on the cluster system with the suspended instance, a thread control module for suspending the thread of the instance may be deployed in the instance to be subjected to the high availability test in the cluster system.

The method can be applied to an architecture with high availability of an instance in a cluster system, as shown in fig. 2, and the architecture includes a signal transmitter 21, where the signal transmitter 21 can be used to perform the step, that is, to deploy a thread control module 22 in an instance to be tested with high availability in the cluster system, so as to perform a suspension operation on a thread of the instance.

Specifically, the signal transmitter 21 may include an automated deployment module 211 and a signal transmission module 212, where after the architecture receives a task of performing a high availability test, the automated deployment module 211 of the signal transmitter 21 may be started to transmit and automatically deploy a deployment package of the thread control module 22 to an instance to be subjected to the high availability test. The thread control module 22 may be configured to perform a suspension operation on a thread of an instance, thereby implementing suspension control of the instance.

Step 102: and sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module can carry out suspending operation on all threads of the instance according to the simulation request signals.

In this step, specifically, after the thread control module is deployed in the instance, a plurality of analog request signals matched with the number of threads of the instance may be sent to the thread control module, so that the thread control module carries out a suspension operation on all threads of the instance according to the plurality of analog request signals, thereby implementing complete suspension of the instance.

Therefore, based on the fact that the sending number of the simulation request signals is matched with the number of threads of the examples, accuracy of example suspension results and effective control of high-availability test results are achieved, the problem that the examples cannot be suspended completely in the prior art is avoided, when an ultra-large batch of port call requests are sent to the examples to be suspended to suspend the examples in the prior art, the performance of other examples of the host is affected due to the fact that a large amount of host resources are consumed due to the fact that the sending signals are more in unit time, and then the problem that the results are inaccurate in the high-availability test is solved, and availability and stability of the cluster system are improved.

Specifically, the analog request signal may be a request signal for analog page access.

In addition, in particular, the thread control module 22 may include a signal receiving unit 221 and a thread suspending unit 222. Wherein the signal receiving unit 221 is configured to receive a plurality of analog request signals transmitted by the signal transmitter 21, and the signal receiving unit 221 may use a page component program in WEB programming, which exposes a borrow to the signal transmitter 21 for access, and in addition, the signal transmitter 21 is also configured with a program for accessing the interface, and by concurrent program statements, batch access, that is, batch transmission of a plurality of analog request signals is achieved. In addition, the thread suspension unit 222 may implement a function of scheduling thread suspension using a multithreading programming technique, and the thread control module 22 may suspend one thread whenever the signal transmitter 21 transmits an analog request signal corresponding to one order of magnitude to the thread control module 22. When a plurality of analog request signals matching the number of threads of an instance are transmitted in batch, the thread suspension unit 222 can call the thread control program to perform a thread suspension operation to fully suspend all threads included in the instance, at which time the load balancing module 23 can balance analog request signals subsequently transmitted to the instance in a suspended state to other instances to test high availability of the system.

In this way, in this embodiment, in an instance to be tested for high availability in the cluster system, a thread control module for suspending threads of the instance is deployed, and a plurality of analog request signals matched with the number of threads of the instance are sent to the thread control module, so that the thread control module performs suspending operations on all threads of the instance according to the plurality of analog request signals. Based on the fact that the sending number of the simulation request signals is matched with the thread number of the examples, accuracy of example suspension results and effective control of high-availability test results are achieved, the problem that the examples cannot be suspended completely in the prior art is avoided, when an ultra-large batch of port calling requests are sent to the suspension examples to suspend the examples in the prior art, the performance of other examples of the host is affected due to the fact that a large amount of host resources are consumed due to the fact that the sending signals are more in unit time, and then the problem of inaccurate results in high-availability test is caused, and availability and stability of the cluster system are improved.

Further, when a plurality of analog request signals matching the number of threads of an instance are sent to the thread control module, a target amount of analog request signals required to suspend all threads included in the instance may be acquired; the target amount of analog request signals are then sent to the thread control module, so that the thread control module can carry out suspending operation on all threads in the instance according to the target amount of analog request signals.

When the target quantity of the simulation request signal required by suspending all threads included in the instance is obtained, the transmission quantity of the simulation request signal under each time slice in a preset period and the number of suspended threads in the instance can be used as training samples, the corresponding relation between the suspended number of the threads in the instance and the transmission quantity of the simulation request signal is trained and learned, and the corresponding relation between the suspended number of the threads and the transmission quantity of the simulation request signal is obtained through training; and then determining the target quantity of the simulation request signal required for suspending all threads included in the instance according to the corresponding relation between the number of suspended threads obtained through training and the transmission quantity of the simulation request signal.

Specifically, the instances (including those within the docker) typically set a maximum thread count and a minimum thread count. Assuming the maximum number of threads of an instance, namely the replacement pins of a thread pool can be flicked without limitation, when the concurrency of the service is large, the total number of execution threads can be quickly increased, and the problem of instance memory overflow is easily caused; in addition, if the total number of threads is not set, when the traffic volume suddenly increases, the thread pool may not be able to make elastic adjustment in time, and thus a batch of traffic fails, so the number of threads in the example in this embodiment is a preset value. For example 500.

In addition, specifically, training learning is performed by taking the analog signal request amount at each time slice within a preset period and the number of actual suspended threads in the instance as training samples. In addition, because the Logistic regression (Logistic) algorithm has the characteristics of easy convergence and quick obtaining of a global optimal solution, the Logistic algorithm can be adopted to train a training sample. The method comprises the steps that if the number of suspended threads and the sending quantity of the simulation request signals in an instance meet the higher-order distribution Bernoulli distribution (Bernoulli), a gradient descent method can be adopted for iteration of a cost function, when gradient convergence occurs, an algorithm is stopped, at the moment, a corresponding relation between the number of suspended threads and the sending quantity of the simulation request signals can be obtained, and according to the corresponding relation, the target quantity of the simulation request signals required for suspending all threads included in the instance, namely the target quantity of the simulation request signals matched with the number of the threads of the instance, can be determined.

In this way, the corresponding relation between the thread suspension number and the transmission quantity of the simulation request signals is obtained through training and learning, and then the target quantity of the simulation request signals matched with the thread number of the examples is obtained, so that the accuracy of the target quantity of the obtained simulation request signals is ensured, the examples can be completely suspended when the simulation request signals with the target quantity are transmitted, the situation that the examples cannot be completely suspended due to the fact that the transmission quantity of the simulation request signals is too small is avoided, and the performance influence of the host where the examples are located due to the fact that the transmission quantity of the simulation request signals is large is avoided.

Further, after sending a plurality of analog request signals matching the number of threads of an instance to the thread control module, it may also be detected whether all threads included in the instance are in a suspended state; when all threads included in the instance are detected to be in a suspended state, it is detected whether the clustered system is in a high availability state.

After detecting whether the cluster system is in the high available state, the release operation can be further performed on the thread in the suspended state in the example, so that the thread in the suspended state is switched to the non-suspended state.

Specifically, when detecting whether all threads included in an instance are in a suspended state, a dump log for collecting the thread state on a host computer where the instance is located can be obtained, and whether the threads are suspended is judged according to all the thread state information recorded by the dump log.

Of course, it should be noted that, if it is detected that there are threads that are not suspended in all the threads included in the instance, the number of analog request signals sent to the thread control module may be increased until the threads in the instance are completely in the suspended state.

Specifically, the architecture of the cluster system where the instances are highly available may further include a thread management module 24, where the thread management module 24 includes a thread detection unit 241 and a thread release unit 242. In this embodiment, the thread detecting unit 241 may detect whether all the threads included in the instance are in the suspended state, and perform a release operation on the threads in the suspended state in the instance by the thread releasing unit 242, so that the threads in the suspended state are switched to the non-suspended state.

In addition, specifically, the high availability test is to test the failure transfer mechanism of the main test case, that is, the high availability of the cluster system can be verified as long as the tested case is ensured to be switched mutually when the failure occurs. Therefore, when detecting whether the cluster system is in a high availability state, whether the service success rate of the cluster system when all the threads included in the instance are in the suspension state is the same as the service success rate of the cluster system when the instance is not suspended or not can be detected, and if so, the cluster system is in the high availability state. Of course, if the service success rate of the cluster system when all the threads included in the instance are in the suspended state is smaller than the service success rate of the cluster system when the instance is not suspended, the log of the instance needs to be obtained, whether the analog request signal is still continuously received or whether an error reporting exists is determined, and if the error reporting exists, the defect of a high-availability mechanism of the cluster system is indicated.

In this way, in the example to be tested in high availability in the cluster system, the thread control module for suspending the threads of the example is deployed, and a plurality of simulation request signals matched with the number of the threads of the example are sent to the thread control module, so that the thread control module suspends all the threads of the example according to the plurality of simulation request signals, and based on the fact that the sending number of the simulation request signals is matched with the number of the threads of the example, the accuracy of the suspending result of the example and the effective control of the high availability test result are realized, the problem that the example cannot be suspended completely in the prior art is avoided, and the problem that a great amount of host resources are consumed when the excessive batch port call request is sent to suspend the example in the prior art is avoided, and the inaccurate result is caused when the high availability test is performed.

As shown in fig. 3, a block diagram of a suspension control apparatus of an example in a cluster system according to an embodiment of the present invention includes:

a deployment module 301, configured to deploy, in an instance to be tested for high availability in the clustered system, a thread control module for suspending a thread of the instance;

and the signal sending module 302 is configured to send a plurality of analog request signals matched with the number of threads of an instance to the thread control module, so that the thread control module carries out a suspension operation on all threads of the instance according to the plurality of analog request signals.

Optionally, the illustrated signaling module 302 includes:

an obtaining unit, configured to obtain a target amount of an analog request signal required to suspend all threads included in the instance;

and the sending unit is used for sending the simulation request signal of the target quantity to the thread control module so that the thread control module can carry out suspending operation on all threads in the example according to the simulation request signal of the target quantity.

Optionally, the acquiring unit is configured to use the transmission amount of the analog request signal and the number of suspended threads in the instance in each time slice in a preset period as a training sample, perform training learning on a corresponding relationship between the number of suspended threads in the instance and the transmission amount of the analog request signal, and perform training to obtain a corresponding relationship between the number of suspended threads and the transmission amount of the analog request signal; and determining the target quantity of the simulation request signal required for suspending all threads included in the instance according to the corresponding relation between the number of suspended threads obtained through training and the transmission quantity of the simulation request signal.

Optionally, the apparatus further comprises:

the thread management module is used for detecting whether all threads included in the instance are in a suspended state; when all threads included in the instance are detected to be in a suspended state, it is detected whether the clustered system is in a high availability state.

Optionally, the thread management module is further configured to perform a release operation on the thread in the suspended state in the instance, so that the thread in the suspended state is switched to the non-suspended state.

According to the suspension control device for the instance in the cluster system, the thread control module for suspending the threads of the instance is deployed in the instance to be subjected to the high availability test in the cluster system, and a plurality of simulation request signals matched with the number of the threads of the instance are sent to the thread control module, so that the thread control module suspends all the threads of the instance according to the simulation request signals, and the accuracy of the suspending result of the instance and the effective control of the high availability test result are realized based on the fact that the sending number of the simulation request signals is matched with the number of the threads of the instance.

In addition, as shown in fig. 4, an entity structure diagram of an electronic device according to an embodiment of the present invention may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke a computer program stored in the memory 430 and executable on the processor 410 to perform the methods provided by the above embodiments, including, for example: in an instance to be subjected to high availability test in a cluster system, deploying a thread control module for suspending the threads of the instance; and sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module carries out suspending operation on all threads of the instance according to the simulation request signals.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: in an instance to be subjected to high availability test in a cluster system, deploying a thread control module for suspending the threads of the instance; and sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module carries out suspending operation on all threads of the instance according to the simulation request signals.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for controlling suspension of an instance in a clustered system, the method comprising:

sending a plurality of simulation request signals matched with the number of threads of an instance to the thread control module so that the thread control module carries out suspending operation on all threads of the instance according to the simulation request signals;

wherein the sending, to the thread control module, a plurality of analog request signals that match the number of threads of an instance includes:

obtaining a target amount of analog request signals required for suspending all threads included in the instance;

and sending the simulation request signal of the target quantity to the thread control module so that the thread control module can carry out suspending operation on all threads in the instance according to the simulation request signal of the target quantity.

2. The method of claim 1, wherein the obtaining a target amount of analog request signals required to suspend all threads included in the instance comprises:

taking the transmission quantity of the simulation request signals and the number of suspended threads in the example in each time slice within a preset period as training samples, training and learning the corresponding relation between the suspension quantity of the threads in the example and the transmission quantity of the simulation request signals, and training to obtain the corresponding relation between the suspension quantity of the threads and the transmission quantity of the simulation request signals;

and determining the target quantity of the simulation request signal required for suspending all threads included in the instance according to the corresponding relation between the number of suspended threads obtained through training and the transmission quantity of the simulation request signal.

3. The method of claim 1, wherein after sending a plurality of analog request signals to the thread control module that match the number of threads of an instance, the method further comprises:

detecting whether all threads included in the instance are in a suspended state;

when all threads included in the instance are detected to be in a suspended state, it is detected whether the clustered system is in a high availability state.

4. A method according to claim 3, wherein after said detecting whether the cluster system is in a high availability state, the method further comprises:

releasing the thread in the suspended state in the example to switch the thread in the suspended state to the non-suspended state.

5. A suspension control device for an instance in a clustered system, said device comprising:

the signal sending module is used for sending a plurality of simulation request signals matched with the number of threads of the instance to the thread control module so that the thread control module carries out suspending operation on all threads of the instance according to the simulation request signals;

the sending, to the thread control module, a plurality of analog request signals matching the number of threads of an instance, including:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the suspension control method of an instance in a cluster system as claimed in any one of claims 1 to 4 when the computer program is executed.

7. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor, implements the steps of the suspension control method of an instance in a clustered system according to any of claims 1 to 4.